Llama-2-7B Model Outperforms Mistral 7B and Llama 13B

heard you like cosine schedule with linear warmup heard you implemented it like this SequentialLR(LinearLR(), CosineAnnealingLR()) did you try resuming training? https://t.co/Ld2blJyDF3

Deep Thrill@DeeperThrill

5 mo

Hey, AI engineer followers, wat mean? Do I make a more aggressive learning rate decay? I have a lr of 1e-4 and a decay of 1e-5 with a cosine schedule… https://t.co/Z5TzD5hUKn

Rohan Paul@rohanpaul_ai

5 mo

Some more details on how this 2B param model outperforms Mistral 7B or Llama 13B? 🤯🤔 ✨ Proposes a new learning rate scheduling strategy, Warmup-Stable-Decay (WSD) scheduler which adjusts the learning rate used in different stages of training. Outperforms cosine scheduler… https://t.co/oSQDrKFvph

Rohan Paul@rohanpaul_ai

5 mo

Some more details on how this 2B param model outperms Mistral 7B or Llama 13B? 🤯🤔 ✨ Proposes a new learning rate scheduling strategy, Warmup-Stable-Decay (WSD) scheduler which adjusts the learning rate used in different stages of training. Outperforms cosine scheduler which… https://t.co/oSQDrKFvph

Rohan Paul@rohanpaul_ai

5 mo

Som more details on how this 2B param model outperms Mistral 7B or Llama 13B? 🤯🤔 ✨ Proposes a new learning rate scheduling strategy, Warmup-Stable-Decay (WSD) scheduler which adjusts the learning rate used in different stages of training. Outperforms cosine scheduler which… https://t.co/oSQDrKFvph

Tanishq Mathew Abraham, Ph.D.@iScienceLuvr

5 mo

Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning abs: https://t.co/bbNebyI5ME Fine-tuning a Llama-2-7B base model on the 1k longest elements of Alpaca outperforms both AlpaGasus and LIMA in one-to-one comparison with different LLMs as… https://t.co/MgQc8Tgl2L

Similar Stories

Llama-2-7B Model Outperforms Mistral 7B and Llama 13B with New WSD Learning Rate Strategy

Similar Stories

Sources

Llama-2-7B Model Outperforms Mistral 7B and Llama 13B with New WSD Learning Rate Strategy