Recent tweets discuss the performance of a Llama-2-7B base model in comparison to AlpaGasus and LIMA. It outperforms Mistral 7B and Llama 13B, proposing a new learning rate scheduling strategy called Warmup-Stable-Decay (WSD) scheduler. Questions are raised about adjusting the learning rate and implementing a cosine schedule with linear warmup for training.
heard you like cosine schedule with linear warmup heard you implemented it like this SequentialLR(LinearLR(), CosineAnnealingLR()) did you try resuming training? https://t.co/Ld2blJyDF3
Hey, AI engineer followers, wat mean? Do I make a more aggressive learning rate decay? I have a lr of 1e-4 and a decay of 1e-5 with a cosine schedule… https://t.co/Z5TzD5hUKn
Some more details on how this 2B param model outperforms Mistral 7B or Llama 13B? 🤯🤔 ✨ Proposes a new learning rate scheduling strategy, Warmup-Stable-Decay (WSD) scheduler which adjusts the learning rate used in different stages of training. Outperforms cosine scheduler… https://t.co/oSQDrKFvph
Some more details on how this 2B param model outperms Mistral 7B or Llama 13B? 🤯🤔 ✨ Proposes a new learning rate scheduling strategy, Warmup-Stable-Decay (WSD) scheduler which adjusts the learning rate used in different stages of training. Outperforms cosine scheduler which… https://t.co/oSQDrKFvph
Som more details on how this 2B param model outperms Mistral 7B or Llama 13B? 🤯🤔 ✨ Proposes a new learning rate scheduling strategy, Warmup-Stable-Decay (WSD) scheduler which adjusts the learning rate used in different stages of training. Outperforms cosine scheduler which… https://t.co/oSQDrKFvph
Long Is More for Alignment: A Simple but Tough-to-Beat Baseline for Instruction Fine-Tuning abs: https://t.co/bbNebyI5ME Fine-tuning a Llama-2-7B base model on the 1k longest elements of Alpaca outperforms both AlpaGasus and LIMA in one-to-one comparison with different LLMs as… https://t.co/MgQc8Tgl2L