EleutherAI has released a new T5 model, named Pile-T5, which has been trained on 2 trillion tokens from the Pile using the Llama tokenizer. This model, featuring intermediate checkpoints, offers a significant boost in benchmark performance. The development was led by Aran Komatsuzaki and Lintang Sutawika, under the guidance of Colin Raffel. The project, described as a long labor of love, also benefited from the computational resources contributed by EMostaque. Additionally, the model is noted for its reproducibility and its potential utility in both natural language and code applications. This fully open release incorporates FLAN and utilizes permissible commercially licensed datasets provided by TeraflopAI.
Data is what makes the model. We at @TeraflopAI are working hard to provide the open-source community with permissible commercially licensed datasets for training. Congrats to @arankomatsuzaki, @lintangsutawika, and @colinraffel. And thanks to @ShayneRedford for his work on FLAN. https://t.co/DheOvHTeil
Glad to see our very own @arankomatsuzaki pushing the boundaries of open-source research with a new T5 release using our data. Congrats to @lintangsutawika and @colinraffel. And @ShayneRedford for his great efforts on FLAN. https://t.co/0oZeOZhZhs
A long labor of love by the team for a new high quality T5 model. Happy to have contributed compute resources for this, will be useful for research & more as a fully open release π https://t.co/pEoshSkPWq
Having teased this a couple times, I'm excited to share that @lintangsutawika and @arankomatsuzaki, advised by @colinraffel have retrained T5 using a more modern dataset and tokenizer, and for longer. This produces a better general model for both NL and code applications. https://t.co/tLMZ18xPRz
Great release by @lintangsutawika, @arankomatsuzaki , and @colinraffel! Finally a fully-reproducible T5 model: https://t.co/wzLtG6p8JS
π Introducing Pile-T5! π We (EleutherAI) are thrilled to open-source our latest T5 model trained on 2T tokens from the Pile using the Llama tokenizer. β¨ Featuring intermediate checkpoints and a significant boost in benchmark performance. Work done by @lintangsutawika, meβ¦ https://t.co/qvoSWyAVjb