A new research paper titled 'Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models' (2024) explores the use of small language models (LMs) to prune pretraining data for larger LMs. The study, conducted by researchers from Databricks, MIT, and DatologyAI including Z Ankner, C Blakeney, K Sreenivasan, and M Marion, finds that small LMs can effectively prune data for models up to 30 times larger. This pruning method works in both overtrained and data-constrained regimes. The paper highlights the potential of small LMs in improving the efficiency and performance of larger language models by selecting high-quality subsets of large-scale text datasets. Additionally, the study examines the marginal contribution of a data point to a model's loss.
[CL] Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models Z Ankner, C Blakeney, K Sreenivasan, M Marion... [Databricks & MIT & DatologyAI] (2024) https://t.co/8TngcEoRZW - Perplexity-based data pruning, where a dataset is pruned to subsets with low,… https://t.co/qN8kQ570Mb
[CL] Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models Z Ankner, C Blakeney, K Sreenivasan, M Marion... [Databricks & MIT & DatologyAI] (2024) https://t.co/8TngcEoRZW - The marginal contribution of a data point to a model's loss, defined as the… https://t.co/O6xZwL6Qg6
Finally, a pruning paper that gets me excited. Small LLMs are helpful for choosing the data for larger LLMs! https://t.co/Bsfkq1ZMsD
New paper where we explore using a small LM’s perplexity to prune the pretraining data for larger LMs. We find that small LMs can prune data for up to 30x larger LMs, data pruning works in the overtrained and data-constrained regimes, and more! https://t.co/XYbI0Ijois
Perplexed by Perplexity Perplexity-Based Data Pruning With Small Reference Models In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language https://t.co/9hejOpCiVJ