Stanford, Microsoft, Google Publish MegaBlocks for Eff

[LG] MegaBlocks: Efficient Sparse Training with Mixture-of-Experts T Gale, D Narayanan, C Young, M Zaharia [Stanford University & Microsoft Research & Google Research] (2022) https://t.co/6Lvg8QUfen - Mixture-of-Experts (MoE) models route input tokens dynamically to expert… https://t.co/bCZTs6tu6o

Omar Sanseviero@osanseviero

7 mo

Lots of confusion about MoEs out there. IIUC: - Faster inference as a fixed number of experts is activated per token (if sparse). E.g., if n=1, just the most appropriate expert is activated. - High VRAM usage; all experts need to be loaded. - Work well when you run them on many…

Sophia Yang, Ph.D.@sophiamyang

7 mo

What is Mixture-of-Experts (MoE)? MoE is a neural network architecture design that integrates layers of experts/models within the Transformer block. As data flows through the MoE layers, each input token is dynamically routed to a subset of the experts for computation. This… https://t.co/56mKkrHL34 https://t.co/AnYeITgHVi

Sophia Yang, Ph.D.@sophiamyang

7 mo

What is Mixture-of-Experts (MoE)? MoE is a neural network architecture design that integrates layers of experts/models within the Transformer block. As data flows through the MoE layers, each input token is dynamically routed to a subset of the experts for computation. This… https://t.co/uDvjTeuaAC https://t.co/CKNPLQahBx