MIPT Deep Learning Club #9


1 minute read

Emil Zakirov about “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”

Authors studied conditional computation layer called Mixture of Experts (MoE) with different parts activated for different samples. Due to this fact the number of parameters can be increased by 1000 while preserving amount of time per one sample.

One of the reason for such a good paper’s result is the huge size of dataset with about 1 billion of different languages sentences. Authors beat SOTA methods in tasks of language modeling and multilingual translation - that’s not very surprising because of researchers from Google Translators who are mentioned as the authors.

Leave a comment