Hacker Newsnew | past | comments | ask | show | jobs | submit | aurohacker's commentslogin

Great answers here, in that, for MoE, there's compute saving but no memory savings even tho the network is super-sparse. Turns out, there is a paper on the topic of predicting in advance the experts to be used in the next few layers, "Accelerating Mixture-of-Experts language model inference via plug-and-play lookahead gate on a single GPU". As to its efficacy, I'd love to know...


Figure 1 in the paper is all about the encoder and how the context and query is packaged and sent to the decoder. I wish it were more complete...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: