LLM performance is about parallelism but also memory bandwidth.
Groq delivers this kind of speed by networking many, many chips together with high bandwidth interconnect. Each chip has only 230mb of SRAM[0].
From the linked reference:
"In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model."
That's eight racks with ~132GB of memory for the model. A single H100 has 80GB and can serve Mixtral without issue (albeit at lower performance).
If you consider the requirements for actual real-world inference serving workloads you need to serve multiple models, multiple versions of models, LoRA adapters, sentence embeddings models (for RAG), etc the economics and physical footprint alone get very challenging.
It's an interesting approach and clearly very, very fast but I'm curious to see how they do in the market:
1) This analysis uses cloud GPU costs for Nvidia pricing. Cloud providers make significant margin on their GPU instances. If you look at qty 1 retail Nvidia DGX, Lambda Hyperplane, etc and compare it to cloud GPU pricing (inference needs to run 24x7) break even on hardware vs cloud is less than seven months depending on what your costs are for hosting the hardware.
2) Nvidia has incredibly high margins.
3) CUDA.
There are some special cases where tokens per second and time to first token are incredibly important (as the article states - real time agents, etc) but overall I think actual real-world production use or deployment of Groq is a pretty challenging proposition.
Groq delivers this kind of speed by networking many, many chips together with high bandwidth interconnect. Each chip has only 230mb of SRAM[0].
From the linked reference:
"In the case of the Mixtral model, Groq had to connect 8 racks of 9 servers each with 8 chips per server. That’s a total of 576 chips to build up the inference unit and serve the Mixtral model."
That's eight racks with ~132GB of memory for the model. A single H100 has 80GB and can serve Mixtral without issue (albeit at lower performance).
If you consider the requirements for actual real-world inference serving workloads you need to serve multiple models, multiple versions of models, LoRA adapters, sentence embeddings models (for RAG), etc the economics and physical footprint alone get very challenging.
It's an interesting approach and clearly very, very fast but I'm curious to see how they do in the market:
1) This analysis uses cloud GPU costs for Nvidia pricing. Cloud providers make significant margin on their GPU instances. If you look at qty 1 retail Nvidia DGX, Lambda Hyperplane, etc and compare it to cloud GPU pricing (inference needs to run 24x7) break even on hardware vs cloud is less than seven months depending on what your costs are for hosting the hardware.
2) Nvidia has incredibly high margins.
3) CUDA.
There are some special cases where tokens per second and time to first token are incredibly important (as the article states - real time agents, etc) but overall I think actual real-world production use or deployment of Groq is a pretty challenging proposition.
[0] - https://www.semianalysis.com/p/groq-inference-tokenomics-spe...