I linked to the writeup by Deepseek with their actual numbers from production, a... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		jsnell 8 months ago \| parent \| context \| favorite \| on: Are OpenAI and Anthropic losing money on inference... I linked to the writeup by Deepseek with their actual numbers from production, and you want "better numbers" than that?! > Each H800 node delivers an average throughput of ~73.7k tokens/s input (including cache hits) during prefilling or ~14.8k tokens/s output during decoding. That's a 5x difference, not 1000x. It also lines up with their pricing, as one would expect. (The decode throughputs they give are roughly equal to yours, but you're claiming a prefill performance 200x times higher than they can achieve.)

smarterclayton 8 months ago [–]

A good rule of thumb is that a prefill token is about 1/6th the compute cost of decode token, and that you can get about 15k prefill tokens a second on Llama3 8B on a single H100. Bigger models will require more compute per token, and quantization like FP8 or FP4 will require less.

Consider applying for YC's Summer 2026 batch! Applications are open till May 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact