>~If they're running on, say, two RTX 6000s for a total draw of ~600 watts, that would be a response time of 1.44 seconds. So obviously the median prompt doesn't go to some high-end thinking model users have to pay for.
You're not accounting for batches for the optimal gpu utilization, maybe it can takes 30 seconds but it completed 30 requests.
You're not accounting for batches for the optimal gpu utilization, maybe it can takes 30 seconds but it completed 30 requests.