More

mrlongroots · 2025-12-09T18:19:11 1765304351

Hyperscalers do not need to achieve parity with Nvidia. There's a (let's say) 50% headroom in terms of profit margins, and plenty of headroom in terms of the complexity custom chip efforts need to implement: they don't need the complexity or generality of Nvidia's chips. If a simple architecture allows them to do inference at 50% of the TCO and 1/5th the complexity and reduce their Nvidia bill by 70% that's a solid win. I'm being fast and loose with numbers and Trainium clearly seems to have ambitions beyond inference, but given the hundreds of billions each cloud vendor is investing in the AI buildout, a couple billion on IP that you will own afterwards is a no brainer. Nvidia has good products and a solid head start but they're not unassailable or anything.

mrlongroots · 2025-12-08T04:53:04 1765169584

Yeah unfortunately no amount of manoeuvering is a substitute for a kill chain where a distributed web of sensors and relays and weapon carriers can result in an AAM being dispatched from any direction at lightspeed.

mrlongroots · 2025-11-14T03:25:54 1763090754

Yep I think the value of the experiment is not clear.

You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.

Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.

justincormack · 2025-11-14T08:41:40 1763109700

Network bandwidth is not 20x storage ant more. An SSD is around 10GB/s now, so similar to 100Gb ethernet.

mrlongroots · 2025-11-14T16:28:47 1763137727

I think I'm talking about cluster-scale network bisection bandwidth vs attached storage bandwidth. With replication/erasure coding overhead and the economics, the order of magnitude difference still prevails.

I think your point is a good one in that it is more economics than systems physics. We size clusters to have more compute/network than storage because it is the design point that maximizes overall utility.

I think it also raises an interesting question in that let's say we get to a point where the disparity really no longer holds: that would justify a complete rethinking of many Spark-like applications that are designed to exploit this asymmetry.

wtallis · 2025-11-14T16:25:49 1763137549

And that's for one SSD. If you're running on a server rather than a laptop, aggregate storage bandwidth will almost certainly be higher than any single network link.

mrlongroots · 2025-11-14T16:30:54 1763137854

The appropriate comparison point for aggregate cluster storage bandwidth would be its bisection bandwidth.

(I do HPC, IIRC ANL Aurora is < 1PB/s DAOS and 20 PB/s bisection).

mrlongroots · 2025-11-09T01:38:15 1762652295

> LDL-C is much much cheaper to measure. ApoB costs 36x times as much, so Insurance Companies don't like to pay for it

Unfortunately American retail prices might as well be generated by a PRNG, and do not mean much.

On Ulta, a basic lipid panel vs an ApoB test are $22 and $36 respectively. Looking at Indian lab prices, (approx. INR->USD), both are under $10 there.

https://www.ultalabtests.com/test/cholesterol-and-lipids-tes... https://www.ultalabtests.com/test/cardio-iq-apolipoprotein-b...

mrlongroots · 2025-11-09T01:30:46 1762651846

Maybe 80-90% of people should take doctors at face value, but it is easy and only getting easier to at least access the knowledge to better advocate for your own healthcare (thanks to LLMs), with better outcomes. Of course, this requires doctors that respect your ability to provide useful inputs, which in your case did not happen.

My advice would be to "shop around" for doctors, establish a relationship where you demonstrate openness to what they say, try not to step on their toes unnecessarily, but also provide your own data and arguments. Some of the most "life-changing" interventions in terms of my own healthcare have been due to my own initiative and stubbornness, but I have doctors who humor me and respect my inputs. Credentials/vibes help here I think: in my case "the PhD student from the brand name school across the street who shows up with plots and regressions" is probably a soft signal that indicates that I mean business.

mrlongroots · 2025-11-08T20:46:55 1762634815

Same, I don't understand the complaints against modern C++. A lambda, used for things like comparators etc, is much simpler than structs with operators overloaded defined elsewhere.

My only complaint is the verbosity, things like `std::chrono::nanonseconds` break even simple statements into multiple lines, and you're tempted to just use uint64_t instead. And `std::thread` is fine but if you want to name your thread you still need to get the underlying handle and call `pthread_setname_np`. It's hard work pulling off everything C++ tries to pull off.

spacechild1 · 2025-11-08T21:16:35 1762636595

> And `std::thread` is fine but if you want to name your thread you still need to get the underlying handle and call `pthread_setname_np`.

Yes, but here we're getting deep into platform specifics. An even bigger pain point are thread priorities. Windows, macOS and Linux differ so fundamentally in this regard that it's really hard to create a meaningful abstraction. Certain things are better left to platform APIs.

nuertey2025 · 2025-11-08T21:19:54 1762636794

```c++

// To lessen verbosity, try defining the following convenience aliases in a header:

using SystemClock_t = std::chrono::system_clock;

using SteadyClock_t = std::chrono::steady_clock;

using HighClock_t = std::chrono::high_resolution_clock;

using SharedDelay_t = std::atomic<SystemClock_t::duration>;

using Minutes_t = std::chrono::minutes;

using Seconds_t = std::chrono::seconds;

using MilliSecs_t = std::chrono::milliseconds;

using MicroSecs_t = std::chrono::microseconds;

using NanoSecs_t = std::chrono::nanoseconds;

using DoubleSecs_t = std::chrono::duration<double>;

using FloatingMilliSecs_t = std::chrono::duration<double, std::milli>;

using FloatingMicroSecs_t = std::chrono::duration<double, std::micro>;

```

mrlongroots · 2025-10-29T07:54:12 1761724452

> Is this just a cost efficiency thing?

It's not entirely, but even that would be a justifiable reason. Tail behavior of all sorts matters a lot, sophisticated congestion control and load-balancing matters a lot. ML training is all about massive collectives: a single tail latency event in a NCCL collective means all GPUs in that group are idling until the last GPU makes it.

> It only takes like 1 core to terminate 200 Gb/s of reliable bytestream using a software protocol with no hardware offload over regular old 1500-byte MTU ethernet.

The conventional TCP/IP stack is a lot more than just 20GB/s of memcpy's with 200 GbE: there's a DMA into kernel buffers and then a copy into user memory, there's syscalls and interrupts and back and forth, there's segmentation and checksums and reassembly and retransmits, and overall a lot more work. RDMA eliminates all that.

> all you need is a parallel hardware crypto accelerator > all you need is a hardware copy/DMA engine

And when you add these and all the other requirements you get a modern RDMA network :).

The network is what kicks in when Moore's law recedes. Jensen Huang wants you to pretend that your 10,000 GPUs are one massive GPU: that only works if you have Nvlink/Infiniband or something in that league, and even then barely. And GOOG/MSFT/AMZN are too big and the datacenter fabric is too precious to be outsourced.

Veserv · 2025-10-29T08:10:24 1761725424

I am aware of how network protocol stacks work. Getting 200 Gb/s of reliable in-order bytestream per core over a unreliable, out-of-order packet-switched network using standard ethernet is not very hard with proper protocol design. If memory copying is not your bottleneck (ignoring encryption), then your protocol is bad.

Hardware crypto acceleration and a hardware memory copy engine do not constitute a RDMA engine. The API I am describing is the receiver programming into a device a (address, length) chunk of data to decrypt and a (src, dst, length) chunk of data to move, respectively. That is a far cry from a whole hardware network protocol.

mrlongroots · 2025-10-29T08:27:50 1761726470

> Getting 200 Gb/s of reliable in-order bytestream per core over a unreliable, out-of-order packet-switched network using standard ethernet is not very hard with proper protocol design.

You also suggested that this can be done using a single CPU core. It seems to me that this proposal involves custom APIs (not sockets), and even if viable with a single core in the common case, would blow up in case of loss/recovery/retransmission events. Falcon provides a mostly lossless fabric with loss/retransmits/recovery taken care of by the fabric: the host CPU never handles any of these tail cases.

Ultimately there are two APIs for networks: sockets and verbs. Former is great for simplicity, compatibility, and portability, and the latter is the standard for when you are willing to break compatibility for performance.

Veserv · 2025-10-29T08:41:17 1761727277

You can use a single core to do 200 Gb/s of bytestream in the presence of loss/recovery/retransmission assuming you adequately size your buffers so you do not need to stall while waiting for retransmit. So ~1 bandwidth-delay product worth of buffering for the number of lost transmits and retransmits of the same chunk of data you want to survive at full speed.

You can use such a protocol as a simple write() and read() to a single bytestream if you so desire, though you would probably be better off using a better API to avoid that unnecessary copy. Usage does not need to be anymore complicated then using a TCP socket which also provides a reliable ordered bytestream abstraction. You make bytes go in, same bytes come out the other side.

jauntywundrkind · 2025-10-29T17:26:38 1761758798

Now do that across thousands of connections. While retaining very low p99 latency.

Just the idea that using a bytestream is ok is leaving opportunity on the table. If you know what protocols you are sending, you can allow some out-of-order transmission.

Asking the kernel or dpdk or whatever to juggle contention sounds like a coherency nightmare on large scale system, is a very hard scheduling problem, that a hardware timing wheel is going to be able to just do. Getting reliability & stability at massive concurrency at low latencies feels like such an obvious place for hardware to shine, and it does here.

Maybe you can dedicate some cores of your system to maintain a low enough latency simulacra maybe, but you'd still have to shuffle all the data through those low latency cores, which itself takes time and system bandwidth. Leaving the work to hardware with its own buffers & own scheduling seems like an obviously good use of hardware. Especially with the incredibly exact delay based congestion control their close cycle timing feedback gives them: you can act way before the CPU would poll/interrupt again.

Then having own Upper-Level-Protocol processors offloads a ton more of the hard work these applications need.

You don't seem curious or interested at all. You seem like you are here to downput and belittle. There's so many amazing wins in so many dimensions here, where the NIC can do very smart things intelligently, can specialize and respond with enormous speed. I'd challenge you to try just a bit to see some upsides to specialization, versus just saying a CPU hypothetically can do everything (and where is the research showing what p99 latency the best of breed software stacks can do?).

Veserv · 2025-10-29T18:44:00 1761763440

They are proposing custom hardware on both ends talking a custom hardware network protocol. That is a enormous amount of complexity in comparison to a custom software stack on bog-standard hardware. I would expect advantages to justify that level of complexity.

However, people like yourself talk about these hardware stacks as if they have clear advantages in performance and latency and isolation. They make uncurious and dismissive comments without evidence that this level of results is only achievable with dedicated hardware.

The only consistent conclusion I can come up with is that everybody just uses really bad software stacks which makes these dedicated hardware solutions seem like major improvements when they are just demonstrating performance you should expect out of your software stack. The fact that this is considered a serious improvement over RoCE which is itself viewed as a serious improvement over things like the Linux kernel TCP software stack lends support for my conclusion.

I make comments on various posts about network protocols to see if I am missing something about the problem space that actually makes it hard to do efficiently in a software protocol. Mostly I just get people parroting the claim that a performant software solution is impossible due to easily solved problems like loss/recovery/retransmission instead of actually indicating hard parts of the problem.

And as for what would be useful hardware I would go with a network with full 64K MTU and hardware copy offload with HBM or other fast bus. Then you could pretty comfortably drive ~10 Tb/s per core subject to enough memory/bus bandwidth.

jauntywundrkind · 2025-11-03T23:39:08 1762213148

It feels like you're not really trying, if you can't cite anything in genre with what you are talking about?

I'd love to run off Google Falcon versus say Microsoft's Machnet. There's a ton of system resources dedicated to making machnet and dpdk fast, it comes with huge design tradeoffs & specially carefully crafted software architectures. Extreme software engineering effort. The glory of Falcon is that you don't need to rebuild your app stack from the bottom up like this: you're just going to get blitz fast insane utilization with incredibly low p99 for free from the system you have. https://github.com/microsoft/machnet

I think you massively over-glorify the software side of things. There is a ton of potential absolutely to very carefully setup systems to get great goodput, no degredation. But it usually requires carefully built systems & eternal vigilance in how you scale your software across cores. Or you could just have some hardware that routes replies directly to the place it needs to go. I don't get what feels like a weird fetishization you have with doing it in software, and it feels like you don't have any rigor about your criticism/can't cite any backing research/want to accuse everyone else of being super lazy.

mrlongroots · 2025-10-26T19:34:35 1761507275

Yes, unfortunately even the best intentioned individuals have very limited ability to make meaningful carbon-minimizing decisions. Carbon tax is such a sensible solution!

mrlongroots · 2025-10-09T06:41:15 1759992075

The other funny bit is that one-way PCIe latency is 250ns-ish (don't quote me on the exact numbers), which imposes a hard 1us constraint on latency between two hosts.

mgrosvenor · 2025-10-09T11:34:47 1760009687

You can go quicker with CXL, but not by much.

foobar10000 · 2025-10-09T12:02:52 1760011372

There are always DPUs like NextSilicon, NVidia's Blue*, etc - at under 100 ns of SoF to fast compute though.

mrlongroots · 2025-09-27T21:58:09 1759010289

I think standard relational databases/schemas are underrated for when you need richness.

OTel or anything in that domain is fine when you have a distributed callgraph, which inference with tool calls does. I think the fallback layer if that doesn't work is just say Clickhouse.

shykes · 2025-09-28T23:52:25 1759103545

Note, you can store otel data in clickhouse and augment the schema as needed, and get the best of both worlds. That's what we do and it works great.