SIMD Matters: Graph Coloring

mikewarot · on Aug 22, 2024

>This is the magic of graph coloring and enables me to solve multiple contact constraints simultaneously without a race condition.

<TANGENT> This hits me, like a ton of bricks, as one of the most elegant ways to describe why I add 2 phases of clocking ("like colors on a chessboard" is the phrase I've been using) to my BitGrid[1] hobby project.

I wonder what other classes of problems this could solve. This feels oddly parallel, like a mapping of the Langlands program into computer science.[2]

[1] https://esolangs.org/wiki/Bitgrid

[2] https://en.wikipedia.org/wiki/Langlands_program

01HNNWZ0MV43FF · on Aug 22, 2024

That looks cool. I want to raise some nits, though:

- If they're sending and receiving along diagonals, aren't those actually bishop-neighbors, not rook-neighbors?

- Think you meant Crypt of the Necro _dancer_

- The grid in the background makes it hard to discern whether they also send to rook-neighbors

- Secret 4th nit - Wait, are the cells just rotated 45 in the diagram and not in the prose?

dahart · on Aug 22, 2024

> aren’t those actually bishop-neighbors, not rook-neighbors?

There are 2 grids, so it depends on your frame of reference. The node connections are oriented 90 degrees to the shape of the node in the picture, which means rook neighbors is accurate relative to the nodes. The background is off by 45 degrees, so node connections relative to the background grid are diagonal, or bishop, moves.

mikewarot · on Aug 22, 2024

I didn't write it up at Esolang, someone else did. I wouldn't have rotated it 45 degrees.

a1o · on Aug 22, 2024

> The conventional wisdom is that it is difficult to achieve real gains from SIMD.

This has been my experience often times I misunderstood how much can be gained by using SIMD, and preparing the data to be "eaten" by SIMD instructions is not trivial. I have many times attempted to use just to profile and see it didn't improve things at all and made the code really hard to understand.

Kudos for Erin here this is really hard work and it's great it paid off well and gave good results here!

bob1029 · on Aug 22, 2024

> preparing the data to be "eaten" by SIMD instructions is not trivial

I've always found my best experiences with SIMD to be one level removed from the actual scary bits. I.e., using primitives from libraries which are inherently optimized.

My favorite SIMD abstraction right now is Vector<T> in .NET:

https://learn.microsoft.com/en-us/dotnet/api/system.numerics...

If you're worried you'd get that one wrong too, you can go up another level of abstraction and use something like Matrix4x4 and its members which are also optimized.

neonsunset · on Aug 22, 2024

Interestingly enough, Vector<T> itself is a bit limited and has smaller API than Vector128/256/512<T>. It is being improved to match those and then expand further, but it happens incrementally, in 8, now in 9 and more changes after - after all, compiler treats it pretty much the same as vector of respective width under the hood.

The main problem with writing SIMD code for the first time is it's a learning curve to understand that the performance improvement doesn't come from just performing n element operations per single instruction but also reducing the book-keeping that usually comes per element per loop iteration like conditional branches to terminate the loop, branchy element selects over branchless shuffles in vectors, loading more data at a time, etc.

Which is why many first time attempts lead to wrong impression that vectorization is hard, while the truth is they just stumble into known "don't do that, also do less" traps like needlessly spilling vectors or writing to intermediate buffers instead of directly, modifying individual vector elements or even iterating them, avoiding actual vector operations.

There's a new-ish guide on vectorization if you're interested: https://github.com/dotnet/runtime/blob/main/docs/coding-guid...

mgaunard · on Aug 22, 2024

The main problem of SIMD remains the same: it requires working with structures of arrays rather than arrays of structures, which makes it more difficult to combine and layer objects on top of each other.

claytonwramsey · on Aug 22, 2024

I would argue the opposite: the problem is that our languages give us only rudimentary tools to represent data in SoA styles, which makes them difficult to apply for efficient processing. In other words, it's not SIMD's fault but rather the C-like approach to data structures. In fact, SoA is also great for non-SIMD parallel implementations because it yields improvements to memory density and cache performance.

titzer · on Aug 22, 2024

This is much nicer in functional languages that have tuples. Then an array of tuples (if normalized or flattened) is both stored efficiently and ready for vectorization. You really don't want to model mathematical objects like vectors as mutable, heap-allocated objects. OO isn't the right fit here.

mgaunard · on Aug 22, 2024

It's not just C, but also any object-oriented language.

Basically the most popular imperative paradigms are just ill-suited to making the most of SIMD hardware. They're designed with other goals in mind.

thesz · on Aug 22, 2024

Express your algorithms as queries over arrays and you can change representation of arrays keeping queries the same.

Who said SQL and columnar/row storage models? ;)

Something like LINQ can do the job as well.

cellularmitosis · on Aug 22, 2024

This seems like an odd objection, because when tumbling down the rabbit hole of performance optimization, leveraging prefetch in your memory controller would normally happen before leveraging SIMD instructions, right? Ie the point at which you start thinking about using SIMD would happen after you’ve already adopted SoA.

eigenspace · on Aug 22, 2024

SIMD works fine with arrays of structures. It's sometimes awkward for people to phrase algorithms in terms of SIMD over arrays of structs, but that's often a problem imposed by poor programming language design, not a fundamental problem with SIMD itself.

CoastalCoder · on Aug 22, 2024

Could you elaborate on this?

I'd think that a fundamental problem is memory I/O bandwidth, and (depending on cache policy) poor cache hit ratios.

alexhutcheson · on Aug 22, 2024

If you are accessing elements in memory with a constant stride[1], then hardware prefetchers[2] do a surprisingly good job at “reading ahead” and avoiding cache misses.

A typical example would be: you have an array of objects of constant size, and you’re reading a double field from a constant offset within each object. The hardware prefetcher will “recognize” this access pattern and prefetch that offset every sizeof(obj) bytes.

The major downsides (vs. a struct-of-arrays design with full spatial locality) are:

1. Every prefetch pulls a full cache line, but the cache line will include data you don’t need. In this example, every cache line might have 64 bytes of data, but you only needed the one double field (8 bytes) - the rest is not useful. If you were iterating over an array of doubles you could have pulled 8 double fields in a single cache line.

2. Specific performance is hardware-dependent, so it’s hard to guarantee performance on e.g. low-end cores, short loops, or unusually long strides.

[1] https://en.wikipedia.org/wiki/Stride_of_an_array

[2] https://en.wikipedia.org/wiki/Cache_prefetching

eigenspace · on Aug 22, 2024

> I'd think that a fundamental problem is memory I/O bandwidth, and (depending on cache policy) poor cache hit ratios.

You seem to be assuming that a vectorized loop over an array of structs only ever uses one element from the struct and ignores the rest. In that case, then yes, you will get some bandwidth problems, but modern hardware prefetchers are so good that a vectorized loop that only looks at each Nth byte or whatever is actually much faster than one might think. I'll also note that the serial (non-SIMD) case also suffers here, but often not quite so badly as the SIMD case suffers.

However, there's lots of cases where you want the whole struct. E.g. complex numbers are typically a struct of two real numbers, and SIMD arithmetic on vectors of structs of complex numbers is able to work just fine with the full theoretical SIMD performance improvement.

CoastalCoder · on Aug 22, 2024

> You seem to be assuming that a vectorized loop over an array of structs only ever uses one element from the struct and ignores the rest.

That is the worst case scenario of my concern with memory bandwidth.

More generally, I'd imagine that any good roofline analysis would need to consider this data-layout issue.

> but modern hardware prefetchers are so good that a vectorized loop that only looks at each Nth byte or whatever is actually much faster than one might think.

I believe you. I wasn't really thinking about how efficiently the processor can pull the correct bytes into the register lanes. I was just focusing on the fundamental memory <--> CPU xfer part of the process.

TL;DR: I'm trying to get better at using roofline analysis in my optimization efforts.

jeltz · on Aug 22, 2024

I assume they are talking about scatter/gather load and store. But even using those I assume performance typically will be worse due to cache misses.

CoastalCoder · on Aug 22, 2024

I thought (maybe wrongly) that even with s/g load and store, the memory transactions will still be in units of contiguous addresses, sized according to the width of the memory bus.

Is that not the case?

neonsunset · on Aug 22, 2024

This. Modern SIMD extensions have gathers and scatters to specifically work with these kinds of memory layout. For example, ARM64 NEON has interleaving loads and stores in the form of LD2/3/4 and respective ST* counterparts.

https://documentation-service.arm.com/static/6530e5163f12c06... (PDF)

xoranth · on Aug 22, 2024

Sure, but how well do they perform compared to vector loads? Do they get converted to vector load + shuffle uops, and therefore require a specific layout anyway?

Last time I tried using gathers on AVX2, performance was comparable to doing scalar loads.

neonsunset · on Aug 22, 2024

They are pretty good: https://dougallj.github.io/applecpu/measurements/firestorm/L...

Gathers on AVX2 used to be problematic, but assume it shouldn't be the case today especially if the lane-crossing is minimal? (if you do know, please share!)

TinkersW · on Aug 22, 2024

Gather is still terrible, the only core that handles it well is the Intel's P core. AMD issues 40+ micro ops in AVX2(80 in AVX512), and the Intel E core is much worse.

When using SIMD you must either use SoA or AoSoA for optimal performance. You can sometimes use AoS if you have a special hand coded swizzle loader for the format.

LegionMammal978 · on Aug 23, 2024

Do you know of any resources on such swizzle loaders? I've toyed around with hand-coding x86 SIMD myself, and getting everything horizontally in the right place is always a pain.

TinkersW · on Aug 23, 2024

You can often find them by search stack overflow(try AVX2 + (deinterleave/AoS to SoA/transpose etc), especially any answers by Peter Cordes.

You can write them yourself also but I'd add a verifier that checks the output with scalar code as it can be tricky to get correct.

Intel had an article up for the 3x8 transpose, but it seems to no longer exist so i'll just post the psuedo code

   //xyz -> xxx 
 void swizzle3_AoS_to_SoA(v8float &x, v8float &y, v8float &z) {
  v8float m14 = interleave_low_high<1, 2>(x, z); //swap low/high 128 bits
  v8float m03 = blend<0, 0, 0, 0, 1, 1, 1, 1>(x, y); //_mm256_blend_ps 1 cycle
  v8float m25 = blend<0, 0, 0, 0, 1, 1, 1, 1>(y, z);

  //shuffles are all 1 cycle
  __m256 xy = _mm256_shuffle_ps(m14, m25, _MM_SHUFFLE(2, 1, 3, 2)); // upper x's and y's 
  __m256 yz = _mm256_shuffle_ps(m03, m14, _MM_SHUFFLE(1, 0, 2, 1)); // lower y's and z's
  v8float xo = _mm256_shuffle_ps(m03, xy, _MM_SHUFFLE(2, 0, 3, 0));
  v8float yo = _mm256_shuffle_ps(yz, xy, _MM_SHUFFLE(3, 1, 2, 0));
  v8float zo = _mm256_shuffle_ps(yz, m25, _MM_SHUFFLE(3, 0, 3, 1));
  x = xo;
  y = yo;
  z = zo;

 }

immibis · on Aug 22, 2024

Rather, it's a fundamental problem with arrays of structs.

whiterknight · on Aug 22, 2024

The friction you are feeling is that arrays of structs require the computer to do more work to access individual members.

machinekob · on Aug 22, 2024

Just wrap array of structures in SIMD it is not that hard at least in 95% of cases

patrick451 · on Aug 22, 2024

> I also implemented all the wide math functions to work with this. It seems that I have arranged all the data perfectly for the compiler to use automatic vectorization. But it seems this doesn’t really happen to a sufficient degree to compete with my hand written SSE2.

I will keep this example in mind the next time somebody trots out the line that you should just trust the compiler.

unwind · on Aug 22, 2024

Very cool and informative article, and I love that Box2d is in C nowadays that really makes it clear. Great job!

I saw a small typo:

    // wide float
    typedef b2FloatW __m128;

The `typedef` is backwards, the alias and the underlying type name are in the wrong order and need to be swapped around.

openasocket · on Aug 22, 2024

They didn't really explain why graph coloring can be done quickly in this case. I'm not familiar with graph coloring theory, but I know the general graph coloring problems are NP-complete. I'm guessing the reason you can use a greedy algorithm to get a decent coloring quickly is because you know the graph is planar, and because you don't actually care about getting a minimal coloring.

jtolmar · on Aug 22, 2024

Coloring a graph is trivial, just give every node its own color.

The NP-complete problems are for finding the optimal number of colors, or determining if the graph can be colored using a set small number of colors.

Erin's solution aims for a small number of colors, but doesn't try to be the optimal number.

jvanderbot · on Aug 22, 2024

Probably. It's like: TSP is really really easy if you know your input is N vertices with N-1 edges forming a line.

intalentive · on Aug 22, 2024

Is the AVX2 implementation using the full 256-bit register width? If so, I am surprised that it is only 14% faster than SSE. If not, I would like to see how the results compare with the rest of the tests.

p0nce · on Aug 22, 2024

Doubling the register size only affects things if CPU operations were the bottleneck. Often he bottleneck is memory accesses.

xoranth · on Aug 22, 2024

General questions for gamedevs here. How useful is SIMD given that now we have compute shaders on the GPU? If so, what workloads still require SIMD/why would you choose one over the other?

h0l0cube · on Aug 22, 2024

Specifically physics benefits from CPU processing. Efficient rendering pipelines are typically one-way (CPU -> GPU), whereas the results of physics calculations are depended on both by the game logic and rendering, and it's much simpler (and probably more efficient) to keep that computation on the CPU. The exception to this is could be on UMA architectures like the Apple M-series and the PS4, where memory transport isn't a limiting factor – though memory/cache invalidation might be an issue?

eigenspace · on Aug 22, 2024

Even with UMA architectures where you eliminate the memory transport costs, it still costs a ton of time to actually launch a GPU kernel from the CPU.

h0l0cube · on Aug 22, 2024

Yeah, that's why I qualified with 'could'. Really depends on what facilities the hardware and driver provide. If the GPU is on the same die, perhaps the latency isn't great, but I really don't have the data on that. But I'd really like to see something like voxel deformable/destructible environments leveraging UMA on the Apple M. Seems like that something that would be groundbreaking, if only Apple really cared about gaming at all.

dxuh · on Aug 22, 2024

With graphics you mostly prepare everything you want to render and then transfer all of it to the GPU. Physics still lends itself fairly well to GPU acceleration as well (compared to other things), but simply preparing something, transferring it to the GPU and being done is not enough. You need to at least get it back, even just to render it, but likely also to have gameplay depend on it. And with graphics programming the expensive part is often the communication between the CPU and the GPU and trying to avoid synchronization (especially with the old graphics APIs), so transferring there and back is expensive. Also physics code is full of branches, while graphics usually is not. GPUs (or rather really wide vectorization generally) don't like branches much and if you do only certain parts of the physics simulation on the GPU, then you need to transfer there and back (and synchronize) even more. I'm just a hobby gamedev and I know that people have done physics on the GPU (PhysX), but to me the things I mentioned sound like big hurdles.

EDIT: one more big thing is also that at least for AAA games you want to keep the GPU doing graphics so it looks good. You usually never have GPU cycles to spare.

eigenspace · on Aug 22, 2024

I'm not a gamedev, but I do a lot of numerical work. GPUs are great, but they're no replacement for SIMD.

For example, I just made a little example on my desktop where I summed up 256 random Float32 numbers, and doing it in serial takes around 152 nanoseconds, whereas doing it with SIMD took just 10 nanoseconds. Doing the exact same thing with my GPU took 20 microseconds, so 2000x slower:

    julia> using CUDA, SIMD, BenchmarkTools

    julia> function vsum(::Type{Vec{N, T}}, v::Vector{T}) where {N, T}
               s = Vec{N, T}(0)
               lane = VecRange{N}(0)
               for i ∈ 1:N:length(v)
                   s += v[lane + i]
               end
               sum(s)
           end;

    julia> let L = 256
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    152.239 ns (0 allocations: 0 bytes)
    SIMD benchmark:      10.359 ns (0 allocations: 0 bytes)
    GPU benchmark:       19.917 μs (56 allocations: 1.47 KiB)

The reason for that is simply that it just takes that long to send data back and forth to the GPU and launch a kernel. Almost none of that time was actually spent doing the computation. E.g. here's what that benchmark looks like if instead I have 256^2 numbers:

    julia> let L = 256^2
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    42.370 μs (0 allocations: 0 bytes)
    SIMD benchmark:      2.669 μs (0 allocations: 0 bytes)
    GPU benchmark:       27.592 μs (112 allocations: 2.97 KiB)

so we're now at the point where the GPU is faster than serial, but still slower than SIMD. If we go up to 256^3 numbers, now we're able to see a convincing advantage for the GPU:

    julia> let L = 256^3
               print("Serial benchmark:  "); @btime vsum(Vec{1, Float32}, v)  setup=(v=rand(Float32, $L))
               print("SIMD benchmark:    "); @btime vsum(Vec{16, Float32}, v) setup=(v=rand(Float32, $L))
               print("GPU benchmark:     "); @btime sum(v)                    setup=(v=CUDA.rand($L))
           end;
    Serial benchmark:    11.024 ms (0 allocations: 0 bytes)
    SIMD benchmark:      2.061 ms (0 allocations: 0 bytes)
    GPU benchmark:       353.119 μs (113 allocations: 2.98 KiB)

So the lesson here is that GPUs are only worth it if you actually have enough data to saturate the GPU, but otherwise you're way better off using SIMD.

GPUs are also just generally a lot more limiting than SIMD in many other ways.

xoranth · on Aug 22, 2024

Thank you for your reply!

> GPUs are also just generally a lot more limiting than SIMD in many other ways.

What do you mean? (besides things like CUDA being available only on Nvidia/fragmentation issues.)

eigenspace · on Aug 22, 2024

Here's a few random limitations I can think of other than those already mentioned:

* Float64 math is typically around 30x slower than Float32 math on "consumer-grade" GPUs due to an arbitrary limitation to stop people from using consumer grade chips for "workstation" purposes. This turns out to not be a big deal for things like machine learning, but lots of computational processes actually are rather sensitive to rounding errors and benefit a lot from using 64 bit numbers, which is very slow on GPUs.

* Writing GPU specific functions can be quite labour intensive compared to writing CPU code. Julia's CUDA.jl and KernelAbstractions.jl packages does make a lot of things quite a bit nicer than in most languages, but it's still a lot of work to write good GPU code.

* Profiling and understanding the performance of GPU programs is typically a lot more complicated than CPU programs (even if there are some great tools for it!) because the performance model is just fundamentally more complex with more stuff going on and more random pitfalls and gotchas.

noelwelsh · on Aug 22, 2024

> This is a bit ironic because on x64 all math is SIMD math, it is just inefficient SIMD math.

I don't understand this statement at the end of the article? Can anyone explain? TIA.

xoranth · on Aug 22, 2024

On x86-64, compilers use SIMD instructions and registers to implement floating point math, they just use the single lane instructions. E.g. (https://godbolt.org/z/94b3r8dMn):

    float my_func(float lhs, float rhs) {
        return 2.0f * lhs - 3.0f * rhs;
    }

Becomes:

    my_func(float, float):
        addss   xmm0, xmm0
        mulss   xmm1, DWORD PTR .LC0[rip]
        subss   xmm0, xmm1
        ret

(addss, mulss and subss are SSE2 instructions.)

au8er · on Aug 22, 2024

There is no independent scalar floating point unit for most modern CPUs. When scalar floating point arithmetic is needed, it is send to the SIMD unit. This pretty means that scalar and vectorised floating point operations usually have the same latency. If you do any scalar floating point operations, the CPU is just doing vectorised operations except with only 1 useful value.

kbolino · on Aug 22, 2024

Is it really true that there's no scalar FPU at all? What about x87?

The instructions are still there even in 64-bit long mode, they use their own registers, and there are enough idiosyncrasies (80-bit extended-double precision, stack-based operations, etc.) that I would expect it to be easier to just include a dedicated scalar x87 FPU than try to shoehorn x87 compatibility into the SIMD units.

mcraiha · on Aug 22, 2024

I assume it means "SSE2 is baseline / non-optional for x86-64, so compilers can always use SSE1/SSE2 instructions when targeting x86-64." https://stackoverflow.com/a/50786881

CreepGin · on Aug 22, 2024

I really appreciate the write-up and showing the most important code snippets.

Intuitively, this feels like a narrower version of using Z-order curve.

ccppurcell · on Aug 22, 2024

This is a total longshot but my PhD and research has been in graph algorithms including colouring, and I'm looking for a career change. If anyone has any advice at all (roles or companies that use these skills, tools to learn, etc) I'd be very interested.

Caveats: my knowledge is mostly theoretical (eg proving np-hardness or algorithm existence results) but I'm very good at thinking algorithmically. I have only hobbyist programming skills but I am a fast learner. Thanks!

JonChesterfield · on Aug 22, 2024

Compiler dev is graph algorithms when it's being done well. Not a very friendly introduction to programming though.

_a_a_a_ · on Aug 22, 2024

Strange claim, can you explain a little more please.

titzer · on Aug 22, 2024

I generally concur. Compilers that do advanced optimizations represent both control flow and data flow of the program as a graph where the nodes can represent basic blocks or operations in the program. One of the most graph-like IRs is called the "sea of nodes" and literally represents the entire program as a directed graph of nodes and edges that represent dependencies. Most compiler programs amount to traversing, matching, and transforming one or more nodes at a time. Graph coloring itself (i.e. the graph coloring problem) is often used to solve register allocation, a key optimization needed to make efficient use of the fastest storage in modern computers, the register file. Graph coloring works on a different representation of the program called the Interference Graph, where nodes represent variables and edges represent interferences. Graph coloring for general graphs is NP-complete, so heuristics are used.

So yes, I mostly concur. Compiler algorithms are usually graph algorithms.

matt_d · on Aug 22, 2024

For the particularly useful subgraph of topics, see "Notes on Graph Algorithms Used in Optimizing Compilers" by Carl Offner: http://www.cs.umb.edu/~offner/files/flow_graph.pdf

That, and pretty much anything (co)authored by Robert E. Tarjan. I'm serious: https://github.com/search?q=repo%3Allvm%2Fllvm-project%20tar...

There's a good recent survey of the three workhorses: "We survey three algorithms that use depth-first search to find the strong components of a directed graph in linear time: (1) Tarjan's algorithm; (2) a cycle-finding algorithm; and (3) a bidirectional search algorithm."

Finding Strong Components Using Depth-First Search Robert E. Tarjan, Uri Zwick https://arxiv.org/abs/2201.07197

yvdriess · on Aug 22, 2024

Control flow and data flow of programs are modelled as graphs so that certain analysis and transformations can be applied. That involves subgraph similarities, graph searches, graph transformation rules etc. And at the tail end, register allocation is graph coloring. Check out the LLVM and MLIR projects if you want some concrete modern examples.

baq · on Aug 22, 2024

> The Apple M2 smokes!

I was about this surprised when I made a jupyter notebook with a few gigs of numbers shuffled around and xgboosted and after I was done prototyping on an M1 Air and ran it on my serious box (a 12700k) it was actually slower, and noticeably.

vkazanov · on Aug 22, 2024

The other problem with simd is that in modern cpu-centric languages it often requires a rewrite for every vector width.

And for 80% of the cases by the point there is enough vectorizable data for a programmer to look into simd, a gpu can provide 1000%+ of perf AND a certain level of portability.

So right now simd is a niche tool for super low-level things: certain decompression algos, bits of math here and there, solvers, etc.

And it also takes a lot of space on your cpu die. Like, A LOT.

aseipp · on Aug 22, 2024

The vector width stuff is overblown. There are in practice 3 widths for all desktop architectures (128-bit, 256-bit, 512-bit) and the 95th percentile of all desktops are squarely in the first two buckets, or alternatively you're targeting a specific SKU which you can optimize for. You'll have to write uarch specific code for absolute peak performance anyway. It's annoying but hardly a deal breaker for this specific task.

The bigger problem is most modern ISAs just aren't very nice to program in. Modern takes like SVE/RVV/AVX-512 are all way better here and much easier to write and think about because their instructions are much more general and have fewer edge cases.

> And it also takes a lot of space on your cpu die. Like, A LOT.

No, it doesn't (what would even be the "right" amount of die space?) But even if it did that would not make it a waste of space. Using up some amount of die space may result in huge performance gains for some subset of floating-point tasks that can't be achieved in any other way. For example software cryptography massively relies on these units, because the benefit is like multiple orders of magnitude; even if that's taking up 10% of die space for only 0.1% of software, that software has disproportionate impact -- if you took away that functional unit, then you may have a huge net decrease in efficiency overall, meaning you need more overall processors to handle the same workload from before.

janwas · on Aug 22, 2024

I generally agree with your points. On niceness of programming, our Highway SIMD library provides portable intrinsics that cover most of the nice SVE/RVV/AVX-512. Those intrinsics are reasonably emulated on other ISAs. Is it fair to say problem solved?

janwas · on Aug 22, 2024

I do not understand this take :) The vast majority of our code is vector length agnostic, as is required for SVE and RVV.

GPU perf/TCO is not always better than CPU, if you can even get enough GPUs, and the latency is also a concern and inducement to do everything on the GPU, which can be limiting.

shakow · on Aug 22, 2024

> a gpu can provide 1000%+ of perf AND a certain level of portability.

Relying GPU only makes sense in a handful of context, e.g. “computing stuff fast is my core task” or “this will be deployed on beefy workstations”. SIMD addresses all the remaining cases and will give benefits to virtually everyone using your software; I'm sure one could implement a GPU-based JSON parsing library that would blow json-simd out of the water, but I'm not going to deploy GTX1060 on my cloud machines to enjoy it.

TinkersW · on Aug 22, 2024

Basically everything you said was wrong..

   The other problem with simd is that in modern cpu-centric languages it often requires a rewrite for every vector width.

Nope, you can use many existing libraries that present the same interface for all sizes, or write your own(which is what I did).

   And for 80% of the cases by the point there is enough vectorizable data for a programmer to look into simd, a gpu can provide 1000%+ of perf AND a certain level of portability.

Transfer latency & bandwidth to GPU is horrible, just utterly horrible. And GPU to CPU perf dif is more like 5x, and in games etc the GPU is already nearly maxed out

   And it also takes a lot of space on your cpu die. Like, A LOT.

Relative to the performance it can offer it is a very small area. The gains in the article are small compared to what I see, probably the author is new to SIMD.*

adrianN · on Aug 22, 2024

Compilers are becoming reasonably good at autovectorization if you’re a bit careful how you write your code. I wouldn’t say that simd is niche. You often don’t get the really great improvements you can achieve by being clever manually, but the improvements are still very measurable in my experience.

cvadict · on Aug 22, 2024

IMHO, the overhead of perpetually babysitting compiler diagnostics or performance metrics to ensure your latest update didn't confound the auto-vectorizer is never a net positive over just using something like xsimd, Google highway, etc.

secondcoming · on Aug 22, 2024

clang may be getting better, but gcc isn't.

Being 'careful how you write your code' is putting it mildly. IME you have to hold the compiler's hand at every step. Using 'bool' instead of 'int' can make gcc give up on autovectorisation, for example.

You need to use intrinsics, or a wrapper lib around them, if you want to be sure that SIMD is being used.

whiterknight · on Aug 22, 2024

The compiler can’t fix your data layout

titzer · on Aug 22, 2024

It depends on how high-level the abstraction you're using is. Things like Halide do pretty significant reorganization of your code and data to achieve parallel speedup. And of course if you are using Tensor libraries you're doing exactly that. It all depends on the level of abstraction.

whiterknight · on Aug 23, 2024

I assume if we talking about SIMD we are not talking about python, etc.

mgaunard · on Aug 22, 2024

You can just use templates or macros to make it length-agnostic.

neonsunset · on Aug 22, 2024

> The other problem with simd is that in modern cpu-centric languages it often requires a rewrite for every vector width.

It does not: https://github.com/bepu/bepuphysics2/blob/master/BepuUtiliti... (in this case, even if you use width-specific Vector128/256/512 - compiler will unroll operations on them into respective 256x2/128x4/128x2 if the desired width is not supported. Unfortunately .NET does not deal as well with fallbacks for horizontal reductions/shuffles/etc. on such vectors but it’s easily fixable with a few helpers)

Naturally, there are other languages that offer similar experience like Zig or Swift.

Moreover, sometimes just targeting one specific width is already good enough (like 128b).

Also you can’t (easily) go to GPU for most tasks SIMD is used in CPUs today. Good luck parsing HTTP headers with that.

secondcoming · on Aug 22, 2024

It's unlikely everyones cloud machines have GPUs