You can bump upwards without an overflow check even if the size is variable. I’v...

zeusk · on Sept 20, 2023

> fact that downward bump is just not what the HW expects you to do.

Where do you get this fact from? ARM and x86 have a descending stack (<= ARMv7 supported stack growing in either direction).

pizlonator · on Sept 20, 2023

What does downward stack have to do with the cpu predicting that you’ll most likely scan memory in the forward direction?

zeusk · on Sept 20, 2023

Your stack is entirely data, why do you think a CPU couldn't predict access patterns on stack?

Prefetchers can predict negative offset/decrements since 1980s

pizlonator · on Sept 20, 2023

Because stacks aren’t unbounded. You grow and shrink them. Therefore, regardless of which direction it grows in, it also shrinks in the opposite direction with the same likelihood.

zeusk · on Sept 20, 2023

And?

> fact that downward bump is just not what the HW expects you to do.

What fact is this? and what does it have to do with stack being bounded? (bounds of stack are unknown to most micro architectural state and only stored in OS structures).

HW prefetchers support access pattern in either direction, so do load/store instructions (stm(fd)/ldm(fd); ltp/stp;)

pizlonator · on Sept 21, 2023

The fact that stacks are bounded means you both push and you pop. Just as frequently as a function calls another one resulting in memory ops moving to lower addresses, functions also return causing memory ops to move back to higher addresses. Pushes and pops are balanced. Why do we know they are balanced? Because the stack is bounded.

So - the direction of stack growth isn’t interesting to prefetching strategy.

zeusk · on Sept 21, 2023

Sure, but push/pop are not the only ways to access stack.

Moreover, I'm still wondering what you mean by this:

> Yeah I’m also amused that they didn’t get into the fact that downward bump is just not what the HW expects you to do.

What is this fact? Why does HW not expect you to bump addresses downwards?

saagarjha · on Sept 20, 2023

Even when size is a full 64-bit value (…in that it might actually overflow the address space)?

pizlonator · on Sept 20, 2023

Phil style: track end and remaining. End doesn’t change. To allocate size, check if lesseq remaining and subtract from it. End minus remaining is the result (using the old value of remaining). Has two variants - one that burns an extra register but saves an instruction. Empirically, this one is fast enough that I just use it all the time.

Bmalloc style bump: independently subtract from remaining and add to bump. It’s more instructions but they are instruction level parallel.

Another friend’s bump style: have bump/end ptrs like usual. Fast path checks if size < end - bump. Then you add to bump like usual.

Note that instruction counts on these don’t matter so much on modern CPU’s. But dependency chain length does matter, a lot. Register usage matters only if you want to inline the allocator (and it’s not always beneficial to do it).