You can bump upwards without an overflow check even if the size is variable.
I’ve seen multiple ways to do it. And I’ve lost days of my life to coming up with pointlessly amusing variants that get there in weird ways.
Yeah I’m also amused that they didn’t get into the fact that downward bump is just not what the HW expects you to do. You’d need a real benchmark to see that effect. And, it’s an open question whether this would matter unless your allocation rate was obscene (if it isn’t then most scans of memory will be normal loops and not the allocator).
Because stacks aren’t unbounded. You grow and shrink them. Therefore, regardless of which direction it grows in, it also shrinks in the opposite direction with the same likelihood.
> fact that downward bump is just not what the HW expects you to do.
What fact is this? and what does it have to do with stack being bounded? (bounds of stack are unknown to most micro architectural state and only stored in OS structures).
HW prefetchers support access pattern in either direction, so do load/store instructions (stm(fd)/ldm(fd); ltp/stp;)
The fact that stacks are bounded means you both push and you pop. Just as frequently as a function calls another one resulting in memory ops moving to lower addresses, functions also return causing memory ops to move back to higher addresses. Pushes and pops are balanced. Why do we know they are balanced? Because the stack is bounded.
So - the direction of stack growth isn’t interesting to prefetching strategy.
Phil style: track end and remaining. End doesn’t change. To allocate size, check if lesseq remaining and subtract from it. End minus remaining is the result (using the old value of remaining). Has two variants - one that burns an extra register but saves an instruction. Empirically, this one is fast enough that I just use it all the time.
Bmalloc style bump: independently subtract from remaining and add to bump. It’s more instructions but they are instruction level parallel.
Another friend’s bump style: have bump/end ptrs like usual. Fast path checks if size < end - bump. Then you add to bump like usual.
Note that instruction counts on these don’t matter so much on modern CPU’s. But dependency chain length does matter, a lot. Register usage matters only if you want to inline the allocator (and it’s not always beneficial to do it).
I’ve seen multiple ways to do it. And I’ve lost days of my life to coming up with pointlessly amusing variants that get there in weird ways.
Yeah I’m also amused that they didn’t get into the fact that downward bump is just not what the HW expects you to do. You’d need a real benchmark to see that effect. And, it’s an open question whether this would matter unless your allocation rate was obscene (if it isn’t then most scans of memory will be normal loops and not the allocator).