The issue here is the SysV amd64 ABI. You could also just make your language-internal ABI not be SysV? As long as these aren't exposed to SysV C callers, you can use any calling convention you want.
For those curious, the relevant diff in neatlang is: https://github.com/Neat-Lang/neat/commit/f4ba38cefc1e26631a5.... It looks much more involved than changing the emitted LLVM calling conventions. Possibly the author wants these types exposed with some deterministic calling convention to C programs.
As any Asm programmer can tell you, this is one of the low-hanging fruits that compilers can easily be beaten at --- don't blindly follow convention, do what makes the most sense in a specific scenario.
indeed. another fun thing that c-like compilers generally don't do is multiple entry points in a function.
consider a pair of functions foo and foo0 that differ only in that the former performs an additional action on its argument -- perhaps refcount adjustment or type conversion.
you can then do (in a register-based abi like amd64 sysv);
foo:
do the action ;fallthrough
foo0:
rest of the function follows
and have essentially two functions for the price of one.
Function blah(…)
Argument validation
Call realBlah(…)
That way internally we could just call realBlah if we needed the functionality of blah without the validation. So I guess we could have used this all over the place.
I guess (although could be wrong, this is the first time I’ve seen it) this strategy is basically incompatible with inlining, though?
Rarely if ever. This isn't about tail-calling, it's about function placement in the final image to enable branch elimination. There's no call between foo and foo0, but many of the requirements for a tail call are also required here
You can structure the code so foo calls foo0, but the compiler and linker have to work together to pull that off and I don't think GCC and clang do so.
If the functions are being built into split sections, generally no.
If they aren't, and foo isn't called nd foo is, I've never seen a tool chain remove just foo but not foo0, which is nice to have.
I think tail call optimization is specifically for this. it will let a function jump to another function rather than returning and then calling it. isn't that what's essentially described here? (honest question - i am always doubting my sanity looking at this stuff :DD)
I can imagine if higher level code isn't within a specific pattern, the compiler might struggle to recognize an opportunity for optimizing the code and skip it. - the higher level programmer could potentially arrange code in ways the optimizers better recognize.
> it will let a function jump to another function ... isn't that what's essentially described here?
Not exactly, they're describing doing it without a jump - the first function simply ends at the start of the second function, so the CPU starts running the second function directly after the first with no jump necessary.
Edit: If you're saying a tail call could enable such an optimization, you're right, but it still requires placing the functions in the right spots to eliminate the jump entirely, which is hard.
hey, thanks a lot for the elaboration / clearing up. my wording on jump was bad, but this comment drives that into my brain nicely :D...
i think your edit is on point. optimizations exist, but the hard thing is to know when to apply them and them restructuring the binary etc. - you can imagine you might optimize one bit only to find its impossible to run further code etc. because the optimization breaks it. or even different phases of optimization either feeding eachother more optimization opportunities or negating them(how to order optimizations). compilers and optimizers are such magic really, how far weve come in there. i got one book bigger and fatter than any ive ever seen, more thick than an oldksool bible. and its on compilers :'). its so big im afraid to start it!
Where it mattered for me was on an ARM core managing a much larger DSP. The DSP consumed most of the memory bandwidth, so fetching a cacheline of instructions or an MMU mapping into the ARM had long and variable latency as it had to wait for the DSP to finish a large burst to or from the shared memory.
Inlining is what can mostly mitigate this. You'd write:
foo() {
// do something
return foo0();
}
foo0() {
// do the rest of the things
}
If you can convince the compiler to inline foo0 into foo, then you get almost what you want. The compiler technically could even use the same code for both (saving some binary size and thus RAM, thus getting the exact same result), though AFAIK this sort of optimization is unusual.
Not inlining, but what I've seen in the real world is that the function foo ends with something like a "jmp foo0" so everything is good. (This jmp is almost free with 100% branch prediction.) No need to inline. Just do a proper tail call optimization. Without symbols you can't tell whether they are two functions or just basic blocks in a single function.
Yeah, for the little programs I write on AVR-8 it drives me nuts how much meaningless activity (moving the stack pointer around) that the C compiler does compared to assembly. For a PoV display engine, for instance, you might be able to reserve a few registers for the interrupt handler and still keep your inner loop variables entirely in registers.
Link time optimisation is supposed to fix that (the meaningless activity). Especially on a bare metal build where you can be certain what needs to call your function. But it wouldn't surprise me if it didn't always.
Compilers can already vary how parameters are passed to internal functions, especially if link time optimizations are enabled. Not sure how good a job they do at improving performance that way, but I certainly did not enjoy how this complicated reverse engineering.
first question that came to my mind you answered :) thanks!. i think its interesting a lot of stuff adheres to such ABIs etc. especially since they were conceived of quite a time ago, and often lean towards being compatible with older CPUs, where newer ones with more extended registers etc. might have features that could be used to improve this without making the structs shorter. I guess it's not super interesting to make software for specific hardware or hardware classes / generations as it'll be unusable on some machines, but having compilers etc. that _can_ produce it might be cool if you want to super optimise the code running on your system towards your systems's hardware features.
https://llvm.org/docs/LangRef.html#calling-conventions
For those curious, the relevant diff in neatlang is: https://github.com/Neat-Lang/neat/commit/f4ba38cefc1e26631a5.... It looks much more involved than changing the emitted LLVM calling conventions. Possibly the author wants these types exposed with some deterministic calling convention to C programs.