The issue here is the SysV amd64 ABI. You could also just make your language-int...

userbinator · on Jan 5, 2024

Or ABIs in general, really.

As any Asm programmer can tell you, this is one of the low-hanging fruits that compilers can easily be beaten at --- don't blindly follow convention, do what makes the most sense in a specific scenario.

5- · on Jan 5, 2024

indeed. another fun thing that c-like compilers generally don't do is multiple entry points in a function.

consider a pair of functions foo and foo0 that differ only in that the former performs an additional action on its argument -- perhaps refcount adjustment or type conversion.

you can then do (in a register-based abi like amd64 sysv);

  foo:
  do the action ;fallthrough
  foo0:
  rest of the function follows

and have essentially two functions for the price of one.

mananaysiempre · on Jan 5, 2024

Fun fact: Fortran has an ENTRY statement specifically for this purpose (obsoleted in the 2008 version).

pklausler · on Jan 5, 2024

This is also why “entry” was a reserved word in the original C language. (So was “fortran”.)

bee_rider · on Jan 5, 2024

Hah, that’s neat.

I was at a place that had a convention to do

    Function blah(…)
    Argument validation
    Call realBlah(…)

That way internally we could just call realBlah if we needed the functionality of blah without the validation. So I guess we could have used this all over the place.

I guess (although could be wrong, this is the first time I’ve seen it) this strategy is basically incompatible with inlining, though?

kitkat_new · on Jan 5, 2024

why would it?

recursive inlining

bee_rider · on Jan 5, 2024

There’s apparently some deep and obvious connection between recursion and multiple entry points, but I don’t know what it is.

zekica · on Jan 5, 2024

Won't tail call optimization mostly mitigate this?

rsaxvc · on Jan 5, 2024

Rarely if ever. This isn't about tail-calling, it's about function placement in the final image to enable branch elimination. There's no call between foo and foo0, but many of the requirements for a tail call are also required here

You can structure the code so foo calls foo0, but the compiler and linker have to work together to pull that off and I don't think GCC and clang do so.

If the functions are being built into split sections, generally no.

If they aren't, and foo isn't called nd foo is, I've never seen a tool chain remove just foo but not foo0, which is nice to have.

sim7c00 · on Jan 5, 2024

I think tail call optimization is specifically for this. it will let a function jump to another function rather than returning and then calling it. isn't that what's essentially described here? (honest question - i am always doubting my sanity looking at this stuff :DD)

I can imagine if higher level code isn't within a specific pattern, the compiler might struggle to recognize an opportunity for optimizing the code and skip it. - the higher level programmer could potentially arrange code in ways the optimizers better recognize.

DSMan195276 · on Jan 5, 2024

> it will let a function jump to another function ... isn't that what's essentially described here?

Not exactly, they're describing doing it without a jump - the first function simply ends at the start of the second function, so the CPU starts running the second function directly after the first with no jump necessary.

Edit: If you're saying a tail call could enable such an optimization, you're right, but it still requires placing the functions in the right spots to eliminate the jump entirely, which is hard.

sim7c00 · on Jan 9, 2024

hey, thanks a lot for the elaboration / clearing up. my wording on jump was bad, but this comment drives that into my brain nicely :D...

i think your edit is on point. optimizations exist, but the hard thing is to know when to apply them and them restructuring the binary etc. - you can imagine you might optimize one bit only to find its impossible to run further code etc. because the optimization breaks it. or even different phases of optimization either feeding eachother more optimization opportunities or negating them(how to order optimizations). compilers and optimizers are such magic really, how far weve come in there. i got one book bigger and fatter than any ive ever seen, more thick than an oldksool bible. and its on compilers :'). its so big im afraid to start it!

ithkuil · on Jan 5, 2024

Not sure how often it's worth the effort though. Unconditional branches while not free are not quite the same performance trap as conditional branches

rsaxvc · on Jan 7, 2024

Agreed.

Where it mattered for me was on an ARM core managing a much larger DSP. The DSP consumed most of the memory bandwidth, so fetching a cacheline of instructions or an MMU mapping into the ARM had long and variable latency as it had to wait for the DSP to finish a large burst to or from the shared memory.

amalcon · on Jan 5, 2024

Inlining is what can mostly mitigate this. You'd write:

  foo() {
    // do something
    return foo0();
  }
  foo0() {
    // do the rest of the things
  }

If you can convince the compiler to inline foo0 into foo, then you get almost what you want. The compiler technically could even use the same code for both (saving some binary size and thus RAM, thus getting the exact same result), though AFAIK this sort of optimization is unusual.

yencabulator · on Jan 6, 2024

An extra gimmick is that it is often better to inline foo into its caller than foo0 into foo.

For example, in the following that saves a function call cost on every nil call.

    fn foo(bar) {
        if bar != nil {
            foo0(bar)
        }
    }

kccqzy · on Jan 5, 2024

Not inlining, but what I've seen in the real world is that the function foo ends with something like a "jmp foo0" so everything is good. (This jmp is almost free with 100% branch prediction.) No need to inline. Just do a proper tail call optimization. Without symbols you can't tell whether they are two functions or just basic blocks in a single function.

pertymcpert · on Jan 5, 2024

Yeah it should.

vitiral · on Jan 5, 2024

Even more fun is using jmp to chain functions together -- only works when you don't have locals and have complete knowledge of your registers!

dreamcompiler · on Jan 6, 2024

CALL and RET are just JMPs that automatically keep track of return locations using a stack. But...

--You don't have to use a stack. A stack is a convenient data structure for this purpose, but there are other ways to do it.

--You don't even need to keep track of return points at all. That's what continuations are about: Nothing ever returns.

vitiral · on Jan 7, 2024

You can JMP to a "function" (which JMPs to a function... etc) where the last function RET. It all works like it's a single function body.

sitkack · on Jan 6, 2024

https://en.wikipedia.org/wiki/COMEFROM

PaulHoule · on Jan 5, 2024

Yeah, for the little programs I write on AVR-8 it drives me nuts how much meaningless activity (moving the stack pointer around) that the C compiler does compared to assembly. For a PoV display engine, for instance, you might be able to reserve a few registers for the interrupt handler and still keep your inner loop variables entirely in registers.

ajb · on Jan 5, 2024

Link time optimisation is supposed to fix that (the meaningless activity). Especially on a bare metal build where you can be certain what needs to call your function. But it wouldn't surprise me if it didn't always.

Unlikely to be able to reserve a register though

CodesInChaos · on Jan 6, 2024

Compilers can already vary how parameters are passed to internal functions, especially if link time optimizations are enabled. Not sure how good a job they do at improving performance that way, but I certainly did not enjoy how this complicated reverse engineering.

sim7c00 · on Jan 5, 2024

first question that came to my mind you answered :) thanks!. i think its interesting a lot of stuff adheres to such ABIs etc. especially since they were conceived of quite a time ago, and often lean towards being compatible with older CPUs, where newer ones with more extended registers etc. might have features that could be used to improve this without making the structs shorter. I guess it's not super interesting to make software for specific hardware or hardware classes / generations as it'll be unusable on some machines, but having compilers etc. that _can_ produce it might be cool if you want to super optimise the code running on your system towards your systems's hardware features.