> It's extremely simple, and breaks down the most complex networks into 4 OpType...

georgehotz · on Nov 4, 2022

That CONV is only used on the older backends. The GPU and LLVM backend rewrite CONV as MUL+SUM, to be fused later, and thus only use the 4 OpTypes.

https://github.com/geohot/tinygrad/blob/master/tinygrad/lazy...

WithinReason · on Nov 4, 2022

That's cool, am I right in assuming that you want to automate the production of efficient GPU (or other accelerator) code based on these low level primitives? But you would still need a piece of sorcery that can produce high performance OpenCL code, right? And that code could be different for every device, so you would need some trial and error, benchmark-based compilation at the very least. Or would OpenCL code be generated by hand for each device?

georgehotz · on Nov 4, 2022

Yea, benchmark based compilation, that's already happening in the tinygrad compiler we use for openpilot to determine the local group size. https://github.com/geohot/tinygrad/blob/caea34c52996cde2ed46...

Working on parameterizing a search space that includes more than the local group size. The end dream is some ML guided search to optimize the kernels :)

WithinReason · on Nov 4, 2022

OK generally I think you're doing exactly what I believe ML is lacking right now. Another huge opportunity is instead of taking the average neural network and designing accelerators for it, designing hardware-friendly networks that run well on a sane accelerator that was designed to work with only these specialised networks (that doesn't need 80% chip area for on-chip memory for example). These might end up being completely different networks to what researchers use today. I work in this area and I think it's also possible to use the loss function to optimise the network for a specific HW.

brrrrrm · on Nov 4, 2022

I've done some work in the past on NN representations and you actually can represent Conv and MatMul in more primitive ways. I ended up writing an IR called loop_tool that exposes this stuff:

https://github.com/facebookresearch/loop_tool/blob/main/pyth...

The idea is basically this: https://news.ycombinator.com/item?id=28883086

WithinReason · on Nov 4, 2022

To directly quote the source:

    # these are the llops your accelerator must implement, along with toCpu
    UnaryOps = Enum("UnaryOps", ["NOOP", "NEG", "RELU", "EXP", "LOG", "SIGN", "RECIPROCAL"])
    BinaryOps = Enum("BinaryOps", ["ADD", "SUB", "MUL", "DIV", "POW", "CMPEQ"])
    ReduceOps = Enum("ReduceOps", ["SUM", "MAX"])
    MovementOps = Enum("MovementOps", ["RESHAPE", "PERMUTE", "EXPAND", "FLIP", "STRIDED", "PAD", "SHRINK"])
    ProcessingOps = Enum("ProcessingOps", ["CONV"])

https://github.com/geohot/tinygrad/blob/caea34c52996cde2ed46...

There is a MAX but not a MIN? Is that because max(x,y) = -min(-x,-y)? But then why is there a SUB? Why is there a RELU if it's only max(0,x)? Maybe MIN is just too rare to be worth implementing?

georgehotz · on Nov 4, 2022

Min is an HLOP.

From: https://github.com/geohot/tinygrad/blob/master/tinygrad/tens...

def min(self, axis=None, keepdim=False): return -((-self).max(axis=axis, keepdim=keepdim))

All folded together, no slower than MAX.

WithinReason · on Nov 4, 2022

But then SUB, DIV and RELU could be an HLOP as well, no?

georgehotz · on Nov 4, 2022

We could have NEG instead of SUB, but with the constant folding it's a wash. DIV is already an HLOP with reciprocal (used to use POW, but that was slower. And what would you implement RELU in terms of?

WithinReason · on Nov 4, 2022

max(0,x)

georgehotz · on Nov 4, 2022

That's a ReduceOp right now, more annoying to reason about than a UnaryOp. But in the limit, yea. Or add an elementwise BinaryOp for max.

Submit a PR if you can improve something!

liuliu · on Nov 4, 2022

Very similar idea as Jittor, convolution definitely can be break down: https://github.com/Jittor/jittor/blob/master/python/jittor/n...

jerpint · on Nov 4, 2022

Just looking at the code from my phone, but it seems that the conv op calls another primitive and einsum, which I believe is just a fancy MUL with broadcasting? so it might still be technically correct?

PartiallyTyped · on Nov 4, 2022

Einsum is an expressive way of doing element wise products and then possibly reducing them. An einsum is essentially a description of the dimensions of the input tensors and the dimensions of the resulting output after multiplication. If the output has reduced dimensions, then a summation is applied over them. The package einops provides reductions such as summation, averaging, and so on.

For example; the einsum " b k n p, k -> b k n p" broadcasts the second tensor b to b[None, :, None, None] and does element wise multiplication. It can be changed to a vector product by writing "b k n p, k -> b n p", which for all intents and purposes is identical to a.transpose(0, 2, 3, 1) @ b .

I can easily recommend the einops package and using einsum, simplifies things significantly.