> It's extremely simple, and breaks down the most complex networks into 4 OpTypes:
>
> - UnaryOps operate on one tensor and run elementwise. RELU, LOG, RECIPROCAL, etc...
> - BinaryOps operate on two tensors and run elementwise to return one. ADD, MUL, etc...
> - ReduceOps operate on one tensor and return a smaller tensor. SUM, MAX
> - MovementOps operate on one tensor and move the data around, copy-free with ShapeTracker. RESHAPE, PERMUTE, EXPAND, etc...
>
> But how...where are your CONVs and MATMULs? Read the code to solve this mystery.
Ok, I was curious, so I read the code. The answer is that it represents a MATMUL as a 1x1 CONV. And it lied about CONV, which is a ProcessingOps.CONV and explicitly represented and implemented: https://github.com/geohot/tinygrad/blob/c0050fab8ff0bc667e40... Quite the letdown of figuring out this 'mystery'.
That's cool, am I right in assuming that you want to automate the production of efficient GPU (or other accelerator) code based on these low level primitives? But you would still need a piece of sorcery that can produce high performance OpenCL code, right? And that code could be different for every device, so you would need some trial and error, benchmark-based compilation at the very least. Or would OpenCL code be generated by hand for each device?
Working on parameterizing a search space that includes more than the local group size. The end dream is some ML guided search to optimize the kernels :)
OK generally I think you're doing exactly what I believe ML is lacking right now. Another huge opportunity is instead of taking the average neural network and designing accelerators for it, designing hardware-friendly networks that run well on a sane accelerator that was designed to work with only these specialised networks (that doesn't need 80% chip area for on-chip memory for example). These might end up being completely different networks to what researchers use today. I work in this area and I think it's also possible to use the loss function to optimise the network for a specific HW.
I've done some work in the past on NN representations and you actually can represent Conv and MatMul in more primitive ways. I ended up writing an IR called loop_tool that exposes this stuff:
There is a MAX but not a MIN? Is that because max(x,y) = -min(-x,-y)? But then why is there a SUB? Why is there a RELU if it's only max(0,x)? Maybe MIN is just too rare to be worth implementing?
We could have NEG instead of SUB, but with the constant folding it's a wash. DIV is already an HLOP with reciprocal (used to use POW, but that was slower. And what would you implement RELU in terms of?
Just looking at the code from my phone, but it seems that the conv op calls another primitive and einsum, which I believe is just a fancy MUL with broadcasting? so it might still be technically correct?
Einsum is an expressive way of doing element wise products and then possibly reducing them. An einsum is essentially a description of the dimensions of the input tensors and the dimensions of the resulting output after multiplication. If the output has reduced dimensions, then a summation is applied over them. The package einops provides reductions such as summation, averaging, and so on.
For example; the einsum " b k n p, k -> b k n p" broadcasts the second tensor b to b[None, :, None, None] and does element wise multiplication. It can be changed to a vector product by writing "b k n p, k -> b n p", which for all intents and purposes is identical to a.transpose(0, 2, 3, 1) @ b .
I can easily recommend the einops package and using einsum, simplifies things significantly.