Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

    > It's extremely simple, and breaks down the most complex networks into 4 OpTypes:
    >
    > - UnaryOps operate on one tensor and run elementwise. RELU, LOG, RECIPROCAL, etc...
    > - BinaryOps operate on two tensors and run elementwise to return one. ADD, MUL, etc...
    > - ReduceOps operate on one tensor and return a smaller tensor. SUM, MAX
    > - MovementOps operate on one tensor and move the data around, copy-free with ShapeTracker. RESHAPE, PERMUTE, EXPAND, etc...
    >
    > But how...where are your CONVs and MATMULs? Read the code to solve this mystery.
Ok, I was curious, so I read the code. The answer is that it represents a MATMUL as a 1x1 CONV. And it lied about CONV, which is a ProcessingOps.CONV and explicitly represented and implemented: https://github.com/geohot/tinygrad/blob/c0050fab8ff0bc667e40... Quite the letdown of figuring out this 'mystery'.


That CONV is only used on the older backends. The GPU and LLVM backend rewrite CONV as MUL+SUM, to be fused later, and thus only use the 4 OpTypes.

https://github.com/geohot/tinygrad/blob/master/tinygrad/lazy...


That's cool, am I right in assuming that you want to automate the production of efficient GPU (or other accelerator) code based on these low level primitives? But you would still need a piece of sorcery that can produce high performance OpenCL code, right? And that code could be different for every device, so you would need some trial and error, benchmark-based compilation at the very least. Or would OpenCL code be generated by hand for each device?


Yea, benchmark based compilation, that's already happening in the tinygrad compiler we use for openpilot to determine the local group size. https://github.com/geohot/tinygrad/blob/caea34c52996cde2ed46...

Working on parameterizing a search space that includes more than the local group size. The end dream is some ML guided search to optimize the kernels :)


OK generally I think you're doing exactly what I believe ML is lacking right now. Another huge opportunity is instead of taking the average neural network and designing accelerators for it, designing hardware-friendly networks that run well on a sane accelerator that was designed to work with only these specialised networks (that doesn't need 80% chip area for on-chip memory for example). These might end up being completely different networks to what researchers use today. I work in this area and I think it's also possible to use the loss function to optimise the network for a specific HW.


I've done some work in the past on NN representations and you actually can represent Conv and MatMul in more primitive ways. I ended up writing an IR called loop_tool that exposes this stuff:

https://github.com/facebookresearch/loop_tool/blob/main/pyth...

The idea is basically this: https://news.ycombinator.com/item?id=28883086


To directly quote the source:

    # these are the llops your accelerator must implement, along with toCpu
    UnaryOps = Enum("UnaryOps", ["NOOP", "NEG", "RELU", "EXP", "LOG", "SIGN", "RECIPROCAL"])
    BinaryOps = Enum("BinaryOps", ["ADD", "SUB", "MUL", "DIV", "POW", "CMPEQ"])
    ReduceOps = Enum("ReduceOps", ["SUM", "MAX"])
    MovementOps = Enum("MovementOps", ["RESHAPE", "PERMUTE", "EXPAND", "FLIP", "STRIDED", "PAD", "SHRINK"])
    ProcessingOps = Enum("ProcessingOps", ["CONV"])
https://github.com/geohot/tinygrad/blob/caea34c52996cde2ed46...

There is a MAX but not a MIN? Is that because max(x,y) = -min(-x,-y)? But then why is there a SUB? Why is there a RELU if it's only max(0,x)? Maybe MIN is just too rare to be worth implementing?


Min is an HLOP.

From: https://github.com/geohot/tinygrad/blob/master/tinygrad/tens...

def min(self, axis=None, keepdim=False): return -((-self).max(axis=axis, keepdim=keepdim))

All folded together, no slower than MAX.


But then SUB, DIV and RELU could be an HLOP as well, no?


We could have NEG instead of SUB, but with the constant folding it's a wash. DIV is already an HLOP with reciprocal (used to use POW, but that was slower. And what would you implement RELU in terms of?


max(0,x)


That's a ReduceOp right now, more annoying to reason about than a UnaryOp. But in the limit, yea. Or add an elementwise BinaryOp for max.

Submit a PR if you can improve something!


Very similar idea as Jittor, convolution definitely can be break down: https://github.com/Jittor/jittor/blob/master/python/jittor/n...


Just looking at the code from my phone, but it seems that the conv op calls another primitive and einsum, which I believe is just a fancy MUL with broadcasting? so it might still be technically correct?


Einsum is an expressive way of doing element wise products and then possibly reducing them. An einsum is essentially a description of the dimensions of the input tensors and the dimensions of the resulting output after multiplication. If the output has reduced dimensions, then a summation is applied over them. The package einops provides reductions such as summation, averaging, and so on.

For example; the einsum " b k n p, k -> b k n p" broadcasts the second tensor b to b[None, :, None, None] and does element wise multiplication. It can be changed to a vector product by writing "b k n p, k -> b n p", which for all intents and purposes is identical to a.transpose(0, 2, 3, 1) @ b .

I can easily recommend the einops package and using einsum, simplifies things significantly.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: