Matrix multiplication using only addition

daniel-cussen · on July 8, 2023

Hi HN, this paper is my first proper academic publication, it's on arxiv only for now--this is a pre-print--but is being considered for publication by peer-reviewed journals concurrently. Open-access journals, of course.

I'm totally disinterested in tenure or academic recognition. For my goals being a Stanford dropout is better than any other amount of academic recognition. So i don't care about journals uh prestige numbers the impact factors i know that term but anything paywalled is bad for what i do care about, which is my business, fgemm. Means Fast/Faster/Fastest GEneral Matrix-Matrix multiplication. gemm is an acronym already used in BLAS libraries, Basic Linear Algebra Subprograms, which is what most of the time n money spent on ML goes to.

I'm going to be available to answer questions insofar as i can.

blast · on July 8, 2023

This is cool! How did you end up working with Ullman? I guess "being a Stanford dropout" explains how you met him, but there must be an interesting story here. Can you share how that happened and what the process was?

quickthrower2 · on July 8, 2023

Hi Daniel. Thanks for the inspiration. Something I have thought about too is sticking some papers out there without needing to go through expensive gates (PhD etc.).

daniel-cussen · on July 8, 2023

It's brutally hard. I had an easier time buying skylinesort.com n posting the skylinesort algorithm there, than publishing through professors n academia. Typically not feasible for undergraduates, least of all anybody not paying tuition. Same way professors are expected to have an undergraduate degree at the very least (4 profs at Stanford have just an undergraduate degree), a Master's degree (a handful have that and no more), but typically a PhD is required (literally all the other professors have PhD's). Is required. Who requires it? Who says, "I require a PhD."? Is expected? Who expects it? Who says, "I expect a PhD."? Passive voice is typical in academia. Very rare to get around the gatekeeping, frankly. I couldn't publish on arxiv for years because of lack of academic affiliation alone.

Took years to get to this point in terms of the effort I dedicate to getting recognition for my work.

Vasniktel · on July 8, 2023

Thanks Daniel. Could you expand on this comment? What did you have to do to be able to publish a paper on arxiv?

daniel-cussen · on July 8, 2023

At the time, i needed academic affiliation, meaning be in college or more likely have a professor vouch for me. What i ended up doing was return to Stanford undergrad n take classes related to algorithms, show my algorithm portfolio in office hours, then get referred to other profs, one of them being Jeffrey Ullman, in 2019. N then after emails back n forth we met in person in the Gates building, it went from there.

Very happy to have met Professor Jeffrey Ullman.

jmhimara · on July 8, 2023

Not sure if this is what you're talking about, but you don't typically pay to get a PhD (in fact you get paid in the US).

daniel-cussen · on July 9, 2023

Yeah n how much do you get paid, n for what? You get paid to take on professorial duties, TA'ing, lab assistant, that sort of thing usually. Pretty rough in many ways.

bishop77 · on July 11, 2023

How does this compare to Strassen's algorithm? Could you please provide a reference implementation?

unlikelymordant · on July 8, 2023

Could you adapt this to finding fast inverses?

daniel-cussen · on July 8, 2023

I looked at that, i concluded yes because the bottleneck of inverting a matrix is matrix multiplication. Spesh since fgemm targets 32-bit floating-point format, n has high accuracy (not saying how high but much better than Strassen, at least as good as naive matrix multiplication).

henistein · on July 9, 2023

Is there any python implementation of it? I would really like to try it out

pkoird · on July 8, 2023

Didn't look too deep into the paper but can I just say that I LOVE this style of academic writing? Accessible, full of examples, and in a conversing tone. Most of the math papers I come across jump right into the "Let $X \in F be ring of S^1$ and etc. I secretly believe that people heap abstractions after abstractions to purposefully shield the fact that the meat of what they've written is actually quite simple. Either that, or I've failed to understand that some ideas can't just be explained without invoking arcane symbols.

throw_pm23 · on July 8, 2023

Believe it or not, the "Let $X \in F$..." style is typically easier and more straightforward to write for mathematicians and theoretical computer scientists and closer to how they have the solution in their head (or on scraps of paper) anyway. The problem is, unless the result is super important, that style will get your paper rejected as it doesn't engage the reviewers who are not already familiar with the problem. So authors go instead for the conversational tone and try to find a good story, which is more often what actually shields that the work is quite simple or less important.

GrumpySloth · on July 8, 2023

It’s also easier to spot errors in that style than in prose.

celrod · on July 8, 2023

As a grad student, one of my more cynical professors said that to write a paper, you take a simple idea and then obfuscate it until it sounds complicated, and therefore seems impressive. He also said the hardest part of reading papers is understanding the notation (thanks to that obfuscation).

inopinatus · on July 8, 2023

If the whole academic career doesn't work out, patent drafting offers an even more baroque application of obfuscated writing skills.

seanhunter · on July 8, 2023

Now the whole journey of Einstein from patent examiner to physics superstar makes sense to me.

bl0rg · on July 8, 2023

He simply did not have the intellectual aptitude required for the patent business. Suddenly I don't feel so bad about becoming a math professor after failing accounting school.

tough · on July 8, 2023

He saw so many bad ideas on his patent job that he had some good ones of his own

JamesLeonis · on July 8, 2023

Now I am imagining Einstein inventing general relativity as a shorthand to deny yet another perpetual motion machine patent.

WinLychee · on July 8, 2023

I have often found that to be true. From what I've heard, it is sometimes against the author's wishes, that they have to toss in equations to "make it look good" to get past review. At least in CS, you can read the source code if it's been published, and often it's a basic technique with a slight new twist that _maybe_ works but is super hard to reproduce. Also much of the papers are fluff lol, just random equations thrown in that are just copied from the textbook.

_nalply · on July 8, 2023

The incentives are misaligned.

The general public wants something simple and useful.

The writers want recognition.

I learnt that when I wrote my own thesis. I tried to be simple and useful but I discovered something else when thinking about the subject: I wanted to make sure that my thesis gets good grades.

tgv · on July 8, 2023

Two days ago, a friend of mine sent a link to a site with audio versions of scientific articles. He sarcastically added: "as if anyone actually still reads articles." In many discplines, articles are write only. It's publish-or-perish, not read-or-perish.

kingkongjaffa · on July 8, 2023

That’s nuts to me. I read aerodynamics for my masters and would have been lost without skimming 100’s of papers and fully reading about 100 papers.

Now I’m currently doing research on a niche statistics topic and I would be lost without papers.

How does one figure out “this is what we currently understand about x” without research papers and plundering scihub?

tgv · on July 8, 2023

Not every discipline has a constructive literature. I worked in psycho-linguistics. There's almost no article that provides anything you can build on. Almost all of it is description of experiment and outcome, with some theoretical interpretation, but that is almost always specific to interpretation within a specific theory. But those theories are high-level fantasy about language processing, which makes the interpretation meaningless elsewhere, even if the article did get everything else right (and that's rare too).

So there are very few results you can expand upon. Lexical priming turned out to be reproducible and usable as a tool, the Stroop effect too. But those are exceptions. However, they don't explain the underlying mechanism. E.g., the Stroop effect is 90 years old, and there's no explanation of how it works. So if you read text books that explain you the state of the art of around 1980, you're practically up to date as far as real knowledge is concerned. The rest is infighting and publishing for its own sake.

And psycho-linguistics at least has some experimental standards, because it is a fairly limited topic and it is suited to lab settings. Other fields don't even have that. Social psychology is a joke. Articles are based on questionnaires and introspection.

That's why articles in those fields are not well read. They get read by a small audience, mostly people in the same school of thought, and mostly to add to the citation section. But not for knowledge.

refulgentis · on July 8, 2023

They're exaggerating to express personal feelings re: bitterness about lack of engagement

sublinear · on July 8, 2023

> I secretly believe that people heap abstractions after abstractions to purposefully shield the fact that the meat of what they've written is actually quite simple.

It usually is pretty simple, but what they're going for is rigor and concision. Maybe a few papers are overconstrained and could drop a few unnecessary details, but I don't think that's all that common after enough review.

cinntaile · on July 8, 2023

Sometimes people just want their work to sound more impressive than it actually is. Using in-words is a pretty standard technique, their peers don't mind because they speak the same language. To outsiders it sounds difficult. Quite common in academia.

nabla9 · on July 8, 2023

That's because the second author is Jeffrey Ullman (Turing Award, Neumann medal)

Ullman is of the authors of several legendary computer science books Dragon Book (Compilers: Principles, Techniques, and Tools) the Cinderella Book (Introduction to Automata Theory, Languages, and Computation), Green Dragon Book (Principles of Compiler Design)

If you want to learn deep stuff with clarity, those old books are still the way to go.

l33t233372 · on July 8, 2023

I think it’s just that what’s being talked about is so precise and so deep in the weeds of nested definitions that you generally need to talk like that, or at least you have to be a truly gifted communicator to write a math paper without it.

mathisfun123 · on July 8, 2023

>or at least you have to be a truly gifted communicator to write a math paper without it.

i know a lot of math (hence the name) - basically lots of stuff scattered around analysis, geometry, and complexity theory, at varying levels between senior undergrad and research level (MIP and SAT and SMT). this basically tracks my academic progression (from math undergrad to cs phd student).

the stuff that i can explain the best is the research level stuff. why? because i can explain it in the same relatable terms that i learned it through, since i learned it when i needed it - through relatable examples that clearly motivate the ideas. i've done it many times - often a junior phd student will ask me what i work on and i start telling a story that starts with some really common thing that gives a foothold ("how would you figure out which variables in a for loop are reused") and then step by step you "follow your nose" to the ideas behind the proofs and techniques and etc.

what's my point? lots of academic math is useless frippery that couldn't be motivated in this way and so it can't be articulated except formally.

l33t233372 · on July 8, 2023

If you want to explain something in a loose fashion, that works.

It doesn’t work so much for proofs in a lot of mathematics, especially because the “common ground” you speak pf starting at would be hours of explanation behind what you’re trying to say.

akasakahakada · on July 8, 2023

Read a paper with few hundreds of cites. They proposed a new algorithm to solve eigenvalue problem. I look at it, decrypt the equations, and then notice that it is just basically a bunch of if-then condition like this:

count = 0

for x in list:

  if x == 1:

    count +=1

if count != 0:

  return 3

else:

  return 1

Everything is simple, they try too hard to make things look mathematically rigorous but turn out stupid.

if you want to see how cryptic it is: https://arxiv.org/abs/2103.07510

elcritch · on July 8, 2023

That doesn't look to try to solve the eigenvalue problem.

nathan_compton · on July 8, 2023

> secretly believe

Apparently not so secretly.

Drunk_Engineer · on July 8, 2023

As a chip designer, I'm dubious of this claim:

"The advantage of performing matrix multiplication using only addition for arithmetic is that it then becomes feasible to build special- purpose chips with no multiplier circuits. Such chips will take up less space per on-chip processor."

Perhaps that is true in a toy design, but in real-world chips the multiplier uses only a very tiny fraction of chip real-estate. And even if the matrix-multiply can be eliminated, there are other uses for multiply operations.

I once attended a chip design conference where NVidia discussed its latest GPU. In one of the slides showing block layout, the designers pointed out how barely any silicon was being used for actual floating operations -- the vast majority was for pipelining and moving bits around.

imtringued · on July 8, 2023

I once saw a professor show me their latest taped out CPU (multi project wafer of course) and it had one tiny core that is supposed to control the wake-up of the larger processor. The sleep controller has no memory and is a thin slice that is barely visible. Meanwhile the primary processor is much larger but it too is completely dwarfed by the memory it is being surrounded by. The idea of throwing out parts of the ALU is ridiculous. The only situation where this would make sense is in some kind of processing in memory situation where your logic process does not permit large CPUs and you expect to have hundreds of cores per chip with multiple chips on a single DIMM.

brigade · on July 8, 2023

That’s because CPUs and GPUs are useful for more than just matrix multiplication. TPUs aren’t; they assume highly regular data movement in and out of the ALUs.

daniel-cussen · on July 9, 2023

Not that tiny, n not such little energy, if Intel AVX-512 requires throttling the chip down to 60% of full speed (n similar slowdowns are necessary on Xeon Phi, 1.4 GHz to 1.2 GHz for full) SIMD when using "heavy" operations like multiplication (and also count-leading-zeroes aka CLZ) so it's a lot of energy for sure.

hedgehog · on July 8, 2023

It of course depends on workload but I know more than a little about this specific problem space and reducing the space+energy cost of the multipliers is useful. If the idea proposed worked well enough it might be a useful block in a camera ISP chip, audio interface for wake word and speech preprocessing, and similar applications where the models are small and energy is precious.

creato · on July 8, 2023

I don’t know, sorting requires a lot of moving data around, which is expensive for energy too. Maybe this will be used for vectors of small fixed/bounded size, but then you don’t get to amortize the cost of the logs as much either.

brucethemoose2 · on July 8, 2023

This is a fascinating idea. Any real acedemic critique is over my head (and I hope others chime in), but some random thoughts:

- "logarithm LUT then add" seems delightfully simple, especially at low precision. I am going to have to read that paper too...

- The concerns about GPU style parallelism may not be as bad in "alternative" architectures. For instance, Centaur came up with a single, serial, but hilariously wide 32,768-bit SIMD core for inference: https://fuse.wikichip.org/news/3256/centaur-new-x86-server-p...

- The silicon simplification also seems relevant to Samsung's in memory computing effort: https://www.servethehome.com/samsung-hbm2-pim-and-aquabolt-x...

- I wonder if this would be relevant to llama.cpp's CPU inference?

daniel-cussen · on July 8, 2023

Well so one issue w both GPUs n CPUs which make them bad platforms for this algorithm is that, in both, FLOPS are such an important metric for sales that multiplication is highly subsidized in both those chip types. So huge amounts of area is dedicated to floating point multiplication, meaning the advantage of fgemm (the name of the algorithm is the same as the name of the company) is purely one of energy.

Which is great because if it were software it would be impossible to protect the IP. USPTO is very clear in that sense, i believe in both in re Bilski and in the Alice Corp. case which reached SCOTUS, that algorithms need to be implemented physically, typically meaning in a chip, to be patentable. So because it needs a chip to work, it is good business, if it did not it would be bad business. A chip provides every form of IP protection, all four forms, trade secret, copyright, patent, n even trademark. No other medium has that to my knowledge.

So if you have a CPU or a GPU n want it to do more work in the same amount of time, this paper promises nothing, n it keeps that promise. Nonetheless i'm advancing rapidly to the point of creating the hardware that can cut off 70% of the cost of GEMM. I considered 50% off, same thing at half the price, but it wouldn't be fair to the consumer w my economics. You see 50% discounts all the time, who cares? 70% off, you don't see that all the time. On something you actually want? Especially on a commodity, n it's still good business for me as the lowest-cost producer.

david-gpu · on July 8, 2023

> Well so one issue w both GPUs n CPUs which make them bad platforms for this algorithm is that, in both, FLOPS are such an important metric for sales that multiplication is highly subsidized in both those chip types. So huge amounts of area is dedicated to floating point multiplication, meaning the advantage of fgemm (the name of the algorithm is the same as the name of the company) is purely one of energy.

I'm having trouble understanding this. Are you saying that GPUs invest area on floating-point multipliers because FLOPS are an important marketing metric? The only thing that mattered to us was: how can we make these operations faster within the area and power constraints we have? Reducing energy consumption was thus a major goal.

I wish you luck. If I were in your shoes, I would approach NVidia or Google -- and expect to be hammered with tough questions.

daniel-cussen · on July 8, 2023

> Are you saying that GPUs invest area on floating-point multipliers because FLOPS are an important marketing metric?

Yes. That is precisely what i'm saying. If i'm mistaken in saying that that's one thing, but as far as it being what i'm saying, it very much is. It's been an important guiding principle for some time now in the project, that recent chips--including FPGA's--tend to have hard IP for floating point multiplication.

Now spending a lot of chip area on getting more FLOPS is not necessarily a bad decision if there is no alternative for achieving fast matrix multiplication. Almost any method is sensible if there was no better alternative available when the decision to use that method was made. In addition, fgemm only really makes sense when matrices contain over 1000 elements per row or column, not sure how much more than 1000 per vector but more than that. Small and in particular small and dense matrices are still best multiplied exactly the way GPUs multiply them, with many floating-point multiplier circuits in parallel. It's not stupid in the least.

Yeah so NVidia n Google have the same business model i'm going for, Google having TPUs in its datacenters that do work that cannot be reverse engineered. Google does not sell TPUs. You can use them by sending Google the work, and you'll benefit from much lower cost and faster speed. NVidia has a similar offering, just not as well-known. That's the correct business model in my analysis, and what fgemm will sell. Sell the work.

david-gpu · on July 8, 2023

> Google having TPUs in its datacenters that do work that cannot be reverse engineered

Help me understand: TPUs cannot be reverse engineered because the user doesn't have access to the physical device, but other devices like GPUs can?

Can you show some examples of reverse-engineering of GPUs that has been performed on the basis of having physical access to the dies? Are you aware of any reverse engineering done on them using other means? How much has this reverse engineering prevented e.g. NVidia from being financially successful? Finally, since patents are freely available to the public once they have been granted, does that nullify some concerns regarding reverse engineering?

I'm not an entrepeneur, so take this with a fistful of salt, but having worked at places like NVidia, I would never try to compete head to head with them, as a startup. Very few semiconductor startups achieve any success, and the ones that do start by finding a very particular market niche where the established players aren't even trying to play.

Again, I wish you good luck.

imtringued · on July 8, 2023

What about us peasants who need multiplication to actually get work done instead of playing FLOPs status games? Not everyone is bottlenecked on something as specific as matrix multiplication.

Also the claims about huge amounts of area being dedicated to multiplication are false. ALU size is mostly irrelevant.

adgjlsfhk1 · on July 8, 2023

games are like 60% matrix multiplication (especially with ray tracing)

kragen · on July 8, 2023

this is an interesting idea; in some sense rotating a point in space is only multiplying a 3-item or 4-item vector (where this idea wouldn't be useful), but rotating n points is multiplying a 3×n or 4×n matrix by the transformation vector, so if the algorithm pans out, you should be able to do that kind of stuff too; n can be pretty large

kopecs · on July 8, 2023

> A chip provides every form of IP protection, all four forms, trade secret, copyright, patent, n even trademark. No other medium has that to my knowledge.

IANAL, but I do not believe semiconductor masks are copyrightable under US law (my limited understanding is that there is essentially due to the fact that the mask is inherently functional and/or aspects of the merger doctrine). There is a separate sui generis mask work protection via 17 U.S.C. §§ 901-914.

Edit: Moreover, I'm unsure how you figure a chip itself is protected by trade secret, since reverse engineering an IC is not terribly difficult.

daniel-cussen · on July 8, 2023

I don't know why trade secret applies, but i remember reading it does. Perhaps in the rationale, or the preimage. It doesn't make all that much sense, come to think. I think Intel tried it? Intel for sure used copyright to protect chips. Hey thanks, i did not know about 17 U.S.C. §§ 901-914.

kragen · on July 8, 2023

mask works

inglor · on July 8, 2023

Did the comments here actually read the paper? It just does "russian peasant" multiplication simulating multiplication with addition.

There is no new math discovered as far as I understand. It's basically "we know how to do multiplication with a lot of additions".

If this was effective rather than just "simulate multiplication with a lot of additions" it would have been super interesting for parallelization of multiplications and communication bounds.

lang4d · on July 8, 2023

The main content of the paper is trying to minimize the number of “russian peasant” multiplications that need to be performed. I would say those are the interesting parts. Section 2.3 claims dropping the number of additions by a factor of 6 from the naive algorithm.

Seems like doing the sorting, recursion, and alignment would have a nontrivial performance penalty, but it’s still a pretty interesting idea.

mathisfun123 · on July 8, 2023

> Did the comments here actually read the paper? It just does "russian peasant" multiplication simulating multiplication with addition.

Hate to break it to you but often not even the reviewers actually read the paper.

kragen · on July 9, 2023

this is not correct

probably it would improve the paper to remove the russian-peasant-multiplication references entirely, or reduce them to a throwaway aside in one place

in part this is because you surely won't be the last person careless enough to make this obvious error

but also it's because russian-peasant multiplication is a totally normal way for hardware multipliers to work, and the main content of the paper is totally decoupled from whether the final multiplication at the end of all the reductions are done with russian-peasant multiplication or (as would probably be a better idea) something like a dadda multiplier or a booth multiplier

nabla9 · on July 8, 2023

Wow. Jeffrey Ullman? Turing Award, Neumann medal etc. He is now 80 years old.

Ullman is of the authors of several legendary computer science books Dragon Book (Compilers: Principles, Techniques, and Tools) the Cinderella Book (Introduction to Automata Theory, Languages, and Computation), Green Dragon Book (Principles of Compiler Design)

He was the thesis advisor for Sergey Brin, Ravi Sethi, Surajit Chaudhuri

giovannibonetti · on July 8, 2023

His book about Standard ML is very interesting, too. It is a great introduction to strongly-typed functional languages like Haskell, Ocaml and F#.

KETpXDDzR · on July 8, 2023

I'm unsure if this yields any computational benefits over classic multiplication.

"In real arithmetic, multiplication may be faster for the following reason: When two real numbers are multiplied, the mantissae are multiplied together and the exponents are added, and these operations can be carried out in parallel. When two real numbers are added, first the mantissa of the smaller number must be shifted so that the exponents match (a process termed normalisation). Then the mantissae must be added. The result of the addition may overflow the original word length by 1 bit, or it may generate any number of leading zeros. Therefore the result must be normalised again. There are therefore 3 steps and they must be done in series." - https://www.researchgate.net/post/Is-multiplication-slower-t...

daniel-cussen · on July 8, 2023

One addition and one move replacing one multiplication? Absolutely makes things much cheaper.

kragen · on July 9, 2023

well, probably. it likely depends on how the move works; muxes aren't free and driving long wires isn't either

but the confusion in the comment you are replying to is that it thinks you are deriving a floating-point matrix multiply algorithm, when in fact you are deriving an integer matrix multiply algorithm

floating-point adds are slightly more expensive than floating-point multiplies

integer multiplies are enormously more expensive than integer adds (in power and area, though not in time)

KETpXDDzR · on July 8, 2023

To support this: "ADDSS/SUBSS take 1–3 cycles while MULSS takes 0.5–5 cycles."

- http://www.agner.org/optimize/

sabhiram · on July 8, 2023

Fascinating paper.

We design an inference accelerator which more or less accomplishes this by quantizing input tensors into logarithmic space. This allows the multiplication (in convolution especially), to be optimized into very simple adders. This (and a few other tricks) has a very dramatic impact on how much compute density we achieve while keeping power very low. We keep the tensors in our quantized space throughout the layers of the network and convert the outputs as required on the way out of the ASIC.

We achieve impressive task level performance, but this requires some specialized training and model optimizations.

Very cool to see ideas like this propagate more into the mainstream.

KRAKRISMOTT · on July 8, 2023

Isn't matrix multiplication already a convolution? You are rotating the right hand side matrix anti clockwise 90 degrees and then convolving it upon the LHS matrix from top to bottom.

sabhiram · on July 8, 2023

The point above regarding convolution had to do specifically with accelerating 3x3 and above convolutional operations, as the product and the accumulation can be done in a few clock cycles if setup with care and love.

kragen · on July 9, 2023

no, it is not, and i am not

discrete convolution is cₙ = Σᵢaᵢbₙ₋ᵢ

there is no way in which the indexes into the input matrices in a matrix multiplication are formed from sums or differences of indices and dummy variables

however, convolution is a matrix multiplication, specifically multiplication by the circulant matrix of the convolution kernel

hth, hand

KRAKRISMOTT · on July 9, 2023

Sure it doesn't sum the whole matrix but it does sum row by row. Also how did you type out LaTeX in HN? Or is that a font?

kragen · on July 9, 2023

it sums products, but convolution is summing products in a particular way that is not general matrix multipication

i typed special characters with the compose key; cf. https://github.com/kragen/xcompose

not as easy as latex but more compatible

hamilyon2 · on July 8, 2023

I rarely let myself a negative comment, but core of article is authors realising that n-bit multiplication is n additions. So, absolutely nothing interesting or new

peepeepoopoo16 · on July 8, 2023

From what I can tell, it memoizes the intermediate additions and then uses that to amortize the adds across multiple array elements and achieve speedups.

kragen · on July 9, 2023

neither of these comments is correct

peepeepoopoo16 · on July 9, 2023

Mind offering an explanation?

kragen · on July 9, 2023

there is a perfectly clear explanation in the paper

daniel-cussen · on July 9, 2023

Uh, you clearly didn't get the proof. N-bit multiplications performed with a single addition and a single move. Proof involves showing superiority over the number of additions in Russian Peasants.

mratsim · on July 8, 2023

At a glance this sounds like a re-discovery of addition chains and using them to construct Pippenger algorithm. But applied to matrices instead of group elements.

See: https://github.com/mratsim/constantine/issues/37

kragen · on July 9, 2023

i don't think that is the case but i don't really understand pippenger's algorithm. what would pippenger's buckets correspond to in this 'fgemm' thing?

kragen · on July 8, 2023

apparently the fgemm future of artificial intelligence is... a versatrig slide rule?

wait, this actually doesn't use logarithms at all, it's more of a difference engine really

daniel-cussen · on July 8, 2023

Yeh, it bears many resemblances to Babbage's difference engine.

kragen · on July 8, 2023

yeah, sorry for making the slide rule joke before even reading the abstract

jhj · on July 8, 2023

This feels very disconnected from the realities of hardware to the point of impracticality. More energy is typically burned on RAMs/flops (storing bits and shuffling them around) than the combinational logic portion (adders/multipliers/etc) doing the arithmetic on real designs these days. Sorting, computing differences and the like involves a lot of data movement and likely temporary storage for buffering as well. I've evaluated fixed-function sorting in ASIC designs, it's not cheap at all.

This feels like the authors had some ideas of circuit design concerns from the 1980s ("hardware multipliers are very expensive!") and trying to port that to the present.

daniel-cussen · on July 9, 2023

No comment.

Roark66 · on July 8, 2023

I learned (assembly)programming on a chip that had no multiplication instruction. It was 6510 (a version of popular 6502) and I fail to see the benefit. Back then every multiplication had to be done via addition in loop and division with subtract/compare(except certain numbers like powers of 2 where one could bit shift). You can imagine how slow it was. I was envious of my friends with Amigas (68k cpu) who had chips that were capable of multiplication in hardware. It seems obvious that a properly tuned hardware implementation is always going to be faster than doing the same thing in software. Taken to the extreme this is the crux of the old RISC vs CISC debate.

imtringued · on July 8, 2023

This also reminds me of the days when division was not implemented in hardware on ARM chips.

Vasniktel · on July 8, 2023

Thanks for the paper Daniel. Very easy to read and understand.

I believe I might have found a minor typo that made me scratch my head for a second. On page 3 in the part where you describe the "follow pointers" part of the algorithm you wrote vi=sj and then cpi=csj whereas I believe you meant cvi=csj and that we can now replace vi with csj to make it cvi. Let me know if I'm misunderstanding something here.

daniel-cussen · on July 9, 2023

Thanks, help finding typos is appreciated! It'll make things easier for when this gets published in a peer-reviewed journal, which is in the works.

jiggawatts · on July 8, 2023

Instead of sorting, just count the occurrences of the distinct values. For 8-bit values, this requires only 256 registers, each with a relatively small number of bits. E.g.: if the maximum matrix size is 16K*16K, then only 14 bits per accumulator is required.

This is just Radix sort and is very easy to implement in digital circuits. It can even reuse the same adder circuits.

daniel-cussen · on July 8, 2023

Sorting is very fast since i designed a sorting algorithm tailor-made for this problem, which is about as fast as your approach. Difference being the numbers are 32-bit, so iterating through the entire array of all possible 24-bit mantissas (i know they're 23 bit, but the implicit initial 1 of normal (vs denormal) values is explicit here) would be way too slow. Otherwise you'd be right, 8-bit values you can just use counting-sort, no problem. Or 14 bits, same deal. Now also notice there's a difference between the matrix size and the mantissa size, we care about the mantissa size, so it's 24 bits.

rustybolt · on July 8, 2023

Sounds like bullshit to me? The 1x1 case reduces to multiplication so it's multiplication without multiplication?

scscsc · on July 8, 2023

Does anyone know anything about the affiliation of the first author, "Fgemm SPA"?

daniel-cussen · on July 9, 2023

Fgemm SpA, coming soon.

amai · on July 8, 2023

See also „Multiplying and Dividing on the 6502“ , which was an early microprocessor without multiplier unit

https://news.ycombinator.com/item?id=31911655

hdhsjsbv · on July 8, 2023

What is the algorithmic complexity (Big O) of this?

And thank you for great submission. I skimmed it and I enjoyed it. But didn’t read the cost function section with much attention.

amelius · on July 8, 2023

Multiplication in binary is just addition of various shifted versions of one of the operands.

So it should not be surprising that this is possible or perhaps even efficient.

daniel-cussen · on July 8, 2023

One thing I highly recommend is trying it for yourself, just with pen and paper. Think of ten two-digit numbers under 40 for it to work nicely. Just numbers under 40, 1-100 would require like 20 numbers for it to work as well as it does in realistic examples. Write them in one line, then write them sorted on the next line, with lines connecting them to where they were before. Then underneath each number write down the difference between that number n the one before it, this is called taking the first differences. Repeat the sorting followed by taking first differences until you have only two numbers, two being an arbitrary limit. You may then pretend the row and column are the same, so expand with the same vector using the lines drawn, and prefix sum where first difference was performed.

So:

11 39 23 28 31 19 32 05 01 09

sort

01 05 09 11 19 23 28 31 32 39

first differences

1 4 4 2 8 4 5 3 1 7

sort and remove duplicates

1 2 3 4 5 7 8

first differences

1 1 1 1 1 2 1

sort and remove duplicates

1 2

reduction complete

smlacy · on July 8, 2023

Writing 'n' instead of 'and' when discussing mathematics is generally a very bad idea. Your example is a good one, but your use of abbreviations in your writing is horrid.

daniel-cussen · on July 8, 2023

I think you're right in this case. Alright I'll edit if I still can. I don't use any variables anyway in my post.

EDIT: alright I fixed all the single letter abbreviations.

svnt · on July 8, 2023

almost:

> Then underneath each number write down the difference between that number n the one before it

daniel-cussen · on July 10, 2023

True. Too late to edit. Yeah that is confusing, guess I gotta write differently when discussing that. I just shave characters for character counts, which are often a problem particularly on Twitter. It's not because I don't like typing out the whole word, I generally refrain from that sort of abbreviation. It is also stylistically unique--I use a unique style for the same reason as, and vindicating, Auguste Rodin who faced problems due to his statues being too literal.

svnt · on July 11, 2023

I don’t follow the Rodin reasoning — was he actually critiqued for being too literal? I thought he was fairly unconventional when everyone else was literal. Maybe that’s what you’re referring to? His ‘fragmented’ style?

casey2 · on July 8, 2023

Rather than pen and paper I recommend using a computer if you have access to one.

   (-⟜»∘⍷∧)⍟(1+↕3) 11‿39‿23‿28‿31‿19‿32‿5‿1‿9

⟨ ⟨ 1 4 4 2 8 4 5 3 1 7 ⟩ ⟨ 1 1 1 1 1 2 1 ⟩ ⟨ 1 1 ⟩ ⟩

   -⟜»∘⍷∧ vec

⟨ 7 2 4 1 3 2 2 4 10 12 3 10 1 6 1 5 17 2 8 ⟩

   +´-⟜»∘⍷∧ vec

100

   ⍷∧ vec

⟨ 7 9 13 14 17 19 21 25 35 47 50 60 61 67 68 73 90 92 100 ⟩

   +`-⟜»∘⍷∧ vec

⟨ 7 9 13 14 17 19 21 25 35 47 50 60 61 67 68 73 90 92 100 ⟩

klysm · on July 8, 2023

Dropping the name of the language that utilizes these hieroglyphics might be helpful to those who are unfamiliar.

daniel-cussen · on July 10, 2023

I was going to say the same as donkeybeer, agree it is surely an APL variant.

donkeybeer · on July 9, 2023

Probably an apl

kragen · on July 8, 2023

i feel like this sometimes runs into problems

like, if after 7 iterations, we have

1 2 3 7 9 12 18 23 27 57 72 95 129 680 718 994 2631 10770 18047 785265

that gets down to four items after 14 iterations

1 54 2814 716356

but because that's roughly exponential it takes a beastly number of iterations to get down to 2 items

i guess i should read the paper

daniel-cussen · on July 8, 2023

Yeah i address that, that's a gotcha and if the numbers are exponentially distributed the algorithm does not work. It is not universal. It depends in part on the exponent for which the numbers are exponentially distributed, n the other optimizations you use. This is the purpose of ongoing experiments. The Fibonacci series is an interesting case, since you get rid of the two largest numbers in each pass.

Yeah hey the paper will not be that painful to read if you can already perform the reduction steps. I'll answer further questions.

Yeah so for floating point an exponential distribution is bounded in how many elements it can contain for a given exponent, so it works out quite nicely. It does not work on bignums.

Nice to see someone use lower-case i like i do!

kragen · on July 8, 2023

the paper is not painful at all

unless i fucked it up, it looks like you can insert a separate renormalization step before the sorting where you shift each number to the left by a variable amount, like a floating-point unit always does with the mantissa (except subnormals), and that seems to solve the exponential distribution problem; it always seems to get down to a single item from 10000 34-bit items in about 15 steps

no wait, it doesn't really solve it, because a vector of the first 1000 fibonacci numbers still takes 485 iterations. but the last number in that vector is a 694-bit number. it does seem to improve it enormously

i thought this might make it work much worse (because in a sense it's adding bits to the numbers: what used to be a 1-bit number might now have n-bit-wide differences with the numbers before and after it) but at least in random tests it seems to make a huge improvement

just to clarify, what i'm doing (with unsigned integers) is

    def normalize(v):
        for vi in v:
            while vi < 2**34:
                vi *= 2
            yield vi

    def nreductions(v):
        while True:
            v = list(sortu(normalize(v)))
            yield v
            v = list(diffs(v))

with 256-bit numbers and a 2**256 normalization target it seems to typically be about 30 or 40 reduction steps, not sure if those qualify as bignums to you

the shifts of course have to be undone in the other direction, just like the permutations, but i don't think that's a problem?

(oh, now i see that in §3.1 'alignment' you are already doing something like this, except that you're shifting right to reduce the number of duplicates and eliminate one extra bit of differencing per iteration, not left to reduce the dynamic range of the data. for smallish numbers that seems to be roughly as effective, but left-shift normalizing works a lot better than right-shift aligning for 256-bit numbers)

i haven't tried doing any actual vector multiplies with this algorithm yet so if i did fuck it up i wouldn't have noticed

this is a pretty exciting algorithm, thanks for sharing

daniel-cussen · on July 8, 2023

Unfortunately i cannot answer questions as unlike i chip i must now sleep. It's 2:42 AM here.

First thing in the morning i'll be on it.