Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NNCP: Lossless Data Compression with Neural Networks (2019) (bellard.org)
397 points by ToJans on May 22, 2021 | hide | past | favorite | 160 comments


I'd like to highlight the Introduction of nmcp.pdf which explains in a very compressed form how it works which encompass all the magic.

"The lossless data compressor employs the traditional predictive approach: at each time t, the encoder uses the neural network model to compute the probability vector p of the next symbol values t knowing all the preceding symbols s0 up to st−1. The actual symbol value st is encoded using an arithmetic encoder with approximately −log2 ( pst ) bits. Then the model is updated knowing the symbol st. The decoder works symmetrically so there is no need to transmit the model parameters. It implies both encoder and decoder update their model identically.

When no preprocessing is done, st represents the byte at position t. Hence there are Ns= 256 different symbol values from 0 to Ns−1."

For those usually familiar with Huffman coding this is using an Arithmetic coding https://en.wikipedia.org/wiki/Arithmetic_coding

It allows adaptive coding (i.e. changing the probabilities of symbols dynamically). Here these probabilities are modelled dynamically using a neural network.

The magic is that the neural network parameters are defined implicitly : You don't need to transmit the neural network parameters, therefore you can use as big as you want neural network.

During the decoding the from scratch network is continuously trained using the freshly decoded data, in the same way that it was during the encoding. The more data you compress, the more you train the internal neural network parameters, and the better the prediction for next character gets.

You decode 20 bytes, you update the model with this new data, you decode 20 new bytes with the new model, you update the model,... , every 1000000 bytes you can even update your model multiple times using all currently available data to make the network converge faster.

Of course this only works if everything is exactly deterministic, which Fabrice Bellard took great effort in guaranteeing, which is no small engineering feat.


I was just working on a similar project and didn't think you could do decode without sending weights, amazing!


The concept is rather intriguing in terms of having a compression algorithm that evolves towards an ideal optimization asymptote. I, for one, would be annoyed at the thought that the compression of an identical artifact might result in a different compression size output as it would seed doubt as to whether they actually had the same input.


As long as the initial state of the network is fixed, the compressed output will be identical every time.


What is the probability that there are no collisions by the encoder?


What do you mean by collisions?

As described, arithmetic coding does a 1:1 mapping of input-output, while NN deterministically generates probability distribution for the arithmetic coding. Conceptually, it would work with any deterministic / reproducible function, even a PRF or a running hash, though the point of NN is to give good distribution estimates for interesting inputs, thus short output from the arithmetic encoder.

Presumably, there are some limited choices during coding, which then need to be encoded as well.


100%.


Is the model actually trained (=updating the weights instead of just the state) during en/decoding?


Yes that's the trick, the model is trained both in the encoding and the decoding from the data that is encoded/decoded, continuously as data flows, from the same scratch in both cases and then trained with exactly the same data (the raw data).

Because of training is deterministic and the same data is presented during encoding and decoding, the weights of encoder and decoder are always kept in sync, such that the network used when encoding the nth byte is exactly the same as the network used when decoding the nth byte.

The weights of the network are the internal state.


Compression speed is around 2 KB/sec.

The figure is given in nncp_v2.1.pdf. This is 1000x slower than xz, which in turn is 1000x slower than gzip.

I have always wondered if a compression service could be a viable commercial offering and with NNCP rates and its compression speed, it looks like it could! That's for cases when you need to really compress the heck out of something, e.g. prior to distributing a large static piece of data to a very large number of recipients. I think there was a service like this, used by AAA game developers for distributing patches, but it didn't take because it had a very slow decompression... and, now reading further through the paper, it looks like an issue with NNCP too -

    The NNCP v2 decompression speed is similar to its compression speed.
So it's a no-go for my brilliant and lazy startup idea, but fascinating stuff nonetheless.


I think maybe matching the compression speed with the network speed is what you want to do.

In a place with ubiquitous fast broadband, a compromise with regard to decompression speed might be ok.

If you're sending something to mars... 2 kb/sec might make sense.


>If you're sending something to mars... 2 kb/sec might make sense.

Keep in mind that 2KB/s is on something like 3090... Massively power hungry even on the decompression.


> I think maybe matching the compression speed with the network speed is what you want to do.

You might be interested: Zstandard has an adaptive mode that attempts to do this.


Hm, there are likely multiple large files that exist that are very common (DLLs, versions of shared libs, etc etc etc) that might be frequently compressed (SCP, stored in SSDs, etc) since disk space isn't free yet last I checked.

Each file is really its own special theoretical multivariate space to find the best theoretical compression ratio. Usually finding the optimal or near-optimal compression is the hardest computationally.

And it's basically a perfectly cacheable computation if you have some sort of database mapping algorith + parameters to apply to a given file.

Decompression usually isn't as big a deal for performance.

Anyway an online database for a service that has actively tried to take a known common file (match on filesize, hash, etc) down to its best compressed representation so 1) we don't have to do it every time.

Cloud providers would presumably love these, since S3 probably has millions copies of many files and would 1) get to compress the file and 2) get to share its representation across all instances.


Compression speed is around 2 KB/sec.

Apologies if this is a noob question, but could you improve it with training?


> Apologies if this is a noob question, but could you improve it with training?

Considering all other things like GPU, code quality etc. equal, the speed here mainly depends on the size of the network. Larger networks result in better performance (logarithmic), but also slower speeds (linear). There are some tricks like pruning to create significantly smaller networks with similar-ish performance, but good models are often large.

Unrelated to your question, what nobody mentions is that the decompression side also needs to complete network. With 57M and 187M parameters, those files are going to be quite large, > 100MB for sure. That completely annihilates the performance wins for 1 time transfers.


No, as I understand it, the network does not need to be transmitted. The decoder learns on the same stream as the encoder, so you do not need to transmit network parameters.


If I compress a large file at location A using this algorithm, and want to send it to location B, how does location B know how to decompress it?


Basically the same way your favorite codec knows how to decompress. It builds the predictor (in this case, a NN) from the known input. The predictor then assigns a symbol (or a series of symbols) to the next few bits - higher probabilities need fewer bits (I haven't checked for this specific technique, but there are techniques that don't even require an integer number of bits, e.g. range encoding).


See https://news.ycombinator.com/item?id=27244810

As I understand it, when starting up (beginning of stream) - the network does not compress, then guesses the next byte at every turn determininstically from input so far.

Decoding can "read" the fist byte, then proceed to learn and guess the second byte and so on.


see the (current) top comment

https://news.ycombinator.com/item?id=27244810

it's a very clever use of symmetry.


It's clever, but pretty much "standard" in compression. Earliest I'm aware of that used symmetry in the encoder and decoder to prevent explicitly transferring the parameters this way was LZ78 (Lempel, Ziv; 1978), but there could well be predecessors I'm not aware of.

LZ78 used it "just" to build a dictionary, but the general idea of using symmetry is there.

(A fascinating variant on this is Semantic Dictionary Encoding, by Dr Michael Franz in his doctoral dissertation (1994), that used this symmetry to compress partial ASTs to cache partially generated code for runtime code generation)


Yup, it isn't a dictionary based compression. Nonetheless that's a fascinating take


With preprocessing it is a dictionary based compression. It uses a fixed dictionary to improve on compression.


This is what the games industry uses in practice. It is optimized for fast decompression. http://www.radgametools.com/oodle.htm


Reads like a snake oil blurb to be honest. No comparision against any modern compressors, eg lz4, no specs, no pricing. Just "trust us, it's awesome." Looks really strange.


Of course there is comparison with LZ4: http://www.radgametools.com/oodlemermaidselkie.htm

LZ4HC is LZ4 compressed with the slow compressor to achieve better compression. It is decompressed using regular LZ4 decompressor.

They also compare with zstd and lzma, which are the open source competition.

I don't know what "specs" you think are missing. They do say which architectures are supported, what the memory use is, whether multi-threading is supported, etc.

No public price, true.

You can read the blogs of the people who work on this code to get a feel for the caliber:

https://cbloomrants.blogspot.com/

https://fgiesen.wordpress.com/

These are known people with known reputation. The product itself is well known in the industry.


What about decompression speed ?


For real

I know shipping speeds are important, but it seems game developers could try harder with delta compressing their updates, because it's hard to believe a tens-of-GBs download is the best they could do when you already have a BluRay disk or something (unless it's new content, of course)

I wonder how much their bandwidth costs are and how quickly a reduction would pay itself.


Does DRM contribute to this perhaps?

Given the layers of encryption and obfuscation, and leaving aside the merits or otherwise of DRM in the first place, I can imagine it could be hard to do delta patching. At least, without making it easier to crack the DRM - e.g. you’d presumably have to decrypt the asset that needed patching, apply the delta, and re-encrypt/re-sign it all client side.


Would it even pay for itself? It seems to me like bandwidth is pretty cheap compared to computation costs so why bother trying to squeeze everything out. Just do some basic compression and you're good to go. Buy a bigger SSD if necessary, with technologies like reBAR compression just slows down a game anyway.


As far as cloud providers go, bandwidth is where the lock in is. It's priced to make it too expensive to switch to a competitor.


What would pay for itself is p2p, e.g. bittorrent... alas.


Blizzard used to use Bittorrent in some of their installers/updaters a while ago.

I'm not sure how viable this is today, or ever was: Many residential connections are heavily asymmetric. Players presumably aren't too excited either by the idea of continuing to seed while they already play online, with shitty ISP-provided routers suffering hundreds of milliseconds of bufferbloat whenever the uplink is saturated.


>Blizzard used to use Bittorrent in some of their installers/updaters a while ago.

Yes, WoW used to have bit-torrent based update.

>Many residential connections are heavily asymmetric.

this is a US centric-view, though. For instance I have 300Mb up/down. At least for most of the Europe (and I suppose Asia), it's pretty common not to be restricted in any way.

I could leave a bit-torrent client non-stop and won't feel it. Like I've said: alas. There is some stigma associated with p2p/torrents.


I have very rarely seen symmetric broadband plans available in Finland. In the past I've had 10/1, 24/1, 100/10 and now I have 1000/100. With ADSL connections being asymmetric was basically necessary because the bandwidth is so limited - I think most consumers would prefer 24/1 over 12/12. With fiber I think it's more about market segmentation: ISPs want businesses to buy significantly more expensive symmetric plans.


Not sure where you are in Europe, but in Germany you have to search a bit to find a residential ISP with symmetric bandwidth. Mine is 50/10.


Asymmetry is the rule in my country for both copper and fibre lines, although it's less pronounced with fibre. Symmetric lines are a specific business option.


> At least for most of the Europe (and I suppose Asia), it's pretty common not to be restricted in any way.

In the Philippines, 15 Mbps down / 3 Mbps up ADSL with a static public IP is not something that a foreigner could easily buy. Globe Telecom in Cebu asked for evidence that I have sufficient income to pay for this (2600 PHP per month), and their rules only allow documents from the Philippines, so my bank statement was rejected. I had to talk to their higher-level manager to negotiate an exception. And their default router has 2 seconds of bufferbloat, so I bought a TP-Link Archer C7 and installed OpenWRT to compensate.


> this is a US centric-view, though

I live in Europe and I have a 1000/0.5 Mbit/s connection.


My gut feeling says you have a 200/0.5Mbit/s connection at maximum, because just from watching some download monitor out of the corner of my eye while downloading something with 100Mbit/s requires about 250kbit/s ACK packets in the upstream.

No more ACKs? No more DL!


No, I can still hit (almost) 1 Gigabit on the downlink. DOCSIS networks are so heavily asymmetric, they take special measures (ACK compression) to get around that!


Interesting... never been on Cable/DOCSIS.


That seems very atypical. As the link capacity ratio goes past 10:1, even for typical dowloads, speed starts to be limited by the ability to send requests, or even TCP ACKs.

I've seen 20:1 offers on cable providers, but it's mostly marketing BS, as you're unlikely to get full download speed in most circumstances due to upload limitations.


True - my nominal bandwidth (i.e. what I pay for) is 1000/50.

I do reach the full downlink speed though due to (presumably) ACK compression in the modem and ACK synthesis/reexpansion in the CMTS. You are right that otherwise such a high ratio would not be feasible with TCP.


Plain single-stream download-only TCP should fit without compression, probably even without Delayed ACKs.

IME, the line starts to choke when you actually want to send requests and they compete with ACKs and your internet stutters and crawls in both direction. Though that was with a slower connection - it's harder to fill a bigger pipe.

BTW, is the ACK compression counted for consumers benefit?


This is not the standard at all in Europe, maybe that's the standard in your specific country?


I have lived in quite few countries, travelled as well. Someone mentioned Germany (and Bavaria) - indeed, it had one of the worst inet couple of years back. OTOH, Austria was totally fine.


"It's fine" is a different statement from "up- and downlink bandwidths are heavily asymmetric", no?

While asymmetric connections make sense for many residential use cases, P2P is not one of them.


Your claim was that the lines were symmetric and "unrestricted". Neither are the standard unfortunately.


p2p wouldn't solve the main issues, which are excessive download times and poor user experiences for upgrades.


Fabrice Bellard is proof that 10x programmers exist. Or even more than 10x; he single-handled wrote a software 4G LTE base station (https://bellard.org/lte/), something that would normally take a team of hundreds to develop due to the complexity of the specification.


I think hero-worship in any field is weird and unhealthy. That said, Bellard is astonishingly talented and prolific and it's hard to look at someone like that and say he's just a mortal who tries his best to write code, just like everyone else.

So what is it that causes this big gulf in productivity? I know that as a programmer I have all sorts of internal and external obstacles to just sitting down and churning out entire projects, let alone something useful with clever and high-quality code. Is a 10x rockstar just somebody who's been able to eliminate most of their obstacles, and has some amount of talent left over to solve problems with?


> it's hard to look at someone like that and say he's just a mortal who tries his best to write code, just like everyone else.

would anyone look at Usain Bolt and say that he's mortal and just like everyone else? Why is it so difficult to imagine that there are people who are just so much better than you, that you (an average person) can never catch up?

Most people have no trouble understanding that they cannot reach the level of Usain Bolt. Why should programming be any different from sprinting in this case?


Usain Bolt doesn't run 10x faster than your average fit person. It's more like 2x. So it's fair to ask if there are factors other than innate ability at play when observing orders of magnitudes of differences in programmers' outputs.


The difficulty of tasks generally increases exponentially, not linearly though. 2x in absolute terms definitely translates to 10x or more in terms of skill.


The combination of better performance, more consistently over a long period of time is what makes 10x

If you put Manchester City up against a team consisting of average people who are fit and know how to play soccer it's pretty easy to see Manchester City scoring 10x the goals - at least.


Isn’t the question here more like putting Man City (elite professionals) up against average professionals? I don’t see the 10x goal differential there necessarily. It would be unremarkable for a “10x programmer” to be 10x better than a hobbyist, which is what I see as the equivalent of someone “who is fit and knows how to play soccer”.


1) Yes, Bolt is also mortal.

2) If quantified the performance gap between me (unfit, untrained) and Bolt will be smaller than the gap between me (practised programmer) and Bellard.


>If quantified the performance gap between me (unfit, untrained) and Bolt will be smaller than the gap between me (practised programmer) and Bellard.

This is quite the assumption. Bolt is the fastest man we've ever seen. You actually think you're more closely competitive to him than Bellard???


It may be stated poorly but what OP is trying to say is that in terms of just output, the relative time difference in the 100m dash is smaller than the relative software output between them.

Obviously that’s silly because the physical result is approached asymptotically. World class athletes train and compete their whole lives of a fractional difference in result. Clearly the output by the human brain is not constrained by something asymptotic as what the human body can physically be trained to accomplish.


I would say an innate talent for ‘coding’ is a relatively minor factor. What is very rare to see is people showing the level of responsibility and grit to get on by themselves and build something. Also in a business setting it is unheard of for someone to have the opportunity or even be trusted enough to do it not to mention all the other business factors.


Just like Usain Bolt’s physical attributes are a minor factor and it’s his grit and responsibility that makes him so quick?


No it is completely different the analogy doesn’t work. Coding isn’t running.


> I would say an innate talent for ‘coding’ is a relatively minor factor

I would argue that innate talent is strongly linked to genetics. In the case of coding, I think intelligence plays a large role.


> innate talent is strongly linked to genetics

I do not believe there is much of a spread in intelligence based on genetics. Short of disability, and given the same consistency, training, and opportunities I seriously doubt that anything even remotely close to a 10x gap would be seen.

I am certainly not keen on the science of the topic though, biology is certainly my worst subject. But I have heard of no research showing such a divide.


It plays a role, sure. There will be significant diminishing returns though. What I’m suggesting is there are other factors at play here that are way more important than a few extra IQ points. Completing a significant project isn’t one dimensional.


The trick is that even a tiny difference in ability can lead to a large difference in results as the number of decision steps involved in a task increase. The expert wastes less time in search space. This is a combination of good mental models honed from experience and attention to detail.

Something unusual about Bellard is the breadth of his knowledge. There are probably a fair few who can write a fast deterministic low level code optimized neural network library. A decent number who can write an autodiff library but few who can do both. Once that's done, I daresay that's the hard part.

Given such a library, implementing an LSTM or transformer model and combining it with Adaptive Modeling/incremental learning of data statistics using a neural network to feed an Arithmetic Coder is something that is within the reach of many more.

In summary, a 10x developer is not 10x smarter or 10x more skilled, they just waste less time investigating unproductive paths and are maybe 1% less likely to take wrong turns. This ends up making a huge difference for large complex tasks.


> Something unusual about Bellard is the breadth of his knowledge.

One thing I’ve noticed is that he chooses to apply himself to many different subfields of development, but always using the same technologies e.g. all his work seems to be in either C or JavaScript.

I’ve only worked in a couple of subfields but have seriously used at least 7 different languages, plus countless libraries. Compared to Bellard I’ve probably spent (wasted?) a large amount of time trying to choose, and then learn, the “best” tool for each job.


To be fair, Bellard has the luxury of doing that because he's generally writing things from near scratch and he's not beholden to other people or outside requirements.


Some people are just very intelligent and can grasp concepts quickly and apply this new knowledge efficiently. To them a person of average intelligence is often like a toddler that is struggling to put the cube into the round hole.


There are billions of people in the world, over seven thousand people with one-in-a-million intelligence, seven million people with one-in-a-thousand intelligence, but nobody else close to Fabrice Bellard in software output.


What surprises me about bellard, beside the excellency, is the breadth. Who could predict his next topic...

Surely his intellectual life is diverse


Why is hero-worship is unhealthy? I thought it's the contrary


Because it decreases self-confidence and initiative. It makes you think of the things that the "hero" does as exceptional and out of your reach, instead of something you could do as well if you put your mind to it properly.


Interestingly enough seeing powerful role models inspired schwarzenegger, bolt martial artists, poets, architects even Tom cruise inspired Air Force pilots


It seems small, but it's a big distinction between hero worship and role model!

'Hero Worship' is about elevating someone to a categorically higher level. This can lead to thinking you could never be as good, taking everything they say as gospel, defending even their poor opinions, etc.

If we -worship- 10x developers, you can maybe better see how with the emphasis on that word things can get pretty toxic pretty quick. For example, I think many would agree HR disputes should not take talent level as an input, but if one of those people isn't even a mortal...

---

None of this is necessary to have a role model, or precludes celebrating someone for doing excellent work as is this case.


Fabrice is fantastic!

There are other programmers at his level, or with the potential to be at his level, splattered all over the industry.

Unfortunately, they ‘waste’ their time on their day job stuff, often having to do things badly because technical approaches are prescribed by others etc.

Many of us meet truly great programmers, but so few of them actually demonstrate and express it. They just grin when asked to implement micro services and add some banal features to the website etc.


He's pretty much the programmer's John von Neumann of our time, isn't he? Or at least one of the few contenders for that title


Von Neumann was more a mathematician/scientist than an engineer.


Well yes, but a 10x mathematician/scientist. Plus important enough to the field of computing to be comparable I think


Totally. Plus FFmpeg…


I love how FFmpeg is way down on his homepage. Meanwhile, even non-programmers in my world know and use it.


And QEMU...


And quickjs


And LZEXE!


And tcc/otcc


Did he ever had a dayjob as a programmer? My bet is ‘no’


he did, at least a few years ago a former coworker worked with him in Paris at a French streaming provider or so


FYI, the compression speed numbers from the paper:

    gzip = 17 MB/s
    xz = 1 MB/s
    CMIX = 1.6 kB/s
    NNCP = 2~3 kB/s
(Note that NNCP is using RTX 3090 here.)

Not only this is unacceptable for hot/warm data, also it's still impractical for cold data. It takes 10~16 years to compress or decompress 1 TB data.


It's still an amazing research result.

Demonstrating a novel compression scheme with a high compression ratio is worthwhile. It might inspire something else.


It does make me wonder: if we wrote an LZ77-based algorithm (like gzip and xz) with that time-budget in mind, could we match NNCP?


I don't know enough about xz, but for gzip I can tell you that definitely not. Even with an infinite amount of computation there's a hard limit to how good gzip can compress, and it's not very good.

For example, let's say you've decoded a stream that so far contained:

123456

a clever algorithm can make a guess that "7890" will be next (compression is all about making good guesses about the future data). LZ77 just can't — it has no way to express such prediction. This is general problem for LZ77 with data that has a predictable structure, but isn't a literal repetition of previous data.


Well, isn't that what PNG filters do? So in that sense one could build an algorithm on top of LZ77 that tries to reversibly transform data into repeating sequences.


LZ77 is basically the "scaffolding" for every compression method that uses symmetric compressor and decompressors to replicate state so you don't need to transmit it. So the moment you add a function on top that does more complex prediction than copying from the existing stream, you have a general framework that is far more powerful than basic LZ77.

NNCP is such an algorithm on top of that general idea.


Filters in PNG are not too bad for what could be done with an off-the-shelf gzip and a few lines of code, but as a compression mechanism they work poorly.

All compression is prediction + entropy coding. The entropy coding part is a solved problem. Inserting LZ77 between your predictor and the entropy coder just makes it harder to communicate the exact symbols and probabilities you have predicted.

If you're building a rocket: you can strap a rocket engine to a car, and it will still go faster than the car itself, but it's easier not to build it on top of a car.


No. People are pushing the envelope for these algorithms constantly, as part of the various competitions in data compression, and they don't get anywhere close.


Every time I see a Fabrice Bellard submission here - it's like a little cake that you found in your fridge unexpectedly. Great stuff!


It's like checking out the freezer and realizing you bought ice cream a while back.


Fabrice Bellard picks up one more side project, and beats state of the art in text compression by a substantial margin...


It’s a very cool achievement (a NN in 200KB, to compress? Wow) but the CMIX compression program performs better in the 1st benchmark, and is competitive in the 2nd.

So he does not beat the SoTA “by a substantial margin”. Your comment is overstating the case.


True.

Aside from the compression efficiency, I'd like to see other metrics too, like the work it takes to compute the compression (in units of energy, time, or whatever), to get more of a sense on what kind of workloads using something like that would make sense. These super-efficient compressors often only make sense in long term archival and compress-once-transmit-often scenarios.

It appears http://www.mattmahoney.net/dc/text.html (that Bellard links to on the NNCP page) has some information:

- NNCP (w/GPU) used 161812 nanoseconds/byte to compress, 158982 nanoseconds/byte to decompress, on an Intel Xeon E3-1230 v6, 3.5 GHz, RTX 3090 GPU, using about 6GB of RAM, to compress enwik9 down to 110,034,293 bytes

- CMIX used 602867 nanoseconds/byte to compress, 601569 nanoseconds/byte to decompress, on an Intel Core i7-7700K, 32 GB DDR4 using about 25GB of RAM, to compress enwik9 down to 115,714,367 bytes

- gzip -9 used 101 nanoseconds/byte to compress, 17 nanoseconds/byte to decompress, on an... Athlon-64 3500+ in 32-bit mode running XP, to compress enwik9 down to 322,591,995 bytes

- 7zip 9.20 used 1031 nanoseconds/byte to compress, 42 nanoseconds/byte to decompress, on a Gateway M-7301U laptop with 2.0 GHz dual core Pentium T3200 (1MB L2 cache), 3 GB RAM, Vista SP1, 32 bit, to compress enwik9 down to 227,905,645 bytes

These benchmarks are not directly comparable, as vastly different machines were used, but it still gives a trend at the very least.

As expected, these super-efficient algorithms are slow, really slow. Honestly, I'd think at least today too slow to make sense even for long term archival, and far too slow for compress-once-transmit-often scenarios, as the decompression rate is almost as slow as the compression rate.

Still, it's impressive work, don't get me wrong.


This is very interesting but for people who are not from the field note

    1. The compressors he is comparing to *are* neural networks themselves (PAQ)
    2. He is using a rather small corpus enwik8 (100,000,000 bytes), instead of the common enwik9
    3. He is not publishing the amount of RAM or hardware he used
http://mattmahoney.net/dc/dce.html (seems down)

https://web.archive.org/web/20210216231934/http://mattmahone...

It seems Bellard is starting to play with these. I bet he will find interesting things and I would love to see him interact in the Russian compression webforum (please don't link). Big names in compression like Yann Collet, Matt Mahoney, and Jarek Duda used to hang out there.


The linked page has results for both enwik8 and enwik9, right?


Don't know how I missed enwik9. Maybe too early for me. ¯\_(ツ)_/¯


I often wonder where Fabrice Bellard finds the time for all of his projects


If you remove all the time wasters and you do strict 40h of work you'd be surprised how much time you have left.

Remove wasting time on HN/reddit/watercooling and then compensating by working more hours, and cut the "cord" for all/most entertainment. Instead, read technical stuff. Even if you do it for just a couple of hours a day you'll leave most people behind. Same with programming. Ideally do it in the morning before work. So go to sleep very early.


For me, I think social media on the whole quite has been quite valuable use of time. It feels less like a waste and more like research or learning.

The serendipity of HN, Reddit and Twitter have exposed me to a lot of ideas and information I would have otherwise never known about.


Sure. I *try* to limit my HN/Wikipedia/etc usage because it's interesting but it gets me very distracted from my goals. For example, my monkey brain hears a song and I suddenly need to get the lyrics, the story behind it, etc. The monkey brain is very creative at finding ways to avoid doing difficult tasks.

Reddit and Twitter are mostly a waste of time nowadays due to the signal to noise ratio. I dropped my reddit addiction and I'm working on eliminating Twitter.

I wish there were a social network system where people pool and share what they are expected to be doing and help each other. Or at least encourage each other to focus. But this goes against the advertisement paradigm where they want us to be impulsive and stupid. Sad, sad, sad.


I think that a lot of serendipitous moments created by social media have actually helped me move towards my goals.

Not directly, but by giving me new perspectives and ideas that bounce off my current perspective and ideas to create new ways of thinking about things.

My perspective now is that too much focus is a bad thing because it comes at the expense of broadening your perspective.

Reddit I agree is low signal, but I think Twitter is high signal depending on who you're following.


I think even Reddit can be "high SNR" depending on the sub reddit. Some subreddits dedicated to specific topics like (eg. compilers, webdev, machinelearning etc) provide updates on industry trends/practices, interesting projects/ideas/papers and you can ask for help for issues you face. Though I do agree that it can be a time sink and not really productive as much as browsing makes you feel.


> Twitter is high signal depending on who you're following.

The signal might be good but is it the signal you should be focusing on? There is an amazing amount of interesting stuff being discovered all the time. But is it helping you achieve concrete goals?

With Twitter I see a trap. Say you work on implementing/using <Knowledge Domain X> and follow exclusively people doing similar work. They will also share things outside this domain. Since 2016 almost all science twitter users became highly political. And there is no way to filter this out.

This is a feature for Twitter and the like. They profit by making you emotional and going a rabbit hole of impulsiveness.


> The signal might be good but is it the signal you should be focusing on? There is an amazing amount of interesting stuff being discovered all the time. But is it helping you achieve concrete goals?

My point is that putting everything through the filter of "does this directly help me achieve my goals?" isn't the best way to pursue your goals, assuming we're talking about big goals where the path to achieving them isn't clear.

I recommend checking out this podcast episode called "Why Greatness Cannot be Planned". It really changed my perspective on achieving my long term goals: https://www.youtube.com/watch?v=lhYGXYeMq_E

Also on Twitter, yeah some people don't only tweet things in their domain. But I've found that some people do reliably stick to a single domain. Though I don't really follow science people on Twitter.


Maybe it's an explore/exploit thing, social media is good for exploring but to build depth we should switch off the socials and hit the books every other week.


Bellard is so awesome.

But can someone please explain to me how this guy finds the time to build all those amazing things? It's so much non trivial stuff in his list. I wonder... Does he have kids? A family? Is he rich? Does he get millions to build quantum computers? I really wanna know :-)

Thanks


Oh hey I did something like this a while ago, but much less polished. I basically made an rANS encoder and hooked it up to GPT-2 (via the excellent transformers library from HuggingFace) to see what compression ratio I could get. Bellard's approach seems to use a custom model, which is probably advisable as big language models like I used are very slow. I didn't try to optimize much and only ever tested on my little personal laptop with just CPU, but I only got encoding speeds of ~0.1 Kb/s (limited mostly by available memory). However I was able to get very good compression factors of 6.5x using standard GPT-2 and 7.3x using the medium version.

Here's the code if anyone's interested: https://github.com/trsigg/lm-compression . I haven't looked at it in a while except for updating it today to work with more recent versions of the transformers library, but I think it should still function and is pretty easy to try out with different models. I thought about using something like a hidden Markov model instead which would be much faster (even more so because it would enable using tabled rather than range-based ANS), but I haven't gotten around to it.


After a bit more digging, it seems Bellard also has a version using GPT-2: https://bellard.org/libnc/gpt2tc.html Looks much cleaner and faster than my little winter break project, cool to see this implemented properly!


Using a pretrained model is a little bit cheating, you would have to include the size of the weights in the ratio


Those two metrics will converge as the size of the text being compressed goes to infinity. It's necessary to include model size for things like the Hutter prize that involve compressing a fixed text (to avoid hard-coding) but isn't usually a useful metric for compression, especially because it will cause the compression ratio to depend on the size of the data being compressed.

Edit: I thought the model in the above paper was pre-trained, but apparently it's only trained on the data as it arrives! That's indeed a very interesting approach, I wouldn't have expected that neural network models would converge quickly enough for it to be useful.


I hope he's eligible for the Hutter prize for this improvement, should be a pretty nice sum.


True story: long time ago, I was once interviewing a guy, and wanted to check his coding skills. Asked him to write some simple C function, doing some simple algorithm. Let the guy use my PC as I watched him (did I say it was very long time ago?). To my surprise, instead of writing C code right away, the guy went to Far on my PC, with a blink of an eye, with hot keys, found some existing files on my disk, glanced through them, returned to file editor and wrote that function.

When I asked what that was, why did he have to dive into my C files, he admitted he doesn’t care to remember C syntax, and just had to look it up quickly to write his function.

It turned out he was mostly writing all of his code in x86 assembly, and a pretty complex code. His main project in assembly was a compressor that was faster and more efficient than RAR, which at that time was at the top. I tried it and that was true.

Many year later, the guys went on to win Hutter prize, and kept improving his own record year after year. I admire his grit and persistence, and how he keeps discovering new ways to improve his algorithms. And I wouldn’t be surprised if he maintains them in pure assembly (he did originally).

I did hire the guy, and enjoyed working with him.


So this guy was Alexander Rhatushnyak?


Sadly not because it only works well on enwiki9 but not the smaller enwiki8.

I would guess that's down to neural networks poor performance with small training sets.


According to http://prize.hutter1.net/#prev he might actually be eligible.


GPUs are excluded, unfortunately.


Anything a GPU can do can be done on a CPU, albeit slower.


Uh.. not so great, this algorithm is already very slow :)


This seems like an old school and specific way of describing a much larger goal using NNs and auto encoders. To take data and reduce its dimensionality, while getting a (near) perfect reconstruction of the original input from the latent space data. Reducing (haha pun) this broad effort to just "data compression" leaves a lot of nuance and use out of sight.


There is long NNCP thread on the forum dedicated to compression algos:

https://encode.su/threads/3094-NNCP-Lossless-Data-Compressio...


I suspect his neural net library used in this could become quite useful and propagate far and wide if it were open-sourced: https://bellard.org/libnc/


For reference, this is what it's License section says:

  The LibNC library is free to use as a binary shared library.
  Contact the author if access to its source code is required. 
https://bellard.org/libnc/libnc.html#License


Compression ratio/speed

  Program or model     size: bytes    rato: bpb  speed: KB/s
  xz -9                24 865 244     1.99       1020
  LSTM (small)         20 500 039     1.64       41.7 
  Transformer          18 126 936     1.45       1.79 (!)
  LSTM (large2)        16 791 077     1.34       2.38

Note the decompression is not faster than compression (unlike xz)


Okay. Stupid question. What in the world do the ratio numbers mean? I get that "bpb" means "byte-per-byte". Is it input bytes per output bytes? Or the other way around? And why do some of the ratios go below 1.0? e.g. NNCP v2 (Transformer) 0.914


bpb - it's bits per byte, e.g 8/bpt is the compression ratio.


Aha! thank you. Now I'm much more impressed


bits per byte?


It's a lightweight ML library... But I'm not sure if it makes sense for anything with CUDA as a dependency to be lightweight.


Compression achieved on the first billion bytes of English Wikipedia is only 0.88 bits per byte (bpb). Even when the decompressing program is large at almost 200KB, that is... wow!


Note that it is an XML formatted version of Wikipedia, so some of that data is very predictable XML tags.


True, but that would hold true for all common compressors, so beating them is still a huge win.


Original 2019 discussion for those curious:

https://news.ycombinator.com/item?id=19589848


Did anyone manage to use the provided Windows binary to succesfully compress anything? For me, it exits producing a 0-byte "compressed" file.


That’s impressively small!


Yep, but file metadata still takes some space.


It needs AVX-512.


Fabrice Bellard is a legend!


When I was a student at the university I thought it could be possible to combine quantisation used in lossy compression like JPEG and NN in some way..


paq also has a similar NN encoder, but a much better and much faster one. In asm though, but open source. LibNC is closed


Could this address lower file sizes for lossless video and audio?


It seems too slow for real time decompression


Real time decompression isn't the only requirement, for example media storage has massive needs now studio production is moving to the cloud. If a new lossless codec is possible even if it needs a day to compress or decompress there are still huge benefits in some applications.


No.


Could this work on an M1, if optimized for its ML cores?


So what's the Weissman score of this baby?


Extremely low - it's a very slow algorithm. Think of xz being slow and takes forever... that's 3 orders of magnitude worse.


In the TV show "Silicon Valley", the Weissman score measure the compression, not the speed.

In reality, the Weissman score does not exist.

I highly recommend this serie if you haven't seen it :)


>the Weissman score measure the compression, not the speed

This is false, it uses both - hence the 'middle out' was suitable for video chat and all kind of enterprise software (incl. a custom appliance box). Overall - a be it all compression.

The score is so popular, it has its own wikipedia page[0]. It compares both required time and compression ratio of measured applications, with those of a de facto standard according to the data type.

[0]: https://en.wikipedia.org/wiki/Weissman_score


Is there a church to go and pray God Bellard


You're in it.


Guys are you sure that bellard is human ?




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: