NNCP: Lossless Data Compression with Neural Networks (2019)

GistNoesis · on May 22, 2021

I'd like to highlight the Introduction of nmcp.pdf which explains in a very compressed form how it works which encompass all the magic.

"The lossless data compressor employs the traditional predictive approach: at each time t, the encoder uses the neural network model to compute the probability vector p of the next symbol values t knowing all the preceding symbols s0 up to st−1. The actual symbol value st is encoded using an arithmetic encoder with approximately −log2 ( pst ) bits. Then the model is updated knowing the symbol st. The decoder works symmetrically so there is no need to transmit the model parameters. It implies both encoder and decoder update their model identically.

When no preprocessing is done, st represents the byte at position t. Hence there are Ns= 256 different symbol values from 0 to Ns−1."

For those usually familiar with Huffman coding this is using an Arithmetic coding https://en.wikipedia.org/wiki/Arithmetic_coding

It allows adaptive coding (i.e. changing the probabilities of symbols dynamically). Here these probabilities are modelled dynamically using a neural network.

The magic is that the neural network parameters are defined implicitly : You don't need to transmit the neural network parameters, therefore you can use as big as you want neural network.

During the decoding the from scratch network is continuously trained using the freshly decoded data, in the same way that it was during the encoding. The more data you compress, the more you train the internal neural network parameters, and the better the prediction for next character gets.

You decode 20 bytes, you update the model with this new data, you decode 20 new bytes with the new model, you update the model,... , every 1000000 bytes you can even update your model multiple times using all currently available data to make the network converge faster.

Of course this only works if everything is exactly deterministic, which Fabrice Bellard took great effort in guaranteeing, which is no small engineering feat.

meiji163 · on May 22, 2021

I was just working on a similar project and didn't think you could do decode without sending weights, amazing!

rakmos · on May 22, 2021

The concept is rather intriguing in terms of having a compression algorithm that evolves towards an ideal optimization asymptote. I, for one, would be annoyed at the thought that the compression of an identical artifact might result in a different compression size output as it would seed doubt as to whether they actually had the same input.

TD-Linux · on May 22, 2021

As long as the initial state of the network is fixed, the compressed output will be identical every time.

debbiedowner · on May 22, 2021

What is the probability that there are no collisions by the encoder?

labawi · on May 22, 2021

What do you mean by collisions?

As described, arithmetic coding does a 1:1 mapping of input-output, while NN deterministically generates probability distribution for the arithmetic coding. Conceptually, it would work with any deterministic / reproducible function, even a PRF or a running hash, though the point of NN is to give good distribution estimates for interesting inputs, thus short output from the arithmetic encoder.

Presumably, there are some limited choices during coding, which then need to be encoded as well.

pkhuong · on May 22, 2021

100%.

threatripper · on May 22, 2021

Is the model actually trained (=updating the weights instead of just the state) during en/decoding?

GistNoesis · on May 23, 2021

Yes that's the trick, the model is trained both in the encoding and the decoding from the data that is encoded/decoded, continuously as data flows, from the same scratch in both cases and then trained with exactly the same data (the raw data).

Because of training is deterministic and the same data is presented during encoding and decoding, the weights of encoder and decoder are always kept in sync, such that the network used when encoding the nth byte is exactly the same as the network used when decoding the nth byte.

The weights of the network are the internal state.

eps · on May 22, 2021

Compression speed is around 2 KB/sec.

The figure is given in nncp_v2.1.pdf. This is 1000x slower than xz, which in turn is 1000x slower than gzip.

I have always wondered if a compression service could be a viable commercial offering and with NNCP rates and its compression speed, it looks like it could! That's for cases when you need to really compress the heck out of something, e.g. prior to distributing a large static piece of data to a very large number of recipients. I think there was a service like this, used by AAA game developers for distributing patches, but it didn't take because it had a very slow decompression... and, now reading further through the paper, it looks like an issue with NNCP too -

    The NNCP v2 decompression speed is similar to its compression speed.

So it's a no-go for my brilliant and lazy startup idea, but fascinating stuff nonetheless.

m463 · on May 22, 2021

I think maybe matching the compression speed with the network speed is what you want to do.

In a place with ubiquitous fast broadband, a compromise with regard to decompression speed might be ok.

If you're sending something to mars... 2 kb/sec might make sense.

xxs · on May 22, 2021

>If you're sending something to mars... 2 kb/sec might make sense.

Keep in mind that 2KB/s is on something like 3090... Massively power hungry even on the decompression.

loeg · on May 22, 2021

> I think maybe matching the compression speed with the network speed is what you want to do.

You might be interested: Zstandard has an adaptive mode that attempts to do this.

AtlasBarfed · on May 22, 2021

Hm, there are likely multiple large files that exist that are very common (DLLs, versions of shared libs, etc etc etc) that might be frequently compressed (SCP, stored in SSDs, etc) since disk space isn't free yet last I checked.

Each file is really its own special theoretical multivariate space to find the best theoretical compression ratio. Usually finding the optimal or near-optimal compression is the hardest computationally.

And it's basically a perfectly cacheable computation if you have some sort of database mapping algorith + parameters to apply to a given file.

Decompression usually isn't as big a deal for performance.

Anyway an online database for a service that has actively tried to take a known common file (match on filesize, hash, etc) down to its best compressed representation so 1) we don't have to do it every time.

Cloud providers would presumably love these, since S3 probably has millions copies of many files and would 1) get to compress the file and 2) get to share its representation across all instances.

kitd · on May 22, 2021

Compression speed is around 2 KB/sec.

Apologies if this is a noob question, but could you improve it with training?

esquire_900 · on May 22, 2021

> Apologies if this is a noob question, but could you improve it with training?

Considering all other things like GPU, code quality etc. equal, the speed here mainly depends on the size of the network. Larger networks result in better performance (logarithmic), but also slower speeds (linear). There are some tricks like pruning to create significantly smaller networks with similar-ish performance, but good models are often large.

Unrelated to your question, what nobody mentions is that the decompression side also needs to complete network. With 57M and 187M parameters, those files are going to be quite large, > 100MB for sure. That completely annihilates the performance wins for 1 time transfers.

317070 · on May 22, 2021

No, as I understand it, the network does not need to be transmitted. The decoder learns on the same stream as the encoder, so you do not need to transmit network parameters.

ironSkillet · on May 22, 2021

If I compress a large file at location A using this algorithm, and want to send it to location B, how does location B know how to decompress it?

Yoric · on May 22, 2021

Basically the same way your favorite codec knows how to decompress. It builds the predictor (in this case, a NN) from the known input. The predictor then assigns a symbol (or a series of symbols) to the next few bits - higher probabilities need fewer bits (I haven't checked for this specific technique, but there are techniques that don't even require an integer number of bits, e.g. range encoding).

e12e · on May 22, 2021

See https://news.ycombinator.com/item?id=27244810

As I understand it, when starting up (beginning of stream) - the network does not compress, then guesses the next byte at every turn determininstically from input so far.

Decoding can "read" the fist byte, then proceed to learn and guess the second byte and so on.

clavigne · on May 22, 2021

see the (current) top comment

https://news.ycombinator.com/item?id=27244810

it's a very clever use of symmetry.

vidarh · on May 22, 2021

It's clever, but pretty much "standard" in compression. Earliest I'm aware of that used symmetry in the encoder and decoder to prevent explicitly transferring the parameters this way was LZ78 (Lempel, Ziv; 1978), but there could well be predecessors I'm not aware of.

LZ78 used it "just" to build a dictionary, but the general idea of using symmetry is there.

(A fascinating variant on this is Semantic Dictionary Encoding, by Dr Michael Franz in his doctoral dissertation (1994), that used this symmetry to compress partial ASTs to cache partially generated code for runtime code generation)

xxs · on May 22, 2021

Yup, it isn't a dictionary based compression. Nonetheless that's a fascinating take

thesz · on May 22, 2021

With preprocessing it is a dictionary based compression. It uses a fixed dictionary to improve on compression.

wolf550e · on May 22, 2021

This is what the games industry uses in practice. It is optimized for fast decompression. http://www.radgametools.com/oodle.htm

eps · on May 24, 2021

Reads like a snake oil blurb to be honest. No comparision against any modern compressors, eg lz4, no specs, no pricing. Just "trust us, it's awesome." Looks really strange.

wolf550e · on May 26, 2021

Of course there is comparison with LZ4: http://www.radgametools.com/oodlemermaidselkie.htm

LZ4HC is LZ4 compressed with the slow compressor to achieve better compression. It is decompressed using regular LZ4 decompressor.

They also compare with zstd and lzma, which are the open source competition.

I don't know what "specs" you think are missing. They do say which architectures are supported, what the memory use is, whether multi-threading is supported, etc.

No public price, true.

You can read the blogs of the people who work on this code to get a feel for the caliber:

https://cbloomrants.blogspot.com/

https://fgiesen.wordpress.com/

These are known people with known reputation. The product itself is well known in the industry.

agumonkey · on May 22, 2021

What about decompression speed ?

raverbashing · on May 22, 2021

For real

I know shipping speeds are important, but it seems game developers could try harder with delta compressing their updates, because it's hard to believe a tens-of-GBs download is the best they could do when you already have a BluRay disk or something (unless it's new content, of course)

I wonder how much their bandwidth costs are and how quickly a reduction would pay itself.

tomalpha · on May 22, 2021

Does DRM contribute to this perhaps?

Given the layers of encryption and obfuscation, and leaving aside the merits or otherwise of DRM in the first place, I can imagine it could be hard to do delta patching. At least, without making it easier to crack the DRM - e.g. you’d presumably have to decrypt the asset that needed patching, apply the delta, and re-encrypt/re-sign it all client side.

cinntaile · on May 22, 2021

Would it even pay for itself? It seems to me like bandwidth is pretty cheap compared to computation costs so why bother trying to squeeze everything out. Just do some basic compression and you're good to go. Buy a bigger SSD if necessary, with technologies like reBAR compression just slows down a game anyway.

literallycancer · on May 22, 2021

As far as cloud providers go, bandwidth is where the lock in is. It's priced to make it too expensive to switch to a competitor.

xxs · on May 22, 2021

What would pay for itself is p2p, e.g. bittorrent... alas.

lxgr · on May 22, 2021

Blizzard used to use Bittorrent in some of their installers/updaters a while ago.

I'm not sure how viable this is today, or ever was: Many residential connections are heavily asymmetric. Players presumably aren't too excited either by the idea of continuing to seed while they already play online, with shitty ISP-provided routers suffering hundreds of milliseconds of bufferbloat whenever the uplink is saturated.

xxs · on May 22, 2021

>Blizzard used to use Bittorrent in some of their installers/updaters a while ago.

Yes, WoW used to have bit-torrent based update.

>Many residential connections are heavily asymmetric.

this is a US centric-view, though. For instance I have 300Mb up/down. At least for most of the Europe (and I suppose Asia), it's pretty common not to be restricted in any way.

I could leave a bit-torrent client non-stop and won't feel it. Like I've said: alas. There is some stigma associated with p2p/torrents.

paavohtl · on May 22, 2021

I have very rarely seen symmetric broadband plans available in Finland. In the past I've had 10/1, 24/1, 100/10 and now I have 1000/100. With ADSL connections being asymmetric was basically necessary because the bandwidth is so limited - I think most consumers would prefer 24/1 over 12/12. With fiber I think it's more about market segmentation: ISPs want businesses to buy significantly more expensive symmetric plans.

dividuum · on May 22, 2021

Not sure where you are in Europe, but in Germany you have to search a bit to find a residential ISP with symmetric bandwidth. Mine is 50/10.

Bayart · on May 22, 2021

Asymmetry is the rule in my country for both copper and fibre lines, although it's less pronounced with fibre. Symmetric lines are a specific business option.

patrakov · on May 22, 2021

> At least for most of the Europe (and I suppose Asia), it's pretty common not to be restricted in any way.

In the Philippines, 15 Mbps down / 3 Mbps up ADSL with a static public IP is not something that a foreigner could easily buy. Globe Telecom in Cebu asked for evidence that I have sufficient income to pay for this (2600 PHP per month), and their rules only allow documents from the Philippines, so my bank statement was rejected. I had to talk to their higher-level manager to negotiate an exception. And their default router has 2 seconds of bufferbloat, so I bought a TP-Link Archer C7 and installed OpenWRT to compensate.

lxgr · on May 22, 2021

> this is a US centric-view, though

I live in Europe and I have a 1000/0.5 Mbit/s connection.

LargoLasskhyfv · on May 23, 2021

My gut feeling says you have a 200/0.5Mbit/s connection at maximum, because just from watching some download monitor out of the corner of my eye while downloading something with 100Mbit/s requires about 250kbit/s ACK packets in the upstream.

No more ACKs? No more DL!

lxgr · on May 23, 2021

No, I can still hit (almost) 1 Gigabit on the downlink. DOCSIS networks are so heavily asymmetric, they take special measures (ACK compression) to get around that!

LargoLasskhyfv · on May 23, 2021

Interesting... never been on Cable/DOCSIS.

labawi · on May 22, 2021

That seems very atypical. As the link capacity ratio goes past 10:1, even for typical dowloads, speed starts to be limited by the ability to send requests, or even TCP ACKs.

I've seen 20:1 offers on cable providers, but it's mostly marketing BS, as you're unlikely to get full download speed in most circumstances due to upload limitations.

lxgr · on May 23, 2021

True - my nominal bandwidth (i.e. what I pay for) is 1000/50.

I do reach the full downlink speed though due to (presumably) ACK compression in the modem and ACK synthesis/reexpansion in the CMTS. You are right that otherwise such a high ratio would not be feasible with TCP.

labawi · on June 3, 2021

Plain single-stream download-only TCP should fit without compression, probably even without Delayed ACKs.

IME, the line starts to choke when you actually want to send requests and they compete with ACKs and your internet stutters and crawls in both direction. Though that was with a slower connection - it's harder to fill a bigger pipe.

BTW, is the ACK compression counted for consumers benefit?

cinntaile · on May 22, 2021

This is not the standard at all in Europe, maybe that's the standard in your specific country?

xxs · on May 22, 2021

I have lived in quite few countries, travelled as well. Someone mentioned Germany (and Bavaria) - indeed, it had one of the worst inet couple of years back. OTOH, Austria was totally fine.

lxgr · on May 22, 2021

"It's fine" is a different statement from "up- and downlink bandwidths are heavily asymmetric", no?

While asymmetric connections make sense for many residential use cases, P2P is not one of them.

cinntaile · on May 22, 2021

Your claim was that the lines were symmetric and "unrestricted". Neither are the standard unfortunately.

raverbashing · on May 22, 2021

p2p wouldn't solve the main issues, which are excessive download times and poor user experiences for upgrades.

logicchains · on May 22, 2021

Fabrice Bellard is proof that 10x programmers exist. Or even more than 10x; he single-handled wrote a software 4G LTE base station (https://bellard.org/lte/), something that would normally take a team of hundreds to develop due to the complexity of the specification.

Y_Y · on May 22, 2021

I think hero-worship in any field is weird and unhealthy. That said, Bellard is astonishingly talented and prolific and it's hard to look at someone like that and say he's just a mortal who tries his best to write code, just like everyone else.

So what is it that causes this big gulf in productivity? I know that as a programmer I have all sorts of internal and external obstacles to just sitting down and churning out entire projects, let alone something useful with clever and high-quality code. Is a 10x rockstar just somebody who's been able to eliminate most of their obstacles, and has some amount of talent left over to solve problems with?

chii · on May 22, 2021

> it's hard to look at someone like that and say he's just a mortal who tries his best to write code, just like everyone else.

would anyone look at Usain Bolt and say that he's mortal and just like everyone else? Why is it so difficult to imagine that there are people who are just so much better than you, that you (an average person) can never catch up?

Most people have no trouble understanding that they cannot reach the level of Usain Bolt. Why should programming be any different from sprinting in this case?

abiro · on May 22, 2021

Usain Bolt doesn't run 10x faster than your average fit person. It's more like 2x. So it's fair to ask if there are factors other than innate ability at play when observing orders of magnitudes of differences in programmers' outputs.

ZephyrBlu · on May 22, 2021

The difficulty of tasks generally increases exponentially, not linearly though. 2x in absolute terms definitely translates to 10x or more in terms of skill.

nl · on May 22, 2021

The combination of better performance, more consistently over a long period of time is what makes 10x

If you put Manchester City up against a team consisting of average people who are fit and know how to play soccer it's pretty easy to see Manchester City scoring 10x the goals - at least.

SirSourdough · on May 22, 2021

Isn’t the question here more like putting Man City (elite professionals) up against average professionals? I don’t see the 10x goal differential there necessarily. It would be unremarkable for a “10x programmer” to be 10x better than a hobbyist, which is what I see as the equivalent of someone “who is fit and knows how to play soccer”.

roenxi · on May 22, 2021

1) Yes, Bolt is also mortal.

2) If quantified the performance gap between me (unfit, untrained) and Bolt will be smaller than the gap between me (practised programmer) and Bellard.

itsoktocry · on May 22, 2021

>If quantified the performance gap between me (unfit, untrained) and Bolt will be smaller than the gap between me (practised programmer) and Bellard.

This is quite the assumption. Bolt is the fastest man we've ever seen. You actually think you're more closely competitive to him than Bellard???

vlovich123 · on May 22, 2021

It may be stated poorly but what OP is trying to say is that in terms of just output, the relative time difference in the 100m dash is smaller than the relative software output between them.

Obviously that’s silly because the physical result is approached asymptotically. World class athletes train and compete their whole lives of a fractional difference in result. Clearly the output by the human brain is not constrained by something asymptotic as what the human body can physically be trained to accomplish.

jbjbjbjb · on May 22, 2021

I would say an innate talent for ‘coding’ is a relatively minor factor. What is very rare to see is people showing the level of responsibility and grit to get on by themselves and build something. Also in a business setting it is unheard of for someone to have the opportunity or even be trusted enough to do it not to mention all the other business factors.

taneq · on May 22, 2021

Just like Usain Bolt’s physical attributes are a minor factor and it’s his grit and responsibility that makes him so quick?

jbjbjbjb · on May 22, 2021

No it is completely different the analogy doesn’t work. Coding isn’t running.

ZephyrBlu · on May 22, 2021

> I would say an innate talent for ‘coding’ is a relatively minor factor

I would argue that innate talent is strongly linked to genetics. In the case of coding, I think intelligence plays a large role.

Datenstrom · on May 22, 2021

> innate talent is strongly linked to genetics

I do not believe there is much of a spread in intelligence based on genetics. Short of disability, and given the same consistency, training, and opportunities I seriously doubt that anything even remotely close to a 10x gap would be seen.

I am certainly not keen on the science of the topic though, biology is certainly my worst subject. But I have heard of no research showing such a divide.

jbjbjbjb · on May 22, 2021

It plays a role, sure. There will be significant diminishing returns though. What I’m suggesting is there are other factors at play here that are way more important than a few extra IQ points. Completing a significant project isn’t one dimensional.

Cybiote · on May 22, 2021

The trick is that even a tiny difference in ability can lead to a large difference in results as the number of decision steps involved in a task increase. The expert wastes less time in search space. This is a combination of good mental models honed from experience and attention to detail.

Something unusual about Bellard is the breadth of his knowledge. There are probably a fair few who can write a fast deterministic low level code optimized neural network library. A decent number who can write an autodiff library but few who can do both. Once that's done, I daresay that's the hard part.

Given such a library, implementing an LSTM or transformer model and combining it with Adaptive Modeling/incremental learning of data statistics using a neural network to feed an Arithmetic Coder is something that is within the reach of many more.

In summary, a 10x developer is not 10x smarter or 10x more skilled, they just waste less time investigating unproductive paths and are maybe 1% less likely to take wrong turns. This ends up making a huge difference for large complex tasks.

pansa2 · on May 22, 2021

> Something unusual about Bellard is the breadth of his knowledge.

One thing I’ve noticed is that he chooses to apply himself to many different subfields of development, but always using the same technologies e.g. all his work seems to be in either C or JavaScript.

I’ve only worked in a couple of subfields but have seriously used at least 7 different languages, plus countless libraries. Compared to Bellard I’ve probably spent (wasted?) a large amount of time trying to choose, and then learn, the “best” tool for each job.

ZephyrBlu · on May 22, 2021

To be fair, Bellard has the luxury of doing that because he's generally writing things from near scratch and he's not beholden to other people or outside requirements.

foepys · on May 22, 2021

Some people are just very intelligent and can grasp concepts quickly and apply this new knowledge efficiently. To them a person of average intelligence is often like a toddler that is struggling to put the cube into the round hole.

logicchains · on May 22, 2021

There are billions of people in the world, over seven thousand people with one-in-a-million intelligence, seven million people with one-in-a-thousand intelligence, but nobody else close to Fabrice Bellard in software output.

agumonkey · on May 22, 2021

What surprises me about bellard, beside the excellency, is the breadth. Who could predict his next topic...

Surely his intellectual life is diverse

slim · on May 22, 2021

Why is hero-worship is unhealthy? I thought it's the contrary

user-the-name · on May 22, 2021

Because it decreases self-confidence and initiative. It makes you think of the things that the "hero" does as exceptional and out of your reach, instead of something you could do as well if you put your mind to it properly.

zaphirplane · on May 22, 2021

Interestingly enough seeing powerful role models inspired schwarzenegger, bolt martial artists, poets, architects even Tom cruise inspired Air Force pilots

opsy2 · on May 22, 2021

It seems small, but it's a big distinction between hero worship and role model!

'Hero Worship' is about elevating someone to a categorically higher level. This can lead to thinking you could never be as good, taking everything they say as gospel, defending even their poor opinions, etc.

If we -worship- 10x developers, you can maybe better see how with the emphasis on that word things can get pretty toxic pretty quick. For example, I think many would agree HR disputes should not take talent level as an input, but if one of those people isn't even a mortal...

---

None of this is necessary to have a role model, or precludes celebrating someone for doing excellent work as is this case.

willvarfar · on May 22, 2021

Fabrice is fantastic!

There are other programmers at his level, or with the potential to be at his level, splattered all over the industry.

Unfortunately, they ‘waste’ their time on their day job stuff, often having to do things badly because technical approaches are prescribed by others etc.

Many of us meet truly great programmers, but so few of them actually demonstrate and express it. They just grin when asked to implement micro services and add some banal features to the website etc.

vanderZwan · on May 22, 2021

He's pretty much the programmer's John von Neumann of our time, isn't he? Or at least one of the few contenders for that title

amelius · on May 22, 2021

Von Neumann was more a mathematician/scientist than an engineer.

vanderZwan · on May 22, 2021

Well yes, but a 10x mathematician/scientist. Plus important enough to the field of computing to be comparable I think

yalok · on May 22, 2021

Totally. Plus FFmpeg…

mercer · on May 22, 2021

I love how FFmpeg is way down on his homepage. Meanwhile, even non-programmers in my world know and use it.

logicchains · on May 22, 2021

And QEMU...

sugarkjube · on May 22, 2021

And quickjs

mobilio · on May 22, 2021

And LZEXE!

zserge · on May 22, 2021

And tcc/otcc

sesm · on May 22, 2021

Did he ever had a dayjob as a programmer? My bet is ‘no’

av500 · on May 22, 2021

he did, at least a few years ago a former coworker worked with him in Paris at a French streaming provider or so

esjeon · on May 22, 2021

FYI, the compression speed numbers from the paper:

    gzip = 17 MB/s
    xz = 1 MB/s
    CMIX = 1.6 kB/s
    NNCP = 2~3 kB/s

(Note that NNCP is using RTX 3090 here.)

Not only this is unacceptable for hot/warm data, also it's still impractical for cold data. It takes 10~16 years to compress or decompress 1 TB data.

jopsen · on May 22, 2021

It's still an amazing research result.

Demonstrating a novel compression scheme with a high compression ratio is worthwhile. It might inspire something else.

vanderZwan · on May 22, 2021

It does make me wonder: if we wrote an LZ77-based algorithm (like gzip and xz) with that time-budget in mind, could we match NNCP?

pornel · on May 22, 2021

I don't know enough about xz, but for gzip I can tell you that definitely not. Even with an infinite amount of computation there's a hard limit to how good gzip can compress, and it's not very good.

For example, let's say you've decoded a stream that so far contained:

123456

a clever algorithm can make a guess that "7890" will be next (compression is all about making good guesses about the future data). LZ77 just can't — it has no way to express such prediction. This is general problem for LZ77 with data that has a predictable structure, but isn't a literal repetition of previous data.

vanderZwan · on May 23, 2021

Well, isn't that what PNG filters do? So in that sense one could build an algorithm on top of LZ77 that tries to reversibly transform data into repeating sequences.

vidarh · on May 23, 2021

LZ77 is basically the "scaffolding" for every compression method that uses symmetric compressor and decompressors to replicate state so you don't need to transmit it. So the moment you add a function on top that does more complex prediction than copying from the existing stream, you have a general framework that is far more powerful than basic LZ77.

NNCP is such an algorithm on top of that general idea.

pornel · on May 24, 2021

Filters in PNG are not too bad for what could be done with an off-the-shelf gzip and a few lines of code, but as a compression mechanism they work poorly.

All compression is prediction + entropy coding. The entropy coding part is a solved problem. Inserting LZ77 between your predictor and the entropy coder just makes it harder to communicate the exact symbols and probabilities you have predicted.

If you're building a rocket: you can strap a rocket engine to a car, and it will still go faster than the car itself, but it's easier not to build it on top of a car.

user-the-name · on May 22, 2021

No. People are pushing the envelope for these algorithms constantly, as part of the various competitions in data compression, and they don't get anywhere close.

rogerb · on May 22, 2021

Every time I see a Fabrice Bellard submission here - it's like a little cake that you found in your fridge unexpectedly. Great stuff!

mercer · on May 22, 2021

It's like checking out the freezer and realizing you bought ice cream a while back.

londons_explore · on May 22, 2021

Fabrice Bellard picks up one more side project, and beats state of the art in text compression by a substantial margin...

rhaps0dy · on May 22, 2021

It’s a very cool achievement (a NN in 200KB, to compress? Wow) but the CMIX compression program performs better in the 1st benchmark, and is competitive in the 2nd.

So he does not beat the SoTA “by a substantial margin”. Your comment is overstating the case.

rndgermandude · on May 22, 2021

True.

Aside from the compression efficiency, I'd like to see other metrics too, like the work it takes to compute the compression (in units of energy, time, or whatever), to get more of a sense on what kind of workloads using something like that would make sense. These super-efficient compressors often only make sense in long term archival and compress-once-transmit-often scenarios.

It appears http://www.mattmahoney.net/dc/text.html (that Bellard links to on the NNCP page) has some information:

- NNCP (w/GPU) used 161812 nanoseconds/byte to compress, 158982 nanoseconds/byte to decompress, on an Intel Xeon E3-1230 v6, 3.5 GHz, RTX 3090 GPU, using about 6GB of RAM, to compress enwik9 down to 110,034,293 bytes

- CMIX used 602867 nanoseconds/byte to compress, 601569 nanoseconds/byte to decompress, on an Intel Core i7-7700K, 32 GB DDR4 using about 25GB of RAM, to compress enwik9 down to 115,714,367 bytes

- gzip -9 used 101 nanoseconds/byte to compress, 17 nanoseconds/byte to decompress, on an... Athlon-64 3500+ in 32-bit mode running XP, to compress enwik9 down to 322,591,995 bytes

- 7zip 9.20 used 1031 nanoseconds/byte to compress, 42 nanoseconds/byte to decompress, on a Gateway M-7301U laptop with 2.0 GHz dual core Pentium T3200 (1MB L2 cache), 3 GB RAM, Vista SP1, 32 bit, to compress enwik9 down to 227,905,645 bytes

These benchmarks are not directly comparable, as vastly different machines were used, but it still gives a trend at the very least.

As expected, these super-efficient algorithms are slow, really slow. Honestly, I'd think at least today too slow to make sense even for long term archival, and far too slow for compress-once-transmit-often scenarios, as the decompression rate is almost as slow as the compression rate.

Still, it's impressive work, don't get me wrong.

alecco · on May 22, 2021

This is very interesting but for people who are not from the field note

    1. The compressors he is comparing to *are* neural networks themselves (PAQ)
    2. He is using a rather small corpus enwik8 (100,000,000 bytes), instead of the common enwik9
    3. He is not publishing the amount of RAM or hardware he used

http://mattmahoney.net/dc/dce.html (seems down)

https://web.archive.org/web/20210216231934/http://mattmahone...

It seems Bellard is starting to play with these. I bet he will find interesting things and I would love to see him interact in the Russian compression webforum (please don't link). Big names in compression like Yann Collet, Matt Mahoney, and Jarek Duda used to hang out there.

SahAssar · on May 22, 2021

The linked page has results for both enwik8 and enwik9, right?

alecco · on May 22, 2021

Don't know how I missed enwik9. Maybe too early for me. ¯\_(ツ)_/¯

legulere · on May 22, 2021

I often wonder where Fabrice Bellard finds the time for all of his projects

alecco · on May 22, 2021

If you remove all the time wasters and you do strict 40h of work you'd be surprised how much time you have left.

Remove wasting time on HN/reddit/watercooling and then compensating by working more hours, and cut the "cord" for all/most entertainment. Instead, read technical stuff. Even if you do it for just a couple of hours a day you'll leave most people behind. Same with programming. Ideally do it in the morning before work. So go to sleep very early.

ZephyrBlu · on May 22, 2021

For me, I think social media on the whole quite has been quite valuable use of time. It feels less like a waste and more like research or learning.

The serendipity of HN, Reddit and Twitter have exposed me to a lot of ideas and information I would have otherwise never known about.

alecco · on May 22, 2021

Sure. I *try* to limit my HN/Wikipedia/etc usage because it's interesting but it gets me very distracted from my goals. For example, my monkey brain hears a song and I suddenly need to get the lyrics, the story behind it, etc. The monkey brain is very creative at finding ways to avoid doing difficult tasks.

Reddit and Twitter are mostly a waste of time nowadays due to the signal to noise ratio. I dropped my reddit addiction and I'm working on eliminating Twitter.

I wish there were a social network system where people pool and share what they are expected to be doing and help each other. Or at least encourage each other to focus. But this goes against the advertisement paradigm where they want us to be impulsive and stupid. Sad, sad, sad.

ZephyrBlu · on May 22, 2021

I think that a lot of serendipitous moments created by social media have actually helped me move towards my goals.

Not directly, but by giving me new perspectives and ideas that bounce off my current perspective and ideas to create new ways of thinking about things.

My perspective now is that too much focus is a bad thing because it comes at the expense of broadening your perspective.

Reddit I agree is low signal, but I think Twitter is high signal depending on who you're following.

thealig · on May 23, 2021

I think even Reddit can be "high SNR" depending on the sub reddit. Some subreddits dedicated to specific topics like (eg. compilers, webdev, machinelearning etc) provide updates on industry trends/practices, interesting projects/ideas/papers and you can ask for help for issues you face. Though I do agree that it can be a time sink and not really productive as much as browsing makes you feel.

alecco · on May 22, 2021

> Twitter is high signal depending on who you're following.

The signal might be good but is it the signal you should be focusing on? There is an amazing amount of interesting stuff being discovered all the time. But is it helping you achieve concrete goals?

With Twitter I see a trap. Say you work on implementing/using <Knowledge Domain X> and follow exclusively people doing similar work. They will also share things outside this domain. Since 2016 almost all science twitter users became highly political. And there is no way to filter this out.

This is a feature for Twitter and the like. They profit by making you emotional and going a rabbit hole of impulsiveness.

ZephyrBlu · on May 23, 2021

> The signal might be good but is it the signal you should be focusing on? There is an amazing amount of interesting stuff being discovered all the time. But is it helping you achieve concrete goals?

My point is that putting everything through the filter of "does this directly help me achieve my goals?" isn't the best way to pursue your goals, assuming we're talking about big goals where the path to achieving them isn't clear.

I recommend checking out this podcast episode called "Why Greatness Cannot be Planned". It really changed my perspective on achieving my long term goals: https://www.youtube.com/watch?v=lhYGXYeMq_E

Also on Twitter, yeah some people don't only tweet things in their domain. But I've found that some people do reliably stick to a single domain. Though I don't really follow science people on Twitter.

opsy2 · on May 22, 2021

Maybe it's an explore/exploit thing, social media is good for exploring but to build depth we should switch off the socials and hit the books every other week.

mpfundstein · on May 22, 2021

Bellard is so awesome.

But can someone please explain to me how this guy finds the time to build all those amazing things? It's so much non trivial stuff in his list. I wonder... Does he have kids? A family? Is he rich? Does he get millions to build quantum computers? I really wanna know :-)

Thanks

TynanSigg · on May 22, 2021

Oh hey I did something like this a while ago, but much less polished. I basically made an rANS encoder and hooked it up to GPT-2 (via the excellent transformers library from HuggingFace) to see what compression ratio I could get. Bellard's approach seems to use a custom model, which is probably advisable as big language models like I used are very slow. I didn't try to optimize much and only ever tested on my little personal laptop with just CPU, but I only got encoding speeds of ~0.1 Kb/s (limited mostly by available memory). However I was able to get very good compression factors of 6.5x using standard GPT-2 and 7.3x using the medium version.

Here's the code if anyone's interested: https://github.com/trsigg/lm-compression . I haven't looked at it in a while except for updating it today to work with more recent versions of the transformers library, but I think it should still function and is pretty easy to try out with different models. I thought about using something like a hidden Markov model instead which would be much faster (even more so because it would enable using tabled rather than range-based ANS), but I haven't gotten around to it.

TynanSigg · on May 22, 2021

After a bit more digging, it seems Bellard also has a version using GPT-2: https://bellard.org/libnc/gpt2tc.html Looks much cleaner and faster than my little winter break project, cool to see this implemented properly!

meiji163 · on May 22, 2021

Using a pretrained model is a little bit cheating, you would have to include the size of the weights in the ratio

TynanSigg · on May 22, 2021

Those two metrics will converge as the size of the text being compressed goes to infinity. It's necessary to include model size for things like the Hutter prize that involve compressing a fixed text (to avoid hard-coding) but isn't usually a useful metric for compression, especially because it will cause the compression ratio to depend on the size of the data being compressed.

Edit: I thought the model in the above paper was pre-trained, but apparently it's only trained on the data as it arrives! That's indeed a very interesting approach, I wouldn't have expected that neural network models would converge quickly enough for it to be useful.

kzrdude · on May 22, 2021

I hope he's eligible for the Hutter prize for this improvement, should be a pretty nice sum.

yalok · on May 22, 2021

True story: long time ago, I was once interviewing a guy, and wanted to check his coding skills. Asked him to write some simple C function, doing some simple algorithm. Let the guy use my PC as I watched him (did I say it was very long time ago?). To my surprise, instead of writing C code right away, the guy went to Far on my PC, with a blink of an eye, with hot keys, found some existing files on my disk, glanced through them, returned to file editor and wrote that function.

When I asked what that was, why did he have to dive into my C files, he admitted he doesn’t care to remember C syntax, and just had to look it up quickly to write his function.

It turned out he was mostly writing all of his code in x86 assembly, and a pretty complex code. His main project in assembly was a compressor that was faster and more efficient than RAR, which at that time was at the top. I tried it and that was true.

Many year later, the guys went on to win Hutter prize, and kept improving his own record year after year. I admire his grit and persistence, and how he keeps discovering new ways to improve his algorithms. And I wouldn’t be surprised if he maintains them in pure assembly (he did originally).

I did hire the guy, and enjoyed working with him.

ZephyrBlu · on May 22, 2021

So this guy was Alexander Rhatushnyak?

londons_explore · on May 22, 2021

Sadly not because it only works well on enwiki9 but not the smaller enwiki8.

I would guess that's down to neural networks poor performance with small training sets.

userbinator · on May 22, 2021

According to http://prize.hutter1.net/#prev he might actually be eligible.

kzrdude · on May 22, 2021

GPUs are excluded, unfortunately.

amelius · on May 22, 2021

Anything a GPU can do can be done on a CPU, albeit slower.

kzrdude · on May 22, 2021

Uh.. not so great, this algorithm is already very slow :)

amcoastal · on May 22, 2021

This seems like an old school and specific way of describing a much larger goal using NNs and auto encoders. To take data and reduce its dimensionality, while getting a (near) perfect reconstruction of the original input from the latent space data. Reducing (haha pun) this broad effort to just "data compression" leaves a lot of nuance and use out of sight.

Bulat_Ziganshin · on May 22, 2021

There is long NNCP thread on the forum dedicated to compression algos:

https://encode.su/threads/3094-NNCP-Lossless-Data-Compressio...

etaioinshrdlu · on May 22, 2021

I suspect his neural net library used in this could become quite useful and propagate far and wide if it were open-sourced: https://bellard.org/libnc/

justinclift · on May 22, 2021

For reference, this is what it's License section says:

  The LibNC library is free to use as a binary shared library.
  Contact the author if access to its source code is required.

https://bellard.org/libnc/libnc.html#License

xxs · on May 22, 2021

Compression ratio/speed

  Program or model     size: bytes    rato: bpb  speed: KB/s
  xz -9                24 865 244     1.99       1020
  LSTM (small)         20 500 039     1.64       41.7 
  Transformer          18 126 936     1.45       1.79 (!)
  LSTM (large2)        16 791 077     1.34       2.38

Note the decompression is not faster than compression (unlike xz)

bckr · on May 22, 2021

Okay. Stupid question. What in the world do the ratio numbers mean? I get that "bpb" means "byte-per-byte". Is it input bytes per output bytes? Or the other way around? And why do some of the ratios go below 1.0? e.g. NNCP v2 (Transformer) 0.914

xxs · on May 22, 2021

bpb - it's bits per byte, e.g 8/bpt is the compression ratio.

bckr · on May 25, 2021

Aha! thank you. Now I'm much more impressed

IcePic · on May 22, 2021

bits per byte?

londons_explore · on May 22, 2021

It's a lightweight ML library... But I'm not sure if it makes sense for anything with CUDA as a dependency to be lightweight.

tromp · on May 22, 2021

Compression achieved on the first billion bytes of English Wikipedia is only 0.88 bits per byte (bpb). Even when the decompressing program is large at almost 200KB, that is... wow!

londons_explore · on May 22, 2021

Note that it is an XML formatted version of Wikipedia, so some of that data is very predictable XML tags.

IcePic · on May 22, 2021

True, but that would hold true for all common compressors, so beating them is still a huge win.

merricksb · on May 22, 2021

Original 2019 discussion for those curious:

https://news.ycombinator.com/item?id=19589848

ivoras · on May 22, 2021

Did anyone manage to use the provided Windows binary to succesfully compress anything? For me, it exits producing a 0-byte "compressed" file.

cylon13 · on May 22, 2021

That’s impressively small!

drran · on May 22, 2021

Yep, but file metadata still takes some space.

thesz · on May 22, 2021

It needs AVX-512.

fabioyy · on May 22, 2021

Fabrice Bellard is a legend!

dariosalvi78 · on May 22, 2021

When I was a student at the university I thought it could be possible to combine quantisation used in lossy compression like JPEG and NN in some way..

rurban · on May 22, 2021

paq also has a similar NN encoder, but a much better and much faster one. In asm though, but open source. LibNC is closed

telesilla · on May 22, 2021

Could this address lower file sizes for lossless video and audio?

riffraff · on May 22, 2021

It seems too slow for real time decompression

telesilla · on May 22, 2021

Real time decompression isn't the only requirement, for example media storage has massive needs now studio production is moving to the cloud. If a new lossless codec is possible even if it needs a day to compress or decompress there are still huge benefits in some applications.

layoutIfNeeded · on May 22, 2021

ballenf · on May 22, 2021

Could this work on an M1, if optimized for its ML cores?

ZephyrBlu · on May 22, 2021

So what's the Weissman score of this baby?

xxs · on May 22, 2021

Extremely low - it's a very slow algorithm. Think of xz being slow and takes forever... that's 3 orders of magnitude worse.

linkdd · on May 22, 2021

In the TV show "Silicon Valley", the Weissman score measure the compression, not the speed.

In reality, the Weissman score does not exist.

I highly recommend this serie if you haven't seen it :)

xxs · on May 22, 2021

>the Weissman score measure the compression, not the speed

This is false, it uses both - hence the 'middle out' was suitable for video chat and all kind of enterprise software (incl. a custom appliance box). Overall - a be it all compression.

The score is so popular, it has its own wikipedia page[0]. It compares both required time and compression ratio of measured applications, with those of a de facto standard according to the data type.

[0]: https://en.wikipedia.org/wiki/Weissman_score

agustinbena · on May 22, 2021

Is there a church to go and pray God Bellard

fit2rule · on May 22, 2021

You're in it.

smel · on May 22, 2021

Guys are you sure that bellard is human ?