Hacker Newsnew | past | comments | ask | show | jobs | submit | jonjonsonjr's commentslogin

Something must be in the air. I've been working on a gzip/deflate visualizer recently as well: https://jonjohnsonjr.github.io/deflate/

This is very work in progress, but for folks looking for a deeper explanation of how dynamic blocks are encoded, this is my attempt to visualize them.

(This all happens locally with way too much wasm, so attempting to upload a large gzip file will likely crash the tab.)

tl;dr for btype 2 blocks:

3 bit block header.

Three values telling you how many extra (above the minimum number) symbols are in each tree: HLIT, HDIST, and HCLEN.

First, we read (HCLEN + 4) * 3 bits.

These are the bit counts for symbols 0-18 in the code length tree, which gives you the bit patterns for a little mini-language used to compactly encode the literal/length and distance trees. 0-15 are literal bit lengths (0 meaning it's omitted). 16 repeats the previous symbol 3-6 times. 17 and 18 encode short (3-10) and long (11-138) runs of zeroes, which is useful for encoding blocks with sparse alphabets.

These bits counts are in a seemingly strange order that tries to push less-likely bit counts towards the end of the list so it can be truncated.

Knowing all the bit lengths for values in this alphabet allows you to reconstruct a huffman tree (thanks to canonical huffman codes) and decode the bit patterns for these code length codes.

That's followed by a bitstream that you decode to get the bit counts for the literal/length and distance trees. HLIT and HDIST (from earlier) tell you how many of these to expect.

Again, you can reconstruct these trees using just the bit lengths thanks to canonical huffman codes, which gives you the bit patterns for the data bitstream.

Then you just decode the rest of the bitstream (using LZSS) until you hit 256, the end of block (EOB).

If you're not already familiar with deflate, don't be discouraged if none of that made any sense. Bill Bird has an excellent (long) lecture that I recommend to everyone: https://www.youtube.com/watch?v=SJPvNi4HrWQ


Hi, this is my project! I was surprised to see it posted here. It's a debugging tool for container images that I host for myself and some friends/coworkers.

Some of its features are intentionally not very discoverable, partially to keep the interface minimal, but mostly because I like to hide them as easter eggs.

I wrote a little more context for how this (and some related tools) came to be at https://dag.dev for the curious.

Happy to answer any questions, of course, but I imagine this is a pretty niche tool.


Thanks for making it, i use it on a regular basis and it's very helpful. I could use crane but it's oftentimes quicker just to explore a registry in a browser.


Wow, this is super cool! I tried this on some of my own images, and it even supports some less-common features like zstd compression and cosign signatures. It even links to the documentation for all the signature fields and to the sigstore transparency log for each signature, and it shows you the size of every individual file inside each layer too. And it's also pretty cool that the top of every page shows the shell command used to generate it.


> Rate exceeded.

Sounds interesting, but it didn't survive the Hacker News amount of visitors.


Sorry about that. I usually don't let it scale up at all so that I don't have to worry about costs.

I've raised some limits for now so hopefully you can get through.


Even worse, in the general case, you should really decompress the whole tarball up to the end because the traditional mechanism for efficiently overwriting a file in a tarball is to append another copy of it to the end. (This is similar to why you should only trust the central directory for zip files.)


I tend to agree with this sentiment but this year I came across two situations where the use of cat is actually harmful and not just useless: https://github.com/jonjohnsonjr/til/blob/main/post/readat.md...

I still instinctively start my pipelines with cat half the time, but now I have complicated feelings about it.


The proof is that they can tell you the number of leaves (again) after you have secretly removed some. Since only you know the number you removed, if the difference between their counts matches your number, it is likely that they can indeed count the leaves on the tree. It is possible they have correctly guessed, which is why you repeat the challenge until you are convinced.


Or they have a way to measure your removal, eg, they don’t count leaves but branches without leaves (in the hypothetical tree example).


We have built something very similar to what you are describing: https://github.com/chainguard-dev/apko


I gave a very compressed lightning talk last year about this: https://youtu.be/ExyWAhS2zBA


Digests cryptographically guarantee that you get the correct content, which prevents both malicious tampering (mitm, stolen credentials, etc) or accidental mutations. This is why "immutable tags" are a bad substitute and an oxymoron.

There are also better caching properties when using content addressable identifiers. For example with kubernetes pull policies, using IfNotPresent and deploying by digest means you don't even have to check with the registry to initialize a pod if the image is already cached, which can improve startup latency.


> There are also better caching properties when using content addressable identifiers. For example with kubernetes pull policies, using IfNotPresent and deploying by digest means you don't even have to check with the registry to initialize a pod if the image is already cached, which can improve startup latency.

While agree on the unquoted part, this is true also for human-readable (aka mutable-that-should-be-immutable) tags, when that pull policy is set (which is by default for everything that is not `latest`)


With a sha you shouldn’t have to change the pull policy. However there isn’t a need for always if you have the sha.


Can you explain why? I've found that it is a lot simpler to use git bisect if every commit builds.


There are two competing interests, and the author suggests cleaning up after so that you can satisfy both.

The first interest is what you said, for every commit to build.

The second interest is to help prevent loss during development. If I am working on something and need to finish for the day, but haven't yet got the code to a working state, I want to be able to commit and push my code in case something happens to my laptop.

Also, sometimes I am making a series of big changes, but I need to make them all for the code to work. I want to have checkpoint along the way, in case I need to undo something (or even just see what I have changed since the checkpoint!) committing WIP code that doesn't build lets you do that.

We usually solve this problem with squash merges, so every commit on the main branch is the full working feature, but there are downsides to that technique.


I'm pretty sure they mean every commit on `master` or `develop`. Not every commit in your `add_new_feature` branch.


It's in the ambiguity we can discuss where the rules make sense :)


    git bisect --first-parent
Every merge commit should build is a more reasonable policy


I don't think I'd ever call a Dockerfile declarative.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: