It is inherently not possible for Dockerfiles, as a format, to generate reproduc...

amluto · on Aug 31, 2023

The format and engine could try a lot harder to make improved reproducibility the default.

As a trivial example, network access for RUN should be opt-in, not opt-out. The fact that the easiest ways to pull data in involve things like RUN wget is a design error.

A much better approach would be to have packages that install with as little script involvement as possible. Most Linux images are put together using rpm or deb packages and, other than pre/post-install scripts (which are not usually particularly necessary), package installation is fundamentally reproducible and does not require running the image. A good image building system IMO would mostly look more like:

INSTALLPACKAGES foo bar baz

And dependencies would get solved and packages installed, reproducibly.

thiht · on Aug 31, 2023

> The fact that the easiest ways to pull data in involve things like RUN wget is a design error

Why is that? You can perfectly get reproducible build even using wget. You wget your file, get its checksum and compare it to an expected checksum. Boom, reproducible wget.

Honestly I've always found reproducibility harder to enforce when using Linux package managers (at least with apt-get which messes stuff up with timestamps)

amluto · on Aug 31, 2023

The easy way to download something in a Dockerfile:

     RUN wget URL

Your better way?

    RUN wget URL && \
        if [[ "$(sha256sum <the output>)" != "the hash" ]]; then \
            # Wow, I sure hope I spelled this right!  Also, can a comment end with \
            echo "Hmm, sha256 was wrong.  Let's log the actual hash we saw.  Oh wait, forgot to save that.  Run sha256sum again?" 2>&1 \
            echo "Hmm, better not forget to fail!" 2>&1 \
            exit 1 # Better remember that 1 is failure and 0 is success! \
        fi

An actual civilized solution would involve a manifest of external resources, a lockfile, and a little library of instructions that the tooling could use to fetch or build those external resources. Any competent implementation would result in VASTLY better caching behavior than Docker or Buildah can credibly implement today -- wget uses network resources and is usually slow, COPY is oddly slow, and the tooling has no real way to know that the import of a file could be cached even if something earlier in the Dockerfile (like "apt update"!) changed.

Think of it like modern cargo or npm or whatever, but agnostic to the kind of resource being fetched.

If there was a manifest and lockfile, it really would not be that hard to wire apt or dnf up to it so that a dependency solver would run outside the container, fetch packages, and then install them inside the container. Of course, either COPY would need to become faster or bind mounts would have to start working reliably. Oh well.

> Honestly I've always found reproducibility harder to enforce when using Linux package managers

Timestamps could well cause issues (which would be fixable), but it's not conceptually difficult to download .rpm or .deb files and then install them. rpm -i works just fine. In fact, rpm -i --root arguably works quite a bit better than docker/podman build, and it would be straightforward to sandbox it.

zanecodes · on Sept 1, 2023

> An actual civilized solution would involve a manifest of external resources, a lockfile, and a little library of instructions that the tooling could use to fetch or build those external resources.

Sounds like you're describing Nix.

I actually thought the article would be framed a bit differently when I saw the title: I think Docker and its ecosystem solve several adjacent but not intrinsically intertwined problems:

- Creating repeatable or ideally reproducible runtime environments for applications (via Dockerfiles) - Isolating applications' runtime environments (filesystems, networks, etc) from one another (via the Docker container runtime) - Specifying a common distribution format for applications and their runtime environments (via Docker images) - Providing a runtime to actually run applications in (via the Docker CLI and Docker Desktop)

In this context, a runtime environment consists of the application's dependencies, its configuration files, its temporary and cache files, its persistent state (usually via a volume or bind mount), its exposed ports, and so on.

I would argue that Docker is often used solely for dependency management and application distribution, and for such use cases things like network and filesystem isolation just present obstacles to be worked around, and this is why developers complain about Docker's complexity.

Too · on Sept 1, 2023

What you are looking for is Mockerfiles.

https://matt-rickard.com/building-a-new-dockerfile-frontend

It’s just a proof of concept. At least shows what can be done if one peeks under the hood a bit.

With multi stage builds you can already do quite few of the things you mention, like downloading in one container and copy into another, happening in parallel while apt install is running. It’s hopelessly verbose to do so though and one ends up with not using it and instead just brute forcing the most simple imperative file instead.

amluto · on Sept 2, 2023

On the one hand, that’s really cool. On the other hand, I just leaned (from that article!) that the Dockerfile “syntax” is actually a reference to a Docker container. It’s turtles all the way down!

Seriously, though:

> The external files are downloaded in separate alpine images, and then use the copy helper to move them into the final image. It uses a small script to verify the checksums of the downloaded binaries s = s.Run(shf("echo \"%s %s\" | sha256sum -c -", e.Sha256, downloadDst)).Root(). If the checksum does not match, the command fails, and the image build stops.

Having any nontrivial build operation being an invocation of an entire Docker container seems like a terrible design. Docker is cool, but actual host-native Linux userspace images are a really rather nastily complicated way to express computation. What’s wrong with Lua or JavaScript or WASM or quake-c or Java or Lisp or any other sandboxable way to express computation that is actually intended for this sort of application? (All of the above, unlike Docker, can actually represent a computation such that a defined runtime can run it portably.

Docker images, being the sort of turtle that are not amenable to a clean build process, don’t seem like a good thing to try to fix by turtles-all-the-way-downing them.

jonjonsonjr · on Sept 1, 2023

We have built something very similar to what you are describing: https://github.com/chainguard-dev/apko

yjftsjthsd-h · on Aug 31, 2023

> INSTALLPACKAGES foo bar baz

That would require significant integration with the image; do you expect docker to know how to talk to apt, dnf, zypper, nix-env, apk, xbps-install, etc.?

nonameiguess · on Aug 31, 2023

It's at least possible in a limited sense. I'm not going to hold it up as a paragon of a solution, but cloud-init lets you just list packages and translates them automatically into a command line for most popular Linux package managers, even pacman.

But I strongly agree with what I think is your basic gist here, which is that too many people wish Docker was something it fundamentally isn't. It's ultimate intent is as a packaging system more than a build system. Yes, it builds container images, but a container image is just a packaging method. How the software running in it builds is up to the developers of that software. Thus, Docker's goal is to work with arbitrary tooling. Whatever compiler, dependency resolver, and whatever else you want to use, that your software already uses, you can keep using. That includes hacky bullshit shell scripts that pull in everything via wget. There is no good reason Docker should keep you from doing that if that's what you want to do. If you want deterministic, reproducible container builds, use a deterministic, reproducible build system, and put your outputs in a container image. Docker will gladly let you do that.

On the other hand, the other complaint above about having to run the base image to build anything on top of it I somewhat agree with. I get why they did it, because it's probably the simplest way to ensure you're not implicitly depending on the host system running Docker, so your containers won't crash for some stupid reason like the glibc in the container at runtime doesn't match what you had on your build host. But there were better ways to achieve this. arch-chroot, Debian's fakeroot+fakechroot, plenty of other systems already existed to build a self-container system on another system without implicitly building against dependencies that won't be there at runtime, and they don't require setting up and running the rather complicated Docker container engine, in particular the network bridging that can get janky, especially if your host system is using systemd-network. It'd be nice to have the systems for building images and running containers entirely separate and self-contained, which you can have, of course, just not with Docker.

cogman10 · on Sept 1, 2023

I don't quite get the need to perfectly reproducible builds.

At least in my org, that ends up being more of a detriment than a boon. The problem? Devs hate updating libraries, which is a crucial part of security with docker.

Call me crazy, but I prefer the fact that `apt install foo` gets the latest foo and not what was pinned. We test our images before sending them to prod so if something breaks it's pretty easy to catch it.

amluto · on Sept 2, 2023

If you want the latest foo, then tell your pinning solution that you want that. Then you get a real record of what’s actually running, you can reproduce old builds to instrument them, and you get all the other benefits of tracking what you actually built.