Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Reproducible Builds in January 2022 (reproducible-builds.org)
187 points by pabs3 on Feb 6, 2022 | hide | past | favorite | 47 comments


It's amazing to me how much the open source community accomplishes that huge corporations with billions of dollars can't even get close to. Thanks to all the open source contributors who give their spare time to make improvements for everybody.


It's no accident that the Google Open Source Security Team is sponsor of reproducible-builds.org: they'd like to get the open source world up to speed with best practices that have been applied widely internally at Google for well over a decade.

Blaze (a.k.a. Bazel) rolled out at Google around 2007-ish, and the idea that consistent inputs should produce consistent outputs was fundamental to its philosophy.

Build rules like https://github.com/GoogleContainerTools/distroless to create minimal, reproducible docker images would seem radical and new most people building docker containers these days (almost everyone uses Dockerfile format) but it'd seem perfectly ordinary and very old-fashioned to any Googler.


I feel like I’ve hit something really special with Bazel, but there’s almost no ecosystem around it.

Im trying to use it for game development, the model fits so perfectly for large monorepo projects with explicit dependencies, but it seems like since there’s no ecosystem that I have to write a lot of my own rules from scratch; and the learning resources are many years old and in the conference format. It’s quite dense to get in.

Even the slack requires an @google.com email to get in.


That seemed shocking, but it doesn’t appear to be true? https://github.com/bazelbuild/rules_swift/issues/536#issueco...


Was not working Friday: seems as though it is fixed now; so that’s nice. Wonder how many people didn’t/couldn’t join though.

Which speaks to my larger point. Feels like there’s no strong community around this.


There was an issue with the sign-up link that was fixed a few days ago. https://slack.bazel.build/ should work now. It's a public Slack and was always intended to be, as far as I know.


Thanks. I’m in now; but it must have been broken for some time.

All I could find was invalid invite links and references to FaaS programs that would generate invites which were all failed.


Yeah, I'm not sure how long it was broken. Slack's support for public instances seems a bit strange. Even the new invite code is only good for 2000 new members or something, after which that link will again redirect to the page that requires a @google.com address until someone notices.


The obvious question then is; why slack?


> Even the slack requires an @google.com email to get in.

This shouldn't be the case, I'm in this slack and it's mostly non-google folks. Do you see this at https://slack.bazel.build ?


I did that page was a log in or sign up page, which required an @google.com email to sign up: someone has fixed it now, it wasn’t working on Friday which was the last time I checked.


I spent a fair amount of time at work five or six years ago trying to figure out how to make supply chain security actually possible in the general case with standard open-source tools. And I can tell you that the fact that Docker builds are fundamentally non-deterministic caused me no end of frustration and difficulty.

This was about the time that Bazel was being open-sourced, and Matt's rules_docker extension was already in there. A solution existed, so to speak, but it would have been nutty to assume that the average project would switch from the straightforward-looking Dockerfile format to using Bazel and BUILD files to construct docker containers. And Docker Inc wasn't going to play along; they were riding a high valuation that depended on them being the final word about containerization, so vocally pretending the problem didn't exist was their safest way forward.

At one point I put together a process and POC for porting the concept of reproducible builds to docker in a user-friendly format -- essentially you'd define a spec that listed your dependencies with no more specificity than you needed. Then tooling would dep-solve that spec and freeze it into a fully-reproducible manifest that encoded all the timestamps, package versions, and other bits that would otherwise have been determined at build time. Then the _actual_ build process left nothing to chance: grab the identified sources and build and assemble in a hermetic environment. You'd attach the manifest to the container, and it gave you a precise bill of materials in a format that you could confidently use for identifying vulnerabilities. Since the builds were fully hermetic, a given manifest would only ever produce one set of bits, which could be reproduced in an automated fashion, allowing you to spot supply chain inconsistencies.

In my tooling, I leaned heavily on package providers like Debian as "owning" the upstream software dependency graph, since this was a problem they'd already solved, and Debian in particular was already serious about reproducibility in their packages.

In the end, it didn't go anywhere. There were a LOT of hacks to make it work since the existing software wasn't designed to allow this kind of integration. For example, the dependency resolution step required splicing in a lot of internal code from package managers, and and the docker container format was (and probably still is) a mess that didn't allow the end products to be properly identified as reproducible without breaking other things.

Plus, this is a problem that only people trying to do security at scale even care about. We needed a sea-change of industry thought around verifiability before my solution would seem at all valuable to people outside a few huge tech companies.


Hey Tyler!

Funny to see you here. Matt and I haven't given up on this, we're giving a lot of that another try at Chainguard.


Sweet. Glad to hear someone's working on it who knows what they're doing. :-P


A huge amount of open source projects are done by people payed to develop them.


yet the Reproducible Builds project has been started and driven almost entirely by Debian volunteers:

https://reproducible-builds.org/who/people/


That page specifically says their work is funded. I’m sure all the core members do a ton of volunteer work for Debian, but I believe they are now being paid for work on reproducible builds.


It's funded by donation to Reproducible Builds project, and donation is managed by Software Freedom Conservancy (free software organization which also manages donation for Git, QEMU, Samba, and Wine). It's not like they are paid by their employers.

You can donate yourself. It is tax-deductible in US.

They include logos for donation larger than $25,000. Logos are here: https://reproducible-builds.org/who/sponsors/. Top sponsor is Google Open Source Security Team.


No, funded does not mean that people are paid for every hour of contribution they do or that they do it full-time.

Of the entire set of contributors, most are paid zero. Some are paid.


I don't know if that's fair. One of the biggest advance in build systems for ages is Bazel and that came from Google.


Could someone with more knowledge explain why this is so difficult?

As I understand it, the goal should be that if you download a binary, and download the source and compile a binary, they should be identical.

I understand different build flags, and compilers. But if a project documents their build system why wouldnt it be reproducable?


There are several things that can make builds nondeterministic:

* Many file formats, especially (but not exclusively) archives like .zip, embed timestamps in them.

* Absolute paths are also ridiculously easy to get embedded in the build artifact.

* Other components of the user's environment can get embedded, including current locale, username, hostname. Some of these are more common than others.

* Using an actual random number generator is more common than you might think (e.g., generating a cryptographic key during the build process), although still pretty rare.

* Saying 'for file in directory' is usually sorted in inode order by default, which makes it inconsistent on different people's drives (although deterministic for a single user, generally).

* Pointer addresses are nondeterministic. And it's easy to accidentally make a container that sorts by pointer address (e.g., std::map in C++).

* In general, there's several features of programming languages that make it easy to accidentally inject nondeterminism in your program. Concurrency is obvious, but using standard containers like rust's HashMap or C++'s std::map combined with a "for every value in container" make it easy to get that nondeterminism. Compilers are not immune from being nondeterministic, and while it is a bug for them to be nondeterministic, it still happens.


Another surprising example: by default, some compilers attempt more aggressive optimization if there is ample memory available.

(We discovered many surprising causes of nondeterminism during the development of Blaze aka Bazel, which prides itself on reproducibility.)


Great examples! Another one is the version number, which is really helpful in lots of situations. It can also be useful to embed the git commit for debugging. (One practical use case for these is cache busting to ensure that a browser downloads a new version of a script, for example.)


Depending on the tools used, it can be "easy" for things in the environment to leak into the build. Timestamps are a common bane of reproducible builds. I understand that a non-trivial amount of reproducible build work consists of hunting down every timestamp in a build, and then finding a way to inject alternate timestamps into them.

For example if we just look at one of the listed reproducibility bugs in the report (https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=864082) you can get a flavor of the random bits and bods that need to be hunted down.

We could say that a system can be relatively easily made reproducible if starting from scratch - certainly all of the tooling is there. The far more significant challenge is retrofitting reproducibility into existing large projects.


Why are people using timestamps in builds and what is the better alternative?


First of all, every filesystem has timestamps and they've got to be set to some value.

Second of all, it's useful to have some sort of version marker that lets you tell at a glance if two copies of the software are the same, or what code was running when that log file was created. Build timestamps are a classic way of doing this in a version-control-system-independent way, without advancing the 'headline' version number for every developer's test builds.

And of course, if you include things like debug symbols it's useful to be able to tell if the source files on disk have changed since the build took place, as the line numbers could be all off.

Alternatives include using a git commit hash; having a separate build pipeline for public release versions then not worrying if developers' builds aren't reproducible; and just being careful with developers' test builds.


but why should different timestamps on the filesystem cause different binary?


There’s a long list of tools that follow a general pattern: they take an input file and produce an output file, but the output file doesn’t just depend on the contents of the input file; it also embeds its modification time somewhere. Sometimes the goal is to be able to automatically regenerate the output if the input changes, by comparing timestamps. Sometimes it’s just intended as general-purpose metadata. Debian’s reproducible builds wiki page has many examples:

https://wiki.debian.org/ReproducibleBuilds/Howto#Files_in_da...

Notably, this affects archivers such as tar and zip, but there’s also a long tail of application-specific tools.


Don't think binaries, think build outputs.

Imagine part of your build is generating a bunch of different files, and then packaging into an archive file, or into your installer bundle (or whatever). If your archive preserves file timestamps, then your end up with different archive files.


Plenty of build outputs are compressed collections of files.

Maybe you want to distribute a .jar, .deb, .whl, or docker image?


OP misspoke; the timestamp in the filesystem metadata doesn't matter for this. This is only about timestamps embedded in the binary itself.


> Why are people using timestamps in builds and what is the better alternative?

Timestamps are often an implicit part of the build which leads to the same code producing 'different' artifacts (e.g. file created timestamp, generated metadata about the build stored in the artifact itself). One alternative is to support explicitly defining the 'build timestamp' so all builds with the same timestamp produces the same artifact. This is what Maven does[0]; I'd imagine it's much harder for build systems that download and recompile dependencies.

[0] https://maven.apache.org/guides/mini/guide-reproducible-buil...


Some people want to know when something was built. Better option is to use Git commit of build or even last-modified of the file it's being embedded into. Alternatively, allow a env var to override what date/time the software thinks it is.


By default, many compilers include things like local filesystem paths, build server hostnames, or build timestamps into their binary artifacts. These will obviously differ build-to-build.

Even without that, it's possible to accidentally leak entropy into the build output. For example, readdir() doesn't guarantee any kind of ordering, so without sorting the list of files it is possible for a binary artifact (or even tar) to produce different output from the same input.


Compilers are huge pieces of code, modified by lots and lots of hands, and up to recently nobody paid any attention if the results were identical or not.

Your question is kinda like "if I write a few hundred lines of code, and run it, it should run, right?". The answer is yes, that's the expectation, but there is no chance that it is actually fulfilled in practice.

As an example, when the Debian's reproducible builds project started, gcc had extra code to tag the executables with the time of compilation.


https://lists.reproducible-builds.org/pipermail/rb-general/2... should give you a better idea of what's going on. It's linked in the article.

  Perl sorting issues in dictionaries-common
  libxmlb used a pointer address (%p) for a hash value
  texlive-base: Reported differences in the generated ls-R


In addition to various issues already mentioned, there are also some build systems that do build in parallel and produce builds or build artefacts that are going to depend on the order in which threads finished, which may vary from run to run.

I'm not saying you cannot build in parallel and end up with a deterministic build: that is done all the time. What I'm saying in that there are cases where some care has to be taken.


The build environment seems to sneak into the artifacts in many ways. Check out some of the listings from the "Upstream Patches" section of the linked page.. Some of the most common reasons seem to be: timestamps, non-deterministic file listings, debug information containing build path data, permissions bits on artifact files,etc.


Apart from what others have said, signed binaries also pose somewhat of a challenge with reproducible builds.


NixOS reproducibility effort is tracked here: https://r13y.com


It is very ironic that the only non-reproducible path in the minimal channel is Nix itself (> 2.3)


I like and use nixos as a primary driver, but the nix tool itself needs a product manager.

A lot of people have direct commit access and it is really hard to figure out near and medium term goals or their progress.

It's how the tool that does so much to ensure other things built are reproducible is not reproducible itself.


Thanks for sharing, did not know this project existed.

I am maintaining a small embedded linux project myself and having a reproducible build as a feature is a long standing goal. The more tooling and awareness there is, the better.


Its a pitty that Linux itself still cannot be bootstrapped without the Linux env - at least no one has done it yet. Anyway thanks to this reproducible-builds efforts maybe the awarness of this serious problem will grow and we will see more support into the projects like https://github.com/fosslinux/live-bootstrap , https://niedzejkob.p4.team/bootstrap/


I admit I've never tried this, but the Linux kernel seems like just another C program. What's to stop me from installing GNU Make, GNU coreutils, gcc, and binutils cross compiled to target x86_64-linux-elf and bootstrap this? It doesn't sound especially useful, but it does sound like something that would be possible if you focused on it.


Yes, it should work. You aren't in any disagreement actually. It's just that no one has done it yet, but should have.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: