It's amazing to me how much the open source community accomplishes that huge corporations with billions of dollars can't even get close to. Thanks to all the open source contributors who give their spare time to make improvements for everybody.
It's no accident that the Google Open Source Security Team is sponsor of reproducible-builds.org: they'd like to get the open source world up to speed with best practices that have been applied widely internally at Google for well over a decade.
Blaze (a.k.a. Bazel) rolled out at Google around 2007-ish, and the idea that consistent inputs should produce consistent outputs was fundamental to its philosophy.
Build rules like https://github.com/GoogleContainerTools/distroless to create minimal, reproducible docker images would seem radical and new most people building docker containers these days (almost everyone uses Dockerfile format) but it'd seem perfectly ordinary and very old-fashioned to any Googler.
I feel like I’ve hit something really special with Bazel, but there’s almost no ecosystem around it.
Im trying to use it for game development, the model fits so perfectly for large monorepo projects with explicit dependencies, but it seems like since there’s no ecosystem that I have to write a lot of my own rules from scratch; and the learning resources are many years old and in the conference format. It’s quite dense to get in.
Even the slack requires an @google.com email to get in.
There was an issue with the sign-up link that was fixed a few days ago. https://slack.bazel.build/ should work now. It's a public Slack and was always intended to be, as far as I know.
Yeah, I'm not sure how long it was broken. Slack's support for public instances seems a bit strange. Even the new invite code is only good for 2000 new members or something, after which that link will again redirect to the page that requires a @google.com address until someone notices.
I did that page was a log in or sign up page, which required an @google.com email to sign up: someone has fixed it now, it wasn’t working on Friday which was the last time I checked.
I spent a fair amount of time at work five or six years ago trying to figure out how to make supply chain security actually possible in the general case with standard open-source tools. And I can tell you that the fact that Docker builds are fundamentally non-deterministic caused me no end of frustration and difficulty.
This was about the time that Bazel was being open-sourced, and Matt's rules_docker extension was already in there. A solution existed, so to speak, but it would have been nutty to assume that the average project would switch from the straightforward-looking Dockerfile format to using Bazel and BUILD files to construct docker containers. And Docker Inc wasn't going to play along; they were riding a high valuation that depended on them being the final word about containerization, so vocally pretending the problem didn't exist was their safest way forward.
At one point I put together a process and POC for porting the concept of reproducible builds to docker in a user-friendly format -- essentially you'd define a spec that listed your dependencies with no more specificity than you needed. Then tooling would dep-solve that spec and freeze it into a fully-reproducible manifest that encoded all the timestamps, package versions, and other bits that would otherwise have been determined at build time. Then the _actual_ build process left nothing to chance: grab the identified sources and build and assemble in a hermetic environment. You'd attach the manifest to the container, and it gave you a precise bill of materials in a format that you could confidently use for identifying vulnerabilities. Since the builds were fully hermetic, a given manifest would only ever produce one set of bits, which could be reproduced in an automated fashion, allowing you to spot supply chain inconsistencies.
In my tooling, I leaned heavily on package providers like Debian as "owning" the upstream software dependency graph, since this was a problem they'd already solved, and Debian in particular was already serious about reproducibility in their packages.
In the end, it didn't go anywhere. There were a LOT of hacks to make it work since the existing software wasn't designed to allow this kind of integration. For example, the dependency resolution step required splicing in a lot of internal code from package managers, and and the docker container format was (and probably still is) a mess that didn't allow the end products to be properly identified as reproducible without breaking other things.
Plus, this is a problem that only people trying to do security at scale even care about. We needed a sea-change of industry thought around verifiability before my solution would seem at all valuable to people outside a few huge tech companies.
That page specifically says their work is funded. I’m sure all the core members do a ton of volunteer work for Debian, but I believe they are now being paid for work on reproducible builds.
It's funded by donation to Reproducible Builds project, and donation is managed by Software Freedom Conservancy (free software organization which also manages donation for Git, QEMU, Samba, and Wine). It's not like they are paid by their employers.
You can donate yourself. It is tax-deductible in US.
There are several things that can make builds nondeterministic:
* Many file formats, especially (but not exclusively) archives like .zip, embed timestamps in them.
* Absolute paths are also ridiculously easy to get embedded in the build artifact.
* Other components of the user's environment can get embedded, including current locale, username, hostname. Some of these are more common than others.
* Using an actual random number generator is more common than you might think (e.g., generating a cryptographic key during the build process), although still pretty rare.
* Saying 'for file in directory' is usually sorted in inode order by default, which makes it inconsistent on different people's drives (although deterministic for a single user, generally).
* Pointer addresses are nondeterministic. And it's easy to accidentally make a container that sorts by pointer address (e.g., std::map in C++).
* In general, there's several features of programming languages that make it easy to accidentally inject nondeterminism in your program. Concurrency is obvious, but using standard containers like rust's HashMap or C++'s std::map combined with a "for every value in container" make it easy to get that nondeterminism. Compilers are not immune from being nondeterministic, and while it is a bug for them to be nondeterministic, it still happens.
Great examples! Another one is the version number, which is really helpful in lots of situations. It can also be useful to embed the git commit for debugging. (One practical use case for these is cache busting to ensure that a browser downloads a new version of a script, for example.)
Depending on the tools used, it can be "easy" for things in the environment to leak into the build. Timestamps are a common bane of reproducible builds. I understand that a non-trivial amount of reproducible build work consists of hunting down every timestamp in a build, and then finding a way to inject alternate timestamps into them.
We could say that a system can be relatively easily made reproducible if starting from scratch - certainly all of the tooling is there. The far more significant challenge is retrofitting reproducibility into existing large projects.
First of all, every filesystem has timestamps and they've got to be set to some value.
Second of all, it's useful to have some sort of version marker that lets you tell at a glance if two copies of the software are the same, or what code was running when that log file was created. Build timestamps are a classic way of doing this in a version-control-system-independent way, without advancing the 'headline' version number for every developer's test builds.
And of course, if you include things like debug symbols it's useful to be able to tell if the source files on disk have changed since the build took place, as the line numbers could be all off.
Alternatives include using a git commit hash; having a separate build pipeline for public release versions then not worrying if developers' builds aren't reproducible; and just being careful with developers' test builds.
There’s a long list of tools that follow a general pattern: they take an input file and produce an output file, but the output file doesn’t just depend on the contents of the input file; it also embeds its modification time somewhere. Sometimes the goal is to be able to automatically regenerate the output if the input changes, by comparing timestamps. Sometimes it’s just intended as general-purpose metadata. Debian’s reproducible builds wiki page has many examples:
Imagine part of your build is generating a bunch of different files, and then packaging into an archive file, or into your installer bundle (or whatever). If your archive preserves file timestamps, then your end up with different archive files.
> Why are people using timestamps in builds and what is the better alternative?
Timestamps are often an implicit part of the build which leads to the same code producing 'different' artifacts (e.g. file created timestamp, generated metadata about the build stored in the artifact itself). One alternative is to support explicitly defining the 'build timestamp' so all builds with the same timestamp produces the same artifact. This is what Maven does[0]; I'd imagine it's much harder for build systems that download and recompile dependencies.
Some people want to know when something was built. Better option is to use Git commit of build or even last-modified of the file it's being embedded into. Alternatively, allow a env var to override what date/time the software thinks it is.
By default, many compilers include things like local filesystem paths, build server hostnames, or build timestamps into their binary artifacts. These will obviously differ build-to-build.
Even without that, it's possible to accidentally leak entropy into the build output. For example, readdir() doesn't guarantee any kind of ordering, so without sorting the list of files it is possible for a binary artifact (or even tar) to produce different output from the same input.
Compilers are huge pieces of code, modified by lots and lots of hands, and up to recently nobody paid any attention if the results were identical or not.
Your question is kinda like "if I write a few hundred lines of code, and run it, it should run, right?". The answer is yes, that's the expectation, but there is no chance that it is actually fulfilled in practice.
As an example, when the Debian's reproducible builds project started, gcc had extra code to tag the executables with the time of compilation.
Perl sorting issues in dictionaries-common
libxmlb used a pointer address (%p) for a hash value
texlive-base: Reported differences in the generated ls-R
In addition to various issues already mentioned, there are also some build systems that do build in parallel and produce builds or build artefacts that are going to depend on the order in which threads finished, which may vary from run to run.
I'm not saying you cannot build in parallel and end up with a deterministic build: that is done all the time. What I'm saying in that there are cases where some care has to be taken.
The build environment seems to sneak into the artifacts in many ways. Check out some of the listings from the "Upstream Patches" section of the linked page.. Some of the most common reasons seem to be: timestamps, non-deterministic file listings, debug information containing build path data, permissions bits on artifact files,etc.
Thanks for sharing, did not know this project existed.
I am maintaining a small embedded linux project myself and having a reproducible build as a feature is a long standing goal. The more tooling and awareness there is, the better.
Its a pitty that Linux itself still cannot be bootstrapped without the Linux env - at least no one has done it yet. Anyway thanks to this reproducible-builds efforts maybe the awarness of this serious problem will grow and we will see more support into the projects like https://github.com/fosslinux/live-bootstrap , https://niedzejkob.p4.team/bootstrap/
I admit I've never tried this, but the Linux kernel seems like just another C program. What's to stop me from installing GNU Make, GNU coreutils, gcc, and binutils cross compiled to target x86_64-linux-elf and bootstrap this? It doesn't sound especially useful, but it does sound like something that would be possible if you focused on it.