The hard things involve things like unstable hash orderings, non-sorted filesystem listing, parallel execution, address-space randomization, ...
Since the build is reproducible, it should not matter when it was built. If you want to trace a build back to its source, there are much better ways than a timestamp.
Therefore, I side with Tavis Ormandy on this debate: https://web.archive.org/web/20210616083816/https://blog.cmpx...
i.e. supply-chain safety
It doesn't entirely resolve Thompson's "Trusting Trust" problem, but it goes a long way.
That benefit goes away if the rest of the community all have hashes that don't agree with each other. Then the tampered-with one doesn't stand out.
Every bit of nondeterminism in your binaries, even if it's just memory layout alone, might alter the behavior, i.e. break things on some builds, which is just really not desirable.
Why would you ever want builds from the same source to have potentially different performance, different output size or otherwise different behavior?
IMO tivoization is completely unrelated, because the vendor most certainly does not need reproducible builds in order to lock down a platform.
Then I can be sure that I only make the changes I intend to do when building upon this state (instead of, for example, "fixing" something by accident because the link order of something changed which changed the memory layout which hides a bug).
Annoying edge cases come up for things like internal object serialization to sort things like JSON keys in config files.
Also, is someone else also compiling these images, so we have evidence that the Debian compiling servers were not compromised?
I think there's also a similar thing for the images, but I might be wrong and I definitely don't have the link handy at the moment.
There's lots of documentation about all of the things on Debian's site at the links in the brief. And LWN also had a story last year about Holger Levsen's talk on the topic from DebConf: https://lwn.net/Articles/985739/
There's lots of info on the Debian site about their reproducibility efforts, and there's a story from 2024's DebConf that may be of interest: https://lwn.net/Articles/985739/
Thing is, inputs can be nondeterministic too - some programs (used to) embed the current git commit hash into the final binary so that a `./foo --version` gives a quick and easy way for bug triage to check if the user isn't using a version from years ago.
I never actually checked that.
(Just playing devil’s advocate here.)
It doesn't really do anything at all for tivoisation, Tivo managed it just fine without reproducable builds.
Like, you spend millions to get that one backdoor into the compiler. And then this guy is like "Uhm. Guys. I have this largely perl-based build process reproducing a modern GCC on a Pentium with 166 Mhz swapping RAM to disk because the motherboard can't hold that much memory. But the desk fan helps cooling. It takes about 2 months or 3 to build, but that's fine. I start it and then I work in the woods. It was identical to your releases about 5 times in the last 2 years (can't build more often), and now it isn't somewhere deep in the code sections. My arduino based floppy emulator is currently moving the binaries through the network"
Sure, it's a cyberpunk hero-fantasy, but deterministic builds would make these kind of shenanigans possible.
And at the end of the day, independent validation is one of the strongest ways to fight corruption.
Yes. All archive entries and date source code macros and any other timestamps are set to a standardized date (in the past).
You must ultimately root trust in some set of binaries and any hardware that you use.
This pages has a good write up
Every country in the world should have the capability of producing "good enough" hardware.
It's not about wanting non-reproducible builds, but what am I sacrificing to achieve reproducible builds. Debian's reproducible build efforts have been going for ten years, and it's still not yet complete. Arguably Debian could have diverted ten years of engineering resources elsewhere. There's no end to the list of worthwhile projects to tackle, and clearly Debian believes that reproducible builds is high priority, but reasonable people can disagree on that.
This not to say reproducible builds are not worth doing, just that depending on your project / org lifecycle and available resources (plus a lot of subjective judgement), you may want to do something else first.
One is where the upstream software developer wants to build and sign their software so that users know it came from them, but distributors also want to be the ones to build and sign the software so they know what exactly it is they are distributing. The most public example is FDroid[1]. Reproducible builds allow both the software developer and the distributor to sign-off on a single binary, giving users addition assurance that neither are sneaking something in. This is similar to the last example that Tavis gave, but shows that it is a workable process that provides real security benefit to the user, not just a hypothetical stretch.
The second is license enforcement. Companies that distribute (A/L)GPL software are required to distribute the exact source code that the binary was created from, and ability to compile and replace the software with a modified version (for GPLv3). However, a lot of companies are lazy about this and publish source code that doesn't include all their changes. A reproducible build demonstrates that the source they provided is what was used to create the binary. Of course, the lazy ones aren't going to go out of their way to create reproducible builds, but the more reproducible the upstream code build system is the fewer extraneous differences downstream builds should have. And it allows greater confidence in the good guys who are following the license.
And like others have said, I don't see the Tivoization argument at all. TiVo didn't have reproducible builds, and they Tivo'd their software just fine. At worst a reproducible build might pacify some security minded folks that would otherwise object to Tivoization, but there will still be people who object to it out of the desire to modify the system.
The liability is EFI underneath that, and the Intel ring -1 stuff (which we should be mandating is open source).
A reproducable build means you can get the same source code and compile it, and it will be identical to the published image. This is important because otherwise you don't know if the published image actually used some other source code. If it used some other source code, the published image might have a backdoor, or something that you can't find by reading the source code.
Ages ago our device firmware release processes caught the early stage of a malware infection because the hash of one of our intermediate code generators (win32 exe) changed between two adjacent releases without any commits that should've impacted that tool.
Turns out they had hooked something into windows to monitor for exe accesses and were accidentally patching out codegen.
Eventually you just top trusting anything and live in the woods I guess.
(This is an alternative to the Guix/Scheme thing).
I work on a product whose user interface in one place says something like “Copyright 2004-2025”. The second year there is generated from __DATE__, that way nobody has to do anything to keep it up to date.
Normal: Before Debian's initiative to handle this problem, most people didn't think hard about all the ways system-specific differences might wind up in binaries. For example: __DATE__ and __TIME__ macros in C, parallel builds finishing in different order, anything that produces a tar file (or zip etc.) usually by default asks the OS for the input files' modification time and puts that into the bytes of the tar file, filesystems may list files in a directory in different order and this may also get preserved in tar/zip files or other places...
Why it's important: With reproducible builds, anyone can check the official binaries of Debian match the source code. This means going forward, any bad actors who want to sneak backdoors or other malware into Debian will have to find a way to put it in the source code, where it will be easier for people to spot.
Pipe something like this into your build system:
date --date "$(git log HEAD --author-date-order --pretty=format:"%ad" --date=iso | head -n1)" +"%Y"
By far the most prevalent source of nondeterminism is timestamps, especially since timestamps crop up in file formats you don't expect (e.g., running gzip stuffs a timestamp in its output for who knows what reason). After that, it's the two big filesystem issues (absolute paths and directory iteration nondeterminism), and then it's basically a long tail of individual issues that affect but one or two packages.
- At the start of the chain, developers write software they claim is secure. But very few people trust the word of just one developer.
- Over time other developers look at the code and also pronounce it secure. Once enough independent developers from different countries and backgrounds do this, people start to believe it really is secure. As measure of security this isn't perfect, but it is verifiable and measurable in the sense more is always better, so if you set the bar very high you can be very confident.
- Somebody takes that code, goes through a complex process to produce a binary, releases it, and pronounces it is secure because it is only based on code that you trust, because of the process above. You should not believe this. That somebody could have introduced malicious code and you would never know.
- Therefore before reproducible builds, your only way to get a binary you knew was built from code you had some level of trust in was to build it yourself. But most people can't do that, so they have to trust that Debian, Google, Apple, Microsoft or whoever that are no backdoors have been added. Maybe people do place their faith in those companies, but is is misplaced. It's misplaced because countries like Australia have laws that allow them to compel such companies to silently introduce malicious code and distribute it to you. Australia's law is called the "Assistance and Access Bill (2018)". Countries don't introduce such laws for no reason. It's almost certain it is being used now.
- But now the build can be reproducible. That means many developers can obtain the same trusted source code from the source the original builder claimed he used, build the binary themselves, verify it is identical to the original so publicly validate the claim. Once enough independent developers from different countries and backgrounds do this, people start to believe it really built from the trusted sources.
- Ergo reproducible builds allow everyone, as opposed to just software developers, to run binaries they can be very confident was built just from code that has some measurable and verifiable level of trustworthiness.
It's a remarkable achievement for other reasons too. Although the ideas behind reproducible builds are very simple, it turned out executing it was about as simple as other straightforward ideas like "lets put a man on old moon". It seems build something as complex as an entire OS was beyond any company, or capitalism/socialism/communism, or a country. It's the product of something we've only seen arise in the last 40 years, open source, and it been built by a bunch of idealistic volunteers who weren't paid to do it. To wit: it wasn't done by commercial organisations like RedHat, or Ubuntu. It was done by Debian. That said, other similar efforts have since arisen like F-Droid, but they aren't on this scale.
ASLR by itself shouldn't cause reproducibility issues, but it can certainly expose bugs.
Also we have had live images of various OSes for many decades. I seem to recall that we used to load DOS from floppy disks.
The least difficult to solve for reproducible build but yes.
The real question is: why, in the past, was an entire ecosystem created where non-determinism was the norm and everybody thought it was somehow ok?
Instead of asking: "how one achieves reproducibility?" we may wonder "why did people got out of their way to make sure something as simple as a timestamp would screw determinism?".
For that's the anti-security mindset we have to fight. And Debian did.
"Fully Countering Trusting Trust through Diverse Double-Compiling (DDC) - Countering Trojan Horse attacks on Compilers"
https://dwheeler.com/trusting-trust/
If the build is reproducible inside VMs, then the build can be done on different architectures: say x86 and ARM. If we end up with the same live image, then we're talking something entirely different altogether: either both x86 and ARM are backdoored the same way or the attack is software. Or there's no backdoor (which is a possibility we have to fancy too).
Sticking it into --version output is helpful to know if, for example, the Python binary you're looking at is actually the one you just built rather than something shadowing that
I hope they promote tools to enable easy verification on systems external to debian build machines.
Now that the build is reproducible, you don't need to trust your distro alone. It's always exactly the same binary, which means it'll have one correct sha256sum. You can have 10 other trusted entities build the same binary with the same code and publish a signature of that sha256sum, confirming they got the same thing. You can check all ten of those. The likelihood that 10 different entities are colluding to lie to you is a lot lower than just your distro lying to you.
Sometimes programs have hash tables which use object identity as key (i.e. pointer).
ASLR can cause corresponding objects in different runs of the program to have different pointers, and be ordered differently in an identity hash table.
A program producing some output which depends on this is not necessarily a bug, but becomes a reproducibility issue.
E.g. a compiler might output some object in which a symbol table is ordered by a pointer hash. The difference in order doesn't change the meaning/validity of the object file, but is is seen as the build not having reproduced exactly.
I’ve said it many times and I’ll repeat it here - Debian will be one of the few Linux distros we have right now, that will still exist 100 years from now.
Yea, it’s not as modern in terms of versioning and risk compared to the likes of Arch, but that’s also a feature!
Software was more artisanal in nature…
It'd certainly be nice, but if you've ever seen an organisation unravel it can happen with startling speed. I think the naive estimate is if you pick something at random it is half-way through its lifespan; so there isn't much call yet to say Debian will make it to 100.
I don't think traditional von Neumann architecture will even be around in 100 years, as energy demands drive more efficient designs for different classes of problems. =3
I’m not, but I really think I should be. As in, there should be a thing that saves the state of the tree every time I type `make`, without any thought on my part.
This is (assuming Git—or Mercurial, or another feature-equivalent VCS) not hard in theory: just take your tree’s current state and put it somewhere, like in a merge commit to refs/compiles/master if you’re on refs/heads/master, or in the reflog for a special “stash”-like “compiles” ref, or whatever you like.
The reason I’m not doing it already is that, as far as I can tell, Git makes it stupendously hard to take a dirty working tree and index, do some Git to them (as opposed to a second worktree using the same gitdir), then put things back exactly as they were. I mean, that’s what `git stash` is supposed to do, right?.. Except if you don’t have anything staged then (sometimes?..) after `git stash pop` everything goes staged; and if you’ve added new files with `git add -N` then `git stash` will either refuse to work, or succeed but in such a way that a later `git stash pop` will not mark these files staged (or that might be the behaviour for plain `git add` on new files?). Gods help you if you have dirty submodules, or a merge conflict you’ve fixed but forgot to actually commit.
My point is, this sounds like a problem somebody’s bound to have solved by now. Does anyone have any pointers? As things are now, I take a look at it every so often, then remember or rediscover the abovementioned awfulness and give up. (Similarly for making precommit hooks run against the correct tree state when not all changes are being committed.)
Eg
git commit -am “Starting work on this important feature”
# make some changes
git add . && git commit —-squash “I made a change” HEAD
Then once you’re all done, you can do an auto squash interactive rebase and combine them all into your original change commit.You can also use `git reset —-soft $BRANCH_OR_COMITTISH` to go back to an earlier commit but leave all changes (except maybe new files? Sigh) staged.
You also might check out `git reflog` to find commits you might’ve orphaned.
Thats just not enough. If you are hunting down tricky bugs, then even extremely minor things like memory layout of your application might alter the behavior completely-- some uninitialized read might give you "0" every time in one build, while crashing everything with unexected non-zero values in another; performance characteristics might change wildly and even trigger (or avoid) race conditions in builds from the exact same source thanks to cache interactions, etc.
There is a lot of developer preference in how an "ideal" processs/toolchain/build environment looks like, but reproducible builds (unlike a lot of things that come down to preference) are an objective, qualitative improvement-- in the exact same way that it is an improvement if every release of your software corresponds to one exact set of sourcecode.
Then that's a good reason not to use Debian indeed. Whatever the distro you choose, you give your trust to its maintainers.
But that's also a feature: instead of trusting random code from the Internet, you can trust random code from the Internet that has been vetted by a group of maintainers you chose to trust. Which is a bit less random, I think?
This doesn't strike me as a strong argument. That naive estimate (in whatever form[0]) is typically based on not knowing anything else about the process you're looking at. We have lots of information about Debian and similar projects, and you can update your estimate (in a Bayesian fashion) when you know this. Given that Ian Murdock started Debian 31 years ago I think more than 100 years is a very reasonable guess.
[0] see e.g. https://en.wikipedia.org/wiki/Lindy_effect
If you ever want a laugh, one should read what Canonical puts the kids though for the role. One could get a job flying a plane with less paperwork...
Authenticated signed packaging is often a slow process, and some people do prefer rapid out-of-band pip/npm/cargo/go until something goes sideways... and no one knows who was responsible (or which machine/user is compromised.)
Not really random, but understandably slow given the task of reaching "stable" OS release involves hundreds of projects... =3
Look I get people use the tools they use and perl is fine, i guess, it does its job, but if you use it you can safely expect to be mocked for prioritizing string operations or whatever perl offers over writing code anyone born after 1980 can read, let alone is willing to modify.
For such a social enterprise, open source orgs can be surprisingly daft when it comes to the social side of tool selection.
Would this tool be harder to write in python? Probably. Is it a smart idea to use it regardless? Absolutely. The aesthetics of perl are an absolute dumpster fire. Larry Wall deserves persecution for his crimes.
What I actually did at $LAST_JOB for dev tooling was to build in <commit sha> + <git diff | sha256> which is probably not amazingly reproducible, but at least you can ask "is the code I have right now what's running" which is all I needed.
Finally, there is probably enough flexibility in most build systems to pick between "reuse a cache artifact even if it has the wrong stamping metadata", "don't add any real information", and "spend an extra 45 cpu minutes on each build because I want $time baked into a module included by every other source file". I have successfully done all 3 with Bazel, for example.
In a way, Flatpak/Snap/Docker was mitigations to support old programs on new systems, and old systems with updated software no longer compatible with the OS. Not an ideal solution, but a necessary one if folks also wanted to address the win/exe dominant long-term supported program versions.
If working with unpopular oddball stuff one notices the packages cycle out of the repositories rather regularly. =3
At my last job, some team spent forever making our software build in a special federal government build cluster for federal government customers. (Apparently a requirement for everything now? I didn't go to those meetings.) They couldn't just pull our Docker images from Docker Hub; the container had to be assembled on their infrastructure. Meanwhile, our builds were reproducible and required no external dependencies other than Bazel, so you could git checkout our release branch, "bazel build //oci" and verify that the sha256 of the containers is identical to what's on Docker Hub. No special infrastructure necessary. It even works across architectures and platforms, so while our CI machines were linux / x86_64, you can build on your darwin / aarch64 laptop and get the exact same bytes, every time.
In a world where everything is reproducible, you don't need special computers to do secure builds. You can just build on a bunch of normal computers and verify that they all generate the same bytes. That's neat!
(I'll also note that the government's requirements made no sense. The way the build ended up working was that our CI system build the binaries, and then the binaries were sent to the special cluster, and there a special Dockerfile assembled the binaries into the image that the customers would use. As far as I can tell, this offers no guarantee that the code we said was in the image was in the image, but it checked their checkbox. I don't see that stuff getting any better over the next 4 years, so...)
The trick will be in the details, as usual. User data that both does useful work... and plays nicely with immutability.
I suspect it would be more sensible to skip the gymnastics of trying to manicure something inherently resistant, and instead, lean in on reproducibility. Make it as you want it, skip the extra work.
Want another? Great - they're freely reproducible :)
[0] https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#in...
Best they can do is to follow developer's instructions to build a binary artefact and upload it somewhere. May be codify those instructions into a (hopefully) repeatable script like PKGBUILD.
As long as Debian provides source packages and access to their repos - digital signature has nothing to do with Reproducible Builds, you actually don't need one for the same bytes.
Then at some point you happen to need all the entries, you iterate, and you get a random order. Which is not necessarily a problem unless you want reproducible builds, which is just a new requirement, not exposing a latent bug.
The Lindy effect says nothing about popularity, which is how I translate your use of 'dethrone' here. It observes that something's duration of existence correlates with its chances for existence in the future.
I'll happily agree higher degrees of "freedom" are an admirable goal, but this is just rudely shitting on a hard-earned achievement.
I don't understand; isn't this exactly what maintainers do? They write a recipe (be it a PKGBUILD or something else) that builds (maybe after applying a few patches) a package that they then distribute.
Whether you use Arch or Debian, you trust that the maintainers don't inject malware into the binaries they ship. And you trust that the maintainers trust the packages they distribute. Most likely you don't personally check the PKGBUILD and the upstream project.
Yet, it's lasted 31 years, which is a pretty insane amount of time in tech. This on top of being kept up to date, good structure, really good contributions and advancements.
On the other hand, you look at centos, redhat, and oracle and their debacle. How much did they fragment that area.
And then we have Debian just chugging along.
Perl always gets hate on HN, but I actually wonder of those commenter, who has actually spent over a single hours using Perl after they've read the Camel book.
Honest opinion: if you're going to be spending time in Linux in your career, then you should read the Camel book at least once. Then and only then should you get to have an opinion on Perl!
Here's one of the recent examples: https://www.reddit.com/r/debian/comments/1cv30gu/debian_keep...
And that's applied to a lot of packages. Sometimes it leads to frustrated users who directly come to frustrated developers who have no idea what they're talking about, because developers did not intend software to be patched and built this way. Sometimes this leads straight to vulnerabilities. Sometimes this leads to unstable software, for example when maintainer "knows better" which libraries the software should link to.
Imagine the filtering required for potential maintainers if they rewrote the packaging to JS.
They used an official build option to not ship a feature by default, and have another package that does enable all features. If that's your best example of
> Debian adds so much of their patches on top of original software, that the result is hardly resembles the original.
then I'm inclined to conclude that Debian is way more vanilla than I thought.
If an application requires a 3 page BS explanation about how to use a footgun without self-inflicted pwning... it seems like bad design for a posix environment.
People that attempt an escalation of coercion with admins usually get a ban at minimum. Deception, threats, and abuse will not help in most cases if the maintainer is properly trained.
https://www.youtube.com/watch?v=lITBGjNEp08
Have a nice day, =3
> The sections below are currently a historical reference covering FreeBSD's migration from CVS to Subversion.
My apologies! At the end of the day, the point still stands in that SVN isn't a DVCS and so you wouldn't want to be committing unfinished code though, correct?
(I suspect I got FreeBSD mixed up with OpenBSD in my head here, embarrassing.)
Sorting had to be added to that kind of output.
So how do those work in these Debian reproducible builds? Do they outlaw those directives? Or do they set those based on something other than the current date and time? Or something else?
Any Canadian residents here know of any tax credit eligible software projects to donate to?
And yes agree, people should read the camel book!
Worth noting lest I give the wrong impression, I don't think security is a reason to avoid Debian. For me the hacked up kernels and old packages have been much more the pain points, though I mostly stopped doing that work a few years ago. As a regular user (unless you're compiling lots of software yourself) it's a non-issue
https://www.canada.ca/content/dam/cra-arc/formspubs/pub/p113...
Best regards, =3
Well yeah, but you choose the maintainers that do it the way you prefer. In your care you say you like Arch better, because they "patch less" (if I understand your feeling).
Still they do exactly what you describe they should do: write a recipe, build it and ship a binary. You can even go with Gentoo if you want to build (and possibly patch) yourself, which I personally like.
> Here's one of the recent examples: [...]
Doesn't seem like it supports your point: the very first comment on that Reddit threads explains what they did: they split one package into two packages. Again, if you're not happy with the way the Debian maintainers do it, you can go with another distro. Doesn't change the fact that if you use a distro (as opposed to building your own from scratch), then you rely on maintainers.
Once an OS is no longer actively supported, it will begin to accumulate known problems if the attack surface is large.
Thus, a legacy complex-monolith or Desktop host often rots quicker than a bag of avocados. =3
If that process involves Docker or Nix or whatever - that's fine. The point is that there is some robust way of transforming the source code to the artifact reproducibly. (The less moving parts are involved in this process though the better, just as a matter of practicality. Locking up the original build machine in a bank vault and having to use it to reproduce the binary is a bit inconvenient.)
The point here is that there is a way for me to get to a "known good" starting point and that I can be 100% confident that it is good. Having a bit-reproducible process is the no-further-doubts-possible way of achieving that.
Sure it is possible that I still get an artifact that is equivalent in all the ways that I care about if I run the build in the exact same Docker container even if the binaries don't match (because for example some build step embeds a timestamp somewhere). But at that point I'll have to start investigating if the cause of the difference is innocuous or if there are problems.
Equivalence can only happen in one way, but there's an infinite number of ways to get inequivalence.
Yes of course, I would not write any type of servers in Perl, I would pick Go or Elixir or Erlang for such an use-case.