Most active commenters
  • cperciva(5)
  • (4)
  • lamby(4)
  • account42(4)
  • echoangle(3)
  • steveklabnik(3)
  • eviks(3)
  • johnisgood(3)

←back to thread

764 points bertman | 88 comments | | HN request time: 0.2s | source | bottom
1. imcritic ◴[] No.43484638[source]
I don't get how someone achieves reproducibility of builds: what about files metadata like creation/modification timestamps? Do they forge them? Or are these data treated as not important enough (like it 2 files with different metadata but identical contents should have the same checksum when hashed)?
replies(10): >>43484658 #>>43484661 #>>43484682 #>>43484689 #>>43484705 #>>43484760 #>>43485346 #>>43485379 #>>43486079 #>>43488794 #
2. c0l0 ◴[] No.43484658[source]
Yes.
3. HideousKojima ◴[] No.43484661[source]
Those aren't needed to generate a hash of a file. And that metadata isn't part of the file itself (or at least doesn't need to be), it's part of the filesystem or OS
replies(1): >>43484687 #
4. o11c ◴[] No.43484682[source]
Timestamps are easiest part - you just set everything according to the chosen epoch.

The hard things involve things like unstable hash orderings, non-sorted filesystem listing, parallel execution, address-space randomization, ...

replies(1): >>43485157 #
5. imcritic ◴[] No.43484687[source]
That's an acceptable answer for the simple case when you distribute just a file, but what if your distribution is something more complex, like an archive with some sub-archives? Metadata in the internal files will affect the checksum of the resulting archive.
replies(2): >>43484995 #>>43490787 #
6. purkka ◴[] No.43484689[source]
Generally, yes: https://reproducible-builds.org/docs/timestamps/

Since the build is reproducible, it should not matter when it was built. If you want to trace a build back to its source, there are much better ways than a timestamp.

replies(1): >>43485622 #
7. ◴[] No.43484705[source]
8. ◴[] No.43484760[source]
9. exe34 ◴[] No.43484995{3}[source]
unless you fix them to a known epoch.
10. koolba ◴[] No.43485157[source]
ASLR shouldn’t be an issue unless you intend to capture the entire memory state of the application. It’s an intermediate representation in memory, not an output of any given step of a build.

Annoying edge cases come up for things like internal object serialization to sort things like JSON keys in config files.

replies(3): >>43485872 #>>43488447 #>>43489756 #
11. jzb ◴[] No.43485346[source]
Debian uses a tool called `strip-nondeterminism` to help with this in part: https://salsa.debian.org/reproducible-builds/strip-nondeterm...

There's lots of info on the Debian site about their reproducibility efforts, and there's a story from 2024's DebConf that may be of interest: https://lwn.net/Articles/985739/

replies(1): >>43489144 #
12. echoangle ◴[] No.43485379[source]
Maybe dumb question but why would this change the reproducibility? If you clone a git repo, do you not get the meta data as it is stored in git? Or would the files have the modification date of the cloning?

I never actually checked that.

replies(1): >>43485393 #
13. mathfailure ◴[] No.43485393[source]
You clone source from git, but then you use them to build some artifacts. The artifacts build time may differ, yet with reproducible builds - the artifact should match.
replies(1): >>43485852 #
14. ryandrake ◴[] No.43485622[source]
C compilers offer __DATE__ and __TIME__ macros, which expand to string constants that describe the date and time that the preprocessor was invoked. Any code using these would have different strings each time it was built, and would need to be modified. I can't think of a good reason for them to be used in an actual production program, but for whatever reason, they exist.
replies(4): >>43485670 #>>43486042 #>>43487552 #>>43488994 #
15. fmbb ◴[] No.43485670{3}[source]
Toolchains for reproducible software likely let you set these values, or ensure they are 1970-01-01 00:00:00
replies(2): >>43486128 #>>43492261 #
16. echoangle ◴[] No.43485852{3}[source]
Right, but if you only clone and build, why would the files modification date be different compared to the version that was committed to git? Does just cloning a repo already lead to different file modification dates in my local copy?
replies(1): >>43485956 #
17. sodality2 ◴[] No.43485872{3}[source]
Let’s say a compiler is doing something in a multi-threaded manner - isn’t it possible that ASLR would affect the ordering of certain events which could change the compiled output? Sure you could just set threads to 1 but there’s probably some more edge cases in there I haven’t thought of.
replies(1): >>43486161 #
18. hoten ◴[] No.43485956{4}[source]
Git does not store or restore file modification times.
replies(2): >>43486058 #>>43486400 #
19. mananaysiempre ◴[] No.43486042{3}[source]
And that’s why GCC (among others) accepts SOURCE_DATE_EPOCH from the environment, and also has -Wdate-time. As for using __DATE__ or __TIME__ in code, I suspect that was more helpful in the age before ubiquitous source control and build IDs.
replies(1): >>43488470 #
20. codetrotter ◴[] No.43486058{5}[source]
And the reason for that in turn is because if you are on one commit and check out and older commit, then restoring file modification times to what they were at the time of the older commit would cause build tools that look at file modification times to sometimes not pick up on all the changes.
21. paulddraper ◴[] No.43486079[source]
> Do they forge them?

Yes. All archive entries and date source code macros and any other timestamps are set to a standardized date (in the past).

replies(1): >>43492278 #
22. mikepurvis ◴[] No.43486128{4}[source]
Nix sets everything to the epoch, although I believe Debian's approach is to just use the date of the newest file in the dsc tarballs.
replies(2): >>43488193 #>>43492243 #
23. zamadatix ◴[] No.43486161{4}[source]
I think you'd need the compiler to guarantee serialization order of such operations regardless if you used ASLR or not. Otherwise you're just hoping thread scheduling, core clocking, thread memory access, and many other things are the same between every system trying to do a reproducible build. Even setting threads to 1 may not solve that problem class if asynchronous functions/syscalls come into play.
24. echoangle ◴[] No.43486400{5}[source]
Ah ok, that explains it.
25. repiret ◴[] No.43487552{3}[source]
> I can't think of a good reason for them

I work on a product whose user interface in one place says something like “Copyright 2004-2025”. The second year there is generated from __DATE__, that way nobody has to do anything to keep it up to date.

replies(1): >>43488135 #
26. Arelius ◴[] No.43488135{4}[source]
I mean, you could do that, it's sort-of a lie though, maybe something better would be using the date of the most recent commit, which would be both more accurate, as far as authorship goes, and actually deterministic..

Pipe something like this into your build system:

    date --date "$(git log HEAD --author-date-order --pretty=format:"%ad" --date=iso | head -n1)" +"%Y"
27. yjftsjthsd-h ◴[] No.43488193{5}[source]
Nix can also set it to things other than 0; I think my favorite is to set it by the time of the commit from which you're building.
replies(2): >>43491317 #>>43492466 #
28. cperciva ◴[] No.43488447{3}[source]
FreeBSD tripped over an issue recently where a C++ program (I think clang?) used a collection of pointers and output values in an order based on the pointers rather than the values they pointed to.

ASLR by itself shouldn't cause reproducibility issues, but it can certainly expose bugs.

replies(1): >>43492945 #
29. cperciva ◴[] No.43488470{4}[source]
Source control only helps you if everything is committed. If you're, say, working on changes to the FreeBSD boot loader, you're probably not committing those changes every time you test something but it's very useful to know "this is the version I built ten minutes ago" vs "I just booted yesterday's version because I forgot to install the new code after I built it".
replies(5): >>43489912 #>>43490489 #>>43492048 #>>43492803 #>>43492948 #
30. TacticalCoder ◴[] No.43488794[source]
> ... what about files metadata like creation/modification timestamps? Do they forge them?

The least difficult to solve for reproducible build but yes.

The real question is: why, in the past, was an entire ecosystem created where non-determinism was the norm and everybody thought it was somehow ok?

Instead of asking: "how one achieves reproducibility?" we may wonder "why did people got out of their way to make sure something as simple as a timestamp would screw determinism?".

For that's the anti-security mindset we have to fight. And Debian did.

replies(2): >>43490072 #>>43495889 #
31. rtpg ◴[] No.43488994{3}[source]
It's super nice to have timestamps as a quick way to know what program you're looking at.

Sticking it into --version output is helpful to know if, for example, the Python binary you're looking at is actually the one you just built rather than something shadowing that

replies(1): >>43498435 #
32. frakkingcylons ◴[] No.43489144[source]
I see this is written in Perl, is that the case with most Debian tooling?
replies(6): >>43489677 #>>43490179 #>>43490769 #>>43490826 #>>43491933 #>>43492219 #
33. dannyobrien ◴[] No.43489677{3}[source]
some, but not all. There's a bunch of historical code which means that Perl is in the base install, but modern tooling has a lot of Python too, as well as POSIX shell (not bash).
replies(1): >>43489919 #
34. kazinator ◴[] No.43489756{3}[source]
ASLR means that the pointers from malloc (which may come from mmap) are not predictable.

Sometimes programs have hash tables which use object identity as key (i.e. pointer).

ASLR can cause corresponding objects in different runs of the program to have different pointers, and be ordered differently in an identity hash table.

A program producing some output which depends on this is not necessarily a bug, but becomes a reproducibility issue.

E.g. a compiler might output some object in which a symbol table is ordered by a pointer hash. The difference in order doesn't change the meaning/validity of the object file, but is is seen as the build not having reproduced exactly.

replies(1): >>43492749 #
35. lmm ◴[] No.43489912{5}[source]
> If you're, say, working on changes to the FreeBSD boot loader, you're probably not committing those changes every time you test something

Whyever not? Does the FreeBSD boot loader not have a VCS or something?

replies(2): >>43489986 #>>43490170 #
36. alfiedotwtf ◴[] No.43489919{4}[source]
Though a lot of the apt tooling is definitely written in Perl the last time I had to deep dive
replies(1): >>43491944 #
37. cperciva ◴[] No.43489986{6}[source]
It's in the FreeBSD src tree. But we usually commit code once it's working...
replies(1): >>43512975 #
38. BobbyTables2 ◴[] No.43490072[source]
You’re forgetting that source control used to not be a mainstream practice…

Software was more artisanal in nature…

39. steveklabnik ◴[] No.43490170{6}[source]
A subtlety that may be lost: FreeBSD uses CVS, and so there isn't a way to commit locally while you're working, like with a DVCS.
replies(1): >>43495755 #
40. fooker ◴[] No.43490179{3}[source]
It’s helpful to think of Perl as a superior bash, rather than a worse python, when it comes to scripting.
replies(3): >>43491457 #>>43491620 #>>43492020 #
41. mananaysiempre ◴[] No.43490489{5}[source]
> you're probably not committing those changes every time you test something

I’m not, but I really think I should be. As in, there should be a thing that saves the state of the tree every time I type `make`, without any thought on my part.

This is (assuming Git—or Mercurial, or another feature-equivalent VCS) not hard in theory: just take your tree’s current state and put it somewhere, like in a merge commit to refs/compiles/master if you’re on refs/heads/master, or in the reflog for a special “stash”-like “compiles” ref, or whatever you like.

The reason I’m not doing it already is that, as far as I can tell, Git makes it stupendously hard to take a dirty working tree and index, do some Git to them (as opposed to a second worktree using the same gitdir), then put things back exactly as they were. I mean, that’s what `git stash` is supposed to do, right?.. Except if you don’t have anything staged then (sometimes?..) after `git stash pop` everything goes staged; and if you’ve added new files with `git add -N` then `git stash` will either refuse to work, or succeed but in such a way that a later `git stash pop` will not mark these files staged (or that might be the behaviour for plain `git add` on new files?). Gods help you if you have dirty submodules, or a merge conflict you’ve fixed but forgot to actually commit.

My point is, this sounds like a problem somebody’s bound to have solved by now. Does anyone have any pointers? As things are now, I take a look at it every so often, then remember or rediscover the abovementioned awfulness and give up. (Similarly for making precommit hooks run against the correct tree state when not all changes are being committed.)

replies(1): >>43490958 #
42. londons_explore ◴[] No.43490769{3}[source]
Packaging and making build scripts is perhaps one of the most unrewarding tasks out there. As an open source project where most work is done for free, debian can't afford to be prescriptive about what languages are used for this sort of task.
replies(1): >>43492550 #
43. londons_explore ◴[] No.43490787{3}[source]
Finding and fixing cases like this are part of what the project has done...
44. jeltz ◴[] No.43490826{3}[source]
Last time I checked a lot was also written in Python.
45. beecasthurlbow ◴[] No.43490958{6}[source]
An easy (ish) option here is to use autosquashing [1], which lets you create individual commits (saving your work - yay!) and then eventually clean em up into a single commit!

Eg

    git commit -am “Starting work on this important feature”
    
    # make some changes
    git add . && git commit —-squash “I made a change” HEAD

Then once you’re all done, you can do an auto squash interactive rebase and combine them all into your original change commit.

You can also use `git reset —-soft $BRANCH_OR_COMITTISH` to go back to an earlier commit but leave all changes (except maybe new files? Sigh) staged.

You also might check out `git reflog` to find commits you might’ve orphaned.

[1] https://thoughtbot.com/blog/autosquashing-git-commits

46. ◴[] No.43491317{6}[source]
47. gjvc ◴[] No.43491457{4}[source]
stealing this, thank you
48. eviks ◴[] No.43491620{4}[source]
How is that helpful to ignore a better alternative just because a worse one exists?
replies(2): >>43491767 #>>43501196 #
49. palata ◴[] No.43491767{5}[source]
They precisely say they use it as a better alternative to bash. Obviously they don't think that Python is a better alternative here... or did I misunderstand the question?
replies(1): >>43491793 #
50. eviks ◴[] No.43491793{6}[source]
Not obvious to me that they think Python is worse than Perl, and make the phrase even less sensible.
replies(2): >>43491988 #>>43491994 #
51. johnisgood ◴[] No.43491933{3}[source]
I checked the code. Perl is suitable for these kind of tasks.
52. johnisgood ◴[] No.43491944{5}[source]
And a lot of OpenBSD-related stuff is written in Perl, too. I do not think it is a bad thing at all.
replies(1): >>43494866 #
53. dizhn ◴[] No.43491988{7}[source]
Weird wording yes. I read it as "yes perl is better than bash" (I assume for tasks that need actual programming languages), "no it's not worse than python".
replies(1): >>43492687 #
54. palata ◴[] No.43491994{7}[source]
So you genuinely believe that they think Python is a better choice in this case, but still chose to go for Perl because they believe it's worse? How does that work?
replies(1): >>43492023 #
55. nukem222 ◴[] No.43492020{4}[source]
Notably, they forgot to improve on readability and maintability, both of which are markedly worse with perl.

Look I get people use the tools they use and perl is fine, i guess, it does its job, but if you use it you can safely expect to be mocked for prioritizing string operations or whatever perl offers over writing code anyone born after 1980 can read, let alone is willing to modify.

For such a social enterprise, open source orgs can be surprisingly daft when it comes to the social side of tool selection.

Would this tool be harder to write in python? Probably. Is it a smart idea to use it regardless? Absolutely. The aesthetics of perl are an absolute dumpster fire. Larry Wall deserves persecution for his crimes.

replies(1): >>43493083 #
56. eviks ◴[] No.43492023{8}[source]
It works by not mixing two different people: the commenter and the implementer.

Also, it works trivially even in the case of the implementer - he might believe Python is better, but chose Perl because he likes it more

57. jrockway ◴[] No.43492048{5}[source]
Versions built into the code are nice. I think the correct answer is to commit before the build proper starts (automatically, without changing your HEAD ref) and put that in there. Then you can check version control for the date information, but if someone else happens to add the same bytes to the same base commit, they also have the same version that you do. (Similarly, you can always make the date "XXXXXXXXXXXXXXXXXXXXXX" or something, and just replace the bytes with the actual date after the build as you deploy it.)

What I actually did at $LAST_JOB for dev tooling was to build in <commit sha> + <git diff | sha256> which is probably not amazingly reproducible, but at least you can ask "is the code I have right now what's running" which is all I needed.

Finally, there is probably enough flexibility in most build systems to pick between "reuse a cache artifact even if it has the wrong stamping metadata", "don't add any real information", and "spend an extra 45 cpu minutes on each build because I want $time baked into a module included by every other source file". I have successfully done all 3 with Bazel, for example.

replies(1): >>43500244 #
58. lamby ◴[] No.43492219{3}[source]
One of the authors of strip nondeterminism is here. The primary reason it's written in Perl is that given that strip-nondeterminism is used when building 99.9% of all Debian packages, using any other language would have essentially made that language's runtime a dependency for all building Debian packages. (Perl is already required by the build process, whilst Python is not.)
replies(1): >>43494083 #
59. lamby ◴[] No.43492243{5}[source]
Debian's approach is actually to use the date specified in the top entry in the debian/changelog file. That's more transparent and resilient than any mtime.
60. lamby ◴[] No.43492261{4}[source]
Strangely enough, sometimes using the epoch can expose bugs in libraries (etc.) when running or building in a timezone west of Greenwich due to the negative time offset taking time "below" zero.
61. lamby ◴[] No.43492278[source]
This is not quite right. At least in Debian, only files that are newer than some standardised date are to that standardised date. This "clamping" preserves any metadata in older files.
62. terinjokes ◴[] No.43492466{6}[source]
Which is also used when the contents of a derivation will be included in a zip file. The Unix epoch is about a decade older than the zip epoch.
63. account42 ◴[] No.43492550{4}[source]
Actually it can and it is. Build system dependencies, especially ones that apply to all packages, are something that concerns the distribution as a whole and not something where each developer can just add their favorite one.
64. ben0x539 ◴[] No.43492687{8}[source]
I'm not reading it as "it's not worse than python", I am reading it as "the choice was between bash and perl, python was not an option for reasons unrelated to its merits"
65. account42 ◴[] No.43492749{4}[source]
That's just one example of nondeterminism in compilers though - at the end it's the responsibility of the compile to provide options not to do that.
replies(1): >>43495112 #
66. account42 ◴[] No.43492803{5}[source]
Nobody cares about reproducibility of local development builds so just limit your use of date/time to those and use a more appropriate build reference for release builds.
67. ahartmetz ◴[] No.43492945{4}[source]
It is sometimes just fine to have a hash table with pointers as keys. It is by design an unordered collection, so you do not care about the order, only about finding entries.

Then at some point you happen to need all the entries, you iterate, and you get a random order. Which is not necessarily a problem unless you want reproducible builds, which is just a new requirement, not exposing a latent bug.

68. chippiewill ◴[] No.43492948{5}[source]
Which is fine, you don't need to use a reproducible build for local dev and can just use the real timestamp.
69. sgarland ◴[] No.43493083{5}[source]
Did you miss the post a few above yours, where an author of this tool explained why it’s written in Perl? Introducing a new language dependency for a build, especially of an OS, is not something you undertake lightly.
replies(1): >>43494118 #
70. flkenosad ◴[] No.43494083{4}[source]
Question: is Perl the only runtime the Debian build process relies on?
replies(1): >>43508018 #
71. nukem222 ◴[] No.43494118{6}[source]
Right. Good luck finding people who want to maintain that. It just seems incredibly short-sighted unless the current batch of maintainers intend to live forever.
replies(1): >>43495073 #
72. alfiedotwtf ◴[] No.43494866{6}[source]
I absolutely love Perl. I'm just so sad Python won because Google blessed it as a language and at the time everyone wanted to work for Google.

Perl always gets hate on HN, but I actually wonder of those commenter, who has actually spent over a single hours using Perl after they've read the Camel book.

Honest opinion: if you're going to be spending time in Linux in your career, then you should read the Camel book at least once. Then and only then should you get to have an opinion on Perl!

replies(1): >>43498292 #
73. sgarland ◴[] No.43495073{7}[source]
Counterpoint: if someone knows Perl, they are much more likely to have the requisite skills to be a maintainer for a distro. It’s self-selection.

Imagine the filtering required for potential maintainers if they rewrote the packaging to JS.

74. kazinator ◴[] No.43495112{5}[source]
Not for external causes like ASLR and memory allocators; those things should have their respective options for that.
replies(1): >>43503110 #
75. cperciva ◴[] No.43495755{7}[source]
FreeBSD hasn't used CVS since 2008.
replies(1): >>43495830 #
76. steveklabnik ◴[] No.43495830{8}[source]
Huh! So, before I posted this, I went to go double check, and found https://wiki.freebsd.org/VersionControl. What I missed was the (now obvious) banner saying

> The sections below are currently a historical reference covering FreeBSD's migration from CVS to Subversion.

My apologies! At the end of the day, the point still stands in that SVN isn't a DVCS and so you wouldn't want to be committing unfinished code though, correct?

(I suspect I got FreeBSD mixed up with OpenBSD in my head here, embarrassing.)

replies(2): >>43498328 #>>43500995 #
77. brohee ◴[] No.43495889[source]
TBH security is someone the source of the issues, as it often involves adding randomness. For example, replacing deterministic hashes by keyed hashes to protect from hash flooding DoS led to deterministic output becoming nondeterministic (e.g. when displaying a hash table in its natural order).

Sorting had to be added to that kind of output.

78. freedomben ◴[] No.43498292{7}[source]
I mostly agree with you, though I do think Perl is genuinely harder to read than many other languages. Perl was often my goto for scripts before I learned Ruby (which has many glorious perl-isms in it even if most rubyists nowadays don't know or want to acknowledge that :-D ), and even looking back at some of my own code and knowing what it does, I have to read it a lot slower and more carefully than most other langs. Perl to me feels wonderfully optimized for writing, sometimes at the expense of reading. I love Perl's power and expressiveness, especially the string processing libs, and while I appreciate the flexibility in how many different ways there are to do things, it does mean that Perl code written by someone else with different approaches can sometimes be difficult to grok. For my own scripts I don't care about any of those issues and I often optimize for writing anyway, but there are plenty of applications where I would recommend against Perl, despite my affection for it.

And yes agree, people should read the camel book!

replies(1): >>43504708 #
79. jraph ◴[] No.43498328{9}[source]
You could still use git-svn, but yeah, as another commenter wrote, I don't think reproducible build is that useful when debugging, it should be fine to have an actual timestamp in the binaries.
80. izacus ◴[] No.43498435{4}[source]
The whole point or reproducible builds is that you don't need to rely on timestamps and similar information to know which binary you're looking at.
81. ◴[] No.43500244{6}[source]
82. cperciva ◴[] No.43500995{9}[source]
Well yes, but we've actually migrated to Git now. ;-)
replies(1): >>43505962 #
83. fooker ◴[] No.43501196{5}[source]
The same reason people write C++ instead of better^TM alternatives.

Pick the tool you already know and focus on solving the problem.

84. account42 ◴[] No.43503110{6}[source]
There is no guarantee that memory allocation is deterministic even without ASLR. If your program is supposed to be deterministic but its output depends on the memory addresses returned by the allocator then your program is buggy.
85. johnisgood ◴[] No.43504708{8}[source]
> there are plenty of applications where I would recommend against Perl

Yes of course, I would not write any type of servers in Perl, I would pick Go or Elixir or Erlang for such an use-case.

86. steveklabnik ◴[] No.43505962{10}[source]
Welp! Egg on my face twice!
87. yrro ◴[] No.43508018{5}[source]
Any packages with "Essential: yes" (run 'apt list ~E' to see them) are required on any Debian system. Additionally, the 'build-essential' pulls in other packages that must be present to build Debian packages via its dependencies: https://packages.debian.org/sid/build-essential
88. lmm ◴[] No.43512975{7}[source]
Huh. If I was confident enough in a change to consider it worth doing an actual boot to test I'd certainly want to have it committed, to be able to track and go back to it. Even the broken parts of history are valuable IME.