Hacker Newsnew | past | comments | ask | show | jobs | submit | steve1977's commentslogin

It's been years since I went through this, but whenever someone asks me what they should read to get a deeper understanding of what a Linux distribution is, I point them to this.

Yup, it's where I got a lot of my linux knowledge.

I think that Gentoo or even Arch would provide pretty close to the same education level, though, with a lot less time to install.


Having installed Arch myself a couple of times, i think i would disagree. Not really much in that process that teaches you how linux actually works. It's more just about managing disk partitions and moving files around than anything else.

LFS is just on a whole different level, and is on my bucket list to complete the entire process one day.


I've completed it along with BLFS and I just don't really agree.

Like, yes you get pretty familiar with autotools, sed, and patch. However, a lot of LFS is in fact just managing disk partitions and moving files around.

LFS also glosses over a lot of pretty important parts like kernel configuration.

The docs from both Gentoo and Arch, on the other hand, are much more complete and practical in explaining things and also troubleshooting problems. And at the end of the process you're left with a system that can be easily maintained.

LFS is harder, but that doesn't really mean you end up learning more. Especially since it's pretty easy to lose focus and just rely on copy/pasting the next command to run.

Edit: Just an example of what I mean.

Here is the LFS discussion of filesystems.

https://www.linuxfromscratch.org/lfs/view/stable/chapter02/c...

And here is the same Gentoo discussion.

https://wiki.gentoo.org/wiki/Handbook:AMD64/Installation/Dis...


Gentoo I understand but Arch? Does Arch go into compilation that much?

Not so much compilation, but it does delve into system management in a way that other OSes don't. Arch has few defaults setup for the user, so if you do it from scratch you'll end up needing to go through several of the general setup recommendations [1].

That's where you end up learning a lot about linux which is particularly practical. Other Linux distros, especially for the desktop, hide a lot of this information behind nice guis.

[1] https://wiki.archlinux.org/title/General_recommendations


It's not just the installation process. Being forced to manage or setup automatic management for most parts of your system teaches you a lot. Often it's just as simple as `sudo pacman -Sy yabdabadoo` but its more instructive than it 'just working'.

Agreed!

As an addendum, you have to do it for your actual working computer, otherwise, doing it on a VM or a machine you don't use, you won't be learning nearly as much as there is no pressure to make it truly work for you (this is where learning happens, when the thing you wanted to configure, and LFS docs or web docs are out of date on, so you have to dig deeper).


agreed. I haven't done LFS, but ive done arch and plently of other distros for a good while and I definitely wouldn't say I have a rock solid understanding of the fundamentals.

> I think that Gentoo or even Arch would provide pretty close to the same education level, though, with a lot less time to install.

it only takes three commands to install Gentoo

cfdisk /dev/hda && mkfs.xfs /dev/hda1 && mount /dev/hda1 /mnt/gentoo/ && chroot /mnt/gentoo/ && env-update && . /etc/profile && emerge sync && cd /usr/portage && scripts/bootsrap.sh && emerge system && emerge vim && vi /etc/fstab && emerge gentoo-dev-sources && cd /usr/src/linux && make menuconfig && make install modules_install && emerge gnome mozilla-firefox openoffice && emerge grub && cp /boot/grub/grub.conf.sample /boot/grub/grub.conf && vi /boot/grub/grub.conf && grub && init 6

that's the first one

https://web.archive.org/web/20230601013339/http://bash.org/?...


I remember playing with Gentoo back in 2004-2005, going through the installation procedure from "stage 1" all the way through to the working system [1]

It looks like nowadays the handbook says just go from stage 3, which makes sense - compiling everything was kinda stupid :D

[1] https://web.archive.org/web/20041013055338/http://www.gentoo...


I made the mistake of hitting from stage 1 an `emerge world` on a Pentium 3 (I think? P4 at the very best) with a full Open Office and Firefox selection.

No idea how long it would take.

One week later I finally saw my new desktop!

I learnt a hell of a lot with Gentoo - only had a dvd and the magazine it came with stepping through the stage 1 install process. No internet connection to search for answers when things went wrong. Not my current daily driver but definitely some good memories!


> it only takes three commands to install Gentoo

> cfdisk /dev/hda

Kids those days. You need to _boot_ first in a system which has those utilities. Also hda is from last century but, it is good for learning why.


Let's break it down to traditional skill level terminology

Apprentice: Ubuntu, Fedora

Journeyman: Arch, Debian, Gentoo

Master: Linux From Scratch


Grandmaster: Ubuntu, Fedora

You forgot Slackware.

Other point is long time maintainability as well.. Like unistalling stuff you don't need etc. Or LFS solves it?

Yeah, that was a real lesson for me when I did LFS.

It was super neat when I got it running for a while, but young me that did it really didn't understand the concept of "Ok, but now you need to upgrade things". That was some of my first experiences with the pain of a glibc update and going "ohhh, that's why people don't run these sorts of systems".


I used versioned AppDirs for that, e. g. /Programs/Python/3.13/. If I don't need it anymore, the directory is removed and a script runs. Similar to GoboLinux. I do however had not use GoboLinux right now; GoboLinux unfortunately lacks documentation, LFS/BLFS has better documentation. Finding information these days is hard - google search has become sooooo bad ...

> Like unistalling stuff you don't need

This will lead to a lot of learning. /s


I learned so much installing and using Gentoo about 20 years ago

I'm still thinking that LFS taught me more about sed, gcc CFLAGS and bootstrapping than the underlying OS sadly

They should of considered that! /s

Me too, I just thought that I wouldn't trust an article on linguistics with such an error too much.

IMHO, the whole social/psychological aspect of the "conspiracy" or phenomenom or whatever you want to call it is at least as interesting as the phenomenon itself.

They had to update all the down detectors first.

I don't find the wording in the RFC to be that ambiguous actually.

> The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.

The "possibly preface" (sic!) to me is obviously to be understood as "if there are any CNAME RRs, the answer to the query is to be prefaced by those CNAME RRs" and not "you can preface the query with the CNAME RRs or you can place them wherever you want".


I agree this doens't seem too ambiguous - it's "you may do this.." and they said "or we may do the reverse". If I say you're could prefix something.. the alternative isn't that you can suffix it.

But also.. the programmers working on the software running one of the most important (end-user) DNS servers in the world:

1. Changes logic in how CNAME responses are formed

2. I assume some tests at least broke that meant they needed to be "fixed up" (y'know - "when a CNAME is queried, I expect this response")

3. No one saw these changes in test behavoir and thought "I wonder if this order is important". Or "We should research more into this", Or "Are other DNS servers changing order", Or "This should be flagged for a very gradual release".

4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

Cloudflare seem to be getting into thr swing of breaking things and then being transparent. But this really reads as a fun "did you know", not a "we broke things again - please still use us".

There's no real RCA except to blame an RFC - but honestly, for a large-scale operation like there's this seems very big to slip through the cracks.

I would make a joke about South Park's oil "I'm sorry".. but they don't even seem to be


> 4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

"Testing environment" sounds to me like a real network real user devices are used with (like the network used inside CloudFlare offices). That's what I would do if I was developing a DNS server anyway, other than unit tests (which obviously wouldn't catch this unless they were explicitly written for this case) and maybe integration/end-to-end tests, which might be running in Alpine Linux containers and as such using musl. If that's indeed the case, I can easily imagine how noone noticed anything was broken. First look at this line:

> Most DNS clients don’t have this issue. For example, systemd-resolved first parses the records into an ordered set:

Now think about what real end user devices are using: Windows/macOS/iOS obviously aren't using glibc and Android also has its own C library even though it's Linux-based, and they all probably fall under the "Most DNS clients don't have this issue.".

That leaves GNU/Linux, where we could reasonably expect most software to use glibc for resolving queries, so presumably anyone using Linux on their laptop would catch this right? Except most distributions started using systemd-resolved (most notable exception is Debian, but not many people use that on desktops/laptops), which is a locally-cached recursive DNS server, and as such acts as a middleman between glibc software and the network configured DNS server, so it would resolve 1.1.1.1 queries correctly, and then return the results from its cache ordered by its own ordering algorithm.


For the output of Cloudflare’s DNS server, which serves a huge chunk of the Internet, they absolutely should have a comprehensive byte-by-byte test suite, especially for one of the most common query/result patterns.

> other than unit tests (which obviously wouldn't catch this unless they were explicitly written for this case)

They absolutely should have unit tests that detect any change in output and manually review those changes for an operation of this size.


> Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

This is the part that is shocking to me. How is getaddrinfo not called in any unit or system tests?


As black3r mentioned (https://news.ycombinator.com/item?id=46686096), it is likely rearranged by systemd, therefore only non-systemd glibc distributions are affected.

I would hazard a guess that their test environment have both the systemd variant and the Unbound variants (Unbound technically does not arrange them, but instead reconstructs it according to RFC "CNAME restart" logic because it is a recursive resolver in itself), but not just plain directly-piped resolv.conf (Presumably because who would run that in this day and age. This is sadly just a half-joke, because only a few people would fall on this category.)


Probably Alpine containers, so musl's version instead of glibc's.

I was even more surprised to see that the RFC draft had original text from the author dating back to 2015. https://github.com/ableyjoe/draft-jabley-dnsop-ordered-answe...

We used to say at work that the best way to get promoted was to be the programmer that introduced the bug into production and then fix it. Crazy if true here...


What you're suggesting seems like a spectacular leap. I do not think it is very likely that the unnamed employee at Cloudflare that was micro-optimising code in the DNS resolver is also the author of this RFC, Joe Abley (the current Director of Engineering at the company, and formerly Director of DNS Operations at ICANN).

> I assume some tests at least broke that meant they needed to be "fixed up"

OP said:

"However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC."

One could guess it's something like -- back when we wrote the tests, years ago, whoever did it missed that this was required, not helped by the fact that the spec proceeded RFC 2119 standardizing the all-caps "MUST" "SHOULD" etc language, which would have helped us translsate specs to tests more completely.


You'd think that something this widely used would have golden tests that detect any output change to trigger manual review but apparently they don't.

Oh, they explain, if I understand right, they did the output change intentionally, for performance reasons. Based on the inaccurate assumption that order did not matter in DNS responses -- becuase there are OTHER aspects of DNS responses in which, by spec, order does not matter, and because there were no tests saying order mattered for this component.

> "The order of RRs in a set is not significant, and need not be preserved by name servers, resolvers, or other parts of the DNS." [from RFC]

> However, RFC 1034 doesn’t clearly specify how message sections relate to RRsets.

The developer(s) was assuming order didn't matter in general, cause the RFC said it didn't for one aspect, and intentionally made a change to order for performance reasons. But it turned out that change did matter.

Mistakes of this kind seem unavoidable, this one doesn't necessary say to me the developers made a mistake i never could or something.

I think the real conclusion is they probably need tests using actual live network stacks with common components, and why didn't they have those? Not just unit tests or with mocks, but tests that would have actually used real getaddrinfo function in glibc and shown it failing?


Even if there weren't tests for the return order, I would have bet that there were tests of backbone resolvers like getaddrinfo. Is it really possible that the first time anyone noticed that that crashed, or that ciscos bootlooped, was on a live query?

Yes, at least they should test the glibc case.

The article makes it very clear that the ambiguity arises in another phrase: “difference in ordering of the RRs in the answer section is not significant”, which is applied to an example; the problem with examples being that they are illustrative, viz. generalisable, and thus may permit reordering everywhere, and in any case, whether they should or shouldn’t becomes a matter of pragmatic context.

Which goes to show, one person’s “obvious understanding” is another’s “did they even read the entire document”.

All of which also serves to highlight the value of normative language, but that came later.


it wouldn't be a problem if they tested it properly... especially WHEN stuff is ambigous

They may not have realized their interpretation is ambiguous until after the incident, that’s the kind of stuff you realize after you find a bug and do a deep dive in the literature for a post mortem. They probably worked with the certitude that record order is irrelevant until that point.

> I don't find the wording in the RFC to be that ambiguous actually.

You might not find it ambiguous but it is ambiguous and there were attempts to fix it. You can find a warmed up discussion about this topic here: https://mailarchive.ietf.org/arch/msg/dnsop/2USkYvbnSIQ8s2vf...


I agree with you, and I also think that their interpretation of example 6.2.1 in the RFC is somewhat nonsensical. It states that “The difference in ordering of the RRs in the answer section is not significant.” But from the RFC, very clearly this comment is relevant only to that particular example; it is comparing two responses and saying that in this case, the different ordering has no semantic effect.

And perhaps this is somewhat pedantic, but they also write that “RFC 1034 section 3.6 defines Resource Record Sets (RRsets) as collections of records with the same name, type, and class.” But looking at the RFC, it never defines such a term; it does say that within a “set” of RRs “associated with a particular name” the order doesn’t matter. But even if the RFC had said “associated with a particular combination of name, type, and class”, I don’t see how that could have introduced ambiguity. It specifies an exception to a general rule, so obviously if the exception doesn’t apply, then the general rule must be followed.

Anyway, Cloudflare probably know their DNS better than I do, but I did not find the article especially persuasive; I think the ambiguity is actually just a misreading, and that the RFC does require a particular ordering of CNAME records.

(ETA:) Although admittedly, while the RFC does say that CNAMEs must come before As in the answer, I don’t necessarily see any clear rule about how CNAME chains must be ordered; the RFC just says “Domain names in RRs which point at another name should always point at the primary name and not the alias ... Of course, by the robustness principle, domain software should not fail when presented with CNAME chains or loops; CNAME chains should be followed”. So actually I guess I do agree that there is some ambiguity about the responses containing CNAME chains.


Even if 'possibly preface' is interpreted to mean CNAME RRSets should appear first there is still a broken reliance by some resolvers on the order of CNAME RRsets if there is more than one CNAME in the chain. This expectation of ordering is not promised by the relevant RFCs.

Isn't this literally noted in the article? The article even points out that the RFC is from before normative words were standardized for hard requirements.

100%

I just commented the same.

It's pretty clear that the "possibly" refers to the presence of the CNAME RRs, not the ordering.


The context makes it less clear, but even if we pretend that part is crystal, a comment that stops there is missing the point of the article. All CNAMEs at the start isn't enough. The order of the CNAMEs can cause problems despite perfect RFC compliance.

To me, this reads exactly the opposite.

My initial reading was "you can place them wherever you want". And given that multiple parties are naturally interpreting the wording in different ways, that means the wording is ambiguous by definition.

So pretty similar to product owners or project managers in your average enterprise

> we won't be able to AI our way into better communication skills

Why not?

I always find these articles funny. There's someone almost triumphantly declaring that AI is able to take over the hard skills tasks from oh so dreaded engineers, but the authors can somehow not imagine that their soft skills - which are often they only ones they have - could be done by AI as well.


Claude?

It could be Claude. Or Sally, Joe, or Sue. Whatever name the PM goes by is immaterial.

I think many people don't realize how big this dependence is.

You're running Linux? Oh fine... on which hard- and firmware? Intel? AMD? Apple Silicon? Qualcomm? All US.

You're using the Internet? Via Cisco routers?

Europe and other regions would have to put in huge efforts to really gain independence.


The thing is that we had been allies for many decades. So the US and the EU are very entwined. You only mention one side, those chips are made using ASML machines from The Netherlands (with lenses from Germany), the latter two use an architecture licensed from a UK company (owned by a Japanese conglomerate). It was a very successful cooperation between two continents, but since the US wants to throw that under the bus, we have to become self-sufficient.

It will take time to untangle the mutual dependencies and become more independent. That said, ARM also designs full ARM64 cores (until recently Qualcomm cores were based on ARM cores, until the new cores based on the NUVIA acquisition) and they can be fabbed in Taiwan (TSMC) and South Korea (Samsung), and hopefully Europe in some years.

Besides that, it's true that if you are running Linux, you rely on US firmware and Intel/AMD chips, but assuming that Intel ME doesn't have a bad remote kill switch, you can continue to run on existing hardware.


I think there are different forms of dependence that result in more or less severe carrying costs. Hardware is only a problem when you need to replace it or create new installations, so its carrying cost is rather low. The Microsoft 365 Copilot app is subscription-based, induces vendor lock-in with a whole software/hardware ecosystem, and is updated on a whim from the vendor with next to no customer control; its carrying cost is enormous.

All that hardware (with chips) is made by machines from ASML. Mobile devices? That's all ARM. Mobile infrastructure like 5G? Mostly Nokia or Ericsson.

China is getting closer to tech independence by the day, I imagine they are happy to sell their tech to anyone who is willing. Not saying this is good or desirable from a European perspective, but quite likely.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: