Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Google Public DNS's approach to fight against cache poisoning attacks (googleblog.com)
212 points by tatersolid on April 7, 2024 | hide | past | favorite | 69 comments


I see most of the bot traffic hiding behind Google and Cloudflare. I appreciate that. They take a lot of load off my servers. If I had one request of Google it would be to remove the TTL cap of 21600 or raise it to 86400 to further reduce the traffic that comes through from them as my TTL records are very high on purpose for my own reasons that I will not debate. I know memory is still a limit. CF seem to honor my TTL's with the caveat I have to wait for each node behind their anycast clusters to cache the record but that's fine.

As a side note, whatever group in Google are running DNS are doing a great job. I do not see malformed garbage coming from them. Some people watch birds when they retire, I watch packets ... and critters.


> my TTL records are very high on purpose for my own reasons that I will not debate.

Why are your records high? Is the load on your servers too intense otherwise? (not trying to debate, am curious because I've really appreciated your comments/insights in the past)



Guess - possibly some sort of outbound port-knock alternative:

   if (SYN->dest port && TTL > X) { ACK };


How is this supposed to work, exactly?



Yes, but what does any of this have to do with TTL?

Are you maybe confusing IP packet TTL with DNS record TTL?


this


Waited for this to fall off the front page

For my personal hobby sites I change things so rarely that I can keep my TTL's very high, especially for NS records and most TXT records. There are some records I could theoretically set to 68 years though obviously nobody would keep a record that long. Going against the grain I also keep neg-ttl negative cache really high to help spot the bots that ignore neg-ttl. It's a hobby of mine to study bots and what they are enumerating. Sometimes it gives me a jump start on zero-day vulnerabilities. For example a number of bots suddenly start looking for a specific A record such as cpanel to use an old and silly example.

Non bot clients will respect TTL's within reason. Most ISP recursive DNS servers will cap the TTL to 24 hours for NS records and sometimes higher for A, CNAME, PTR, etc... Most corporate DNS servers will be close to the defaults of whatever recursive daemon they are using, usually Active Directory, sometimes Bind. The remaining limiting factor is memory but most recursive DNS servers these days have obscene amounts of RAM and CPU time that the DNS admin may allocate. There are ways to even further optimize recursive servers such as periodically flushing junk zones via cron and many other things that do not need to be there as well as tuning slabs and threads based on core count. This will vary by organization and requires getting detailed zone and client statistics. An example of a junk zone would be a zone used by a corporate spy to exfiltrate data over DNS that will result in hundreds of thousands of unique A records data-flow-outbound customer or intellectual property data or TXT records inbound that are actually encrypted data data-flow-inbound encrypted malware, instrucitons.

Bot scripts will bypass recursive DNS servers and will typically talk directly to authoritative servers. Most of these scripts do not even look at zone or resource record TTL times which is unfortunate for them as it makes spotting them trivial. I then dig deeper into the networks they are originating from, find their IPv4/IPv6 CIDR blocks, who they peer with, what business they claim to be. Sometimes they are squatters that are announcing routes from businesses that went under and had laid off the people that would have released the IP allocations. I help get those clawed back when I can, taking the IP allocations away from the squatters.

Side project idea for Google / Cloudflare / OpenDNS / etc... These companies have a unique view into DNS traffic and could also help spot the squatters. They could free up a significant amount of IP space if they allocated a small team of interns to analyze their traffic, use automation to build graphs and reports, then once confidence is high enough submit this data to all the IP registries. Registries are also much more likely to take communication from these companies far more seriously than some retired hobbyist.

This will probably come up but some may say having high TTL's is risky. It can be, but not for me. If all of the internet shared one massive /etc/hosts and DNS ceased to exist, it would rarely have updates from me and if a record is stale there would be no harm, no foul as it pertains to me. There are a myriad of other reasons I have unusually high zone/record TTL's but it would turn into a blog post and I have been too lazy as of late to make any as interest is usually very low. I don't even have my blog VM's spun up.


I aspire to be you when (if) I retire.


A recursive resolver is not a database. They're under no obligation to cache for a day.


That doesn’t prevent them from respecting the TTL values, and returning them to downstream clients. They might not keep the record cached for a day, but that doesn’t mean end user can’t keep it cached. But of course that only works if Google respects the provided TTL, and don’t rewrite to something much lower.


Okay stupid question… I thought TTL was a count of hops, not time in seconds. Or do people use it differently where instead of subtracting 1, their routing devices subtract n seconds?


TTL is a general term standing for "time to live". In the context of an IP packet it's a number of hops. In the context of a cache, it's a time until expiration.


IP is weird.

TTL is still in seconds, but every hop has to decrease it by at least one.

I'm not sure if there is any implementation out there that cares, but if e.g. wifi retries or a massive buffer queue lead to a packet spending more than a second on a hop, it should decrease by two.


TTL for a DNS response is in seconds, and is intended to be the maximum time a value is cached. If you request an address for a name that the server you are talking to is authoritative for, then you will always get the TTL set in the DNS records. If your request goes to a caching name server that is not authoritative then you may get a lower value, in fact you most likely will. If the intermediate server makes a new request upstream it can, as per the spec, just pass on the TTL it gets¹, but some seem to drop one second just-in-case even though it isn't needed² really.

If an intermediate server handed out the original TTL for all requests then the TTL would be effectively multiplied by each caching layer³, which is why you will only get the true TTL at the client end if a fresh request was actually made to the authoritative server(s).

--

[1] the fraction of a second difference caused by latency isn't likely to be significant for either long or short TTLs

[2] and on a really high latency link (like back when GPRS or POTS landlines were common, or some links even now in deep rural areas) 1s might not be enough anyway for the benefit they think they are giving

[3] so at my home where lookups go PiHole->8.8.8.8->authoritative it would make TTLs potentially up to 2x their intended length. Example: for a 1000s TTL if PiHole gets the lookup request with 1s remaining but hands out the true TTL, my client will not check for 1000s, potentially 999s late. Depending on the sequence of related requests to each DNS server from other clients, that could become 1998s in my case, longer if there were more caching layers.


Derp. Right. We’re talking about caching.


Wondering what prompted the blog post. The recent publication of RFC 9539?

It would be interesting to hear how often the google dns servers see attempts to poison their cache. The mitigations probably prevent folks from even trying, but he numbers would be interesting.

The OARC 40 presentation PDF mentions cookies deployment is low for large operators but open source software has compliant implementations. Are large operators writing their own dns servers, but badly? I would think there wouldn't be many custom implementations, and that you would be able to detect which software nameservers are running, each with known capabilities. But from the way the numbers are presented it seems they only look at behaviour without considering software (versions).


> Are large operators writing their own dns servers, but badly?

Yes. It is trivial to build a DNS server, and near impossible to write a correct DNS server. Eventually your organization gets large enough that someone thinks it is a good idea without understanding the implications.

> you would be able to detect which software nameservers are running, each with known capabilities

There are pseudo-standards for asking an authoritative server what software it is running, but everyone turns that off because somehow it makes you "more secure." What you end up having to do is probe auth servers by replaying user queries on the side, measuring if the responses you get are correct, and then keeping a database somewhere of which servers support which flags.


I'm a little surprised Google only implemented case randomization by default in 2022 considering it's been around since 2008. Presumably they had concerns about widespread compatibility? Although my understanding is that for a lot of DNS servers it just worked without any specific implementation effort...but maybe there was a long tail of random server types Google was concerned about.


This is so weird to see. Just this morning I was checking thru my public authoritative NS query logs, and noticed the random capitalization. I had also noticed this in a similar work environment roughly end of 2023, but attributed it to people just doing DNS wordlist bruteforcing to find stuff (couldn't explain the case, but figured it was some evasion).

Today I let my curiosity dive deeper and quickly found the ietf publication on 0x20 encoding and this article.

Just odd to see others post it to hn on the same day.. coincidences are weird.


Weird that they don't even mention that Google's public DNS performs DNSSEC validation. It does, and that's the ultimate defense against cache poisoning attacks.


No, it obviously isn't, because less than 5% of zones are signed, and an even smaller fraction of important zones (the Moz 500, the Tranco list) are.

That's why they don't mention DNSSEC: because it isn't a significant security mechanism for Internet DNS. It's also why they do mention ADoT: because it is.

I think this is also why DNSSEC advocates are so fixated on DANE, which is (necessarily) an even harder lift than getting DNSSEC deployed: because the attacks DNSSEC were ostensibly designed to address are now solved problems.

Note also that if ADoT rolls out all the way --- it's already significantly more available than DNSSEC! --- there won't even be a real architectural argument for DNSSEC anymore, because we'll have end-to-end cryptographic security for the DNS.

Thanks for calling this out! That Google feels case randomization is more important than DNSSEC is indeed telling.


> there won't even be a real architectural argument for DNSSEC anymore

ADoT relies on NS records to be DNSSEC signed.

The TLS certificates that ADoT relies on need to be hashed into TLSA records (DANE, DNSSEC).



And? The IETF RFC draft for ADoT still specifies that it relies on DNSSEC.

https://datatracker.ietf.org/doc/draft-dickson-dprive-adot-a...


Check out the DPRIVE working group (ekr's comments in particular) for some of the backstory. DNSSEC isn't happening either way, but I think ADOX might.


I think this is a bit of an apples and oranges moment.

While I agree transport confidentiality is important, that is not what DNSSEC solves, nor should you see people saying that it does solve confidentiality.

DNSSEC protects the transport integrity of DNS responses. DNSSEC enabled zones 100% defeat on-path cache poisoning attacks to recursive resolvers that are DNSSEC enabled. Full stop.

ADo"X" protects the transport confidentiality of DNS responses. I suppose this "weakly" protects the transport integrity of DNS responses, but, again, the primary purpose is confidentiality.

After reading RFC 9539, it's clearly stated that it's opportunistic encryption. An on-path attacker will find it trivial to disable encryption and start poisoning caches. The RFC states in multiple places that if TLS setup fails, fall back to plaintext DNS.

If DNSSEC fails, an on-path attacker has no similar recourse. A properly configured recursive resolver will SERVFAIL and _never_ send back a potentially poisoned response to clients for DNSSEC signed zones.


> the attacks DNSSEC were ostensibly designed to address are now solved problems.

Maybe you can help figure out where you went wrong here by explaining what - in your understanding - were the problems that DNSSEC was "ostensibly designed to address" ?

> because we'll have end-to-end cryptographic security for the DNS.

In 1995 a researcher who was annoyed about people snooping his passwords over telnet invented a protocol (and gave away a free Unix program) which I guess you'd say delivers "end-to-end cryptographic security" for the remote shell. Now, when you go into a startup and you find they've set up a bunch of ad hoc SSH servers and their people are just agreeing to all the "Are you sure you want to continue ..?" messages, do you think "That's fine, it's end-to-end cryptographic security" ? Or do you immediately put that on the Must Do list for basic security because it's an obvious vulnerability ?


Many years ago, when HTTPS adoption was in the 5% range, if someone would have said “HTTPS is the ultimate defense against web page spoofing”, would you have argued that it was false, just because the adoption was so low?

DNSSEC is the ultimate defense against cache poisoning attacks, no matter the adoption percentage.


You should go tell Google that.


Google is aware and has enabled DNSSEC on their recursive resolvers for a long time.

Unfortunately, most people do not DNSSEC sign their zones, so Google have to resort to also enabling 0x20, which is helpful, but also (to an extent) security theater.


"People" in this case includes Google, which does not sign its zones.


The needs and risk profiles of Google are vastly different than basically every other organization on Earth.

(I also note that you didn’t answer my question, and instead opted for a rhetorical cheap shot reply to my second paragraph only.)


Never mind. It's OK. I don't think we're speaking the same language, so to say.


I think your posting style of making confident statements, but responding to counter-arguments with rhetorical cheap shots and condescending non-answers, is unsuitable for HN.


I don't think personal attacks are helping your case, Teddy.


Criticizing specific behavior as being unsuitable for HN is not a personal attack.


For end users, TLS is the key protection. I don't care if my DNS is poisoned, MITMed, or malicious: if the IP address I connect to can't present a valid TLS cert, then I don't proceed.

If you can't securely authenticate your server (as HTTPS/TLS does) you have other problems too.


Unfortunately it's quite easy for some actors to get a valid TLS cert https://notes.valdikss.org.ru/jabber.ru-mitm/


Quite astonishing that someone managed to get valid certs from Let's Encrypt for domains that they didn't own. Has Let's Encrypt issued any statements about how this might have happened, and how those specific certificates were validated by them?

Still, good to see that monitoring the CT logs would have caught this problem much sooner.


As far as I remember from when I last read that article, it was a police-requested MiTM by the hosting provider. LetsEncrypt did a standard challenge (requesting http://webroot/.well-known/something) and the MiTM responded appropriately. This isn't really a problem with LE - if you can control the http response to all outside servers, it's fair to say that you control the domain and should have the cert. Bad on the hosting provider for doing so? Maybe, but there is no way for LE to know.


DNS is where the web would have been if browsers basically didn't force websites to support HTTPS.

The reasoning is that DNS is not important enough to go through the trouble of deploying DNSSEC. These days TLS is often cited as the reason DNSSEC is not needed.

At the same time we see a lot of interest in techniques to prevent cache poisoning and other spoofing attacks. Suddenly in those cases DNS is important.

If all DNS client software would drop UDP source port randomization and randomized IDs, then lots of people would be very upset. Because DNS security was more important than claimed.

DNS cookies are also an interesting case. They can stop most cache poisoning attacks. But from the google article, big DNS authoritatives do not deploy them.


The key thing to note is that anti-poisoning countermeasures deployed at major authority servers scale to provide value without incurring cost for every (serverside) Internet "user", and DNSSEC doesn't. A lot of these thing seem like (really, are) half-measures, but their cost/benefit ratio is drastically different, which is why they get rolled out so quickly compared to DNSSEC, which is more akin to a forklift upgrade.


There is no quickly in the Google article. It took them ages to roll out 0x20. Cookies are not very well supported. And then the elephant in the room is the connection between the stub resolver and the public DNS resolvers.

The interesting thing is what happens when BGP is used to redirect traffic to DNS servers: https://www.thousandeyes.com/blog/amazon-route-53-dns-and-bg...


Did it take 25 years? That's the baseline. :)


If we didn't have the web, all networking above OSI L4 on all operating systems would have been encrypted by default. A simple set of syscalls and kernel features could have enabled it. But since the web was there, and popularized a solution for secure communications (TLS + HTTP), everyone just jumped on that bandwagon, and built skyscrapers on top of a used books store.

The weird irony is it's the old "worse is better" winning again. HTTP and TLS are fairly bad protocols, in their own ways. But put them together and they're better than whatever else exists. It's just too bad we didn't keep them and ditch the browser.


Assuming you are talking about IPSEC, that uses a model that is very hard to deploy.

The problem is that applications typically use TCP connections, but IPSEC works at the IP level. Early on, the (BSD socket) kernel API was basically fixed at the IP level instead of associating it with a TCP socket.

So the whole thing became too complex (also for other reasons). So SSL and SSH were created to have simple things that worked.

SSL took many iterations to get any kind of security, so IPSEC had plenty of time to get it right and take over. But as far as I know, there just never happened. It also doesn't help that TLS is trivial to combine with NAT, and for IPSEC that is quite tricky.


Can you articulate what you believe is bad about TLS?


Isn't the Linux kernel at least very unhappy with the idea of adding encryption logic inside, especially in a way where they expose this to user space?


Just guessing but it could be the lack of adoption. Despite having climbed rapidly in the last few years [0] the percentage is still very low. [1]

[0] - https://www.verisign.com/en_US/company-information/verisign-...

[1] - https://www.statdns.com/


The low adoption of DNSSEC might be due to posts like these:

https://news.ycombinator.com/item?id=36171696 - Calling time on DNSSEC: The costs exceed the benefits (2023)

And also many news regarding validation failures:

https://hn.algolia.com/?q=dnssec


The rabbit hole on people gradually pulling up stakes on DNSSEC goes deeper than that; I'd say the canary in the coal mine is probably Geoff Huston switching from "of course we're going to DNSSEC everything" to "are we going to DNSSEC anything?":

https://www.potaroo.net/ispcol/2023-02/dnssec.html

(Geoff Huston is an Internet infrastructure giant.)

But really it all just boils down to the fact that the DNS zones that matter --- the ones at the busy end of the fat tail of lookups --- just aren't signed, despite 25 years of work on the standard. IPv6 is gradually mainstreaming; in countries where registrars auto-sign zones, DNSSEC is growing too, but very notably in countries where people have a choice, DNSSEC deployment is stubbornly stuck in the low single digit percentages, and the zones that are getting signed are disproportionately not in the top 10,000 of the Tranco list.


I urge everyone to actually read the blog post by Geoff Huston. From what I can tell, it does not say what tptacek says it does.

Some other counterpoints to general DNSSEC doomsayers:

• <https://blog.technitium.com/2023/05/for-dnssec-and-why-dane-...>

• <https://www.redpill-linpro.com/techblog/2019/05/06/sshfp-and...>


You keep citing these two random blog posts to me. I've never quite understood why you think they're such a mic drop. "Shreyas Zare, who develops software part-time as a hobby" thinks I'm all wrong. OK? Did Shreyas Zare connect Australia to the Internet before spending 15 years advocating for DNSSEC as the chief scientist at APNIC? I think my Pokemon wins here.


I’m not citing them to you. I’m citing them to other readers here who do not have your attitude of dismissing arguments unless they’re made by someone in authority, in which case your proclaim them valid, and coincidentally supporting your viewpoint (even when they don’t; again, I urge readers to actually read Geoff Huston’s blog post for themselves). You’re literally making an argument from authority and making an ad hominem argument at the same time.

Edit: Is this authoritative enough for you? <https://www.icann.org/resources/pages/dnssec-what-is-it-why-...>


I don't think we're playing the same game here. This is a random info page at ICANN from 2019. It doesn't even have a byline. I had to go to archive.org to figure out when it showed up. Why would you think this would be persuasive?


I assumed that you would consider ICANN an authority, but apparently you only consider named people to be authoritative and being capable of having any argument of merit? Your mind is truly fascinating.


This is like saying that the DNSSEC working group endorses DNSSEC, Teddy. Like, yes, I agree that they do, but it's not interesting to point that out. It is in fact interesting that Geoff Huston is entertaining questions about the success of the protocol, because he's a major DNSSEC advocate and a globally recognized authority on core Internet infrastructure and, in particular, DNS measurement.

This is what I mean when I say we're not really talking to each other. I don't think you understand or care about the argument I'm making, and so you're not engaging with it. That's fine! But then: let's just stop engaging.


I’m not interested in whether ICANN or Geoff Huston is or is not endorsing DNSSEC. I’m interested in what arguments they make for and/or against DNSSEC. You, on the other hand, seem to be having some electoral college model, where you don’t consider facts and arguments at all, but only care how many people in authority are for or against it. This is why you namedrop all the time, and dismiss my (and others’) arguments as coming from “randos”.


No. I've spent 16 years on this site discussing DNSSEC in detail on this site, as the search bar will aptly show.


If you’ve ceased to be willing to discuss and argue for your opinions, your presence here is now unproductive.


I first heard about this 0x20 scheme around 2015 when I was working on a DNS cache (also at Google, but not for the public DNS team). I noticed and had to work around the fact that some servers were responding in vixie-case even when the requests were not. Those servers would be broken if the requests were paying attention to 0x20, right? I wonder what software was doing that.


The that the longer your domain name, the less susceptible is it to cache poisoning attacks, right? Since there are more possible case variations.


I updated https://crates.io/crates/dns-server to support case randomization.


Good to hear the world's infrastructure relies on a kludge like case randomization.


That's how you maintain backwards compatibility while improving security. I'm not sure lamenting the imperfection is valuable, but it is a worthwhile lesson for those designing new protocols.

If it becomes popular enough you will certainly face future security challenges you failed to even imagine. Leave some room for that.

Otherwise, this is great work.


Longer domain names are more secure!


And conversely, short domains like Google’s own g.co are less secure!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: