Hacker Newsnew | past | comments | ask | show | jobs | submit | geocar's commentslogin

Try searching for thunderbolt lto.

I know MagStor has one with usb-c presentation.


LTO9 is like 45TB for <$100 (I got a bunch for €55 a piece), so 4.5TB for <$10 is being generous. And even if you didn't think they lasted 30-40 years and made copies every 3 years, it's still cheaper, not to mention you have fewer tapes to manage.

Also: I don't have a bd/dvd player in my house today, so even if there are the most tremendous gains in medical sciences I'm almost certainly not going to have one in 100+ years, so I'm not sure m disc even makes cost-sense for smaller volumes.

Maybe if you want to keep your data outside for sunshine like the author of the article, but that's not me...


LTO-9 tapes are actually 18TB, but yes they are a lot cheaper than optical discs. If you can afford the drive.

> so even if there are the most tremendous gains in medical sciences I'm almost certainly not going to have one in 100+ years

Never say never. People of today are building "90s entertainment center" setups for nostalgia, complete with VCRs. Given how many generations of game consoles had DVD drives (or BD drives that supported DVDs) in them, I would fully expect the "retro gaming" market of 100 years from now to be offering devices that can play a DVD.


> Also: I don't have a bd/dvd player in my house today

You have just stumbled on the inherent problem with any archival media.

You really think you will have a working tape drive after 40 years?

Hell, in my experience tape drives are mechanically complex and full of super thin plastic wear surfaces. Do you really expect to have a working tape drive in 10 years?

As far as I can tell there is no good way to do long term static digital archives, And in the absence of that you have to depend on dynamic archives, transfer to new media every 5 years.

I think to have realistic long term static archives the best method is to only depend on the mark 1 eyeball. find your 100 best pictures, and print them out. identify important data and print it out. Stuff you want to leave to future generations, make sure it is in a form they can read.


I do think LTO is a common enough format, and explicitly designed to be backwards-compatible, that it is very likely to be around in 10 years. The companies that rely on it wouldn't invest in it if they didn't think the hardware would be available. 40 years, harder to say, but as someone who owns a fair bit of working tape equipment (cassette, VHS, DV) that is almost all 25+ years old, i wouldn't think it'd be impossible.

That said, i imagine optical drives will be much the same.


It is only backwards compatible two generations, occasionally something slips at the LTO trust (or wherever those things are designed) and you get three generations. But if I have a basement full of LTO1 tapes no currently manufactured drive will read them. I would have to buy a used drive and the drives were never really made all that well. Better than the DAT drives one company I worked for used for some of their backups. But still mechanically very complex with many many small delicate plastic parts that wear out quickly. Those DAT drives were super delicate and also suffered from the same generational problems LTO does. We had a bunch of DAT1 tapes somebody wanted data from but had no working drives to do so. All our working drives were newer DAT3 and 4

That was always the hard part to justifying tape backup. the storage is cheap. but the drives are very expensive. And never seemed to last as long as their price would warrant.


That also changed somehow... LTO-10 drives are not backward compatible and can only read/write LTO-10 media.

That is because LTO-10 had to make an incompatible change to go from 18TB to 30TB

For LTO tapes? Yes they will be available since the format is so common.

LTO9 is only 18TB.

The LTO compression ratio is theoretical and most peoples data will be incompatible with native LTO compression method used.


3 years is way overkill. 10 years is more reasonable.

> I can’t think of a single time I’ve needed a sorted list of only numbers.

Gosh. Let me try to convince you.

I use permutation arrays all the time: lists of indexes that can be used across multiple vectors.

This is much faster than the pattern of scanning rows, constructing tuples of (thingToSort . thingIWantInThatOrder) and making a custom sort function, and destructuring those tuples...

And really, not having to write custom sort functions is really really nice.

> Especially in telemetry, where mean is easy and median is not.

Funny. Yes median is obvious with a permutation array, and maybe mean is less so.

When your data is really big and not very variable, mean of x is roughly the same as the mean of any sufficient sample of x, and that sample can be meaningfully represented as a permutation array!

You can get such an array with reservoir sampling and some maths, and (depending on what you know of your data and variance) sometimes even simpler tricks.

That's kindof actually how the "faster than dijkstra" trick referred-to in the article works: Data sets with small variance has this same property that the min of x is roughly the same as the min of a sufficient sample of x (where the size of sufficient has to do with the variance). And so on.

Another big use-case in my code is trees: Apter trees have a flat memory layout which is convenient for permutation arrays which can simultaneously represent index, rotation, tombstones, and all sorts of other things you might need to do with a tree.

Give it a dig. There's good stuff in there.


> How is not having a message-id a security risk?

CVE classify a lot of things that have nothing to do with security.

Not having a Message-ID can cause problems for loop-detection (especially on busy netnews and mailing lists), and with reliable delivery status notification.

Dealing with these things for clients who can't read the RFC wastes memory and time which can potentially deny legitimate users access to services

> It seems that Gmail is being pedantic for no reason

Now you know that feeling is just ignorance.


Well, gmail does not manage usenet groups and mailing lists. Delivery status notifications are considered best effort so it wouldn't make sense to block messages for that case.

Additionally, Gmail adds its own message identifier on every message (g-msgid) because it knows that message ids can not be trusted to be unique.

Finally just calling me ignorant is the cherry on top – please try to keep things civil on here.


> Well, [google] does not manage usenet groups and mailing lists.

They do. Sort of.

Google used to nntp, and manages the largest usenet archive; They still have one of the largest mailing list servers in the world, and they still perform distribution on those lists via SMTP email.

They still have all of the problems associated with it, as do lots of other mail/news/list sites still do that are a fraction of Google's size.

> Delivery status notifications are considered best effort so it wouldn't make sense to block messages for that case.

Sure it does.

You consider them best-effort, but that doesn't follow that I should consider them best-effort. For a simple example: Consider spam.

In any event, if you keep sending me the same message without any evidence you can handle them, I'm not going to accept your messages either, because I don't know what else you aren't doing. That's part of the subtext of "SHOULD".

Most big sites take this policy because it is internal nodes that will generate the delivery notification, but the edge nodes that are tasked with preventing loops. If the edge node adds a Message-ID based on the content, it'll waste CPU and possibly deny service; If the edge node naively adds a Message-ID like an MSA, the origin won't recognise it, and forwarded messages can loop or (if sent to a mailing list) be amplified. There also are other specific documented requirements related to Internet Mail that edge nodes not do this (e.g. RFC2821 § 6.3).

However you seem to be assuming Google is blocking messages "for this case" which is a little presumptuous. Google is presumably trying to save themselves a headache of handling errors for people who aren't prepared to do anything about it, the most common of which is spam. And the use of Message-ID in this application is documented at least as early as RFC2635.

> Additionally, Gmail adds its own message identifier on every message (g-msgid) because it knows that message ids can not be trusted to be unique.

Without knowing what Google does with the g-msgid header, you are making a mistake to assume it is equivalent to the Message-ID header just because it has a similar name. You have no reason to believe this is true.

> Finally just calling me ignorant is the cherry on top – please try to keep things civil on here.

I am sorry you are offended to not know things, but you do not know this thing, and your characterising my actions will make it very difficult for you to learn something new and not be so ignorant in the future.

Think hard exactly about what you want to happen here: Do you want Google (et al) to do something different? Do you want me to agree Google should? Who exactly are you trying to convince of what, and to what end?

I am trying to tell you how to interpret the documentation of the Internet, in this case to be successful sending email. That's it.

I am not likely to try and tell Google what to do in this case because of my own experiences running mail servers over the last 30 years, but listen: I am willing to be convinced. That's all I can do.

If it's something else, I'm sorry I just don't understand.


> If it's something else, I'm sorry I just don't understand.

I'm trying to explain to you that you're speaking very authoritatively about how things were done 30 years ago but things have changed since then, Gmail won't even send a message non-delivery notification to non-DKIM hosts.

Same thing with the g-msgid I'm telling you about – google documented it explicitly as a unique identifier, see here for example: https://developers.google.com/workspace/gmail/imap/imap-exte...

So yeah, things changed.


[flagged]


sic transit gloria mundi..

So add a message id at the first stop, or hard ban the sender server version until they confirm. A midway point that involves a doom switch is not a good option.

> So add a message id at the first stop

That should have already happened. Google is not the "first stop".

> hard ban the sender server version until they confirm

SMTP clients do not announce their version.

Also I don't work for you, stop telling me what to do.

> A midway point that involves a doom switch is not a good option.

No shit. That's almost certainly a big part of why Google blocks messages from being transited without a Message-ID.


> isn't a valid excuse to reject a client either.

Yes it absolutely is: https://www.rfc-editor.org/rfc/rfc2119 is quite clear.

    3. SHOULD   This word, or the adjective "RECOMMENDED", mean that there
       may exist valid reasons in particular circumstances to ignore a
       particular item, but the full implications must be understood and
       carefully weighed before choosing a different course.
If the client SHOULD do something and doesn't, and your server does not know why, you SHOULD disconnect and move on.

If the server has considered fully the implications of not having a Message-ID header, then it MAY continue processing.

In general, you will find most of the Internet specifications are labelled MUST if they are required for the protocol's own state-processing (i.e. as documented), while specifications are labelled SHOULD if they are required for application state-processing in some circumstances (i.e. other users of the protocol).


> If the client SHOULD do something and doesn't, and your server does not know why, you SHOULD disconnect and move on.

That is not a rule.

In this situation the server can reject any message if it wants to, and not doing a SHOULD tests the server's patience, but it's still ultimately in the "server wanted to" category, not the "RFC was violated" category.


[flagged]


You are confused about what I'm doing. I'm not telling anyone what to do. I'm saying what category their actions fall into.

And the line of yours I quoted is still not supported by anything.


> You are confused about what I'm doing.

Absolutely.

And if you're not confused by what I said, that's not obvious in the slightest.

> I'm not telling anyone what to do.

So you say.

> I'm saying what category their actions fall into.

I think that's pretty weird that you think you get to decide that all by yourself.

But I'm not playing, because you're right: I don't know why you would be doing that.

> And the line of yours I quoted is still not supported by anything.

Yes it is. It's a description of the behaviour of other Internet hosts, and it's a description of exactly what is happening in the linked article.


> But I'm not playing, because you're right: I don't know why you would be doing that.

This is the comment you originally replied to:

>> If you're implementing a server, "the client SHOULD but didn't" isn't a valid excuse to reject a client either.

>> You can do it anyway, you might even have good reasons for it, but then you sure don't get to point at the RFC and call the client broken.

They're talking about how to categorize actions, just like I am.

So I thought you were playing on that same topic.

But if you weren't on that topic, and given that you quoted only the first one of those sentences, I have a guess.

I think you didn't realize how the second sentence affects the meaning of the first one, and you misunderstood what they were saying as trying to tell servers they can't reject. They were not trying to tell servers they can't reject.

If that's the case, then this whole line of conversation is pointless, because you were rebutting an argument that nobody made.

> It's a description of the behaviour of other Internet hosts, and it's a description of exactly what is happening in the linked article.

Taking a description of behavior and throwing an RFC-style "SHOULD" in front is only going to be correct if you get lucky. "is" and "SHOULD" are different things!


That clearly means it’s not required.

How does Google know whether or not the sender has a valid reason? They cannot know that so for them to reject an email for it means they would reject emails that have valid reasons as well.


How would the sender know the consequences of sending without the header? You shouldn’t assume anything here. As a sender, you should include it unless you’ve already worked out what the recipient is expecting or how it will be handled. Doing this with email is silly because the client is sennding to so many different servers they know nothing about so it’s basically a requirement to include it.

> That clearly means it’s not required.

You and I have different definitions of "clearly"

It is not required for the protocol of one SMTP client sending one message to one SMTP server, but it is required for many Internet Mail applications to function properly.

This one for example, is where if you want to send an email to some sites, you are going to need a Message-ID, so you SHOULD add one if you're the originating mail site.

> How does Google know whether or not the sender has a valid reason?

If the Sender has a valid reason, they would have responded to the RFC (Request For Comments) telling implementers what they SHOULD do, rather than do their own thing and hope for the best!

Google knows the meaning of the word SHOULD.

> it means they would reject emails that have valid reasons as well.

No shit! They reject spam for example. And there's more than a few RFC's about that. Here's one about spam that specifically talks about using Message-ID:

https://datatracker.ietf.org/doc/html/rfc2635


> If the server has considered fully the implications

The server "considers" nothing. The considerations are for the human implementers to make when building their software. And they can never presume to know why the software on the other side is working a certain way. Only that the RFC didn't make something mandatory.

The rejection isn't to be compliant with the RFC, it's a choice made by the server implementers.


Either the server must explicitly confirm to servers or the clients must accept everything. Otherwise message delivery is not guaranteed. In the context of an email protocol, this often is a silent failure which causes real-world problems.

I don’t care what the protocol rfc says, the client arbitrarily rejecting an email from the server for some missing unimportant header (for deduction detection?) is silly.


If it was unimportant it would be MAY.

Is the server somehow unable to inject an ID if the sender did not send one? Stop hiding behind policy and think for yourself.

> Is the server somehow unable to inject an ID if the sender did not send one?

Yes. https://www.rfc-editor.org/rfc/rfc2821#section-6.3 refers to servers that do this and says very clearly:

    These changes MUST NOT be applied by an SMTP server that
       provides an intermediate relay function.
That's Google in this situation.

> Stop hiding behind policy and think for yourself.

Sometimes you should think for yourself, but sometimes, and friend let me tell you this is one of those times, you should take some time to read all of the things that other people have thought about a subject, especially when that subject is as big and old as email.

There is no good reason viva couldn't make a Message-ID, but there's a good reason to believe they can't handle delivery status notifications, and if they can't do that, they are causing bigger problems than just this.


You want me to think for myself when writing an email server that interoperates with other email servers? Are you just clueless?

That's some gnu bash shenanigans. There is no /dev/tcp in unix

Lots of shops didn't have gnu installed: telnet was what we had.


⌥- produces a – as well. That's sometimes easier than typing `--` and hoping for the best.

That's an en-dash. You want to also hold shift to make it an em-dash.

oh cool —–—– ——— ——— –—––

cheers for that never even noticed


I think the problem is what is an image?

I made an attempt to enumerate them[1], and whilst I catch this issue with feImage over a decade ago by simply observing that xlink:href attributes can appear anywhere, Roundcube also misses srcset="" and probably other ways, so if the server "prefetched every image" it knew about using the Roundcube algorithm the one in srcset would still act as a beacon.

I feel like the bigger issue is the W3 (nee Google). The new HTML Sanitizer[2] interface does nothing, but some VP is somewhere patting themselves on the back for this. We don't need an object-oriented way to edit HTML, we need the database of changes we want to make.

What I would like to see is the ability to put a <pre-cache href="url"><![CDATA[...]]></pre-cache> that would allow the document to replace requests for url with the embedded data, support what we can, then just turn off networking for things we can't. If networking is enabled, just ignore the pre-cache tags. No mixing means no XSS. Networking disabled means "failures" in the sanitizer is that the page just doesn't "look" right, instead of a leak.

Until then, the HTML4-era solution was a whitelist (instead of trying to blacklist/block things) is best. That's also easier in a lot of ways, but harder to maintain since gmail, outlook, etc are a moving target in _their_ whitelists...

[1]: https://github.com/geocar/firewall.js

[2]: https://developer.mozilla.org/en-US/docs/Web/API/HTML_Saniti...


Why on earth does the HTML sanitiser allow blacklisting?! That can't ever be safe to use, the set of HTML elements can always change.

Note that the API is split into XSS-safe and XSS-unsafe calls. The XSS-safe calls [0] have this noted for each of them (emphasis mine):

> Then drop any elements and attributes that are not allowed by the sanitizer configuration, and any that are considered XSS-unsafe (even if allowed by the configuration)

The XSS-unsafe functions are all named "unsafe". Although considering web programmers, maybe they should have been named "UnsafeDoNotUseOrYouWillBeFired".

[0] https://developer.mozilla.org/en-US/docs/Web/API/HTML_Saniti...


I mean, at least they eventually came to their senses, but it does not inspire confidence!

https://developer.chrome.com/blog/sanitizer-api-deprecation/


That's the old sanitizer API. That was already removed and what you linked earlier is the new sanitizer API.

> What I would like to see is the ability to put a <pre-cache href="url"><![CDATA[...]]></pre-cache> that would allow the document to replace requests for url with the embedded data

multipart/related already exists.


> multipart/related already exists.

Which web browsers render multipart/related correctly served over https?


What is stopping them from doing so instead of going with a NIH solution?

Never mind the context is e-mail, which is not served to a browser over HTTPS.


Got it: So none.

As to why I prefer one thing that doesn’t exist over another thing that doesn’t exist depends on my priors. You might as well be asking my opinion and making fun of it before you know the answer.

What do you think the impact would be if Content-Location: would be if it suddenly gained the interpretation I suggest?

What do you think a script in the package can do to reference a part of the URL is constructed by code?


Who are you thinking of?

Netflix might be spending as much as $120m (but probably a little less), and I thought they were probably Amazon's biggest customer. Does someone (single-buyer) spend more than that with AWS?

Hertzner's revenue is somewhere around $400m, so probably a little scary taking on an additional 30% revenue from a single customer, and Netflix's shareholders would probably be worried about risk relying on a vendor that is much smaller than them.

Sometimes if the companies are friendly to the idea, they could form a joint venture or maybe Netflix could just acquire Hertzner (and compete with Amazon?), but I think it unlikely Hertzner could take on Netflix-sized for nontechnical reasons.

However increasing pop capacity by 30% within 6mo is pretty realistic, so I think they'd probably be able to physically service Netflix without changing too much if management could get comfortable with the idea


A $120M spend on AWS is equivalent to around a $12M spend on Hetzner Dedicated (likely even less, the factor is 10-20x in my experience), so that would be 3% of their revenue from a single customer.


> A $120M spend on AWS is equivalent to around a $12M spend on Hetzner Dedicated (likely even less, the factor is 10-20x in my experience), so that would be 3% of their revenue from a single customer.

I'm not convinced.

I assume someone at Netflix has thought about this, because if that were true and as simple as you say, Netflix would simply just buy Hetzner.

I think there lots of reasons you could have this experience, and it still wouldn't be Netflix's experience.

For one, big applications tend to get discounts. A decade ago when I (the company I was working for) was paying Amazon a mere $0,2M a month and getting much better prices from my account manager than were posted on the website.

There are other reasons (mostly from my own experiences pricing/costing big applications, but also due to some exotic/unusual Amazon features I'm sure Netflix depends on) but this is probably big enough: Volume gets discounts, and at Netflix-size I would expect spectacular discounts.

I do not think we can estimate the factor better than 1.5-2x without a really good example/case-study of a company someplace in-between: How big are the companies you're thinking about? If they're not spending at least $5m a month I doubt the figures would be indicative of the kind of savings Netflix could expect.


We run our own infrastructure, sometimes with our own fincing (4), sometimes external (3). The cost is in tens of millions per year.

When I used to compare to aws, only egress at list price costs as much as my whole infra hosting. All of it.

I would be very interested to understand why netflix does not go 3/4 route. I would speculate that they get more return from putting money in optimising costs for creating original content, rather than cloud bill.


> I would be very interested to understand why netflix does not go 3/4 route. I would speculate that they get more return from putting money in optimising costs for creating original content, rather than cloud bill.

I invest in Netflix, which means I'm giving them some fast cash to grow that business.

I'm not giving them cash so that they can have cash.

If they share a business plan that involves them having cash to do X, I wonder why they aren't just taking my cash to do X.

They know this. That's why on the investors calls they don't talk about "optimising costs" unless they're in trouble.

I understand self-hosting and self-building saves money in the long-long term, and so I do this in my own business, but I'm also not a public company constantly raising money.

> When I used to compare to aws, only egress at list price costs as much as my whole infra hosting. All of it.

I'm a mere 0,1% of your spend, and I get discounts.

You would not be paying "list price".

Netflix definitely would not be.


Of course netflix is optimising costs, otherwise it would not be a business, I just think they put much more effort elsewhere. They could be using other words, like "financial discipline" :)

My point is that even if I get 20 times discount on egress its still nowhere close, since i have to buy everything else - compute, storage are more expensive, and even with 5-10x discounts from list price its not worth it.

(Our cloud bills are in the millions as well, I am familiar with what discounts we can get)


Figma apparently spends around 300-400k/day on AWS. I think this puts them up there.


How is this reasonable? At what point do they pull a Dropbox and de-AWS? I can’t think of why they would gain with AWS over in house hosting at that point.

I’m not surprised, but you’d think there would be some point where they would decide to build a data center of their own. It’s a mature enough company.


That $120m will become $12m when they're not using AWS.


> Hertzner's revenue is somewhere around $400m, so probably a little scary taking on an additional 30% revenue from a single customer

A little scare for both sides.

Unless we're misunderstanding something I think the $100Ms figure is hard to consider in a vacuum.


I'm largely just thinking $HUGE when throwing out that number, but there are plenty of companies that have cloud costs in that range. A quick search brings up Walmart, Meta, Netflix, Spotify, Snap, JP Morgan.


> But you can't take .so files and make one "static" binary out of them.

Yes you can!

This is more-or-less what unexec does

- https://news.ycombinator.com/item?id=21394916

For some reason nobody seems to like this sorcery, probably because it combines the worst of all worlds.

But there's almost[1] nothing special about what the dynamic linker is doing to get those .so files into memory that it can't arrange them in one big file ahead of time!

[1]: ASLR would be one of those things...


What if the library you use calls dlopen later? That’ll fail.

There is no universal, working way to do it. Only some hacks which work in some special cases.


> What if the library you use calls dlopen later? That’ll fail.

Nonsense. xemacs could absolutely call dlopen.

> There is no universal, working way to do it. Only some hacks which work in some special cases.

So you say, but I remember not too long ago you weren't even aware it was possible, and you clearly didn't check one of the most prominent users of this technique, so maybe you should also explain why I or anyone else should give a fuck about what you think is a "hack"?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: