This was previously submitted to HN 23 hours ago — and in fact, there's even a link to the previous submission at the bottom of the Matrix article, which is how I found it
So HN could detect dups based on hashes instead? I understand the HTML could be different while the content is the same but this is an extra step that helps.
Then you'll have one of these "fancy" modern websites that load their content entirely through js and it'll fail because the HTML is the same on all pages.
A URL match is probably good enough the vast majority of the time. Maybe it could also support a bit of fuzzing, such as matching with and without the leading www and both http and https. Beyond that it's probably asking for trouble.
> one of these "fancy" modern websites that load their content entirely through js [will] fail because the HTML is the same on all pages.
That's a feature, not a bug. (Although it would admittedly be better to block those explicitly rather than relying on coincidental interactions with something that doesn't seem directly related.)
> Then you'll have one of these "fancy" modern websites that load their content entirely through js and it'll fail because the HTML is the same on all pages.
What are you talking about, it would be done entirely on the backend.
Mind you, I've brought it up with dang multiple times and he says it would be a hassle and too brittle to be effective (fair enough), but nothing about it would require javascript.
I believe the parent is talking about submitting links to websites that render their content via clientside JavaScript and how that would break the hash dupe detection. They aren't suggesting that the functionality would need to be implemented in JS by HN.
Regardless, hashing the content to detect dupes is just an idea that wouldn't work for a lot of reasons.
We tried that kind of thing and it was a nightmare. Trying to make general content-processing things on the web is a full time job and more.
I do think it's practical for us to make use of <link rel='canonical'> though. But many pages don't include that. The OP doesn't for example, so that wouldn't have helped here.
hashes of what, though? Without accessing the content, the URL is essentially the only thing to go on. Plus having the same post make it to the front page pretty much means the system is working exactly as intended: content that HN users are interested in makes it to the widest audience (not everyone checks HN multiple times a day =)
THANK YOU. I have never understood why folks on Reddit, etc. are so vehemently opposed to reposts. If people are upvoting it, that means they like it. If they like it, that means either a) it's new to them, or b) they enjoy seeing it again.
For what it's worth, my comment wasn't meant as a complaint that it was posted again. I just wanted to make sure that folks saw the previous submission, since that one has comments by the Matrix lead.
https://news.ycombinator.com/item?id=24826951