Hacker Newsnew | past | comments | ask | show | jobs | submit | gregw2's commentslogin

You articulate your case well, thank you!

I always warn people (particularly junior people) though that blindly dropping duplicates is a dangerous habit because it helps you and others in your organization ignore the causes of bad data quickly without getting them fixed at the source. Over time, that breeds a lot of complexity and inefficiency. And it can easily mask flaws in one's own logic or understanding of the data and its properties.


When I'm in pandas (or was, I don't use it anymore) I'm always downstream of some weird data process that ultimately exported to a CSV from a team that I know has very lax standards for data wrangling, or it is just not their core competency. I agree that duplicates are a smell but they happen often in the use-cases that I'm specifically reaching to pandas for.

Exactly. It’s not that getting rid of duplicates is bad, is that they may be a symptom of something worse. E.g. incorrect aggregation logic

MS SQL Server was a cheaper, friendlier plugin replacement for Sybase in the early 2000s.

I built apps in an active-active bidirectional replication telecom Sybase environment and was deeply involved in migrating it to MS SQL server in the early 2000s. I remember a fair amount of paranoia and effort around the transition as our entire business and customers' phone calls depended on it (for "reasons") but in hindsight it went quite smoothly and there were no regrets afterwards.

The Microsoft went and added a nice BI stack to the whole thing which added a new dimension of value creation at a new low price point.


You aren't wrong that Vanguard seems more active friendly these days.

But Vanguard under Bogle always played both sides of the fence at least to some extent. They have always had that actively managed Windsor fund, right? And Wellington?

I think your article headline shows you have a fair bit more to learn about Bogle. Or at least you haven't made your case on that front. Bogle was at least as much about low cost and aligning interests of the investment client as he was about passive indexing, though he is known more for the latter.

Here's a writeup with a couple pointers to more on the topic from Bogle: https://www.bogleheads.org/forum/viewtopic.php?t=388377


Rather than switch completely, consider putting your eggs in two baskets?


Not the parent poster but, while I acknowledge your point on Canada and Europe entering the conflict (and I'd add that the highly motivated Dutch punch well above their weight in intelligence and economic spheres and this whole scenario of US invasion is a Putin dream), when you ask "why do you think it could win...", the 50k population of Greenland is smaller than Granada (100k) and three orders of magnitude smaller than Vietnam/Afghanistan/Iraq (~40m). So I find its insurgency potential hard to compare to those examples you give.


Thanks for the pointer to this 2016 dialog!

One part of it has interesting new resonance in the era of agentic LLMs:

alankay on June 21, 2016 | root | parent | next [–]

This is why "the objects of the future" have to be ambassadors that can negotiate with other objects they've never seen. Think about this as one of the consequences of massive scaling ...

Nowdays rather than the methods associated with data objects, we are dealing with "context" and "prompts".


Quite a nice insight there!

I should probably be thinking more in this direction.


A bit of "hacker history"... at the dawn of the web 1993 was birthed the first app (that I know of) along these lines: "Buzzword Bingo".

It got mentioned in WSJ of all places as news of it spread.

For the history+app from its creator, see:

https://lurkertech.com/buzzword-bingo/

(Wikipedia page: https://en.wikipedia.org/wiki/Buzzword_bingo )

I'm glad to see, 25-30 years later, the hackers/cynical-tech-workers who birthed it getting justified by actual social science research.


It's never been clear to me how effective IAEA can be at keeping out state spies from their midst if the world's best intel agencies want "in".

Going back pre-Iraq-war, back when there were "inspections" and "sanctions" on Iraq, you can dig up "page 19" articles in NYTimes where -- if I recall correctly -- the US was caught putting spy equipment on the IAEA monitoring equipment in Iraq. This is (according to Iraq) what in large part triggered Iraq to kick out US inspectors. Then the Iraq (2) war started because they wouldn't let in inspectors.

Iran's theory, glossed over at the time but also reported in the rare western press articles was that the US intentionally got caught. (So that the Saddam would have explicit pressure to get the US kicked out, so that then they (US/Israel) could have a pretext to take out Iraq.) I don't know if Iran had any actual evidence to that effect or it was a bit of a conspiracy theory; I never actually read Iranian news sources whcih might have had details (or might have revealed just empty posturing.)


*timeless


*infinitely nested


Not to mention that while Parquet fixes the "delimiter problem", it doesn't fix the "encoding problem".

In (simplistic) CSV, you have to pick the right delimiter or it mangles some of your data.

In Parquet you have to pick the right data type encodings for each column for your data or it gets mangled.

Your clean monetary fixed-precision decimal data from the source system becomes floating point slop in your "I didn't want to think about data types"-encoded Parquet file and then starts behaving differently (or even changing values!) due to the nature of floating point precision artifacts. Or your blanks become 0s or nulls, etc, etc.

And don't get me started on character set encodings!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: