Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Disclaimer: previous job had a lot of cases where CSVs were dropped by SFTP, your milage may vary..., JSON APIs are said to be a different flavor of crazy...

Haven't heard much beyond "ask the Old Ones", but "Murphy's law strikes again", "eventually someone will want that data even though they swore it was unnecessary", "eventually someone will ask for a backfill/replay", "eventually someone will give you a duplicate file", "eventually someone will want to slice-and-dice the data a different way" and "eventually someone will change the schema without telling you" have been some things I have noticed.

Even de-duplicating data is, in a sense, deletion (or someone will eventually want to get at the data with duplicates -- e.g. for detecting errors or repeats or fraud or some other analysis that mirrors looking at the bullet holes in World War 2 bombers)

Store the data as close to the original form as you can. Keep a timestamp of when you landed the data. Create a UUID for the record. Create a hash of the record if you can. Create a batch_id if you load multiple things at once (e.g. multiple CSVs). Don't truncate and reload a table - rather, append to it. If you still need something that looks like atomic table changes, I've gotten away with something close: "a view that shows only the most recent valid batch". (Yes this is re-inventing the database wheel, but sometimes you make do with the tools you are forced to use.)

Someone, somewhere, will hand you a file that does not conform to the established agreement. You want to log that schema change, with a timestamp, so you can complain to them with evidence that they ain't sending you what they used to, and they didn't bother sending you an email beforehand...

They're not going to fix it on your timeline, so you're probably going to end up hacking your code to work a different way... Until, you know, they switch it back...

So, yeah. Log it. Timestamp it. Hash it. UUID it. Don't trust the source system to do it right, because they will eventually change the script on you. Keep notes, and plan in such a way that you have audit logs and can move with agility.



In conclusion...

I find, in data engineering ,the goal is not to prevent everything, it's to be flexible and prepared to handle lots of change, even silly changes, and be able to audit it, observe it, maneuver around it, and keep the mean-time-to-resolution low.


sound advice. thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: