Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Getting the data better is not the fun bit, I imagine.


Cleaning data is usually very low on the totem pole of priorities for many companies. It is often a very manual and time consuming process. Support from management to fix it can be very poor.

Relational tables with millions or billions of rows can contain many different kinds of errors. Finding and fixing them can be like finding a few dozen needles in a field of haystacks.

I was working on a tool to make it much easier to find and fix anomalies. I used the Chicago crime data (available for free download on the city's open data portal) to show how easy it was to find and fix problems. Here is a short video of some of the things it found: https://www.youtube.com/watch?v=kqkNeU1LYEQ

I contacted the people responsible for managing the data to show them some of the problems and how to fix them. They seemed like they could not be any less interested.


oh but it certainly can be depending on the industry you are in.

It can just be frustrating because you want to deliver results and this realization can get you stuck in the mud of "We need better measuring processes."

However you also get to advocate for why these measuring practices matter.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: