Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Scenario: The file is a 100GB CSV file. The machine is a VPS with 500MB RAM. The task: Determine whether the value in the first column of the first row, is ever repeated in the first column of any subsequent row.

An import solution is unlikely to work. Most import solutions would try to load the rows into memory, which doesn't work here. Maybe if the imported parser can be configured to run a user-supplied callback on individual rows and then discard them...

(P.S. solving the problem with awk still counts as writing a CSV parser --- in awk)



    import csv
    
    r = csv.reader(open('file.csv'))
    first_row = r.next()
    for row in r:
        if row[0] == first_row[0]:
            return True
    return False
287MB file 100m rows 15.804s peak memory usage 7MB


Thank you, that is beautiful. I've been proved decisively wrong and I've gained some greater appreciation for python's libraries.


It's not even a library thing. Reading a file in this way does reads in approximately optimal (for the FS) sized blocks. The CSV parser just rides on top of stdio.


The trivial Java solution

    BufferedReader r = new BufferedReader(new File(filename))
    String line = r.readLine();
    if (line != null) {
        String first = line.split(",")[0];
        while ((line = r.readLine()) != null) {
            if (first.equals(line.split(",")[0])) {
                return true;
            }
        }
        return false;
    }
    throw new RuntimeException();
would work just fine, and very quickly, too. This sort of question simply is not hard to solve; it's far more dependent on tiny implementation details. I very much doubt that any CSV parser would try to load the file all at once; it'd be too much of a performance hit.


In other words, you write your own CSV parser, in this case using Java.

Btw, CSV files can contain values with commas and even newlines inside of them. So if your point was that you don't have to write an _entire_ CSV parser, only a partial one, unfortunately that isn't true. https://en.wikipedia.org/wiki/Comma-separated_values#Example

>I very much doubt that any CSV parser would try to load the file all at once

I agree, but a naive imported solution would likely try to store all the rows at once, possibly in some sort of list or vector or array or whatever. This is what would cause the memory failure, not the file read itself.


Possibly, but unlikely (most csv libraries will be based around iterators). In any case, the problem is stupidly underspecified. For example, what if the 100GB CSV is 100 1GB rows? What if the input is UTF16, UTF32? How do you deal with the 10 Unicode line separators?

It just tests how well the interviewee knows CSV, which is an ill specified format anyway. It's a fake problem as no sane person would parse 100GB CSVs on 500MB VPS', and in real life, you'd just try the naive solution, see why it didn't work and iterate.


please note that according to https://www.ietf.org/rfc/rfc4180.txt

this is a valid record definition from a csv file: "b CRLF bb","ccc" CRLF zzz,yyy,xxx

Your code will fail on this case. CSV parsing is not as simple as it sounds. Given that the format is also not well-defined, it's even worse. (is the first line a header or not for example)




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: