Scenario: The file is a 100GB CSV file. The machine is a VPS with 500MB RAM. The...

pstrateman · on June 12, 2015

    import csv
    
    r = csv.reader(open('file.csv'))
    first_row = r.next()
    for row in r:
        if row[0] == first_row[0]:
            return True
    return False

287MB file 100m rows 15.804s peak memory usage 7MB

xamuel · on June 12, 2015

Thank you, that is beautiful. I've been proved decisively wrong and I've gained some greater appreciation for python's libraries.

TylerE · on June 12, 2015

It's not even a library thing. Reading a file in this way does reads in approximately optimal (for the FS) sized blocks. The CSV parser just rides on top of stdio.

CHY872 · on June 12, 2015

The trivial Java solution

    BufferedReader r = new BufferedReader(new File(filename))
    String line = r.readLine();
    if (line != null) {
        String first = line.split(",")[0];
        while ((line = r.readLine()) != null) {
            if (first.equals(line.split(",")[0])) {
                return true;
            }
        }
        return false;
    }
    throw new RuntimeException();

would work just fine, and very quickly, too. This sort of question simply is not hard to solve; it's far more dependent on tiny implementation details. I very much doubt that any CSV parser would try to load the file all at once; it'd be too much of a performance hit.

xamuel · on June 12, 2015

In other words, you write your own CSV parser, in this case using Java.

Btw, CSV files can contain values with commas and even newlines inside of them. So if your point was that you don't have to write an _entire_ CSV parser, only a partial one, unfortunately that isn't true. https://en.wikipedia.org/wiki/Comma-separated_values#Example

>I very much doubt that any CSV parser would try to load the file all at once

I agree, but a naive imported solution would likely try to store all the rows at once, possibly in some sort of list or vector or array or whatever. This is what would cause the memory failure, not the file read itself.

CHY872 · on June 12, 2015

Possibly, but unlikely (most csv libraries will be based around iterators). In any case, the problem is stupidly underspecified. For example, what if the 100GB CSV is 100 1GB rows? What if the input is UTF16, UTF32? How do you deal with the 10 Unicode line separators?

It just tests how well the interviewee knows CSV, which is an ill specified format anyway. It's a fake problem as no sane person would parse 100GB CSVs on 500MB VPS', and in real life, you'd just try the naive solution, see why it didn't work and iterate.

terryf · on June 12, 2015

please note that according to https://www.ietf.org/rfc/rfc4180.txt

this is a valid record definition from a csv file: "b CRLF bb","ccc" CRLF zzz,yyy,xxx

Your code will fail on this case. CSV parsing is not as simple as it sounds. Given that the format is also not well-defined, it's even worse. (is the first line a header or not for example)