Scenario: The file is a 100GB CSV file. The machine is a VPS with 500MB RAM. The task: Determine whether the value in the first column of the first row, is ever repeated in the first column of any subsequent row.
An import solution is unlikely to work. Most import solutions would try to load the rows into memory, which doesn't work here. Maybe if the imported parser can be configured to run a user-supplied callback on individual rows and then discard them...
(P.S. solving the problem with awk still counts as writing a CSV parser --- in awk)
It's not even a library thing. Reading a file in this way does reads in approximately optimal (for the FS) sized blocks. The CSV parser just rides on top of stdio.
BufferedReader r = new BufferedReader(new File(filename))
String line = r.readLine();
if (line != null) {
String first = line.split(",")[0];
while ((line = r.readLine()) != null) {
if (first.equals(line.split(",")[0])) {
return true;
}
}
return false;
}
throw new RuntimeException();
would work just fine, and very quickly, too. This sort of question simply is not hard to solve; it's far more dependent on tiny implementation details. I very much doubt that any CSV parser would try to load the file all at once; it'd be too much of a performance hit.
In other words, you write your own CSV parser, in this case using Java.
Btw, CSV files can contain values with commas and even newlines inside of them. So if your point was that you don't have to write an _entire_ CSV parser, only a partial one, unfortunately that isn't true. https://en.wikipedia.org/wiki/Comma-separated_values#Example
>I very much doubt that any CSV parser would try to load the file all at once
I agree, but a naive imported solution would likely try to store all the rows at once, possibly in some sort of list or vector or array or whatever. This is what would cause the memory failure, not the file read itself.
Possibly, but unlikely (most csv libraries will be based around iterators). In any case, the problem is stupidly underspecified. For example, what if the 100GB CSV is 100 1GB rows? What if the input is UTF16, UTF32? How do you deal with the 10 Unicode line separators?
It just tests how well the interviewee knows CSV, which is an ill specified format anyway. It's a fake problem as no sane person would parse 100GB CSVs on 500MB VPS', and in real life, you'd just try the naive solution, see why it didn't work and iterate.
this is a valid record definition from a csv file:
"b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
Your code will fail on this case. CSV parsing is not as simple as it sounds. Given that the format is also not well-defined, it's even worse. (is the first line a header or not for example)
An import solution is unlikely to work. Most import solutions would try to load the rows into memory, which doesn't work here. Maybe if the imported parser can be configured to run a user-supplied callback on individual rows and then discard them...
(P.S. solving the problem with awk still counts as writing a CSV parser --- in awk)