MapReduce II (followup to "MapReduce: A major step backwards")

jimbokun · on Jan 26, 2008

This is the article they should have written in the first place. Much more detail, much better qualified arguments. And the point about each generation of computer scientists reinventing the wheel. Lisp being 50 years old and all that.

apathy · on Jan 26, 2008

> efforts such as PigLatin and Sawzall appear to be promising steps in this direction.

Sawzall is a parallel logfile analyzer. It takes logfiles stored into GFS and MapReduces them for reliable billing (which used to be a nightmarish, fundamental revenue problem and is now a nonissue -- all you Nooglers live in a relative utopia). A unixy tool with a unixy mindset (gee, maybe that's because Rob Pike wrote it). It's a special-purpose tool that is incredibly good at its job, not a general-purpose filigreed hammer that is supposed to nail everything in sight. And the fact that the authors could possibly overlook this speaks volumes about their myopia.

Stonebraker naturally refused to answer the most obvious criticism of all -- HEY ASSHOLE, WHAT ABOUT BIGTABLE?

But then, if he addressed that, he'd no longer have an essay. And if Google took his advice they wouldn't have the revenues to have made it as a company. But that's not really something that academics think about, is it?

neilk · on Jan 26, 2008

This is worse than the first article.

They seem to be conflating the use of a high-level, SQL-like language with the architecture of the system. Of course you could layer a SQL-like language on top of a MapReduce-based storage and processing array, and for some queries that would be very user-friendly. If that is their whole point it is true but trivial.

I think BigTable does have a limited join capability now.

The real difference between a MapReduce-oriented system and typical SQL storage options is something like this. MR gives you assured scalability at the cost of limiting what kinds of queries you can do, and as the authors correctly point out, some queries get more and more onerous to create. Usually, SQL storage engines place ease of querying above all else, but have to go through painful and expensive procedures to scale well.

jimm · on Jan 26, 2008

Upmodded, but I chuckled when I read their first item where they take an obviously relational data problem then observe that you should use a relational database instead of MapReduce for it. From that, they claim RDBM is better.

ntoshev · on Jan 26, 2008

I believe BigTable implements the kind of indexing they like.

Re the point of scalability; while we don't really know how well MapReduce scales, usage at Google provides good hints that it does so well. Further, I don't think relational databases scale linearly either, and it is notoriously difficult to implement large DB clusters.