I would say if a startup overhead time of < 10 second bothers you, you're not working with "data". Of course sed and grep have less overhead, but I wouldn't even thin of trying out a new tool for files/datasets larger than, say a Gigabyte. (Rough guess, I know you can use grep and sed in under 10 seconds for larger files, the point is about perspective and complexity.)
Clojure is sadly a really bad choice for fire-and-forget cli scripts, but "large scale data processing" doesn't fit this criterion for me.
I'm mostly going to use this for parsing XML into some other formats and getting it into SQLite databases I think. The reason I would like to use Drake over 'raw' Python scripts is because it supports a lot of the mundane stuff that goes around the actual processing of the data, and I want to automate the processes.
I typically deal with sub-100MB XML documents, so processing them takes very little time, but having the quick iteration of changing the format and re-outputting is a key part of the development cycle for me, and I think very useful when you are experimenting with new data and seeing how it could be used. Doing quick transforms is awesome.
Drip now works with Drake! Yes, it's still less than ideal if you're calling Drake hundreds of times from an automated script which you need to run quickly, but for interactive development, it should work just fine:
It's a good point, and I agree it might not be the top priority, but I also understand the frustration. I, too, find 5s start up file rather irritating especially when I make errors in the workflow file, or didn't specify targets correctly. So, we are in search of ideas on how to fix it.
To be honest with you, no, we didn't seriously consider it. Maybe we should have. I do not know if ClojureScript would be able to work with all the dependencies we have (for example, Hadoop client library to talk to HDFS). But it's a good point nevertheless. I'll mention it in https://github.com/Factual/drake/issues/1.
I didn't realise originally that Drake integrated with HDFS. Thats a really awesome feature, and I can see why the JVM made sense in development because of existing HDFS libraries.
Thanks for the response! I ask because I have an idea for a CLI program, and I want to write it in Clojure, but I'm worried about the startup time of the JVM. As I understand it, this issue is mitigated in Drake by the fact that a typical job will crunch lots of data and therefore take lots of time. That's not the case for my program, it needs to be quick.
Yes, startup times are a pain. As of this morning, Drake now works with Drip, which is a nifty tool to bring down start up times. It spins "backup" JVMs, so next time you run the command, JVM is ready. It works great for interactive environments where at least several seconds pass between runs, but won't do much if you need to run Drake several times per second from an automated script.
Another option is Nailgun, but it has its limitations, too.
None if this is ideal. If you want to write a very simple CLI program, keep this in mind. You may want to stay away from JVM.
Clojure is sadly a really bad choice for fire-and-forget cli scripts, but "large scale data processing" doesn't fit this criterion for me.