Running your models in production with TensorFlow Serving

Smerity · on Feb 17, 2016

Model serving in production is a persistent pain point for many ML backends, and is usually done quite poorly, so this is great to see.

I'm expecting large leaps and bounds for TensorFlow itself. This improvement to surrounding infrastructure is a nice surprise, just as TensorBoard is one of the nicest "value-adds" that the original library had[4].

Google have ensured many high quality people have been active as evangelists[3], helping build a strong community and answerbase. While there are still gaps in what the whitepaper[1] promises and what has made it to the open source world[2], it's coming along steadily.

My largest interests continue to be single machine performance (a profiler for performance analysis + speedier RNN implementations) and multi-device / distributed execution. Single machine performance had a huge bump from v0.5 to v0.6 for CNNs, eliminating one of the pain points there, so they're on their way.

I'd have expected this to lead to an integration with Google Compute Engine (TensorFlow training / prediction as a service) except for the conspicuous lack of GPU instances on GCE. While GPUs are usually essential for training (and theoretically could be abstracted away behind a magical GCE TF layer) there are still many situations in which you'd want access to the GPU itself, particularly as performance can be unpredictable across even similar hardware and machine learning model architectures.

[1]: http://download.tensorflow.org/paper/whitepaper2015.pdf

[2]: Extricating TensorFlow from "Google internal" must be a real challenge given TF distributed training interacts with various internal infra tools and there are gaps with open source equivalents.

[3]: Shout out to @mrry who seems to have his fingers permanently poised above the keyboard - http://stackoverflow.com/users/3574081/mrry?tab=answers&sort...

[4]: I've been working on a dynamic memory network (http://arxiv.org/abs/1506.07285) implementation recently and it's just lovely to see a near perfect visualization of the model architecture by default - http://imgur.com/a/PbIMI

barneso · on Feb 17, 2016

For profiling of models, almost everything needed is already there. You only need to pass in a StepStatsCollector through the Session::Run() method (I called it RunWithStats() ) and hook it up to the Executor Args by filling in this variable: https://github.com/tensorflow/tensorflow/blob/master/tensorf... You then get a very usable set of profiling statistics out by aggregating the StepStats object. For profiling individual ops on the CPU, perf is very useful.

mrry · on Feb 17, 2016

Thanks for the shout out!

colah3 · on Feb 17, 2016

Derek is also extremely available to his colleagues at Google. He's always friendly when I ask questions, and very thoughtful. I feel lucky to work with him, however distantly! :)

ilyaeck · on Feb 17, 2016

[4] Did you generate that diagram with TensorFlow?

Regardless, I'd be interested in hearing more about your lessons learned from the DMN implementation - when you have some so share.

Smerity · on Feb 17, 2016

The diagram was generated using a DMN I've implemented in TensorFlow, part of my work at MetaMind. Those diagrams are useful not just in visualizing the architecture but also for spot checking certain issues.

I was planning on writing up a blog post on TensorFlow in the near future but undecided on the topic. It could be about implementing something nice and simple[1], maybe an attention based model / language model using RNNs over PTB / etc, or a broader discussion about the good and bad bits of TensorFlow.

I really like TensorFlow, so improving the tutorials seems to be an important step. Whilst the existing tutorials are a good starting point, more in-depth exploration is trial by fire, made more difficult by construction of the graph being separate from executing the graph[2].

If people have particular topics they'd like to see covered, I'd love to hear about them! Ping me at smerity@smerity.com or @Smerity.

[1]: Similar to my "Question answering on the Facebook bAbi dataset using Keras" - http://smerity.com/articles/2015/keras_qa.html

[2]: As opposed to Numpy, directly inspecting something requires a bit more work. There is an InteractiveSession but it's still less direct.

dgacmu · on Feb 17, 2016

Note also that we've released v0.7 of Tensorflow today - more details in the release announcement: https://groups.google.com/a/tensorflow.org/forum/#!topic/dis...

TheGuyWhoCodes · on Feb 16, 2016

This looks great and brings TensorFlow close to using it in production where the model has a life cycle.

I'd wish they could implement other well know ML algos like trees, give Spark ML some fight :)

barneso · on Feb 17, 2016

There are plenty of alternatives out there to Spark ML: here is a survey of RF implementations: https://github.com/szilard/benchm-ml/tree/master/z-other-too...

There is a whole other world of non stochastic gradient descent based algorithms out there; IMO Tensorflow is sensible to stick to one class of algorithms and do it well.

(Disclaimer: I work on mldb, one of the tools on that list).

TheGuyWhoCodes · on Feb 17, 2016

mldb looks great. But I was referring to distributed model building, in a horizontal way. Which SparkML does, and TensorFlow says it does. If they can implement a distributed Gradient Boosting Tree across nodes, maybe even with GPU support (Although I'm not sure if it's applicable) that could be huge.

barneso · on Feb 17, 2016

Once the open source version of Tensorflow releases multi-node support, this would be one way to make it work. There are potential gains from using a GPU for RF training. As for distributing, in my experience for small models it doesn't make much difference and for larger models the cost of distributing the dataset dominates the benefit from having multiple nodes. But an implementation carefully designed for a given node topology could be made more performant.

nl · on Feb 17, 2016

Comparable AUC metric to xgboost and faster? That's... pretty interesting.

Does that include the dataload time into MLDB?

barneso · on Feb 17, 2016

None of the systems include the data load time, but for mldb and the other non-distributed systems, it's only a few seconds.

(edit: my grammar is good not)

swah · on Feb 17, 2016

Off-topic: I always open C++ projects from Google - they are always so tidy and clean. It just feels like a work of craftmanship, if that actually exists in software: https://github.com/tensorflow/serving/tree/master/tensorflow...

OTOH, I have a strong prejudice against Javascript on the backend... And its not due to it being dynamic - the same doesn't happen with Python codebases. It is completely irrational.

pjmlp · on Feb 17, 2016

I don't share the same opinion from the NDK code.

Plain C with C++ compiler, with a pseudo Hungarian notation.

curiousfiddler · on Feb 17, 2016

I'm not sure if TensorFlow already provides that, but it would also be pretty awesome to access some of Google's data sets to train the models.

nl · on Feb 17, 2016

Which data are you after? The ImageNet data is public and they released the pretrained model.

They've promised to release (or already have released) the models for Exploring the Limits of Language Modeling[1] which was trained on the 1 B Word Benchmark corpus[2] which is also public data.

Note that for these, the trained models are often more immediately useful. The language modelling model was trained for 3 weeks on 32 Tesla K40s. That's not something many can replicate casually.

[1] http://arxiv.org/pdf/1602.02410v2.pdf

[2] http://www.statmt.org/lm-benchmark/

barneso · on Feb 17, 2016

They do provide some very useful pre-trained models, eg the full parameter set for their Inception model.