Founder of the Cloud TPU program here. If you'd like to experiment with TPU VMs ...

jasonphang · on May 10, 2022

Hi Zak, this question is a little out of your scope, but perhaps you may know the answer: Do you know if/when Colab TPU runtimes are likely to be updated to support newer JAX functionality like pjit? (Or put another way: are Colab and Cloud TPU runtimes expected to be in-sync at all?) I'd written some model code that worked great on a TPU-VM and that I was excited to share on Colabs (which is likely far more accessible), but then I found out that Colabs simply don't support pjit (https://github.com/google/jax/issues/8300).

Other than that, really like TRC!

zak · on May 10, 2022

We love Colab and would love to upgrade the Colab TPU integration to support TPU VMs! No timeframe yet, but the right folks across JAX / Colab / Cloud TPU are very aware of this issue.

frankchn · on May 10, 2022

Just wanted to pop in to say congrats on making GA! Really happy to see the program develop from the early days :)

zak · on May 10, 2022

Thanks, Frank! You personally helped more Cloud TPU and TRC users than I can count, and you always came through something needed to get done and fast. I really appreciated it!

bertday · on May 10, 2022

What are some of the things on the roadmap for the platform? Any immediate plans to close the command-line gap for TPU utilization, etc.?

My overall impression is TPUs are pretty awesome but the software stack is still a bit hard to use compared to mature GPU tools. I’d imagine it’s pretty hard for inexperienced users to use them.

zak · on May 11, 2022

If you haven't used Cloud TPUs in a while, I'd encourage you to try them now with TPU VMs and the latest versions of JAX, PyTorch / XLA, or TensorFlow. We've gotten a lot of positive feedback from customers and TRC users, so we think the overall experience has improved a lot, though there's always more we want to do.

People especially seem to find Cloud TPUs easy to use in comparison to alternatives when they are scaling up ML training workloads. Once you have a model running on a single TPU core, it is relatively straightforward from a systems perspective to scale it out to thousands of cores. You still need to work through the ML challenges of scaling, but that is more tractable when you aren't simultaneously struggling with systems-level issues.

In particular, you don't need to master a sequence of different networking technologies as you scale up, and the TPU interconnect is so much faster at scale than other technologies (10X last time I checked) that you don't have to work as hard to avoid network bottlenecks. Support for model parallelism on Cloud TPUs is improving across the ML frameworks, too.

To be clear, training ML models at scales that we currently consider large is still very challenging on any platform - for example, the logbooks that Meta recently published are fascinating: https://github.com/facebookresearch/metaseq/blob/main/projec...

We aim for Cloud TPUs to simplify the process of training models at these scales and far beyond: https://ai.googleblog.com/2022/04/pathways-language-model-pa...

jacquesm · on May 10, 2022

Nice achievement and great to see you are still there after all that time to see it through to GA. Congrats! It must be a bit like seeing your child cycle to school alone for the first time :)

zak · on May 11, 2022

Thanks very much! We've come a long way, but there is always more interesting work required to keep up with the deep learning frontier and enable Cloud TPU customers and TRC users to expand it further.

teleforce · on May 10, 2022

Thanks Zak, already applied.

Just wondering does TPU VM support Vectorflow?

https://github.com/Netflix/vectorflow

zak · on May 10, 2022

No, Vectorflow is not supported out of the box, and I'm not sure the workloads it targets are the right fit for Cloud TPU hardware. However, be sure to check out the "Ranking and recommendation" section of the linked blog post above - Cloud TPUs are able to accelerate the ML models with very large embeddings that are increasingly common in state-of-the-art ranking and recommendation systems.

BooneJS · on May 10, 2022

Congrats Zak!

zak · on May 10, 2022

Thanks, and congratulations to many others across many teams who have supported the Cloud TPU program over the years!