Founder of the Cloud TPU program here. If you'd like to experiment with TPU VMs for free and are willing to share your work with the world somehow (e.g. via publications or open-source projects), you can apply to participate in the TPU Research Cloud (TRC) program here:
https://sites.research.google/trc/
Hi Zak, this question is a little out of your scope, but perhaps you may know the answer: Do you know if/when Colab TPU runtimes are likely to be updated to support newer JAX functionality like pjit? (Or put another way: are Colab and Cloud TPU runtimes expected to be in-sync at all?) I'd written some model code that worked great on a TPU-VM and that I was excited to share on Colabs (which is likely far more accessible), but then I found out that Colabs simply don't support pjit (https://github.com/google/jax/issues/8300).
We love Colab and would love to upgrade the Colab TPU integration to support TPU VMs! No timeframe yet, but the right folks across JAX / Colab / Cloud TPU are very aware of this issue.
Thanks, Frank! You personally helped more Cloud TPU and TRC users than I can count, and you always came through something needed to get done and fast. I really appreciated it!
What are some of the things on the roadmap for the platform? Any immediate plans to close the command-line gap for TPU utilization, etc.?
My overall impression is TPUs are pretty awesome but the software stack is still a bit hard to use compared to mature GPU tools. I’d imagine it’s pretty hard for inexperienced users to use them.
If you haven't used Cloud TPUs in a while, I'd encourage you to try them now with TPU VMs and the latest versions of JAX, PyTorch / XLA, or TensorFlow. We've gotten a lot of positive feedback from customers and TRC users, so we think the overall experience has improved a lot, though there's always more we want to do.
People especially seem to find Cloud TPUs easy to use in comparison to alternatives when they are scaling up ML training workloads. Once you have a model running on a single TPU core, it is relatively straightforward from a systems perspective to scale it out to thousands of cores. You still need to work through the ML challenges of scaling, but that is more tractable when you aren't simultaneously struggling with systems-level issues.
In particular, you don't need to master a sequence of different networking technologies as you scale up, and the TPU interconnect is so much faster at scale than other technologies (10X last time I checked) that you don't have to work as hard to avoid network bottlenecks. Support for model parallelism on Cloud TPUs is improving across the ML frameworks, too.
To be clear, training ML models at scales that we currently consider large is still very challenging on any platform - for example, the logbooks that Meta recently published are fascinating:
https://github.com/facebookresearch/metaseq/blob/main/projec...
Nice achievement and great to see you are still there after all that time to see it through to GA. Congrats! It must be a bit like seeing your child cycle to school alone for the first time :)
Thanks very much! We've come a long way, but there is always more interesting work required to keep up with the deep learning frontier and enable Cloud TPU customers and TRC users to expand it further.
No, Vectorflow is not supported out of the box, and I'm not sure the workloads it targets are the right fit for Cloud TPU hardware. However, be sure to check out the "Ranking and recommendation" section of the linked blog post above - Cloud TPUs are able to accelerate the ML models with very large embeddings that are increasingly common in state-of-the-art ranking and recommendation systems.