Testing HPC on Google’s TPU Matrix Engines

Testing HPC on Google's TPU Matrix Engines

In an ideal platform cloud, you wouldn’t know or care about the underlying hardware and how it was composed to run your HPC – and now AI – applications. The underlying hardware in a cloud would have a mix of different types of compute and storage, an all-to-all network ripping it together, and anything you need could be dialed in on the fly.

This is precisely the kind of compute cloud that Google wanted to build in April 2008 with App Engine, and ultimately very few organizations wanted to buy. Companies cared – and still care – about the underlying infrastructure, but at the same time, Google still believes in its heart in the cloud of the platform. And that’s one of the reasons its Tensor Processing Unit, or TPU, computational engines are only available on Google Cloud. (While you could argue that the GroqChip matrix math units available through Groq are as much an architectural copy of the TPU as Kubernetes is for the Borg container and Google’s cluster controller, Hadoop is for analytics and storage Google’s MapReduce database, or CockroachDB is for Google’s SQL Spanner database.)

When you place a cluster of 2,048 TPU cores on a tightly coupled toroidal mesh network and have a combined 32TB of HBM2 memory and over 100 petaflops of single-precision floating-point math (with precision support to do inference) on the cloud for people to run applications, it’s natural for people running HPC applications to change their codes to use the TPU.

Over the past few years, experiments have been conducted, and the latest was by Google itself showing how to speed up the fluid dynamics associated with predicting river floods. There are many more examples, which we’ll discuss in a moment, and it will be interesting to see if TPU and its clones find a home in HPC or if it’s just a bunch of science projects of worthy investigation.

We emphasize that in 2006, this is precisely how the mathematical algorithms and then portions of HPC codes were unloaded from the CPUs to the GPUs, triggering a whole revolution from which the explosion of AI several years later benefited. Now it’s AI that drives technologies like TPU and it’s HPC that can benefit.

The most recent Google paper, which you can read here, researchers ported the math behind the hydrodynamic models of CPUs to TPUs and measured the performance of the TPU core against a relatively modern X86 CPU core. (Comparisons with GPUs weren’t given, and obviously thanks to Google Cloud, which sells raw GPU capacity, as well as massive internal gpu farms that Google uses for various parts of its learning training stack While the TPUv4 Matrix Engine has been in development and pre-production for quite some time now, as far as we know it has not been deployed to Google Cloud and the TPUv3 Matrix Engines, which we have profiled here, are the only ones that people can kick the tires on.

Here’s the interesting thing that came out when we read this article: the HPC codes to run the flood simulation were not ported from a parallel stack running, say, Fortran code and using OpenMP and MPI . Google went straight to the Saint-Venant shallow water partial differential equations used to simulate liquid flow over topography and implemented them in Python on its machine learning framework TensorFlow, using its algebra compiler Linear Accelerated (XLA). It is important to note that Google has created a fully 2D model of the river flooding rather than a hybrid 1D-2D model which is typically performed on CPU only systems which are computationally low compared to a group of TPUs . This is what the TPU simulation stream looks like for this flood simulation:

google tpu hpc workflow

Google has opened up the code behind this simulation to help HPC researchers see how this flood model works and perhaps do their own work in other parts of the HPC industry. You can download this code here. There are some tricky things you need to do to create the initial boundary conditions for any simulation, and this Google-made app handles that. And the collective operations within the TensorFlow framework and enabled by the TPUv3 network seem to be doing a good job, based on visual inspection of imagery and comparison with actual data obtained from an actual flood that has been simulated , to determine the height of the water and its flow through the simulated topology.

As with any fluid flow simulation, resolution is important, but it comes at a high computational cost. Low resolution models give you a feel, but high resolution models give you something that looks and feels more like data. So Google runs its simulations on X86 CPU cores and TPU cores at 8-meter, 4-meter, and 2-meter grid resolutions, then pushes a quarter pod of TPU to provide 1-meter grid resolution. The simulation was performed on a section of the Arkansas River that flooded in May 2019. Google tested the resolution scaling against different size slices of the TPU pod, ranging from a single TPU core up to a quarter pod with 512 TPU cores. The datasets ranged from 15.5 million grid points at 8 meter resolution to 1 billion grid points at 1 meter resolution.

Here’s how the river flood simulation performed on different TPU compute sizes and at different resolutions:

google tpu hpc simulation scaling table

For some reason Google didn’t run this flood simulation on an entire TPU pod. As you can see in the table above there were diminishing returns going from 128 TPU cores to 512 TPU cores at the 8 meter resolution, but at lower resolutions the scaling still wore off quite a bit well as more computation was added. But the scaling was dropping pretty quickly, and maybe Google didn’t want to talk about it. OK, we think Google I certainly didn’t want to talk about it. But we realize that it’s difficult to do full-iron scale simulations in any supercomputer and that on a second pass Google would no doubt be able to do better on a larger scale . Just like real HPC shops do with their simulations.

So how well did the TPU predict the flood? Pretty good at telling emergency responders where hot spots were going to be, we think. Here is aerial flooding over a section of the Arkansas River at 1 meter resolution showing the actual flood extent:

google tpu hpc aerial image flood

And here is where the simulation predicted where the flood would be based on similar river flow to what happened during the flood:

google tpu hpc simulation flood

The other interesting piece of research done by Google was to run the same simulation on the CPUs as well as its TPUs, using the same code stack and simply replacing the XLA compiler used for the TPU with the library of Eigen C++ models for linear algebra for processors running Linux.

Here’s how the CPU stacked up against the TPU on a per-core level:

google tpu hpc simulation cpu versus tpu

The processor in question here is a “Cascade Lake” Xeon SP-8273CL Platinum, which has 28 cores running at 2.2 GHz and rated at around 2 teraflops at fp32 single precision. (Single-precision floating-point performance indices for IEEE FP32 formats that TPUs have never been published.) The performance difference per core is well over 500X, which stands to reason given the number and the size of the mathematical units of the MMX die in the TPU cores. Each TPUv3 core has two 128×128 matrix math units, and each TPUv3 chip has two cores; there are four TPUv3 chips per motherboard and 256 motherboards per TPUv3 pod. (By the way, the TPUv2 had half the HBM memory per core, at 8GB, and half the MMX units per core, at one each, compared to the TPUv3. So the TPUv2 iron speedup which is still available on the Google Cloud would be about half as much per core compared to the X86 iron.

Google hasn’t shown how X86 servers can be clustered and scaled. And he certainly didn’t talk about the cost of running the simulation on a CPU cluster versus a TPU cluster in a while, like you need for weather and emergency management simulations. But, given this data and a lot of guesswork, HPC shops can start thinking about what it could be. (We might do this work ourselves when the newsfeed slows in July, just for fun, finding out how a more modern cluster using AMD “Milan” Epyc 7003 processors might compare to rented capacity on TPUv3 pods and TPUv4. Hmmmm.)

As we pointed out above, the number of HPC codes that have been ported to the TPU to speed them up is increasing, and Google isn’t doing all the work because the HPC community is curious. Here are the articles that we could find without harming the search engine too much:

  • Large-scale distributed linear algebra with tensor processing unitsGoogle, December 2021
  • Molecular dynamics simulations on cloud computing and machine learning platformsIndiana University, November 2021
  • Non-Uniform Fast Fourier Transform on TPUsGoogle, Mass General, Harvard University, April 2021
  • Large Scale Discrete Fourier Transform on TPUsGoogle, December 2020
  • Accelerate MRI reconstruction on TPUsGoogle, June 2020
  • Tensor processing units for Le Monte Carlo FinancierGoogle, January 2020
  • High performance Monte Carlo simulation of the Ising model on TPU clustersGoogle, NoFebruary 2019

Wouldn’t it be funny if, after going through all this with processors and accelerators, we ended up with an architecture that looks like an 80286 processor with a massively parallel set of 80287 coprocessors to do its math homework on? IBM did the same thing with six-way System/3090 mainframes and punching a vector math unit on each engine in 1989 when we were just getting started in this data center racket and Cray was first earning business customers in the company. Everything will depend on the software that will be developed, of course.

And one final thought: any code created to speed up HPC on TPUs would probably be relatively easy to move to matrix math engines created by Cerebras, SambaNova and GraphCore as well as Groq.


Please enter your comment!
Please enter your name here