Google Stand Up Exascale TPUv4 Pods On The Cloud

Google Stand Up Exascale TPUv4 Pods On The Cloud

It’s Google I/O 2022 this week, among other things, and we were hoping for a deep architectural dive into the TPUv4 matrix math engines that Google talked about at last year’s I/O event. But, alas, no such luck. But the search engine and advertising giant, which also happens to be one of the biggest AI innovators on the planet due to the gigantic amount of data it needs to use, gave some additional information about TPUv4 processors and the systems that use them.

Google also said it was installing eight pods of the TPUv4 systems in its Mayes County, Oklahoma, data center, which embraces 9 exaflops of aggregate compute capacity, for use by its Google Cloud branch so that researchers and businesses have access to the same type and capacity of computing that Google has to do its own in-house AI development and production.

Google has operated data centers in Mayes County, northeast of Tulsa, since 2007 and has invested $4.4 billion in facilities since then. It’s located in the geographic center of the United States – well a little south and west of it – and that makes it useful because of the relatively short latencies for much of the country. And now, by definition, Mayes County has one of the largest iron rigs to boost AI workloads on the planet. (If all eight TPUv4 pods were networked together and the work could scale simultaneously, perhaps we could say “largest” unequivocally. . . . Surely Google did, as you’ll see in quote below.)

During his keynote, Sundar Pichai, who is chief executive of Google and its parent company, Alphabet, mentioned in passing that TPUv4 pods were previewed on his cloud.

google io mayes county ai hub

“All of the advances we shared today are only possible through continued innovation in our infrastructure,” Pichai said of some pretty cool natural language enhancements and immersive data search engines that he has made and which feed all kinds of applications. “Recently, we announced our intention to invest $9.5 billion in data centers and offices across the United States. One of our state-of-the-art data centers is in Mayes County, Oklahoma, and I’m thrilled to announce that we’re launching the world’s largest publicly accessible machine learning center there for all of our Google customers. Cloud. This machine-facing hub features eight Cloud TPU v4 pods, custom-built on the same network infrastructure that powers Google’s largest neural models. They deliver nearly 9 exaflops of aggregate computing power, giving our customers unprecedented ability to run complex models and workloads. We hope this will fuel innovation in everything from medicine to logistics to sustainability and more. »

Pichai added that this TPUv4 pod-based AI hub already has 90% of its power coming from sustainable, carbon-free sources. (He didn’t say how much wind, solar or hydropower there was.)

Before we get into the speeds and flows of TPUv4 chips and pods, it’s probably worth pointing out that, for all we know, Google already has TPUv5 pods in its internal data centers, and it might have a considerably larger collection of TPUs to drive own models and augment own applications with AI algorithms and routines. This would be the old way Google did things: Talk about generation NOT of something while he was selling generation N-1 and had already passed to the generation N+1 for its internal workloads.

This does not seem to be the case. In a blog post written by Sachin Gupta, VP and GM of Infrastructure at Google Cloud, and Max Sapozhnikov, Product Manager for Cloud TPUs, when the TPUv4 systems were built last year, Google y gave early access to researchers from Cohere, LG AI Research, Meta AI and Salesforce Research, and further they added that TPUv4 systems were used to create the Pathways Language Model (PaLM) that underpins processing natural language and speech recognition innovations that were at the heart of today’s speech. Specifically, PaLM was developed and tested on two TPUv4 pods, which each have 4,096 TPUv4 matrix math engines.

If Google’s brightest new models are being developed on TPUv4s, it probably doesn’t have a fleet of TPUv5s hiding in a data center somewhere. Although we’ll add, it would be interesting if the TPUv5 machines were hidden away, 26.7 miles southwest of our office, in the Lenoir data center, shown here from our window:

google lenoir datacenter scaled

The strip of gray down the mountain, under the birch leaves, is the Google datacenter. If you squint and stare hard into the distance, the Maiden Apple Data Center has gone left and considerably further down the line.

Enough of that. Let’s talk about some flows and speeds. Here, finally, are some capabilities that compare TPUv4 to TPUv3:

google io tpuv4 specs

Last year when Pichai was hinting at TPUv4, we guessed that Google was moving to 7 nanometer processes for this generation of TPU, but given that very low power consumption, it seems like it’s probably etched using 5 nanometer processes. (We assumed Google was trying to keep the power envelope constant, and it clearly wanted to lower it.) We also guessed it was doubling the core count, from two cores on the TPUv3 to four cores on the TPUv4 , which Google has not confirmed or denied.

Doubling the performance while doubling the cores would allow TPUv4 to reach 246 teraflops per chip, and going from 16 nanometers to 7 nanometers would allow roughly doubling the same power envelope with roughly the same clock speed. Going to 5 nanometers allows the chip to be smaller and run a bit faster while reducing power consumption – and having a smaller chip with potentially higher efficiency as 5 nanometer processes mature. That average power consumption decreased by 22.7%, and this is met with an 11.8% increase in clock speed considering the process node two and change from TPUv3 to TPUv4.

There are some very interesting things in this table and in the statements that Google makes in this blog.

Aside from the 2X cores and the slight increase in clock speed brought about by the chip manufacturing process for the TPUv4, it’s worth noting that Google has kept the memory capacity at 32GB and hasn’t increased to the HBM3 memory that Nvidia uses with the “Hopper” GH100 GPU accelerators. Nvidia is obsessed with memory bandwidth on devices and, by extension with its NVLink and NVSwitch, memory bandwidth in nodes and now on nodes with a maximum of 256 devices in a single frame.

Google isn’t as concerned about memory atoms (as far as we know) on the proprietary TPU interconnect, device memory bandwidth, or device memory capacity. The TPUv4 has the same 32GB capacity as the TPUv3, it uses the same HBM2 memory, and it only has a 33% increase in speed to just under 1.2TB/sec. What interests Google is the bandwidth on the TPU pod interconnect, which switches to a 3D torus design that tightly couples 64 TPUv4 chips with “wraparound connections” – which was not possible with the 2D torus interconnect used with TPUv3 pods. The increasing dimension of the torus interconnect allows more TPUs to be drawn into a tighter subnet for collective operations. (Which begs the question, why not a 4D, or 5D, or 6D torus then?)

The TPUv4 Pod has 4 times the TPU chips, at 4,096, and has twice the TPU cores, which we estimate at 16,384; We believe that Google has kept the number of math units in the MXU matrix to two per core, but that’s just a hunch. Google could keep the same number of TPU cores and double the MXU units and get the same raw performance; the difference would be the amount of front-end scalar/vector processing that needs to be done on those MXUs. Anyway, in the 16-bit BrainFloat (BF16) floating-point format created by the Google Brain Unit, the TPUv4 pod delivers 1.1 exaflops, compared to just 126 petaflops in BF16. This is an 8.7x raw compute factor, offset by a 3.3x increase in total reduction bandwidth across the pod and a 3.75x increase in bi-section bandwidth through the TPUv4 interconnect through the pod.

This blog post intrigued us: “Each Cloud TPU v4 chip has ~2.2x more peak FLOPs than Cloud TPU v3, for ~1.4x more peak FLOPs per dollar. If you do the math on that statement, that means the price of TPU rental on Google Cloud has gone up 60% with TPUv4, but it does 2.2x the job. These price and performance jumps are absolutely consistent with the kind of price/performance improvement that Google expects from the switch ASICs it buys for its data centers, which typically offer 2x the bandwidth for 1.3 times at 1.5 times the cost. The TPUv4 is a bit more expensive, but it has a better network to run larger models, and that comes at a cost too.

TPUv4 Pods can run in virtual machines on Google Cloud ranging in size from four chips to “thousands of chips”, and we assume that means across an entire Pod.


Please enter your comment!
Please enter your name here