What’s the difference between GPUs and TPUs for AI processing?

Graphic processing units (GPUs) and tensor processing units (TPUs) are specialized ICs that support different types of artificial intelligence (AI) and machine learning (ML) algorithms. This article begins with a brief overview of tensors and where they fit into the mathematics of AI and ML; it then looks at the different structures of GPUs and TPUs and how they can be used for AI and ML applications and closes by considering where central processing units (CPUs) fit in.

In mathematics, a tensor is a generalized object like a scalar, vector, or matrix that has a rank, or dimension, of 0, 1, and 2, respectively. In AI and ML, the term tensor is usually applied to mathematical objects with ranks of 3 or greater (Figure 1).

Figure 1. When referring to AI and ML processors, the term tensor refers to matrices with ranks (dimensions) of 3 and greater. (Image: *MathisFun*)

Visually, a tensor with a rank of 3 could look like a cube, a tensor with a rank of 4 could look like a series of stacked cubes, and so on. In addition, all the values in a tensor are usually linked, so the tensor object obeys transformation rules like scaling or rotation.

Tensors are useful with complex data sets in machine learning like audio, images, and video. For example, a two-dimensional image can be represented using a tensor with a rank of 3, where the dimensions correspond to height, width, and color. A video can be represented using a tensor with a rank of 4 with the added dimension corresponding to time.

In ML, tensors can be used to calculate and manipulate characteristics of objects like edges and gradients. Gradients can be particularly useful for training models that use backpropagation. (For more information on backpropagation, see the article “How does a recurrent neural network remember?”)

How do GPUs work?

High-performance arithmetic logic units (ALUs) are the primary building block of GPUs. The ALUs provide the computational power for implementing the complex algorithms needed for AI and ML. GPUs are based on parallel computing and can perform thousands or millions of operations simultaneously.

GPUs employ a compute unified device architecture (CUDA) where individual processing units implement parallel tasks that enable them to process large datasets and perform complex calculations at once. The CUDA architecture in GPUs relies on single instruction multiple data (SIMD) to allow execution of the same instruction across numerous data points in parallel.

In addition to thousands of ALU compute units, a typical GPU includes several memory types, including global, shared, and registers, plus a high bandwidth memory (HBM) interface. Global memory is large but has a relatively high latency, and shared memory is fast but limited in size.

The compute units operate in groups called streaming multiprocessors (SM) that use global memory. Each SM includes multiple scalar processors, a scheduler, and shared global memory that enables data sharing and synchronization between threads. The HBM interface connects the GPU to the system memory and ensures high throughput in AI applications.

How do TPUs work?

TPUs are matrix processors specifically designed for neural network workloads. A matrix multiplier unit (MXU) is the primary building block in a TPU. An MXU is designed to perform large numbers of multiply-accumulate operations in parallel. A TPU also as a unified buffer, high throughput on-chip HBM, and an activation unit that performs non-linear functions.

The on-chip HBM can support large models, and TPUs can be connected into groups called Pods, which can simplify scaling the workload that can be handled. The DRAM operates as a single unit in parallel to support inputting the large number of weights used by the matrix multiplication section.

TPUs are often used as coprocessors or accelerators that execute commands received from the host processor, usually a CPU. The host is responsible for data preprocessing and delivering the data to the TPU. In addition, TPU boards are linked to CPU-based host systems to perform tasks that the TPUs aren’t designed to handle.

Figure 2. CPUs, GPUs, and TPUs are optimized for scalar, vector, and matrix operations, respectively. (Image: *LinkedIn*)

Where do CPUs fit in?

CPUs are designed for scalar operations compared with vector operations in a GPU and matrix operations in a TPU (Figure 2). CPUs are often used with TPUs as the “control” processor. They feature several cores, but not thousands like GPUs. CPUs are good for serial processing and can handle a small number of simultaneous (parallel) operations. They are more flexible than GPUs or TPUs and can be programmed easily for a wide range of computational tasks.

Summary

GPUs and TPUs are specialized processors developed to support AI and ML algorithms. GPUs are designed to efficiently perform large numbers of vector operations in parallel, while TPUs implement matrix operations. GPUs are used as standalone processors, while TPUs act as accelerators or coprocessors, usually in combination with a CPU.