Speed
The most important advances have happened because computing technology has become available almost for everyone thanks to a two ubiquitous applications, mining, and gaming.
Machine learning algorithms and most prominently large models such as GPT4 require for their training immense computing power that is nowdays made available through GPU based computing clusters. The are also now available TPU which are further specialized computing chips to perform AI oriented mathematical operations. The magnitude of the power currently available would have been unthinkable even a decade ago.
Graphical Processing Units (GPU) are computing chips initially optimized for graphical applications such as rendering efficiently on a monitor. Subsequent versions would provide 3D rendering capabilities to support the flourishing computer gaming industry. These would constantly be enhanced with more capabilities, such as ray-tracing.
Ray tracing is a rendering technique used in computer graphics to generate an image by tracing the path of light as pixels in an image plane and simulating the effects of its encounters with virtual objects. It is capable of producing a very high degree of visual realism, more so than typical rendering methods, but at a greater computational cost. This makes ray tracing best suited for applications where the image can be rendered slowly ahead of time, such as in still images and film and television visual effects. Performing it in near-real-time applications for video games proved to be a very important optimization. Something that has supported the creation of ever more capable GPUs for online/offline video game rendering.
Machine Learning Methods and video games share a common computational operation. They usually require very fast matrix multiplication and manipulation; Neural Networks even more so. This commonality is what has enabled the use of GPU in machine learning research and allowed the growth we have observed the last couple of years.
Efficiency
In machine learning and most importantly deep learning techniques there is a constant need to increase efficiency. There are ever increasing data sets and which require heavy specialized mathematical computation. The heavy lifting is done by efficient implementations of algorithms targeted for GPU processing. At the same time, there is an ever increasing effort to improve the algorithms and the hardware such that they both perform the utmost of their capacity.
GPU Basics
The architecture of a GPU is alike a large factory filled with several workstations, known as Stream Multiprocessors (SMs). Each of these SMs is capable of handling hundreds of tasks simultaneously, functioning as the backbone of the GPU's operations. Within each SM, there are smaller units known as CUDA cores, which are the workers responsible for carrying out the actual computations. The GPU has different types of storage spaces available where the data to be processed is stored until execution. There are two types global memory which is slow and shared memory, much smaller but faster. Additional types of storage include constant memory and texture memory. Work is organized in team of workers (32 by default) called a warp. The tasks, aka threads, are organized also into a grid of blocks. Each block, capable of holding up to 1024 tasks, can be assigned to any SM, ensuring efficient distribution and execution of tasks within the GPU architecture. The purpose of this architecture is to facilitate efficient matrix multiplication which is usually performed in raw or column matrix blocks.
Machine Learning Processing
The Central Processing Unit (CPU) is the primary component of a computer that performs most of the processing inside the computer. It interprets and carries out the instructions of a computer program, and it manages the activities of all the hardware resources in the computer. The CPU in the context of GPU processing orchestrates pre-processing, organizing, and moving data from storage to memory and in/out of the GPU.
Getting the datat in/out of the GPU is the most time consuming step. Data is organized in batches that can fit entirely in the GPU global memory. In Machine Learning, memory is used to store the training data, the weights and biases of the model during training, and the final trained model. The amount of memory available can significantly impact the speed and efficiency of the training process. Modern GPU processing units are nearing the mark of 100GB per GPU which is even larger than RAM available to most commodity personal computers.
Once the initial model has been sent to the GPU, then data is sent as well to GPU memory for processing. GPUs are a very highly parallel structure initially developed for video, image, and graphics processing that makes them more efficient (a usually 10x improved performance) than general-purpose CPUs for algorithms where the processing of large blocks of data is done in parallel. Training of deep learning models employs a method called backpropagation which is heavy on matrix multiplications. This type of operation is often called heavily parallelizable because it can be split in smaller pieces of computation. Combining multiple GPUs together can have a dramatic effect on performance, cutting down the time to train a deep learning model from years to days.
Complexity
Matrix multiplication is a well known operation from linear algebra. A deep learning model is a neural network with very many layers of neurons. The amount of processing required is proportional to the number of layers.
Specifically, the number of matrix multiplications performed during a single forward pass of a neural network is equal to the number of layers in the network minus 1. This is because each layer's output is calculated by multiplying the input from the previous layer by the weight matrix of the current layer. However, the input layer does not perform any multiplication. So, if L is the number of layers, the number of matrix multiplications is L-1.
During a single backward pass of a neural network, the number of matrix multiplications is also equal to the number of layers minus 1. This is because during backpropagation, the error of each layer is calculated by multiplying the error of the next layer by the transposed weight matrix of the current layer. So, if L is the number of layers, the number of matrix multiplications is L-1.
Adjusting the weights after backpropagation also involves matrix operations. Specifically, each weight in the network is updated by subtracting a portion of the calculated gradient. This is done for each layer that has weights, which is the number of layers minus 1 (since the input layer doesn't have weights). So, if L is the number of layers, the number of matrix operations for weight adjustment is also L-1.
Not all operations in this processing take the same amount of computations, but it largely depends on the number of operands (the weights in a layer). Therefore the operation of backpropagation which involves multiplying together two large matrices takes significantly more time and is both heavy in computation and memory requirements.