Skip to content

Serve

The biggest effort in creating a machine learning is to collect the data and train. The larger the dataset the more computation effort and costs are needed to train a successful model.

To enable machine learning training and inference to run at scale on the cloud, a number of technologies have been implemented. Distributed computing, such as Apache Hadoop and Spark, allows machine learning algorithms to run on multiple machines simultaneously, speeding up the process and allowing for larger datasets. The integration of such frameworks with cloud service providers has enabled on-demand computing resources that has reduced costs for training campaigns. Cloud has been heavily depended on Containerization technologies like Docker and Kubernetes that allow packaging and deployment of machine learning models in isolated environments, making it easier to scale and manage them. Server-less computing, such as AWS Lambda, Google Cloud Functions, and Azure Functions, allows machine learning models to be deployed without having to manage servers, making it easier to scale. With respect to storage services S3 based services have provide scalable storage solutions for large datasets and models essential in modern machine learning.

These technologies are available usually via vendor enabled machine learning platforms such as Google Cloud ML Engine, Amazon SageMaker, and Azure Machine Learning that provide a comprehensive set of tools to manage, scale, and deploy machine learning models.

These technologies are detailed and summarized below.

Distributed computing

Distributed computing is a computing model where components of a software system are shared among multiple computers to improve efficiency and performance. In this model, computers, also known as nodes, communicate and coordinate their actions by passing messages to each other. A variety of tasks can be coordinated to run concurrently, making it a highly efficient way of processing large volumes of data.

In machine learning distributed computing acts by splitting the computational tasks of machine learning algorithms across multiple nodes or machines. This is very useful when dealing with large datasets or training of large models (such as large language models) that are too big to be processed by a single machine. Here's a general overview of how it's done.

Data Parallelism

The foremost task is Data Parallelism in distributed machine learning that divides the dataset into smaller subsets. Each of these subsets can then be processed by a different node (a computer) and then the results of this local learning to be combined to perform the update of the model parameters.

Model Parallelism

There are cases, which are very common nowadays that updating of the model parameters is not done on single computing node due to the sheer size of the model. In those cases Model Parallelism enabled the machine learning model itself to be distributed and process updates to parameters and input across many computing nodes. In such cases, the model is split into smaller parts, and each part is processed on a different machine. How the model is split can be very different depending on the model and machine learning algorithms employed to train it.

Distributed Communication

Either model or data parallelism involve communication of the error computation due to new data and updates to the parameters. The nodes communicate with each other during the training process. Typically gradient descent is employed where each node calculates the gradient of the loss function with respect to the model parameters, how big a change in error with respect to some arbitrary update in the parameters. These are called local gradients and how they are communicated and combined together to incorporate updates from entirety of the data distributed across the network can have significant impact in performance of the model and time to complete the task.

Distributed machine learning can be implemented using various frameworks like TensorFlow, PyTorch, Apache Spark, etc., which provide tools and APIs to facilitate distributed computing. Thus enabling the Distributed Computing framework for serving machine learning training and inference at scale. It allows for faster processing of large datasets, and efficient use of resources, making it an essential tool for large-scale machine learning applications.

Distributed computing in machine learning involves training models on multiple machines or nodes. Efficient communication of weight updates and gradients is crucial for the performance of these systems. Below are briefly described some of the involved communication strategies.

Parameter Server Model

In this model, one or more nodes, known as parameter servers, are designated to store the model's parameters. The other nodes compute gradients and send them to the parameter server, which updates the model parameters and sends them back to the nodes.

All-Reduce

This is a communication primitive often used in distributed computing. It combines values from all nodes and shares the result with all nodes. This can be used to aggregate gradients and distribute the updated weights.

Model Averaging

Each node trains a separate copy of the model, and the weights are updated by averaging the weights of all models.

Decentralized Distributed Optimization

Instead of using a parameter server, each node communicates with a subset of other nodes. This can reduce network congestion and improve scalability. Different protocols are employed with foremost the distributed averaging algorithm as described in Distributed machine learning in networks by Consensus .

Federated Learning

This is a strategy for training models across many decentralized devices or servers holding local data samples, without exchanging the data samples themselves. Instead, locally computed updates are communicated to a central server where they are aggregated to form a global model.