We are living in an age of information and a big part of our lives rely on technology in the form of cell phones, computers, appliances, cars, etc. and life seems difficult without them. The massive use of technology creates a lot of data and processing difficulties. Applications are used by a lot of people in the world and if this excess use of application affects its performance, it is not acceptable. The question here is how to maintain the performance of the system and manage the huge amounts of data? Here comes the term “Distributed Computing”. Distributed Computing and Artificial intelligence is the key to the influx of Big Data processing that we have seen in recent years with the increased use of technology.

What is Distributed Computing?
As the name refers, in distributed computing an enormous task/process that cannot be executed by a single computer on its own is divided into multiple sub-tasks and computed in different connected computers, called nodes. Computers can be in different places physically. Distributed computing can be implemented in different ways. Computers can compile using parallel computing or serial computing and communicate by shared memory or message passing.

You can split your huge task into smaller ones, compute and execute them on multiple computers/nodes parallel and aggregate the output from the multiple computers appropriately to have the solution to your huge problem. When you have a bigger process to execute, you can simply add more nodes in the computation. Distributed computing is becoming popular because of its high performance, availability of resources, fault tolerance and resource sharing. It allows us to compute complex tasks in less time and reduces the load on a single machine.
The inventor of distributed computing was Google, which by necessity had to develop a distributed computing paradigm, named MapReduce[1],[2] to handle their huge amounts of data. The open-source community later developed Apache Hadoop[3] based on it. Amazon Web Service[4], Microsoft Azure [5]and Google cloud platform [6] are some of the platforms that provide online services for distributed computing. You can purchase services for the time and amount you desire.
How is Distributed Computing and Artificial intelligence Related ?
Talking about technology and don’t mention AI is like a cake without any topping. In this era, we are solving problems with the use of artificial intelligence. No one can deny the importance of artificial intelligence. Various complex tasks are done with the help of AI in different domains such as healthcare, surveillance, finance, robotics, travel, social media and many more.
Simply running AI algorithms on a cluster or data, requires a lot of effort, time and careful software engineering. On the other hand, many high-quality distributed systems exist for applying machine learning on clusters, often build from scratch. But they have their own limitations. Combining them both can make this easy. It will take less effort and time to solve a complex problem.
If you have to solve a problem with a petabyte of the dataset to be processed, how much time will it require? It will take a long time. There comes the concept of distributed computing and pooling. We can pool the processing power of multiple processors together in order to complete the task in a reasonable time.
Imagine the system of image recognition, there are trillions of images belonging to thousands of categories. For example, humans, animals, nature, objects, etc. Doing this on a single machine will take a lot of time. Distributed computing will solve this in less time.
What are Distributed AI Models?
Processing data and increasing the speed of processors lead to the need for different models of DC. The need for Horizontal and vertical scaling of resources brought a lot of models and their extensions. Some of them are as follows:
I. Map Reduce:
In 2004, Google presented a map reduce programming model which is used for large scale data processing. Its implementation is Hadoop. Its main features are automated parallelism, load balancing, fault tolerance and network and disk transfer optimization. The developer has to care about only two methods: Map and Reduce.
II. Apache spark:
Hadoop map reduce was a dominating parallel programming engine for clusters. The most notable limitation of apache Hadoop was that it writes intermediate results on the disk. To address this and some other drawbacks of Hadoop, in 2009 Apache Spark[1] was introduced. Apache Spark kept everything in memory that is why it was much faster. It started growing for widespread use. It can be used for regression, clustering and classification problems.
III. Ray:
There are many distributed systems but many of them were not build with AI applications in mind and therefore they lack performance for AI applications and their APIs. Ray[1] is an open-source framework for AI, developed by Barkley Artificial Intelligence Research (BAIR)[2] group. The main feature of Ray is that it enables developers to turn their prototype algorithm that runs on a laptop into a high-performance application that can run efficiently on a cluster with adding a few lines of code. There are two ways of using Ray i.e. through its low-level API and high-level libraries.
The goal of the Ray is to express computations and applications naturally without being restricted to any specific patterns like map reduce. In Ray, the task graph represents the application and is executed just once. It uses an actor to encapsulate mutable states shared between different tasks. Ray RLlib is a scalable reinforcement learning library that can run on many machines. It currently includes the implementation of A3C, DQN, Evaluation strategies, etc.
Discussion
The machine learning algorithms have advanced and applications based on these techniques require a lot of resources and multiple machines. With the requirement for multiple machines, there comes the need for parallelism. However, the paradigms for performing clusters-based machine learning remains impromptu. While, distributed systems outside the domain of AI do exist such as Hadoop and Spark, but they can solve specific use cases. AI and machine learning developers often develop their own paradigms from scratch that are capable of doing machine learning. This requires and a lot of effort and time.
For example, if one wants to implement the algorithm of reinforcement learning, which is about a dozen lines of code in Python. In order to run the algorithm efficiently on clusters or multiple machines requires a lot of effort and software engineering. This will involve thousands of lines of code, strategies for data handling and message serialization/deserialization as well as definitions of communication protocols.
Limitations of Traditional Distributed Systems
Many distributed systems today do not fulfill this need and are not compatible with AI based libraries and APIs. They are unable to support AI based applications and hence lack the performance required. The flaws of traditional distributed systems are:
- Not able to perform tasks in a millisecond
- Unbale to perform millions of tasks per second
- Not able to support nested parallelism that involves tasks within tasks
- The dependencies of tasks are identified at runtime dynamically
- Unable to support a variety of resources such as CPU, GPU, etc.
Distributed Computing Framework for AI
Ray fulfills this need of developers as it is an open-source framework for AI. With the use of Ray, developers can turn their prototype algorithms that run on a laptop into high-performance applications that can run efficiently on clusters with adding a few lines of code. Ray is compatible with many deep learning frameworks such as PyTorch[3], TensorFlow[4] and MXNet[5] and it allows us to use single or multiple deep learning frameworks in an application.
Implementation and Results
Let’s take a look at Ray with a few code examples to understand the core concepts of using this AI based distributed framework. Ray can be installed by pip install ray. There are two ways of using Ray i.e. through its low-level API and high-level libraries. However, high-level libraries are based on low-level APIs. There are two libraries for Ray:
- Ray RLlib[6]: it is a scalable library for performing reinforcement learning. It can run on multiple machines and have a Python-based API.
- Ray.tune[7]: It is a distributed library for hyperparameter search.
Ray Components
The basic component in a Ray application is a dynamic task graph that represents the whole application. It is constructed at run time as an application runs and executed a single time. The tasks can be nested or they can trigger the execution of more tasks. For example, Python functions are considered as tasks and they can depend on the execution of other tasks as explained in the example below:
@ray.remote def add(X, Y): return X+Y @ray.remote def ones(S): return np.ones(S) task1 = ones.remote((20, 20)) task2 = ones.remote((20, 20)) task3 = add.remote(task1,task2) result = ray.get(task3) |
While working with machine learning algorithms, one often requires to have multiple tasks that can process on a shared mutable state. For example, with neural networks the shared weights of the units. Ray uses an actor to encapsulate mutable states shared between different tasks. An example of Ray actor as a parameter server is given below:
@ray.remote class ParameterServer(object): def __init__(self, X, Y): Y = [y.copy() for y in Y] self.server_parameters = dict(zip(X, Y)) def Get(self, X): return [self.server_parameters[x] for x in X] def Update(self, X, Y): for x, y in zip(X, Y): self.server_parameters[x] += y |
You can initiate the parameter server by:
parameter_server_initiate = ParameterServer.remote(initial_X, initial_Y) |
Using Ray Libraries
Ray.tune provides support for a random search, grid search and many early stopping algorithms. Its results can be visualized through Tensor board or read directly though the JSON logs. AN example of implementing grid search is given below:
from ray.tune import grid_search, register_trainable, run_experiments def Grid_Search(hyperparamters, results): import time import numpy as np index = 0 while True: results(total_time = index, average_accuracy = (index ** hyperparameters[‘alpha’])) index += hyperparameters [‘beta’] time.sleep(0.01) register_trainable(‘ Grid_Search ‘, Grid_Search) run_experiments({ ‘grid_search_experiment’: { ‘run’: ‘Grid_Search’, ‘stop’: {‘average_accuracy’: 100}, ‘resources’: {‘cpu’: 1, ‘gpu’: 0}, ‘hyperparamters’:{ ‘alpha’: grid_search([0.1, 0.2, 0.3, 0.4, 0.5]), ‘beta’: grid_search([0, 1]), }, } }) |
Conclusion
Distributed computing is not a new term but with the development and use of technology, it is becoming popular. Distributed computing with AI is one of the basics of the new era in technology. It can help in achieving more in less time. There are a lot of benefits of using it and it is improving day by day. In the near future, it could be used even more and we will be able to solve complex problems involving huge amounts of data, efficiently.
Blog Description
Distributed Computing and Artificial intelligence is the key to the influx of Big Data processing that we have seen in recent years with the increased use of technology. Distributed computing with AI is one of the basics of the new era in technology that can help in achieving more in less time. While, distributed systems outside the domain of AI do exist, but they can solve only specific use cases.
The machine learning algorithms have advanced and applications based on these techniques require multiple machines for processing. Many distributed systems today do not fulfill this need and they do not support AI based libraries and applications and hence lack performance. Ray is an open-source framework for AI that fulfills this need of developers. With the use of Ray, developers can turn their prototype algorithms that run on a laptop into high-performance applications that can run efficiently on clusters with adding a few lines of code. Ray is compatible with many deep learning frameworks.
[1] https://rise.cs.berkeley.edu/projects/ray/
[2] https://bair.berkeley.edu/
[4] https://www.tensorflow.org/
[6] https://bair.berkeley.edu/blog/2018/01/09/ray/
[7] http://ray.readthedocs.io/en/latest/tune.html
[1] https://spark.apache.org/releases/spark-release-2-0-0.html
[1] Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: Simplified data processing on large clusters.” (2004).
[2] Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: simplified data processing on large clusters.” Communications of the ACM 51.1 (2008): 107-113.
[3] https://hadoop.apache.org/