Big Data cluster deployment in cloud for Machine Learning


Cloud-based Big Data and Machine Learning applications are becoming increasingly popular in the industry, also in academic and education sectors. In many cases, clouds are used to support the computation and storage needs of such applications by building and managing multi-VM virtual infrastructures (clusters) using some cloud-based system (IaaS), temporarily or for a longer period of time, respectively.

Recently, a popular choice to convey big data analytics or machine learning is to use Apache Spark, which is an open source distributed, cluster computing system. For  best performance, it is also recommended to use Hadoop File System (HDFS) along with Spark, with which Spark can perform data processing in data locality-aware way, i.e. moves computation to the site of the data, so avoiding data movement overhead.

Manual deployment and configuration of such a cluster in a cloud is non-trivial, error-prone and tedious task, requiring considerable expertise in both cloud and Spark-Hadoop technologies. It might take days, which may even exceed the time required to perform the actual data processing. After discarding the infrastructure when the current computation is done, the same infrastructure might have to be re-built again for a subsequent computation later, or when deciding to choose another cloud provider, respectively.

This is solution to manage (create and discard) such infrastructures rapidly, easily, and efficiently in clouds, which are guaranteed to be consistent, properly configured, well integrated, controllable, scalable and fault-tolerant. An important advantage of the proposed solution is that the main parameters of the Apache Spark architecture (such as the size of the cluster, number of CPU cores and memory configurations per workers, etc.) can be customized, the computing capacity required for processing can be scaled and cloud-independent. To fulfill these goals we used a hybrid-cloud orchestration tool called Occopus, which was developed by MTA SZTAKI. Occopus uses so called descriptors to define the required infrastructure, which contains the definition of the node types (Spark Master and Worker). The Spark Master will also be the HDFS Name Node, and the Worker nodes will be Data Nodes for the HDFS system at the same time to enable data locality-aware processing. The number of the worker nodes is configurable before deployment and scalable even at runtime. Occopus uses these descriptors (infrastructure descriptor, node definition descriptor and optionally the cloud-init files for contextualization) to deploy the required infrastructure in the target cloud. Occopus supports several interfaces to various clouds: EC2, OpenStack Nova, Docker, CloudBroker, CloudSigma, which allows of easily deploying the very same infrastructure in almost any commercial or private cloud providing such an interface (Amazon, OpenStack, OpenNebula, etc.)