The objective of this tutorial is to describe step by step process to install Spark on a cluster of nodes. We are explaining this example by using 1 master node and 4 slave nodes.

Platform

  • Operating System (OS). You can use Ubuntu 18.04.4 LTS version or later version, also you can use other flavors of Linux systems like Redhat, CentOS, etc.
  • Spark. We have used Spark 2.4.5 (Version spark-2.4.5-bin-hadoop2.7).

Download Software

  • VMWare Player for Windows
  • https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/7_0

  • Ubuntu
  • http://releases.ubuntu.com/18.04.4/ubuntu-18.04.4-desktop-amd64

  • Eclipse for windows
  • https://www.eclipse.org/downloads/

  • Putty for windows
  • http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

  • Winscp for windows
  • http://winscp.net/eng/download.php

  • Spark
  • https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

We have below the cluster of nodes on which we will install Spark-2.4.5. We have 1 master and 4 slaves and their details are as mentioned below.

• HadoopMasternode: 185.150.1.20 (hadoopmaster) • HadoopSlavenode: 185.150.1.21 (hadoopslave-1) • HadoopSlavenode: 185.150.1.22 (hadoopslave-2) • HadoopSlavenode: 185.150.1.23 (hadoopslave-3) • HadoopSlavenode: 185.150.1.24 (hadoopslave-4)

Installation of Spark on Master Node

Let’s install Spark on the master node that is (HadoopMasternode: 185.150.1.20).

Step 1. Please edit the hosts file of the Master node and add the below entries.

$sudo nano /etc/hosts

185.150.1.20 hadoopmaster 185.150.1.21 hadoopslave-1 185.150.1.22 hadoopslave-2 185.150.1.23 hadoopslave-3 185.150.1.24 hadoopslave-4

Step 2. Verify if Java is installed on the Master node by using Java –version command. If it is installed you will receive the below message.

openjdk version "1.8.0_252" openjdk Runtime Environment (build 1.8.0_252-8u252-b09-1~16.04-b09) openjdk Server VM (build 25.252-b09, mixed mode)

Otherwise, you can install JAVA 8 from the below link.

$sudo apt-get install openjdk-8-jdk


Step 3. Once Java is installed, update the source list of files using the below command.

$sudo apt-get update


Step 4. Now configure SSH on the master node using the below command.

$sudo apt-get install openssh-server openssh-client


strong>Step 5. Once SSH is installed, now generate key pair for passwordless SSH from master to slaves.

$ssh-keygen -t rsa -P ""


Step 6. Now copy the content of .ssh/id_rsa.pub from the master node to all slaves in .ssh/authorized_keys.


Step 7. Once the key is copied verify the login from the master node for all slave nodes.

$ssh hadoopslave-1 $ssh hadoopslave-2 $ssh hadoopslave-3 $ssh hadoopslave-4

Till now we are able to connect the slave machine from the master node by just supplying ssh and slave node name.


Step 8. Now we are ready to install Spark on the master node. Download the software from the below link.

https://downloads.apache.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz

In our case, it is present at the below location.

/home/cloudduggu/spark-2.4.5-bin-hadoop2.7.tgz


Step 9. Now let us untar the file.

$ tar xzf spark-2.4.5-bin-hadoop2.7.tgz

$mv spark-2.4.5-bin-hadoop2.7 spark


Spark Configuration files Setup


Step 10. Open .bashrc file from user’s home and add below parameters to update the location of Spark.

$nano .bashrc

export SPARK_HOME="/home/cloudduggu/spark" export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-i386/" export PATH=$PATH:$SPARK_HOME/bin


Step 11. Now we will set up the spark-env.sh file which is present in $SPARK_HOME/conf/. We will create a copy of the spark-env.sh.template and export Java home.

$cp spark-env.sh.template spark-env.sh

export SPARK_WORKER_CORES=12

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-i386/


Step 12. Now add slave nodes in the slave's file present under ($SPARK_HOME/conf/).

Add slave nodes detail.

$SPARK_HOME/conf/ hadoopslave-1 hadoopslave-2 hadoopslave-3 hadoopslave-4

Now we have set up Spark successfully on Master Node, Let’s configure on slave nodes.


Step 13. Verify if Java is installed on all slave nodes by using Java –version command. If it is installed you will receive the below message.

openjdk version "1.8.0_252" openjdk Runtime Environment (build 1.8.0_252-8u252-b09-1~16.04-b09) openjdk Server VM (build 25.252-b09, mixed mode)

Otherwise, you can install JAVA 8 from the below link.

$sudo apt-get install openjdk-8-jdk


Step 14. Now copy the configuration file from the Master node to all Slave nodes using the below commands.

ON hadoopmaster$ tar czf spark.tar.gz spark-2.4.5-bin-hadoop2.7.tgz ON hadoopmaster$ scp spark.tar.gz hadoopslave-1:~ ON hadoopmaster$ scp spark.tar.gz hadoopslave-2:~ ON hadoopmaster$ scp spark.tar.gz hadoopslave-3:~ ON hadoopmaster$ scp spark.tar.gz hadoopslave-4:~

Step 15. Now untar that file on all slave nodes.

ON hadoopslave-1 $tar xzf spark.tar.gz ON hadoopslave-2 $tar xzf spark.tar.gz ON hadoopslave-3 $tar xzf spark.tar.gz ON hadoopslave-4 $tar xzf spark.tar.gz

Step 16. Spark Installation is completed on all Slave nodes.


Step 17. Start Spark cluster from master node using the below commands.

To start Spark cluster services run the below command.

$sbin/start-all.sh


Step 18. Verify services are running from master and slave nodes.

On the master node run the below command.

$jps master

On the slave, nodes run below command.

$jps worker

So now we have completed Spark installation on multinode clusters.