1. Project Idea

1. We have taken an example of NLP (Natural language processing) Data flow for this Spark MLlib (Machine Learning Library) Project.

2. In this project, we have used a self-made NLP CSV data file for sentimental analysis.

3. Spark MLlib java project will run on Spark system and it will have sentimental analytics logic.

4. Spring Java Client project will run on the Client system and it will show result data in tabular format.

2. Spark MLlib Project Workflow

a.   Start the Client system and Spark system. spark_ml-project_workflow01
b.   Client system generates sentiment analysis requests to the Spark system. spark_ml-project_workflow02
c.   Spark system process client request and after successful execution of the program, Spark will upload the result to Client System. spark_ml-project_workflow03
d.   Once the result is uploaded into the Client system, after that it will show output in table format. spark_ml-project_workflow04

3. Building Of Project

To run this project you can install VM (Virtual Machine) on your local system and configured Spark on that. After this configuration, your local system will work as a client system and Spark VM will work as a Spark system. Alternatively, you can take two systems which are communicating with each other and on one of the system Spark is configured.

Let us see this project in detail.

a. Client System
It is an example of the Spring Boot JAVA Framework. When we will build this project then it will create a "client.jar" executable file.
It has java code, spark executable jar, data files, and static pages (HTML, javascript, image).
Java code has 2 files, ClientSpringBootWebApplication.java and UploadDownloadController.java
ClientSpringBootWebApplication.java is the main project file, which is responsible for building code and running it on an embedded server.
UploadDownloadController.java is used to provide download & upload URL HTTP services for the Spark system.
data folder has NLP (Natural Language Processing) training data file (emotion.csv) and executable spark project (artificial.jar) file.
the static folder has HTML pages and dependent js & image files. This is the main client view page which shows the spark process result.
pom.xml is a project build tool file. This file has java dependencies and builds configuration details.
For creating the “client.jar” file, use the command "mvn clean package".
Click Here To Download "ClientSystem" project zip file.

b. Spark System
It is a JAVA project, when we will build this project then it will create an “artificial.jar “executable file.
Spark project has 3 java files (Artificial.java, TestDocument.java, TrainingDocument.java) and 1 build tool file (pom.xml).
Artificial.java is a main logic code file. TestDocument.java file is used to create client request object and TrainingDocument.java file is used to create NLP training dataset object.
Main code works in four logical parts, 1- create training dataset object. 2- create the PipelineModel object from the training dataset.3- convert client request string into the test dataset object. 4- transform the client dataset into predictions result in the dataset.

1- spark.createDataFrame(trainingDocumentJavaRDD, TrainingDocument.class); 2- pipeline.fit(trainingRowDataset); 3- spark.createDataFrame(data, TestDocument.class); 4- model.transform(testDataset);

pom.xml contains external code dependencies and main class details.
For creating the "artificial.jar" file, use the command "mvn clean package".
Click Here To Download "SparkSystem" project zip file.

4. Run The Project

a. Client System b. Spark System


Verify Java is installed into the Client system.
Verify spark services are running on the Spark system. Also, check the export variable path for spark commands.


Download client.jar in the Client system.
Click Here To Download "client.jar" executable jar file.
Download spark-system.sh shell script file in thr Spark system.
Click Here To Download "spark-system.sh" shell script file.


Run client.jar in the client system. At execution time pass server port 9090.
Here we can use a different port if the port already uses in the client system.

java -jar client.jar --server.port=9090

4. Find a client IP address that is accessible into the Spark system. Run spark-system. sh shell script in the Spark system. At execution time pass client-ip & client-port.

sh spark-system.sh 9090


On the client, page enters a statement and generates a sentiment analysis request. After receiving a request from the Client system, the Spark system will start processing data.


The client system will automatically show the result as soon as the Spark system uploads the result. Spark system uploads the result after successful execution.

5. Project Files Description In Detail

(i).   spark-system.sh

Using the shell script (spark-system. sh) we can easily run the Spark project jar in the Spark system.

spark-system.sh file is used to download NLP data and spark project jar files. After downloading files start the Spark MLlib project based on input client values.

spark-system.sh file has 2 input variables which are required during runtime.

The second variable is $1, used to obtain the client IP address from the command line.

The third variable is $2, used to obtain the client port number from the command line.

(sh spark-system.sh 9090 ): here "sh" is linux command, "spark-system.sh" is a shell script file name, "" is a first variable & "9090" is a second variable.

6. Update Training Dataset If Spark Project Not Behave Correctly

Open emotion.csv file from Spark system and add a new statement according to sentiment values.


1693933165,positive,My experience so far has been fantastic!,

1693946914,negative,I hate to my boss.,

Run client jar from the Client system.

java -jar client.jar --server.port=9090

Run spark project jar from Spark system.

spark-submit --class cloudduggu.com.Artificial artificial.jar emotion.csv 9090

:) ...enjoy the Spark machine learning library (MLlib) project.