Apache Spark SQL Project

1. Project Idea

1. In this Project, We will perform data analysis through Spark, and for that, we have taken a movie dataset of type CSV.

2. To execute this Project we have two systems, a Client Machine, and a Spark Machine.

3. On Client Machine, we will send the SQL request to Spark Machine to process data.

4. On Spark Machine, data will be processed and the result will be returned to the Client machine.

5. After receiving the result, the Client Machine will show the result on the web browser in graphical and tabular format.

2. Project Workflow

a. Client system sends SQL requests to the Spark system and waits until the Spark system responds.
b. Spark system starts execution after receiving Client system requests.
c. Spark system sends results to the Client system after successfully processing the request.
d. Client system displays data upon receiving results from the Spark system.

3. Building Of Project

To run this project you can install VM (Virtual Machine) on your local system and configured Spark on that. After this configuration, your local system will work as a client system and Spark VM will work as a Spark system. Alternatively, you can take two systems which are communicating with each other and on one of the system Spark is configured.

Let us see this project in detail and run it using the below steps.

a. Client System
	It is an example of the Spring Boot JAVA Framework. When we will build this project then it will create a "client.jar" executable file.
	It has java code, spark executable jar, data files, and static HTML pages.
	Java code has 2 files, SpringBootWebApplication.java and UploadDownloadController.java
	SpringBootWebApplication.java is the main project file, which is responsible for building code and running it on an embedded server.
	UploadDownloadController.java is used to provide download & upload URL HTTP services for the Spark system.
	data folder has unprocessed movie CSV data files and executable spark project (analysis.jar) file.
	the static folder has HTML pages and dependent js & image files. This is the main client view page which shows the spark process result.
	pom.xml is a project build tool file. This file has java dependencies and builds configuration details.
	For creating the “client.jar” file, use the command "mvn clean package".
	Click Here To Download "ClientSystem" project zip file.

b. Spark System
	It is a JAVA project, when we will build this project then it will create an “analysis.jar “executable file.
	Spark project has 3 java files (MovieAnalysis.java, Movie.java, AnalysisLogic.java) and 1 build tool file (pom.xml).
	MovieAnalysis.java is the main file which receives client request and sends the request to AnalysisLogin object for processing. After the process, the result sends back to the client system.
	Movie.java is used to create "JavaRDD" of the data CSV file. `JavaRDD<Movie> movieRDD = spark.read().csv(filePath).javaRDD()`
	AnalysisLogic.java is a main spark SQL logic code to process the SQL request.
	pom.xml contains external code dependencies and main class details.
	For creating the "analysis.jar" file, use the command "mvn clean package".
	Click Here To Download "SparkSystem" project zip file.

4. Run The Project

	a. Client System	b. Spark System
1.	Download client.jar in the client system. Click Here To Download "client.jar" executable jar file.	Download spark-system.sh shell script file in the spark system. Click Here To Download "spark-system.sh" shell script file.
1.
2.	Run client.jar in the client system. At execution time pass server port 9999. Here we can use a different port if the port already uses in the client system. `java -jar client.jar --server.port=9999`


3.	Find client system IP, which will be accessible to the Spark system.	Run spark-system. sh shell script in the Spark system. At execution time pass client-ip & client-port. `sh spark-system.sh 192.168.0.104 9999`
3.
4.	Press the request button from the client page and generate a request for the spark system. After that Client page will wait till the time the Spark system sends the result.	After receiving a request from the Client system, the Spark system will start processing data.


5.	The client system will automatically show the result as soon as the Spark system uploads the result.	Spark system uploads the result after successful execution.
5.
6.	Press another request button from the Client system and generate a new request for the Spark system.	Spark system uploads the result after successful execution.

5. Project Files Description In Detail

(i). spark-system.sh

Using a shell script (spark-system. sh) we can easily run the Spark project jar in the Spark system.

spark-system. sh, the file runs the main script within the loop. Here loop runs up to 100 times. Once the loop is completed Spark system does not monitor the Client system request.

spark-system. sh file has 2 variables that pass at runtime.

The first variable is $1, used to obtain the client IP address from the command line.

The Second variable is $2, used to obtain the client port number from the command line.

(sh spark-system.sh 192.168.0.104 9999 ): here "sh" is linux command, "spark-system.sh" is a shell script file name, "192.168.0.104" is a first variable & "9999" is a second variable.

:) ...enjoy the spark sql project.

Apache Spark SQL Project

Spark - Interview Questions & Answers

Spark - Streaming Project

1. Project Idea

2. Project Workflow

3. Building Of Project

`JavaRDD<Movie> movieRDD = spark.read().csv(filePath).javaRDD()`

4. Run The Project

1.

2.

`java -jar client.jar --server.port=9999`

3.

`sh spark-system.sh 192.168.0.104 9999`

4.

5.

6.

5. Project Files Description In Detail

(i). spark-system.sh