Apache Spark Interview Questions and Answers

Top 50 Apache Spark Question and Answers

The following are the frequently asked Apache Spark questions and answered for fresher and experienced professionals.


1. What is Apache Spark?

Apache Spark is a lightning-fast, distributed, and cluster computing framework that is used to process data in memory. Apache Spark provides APIs to support different programming languages such as Scala, Java, R, Python, and so on. It provides support to a rich set of tools such as Spark SQL to process structured data, MLlib to process machine learning data, GraphX to process Graphs, Streaming to process live stream data, and so on.


2. What are the three data sources available in Apache SparkSQL?

The following are the list of data source available in Apache SparkSQL.

  • JSON Datasets
  • Hive tables
  • Parquet file

3. What are internal daemons in Apache Spark?

Apache Spark provides the following important daemon such as Worker, Driver, Executor, Memestore, Blockmanager, DAGscheduler, tasks, and so on.


4. What is ‘Sparse Vector’?

A sparse vector is a vector that has two parallel arrays, one for indices, one for values, and use for storing non-zero entities to save space.


5. What are the languages supported by Apache Spark for developing big data applications?

The following is a list of applications that are used to develop big data applications.

  • Java
  • Python
  • R
  • Clojure
  • Scala

6. What are accumulators?

The Accumulator is a variable and the type of the variable is write-only. It is initialized once and forwarded to workers. Once the workers receive it they are updated as per the logic define and then update to driver.


7. What is the difference between Spark Transform in DStream and map?

The Transform function allows developers to use Apache Spark transformations on the underlying RDD's for the stream whereas Hadoop's Map function usage is based on the element basis transform in Hadoop. Maps work with Dstream elements whereas transform provides the facility to user to work with Dstream RDD.


8. How to connect Apache Spark with Apache Mesos?

The following are the steps to connect Apache Spark with Apache Mesos.

  • The Spark driver program should be accessible with Apache Mesos and can connect with Mesos.Spark binary package.
  • The installation location of Apache Spark and Apache Mesos should be the same and the property ‘spark.mesos.executor.home’ should be pointing to that location.

9. Is it necessary to install Apache Spark on all the nodes of the YARN cluster?

It is not required to install when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.


10. What is the difference between Hadoop and Spark in terms of ease of use?

Apache Hadoop supports batch processing whereas Apache Spark is an in-memory processing system. Hadoop uses the Map_Reduce programming model which is JAVA-based and learning it takes time however the usage of Hive and Pig with Hadoop provides an easy way to use it. On the other hand, Apache Spark provides APIs to support different languages such as Python, Java, Scala also supports Spark SQL for SQL lovers that makes it very easy compared with Hadoop.


11. What is the Parquet file?

The Parquet is the file format of type columnar. It is supported by various processing systems. Apache Spark SQL support the read and write transaction on Parquet files.


12. What are the components of the Apache Spark Ecosystem?

The following is the list of Apache Spark components.

  • Spark Core: This is the Apache Spark code engine for scaling and distributed processing.
  • Spark SQL: It provides API to process SQL queries.
  • Spark Streaming: This component is used to process live streaming data.
  • GraphX: It allows graphs and graph-parallel computation.
  • MLlib: It allows you to perform machine learning in Apache Spark.
  • SparkR: This component is used to provide an interface to use R language with Apache Spark.

13. Should you install Spark on all nodes of the YARN cluster?

There is no need to install Apache Spark on all nodes of a Cluster because Apache Spark works independently and on top of the YARN processing engine. The configuration of run Apache Spark with YARN is executor-memory, deploy-mode, executor-cores, master, queue, and so on.


14. What are the benefits we get if we learn MapReduce and on the other side Spark is also there?

We should learn Mapreduce as well because there are tools like Hive and Pig, these tools convert their queries into MapReduce phases to optimize them better and it is necessary to learn Mapreduce when data grows bigger and bigger.


15. What is Executor Memory?

The heap size is referred to as the Spark executor memory. It is managed by spark.executor.memory property of the –executor-memory flag. Each Spark application will have a fixed heap memory. It indicates the usage of memory that applications will be utilized.


16. What is RDD in Apache Spark?

Apache Spark RDD stands for “Resilient Distributed Dataset”, which is the fundamental structure and building blocks of any Spark application. RDD is immutable and follows transformations and actions. Each dataset in RDD is logically partitioned and spread across multiple nodes and due to this, it can be computed on the distributed node of the cluster. It can automatically rebuild in case of node failure.

There are the following three ways to create an RDD in Apache Spark.

  • RDD is created by Parallelizing existing collections in the driver program.
  • RDD can be created from external datasets such as HDFS, filesystem, HBase.
  • RDD can be created from an existing RDD.

17. What operations can we perform with RDD?

The following two types of operations can be performed using Apache Spark RDD.

    Transformations

    Apache Spark Transformations are the operation which is applied on RDD after that RDD is transformed and a new RDD is created from an existing RDD. The Transformations are lazy that indicates that unless action is not called there will no operation performed. The Transformations functions are Flat map, Map, filter, and so on.

    Actions

    Apache Spark action performs some action on the dataset and returns the result to the driver program. Actions follow lineage graph for loading data in RDD. The actions functions are the following counted (), Collect () , saveAsTextFile(path) and so on.


18. What is Apache SparkCore?

Apache Spark Core is the base of Spark architecture. The task of Spark core is to provide distributed task dispatching, I/O functionalities, scheduling, and provide a programming interface for programming languages such as Scala, Python, Java, and so on.

Apache Spark Core performs the following action.

  • Fault recovery
  • Scheduling, distributing, and monitoring jobs on a cluster
  • Memory management
  • Interacting with storage systems

19. What is Apache Spark Streaming?

Apache Spark Streaming provides an interface to process live streaming data. The input data can be Flume, Kinesis, and so on. The input data is processed by applying the function like the map, reduce, join, and once the processing is completed the output data is sent to other sources such as filesystem, databases and so on.


20. What is Apache Spark SQL?

Apache Spark SQL is the Spark model to process structure data. It brings the SQL flavor to process data from the Spark system and it can process data that are stored in RDD as well as in external devices. A user can connect with Spark SQL using SQL and API. Apache Spark SQL provides supports for both streamings as well as batch processing.


21. What is Apache Spark MLlib?

Apache Spark MLlib is the Machine learning (ML) library. MLlib is one of the major components of Spark. Its goal is to make machine learning easy and scalable. It provides common learning algorithms and utilities such as classification, regression, clustering, collaborative filtering, and dimensionality reduction. Spark machine learning library (MLlib) allows data scientists to focus on their data problems and models despite other issues related to configuration or infrastructure.

Apache Spark MLlib is divided into two packages.

  • spark.mllib package
  • spark.ml package

22. What is Apache Spark GraphX?

Apache Spark GraphX is a module of Spark architecture to process graphs. It has inbuild graph algorithms and builders to perform analysis on the graph. Apache Spark GraphX inherits the property of Spark and provides the distributed processing and so on. it provides multiple algorism to simply the analytical task such as PageRank, Connected Components, Triangle Counting, and so on.


23. What is Apache Spark Driver?

It is a program that runs on the master node of the cluster and states transformations and actions on data RDDs. It creates a SparkContext and connects to a Spark master.


24. What are the libraries of Apache Spark SQL?

The following are four libraries of Apache Spark SQL.

  1. Data Source API
  2. DataFrame API
  3. SQL Service
  4. Interpreter & Optimizer

25. Can we use Apache Spark along with Hadoop?

Apache Spark is compatible with Hadoop and it can be used with YARN and can store data in HDFS. So together Spark and Hadoop can be used to utilize the best of YARN and HDFS.

Apache Spark can be used with Hadoop in the following way.

  • Spark can run on top of HDFS to utilize distributed storage.
  • Spark can be used as a processing unit along with MapReduce in the same cluster of nodes where MapReduce will be used for batch processing and Spark will be used for real-time processing.
  • Spark applications can use YARN.

26. What is the file system supported by Apache Spark?

There are the following three types of file systems supported by Apache Spark.

  1. Local File system
  2. Hadoop Distributed File System (HDFS)
  3. Amazon S3

27. What are the cluster managers, Apache Spark supports?

Apache Spark supports majorly three types of cluster managers.

  1. Standalone is a basic manager to set up a cluster.
  2. Apache Mesos is a commonly-used cluster manager.
  3. YARN, which is responsible for resource management in Hadoop.

28. What is the Apache Spark worker node?

Apache Spark worker nodes are the salve nodes that perform actual operations assigned by the Master node. It processes data which are stored on the node and acknowledge to Master node and based on resource availability master node assign task to the worker node.


29. Can we use Apache Spark to access data stored in Cassandra databases?

Yes by using Apache Spark Cassandra Connector you can access data from the Cassandra database. A Cassandra Connector will connect to Apache Spark and then Spark will connect to will local Cassandra node and query local data and this way query will work very fast by reducing the usage of the network to send data between Spark executors and Cassandra nodes.


30. How to run Apache Spark on Apache Mesos?

Apache Spark can be deployed on Apache Mesos cluster managed systems. Mesos master node will replace Spark master as the cluster manager. Mesos only decide which task should be assigned to which machine.


31. How to connect Apache Spark with Apache Mesos?

The following are the steps to connect Apache Spark with Apache Mesos.

  • Create a driver program to connect with Mesos.
  • Mesos should be able to access Spark binary files.
  • Configure this property (spark.mesos.executor.home) and put the location where Spark is installed.

32. What is the Broadcast variable?

Broadcast variables are used for storing a read-only copy of variables on each machine. It can be used to give all nodes a copy of a large dataset. Spark also tries to distribute broadcast variables to reduce communication.


33. What is a DStream in Apache Spark?

DStream is the basic abstraction of Spark Streaming and a continuous sequence of RDD. It receives data in a continuous form from external sources or the data generated from the processed input stream.


34. What are the different levels of persistence in Apache Spark?

In Apache Spark, RDD can be persisted by using persist () or cache () methods.

  • MEMORY_ONLY
  • MEMORY_AND_DISK
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK_SER
  • DISK_ONLY
  • MEMORY_ONLY_2, MEMORY_AND_DISK_2

35. What is Akka in Apache Spark?

Apache Spark uses Akka for commination between worker nodes and the master node. When worker nodes ask for data from the Master Node in that case Spark uses Akka for communication.


36. What are Apache Spark Datasets?

A Dataset is a representation of structured data that are mapped and stored in relation schema. On Dataset, there are two types of operation that can be performed, transformations and actions. By applying transformations on the dataset, a new dataset is generated, and by applying action on the dataset some results are generated.


37. What is Apache Spark transformations & actions?

In Apache Spark, all transformations are lazy which means unless action is not called the execution will not take place. Say we are creating an RDD by applying transformations method such as map (), filter () or flatMap(), after that it does due to its lazy nature, once we call action methods such as collect(), take(), foreach() then it will execute and return the result to the user.


38. What are Transformations in Apache Spark?

RDD transformations are the methods that we apply to a dataset to create a new RDD. It will work on RDD and create a new RDD by applying transformation functions. The newly created RDDs are immutable and can’t be changed. All transformations in Spark are lazy which means when any transformation is applied to the RDD such as map (), filter (), or flatMap(), it does nothing and waits for actions and when actions like collect(), take(), foreach() invoke it does actual transformation/computation on the result of RDD.


39. What are the Actions in Apache Spark?

The action phase of Apache Spark starts on those RDD which are created by applying the transformation. The action phase is the final phase of a program that returns the result to the driver program.


40. What is Apache Spark Lazy Evaluation?

In Apache Spark, all transformations are lazy which means unless action is not called the execution will not take place. Say we are creating an RDD by applying transformations method such as map (), filter () or flatMap(), after that it does due to its lazy nature, once we call action methods such as collect(), take(), foreach() then it will execute and return the result to the user.


41. What is the difference between Apache Spark SQL, HQL, and SQL?

Apache Spark SQL is the component of Spark. It supports SQL and Hive SQL without changing any syntax. You can perform operations by joining SQL/HQL table with Spark SQL and get the result.


42. What is Apache SparkContext?

Apache SparkContent is an entry point for Spark applications. SparkContext allows you to create RDDs and once RDD is created actions can be performed to get the desired result.


43. Can you implement Machine Learning in Apache Spark?

Machine Learning can be implemented in Apache Spark by using the MLlib package.


44. What is the difference between persist () and cache () functions?

Cache () function uses the default storage level whereas Persist() function allows the user to specify the storage level.


45. What is a checkpoint in Apache Spark?

The checkpoint is an Apache Spark Core feature that allows a driver to restart in case of failure.


46. What are the major components of the Apache Spark ecosystem?

The following are the major components of the Apache Spark ecosystem.

  • Core Components:- The core components are Spark Core, Spark MLlib, Spark Streaming, Spark SQL, and GraphX.
  • Language support:- It can be integrated with multiple languages such as Java, Python, JAVA, R, and Scala.
  • Cluster Management:- It can run on a stand-alone machine, YARN, and Apache Mesos.

47. What are the algorisms provided by Apache Spark GraphX?

The following algorithms are supported by Apache Spark GraphX.

  • PageRank
  • Connected Components
  • Label Propagation
  • SVD++
  • Strongly Connected Components
  • Triangle Count

48. What is Apache SparkR?

Apache SparkR is a module of Spark cluster that provides an interface to run integrate R language with Spark. R language is very famous in data analysis and got a very active community. It supports operations such as filtering, selection, aggregation on large datasets.


49. What is Apache Spark Tuning?

Apache Spark tuning refers to a procedure of regulating settings to data for memory, code, I/O, and instances used by the system. By performance, tuning resources will work in an optimized way. We need to perform tuning because of the in-memory nature of Apache Spark computations as Spark program can be bottlenecked by any resource in the cluster CPU, network bandwidth, or memory. Most often, if the data will fit in memory then the bottleneck is network bandwidth and sometimes we need to store RDD in the serialized form so that memory usage can be decreased.


50. What is Dataframe in Apache Spark?

A DataFrame is the representation of traditional database tables. It has its columns and its data type in which data is stored. Dataframe can be created from multiple sources such as hive tables, structure data files, existing RDDs, and so on.