Apache Spark Performance Tuning

What is Apache Spark Tuning?

Apache Spark tuning is the process of adjusting the settings to data for system resources such as memory, I/O, code, and so on. Performance tunning task identify the bottleneck of the system and fix it so that am optimum result can be delivered. As Apache Spark performs the in-memory operation, it is important to check the performance of the program by inspecting the usage of node CPU, memory, network bandwidth, and so on. For example, if data is not fitting in memory then the load will be on network bandwidth.

In this tutorial, we will cover the following two topics of performance tuning.

1. Data Serialization

Using Data Serialization we can store RDD in serialized form to reduce memory usage. It plays an important role in the performance of the distributed application. The formats of data that is slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation that should be tuned first to optimize a Spark application.

Apache Spark will provide a good performance by the below points.

  • By Enhancing system’s performance time.
  • By using all resources effectively.
  • By ensuring that all jobs are running on a relative execution engine.
  • By terminating those jobs which are running for a longer duration.

Data Serialization provides the following two libraries.

  • Java serialization
  • Kryo serialization
Java serialization

In Java, serialization Spark uses Java’s ObjectOutputStream framework to serialize objects post that it can work with any class you create that implements java.io.Serializable. Performance of serialization can also be controlled by using java.io.Externalizable.

Kryo serialization

Kryo serialization is used by Spark to serialize objects more quickly. It is 10x faster than Java serialization but it will not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.

2. Memory Tuning

There are the following three considerations for memory tunning.

  • The entire dataset should be fit in memory by considering the memory used by an object.
  • The cost involved in accessing those objects.
  • The overhead of garbage collection.

The Java objects are fast to access but consume a factor of 2-5x more space than the “raw” data inside their fields and this happens due to the below reasons.

  • The java object has 16 bytes of object header which holds that information about its class. So if there is an object which has very little data(for example one int field), then it can higher than the actual data.
  • Java Strings have about 40 bytes of overhead over the raw string data and store each character as two bytes due to String’s internal usage of UTF-16 encoding.
  • The primitive types of collections are saved as boxed objects for example java.lang.Integer.

Data Structures Tuning

By avoiding Java Features reduce memory consumption.

There are the following ways to do it.

  • If the RAM size of the node is less than 32 GB then make JVM flag -XX:+UseCompressedOops pointers to 4 bytes despite 8 in the spark-env.sh file.
  • You can use strings for keys instead of using numeric IDs or enumeration objects.
  • The high number of small objects and pointers for nested structures should be avoided.
  • A user can design his data structures to favor arrays of objects despite a standard Java or Scala collection classes.

Garbage Collection Tuning

If there is a large set of RDDs collected by a user program in that case JVM garbage collection could be a difficult task because if Java will remove old objects to create room for new once then it will have to track the complete object to find the unused object. This is a very difficult task hence using the data structure with fewer objects brings the cost down.

Memory Management

Good memory management is needed for a great performance and in Spark, memory uses mainly fall under two categories execution and storage. The execution memory is used for computation in shuffles, joins, sorts, and aggregations whereas the storage memory are used for caching and developing internal data across the cluster.

There is the following two memory configuration which should not be modified because the default value is required by most of the workloads.

  • The "spark.memory.fraction" represents the JVM heap space that is 300 MB and the rest space is reserved for the user's data structure, safeguarding against OOM errors, internal metadata, and so on.
  • The spark.memory.storageFraction display the R size.

Other Considerations For Performance Tuning

The following are some other performance tunning suggestions.

Level of Parallelism

By setting the parallelism to a level high will fully utilize the cluster resources for any operation. Apache Spark sets the map task depending upon the size of the file and parallelism can also be set by using spark.default.parallelism parameter. The recommendation is to take 2-3 tasks per CPU core.

Memory Usage of Reduce Tasks

In some cases, OutOfMemoryError memory error is thrown because Apache Spark operations such as groupByKey, sortByKey, reduceByKey create a hash table to perform grouping that could be very large. To fix this issue we can increase the parallelism level to a high value.

Broadcasting Large Variables

The broadcast functionality of SparkContext can greatly reduce the size of each serialized task and the cost of launching a job over a cluster. If a task is using a large object from the driver program consider turning it into a broadcast variable. Generally, those tasks which are s larger than about 20 KB are probably worth optimizing, and Spark prints the serialized size of each task on the master.

Data Locality

Data Locality in Apache Spark means placing code near the data despite data is traveling to the code which requires a high amount of I/O. Data locality improves performance because operation happens near the data only.

The following are the level of Data locality.

  • The PROCESS_LOCAL data and the running user's code need to available on the same JVM.
  • The NODE_LOCAL should present on the same node. It is a bit slower to PROCESS_LOCAL because data travels in between processes.
  • The NO_PREF data can be easily accessible from anywhere as this has no locality choice.
  • The RACK_LOCAL data should be present on the same rack of servers.