Apache Spark In Memory Computation

What is In-Memory Computation?

Apache Spark is a lightning-fast cluster computing system that is used to process interactive queries as well as iterative algorithms. Spark provides a rich set of APIs for distributed programming that is comparable to what the Hadoop Mapreduce model provides. Spark provides such tremendous performance by caching data in the memory of cluster nodes which helps the future task to perform faster (almost 10 times faster).

Apache Spark RDD can be persisted by using persist () or cache () methods.

When we use the cache () method in that case RDD will be stored in memory and due to this disk storage overhead will be reduced. This option is good for machine learning. When we use the persist () method in that case a parallel operation can be performed. The difference between cache () and persist () method is that for cache () the default method is StorageLevel.MEMORY_ONLY but for persisting () we have different options which we will see below.

Persist() Storage Level In Spark

The list of Persist() storage level is mentioned below.

  • MEMORY_ONLY
  • MEMORY_AND_DISK
  • MEMORY_ONLY_SER
  • MEMORY_AND_DISK_SER
  • DISK_ONLY
  • MEMORY_ONLY_2, MEMORY_AND_DISK_2

Let us discuss each storage level.

MEMORY_ONLY

In this option, Spark RDDs are stored as deserialized Java objects. In case RDD is not fitting in memory then extra partitions will not be saved on disk despite those partitions will be recalculated every time if they need.

MEMORY_AND_DISK

In this option, the RDD is stored in memory and if the RDD partitions are not fitting in memory then those partitions will be stored on disk, and then read operation will take place from disk per requirement.

MEMORY_ONLY_SER

In this option, the RDDs are stored one-byte array per partition. It is very CPU-intensive but space-efficient in case of using a fast serializer.

MEMORY_AND_DISK_SER

It is similar to MEMORY_ONLY_SER but those RDD partitions which are not fitting in memory and disk are required every time during recomputation.

DISK_ONLY

It stores the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2

It is the same as the levels above but replicates each partition on two cluster nodes.