What is In-Memory Computation?
Apache Spark is a lightning-fast cluster computing system that is used to process interactive queries as well as iterative algorithms. Spark provides a rich set of APIs for distributed programming that is comparable to what the Hadoop Mapreduce model provides. Spark provides such tremendous performance by caching data in the memory of cluster nodes which helps the future task to perform faster (almost 10 times faster).
Apache Spark RDD can be persisted by using persist () or cache () methods.
When we use the cache () method in that case RDD will be stored in memory and due to this disk storage overhead will be reduced. This option is good for machine learning. When we use the persist () method in that case a parallel operation can be performed. The difference between cache () and persist () method is that for cache () the default method is StorageLevel.MEMORY_ONLY but for persisting () we have different options which we will see below.
Persist() Storage Level In Spark
The list of Persist() storage level is mentioned below.
- MEMORY_ONLY_2, MEMORY_AND_DISK_2
Let us discuss each storage level.
In this option, Spark RDDs are stored as deserialized Java objects. In case RDD is not fitting in memory then extra partitions will not be saved on disk despite those partitions will be recalculated every time if they need.
In this option, the RDD is stored in memory and if the RDD partitions are not fitting in memory then those partitions will be stored on disk, and then read operation will take place from disk per requirement.
In this option, the RDDs are stored one-byte array per partition. It is very CPU-intensive but space-efficient in case of using a fast serializer.
It is similar to MEMORY_ONLY_SER but those RDD partitions which are not fitting in memory and disk are required every time during recomputation.
It stores the RDD partitions only on disk.
It is the same as the levels above but replicates each partition on two cluster nodes.