Apache Spark is an engine for fast, large scale data processing. It claims to run the programs up to 100x faster than Hadoop MapReduce in-memory, while 10x faster with the disks. Introduction of Hadoop Mapreduce framework greatly simplified the problem of big data management and analysis in a cost-efficient way. With the help of commodity hardware, we can apply several algorithms on large volumes of data. But MapReduce failed to show its performance while implementing complex and multi-stage algorithms. Through this article, we tried to dig deep to understand why Apache Spark upstages Apache Hadoop MapReduce framework.
Introduction of big data mandated the development of sophisticated tools that runs faster and are easy to use. We need such tools for various applications such as interactive query processing, ad-hoc queries on real-time streaming data and sophisticated data processing on historical data for better decision making.
Apache Hadoop MapReduce framework was initially designed to perform a batch processing on large amounts of data. Tools such as Hive and Pig helps to execute ad-hoc queries on historical data using query language. But processing using MapReduce and tools such as Pig and Hive is slow due to disk reads and writes during data processing. To get access to the interesting data faster, we introduced a new stack which contains tools such as HBase, Impala etc. which enables interactive query processing. To fulfill more demanding need of real-time analytics a stack for streaming data consisting of Apache Storm and Kafka was introduced.
The limitations of this model are that it’s expensive and complex. Also, it is hard to compute the consistent metrics among these stacks. Furthermore, processing on streaming data is slow in case of MapReduce due to the use of disk for storing the intermediate results.
Apache Spark introduced the unified architecture that combines streaming, interactive and batch processing components. With Spark, it is easy to build applications using powerful APIs in JAVA, Python and Scala.
In this article, we will consider three use cases namely Graph processing, Iterative machine learning algorithms and Real time data analysis and will try to understand how Apache Spark works better than Hadoop architecture in these use cases.
Most of the graph processing algorithms (e.g. Page Rank) performs multiple iterations over the same data and requires a message passing mechanism. We need to program MapReduce explicitly to handle these multiple iterations over the same data. Roughly, it works like this: read data from the disk and after a particular iteration, write results to HDFS and read data from the HDFS for next iteration and continue. This is very inefficient since it involves reading and writing data to the disk which involves heavy I/O operations and data replication across the cluster for fault tolerance. Also, each MapReduce iteration has high latency, and none could start until the previous job has completely finished.
As for message passing, page rank algorithm for example, requires scores of neighboring nodes in order to evaluate the score of a particular node. These computations needs messages from its neighbors (or data across multiple stages of the job), a mechanism that MapReduce lacks. Different Graph processing tools such as Pregel, GraphLab were designed in order to address the need of efficient platform for graph processing algorithms. These tools are fast and scalable, but are not efficient for creation and post-processing of these complex multi-stage algoritms.
However, introduction of Apache Spark solved these problems to a great extent. Spark has graph computation library called GraphX which simplifies our life. In-memory computation along with in-built graph support improves the performance of the algorithm by one or two orders of magnitude over traditional MapReduce programs. Spark uses a combination of Netty and Akka for distributing messages throughout the executors.
Following statistics depict the performance of the PageRank algorithm using Hadoop and Spark.
Iterative Machine Learning Algorithms
Almost all machine learning algorithms work iteratively. As we have seen earlier, iterative algorithms involves I/O bottlenecks in the MapReduce implementations. MapReduce uses coarse-grained tasks (task-level parallelism) which proves to be heavyweight for iterative algorithms. Spark with the help of Mesos – a distributed system kernel, caches the intermediate dataset after each iteration and runs multiple iterations on this cached dataset which reduces the I/O and helps to run the algorithm faster in a fault tolerant manner.
Spark has built-in scalable machine learning library called MLib which contains high-quality algorithms that leverages iterations and yields better results than one pass approximations sometimes used on MapReduce.
Real time Data Analysis
Real time data processing is a complex task to accomplish. We are familiar with the 3 Vs in the world of big data – Volume, Variety and Velocity. Hadoop architecture is able to handle the volume and variety part of it with ease. Real time data analysis needs to consider the velocity along with volume and variety of data which is a big challenge in the existing Hadoop architecture. Real time data analysis includes collection of data generated by the real time event streams coming in at the rate of millions of events per second. Also, it needs to handle the parallel processing of data as and when it is being collected. Along with this, it should perform event co-relation using the complex event processing engine to extract the useful information from streaming data. All these steps should be performed in the fault tolerant and distributed way.
Hadoop’s strength lies in batch processing. Basic application of Hadoop is to store petabytes of data and perform batch processing to gain more insights on the data. This works pretty well for different scenarios such as analyzing the banking application logs to detect frauds or digging into the customer data to find the patterns. But Hadoop can’t manage itself to the fast data analysis performance. Hadoop performs stream processing with the help of technologies such as Apache Kafka and Apache Storm.
Apache Spark has built-in Streaming API which makes it easy to build scalable and fault-tolerant streaming applications. Spark Streaming includes support for recovery from failures of both driver and worker machines to ensure 24/7 operational time. Thus, making Spark better than the Hadoop MapReduce framework.