Comparing Real Time Analytics and Batch Processing Applications with Hadoop MapReduce and Spark

Apache Spark is an engine for fast, large scale data processing. It claims to run the programs up to 100x faster than Hadoop MapReduce in-memory, while 10x faster with the disks. Introduction of Hadoop Mapreduce framework greatly simplified the problem of big data management and analysis in a cost-efficient way. With the help of commodity hardware, we can apply several algorithms on large volumes of data. But MapReduce failed to show its performance while implementing complex and multi-stage algorithms. Through this article, we tried to dig deep to understand why Apache Spark upstages Apache Hadoop MapReduce framework.

Unified Architecture

Introduction of big data mandated the development of sophisticated tools that runs faster and are easy to use. We need such tools for various applications such as interactive query processing, ad-hoc queries on real-time streaming data and sophisticated data processing on historical data for better decision making.

Continue reading

Introduction to Big Data with Apache Spark (Part-1)

With the advent of new technologies, there has been an increase in the number of data sources. Web server logs, machine log files, user activity on social media, recording a user’s clicks on the website and many other data sources have caused an exponential growth of data. Individually this content may not be very large, but when taken across billions of users, it produces terabytes or petabytes of data. For example, Facebook is collecting 500 terabytes(TB) of data everyday with more than 950 million users. Such a massive amount of data which is not only structured but also unstructured and semi-structured  is considered under the roof known as Big Data.

Big data is of more importance today, because in past we collected a lot of data and built models to predict the future, called forecasting, but now we collect data and build models to predict what is happening now, called nowcasting. So a phenomenal amount of data is collected, but only a tiny amount is ever analysed. The term Data Science means deriving knowledge from big data, efficiently and intelligently.

The common tasks involved in data science are :

  1. Dig data to find useful data to analyse
  2. Clean and prepare that data
  3. Define a model
  4. Evaluate the model
  5. Repeat until we get statistically good results, and hence a good model
  6. Use this model for large scale data processing

Continue reading