Statistics – Understanding Basic Concepts and Dispersion

In the last post titled, Statistics – Understanding the Levels of Measurement, we have seen what variables are, and how do we measure them based on the different levels of measurement. In this post, we will talk about some of the basic concepts that are important to get started with statistics and then dive deep into the concept of dispersion.

hist1
fig: Histogram and Distribution Curve

Histogram:

A histogram is a graphical representation of the distribution of numerical data. We know the basic bar graph, but in a histogram, all the bars involved are connected or they touch each other – meaning that there is no gap between the points. Eg: Consider we have some data points (i.e values of the variable that we measured), we create this histogram by plotting the data points against their corresponding frequency of occurrence in our random sample. We then draw the distribution curve by connecting the midpoints of the bars in the histogram. So the important point to remember here is that there are bars sitting below that curve, and the process of drawing the distribution curve is – numbers -> bars -> curve.

Continue reading

Statistics – Understanding the Levels of Measurement

One of the most important and basic step in learning Statistics is understanding the levels of measurement for the variables. Let’s take a step back and first look at what a variable is? A variable is any quantity that can be measured and whose value varies through the population. For example, if we consider a population of students, the student’s nationality, marks, grades, etc are all the variables defined for the entity student, and their corresponding value will differ for each student. Looking at the larger picture, if we want to compute the average salary of the US citizens, we can go out and record the salary of each and every person to compute the average or choose a random sample from the entire population and compute the average salary for that sample, and then use the statistical tests to derive conclusions for a wider population.

The type of statistical test that can be used to derive a conclusion about the wider population depends upon the level of measurement of the variable under consideration. The level of measurement of a variable is nothing but the mathematical nature of a variable or, how a variable is measured.

Broadly, there are 4 levels of measurement for the variables –

Continue reading

Comparing Real Time Analytics and Batch Processing Applications with Hadoop MapReduce and Spark

Apache Spark is an engine for fast, large scale data processing. It claims to run the programs up to 100x faster than Hadoop MapReduce in-memory, while 10x faster with the disks. Introduction of Hadoop Mapreduce framework greatly simplified the problem of big data management and analysis in a cost-efficient way. With the help of commodity hardware, we can apply several algorithms on large volumes of data. But MapReduce failed to show its performance while implementing complex and multi-stage algorithms. Through this article, we tried to dig deep to understand why Apache Spark upstages Apache Hadoop MapReduce framework.

Unified Architecture

Introduction of big data mandated the development of sophisticated tools that runs faster and are easy to use. We need such tools for various applications such as interactive query processing, ad-hoc queries on real-time streaming data and sophisticated data processing on historical data for better decision making.

Continue reading

Analyzing Wikipedia Text with pySpark

Spark improves usability by offering a rich set of APIs and making it easy for developers to write code. Programs in spark are 5x smaller than MapReduce. The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, read through the Scala programming guide; it should be easy to follow even if you don’t know Scala. pySpark provides an easy-to-use programming abstraction and parallel runtime, we can think of it as – “Here’s an operation, run it on all of the data”.

To use Spark, developers write a driver program that implements the high-level control flow of their application and launches various operations in parallel on the nodes of the cluster.

The typical life cycle of a Spark program is –

  • Create RDDs from some external data source or parallelize a collection in your driver program.
  • Lazily transform the base RDDs into new RDDs using transformations.
  • Cache some of those RDDs for future reuse.
  • Perform actions to execute parallel computation and to produce results.

Continue reading

Introduction to Big Data with Apache Spark (Part-2)

In part-1 of this series we saw a brief overview of Apache Spark, Resilient Distributed Dataset (RDD) and Spark Ecosystem. In this article, we will have a closer look at Spark’s primary and fault-tolerant memory abstraction for in-memory cluster computing called the Resilient Distributed Dataset (i.e RDD).

Motivation

One of the most popular parallel data processing paradigm – MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable to efficiently solve the complex and iterative machine learning and graph processing algorithms, as well as the interactive or ad-hoc queries. All of these complex algorithms need one thing in common that MapReduce lacks : efficient primitives for data sharing. In MapReduce, the data is shared across different jobs (or different stages of a single job) with the help of stable storage. As discussed in the previous article, MapReduce stores results on the disk, and thus, the reads and writes are very slow. Also, the existing storage abstraction interfaces uses the data replication or update log replication for fault-tolerance. This method is considerably costly if we are dealing with data-intensive applications.

Continue reading