Statistics – Understanding Basic Concepts and Dispersion

In the last post titled, Statistics – Understanding the Levels of Measurement, we have seen what variables are, and how do we measure them based on the different levels of measurement. In this post, we will talk about some of the basic concepts that are important to get started with statistics and then dive deep into the concept of dispersion.

fig: Histogram and Distribution Curve


A histogram is a graphical representation of the distribution of numerical data. We know the basic bar graph, but in a histogram, all the bars involved are connected or they touch each other – meaning that there is no gap between the points. Eg: Consider we have some data points (i.e values of the variable that we measured), we create this histogram by plotting the data points against their corresponding frequency of occurrence in our random sample. We then draw the distribution curve by connecting the midpoints of the bars in the histogram. So the important point to remember here is that there are bars sitting below that curve, and the process of drawing the distribution curve is – numbers -> bars -> curve.

Continue reading

Statistics – Understanding the Levels of Measurement

One of the most important and basic step in learning Statistics is understanding the levels of measurement for the variables. Let’s take a step back and first look at what a variable is? A variable is any quantity that can be measured and whose value varies through the population. For example, if we consider a population of students, the student’s nationality, marks, grades, etc are all the variables defined for the entity student, and their corresponding value will differ for each student. Looking at the larger picture, if we want to compute the average salary of the US citizens, we can go out and record the salary of each and every person to compute the average or choose a random sample from the entire population and compute the average salary for that sample, and then use the statistical tests to derive conclusions for a wider population.

The type of statistical test that can be used to derive a conclusion about the wider population depends upon the level of measurement of the variable under consideration. The level of measurement of a variable is nothing but the mathematical nature of a variable or, how a variable is measured.

Broadly, there are 4 levels of measurement for the variables –

Continue reading

Analyzing Wikipedia Text with pySpark

Spark improves usability by offering a rich set of APIs and making it easy for developers to write code. Programs in spark are 5x smaller than MapReduce. The Spark Python API (PySpark) exposes the Spark programming model to Python. To learn the basics of Spark, read through the Scala programming guide; it should be easy to follow even if you don’t know Scala. pySpark provides an easy-to-use programming abstraction and parallel runtime, we can think of it as – “Here’s an operation, run it on all of the data”.

To use Spark, developers write a driver program that implements the high-level control flow of their application and launches various operations in parallel on the nodes of the cluster.

The typical life cycle of a Spark program is –

  • Create RDDs from some external data source or parallelize a collection in your driver program.
  • Lazily transform the base RDDs into new RDDs using transformations.
  • Cache some of those RDDs for future reuse.
  • Perform actions to execute parallel computation and to produce results.

Continue reading

Introduction to Big Data with Apache Spark (Part-1)

With the advent of new technologies, there has been an increase in the number of data sources. Web server logs, machine log files, user activity on social media, recording a user’s clicks on the website and many other data sources have caused an exponential growth of data. Individually this content may not be very large, but when taken across billions of users, it produces terabytes or petabytes of data. For example, Facebook is collecting 500 terabytes(TB) of data everyday with more than 950 million users. Such a massive amount of data which is not only structured but also unstructured and semi-structured  is considered under the roof known as Big Data.

Big data is of more importance today, because in past we collected a lot of data and built models to predict the future, called forecasting, but now we collect data and build models to predict what is happening now, called nowcasting. So a phenomenal amount of data is collected, but only a tiny amount is ever analysed. The term Data Science means deriving knowledge from big data, efficiently and intelligently.

The common tasks involved in data science are :

  1. Dig data to find useful data to analyse
  2. Clean and prepare that data
  3. Define a model
  4. Evaluate the model
  5. Repeat until we get statistically good results, and hence a good model
  6. Use this model for large scale data processing

Continue reading

Working with DB2 Database Grants

Being a database administrator and developer, one often needs to restore databases from one database server to another. In DB2, when a database is created, by default, the DB2 instance owner becomes the owner of the database. So when you restore a database backup across different database servers where both, the source and the target database servers have the same instance owner names, it works fine. But when the source and target database servers have different instance owners, the database gets restored successfully, but there can be privilege/authorization issues when you try to access the database objects. The most common being, SQL0551N and SQL1092N.

This problem can be solved by creating the source database server’s instance owner on target database server, connecting to the database through it and then granting the DBADM privileges to the new instance owner. But there is a more elegant and simpler way to do this using the DB2_RESTORE_GRANT_ADMIN_AUTHORITIES DB2 registry variable.

Continue reading

IDUG NA 2015 : Lessons from a First Time Speaker

Finally, there it was, the most exciting and awaited week of the year. I got a chance to live my dream, to be at the International DB2 User Group Tech Conference (IDUG), Philadelphia. I must confess, Its not about being at a conference, its about the whole new experience with which I will live ever after”. Being a first time attendee and speaker, the blogs on IDUG experience from Ember Crooks (at and Pavan Kristipati (at really helped me. Also, slide-decks from the past IDUG conferences served a good reference and a starting point. I reached out for tips to the people I admire from the DB2 Community – Mike Krafick, Ember Crooks and Pavan Kristipati. They happily extended their helping hand towards me and Mike actually reviewed my presentation and helped me to improve it. Thank you Mike! Here is a summary of tips that these experts offered – I hope it will be of help to the first time attendees and speakers.

Continue reading

Federated Database Performance (Part-2)

In part-1 of Federated Database Performance, we discussed what federation is, how it works and a few tips at the query level to improve the federated database performance.

In part-2, we will discuss few configuration parameters that will help us in performance tuning of the federated systems.

The performance of the query which includes data access on the remote databases mainly depends on the PUSHDOWN mechanism. Almost all parameters that IBM provides for improving the performance of the SHIP operation is dependent on this mechanism.

PUSHDOWN Mechanism:

Using pushdown mechanism, DB2 optimizer tries to push as many operations as possible to the remote database server so that filtering of data can be done at the remote server and only the required data is transferred to the federated server. So using pushdown, we can filter the data remotely instead of bringing all data to the local DB2 server and then perform filtering. This will improve the performance of the database system by reducing the network traffic and data transfer cost and time.

SYSCAT.SERVEROPTIONS view has some interesting parameters which can be helpful in federated environments. SYSCAT.SERVEROPTIONS view catalogs the federated server settings for each remote data source. Each row specifies the server-specific options and corresponding values.

Few basic and important parameters are discussed below.

Continue reading