Exploratory Data Analysis Using Python
Spread the love

Loading

Here’s a good explanation of exploratory data analysis:

“Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.”

Before making inferences from data it is essential to examine all your variables.

Listening to data is important for many reasons, including :

  • To catch mistakes
  • To see patterns in the data
  • To find violations of statistical assumptions
  • To generate hypotheses

If you don’t listen to your data, you will have trouble later.

Types of Data

First of all you need to understand, the various types of data that you have in your dataset. Understanding the types of data involved is key to analyzing these and being able to process these for further use.

Nominal

ID numbers, Names of people

Categorical

Eye Color, Zip Codes

Ordinal

Rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short}

Interval

Calendar dates, temperatures in Celsius or Fahrenheit, GRE(Graduate Record Examination) and IQ scores

Ratio

Temperature in Kelvin, length, time, counts

Figure : Understanding types of data

Here’s the categorization of data/measurements as qualitative or quantitative :-

Figure : Recognizing qualitative and quantitative data

Dimensionality of Data

The next important step is understanding the dimensionality of the dataset.

Understanding dimensionality of the data
Figure : Understanding dimensionality of the data

What Are the Key Properties of a Dataset?

Prior to performing any type of statistical analysis, understanding the nature of the data being analyzed is essential.

You can use EDA to identify the properties of a dataset to determine the most appropriate statistical methods to apply to the data.

You can investigate several types of properties with EDA techniques, including the following:-

  • The center of the data
  • The spread among the members of the data
  • The skewness of the data
  • The probability distribution of the data follows
  • The correlation among the elements in the dataset
  • Whether or not the parameters of the data are constant over time
  • The presence of outliers in the data

Visualization Techniques for Exploratory Data Analysis

1). Correlation Matrix Plots

2). Density Plots

3). Scatter Plots

4). Bubble Plots

Correlation Matrix Plots

If you have a data set with many columns, a good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap.

To use linear regression for modelling, its necessary to remove correlated variables to improve your model.  One can find correlations using pandas “.corr()” function and can visualize the correlation matrix using a heatmap in seaborn.

sns.heatmap(df.corr(),cmap=’viridis’, annot=False)

Dark shades represents positive correlation while lighter shades represents negative correlation.

If you set annot=True, you’ll get values by which features are correlated to each other in grid-cells.

1). Here we can infer that “density” has strong positive correlation with “residual sugar” whereas it has strong negative correlation with “alcohol”.

2). “free sulphur dioxide” and “citric acid” has almost no correlation with “quality”.

3). Since correlation is zero we can infer there is no linear relationship between these two predictors. However it is safe to drop these features in case you’re applying Linear Regression model to the dataset.

Heatmap for understanding correlation between features
Figure : Heatmap for understanding correlation between features

Density Plots

Density plots are another way of getting a quick idea of the distribution of each attribute. The plots look like an abstracted histogram with a smooth curve drawn through the top of each bin, much like your eye tried to do with the histograms.

# Univariate Density Plots

import matplotlib.pyplot as plt

import pandas

url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv

names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]

data = pandas.read_csv(url, names=names)

data.plot(kind=’density’, subplots=True, layout=(3,3), sharex=False)

plt.show()

We can see the distribution for each attribute is clearer with the histograms.

Scatter Plots

Another common visualization techniques is a scatter plot that is a two-dimensional plot representing the joint variation of two data items. Each marker (symbols such as dots, squares and plus signs) represents an observation. The marker position indicates the value for each observation. When you assign more than two measures, a scatter plot matrix is produced that is a series of scatter plots displaying every possible pairing of the measures that are assigned to the visualization. Scatter plots are used for examining the relationship, or correlations, between X and Y variables.

Despite their simplicity, scatter plots are a powerful tool for visualizing data. There’s a lot of options, flexibility, and representational power that comes with the simple change of a few parameters like color, size, shape, and regression plotting.

Regression Plotting

When we first plot our data on a scatter plot it already gives us a nice quick overview of our data. In the far left figure below, we can already see the groups where most of the data seems to bunch up and can quickly pick out the outliers.

But it’s also nice to be able to see how complicated our task might get; we can do that with regression plotting. In the middle figure below we’ve done a linear plot. It’s pretty easy to see that a linear function won’t work as many of the points are pretty far away from the line. The far-right feature uses a polynomial of order 4 and looks much more promising. So it looks like we’ll definitely need something of at least order 4 to model this dataset.

Figure : Scatter plot for studying relationship between features and picking out outliers
Figure : Scatter plot with regression lines
Figure : Scatter plot with polynomial regression fit

Code

import seaborn as sns

import matplotlib.pyplot as plt

df = sns.load_dataset(‘iris’)

# A regular scatter plot

sns.regplot(x=df[“sepal_length”], y=df[“sepal_width”], fit_reg=False)

plt.show()

# A scatter plot with a linear regression fit:

sns.regplot(x=df[“sepal_length”], y=df[“sepal_width”], fit_reg=True)

plt.show()

# A scatter plot with a polynomial regression fit:

sns.regplot(x=df[“sepal_length”], y=df[“sepal_width”], fit_reg=True, order=4)

plt.show()

Marginal Histogram

Scatter plots with marginal histograms are those which have plotted histograms on the top and side, representing the distribution of the points for the features along the x- and y- axes. It’s a small addition but great for seeing the exact distribution of our points and more accurately identify our outliers.

For example, in the figure below we can see that the why axis has a very heavy concentration of points around 3.0. Just how concentrated? That’s most easily seen in the histogram on the far right, which shows that there is at least triple as many points around 3.0 as there are for any other discrete range. We also see that there’s barely any points above 3.75 in comparison to other ranges. For the x-axis on the other hand, things are a bit more evened out, except for the outliers on the far right.

Figure : Scatter plot with marginal histogram

Code

import seaborn as sns

import matplotlib.pyplot as plt

df = sns.load_dataset(‘iris’)

sns.jointplot(x=df[“sepal_length”], y=df[“sepal_width”], kind=’scatter’)

plt.show()

References

Scatterplot Matrix

A scatterplot shows the relationship between two variables as dots in two dimensions, one axis for each attribute. You can create a scatterplot for each pair of attributes in your data. Drawing all these scatterplots together is called a scatterplot matrix.

Scatter plots are useful for spotting structured relationships between variables, like whether you could summarize the relationship between two variables with a line. Attributes with structured relationships may also be correlated and good candidates for removal from your dataset.

# Scatterplot Matrix

import matplotlib.pyplot as plt

import pandas

from pandas.plotting import scatter_matrix

url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv

names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]

data = pandas.read_csv(url, names=names)

scatter_matrix(data)

plt.show()

Like the Correlation Matrix Plot, the scatterplot matrix is symmetrical. This is useful to look at the pair-wise relationships from different perspectives. Because there is little point oi drawing a scatterplot of each variable with itself, the diagonal shows histograms of each attribute.

Scatter matrix
Figure : Scatter matrix

Bubble Plots

With bubble plots we are able to use several variables to encode information. The new one we will add here is size. In the figure below we are plotting the number of french fries eaten by each person vs their height and weight. Notice that a scatter plot is only a 2D visualization tool, but that using different attributes we can represent 3-dimensional information.

Here we are using color, position, and size. The position determines the person’s height and weight, the color determines the gender, and the size determines the number of french fries eaten! The bubble plot lets us conveniently combine all of the attributes into one plot so that we can see the high-dimensional information in a simple 2D view; nothing crazy complicated.

Three dimensional correlational analysis between height, weight and sex
Figure : Three dimensional correlational analysis between height, weight and sex

Word Clouds and Network Diagrams for Unstructured Data

The variety of big data brings challenges because semi-structured and unstructured data require new visualization techniques. A word cloud visual represents the frequency of a word within a body of text with its relative size in the cloud. This technique is used on unstructured data as a way to display high- or low-frequency words.

World Cloud
Figure : World Cloud

Another visualization technique that can be used for semistructured or unstructured data is the network diagram. Network diagrams represent relationships as nodes (individual actors within the network) and ties (relationships between the individuals). They are used in many applications, for example for analysis of social networks or mapping product sales across geographic areas.

Network diagram for understanding relationship between variables
Figure : Network diagram for understanding relationship between variables

References

https://www.kdnuggets.com/2019/04/best-data-visualization-techniques.html

By Hassan Amin

Dr. Syed Hassan Amin has done Ph.D. in Computer Science from Imperial College London, United Kingdom and MS in Computer System Engineering from GIKI, Pakistan. During PhD, he has worked on Image Processing, Computer Vision, and Machine Learning. He has done research and development in many areas including Urdu and local language Optical Character Recognition, Retail Analysis, Affiliate Marketing, Fraud Prediction, 3D reconstruction of face images from 2D images, and Retinal Image analysis in addition to other areas.