Learning Exploratory Data Analysis Using Redwine Dataset

Spread the love

RedWine Dataset

Redwine is a simple and clean practice dataset for data science problems using regression or classification modelling. The classes are ordered and not balanced (e.g. there are much more normal wines than excellent or poor ones).

Input variables or features in redwine dataset (based on physicochemical tests) are :-

1 – fixed acidity

2 – volatile acidity

3 – citric acid

4 – residual sugar

5 – chlorides

6 – free sulfur dioxide

7 – total sulfur dioxide

8 – density

9 – pH

10 – sulphates

11 – alcohol

Output variable (based on sensory data) :-

12 – quality (score between 0 and 10)

Step 1 Examine Data

Data will be typically examined by first looking at a few rows, finding dimensions of the data, looking at data types and names of the columns,

Use the following python commands to examine data :-

df = pd.read_csv(‘../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv’)

df.head()

Figure : Getting started with exploring a new dataset, in this case lets examine redwine dataset

Find total number of rows and columns in the dataset using shape attribute

It is also a good practice to know the columns and their corresponding data types, along with finding whether they contain null values or not.

Figure : Examine columns and their data types in redwine dataset

Step 2 Descriptive Statistics

Descriptive statistics can give you great insight into the shape of each attribute.

Often you can create more summaries than you have time to review. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute:

a). Count

b). Mean

c). Standard Deviation

d). Minimum Value

e). 25th Percentile

f). 50th Percentile (Median)

g). 75th Percentile

h). Maximum Value

Figure : Analyzing descriptive statistics of redwine dataset to find anomalies and usefulness of different features

There is notably a large difference between 75th %tile and max values of predictors “residual sugar”,”free sulfur dioxide”,”total sulfur dioxide”. Thus observation suggests that there are extreme values-Outliers in our data set.

Step 3 Understanding Target Variable

Few key insights just by looking at dependent variable. Target variable/Dependent variable is discrete and categorical in nature.

“quality” score scale ranges from 1 to 10;where 1 being poor and 10 being the best.

df.quality.unique()

df.quality.value_counts()

Step 4 Checking Correlation

To use linear regression for modelling, its necessary to remove correlated variables to improve your model. One can find correlations using pandas “.corr()” function and can visualize the correlation matrix using a heatmap in seaborn.

sns.heatmap(df.corr(),cmap=’viridis’, annot=False)

Dark shades represents positive correlation while lighter shades represents negative correlation.

If you set annot=True, you’ll get values by which features are correlated to each other in grid-cells.

1). Here we can infer that “density” has strong positive correlation with “residual sugar” whereas it has strong negative correlation with “alcohol”.

2). “free sulphur dioxide” and “citric acid” has almost no correlation with “quality”.

3). Since correlation is zero we can infer there is no linear relationship between these two predictors. However it is safe to drop these features in case you’re applying Linear Regression model to the dataset.

Figure : Analyzing correlation between features and target variable(quality) in redwine dataset

Step 5 Checking Outliers

This is a commonly overlooked mistake we tend to make. The temptation is to start building models on the data you’ve been given. But that’s essentially setting yourself up for failure.

Data exploration consists of many things, such as variable identification, treating missing values, feature engineering, etc.

Detecting and treating outliers is also a major cog in the data exploration stage.

The quality of your inputs decide the quality of your output!

l = df.columns.values

number_of_columns=12

number_of_rows = len(l)-1/number_of_columns

plt.figure(figsize=(number_of_columns,5*number_of_rows))

for i in range(0,len(l)):

plt.subplot(number_of_rows + 1,number_of_columns,i+1)

sns.set_style(‘whitegrid’)

sns.boxplot(df[l[i]],color=’green’,orient=’v’)

plt.tight_layout()

Figure : Using boxplot to find outliers in redwine dataset

Step 6 Skewness and Kurtosis

plt.figure(figsize=(2*number_of_columns,5*number_of_rows))

for i in range(0,len(l)):

plt.subplot(number_of_rows + 1,number_of_columns,i+1)

sns.distplot(df[l[i]],kde=True)

print(“Skewness \n “,df.skew())

print(“\n Kurtosis \n “, df.kurt())

Figure : Analyzing skewness and kurtosis in Seaborn dataset

Learning Exploratory Data Analysis Using Redwine Dataset

ByHassan Amin

RedWine Dataset

Step 1 Examine Data

Step 2 Descriptive Statistics

Step 3 Understanding Target Variable

Step 4 Checking Correlation

Step 5 Checking Outliers

Step 6 Skewness and Kurtosis

Other Interesting Articles

Related

By Hassan Amin

Related Post

Understanding and Developing Data Strategy and Monetization

Combating AI Fear

Tracking Recent LLM Model Announcements

Building a Personal Brand

Book Summary : Think and Grow Rich by Napolean Hill

Book Summary : Thinking, Fast and Slow by Nobel Laureate Daniel Kahneman

A Guide to Good and Bad Habits for Teens

Setting Up PostGIS Extension On Greenplum 6 In Ubuntu 18.04

ChatGPT is a Turning Point

Building Conversational Chatbot with GPT3

Success of Technology Companies Depends on Relationship between Technical Leadership and Management

You missed

Building a Personal Brand

Book Summary : Think and Grow Rich by Napolean Hill

Book Summary : Thinking, Fast and Slow by Nobel Laureate Daniel Kahneman

A Guide to Good and Bad Habits for Teens