Python

Learn Data Analysis with Python: Find out the Practical Code for Data Cleaning

Step by step Python Code for data cleaning ( missing values, outlier treatment, finding duplicates)

Image for post
Image from Unsplash

Introduction

If we want to apply for any data analyst or data scientist role, it is necessary to know one of the programming languages used for such roles. It could be R or Python or Scala etc. To fulfill this, I have selected Python for data analysis.

If you want to check the practical code data loading step from different sources. Please check the below link.

Learn Data Analysis with Python: Find out the practical code for data loading and saving.

The data preparation step is the most important part to win the battle of a data analysis project. This document will have information about how data cleaning ( missing values, outliers, duplicates )is possible with Python.

Raw data is full of impurity like outliers, missing values, duplicates, etc. To clean this data means, it needs to be logical, significant and regulated.

Missing Data

This is one of the most common issues. To solve these, there are many methods. Let us find out some practical codes.

Image for post
Image by Author (graderecord.csv)

This is a CSV file with two missing values in the Grade column. Let us now find out practical codes to process such types of missing information.

import pandas as pd
df = pd.read_csv(“graderecord.csv”)
df.head(10)

Code: Drop Rows with Missing values

df_no_missing = df.dropna()
df_no_missing

Code: Replace Empty rows with 0

df.fillna(0)

Code: Replace Empty rows with Mean or Median or Mode of the column

df[“Grade”].fillna(df[“Grade”].mean(), inplace=True)
df[“Grade”].fillna(df[“Grade”].median(), inplace=True)
df[“Grade”].fillna(df[“Grade”].mode(), inplace=True)

Code: Selecting Rows with No Missing values

df[df[‘Grade’].notnull()]

Code: If a column has all empty values. drop the completely empty column

# Add a Column with Empty Values
import numpy as np
df[‘newcol’] = np.nan
df.head()
# Drop Empty Column
df.dropna(axis=1, how=”all”)

Outlier Treatment

In this blog, I am considering two ways of outlier treatment.

Method 1: If data is normally distributed, then we follow the standardization method. Considering, confidence interval of 95% that means z-score is 1.96 for 5% alpha value. In conclusion, 95% of data is distributed within 1.96 standard deviations of the mean. So we can drop the value below or above this range.

Image for post
Image by Author (gradedata.csv)

import pandas as pd
df = pd.read_csv(“gradedata.csv”)
df

meangrade = df[‘grade’].mean()
stdgrade = df[‘grade’].std()
higherrange = meangrade + stdgrade * 1.96
lowerrange = meangrade — stdgrade * 1.96
df = df.drop(df[df[‘grade’] > higherrange].index)
df = df.drop(df[df[‘grade’] < lowerrange].index)
df

Method 2: In this method, we use the interquartile range (IQR). IQR is the difference between 25% (Q1 ) of the quantile and 75% (Q3) of the quantile. Any value lower than Q1–1.5*IQR or greater than Q3–1.5*IQR is considered an outlier.

q1 = df[‘grade’].quantile(.25)
q3 = df[‘grade’].quantile(.75)
iqr = q3-q1
highrange = q3 + iqr * 1.5
lowrange = q1 — iqr * 1.5
df = df.drop(df[df[‘grade’] > highrange].index)
df = df.drop(df[df[‘grade’] < lowrange].index)
df

Finding Duplicates

Using Python libraries, we can identify duplicate rows as well as unique rows of a data set.

# Creating Dataset with Duplicates
import pandas as pd
Emp = [‘Jane’,’Johny’,’Boby’,’Jane’,’Mary’,’Jony’,’Melica’,’Melica’]
Salary = [9500,7800,7600,9500,7700,7800,9900,10000]
SalaryList = zip(Emp,Salary)
df = pd.DataFrame(data = SalaryList,columns=[‘Emp’, ‘Salary’])
df

# Displaying Only Duplicates in the Dataframe
df.duplicated()

# Displaying Dataset without Duplicates
df.drop_duplicates()

# Drop Rows with Duplicate Emp, Keeping the Last Observation
df.drop_duplicates([‘Emp’], keep=”last”)

Conclusion:

In this blog, we learn how to do Python coding for data cleaning purposes. If you have any questions, please post them in the comment section.

You can check my other blogs in Medium about Python and my YouTube Channel.

2 thoughts on “Learn Data Analysis with Python: Find out the Practical Code for Data Cleaning

Leave a comment