Data Analysis: Cleaning data with Jupyter Part1-Remove Outliers

The second step of data analysis is cleaning the data. This process it’s really important because allows us to improve our data quality and in doing so, increases overall productivity. When you clean your data, all outdated or incorrect information is gone – leaving you with the highest quality information.

The whole process to cleaning the data includes the following steps:

  • Remove Outliers
  • Remove inappropriate values
  • Remove duplicates
  • Remove whitespaces

In this tutorial we will see how easy and fast is remove outliers with Jupyter. Let’s Start !

An outlier data is an observation that lies an abnormal distance from other values in a random sample from a population. Outliers are often bad data and we need to delete this data from our dataset, by the way Outliers should be investigated carefully. Often they contain valuable information about the process under investigation or the data gathering and recording process.

There are many tools and methods to find the outliers data. In this article we will see how to use the Standard Deviations method and/or the Interquartile Range:

  • Standard Deviations: If the data is normally distributed, then 95 percent of the data is within 1.96 standard deviations of the mean. So we can remove the values either above or below that range.
  • Interquartile Range (IQR): The IQR is the difference between the 25 percent quantile and the 75 percent quantile. Any values that are either lower than Q1 – 1.5 x IQR or greater than Q3 + 1.5 x IQR are treated as outliers and removed.

Remove Outliers with Standard Deviations

In the code below, as first we import the data, and after we calculate the standard deviations and the top/bottom limits. In my case, ‘grade’ is the column on which I am working on.

Interquartile Range (IQR):

In the code below, as first we import the data, and after we calculate the IQR and the top/bottom limits. In my case, ‘grade’ is the column on which I am working on.

Now our dataset it’s free of outliers !!!

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *