Outlier Detection & Data Cleaning in Python - Comprehensive Guide

Apr 2, 2023

3 min read


programmer-with-chrome

Outlier techniques are crucial for data cleansing and eliminating noise


In this example, we used Python, the Pandas library, and the yfinance library to import and analyze financial data. We demonstrated how to clean data manually, standardize data using the StandardScaler, robust scaling with the RobustScaler, and normalization using quantile filtering. These techniques help improve the quality of data analysis by eliminating extreme values and focusing on more representative data points.


Step 1: Import pandas and matplotlib Start by importing the necessary libraries:


import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Step 2: Import Dataframe from Excel Read the data from an Excel file and store it in a dataframe:



start = '2018-01-01'
end = '2021-01-01'
data = yf.download('CL=F PTTEP.BK', start=start, end=end, periods=1)
# CL=F  -> WTI3 Oil price
    [*********************100%***********************]  2 of 2 completed
# compare oilprice with PTTEP oil equity

df.dropna(inplace=True)
# clean data
df

Here’s what the dataframe should look like:


CL=FPTTEP.BK
Date
2018-01-0361.63000180.119003
2018-01-0462.00999883.671089
2018-01-0561.43999984.460434
2018-01-0861.73000083.276405
2018-01-0962.95999984.460434
2020-12-2348.11999986.777565
2020-12-2448.23000087.233093
2020-12-2847.61999986.549805
2020-12-2948.00000089.055191
2020-12-3048.40000289.510719

709 rows × 2 columns



df.loc['2020-04-15':'2020-04-24']

CL=FPTTEP.BK
Date
2020-04-1519.87000170.167038
2020-04-1619.87000166.356049
2020-04-1718.27000068.597801
2020-04-20-37.63000170.839561
2020-04-2110.01000068.373619
2020-04-2213.78000067.028572
2020-04-2316.50000068.149452
2020-04-2416.94000168.373619

709 rows × 2 columns


df.pct_change().plot()

png


df.pct_change().plot.kde()
plt.xlim(-3,3)

png


import seaborn as sns
sns.pairplot(df.pct_change())
plt.show()

png


we demonstrated how to clean outlier data manually using Python and the Pandas library. First, we computed the percentage change for the given data:


df_ret = df.pct_change()
df_ret.loc['2020-04-15':'2020-04-24']

CL=FPTTEP.BK
Date
2020-04-15-0.011934-0.027950
2020-04-160.000000-0.054313
2020-04-17-0.0805230.033784
2020-04-20-3.0596610.032680
2020-04-21-1.266011-0.034810
2020-04-220.376623-0.019672
2020-04-230.1973880.016722
2020-04-240.0266670.003289

Then, we identified the outliers using a threshold of more than -1 standard deviation and 1 standard deviation:


df_ret.loc[(df_ret['CL=F']<-1)|(df_ret['CL=F']>1)]
#check is over more than - 1 sd and 1 sd

CL=FPTTEP.BK
Date
2020-04-20-3.0596610.03268
2020-04-21-1.266011-0.03481

We removed the outliers by appending the data before and after the removal period:


start_remove = '2020-04-17'
end_remove = '2020-04-22'
df_remove = df_ret.loc[:start_remove].append(df_ret.loc[end_remove:])
df_remove

CL=FPTTEP.BK
Date
2018-01-03NaNNaN
2018-01-040.0061660.044335
2018-01-05-0.0091920.009434
2018-01-080.004720-0.014019
2018-01-090.0199250.014218
2020-12-230.023394-0.023077
2020-12-240.0022860.005249
2020-12-28-0.012648-0.007833
2020-12-290.0079800.028947
2020-12-300.0083330.005115

707 rows × 2 columns


df_remove.loc['2020-04-15':'2020-04-24']

CL=FPTTEP.BK
Date
2020-04-15-0.011934-0.027950
2020-04-160.000000-0.054313
2020-04-17-0.0805230.033784
2020-04-220.376623-0.019672
2020-04-230.1973880.016722
2020-04-240.0266670.003289

707 rows × 2 columns


df_remove.plot()

png


After cleaning the data, we demonstrated different methods for data transformation. First, we standardized the data using the StandardScaler from the sklearn library:


from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import RobustScaler
scaler = StandardScaler()
scaler.fit(df)
    StandardScaler()
df_std = scaler.transform(df)
df_std = pd.DataFrame(df_std, columns=df.columns)
df_std

CL=FPTTEP.BK
00.589278-1.210682
10.617857-0.979414
20.574989-0.928022
30.596799-1.005111
40.689305-0.928022
704-0.426787-0.777159
705-0.418514-0.747500
706-0.464391-0.791988
707-0.435812-0.628868
708-0.405728-0.599210

709 rows × 2 columns


df.mean()
    CL=F        53.794725
    PTTEP.BK    98.714088
    dtype: float64

df.std()
    CL=F        13.305778
    PTTEP.BK    15.370026
    dtype: float64

import matplotlib.pyplot as plt

df_std.plot.kde()
plt.xlim(-3,3)

png


df_std.plot.hist(bins=[-3,-2,-1,0,1,2,3])

png


import seaborn as sns
sns.pairplot(df_std)
plt.show()

png


df_std.loc[(df_std > 3).any(axis=1)|(df_std<-3).any(1)]

CL=FPTTEP.BK
522-1.887333-3.026275
523-2.018948-3.040871
524-2.513819-3.099254
525-2.149059-3.216018
527-2.288946-3.186827
545-6.875905-1.814844
546-3.292978-1.975396
547-3.009443-2.062969
550-3.084651-1.931609
551-3.117743-1.931609

outlier_3sd = df_std.loc[~((df_std > 3).any(1) | (df_std < -3).any(1))]

outlier_3sd.plot.hist()

png


outlier_3sd.plot.kde()

png


import seaborn as sns
sns.pairplot(outlier_3sd)
plt.show()

png


Next, we used robust scaling with the RobustScaler:


transformer = RobustScaler( quantile_range=(25.0, 75.0)).fit(df)

df_robust=transformer.transform(df)
df_robust
array([[ 0.29773111, -1.07570019],
           [ 0.31876029, -0.91572879],
           [ 0.28721631, -0.88017993],
           ...,
           [-0.47758723, -0.78608323],
           [-0.45655784, -0.6732509 ],
           [-0.43442162, -0.65273577]])

df_robust = pd.DataFrame(df_robust,columns=df.columns)
df_robust

CL=FPTTEP.BK
00.297731-1.075700
10.318760-0.915729
20.287216-0.880180
30.303265-0.933504
40.371334-0.880180
704-0.449917-0.775826
705-0.443830-0.755311
706-0.477587-0.786083
707-0.456558-0.673251
708-0.434422-0.652736

709 rows × 2 columns


df_robust.boxplot(figsize=(10,6))
plt.show()

png


df_robust.plot.kde()
plt.xlim(-3,3)

png


import seaborn as sns
sns.pairplot(df_robust)
plt.show()

png


Normalized


df_normalized = pd.DataFrame(df,columns=df.columns)

df_normalized

CL=FPTTEP.BK
Date
2018-01-0361.63000180.119003
2018-01-0462.00999883.671089
2018-01-0561.43999984.460434
2018-01-0861.73000083.276405
2018-01-0962.95999984.460434
2020-12-2348.11999986.777565
2020-12-2448.23000087.233093
2020-12-2847.61999986.549805
2020-12-2948.00000089.055191
2020-12-3048.40000289.510719

709 rows × 2 columns


We also demonstrated normalization using quantile filtering:


low = .05
high = .95
quant_df = df_normalized.quantile([low, high])
quant_df

CL=FPTTEP.BK
0.0527.19471.288631
0.9570.966116.598221

filt_df_x = df_normalized.apply(lambda x: x[(x>quant_df.loc[low,x.name])&  (x <= quant_df.loc[high,x.name])], axis=0)
filt_df_x

CL=FPTTEP.BK
Date
2018-01-0361.63000180.119003
2018-01-0462.00999883.671089
2018-01-0561.43999984.460434
2018-01-0861.73000083.276405
2018-01-0962.95999984.460434
2020-12-2348.11999986.777565
2020-12-2448.23000087.233093
2020-12-2847.61999986.549805
2020-12-2948.00000089.055191
2020-12-3048.40000289.510719

662 rows × 2 columns


filt_df_x.dropna()

CL=FPTTEP.BK
Date
2018-01-0361.63000180.119003
2018-01-0462.00999883.671089
2018-01-0561.43999984.460434
2018-01-0861.73000083.276405
2018-01-0962.95999984.460434
2020-12-2348.11999986.777565
2020-12-2448.23000087.233093
2020-12-2847.61999986.549805
2020-12-2948.00000089.055191
2020-12-3048.40000289.510719

614 rows × 2 columns


alternative way


Q1 = df_normalized.quantile(0.05)
Q3 = df_normalized.quantile(0.95)
filt_manual_df = df_normalized[~((df_normalized < Q1) | (df_normalized > Q3 )).any(axis=1)]
filt_manual_df

CL=FPTTEP.BK
Date
2018-01-0361.63000180.119003
2018-01-0462.00999883.671089
2018-01-0561.43999984.460434
2018-01-0861.73000083.276405
2018-01-0962.95999984.460434
2020-12-2348.11999986.777565
2020-12-2448.23000087.233093
2020-12-2847.61999986.549805
2020-12-2948.00000089.055191
2020-12-3048.40000289.510719

614 rows × 2 columns


filt_manual_df.plot()

png


minmax scaler


Lastly, we scaled the data using the MinMaxScaler:


from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(df_normalized)
    MinMaxScaler()

df_normal_scaled = scaler.transform(df_normalized)

df_normal_scaled
    array([[0.87039633, 0.37746131],
           [0.87372847, 0.42099251],
           [0.86873023, 0.43066602],
           ...,
           [0.74754469, 0.45627149],
           [0.75087686, 0.48697528],
           [0.75438442, 0.49255782]])

df_normal_scaled = pd.DataFrame(df_normal_scaled,columns=df.columns)

df_normal_scaled

CL=FPTTEP.BK
00.8703960.377461
10.8737280.420993
20.8687300.430666
30.8712730.416156
40.8820590.430666
7040.7519290.459063
7050.7528940.464645
7060.7475450.456271
7070.7508770.486975
7080.7543840.492558

709 rows × 2 columns


df_normal_scaled.boxplot(figsize=(10,6))
plt.show()

png


import seaborn as sns
sns.pairplot(df_normal_scaled)
plt.show()

png


df_normal_scaled.plot.kde()
plt.xlim(-.5,1.5)

png


Conclusion


These methods can help improve the performance of machine learning models by reducing the impact of outliers and ensuring that features are on a similar scale. It is important to choose the appropriate method based on the specific characteristics of the dataset and the requirements of the analysis.