Outlier techniques are crucial for data cleansing and eliminating noise

In this example, we used Python, the Pandas library, and the yfinance library to import and analyze financial data. We demonstrated how to clean data manually, standardize data using the StandardScaler, robust scaling with the RobustScaler, and normalization using quantile filtering. These techniques help improve the quality of data analysis by eliminating extreme values and focusing on more representative data points.

Step 1: Import pandas and matplotlib Start by importing the necessary libraries:

import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Step 2: Import Dataframe from Excel Read the data from an Excel file and store it in a dataframe:


start = '2018-01-01'
end = '2021-01-01'
data = yf.download('CL=F PTTEP.BK', start=start, end=end, periods=1)
# CL=F  -> WTI3 Oil price
    [*********************100%***********************]  2 of 2 completed
# compare oilprice with PTTEP oil equity

df.dropna(inplace=True)
# clean data
df

Here’s what the dataframe should look like:

	CL=F	PTTEP.BK
Date
2018-01-03	61.630001	80.119003
2018-01-04	62.009998	83.671089
2018-01-05	61.439999	84.460434
2018-01-08	61.730000	83.276405
2018-01-09	62.959999	84.460434
…	…	…
2020-12-23	48.119999	86.777565
2020-12-24	48.230000	87.233093
2020-12-28	47.619999	86.549805
2020-12-29	48.000000	89.055191
2020-12-30	48.400002	89.510719

709 rows × 2 columns

df.loc['2020-04-15':'2020-04-24']

	CL=F	PTTEP.BK
Date
2020-04-15	19.870001	70.167038
2020-04-16	19.870001	66.356049
2020-04-17	18.270000	68.597801
2020-04-20	-37.630001	70.839561
2020-04-21	10.010000	68.373619
2020-04-22	13.780000	67.028572
2020-04-23	16.500000	68.149452
2020-04-24	16.940001	68.373619

709 rows × 2 columns

df.pct_change().plot()

png

df.pct_change().plot.kde()
plt.xlim(-3,3)

png

import seaborn as sns
sns.pairplot(df.pct_change())
plt.show()

png

we demonstrated how to clean outlier data manually using Python and the Pandas library. First, we computed the percentage change for the given data:

df_ret = df.pct_change()
df_ret.loc['2020-04-15':'2020-04-24']

	CL=F	PTTEP.BK
Date
2020-04-15	-0.011934	-0.027950
2020-04-16	0.000000	-0.054313
2020-04-17	-0.080523	0.033784
2020-04-20	-3.059661	0.032680
2020-04-21	-1.266011	-0.034810
2020-04-22	0.376623	-0.019672
2020-04-23	0.197388	0.016722
2020-04-24	0.026667	0.003289

Then, we identified the outliers using a threshold of more than -1 standard deviation and 1 standard deviation:

df_ret.loc[(df_ret['CL=F']<-1)|(df_ret['CL=F']>1)]
#check is over more than - 1 sd and 1 sd

	CL=F	PTTEP.BK
Date
2020-04-20	-3.059661	0.03268
2020-04-21	-1.266011	-0.03481

We removed the outliers by appending the data before and after the removal period:

start_remove = '2020-04-17'
end_remove = '2020-04-22'
df_remove = df_ret.loc[:start_remove].append(df_ret.loc[end_remove:])
df_remove

	CL=F	PTTEP.BK
Date
2018-01-03	NaN	NaN
2018-01-04	0.006166	0.044335
2018-01-05	-0.009192	0.009434
2018-01-08	0.004720	-0.014019
2018-01-09	0.019925	0.014218
…	…	…
2020-12-23	0.023394	-0.023077
2020-12-24	0.002286	0.005249
2020-12-28	-0.012648	-0.007833
2020-12-29	0.007980	0.028947
2020-12-30	0.008333	0.005115

707 rows × 2 columns

df_remove.loc['2020-04-15':'2020-04-24']

	CL=F	PTTEP.BK
Date
2020-04-15	-0.011934	-0.027950
2020-04-16	0.000000	-0.054313
2020-04-17	-0.080523	0.033784
2020-04-22	0.376623	-0.019672
2020-04-23	0.197388	0.016722
2020-04-24	0.026667	0.003289

707 rows × 2 columns

df_remove.plot()

png

After cleaning the data, we demonstrated different methods for data transformation. First, we standardized the data using the StandardScaler from the sklearn library:

from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import RobustScaler
scaler = StandardScaler()
scaler.fit(df)
    StandardScaler()
df_std = scaler.transform(df)
df_std = pd.DataFrame(df_std, columns=df.columns)
df_std

	CL=F	PTTEP.BK
0	0.589278	-1.210682
1	0.617857	-0.979414
2	0.574989	-0.928022
3	0.596799	-1.005111
4	0.689305	-0.928022
…	…	…
704	-0.426787	-0.777159
705	-0.418514	-0.747500
706	-0.464391	-0.791988
707	-0.435812	-0.628868
708	-0.405728	-0.599210

709 rows × 2 columns

df.mean()
    CL=F        53.794725
    PTTEP.BK    98.714088
    dtype: float64

df.std()
    CL=F        13.305778
    PTTEP.BK    15.370026
    dtype: float64

import matplotlib.pyplot as plt

df_std.plot.kde()
plt.xlim(-3,3)

png

df_std.plot.hist(bins=[-3,-2,-1,0,1,2,3])

png

import seaborn as sns
sns.pairplot(df_std)
plt.show()

png

df_std.loc[(df_std > 3).any(axis=1)|(df_std<-3).any(1)]

	CL=F	PTTEP.BK
522	-1.887333	-3.026275
523	-2.018948	-3.040871
524	-2.513819	-3.099254
525	-2.149059	-3.216018
527	-2.288946	-3.186827
545	-6.875905	-1.814844
546	-3.292978	-1.975396
547	-3.009443	-2.062969
550	-3.084651	-1.931609
551	-3.117743	-1.931609

outlier_3sd = df_std.loc[~((df_std > 3).any(1) | (df_std < -3).any(1))]

outlier_3sd.plot.hist()

png

outlier_3sd.plot.kde()

png

import seaborn as sns
sns.pairplot(outlier_3sd)
plt.show()

png

Next, we used robust scaling with the RobustScaler:

transformer = RobustScaler( quantile_range=(25.0, 75.0)).fit(df)

df_robust=transformer.transform(df)
df_robust
array([[ 0.29773111, -1.07570019],
           [ 0.31876029, -0.91572879],
           [ 0.28721631, -0.88017993],
           ...,
           [-0.47758723, -0.78608323],
           [-0.45655784, -0.6732509 ],
           [-0.43442162, -0.65273577]])

df_robust = pd.DataFrame(df_robust,columns=df.columns)
df_robust

	CL=F	PTTEP.BK
0	0.297731	-1.075700
1	0.318760	-0.915729
2	0.287216	-0.880180
3	0.303265	-0.933504
4	0.371334	-0.880180
…	…	…
704	-0.449917	-0.775826
705	-0.443830	-0.755311
706	-0.477587	-0.786083
707	-0.456558	-0.673251
708	-0.434422	-0.652736

709 rows × 2 columns

df_robust.boxplot(figsize=(10,6))
plt.show()

png

df_robust.plot.kde()
plt.xlim(-3,3)

png

import seaborn as sns
sns.pairplot(df_robust)
plt.show()

png

Normalized

df_normalized = pd.DataFrame(df,columns=df.columns)

df_normalized

	CL=F	PTTEP.BK
Date
2018-01-03	61.630001	80.119003
2018-01-04	62.009998	83.671089
2018-01-05	61.439999	84.460434
2018-01-08	61.730000	83.276405
2018-01-09	62.959999	84.460434
…	…	…
2020-12-23	48.119999	86.777565
2020-12-24	48.230000	87.233093
2020-12-28	47.619999	86.549805
2020-12-29	48.000000	89.055191
2020-12-30	48.400002	89.510719

709 rows × 2 columns

We also demonstrated normalization using quantile filtering:

low = .05
high = .95
quant_df = df_normalized.quantile([low, high])
quant_df

	CL=F	PTTEP.BK
0.05	27.194	71.288631
0.95	70.966	116.598221

filt_df_x = df_normalized.apply(lambda x: x[(x>quant_df.loc[low,x.name])&  (x <= quant_df.loc[high,x.name])], axis=0)
filt_df_x

	CL=F	PTTEP.BK
Date
2018-01-03	61.630001	80.119003
2018-01-04	62.009998	83.671089
2018-01-05	61.439999	84.460434
2018-01-08	61.730000	83.276405
2018-01-09	62.959999	84.460434
…	…	…
2020-12-23	48.119999	86.777565
2020-12-24	48.230000	87.233093
2020-12-28	47.619999	86.549805
2020-12-29	48.000000	89.055191
2020-12-30	48.400002	89.510719

662 rows × 2 columns

filt_df_x.dropna()

	CL=F	PTTEP.BK
Date
2018-01-03	61.630001	80.119003
2018-01-04	62.009998	83.671089
2018-01-05	61.439999	84.460434
2018-01-08	61.730000	83.276405
2018-01-09	62.959999	84.460434
…	…	…
2020-12-23	48.119999	86.777565
2020-12-24	48.230000	87.233093
2020-12-28	47.619999	86.549805
2020-12-29	48.000000	89.055191
2020-12-30	48.400002	89.510719

614 rows × 2 columns

alternative way

Q1 = df_normalized.quantile(0.05)
Q3 = df_normalized.quantile(0.95)
filt_manual_df = df_normalized[~((df_normalized < Q1) | (df_normalized > Q3 )).any(axis=1)]
filt_manual_df

	CL=F	PTTEP.BK
Date
2018-01-03	61.630001	80.119003
2018-01-04	62.009998	83.671089
2018-01-05	61.439999	84.460434
2018-01-08	61.730000	83.276405
2018-01-09	62.959999	84.460434
…	…	…
2020-12-23	48.119999	86.777565
2020-12-24	48.230000	87.233093
2020-12-28	47.619999	86.549805
2020-12-29	48.000000	89.055191
2020-12-30	48.400002	89.510719

614 rows × 2 columns

filt_manual_df.plot()

png

minmax scaler

Lastly, we scaled the data using the MinMaxScaler:

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaler.fit(df_normalized)
    MinMaxScaler()

df_normal_scaled = scaler.transform(df_normalized)

df_normal_scaled
    array([[0.87039633, 0.37746131],
           [0.87372847, 0.42099251],
           [0.86873023, 0.43066602],
           ...,
           [0.74754469, 0.45627149],
           [0.75087686, 0.48697528],
           [0.75438442, 0.49255782]])

df_normal_scaled = pd.DataFrame(df_normal_scaled,columns=df.columns)

df_normal_scaled

	CL=F	PTTEP.BK
0	0.870396	0.377461
1	0.873728	0.420993
2	0.868730	0.430666
3	0.871273	0.416156
4	0.882059	0.430666
…	…	…
704	0.751929	0.459063
705	0.752894	0.464645
706	0.747545	0.456271
707	0.750877	0.486975
708	0.754384	0.492558

709 rows × 2 columns

df_normal_scaled.boxplot(figsize=(10,6))
plt.show()

png

import seaborn as sns
sns.pairplot(df_normal_scaled)
plt.show()

png

df_normal_scaled.plot.kde()
plt.xlim(-.5,1.5)

png

Conclusion

These methods can help improve the performance of machine learning models by reducing the impact of outliers and ensuring that features are on a similar scale. It is important to choose the appropriate method based on the specific characteristics of the dataset and the requirements of the analysis.

Outlier Detection & Data Cleaning in Python - Comprehensive Guide

Outlier techniques are crucial for data cleansing and eliminating noise

we demonstrated how to clean outlier data manually using Python and the Pandas library. First, we computed the percentage change for the given data:

Then, we identified the outliers using a threshold of more than -1 standard deviation and 1 standard deviation:

We removed the outliers by appending the data before and after the removal period:

After cleaning the data, we demonstrated different methods for data transformation. First, we standardized the data using the StandardScaler from the sklearn library:

Next, we used robust scaling with the RobustScaler:

Normalized

We also demonstrated normalization using quantile filtering:

alternative way

minmax scaler

Lastly, we scaled the data using the MinMaxScaler:

Conclusion

These methods can help improve the performance of machine learning models by reducing the impact of outliers and ensuring that features are on a similar scale. It is important to choose the appropriate method based on the specific characteristics of the dataset and the requirements of the analysis.