Outlier techniques are crucial for data cleansing and eliminating noise
In this example, we used Python, the Pandas library, and the yfinance library to import and analyze financial data. We demonstrated how to clean data manually, standardize data using the StandardScaler, robust scaling with the RobustScaler, and normalization using quantile filtering. These techniques help improve the quality of data analysis by eliminating extreme values and focusing on more representative data points.
Step 1: Import pandas and matplotlib
Start by importing the necessary libraries:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Step 2: Import Dataframe from Excel
Read the data from an Excel file and store it in a dataframe:
start = '2018-01-01'
end = '2021-01-01'
data = yf.download('CL=F PTTEP.BK', start=start, end=end, periods=1)
# CL=F -> WTI3 Oil price
[*********************100%***********************] 2 of 2 completed
# compare oilprice with PTTEP oil equity
df.dropna(inplace=True)
# clean data
df
Here’s what the dataframe should look like:
df.loc['2020-04-15':'2020-04-24']
df.pct_change().plot()
df.pct_change().plot.kde()
plt.xlim(-3,3)
import seaborn as sns
sns.pairplot(df.pct_change())
plt.show()
we demonstrated how to clean outlier data manually using Python and the Pandas library. First, we computed the percentage change for the given data:
df_ret = df.pct_change()
df_ret.loc['2020-04-15':'2020-04-24']
Then, we identified the outliers using a threshold of more than -1 standard deviation and 1 standard deviation:
df_ret.loc[(df_ret['CL=F']<-1)|(df_ret['CL=F']>1)]
#check is over more than - 1 sd and 1 sd
We removed the outliers by appending the data before and after the removal period:
start_remove = '2020-04-17'
end_remove = '2020-04-22'
df_remove = df_ret.loc[:start_remove].append(df_ret.loc[end_remove:])
df_remove
df_remove.loc['2020-04-15':'2020-04-24']
df_remove.plot()
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
scaler = StandardScaler()
scaler.fit(df)
StandardScaler()
df_std = scaler.transform(df)
df_std = pd.DataFrame(df_std, columns=df.columns)
df_std
df.mean()
CL=F 53.794725
PTTEP.BK 98.714088
dtype: float64
df.std()
CL=F 13.305778
PTTEP.BK 15.370026
dtype: float64
import matplotlib.pyplot as plt
df_std.plot.kde()
plt.xlim(-3,3)
df_std.plot.hist(bins=[-3,-2,-1,0,1,2,3])
import seaborn as sns
sns.pairplot(df_std)
plt.show()
df_std.loc[(df_std > 3).any(axis=1)|(df_std<-3).any(1)]
outlier_3sd = df_std.loc[~((df_std > 3).any(1) | (df_std < -3).any(1))]
outlier_3sd.plot.hist()
outlier_3sd.plot.kde()
import seaborn as sns
sns.pairplot(outlier_3sd)
plt.show()
Next, we used robust scaling with the RobustScaler:
transformer = RobustScaler( quantile_range=(25.0, 75.0)).fit(df)
df_robust=transformer.transform(df)
df_robust
array([[ 0.29773111, -1.07570019],
[ 0.31876029, -0.91572879],
[ 0.28721631, -0.88017993],
...,
[-0.47758723, -0.78608323],
[-0.45655784, -0.6732509 ],
[-0.43442162, -0.65273577]])
df_robust = pd.DataFrame(df_robust,columns=df.columns)
df_robust
df_robust.boxplot(figsize=(10,6))
plt.show()
df_robust.plot.kde()
plt.xlim(-3,3)
import seaborn as sns
sns.pairplot(df_robust)
plt.show()
Normalized
df_normalized = pd.DataFrame(df,columns=df.columns)
df_normalized
We also demonstrated normalization using quantile filtering:
low = .05
high = .95
quant_df = df_normalized.quantile([low, high])
quant_df
filt_df_x = df_normalized.apply(lambda x: x[(x>quant_df.loc[low,x.name])& (x <= quant_df.loc[high,x.name])], axis=0)
filt_df_x
filt_df_x.dropna()
alternative way
Q1 = df_normalized.quantile(0.05)
Q3 = df_normalized.quantile(0.95)
filt_manual_df = df_normalized[~((df_normalized < Q1) | (df_normalized > Q3 )).any(axis=1)]
filt_manual_df
filt_manual_df.plot()
minmax scaler
Lastly, we scaled the data using the MinMaxScaler:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df_normalized)
MinMaxScaler()
df_normal_scaled = scaler.transform(df_normalized)
df_normal_scaled
array([[0.87039633, 0.37746131],
[0.87372847, 0.42099251],
[0.86873023, 0.43066602],
...,
[0.74754469, 0.45627149],
[0.75087686, 0.48697528],
[0.75438442, 0.49255782]])
df_normal_scaled = pd.DataFrame(df_normal_scaled,columns=df.columns)
df_normal_scaled
df_normal_scaled.boxplot(figsize=(10,6))
plt.show()
import seaborn as sns
sns.pairplot(df_normal_scaled)
plt.show()
df_normal_scaled.plot.kde()
plt.xlim(-.5,1.5)
Conclusion