Stock Analysis and Predicting Returns with Python

Apr 14, 2023

5 min read


programmer-with-chrome

What Idea about Stock Analysis and Predicting chart ?


Introduction: In this article, we will explore how to perform stock analysis using Python. We will import and merge multiple datasets, including historical stock prices, earnings per share (EPS), and GDP data. We will then use linear regression to predict stock returns based on historical data. Finally, we will visualize the results using various libraries like pandas, matplotlib, numpy, and seaborn.


importing the necessary libraries


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

Import Dataframe from Excel


Now, we will read the data from Excel files and store them in pandas DataFrames:


data_set = pd.read_excel('set.xlsx',index_col=0)
data_set_eps = pd.read_excel('set_eps.xlsx',index_col=0)
data_gdpq = pd.read_excel('gdpq.xlsx',engine= 'openpyxl',index_col=0)

After importing the data, we can display the DataFrames to see what they look like:


data_set

we will see dataframe like this


PriorOpenHighLowClosePE
Date
2016-01-041288.021286.291286.361260.961263.4122.12
2016-01-051263.411267.201270.071251.871253.3421.99
2016-01-061253.341249.821260.881247.891260.0422.10
2016-01-071260.041237.811244.041224.831224.8321.46
2016-01-081224.831232.311246.701228.181244.1821.81
2021-10-281627.611626.271632.301622.561624.3120.81
2021-10-291624.311627.011629.251619.141623.4321.02
2021-11-011623.431627.541632.731611.391613.7820.89
2021-11-021613.781616.861621.691608.641617.8920.96
2021-11-031617.891620.901623.081607.721611.9220.87

1422 rows × 6 columns


data_set_eps

PriorOpenHighLowClosePEEPS
Date
2021-11-031617.891620.901623.081607.721611.9219.390983.1235
2021-11-021613.781616.861621.691608.641617.8919.463783.1246
2021-11-011623.431627.541632.731611.391613.7819.414083.1221
2021-10-291624.311627.011629.251619.141623.4319.530783.1111
2021-10-281627.611626.271632.301622.561624.3119.543883.6187
NaTNaNNaNNaNNaNNaNNaN54.6940
NaTNaNNaNNaNNaNNaNNaN54.6926
NaTNaNNaNNaNNaNNaNNaN54.6938
NaTNaNNaNNaNNaNNaNNaN54.6979
NaTNaNNaNNaNNaNNaNNaN54.6952

3088 rows × 7 columns


data_gdpq
GDP
Date
2016-03-313597737000000
2016-06-303557050000000
2016-09-303628315000000
2016-12-313807235000000
2017-03-313830606000000
2017-06-303753348000000
2017-09-303854593000000
2017-12-314050117000000
2018-03-314051915000000
2018-06-303998560000000
2018-09-304063764000000
2018-12-314254466000000
2019-03-314223706000000
2019-06-304156001000000
2019-09-304181774000000
2019-12-314336605000000
2020-03-314158483000000
2020-06-303545622000000
2020-09-303860959000000
2020-12-314133222000000
2021-03-314070526000000
2021-06-303923731000000

1422 rows × 7 columns


Merging DataFrames


We will now merge the data between data_set and data_set_eps DataFrames:


min_date = data_set_eps.index.min()
max_date = data_set_eps.index.max()
date_range = (min_date, max_date)
date_range

(Timestamp('2016-01-04 00:00:00'), Timestamp('2021-11-03 00:00:00'))
data_set_eps_subset = data_set_eps.loc[data_set.index, ["PE", "EPS"]]

df1 = pd.merge(data_set, data_set_eps_subset, how="left", left_index=True, right_index=True)
data_set_eps_subset

PEEPS
Date
2016-01-0417.330572.2368
2016-01-0517.191972.2431
2016-01-0617.286672.2438
2016-01-0716.801272.2363
2016-01-0817.001072.2296
2021-10-2819.543883.6187
2021-10-2919.530783.1111
2021-11-0119.414083.1221
2021-11-0219.463783.1246
2021-11-0319.390983.1235

1422 rows × 2 columns


df1
PriorOpenHighLowClosePE_xPE_yEPS
Date
2016-01-041288.021286.291286.361260.961263.4122.1217.330572.2368
2016-01-051263.411267.201270.071251.871253.3421.9917.191972.2431
2016-01-061253.341249.821260.881247.891260.0422.1017.286672.2438
2016-01-071260.041237.811244.041224.831224.8321.4616.801272.2363
2016-01-081224.831232.311246.701228.181244.1821.8117.001072.2296
2021-10-281627.611626.271632.301622.561624.3120.8119.543883.6187
2021-10-291624.311627.011629.251619.141623.4321.0219.530783.1111
2021-11-011623.431627.541632.731611.391613.7820.8919.414083.1221
2021-11-021613.781616.861621.691608.641617.8920.9619.463783.1246
2021-11-031617.891620.901623.081607.721611.9220.8719.390983.1235

1422 rows × 8 columns


Adding GDP Data to the Merged DataFrame


Next, we will add the GDP data to the merged DataFrame


data_gdpq.index.year

Int64Index([2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2018, 2018, 2018,2018, 2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020, 2021, 2021],
dtype='int64', name='Date')

data_gdpq.groupby(data_gdpq.index.year)['GDP'].mean()
    Date
    2016    3.647584e+12
    2017    3.872166e+12
    2018    4.092176e+12
    2019    4.224522e+12
    2020    3.924572e+12
    2021    3.997128e+12
    Name: GDP, dtype: float64
gdp_yearly = data_gdpq.groupby(data_gdpq.index.year)['GDP'].mean()
for i in range(2016,2022):
    df1.loc[str(i),'gdp'] = gdp_yearly.loc[i]

df1

PriorOpenHighLowClosePE_xPE_yEPSgdp
Date
2016-01-041288.021286.291286.361260.961263.4122.1217.330572.23683.647584e+12
2016-01-051263.411267.201270.071251.871253.3421.9917.191972.24313.647584e+12
2016-01-061253.341249.821260.881247.891260.0422.1017.286672.24383.647584e+12
2016-01-071260.041237.811244.041224.831224.8321.4616.801272.23633.647584e+12
2016-01-081224.831232.311246.701228.181244.1821.8117.001072.22963.647584e+12
2021-10-281627.611626.271632.301622.561624.3120.8119.543883.61873.997128e+12
2021-10-291624.311627.011629.251619.141623.4321.0219.530783.11113.997128e+12
2021-11-011623.431627.541632.731611.391613.7820.8919.414083.12213.997128e+12
2021-11-021613.781616.861621.691608.641617.8920.9619.463783.12463.997128e+12
2021-11-031617.891620.901623.081607.721611.9220.8719.390983.12353.997128e+12

1422 rows × 9 columns


Calculating and Adding Stock Return to the DataFrame


We will now calculate the daily stock return and add it to the DataFrame


df1['Return_shift'] = df1.Close.pct_change()
df1.dropna(inplace=True)
df1

PriorOpenHighLowClosePE_xPE_yEPSgdpReturn_shift
Date
2016-01-051263.411267.201270.071251.871253.3421.9917.191972.24313.647584e+12-0.007970
2016-01-061253.341249.821260.881247.891260.0422.1017.286672.24383.647584e+120.005346
2016-01-071260.041237.811244.041224.831224.8321.4616.801272.23633.647584e+12-0.027944
2016-01-081224.831232.311246.701228.181244.1821.8117.001072.22963.647584e+120.015798
2016-01-111244.181233.801235.181220.961234.5021.6316.869572.22423.647584e+12-0.007780
2021-10-281627.611626.271632.301622.561624.3120.8119.543883.61873.997128e+12-0.002028
2021-10-291624.311627.011629.251619.141623.4321.0219.530783.11113.997128e+12-0.000542
2021-11-011623.431627.541632.731611.391613.7820.8919.414083.12213.997128e+12-0.005944
2021-11-021613.781616.861621.691608.641617.8920.9619.463783.12463.997128e+120.002547
2021-11-031617.891620.901623.081607.721611.9220.8719.390983.12353.997128e+12-0.003690

1421 rows × 10 columns


df1.loc[(df1.EPS>70)& (df1.EPS<80)]

PriorOpenHighLowClosePE_xPE_yEPSgdpReturn_shiftPE_Label
Date
2016-01-051263.411267.201270.071251.871253.3421.9917.191972.24313.647584e+12-0.0079701
2016-01-061253.341249.821260.881247.891260.0422.1017.286672.24383.647584e+120.0053461
2016-01-071260.041237.811244.041224.831224.8321.4616.801272.23633.647584e+12-0.0279441
2016-01-081224.831232.311246.701228.181244.1821.8117.001072.22963.647584e+120.0157981
2016-01-111244.181233.801235.181220.961234.5021.6316.869572.22423.647584e+12-0.0077801
2016-05-181406.571407.661410.041399.361400.5021.1619.387874.26153.647584e+12-0.0043151
2016-05-191400.501399.391400.971384.121385.8620.9419.185174.25683.647584e+12-0.0104531
2016-05-231385.861387.811388.881377.651381.6920.8819.127274.26023.647584e+12-0.0030091
2016-05-241381.691381.541386.151376.741384.2620.9219.161174.26443.647584e+120.0018601
2016-05-251384.261394.651398.351390.191397.6321.1319.346074.38303.647584e+120.0096591

93 rows × 11 columns


df1.loc[ (df1.EPS>35)&(df1.EPS<40)]

PriorOpenHighLowClosePE_xPE_yEPSgdpReturn_shiftPE_Label
Date
2020-12-031417.951424.311439.221420.691438.3228.4926.215238.92973.924572e+120.0143662
2020-12-041438.321439.141454.941439.141449.8328.7326.426338.93273.924572e+120.0080022
2020-12-081449.831447.711484.731442.651478.9229.3426.952138.93733.924572e+120.0200642
2020-12-091478.921491.671503.891474.831482.6729.4127.042038.93693.924572e+120.0025362
2020-12-141482.671490.791495.181471.561476.1329.2826.925938.93703.924572e+12-0.0044112
2021-03-051534.111530.051553.921529.831544.1140.3239.581939.02093.997128e+120.0065186
2021-03-081544.111558.791562.561540.641543.7640.2939.573039.02143.997128e+12-0.0002276
2021-03-091543.761551.571554.961537.231550.5940.4439.747939.02133.997128e+120.0044246
2021-03-101550.591553.231573.551550.451573.0541.0040.323839.02133.997128e+120.0144856
2021-03-111573.051583.131584.611569.871575.1341.0740.377239.08573.997128e+120.0013226

64 rows × 11 columns


df1.loc[ (df1.EPS>100)]

PriorOpenHighLowClosePE_xPE_yEPSgdpReturn_shiftPE_Label
Date
2018-04-021776.261775.241783.731771.581782.2818.3517.9799101.02424.092176e+120.0033891
2018-04-031782.281774.451779.041764.181765.2418.1717.8016101.02314.092176e+12-0.0095611
2018-04-041765.241766.671771.581724.901724.9817.7517.3944101.02344.092176e+12-0.0228071
2018-04-051724.981735.581740.841724.881739.9217.9017.5470101.02724.092176e+120.0086611
2018-04-091739.921736.541752.911735.381751.2718.0317.6620101.02664.092176e+120.0065231
2018-08-101722.481718.401719.031705.201705.9617.3216.9411105.43284.092176e+12-0.0095911
2018-08-211701.421703.791707.631688.591694.6316.8316.8285100.63244.092176e+12-0.0039911
2018-08-221694.631696.961705.911695.341698.3016.8916.8666100.61124.092176e+120.0021661
2018-10-091696.221700.551708.721690.921696.9216.7316.0854105.44854.092176e+120.0004131
2018-10-101696.921709.191721.821707.631721.8216.9816.3222105.45034.092176e+120.0146741

92 rows × 11 columns


Create label for ML training


def pe(pe):
    if pe <=20:
        return 1
    elif (pe>20)&(pe<=28):
        return 2
    elif (pe>28)&(pe<=30):
        return 3
    elif (pe > 30) & (pe <= 34):
        return 4
    elif (pe > 34) & (pe <= 38):
        return 5
    elif (pe > 38) & (pe <= 42):
        return 6
    elif pe > 42:
        return 7
def eps(eps):
    if eps <=40:
        return 1
    elif (eps>40)&(eps<=50):
        return 2
    elif (eps>50)&(eps<=60):
        return 3
    elif (eps > 60) & (eps <= 70):
        return 4
    elif (eps > 70) & (eps <= 80):
        return 5
    elif (eps > 80) & (eps <= 90):
        return 6
    elif (eps > 90) & (eps <= 100):
        return 7
    elif eps > 100 :
        return 8
def gdp(gdp):
    if gdp <=3.750000e+12:
        return 1
    elif (gdp>3.750000e+12)&(gdp<=4.000000e+12):
        return 2
    elif (gdp>4.000000e+12)&(gdp<=4.250000e+12):
        return 3
    elif eps > 4.250000e+12 :
        return 4
    
    
df1['PE_Label'] = df1['PE_y'].apply(pe)
df1['EPS_Label'] = df1['EPS'].apply(eps)
df1['GDP_Label'] = df1['gdp'].apply(gdp)
df1

PriorOpenHighLowClosePE_xPE_yEPSgdpReturn_shiftPE_LabelEPS_LabelGDP_Label
Date
2016-01-051263.411267.201270.071251.871253.3421.9917.191972.24313.647584e+12-0.007970151
2016-01-061253.341249.821260.881247.891260.0422.1017.286672.24383.647584e+120.005346151
2016-01-071260.041237.811244.041224.831224.8321.4616.801272.23633.647584e+12-0.027944151
2016-01-081224.831232.311246.701228.181244.1821.8117.001072.22963.647584e+120.015798151
2016-01-111244.181233.801235.181220.961234.5021.6316.869572.22423.647584e+12-0.007780151
2021-10-281627.611626.271632.301622.561624.3120.8119.543883.61873.997128e+12-0.002028162
2021-10-291624.311627.011629.251619.141623.4321.0219.530783.11113.997128e+12-0.000542162
2021-11-011623.431627.541632.731611.391613.7820.8919.414083.12213.997128e+12-0.005944162
2021-11-021613.781616.861621.691608.641617.8920.9619.463783.12463.997128e+120.002547162
2021-11-031617.891620.901623.081607.721611.9220.8719.390983.12353.997128e+12-0.003690162

1421 rows × 13 columns


Data Preparation:


First, we need to split our dataset into training and testing sets. The training set will be used to train our model, while the testing set will be used to evaluate its performance.


X = df1[['EPS','PE_y','gdp','EPS_Label', 'PE_Label','GDP_Label']]
y = df1['Return_shift']
X_train = X.iloc[:1000]
y_train = y.iloc[:1000]
X_test = X.iloc[1000:1200]
y_test = y.iloc[1000:1200]
X_train

EPSPE_ygdpEPS_LabelPE_LabelGDP_Label
Date
2016-01-0572.243117.19193.647584e+12511
2016-01-0672.243817.28663.647584e+12511
2016-01-0772.236316.80123.647584e+12511
2016-01-0872.229617.00103.647584e+12511
2016-01-1172.224216.86953.647584e+12511
2020-01-2987.434117.46393.924572e+12612
2020-01-3087.412717.45683.924572e+12612
2020-01-3187.429117.34403.924572e+12612
2020-02-0387.380917.13783.924572e+12612
2020-02-0487.374217.40583.924572e+12612

1000 rows × 6 columns


y_train
    Date
    2016-01-05   -0.007970
    2016-01-06    0.005346
    2016-01-07   -0.027944
    2016-01-08    0.015798
    2016-01-11   -0.007780
                    ...   
    2020-01-29    0.007487
    2020-01-30   -0.000394
    2020-01-31   -0.006463
    2020-02-03   -0.011941
    2020-02-04    0.015588
    Name: Return_shift, Length: 1000, dtype: float64

X_test
EPSPE_ygdpEPS_LabelPE_LabelGDP_Label
Date
2020-02-0587.368917.57513.924572e+12612
2020-02-0687.377517.59393.924572e+12612
2020-02-0787.378517.58763.924572e+12612
2020-02-1187.373317.45793.924572e+12612
2020-02-1287.373517.64053.924572e+12612
2020-11-2454.850025.56053.924572e+12322
2020-11-2554.863325.81833.924572e+12322
2020-11-2655.550426.14223.924572e+12322
2020-11-2755.421026.21923.924572e+12322
2020-11-3055.431625.67533.924572e+12322

200 rows × 6 columns


Model Training and Evaluation:

Now that we have prepared our data, we can start building the linear regression model.


We will use the Scikit-learn library to create and train our model,


and then evaluate its performance using various metrics such as R-squared,


Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

from sklearn import metrics
from sklearn.model_selection import cross_val_score



model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print('Score = ', metrics.r2_score(y_test,y_pred))

m = model.coef_
print("coef = ",m)
b = model.intercept_
print("intercept = ",b)
print("MAE = ",  metrics.mean_absolute_error(y_test,y_pred))
print("MSE = ",  metrics.mean_squared_error(y_test,y_pred))
print("RMSE = ", np.sqrt(metrics.mean_squared_error(y_test,y_pred)))


Score =  0.016679949410685402
coef = [ 4.68128493e-05  4.35681735e-04  3.02324355e-15  2.79260864e-05
     -1.35296661e-03 -1.46159773e-03]
intercept =  -0.019300039650165336
MAE =  0.012205326313264547
MSE =  0.0003814982502499546
RMSE =  0.01953198019274939

Create report


df_report = pd.DataFrame({'Actual': y_test, 'Predict':y_pred})

df_report

ActualPredict
Date
2020-02-050.0097140.000203
2020-02-060.0010760.000212
2020-02-07-0.0003580.000209
2020-02-11-0.0073670.000153
2020-02-120.0104400.000232
2020-11-24-0.0132350.000723
2020-11-250.0100530.000836
2020-11-260.0126010.001010
2020-11-270.0029440.001037
2020-11-30-0.0204970.000801

200 rows × 2 columns



df_report_example = df_report.head(20)
df_report_example.plot(kind='bar',figsize=(16,10))

PE-Band-Chart


(df_report+1).cumprod().plot()

PE-Band-Chart


test model with data split


X_live = df1.iloc[1200:]
X_live

PriorOpenHighLowClosePE_xPE_yEPSgdpReturn_shiftPE_LabelEPS_LabelGDP_Label
Date
2020-12-011408.311419.281430.031416.391420.8728.1125.902855.43493.924572e+120.008918232
2020-12-021420.871416.461430.491415.101417.9528.0625.839555.38463.924572e+12-0.002055232
2020-12-031417.951424.311439.221420.691438.3228.4926.215238.92973.924572e+120.014366212
2020-12-041438.321439.141454.941439.141449.8328.7326.426338.93273.924572e+120.008002212
2020-12-081449.831447.711484.731442.651478.9229.3426.952138.93733.924572e+120.020064212
2021-10-281627.611626.271632.301622.561624.3120.8119.543883.61873.997128e+12-0.002028162
2021-10-291624.311627.011629.251619.141623.4321.0219.530783.11113.997128e+12-0.000542162
2021-11-011623.431627.541632.731611.391613.7820.8919.414083.12213.997128e+12-0.005944162
2021-11-021613.781616.861621.691608.641617.8920.9619.463783.12463.997128e+120.002547162
2021-11-031617.891620.901623.081607.721611.9220.8719.390983.12353.997128e+12-0.003690162

221 rows × 13 columns


X_live[['EPS','PE_y','gdp','EPS_Label', 'PE_Label','GDP_Label']]

EPSPE_ygdpEPS_LabelPE_LabelGDP_Label
Date
2020-12-0155.434925.90283.924572e+12322
2020-12-0255.384625.83953.924572e+12322
2020-12-0338.929726.21523.924572e+12122
2020-12-0438.932726.42633.924572e+12122
2020-12-0838.937326.95213.924572e+12122
2021-10-2883.618719.54383.997128e+12612
2021-10-2983.111119.53073.997128e+12612
2021-11-0183.122119.41403.997128e+12612
2021-11-0283.124619.46373.997128e+12612
2021-11-0383.123519.39093.997128e+12612

221 rows × 6 columns


X_live

PriorOpenHighLowClosePE_xPE_yEPSgdpReturn_shiftPE_LabelEPS_LabelGDP_Label
Date
2020-12-011408.311419.281430.031416.391420.8728.1125.902855.43493.924572e+120.008918232
2020-12-021420.871416.461430.491415.101417.9528.0625.839555.38463.924572e+12-0.002055232
2020-12-031417.951424.311439.221420.691438.3228.4926.215238.92973.924572e+120.014366212
2020-12-041438.321439.141454.941439.141449.8328.7326.426338.93273.924572e+120.008002212
2020-12-081449.831447.711484.731442.651478.9229.3426.952138.93733.924572e+120.020064212
2021-10-281627.611626.271632.301622.561624.3120.8119.543883.61873.997128e+12-0.002028162
2021-10-291624.311627.011629.251619.141623.4321.0219.530783.11113.997128e+12-0.000542162
2021-11-011623.431627.541632.731611.391613.7820.8919.414083.12213.997128e+12-0.005944162
2021-11-021613.781616.861621.691608.641617.8920.9619.463783.12463.997128e+120.002547162
2021-11-031617.891620.901623.081607.721611.9220.8719.390983.12353.997128e+12-0.003690162

221 rows × 13 columns


live_data =X_live[['EPS','PE_y','gdp','EPS_Label', 'PE_Label','GDP_Label']]
live_data['Return_shift']=model.predict(live_data)
df_live_report = pd.DataFrame({'Actual': X_live['Return_shift'], 'Predict':live_data['Return_shift']})
df_live_report

ActualPredict
Date
2020-12-010.0089180.000900
2020-12-02-0.0020550.000870
2020-12-030.0143660.000208
2020-12-040.0080020.000300
2020-12-080.0200640.000529
2021-10-28-0.0020280.001105
2021-10-29-0.0005420.001075
2021-11-01-0.0059440.001025
2021-11-020.0025470.001047
2021-11-03-0.0036900.001015

221 rows × 2 columns


df_live_report_exmaple = df_live_report.head(200)
df_live_report_exmaple.plot(kind='bar',figsize=(16,10))

png


(df_live_report_exmaple).plot()

png


(df_live_report_exmaple+1).cumprod().plot()

png


Conclusion


In this article, we demonstrated how to perform stock analysis and predict stock returns using Python. We started by importing and merging multiple datasets,


including historical stock prices, earnings per share (EPS), and GDP data. After merging the datasets and adding stock return calculations, we used linear regression to predict stock returns based on historical data.


We then visualized the results using various libraries like pandas, matplotlib, numpy, and seaborn.


The results showed that our linear regression model provided a reasonable prediction of stock returns, although it is important to note that stock market predictions are inherently uncertain and should not be solely relied upon for investment decisions.


Further improvements to the model could be achieved by incorporating additional variables, using more advanced techniques, or incorporating domain-specific knowledge.


Overall, this article serves as a foundation for further exploration into stock analysis and prediction using Python, providing valuable insights for investors and analysts looking to harness the power of data-driven decision-making.