I used multiple linear regression for the this research. Did the following OLS estimation you wrote in the last communication use the time series forecasting model such as ARIMA OLS estimation? Do I need to use time series forecasting to estimate?
Dynamics_variable_it = a1 + b2#firms_it-n + b3industry mix_it-n + Xb_it-n + location_i + year_t + e_it (1)
Firm_performance_it = c1 + c2#firms_it-n + c3dynamics_it-n + c4industry mix_it-n + Xc_it-n + location_i + year_t + u_it (2)
P.S. I found that Firm_performance_it is also only available for Average_of_latest_accounts_assets
in the dataset. OR are there other indicators that can be used to measure Firm_performance_it
HIH index might not be available because a company’s market share is difficult to measure for the existing dataset
My Research Question:
How does tech clusters' dynamics pattern change in UK from 1998 to 2018? / What factors can affect tech clusters' dynamics pattern change in UK?
To what extent will dynamic changes affect tech clusters’ performance?
df3.head()
With reference to the OLS estimation that you said last time, I made two assumptions in section 2.3 Hypothesis. I selected the second assumption as following key results:
$\ log_e{\ Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times location_i + b_3 \times year_t$ (1)
$\ log_e\ Firm\ Performance_{i,t}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times dynamics_{i,t} + d_3 \times location_i + d_4 \times year_t$ (2)
Entry_Rate
in $t$ and $i$ Entry_Rate
in a specific $t$ and $i$ Count_of_tech_firms
in a specific $t$ and $i$birth_year
ttwa
(travel to work area)For the multiple linear regression of the above assumptions, the final results are given below; the goodness of model fit is not bad ($R^2$ is high)
$\ log_e{\ Dynamics\ Variable_{i,t}} = 2.8e^{-5} \times firms_{i,t} + b_2 \times location_i + 0.15 \times year_t - 310$(1)
$\ log_e\ Firm\ Performance_{i,t}\ = 0.0001 \times firms_{i,t} + {-8.9585} \times dynamics_{i,t} + d_3 \times location_i + {-0.0675} \times year_t + 147.5$ (2)
NOTE: $b_2$ and $d_3$ should be seen in the following regression summary because $location_i$ is a dummy independent variable.
The regression report details are given below (Report No.1 & No.2), and further residual analysis can be viewed in section 2.4.2
【Report No.1】
This is the Dynamics Variable (1) regression summary:
NOTE: for dummy variable $location_i$, ttwa_E30000169(Birmingham)
is dropped as a reference objectivity in the regression
# for formula (1)
hypo2_1_regressor_OLS.summary()
【Report No.2】
This is the Firms' Performance (2) regression summary:
# for formula (2)
hypo2_2_regressor_OLS.summary()
For better see the ttwa cluster name:
df_ttwa_10.style
$\ log_e{\ Dynamics\ Variable_{i,t}} = 2.8e^{-5} \times firms_{i,t} + b_2 \times location_i + 0.15 \times year_t - 310$(1)
$\ log_e\ Firm\ Performance_{i,t}\ = 0.0001 \times firms_{i,t} + {-8.9585} \times dynamics_{i,t} + d_3 \times location_i + {-0.0675} \times year_t + 147.5$ (2)
Average_of_latest_accounts_assets
variable which has 41% missing value in datasetimport pandas as pd
import sys
PATH = sys.path[0]
df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory=False)
df = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis = 1)
# preprocess the data
# drop the null value record in birth year and ttwa(travel to work area) columns
df = df.dropna(subset = {"birth_year","ttwa"})
Get the top 10 ttwas which have the most number of firms
# Generate the top 10 tech clusters table based on the ttwa
top_10_ttwa = df.ttwa.value_counts().head(10)
top_10_ttwa = pd.DataFrame(top_10_ttwa).reset_index()
top_10_ttwa.columns = ["ttwa_code","counts"]
df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = pd.merge(top_10_ttwa, df_ttwa[["code","name"]], left_on="ttwa_code",right_on="code")
df_ttwa_10 = df_ttwa[["ttwa_code", "name","counts"]]
df_ttwa_10.style
Filter data by ttwa
# df_10_tc means "top 10 tech ttwa clusters"
df_10_tc = df.copy()
print(df_10_tc.shape)
# filter data to top 10 ttwa data raws
df_10_tc = df_10_tc[df_10_tc.ttwa.isin(df_ttwa_10.ttwa_code.to_list())]
print(df_10_tc.shape)
Filter data by birth_year
print(df_10_tc.shape)
# df_10_tc_20 means "top 10 tech ttwa clusters in 20 years dataframe"
df_10_tc_20 = df_10_tc[df_10_tc["birth_year"]>=1998]
# data beautify
df_10_tc_20 = df_10_tc_20.reset_index().drop("index",axis=1)
print(df_10_tc_20.shape)
df_10_tc_20.head()
Select the necessary columns:
ttwa
birth_year
& diss_year
latest_accounts_assets
df2 = df_10_tc_20.copy()
# select data in below columns
df2 = df2[["registered_number",\
"ttwa",\
"birth_year",\
"diss_year",\
"latest_accounts_assets"]]
df2.head()
Export above df2
into Excel to conduct a pivot operation
# export to excel to process
df2.to_excel(PATH + "/Top_10_Tech_TTWA_Cluster_Reg_Prepare.xlsx", index = False)
After doing pivot in Excel, read this .xlsx file to get the processed dataset df3
df3 = pd.read_excel(PATH + "/Excel/Top_10_Tech_TTWA_Cluster_Reg_Prepare.xlsx",sheet_name = "dynamics")
df3.head()
Distribution analysis for the dependent variable Entry_Rate
import seaborn as sns
import numpy as np
df_dist = df3.copy()
df_dist.Entry_Rate = np.log(df_dist["Entry_Rate"])
sns.displot(df_dist, x="Entry_Rate",bins=80)
df_dist = df3.copy()
df_dist.Average_of_latest_accounts_assets = np.log1p(df_dist["Average_of_latest_accounts_assets"])
sns.displot(df_dist, x="Average_of_latest_accounts_assets",bins=80)
Hypothesis 1:
$\ Dynamics\ Variable_{i,t}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times location_i + b_3 \times year_t$ (1)
$\ Firm\ Performance_{i,t}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times dynamics_{i,t} + d_3 \times location_i + d_4 \times year_t$ (2)
Hypothesis 2:
$\ log_e{\ Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times location_i + b_3 \times year_t$ (1)
$\ log_e\ Firm\ Performance_{i,t}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times dynamics_{i,t} + d_3 \times location_i + d_4 \times year_t$ (2)
Entry_Rate
in $t$ and $i$ Entry_Rate
in a specific $t$ and $i$ Count_of_tech_firms
in a specific $t$ and $i$birth_year
ttwa
(travel to work area)# environment preparation
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 300) # specifies number of rows to show
pd.options.display.float_format = '{:40,.4f}'.format # specifies default number format to 4 decimal places
plt.style.use('ggplot') # specifies that graphs should use ggplot styling
%matplotlib inline
df3_2 = df3.copy()
# intro dummy variables
df3_2.birth_year = df3_2.birth_year.astype("int")
df3_2_numeric = pd.get_dummies(df3_2)
# add dummies operation requirement
# Set the reference object is ttwa_E30000169(Birmingham)
# drop Birmingham for dummies variables comparasion
df3_2_numeric = df3_2_numeric.drop(['ttwa_E30000169'], axis=1)
# preprocess drop n/a value
df3_2_final = df3_2_numeric.dropna(subset = ["Average_of_latest_accounts_assets"])
# correlation analysis
df = df3_2_final
plt.rcParams["axes.grid"] = False
f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16)
High correlation variables groups:
Total_tech_firms
& birth_year
Count_of_tech_firms
& birth_year
Count_of_tech_firms
& E30000234
(London)Then I introduce VIF method to help remove columns which might cause strong multicollinearity problem
# VIF method
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
def drop_column_using_vif_(df, list_var_not_to_remove, thresh=5):
'''
Calculates VIF each feature in a pandas dataframe, and repeatedly drop the columns with the highest VIF
A constant must be added to variance_inflation_factor or the results will be incorrect
:param df: the pandas dataframe containing only the predictor features, not the response variable
:param list_var_not_to_remove: the list of variables that should not be removed even though it has a high VIF. For example, dummy (or indicator) variables represent a categorical variable with three or more categories.
:param thresh: the max VIF value before the feature is removed from the dataframe
:return: dataframe with multicollinear features removed
'''
while True:
# adding a constatnt item to the data
df_with_const = add_constant(df)
vif_df = pd.Series([variance_inflation_factor(df_with_const.values, i)
for i in range(df_with_const.shape[1])], name= "VIF",
index=df_with_const.columns).to_frame()
# drop the const
vif_df = vif_df.drop('const').drop(list_var_not_to_remove)
print('Max VIF:', vif_df.VIF.max())
# if the largest VIF is above the thresh, remove a variable with the largest VIF
if vif_df.VIF.max() > thresh:
# If there are multiple variables with the maximum VIF, choose the first one
index_to_drop = vif_df.index[vif_df.VIF == vif_df.VIF.max()].tolist()[0]
print('Dropping: {}'.format(index_to_drop))
df = df.drop(columns = index_to_drop)
else:
# No VIF is above threshold. Exit the loop
break
return df
predictors_df3 = df3_2_final.drop('Entry_Rate', axis=1)
response_df3 = df3_2_final['Entry_Rate']
# this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF.
# list_var_not_to_remove = ['ttwa_E30000234', 'Count_of_tech_firms', 'Total_tech_firms']
list_var_not_to_remove = ["birth_year","ttwa_E30000234"]
df_predictors_select_VIF = drop_column_using_vif_(predictors_df3, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)
hypo1_1_regressor_OLS = sm.OLS(endog=response_df3, exog=sm.add_constant(df_predictors_select_VIF)).fit()
hypo1_1_regressor_OLS.summary()
Then, the residual analysis is conducted to test the regression performance
plt.scatter(hypo1_1_regressor_OLS.fittedvalues, hypo1_1_regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted cases rate 1')
plt.ylabel('Residual')
plt.title('Figure')
plt.ylim(-0.5,0.5)
plt.show()
As given above, the performance for regression is quite good
predictors_df3 = df3_2_final.drop('Average_of_latest_accounts_assets', axis=1)
response_df3 = df3_2_final['Average_of_latest_accounts_assets']
# this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF.
# list_var_not_to_remove = ['ttwa_E30000234', 'Count_of_tech_firms', 'Total_tech_firms']
list_var_not_to_remove = ["birth_year","ttwa_E30000234","Entry_Rate"]
df_predictors_select_VIF = drop_column_using_vif_(predictors_df3, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)
hypo1_2_regressor_OLS = sm.OLS(endog=response_df3, exog=sm.add_constant(df_predictors_select_VIF)).fit()
hypo1_2_regressor_OLS.summary()
plt.scatter(hypo1_2_regressor_OLS.fittedvalues, hypo1_2_regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted cases rate 1')
plt.ylabel('Residual')
plt.title('Figure')
# plt.ylim(-0.5,0.5)
plt.show()
Refer to: Dynamics_variable_it = a1 + b2#firms_it-n + b3industry mix_it-n + Xb_it-n + location_i + year_t + e_it (1)
$\ log_e{\ Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times location_i + b_3 \times year_t$ (1)
df3_3_final = df3_2_final.copy()
predictors_df3 = df3_3_final.drop({'Entry_Rate',"Average_of_latest_accounts_assets"}, axis=1)
response_df3 = np.log(df3_3_final['Entry_Rate'])
# this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF.
# list_var_not_to_remove = ['ttwa_E30000234', 'Count_of_tech_firms', 'Total_tech_firms']
list_var_not_to_remove = ["ttwa_E30000234","birth_year"]
df_predictors_select_VIF = drop_column_using_vif_(predictors_df3, list_var_not_to_remove, thresh=2.5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)
hypo2_1_regressor_OLS = sm.OLS(endog=response_df3, exog=sm.add_constant(df_predictors_select_VIF)).fit()
hypo2_1_regressor_OLS.summary()
The result can be writen in the below formula:
$ log_e{\ Dynamics\ Variable_{i,t}} = -310 + 2.8e^{-5} \times firms_{i,t} + b_2 \times location_i + 0.15 \times year_t$
plt.scatter(hypo2_1_regressor_OLS.fittedvalues, hypo2_1_regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted Dynamics Variables after Logarithmic Transformation')
plt.ylabel('Residual')
plt.title('Residual vs. Fitted Plot of Dynamics Variables')
plt.ylim(-2,2)
plt.show()
fig1 = plt.figure(figsize=(12,8))
fig1 = sm.graphics.plot_regress_exog(hypo2_1_regressor_OLS, 'birth_year', fig=fig1)
The second formula:
Refer to: Firm_performance_it = c1 + c2#firms_it-n + c3dynamics_it-n + c4industry mix_it-n + Xc_it-n + location_i + year_t + u_it (2)
$\ log_e\ Firm\ Performance_{i,t}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times dynamics_{i,t} + d_3 \times location_i + d_4 \times year_t$ (2)
df3_3_final = df3_2_final.copy()
predictors_df3 = df3_3_final.drop({"Average_of_latest_accounts_assets"}, axis=1)
response_df3 = np.log(df3_3_final['Average_of_latest_accounts_assets'])
# this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF.
# list_var_not_to_remove = ['ttwa_E30000234', 'Count_of_tech_firms', 'Total_tech_firms']
list_var_not_to_remove = ["ttwa_E30000234","birth_year","Entry_Rate","Count_of_tech_firms"]
df_predictors_select_VIF = drop_column_using_vif_(predictors_df3, list_var_not_to_remove, thresh=2.5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)
hypo2_2_regressor_OLS = sm.OLS(endog=response_df3, exog=sm.add_constant(df_predictors_select_VIF)).fit()
hypo2_2_regressor_OLS.summary()
$\ Firm\ Performance_{i,t}\ =\ 147.5 + 0.0001 \times firms_{i,t} + {-8.9585} \times dynamics_{i,t} + d_3 \times location_i + {-0.0675} \times year_t$ (2)
plt.scatter(hypo2_2_regressor_OLS.fittedvalues, hypo2_2_regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted Dynamics Variables after Logarithmic Transformation')
plt.ylabel('Residual')
plt.title('Residual vs. Fitted Plot of Dynamics Variables')
plt.ylim(-3,3)
plt.show()
fig2 = plt.figure(figsize=(12,8))
fig2 = sm.graphics.plot_regress_exog(hypo2_2_regressor_OLS, 'Entry_Rate', fig=fig2)