Regression Research for Dynamics Variables (Top 10 Tech Clusters)

Current Question

  1. I used multiple linear regression for the this research. Did the following OLS estimation you wrote in the last communication use the time series forecasting model such as ARIMA OLS estimation? Do I need to use time series forecasting to estimate?

    Dynamics_variable_it = a1 + b2#firms_it-n + b3industry mix_it-n + Xb_it-n + location_i + year_t + e_it (1)

    Firm_performance_it = c1 + c2#firms_it-n + c3dynamics_it-n + c4industry mix_it-n + Xc_it-n + location_i + year_t + u_it (2)

    P.S. I found that Firm_performance_it is also only available for Average_of_latest_accounts_assets in the dataset. OR are there other indicators that can be used to measure Firm_performance_it

  2. HIH index might not be available because a company’s market share is difficult to measure for the existing dataset

1 Key Results

My Research Question:

  • How does tech clusters' dynamics pattern change in UK from 1998 to 2018? / What factors can affect tech clusters' dynamics pattern change in UK?

  • To what extent will dynamic changes affect tech clusters’ performance?

1.2 Data

Below is my dataset for regression research, you can access/download it in this link

In [18]:
df3.head()
Out[18]:
ttwa birth_year Count_of_tech_firms Average_of_latest_accounts_assets Total_tech_firms Entry_Rate
0 E30000169 1998 57 154,110.7632 9047 0.0063
1 E30000169 1999 78 392,610.4737 9047 0.0086
2 E30000169 2000 82 234,658.3898 9047 0.0091
3 E30000169 2001 67 198,533.5192 9047 0.0074
4 E30000169 2002 125 169,835.2778 9047 0.0138

1.3 Hypothesis

With reference to the OLS estimation that you said last time, I made two assumptions in section 2.3 Hypothesis. I selected the second assumption as following key results:

$\ log_e{\ Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times location_i + b_3 \times year_t$ (1)

$\ log_e\ Firm\ Performance_{i,t}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times dynamics_{i,t} + d_3 \times location_i + d_4 \times year_t$ (2)

  • $Dynamics\ Variable_{i,t}$ in (1) means Entry_Rate in $t$ and $i$
  • $dynamics_{i,t}$ in (2) means Entry_Rate in a specific $t$ and $i$
  • $firms_{i,t}$ means Count_of_tech_firms in a specific $t$ and $i$
  • $t$ means year -- birth_year
  • $i$ means location -- ttwa (travel to work area)

1.4 Regression

For the multiple linear regression of the above assumptions, the final results are given below; the goodness of model fit is not bad ($R^2$ is high)

$\ log_e{\ Dynamics\ Variable_{i,t}} = 2.8e^{-5} \times firms_{i,t} + b_2 \times location_i + 0.15 \times year_t - 310$(1)

$\ log_e\ Firm\ Performance_{i,t}\ = 0.0001 \times firms_{i,t} + {-8.9585} \times dynamics_{i,t} + d_3 \times location_i + {-0.0675} \times year_t + 147.5$ (2)

NOTE: $b_2$ and $d_3$ should be seen in the following regression summary because $location_i$ is a dummy independent variable.

The regression report details are given below (Report No.1 & No.2), and further residual analysis can be viewed in section 2.4.2

【Report No.1】

This is the Dynamics Variable (1) regression summary:

NOTE: for dummy variable $location_i$, ttwa_E30000169(Birmingham) is dropped as a reference objectivity in the regression

In [99]:
# for formula (1)
hypo2_1_regressor_OLS.summary()
Out[99]:
OLS Regression Results
Dep. Variable: Entry_Rate R-squared: 0.932
Model: OLS Adj. R-squared: 0.928
Method: Least Squares F-statistic: 238.5
Date: Tue, 27 Jul 2021 Prob (F-statistic): 3.43e-105
Time: 00:49:45 Log-Likelihood: -9.0648
No. Observations: 203 AIC: 42.13
Df Residuals: 191 BIC: 81.89
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -310.4470 6.992 -44.399 0.000 -324.239 -296.655
birth_year 0.1529 0.003 43.878 0.000 0.146 0.160
Count_of_tech_firms 2.804e-05 1.13e-05 2.489 0.014 5.82e-06 5.03e-05
ttwa_E30000175 -0.4938 0.082 -5.986 0.000 -0.657 -0.331
ttwa_E30000179 -0.1491 0.082 -1.829 0.069 -0.310 0.012
ttwa_E30000180 0.1377 0.082 1.669 0.097 -0.025 0.300
ttwa_E30000186 0.1193 0.083 1.446 0.150 -0.043 0.282
ttwa_E30000212 0.2416 0.083 2.928 0.004 0.079 0.404
ttwa_E30000234 -0.2003 0.099 -2.028 0.044 -0.395 -0.005
ttwa_E30000237 -0.0440 0.082 -0.533 0.595 -0.207 0.119
ttwa_E30000239 0.0074 0.082 0.091 0.928 -0.154 0.168
ttwa_E30000266 0.0248 0.083 0.300 0.765 -0.138 0.188
Omnibus: 6.590 Durbin-Watson: 1.008
Prob(Omnibus): 0.037 Jarque-Bera (JB): 6.880
Skew: 0.316 Prob(JB): 0.0321
Kurtosis: 3.643 Cond. No. 1.04e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.04e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

【Report No.2】

This is the Firms' Performance (2) regression summary:

In [100]:
# for formula (2)

hypo2_2_regressor_OLS.summary()
Out[100]:
OLS Regression Results
Dep. Variable: Average_of_latest_accounts_assets R-squared: 0.745
Model: OLS Adj. R-squared: 0.729
Method: Least Squares F-statistic: 46.37
Date: Tue, 27 Jul 2021 Prob (F-statistic): 6.23e-50
Time: 00:50:00 Log-Likelihood: -99.816
No. Observations: 203 AIC: 225.6
Df Residuals: 190 BIC: 268.7
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 147.5047 22.122 6.668 0.000 103.868 191.142
birth_year -0.0675 0.011 -6.111 0.000 -0.089 -0.046
Count_of_tech_firms 0.0001 1.95e-05 5.773 0.000 7.42e-05 0.000
Entry_Rate -8.9585 1.814 -4.940 0.000 -12.536 -5.381
ttwa_E30000175 -0.2359 0.129 -1.823 0.070 -0.491 0.019
ttwa_E30000179 -0.2070 0.128 -1.618 0.107 -0.459 0.045
ttwa_E30000180 0.0600 0.129 0.464 0.643 -0.195 0.315
ttwa_E30000186 0.1323 0.129 1.023 0.308 -0.123 0.387
ttwa_E30000212 0.2389 0.129 1.845 0.067 -0.016 0.494
ttwa_E30000234 0.0065 0.159 0.041 0.967 -0.307 0.320
ttwa_E30000237 0.0750 0.129 0.580 0.563 -0.180 0.330
ttwa_E30000239 -0.0467 0.128 -0.365 0.716 -0.299 0.206
ttwa_E30000266 0.0912 0.129 0.704 0.482 -0.164 0.347
Omnibus: 103.810 Durbin-Watson: 1.682
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1355.690
Skew: -1.582 Prob(JB): 4.13e-295
Kurtosis: 15.259 Cond. No. 2.11e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.11e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

For better see the ttwa cluster name:

In [101]:
df_ttwa_10.style
Out[101]:
ttwa_code name counts
0 E30000234 London 119008
1 E30000239 Manchester 18273
2 E30000266 Slough and Heathrow 17367
3 E30000169 Birmingham 9593
4 E30000237 Luton 8873
5 E30000175 Bournemouth 7546
6 E30000179 Brighton 7483
7 E30000186 Cambridge 6252
8 E30000180 Bristol 6176
9 E30000212 Guildford and Aldershot 5986

1.5 Insights

$\ log_e{\ Dynamics\ Variable_{i,t}} = 2.8e^{-5} \times firms_{i,t} + b_2 \times location_i + 0.15 \times year_t - 310$(1)

$\ log_e\ Firm\ Performance_{i,t}\ = 0.0001 \times firms_{i,t} + {-8.9585} \times dynamics_{i,t} + d_3 \times location_i + {-0.0675} \times year_t + 147.5$ (2)

  1. The more number of firms in the cluster increase and year grow, the more firms will entry this cluster
  2. E30000212 (Guildford and Aldershot) & E30000186(Cambridge) area 相比 E30000169(Birmingham) 更能吸引公司进入其对应的科技集群
  3. The second result is not so ideal becasue firm performance is based on Average_of_latest_accounts_assets variable which has 41% missing value in dataset

2 Reproducible Analysis

2.1 Read Data

In [ ]:
import pandas as pd
import sys
PATH = sys.path[0]

df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory=False)
df = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis = 1)

# preprocess the data
# drop the null value record in birth year and ttwa(travel to work area) columns
df = df.dropna(subset = {"birth_year","ttwa"})

2.2 Data Preparation

Get the top 10 ttwas which have the most number of firms

In [4]:
# Generate the top 10 tech clusters table based on the ttwa
top_10_ttwa = df.ttwa.value_counts().head(10)
top_10_ttwa = pd.DataFrame(top_10_ttwa).reset_index()
top_10_ttwa.columns = ["ttwa_code","counts"]

df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = pd.merge(top_10_ttwa, df_ttwa[["code","name"]], left_on="ttwa_code",right_on="code")
df_ttwa_10 = df_ttwa[["ttwa_code", "name","counts"]]
df_ttwa_10.style
Out[4]:
ttwa_code name counts
0 E30000234 London 119008
1 E30000239 Manchester 18273
2 E30000266 Slough and Heathrow 17367
3 E30000169 Birmingham 9593
4 E30000237 Luton 8873
5 E30000175 Bournemouth 7546
6 E30000179 Brighton 7483
7 E30000186 Cambridge 6252
8 E30000180 Bristol 6176
9 E30000212 Guildford and Aldershot 5986

Filter data by ttwa

In [5]:
# df_10_tc means "top 10 tech ttwa clusters"
df_10_tc = df.copy()
print(df_10_tc.shape)

# filter data to top 10 ttwa data raws
df_10_tc = df_10_tc[df_10_tc.ttwa.isin(df_ttwa_10.ttwa_code.to_list())]
print(df_10_tc.shape)
(423634, 64)
(206557, 64)

Filter data by birth_year

In [9]:
print(df_10_tc.shape)
# df_10_tc_20 means "top 10 tech ttwa clusters in 20 years dataframe"
df_10_tc_20 = df_10_tc[df_10_tc["birth_year"]>=1998]
# data beautify
df_10_tc_20 = df_10_tc_20.reset_index().drop("index",axis=1)
print(df_10_tc_20.shape)
df_10_tc_20.head()
(206557, 64)
(196071, 64)
Out[9]:
registered_number incorporation_date dissolution_date latest_accounts_cash latest_accounts_assets latest_accounts_liabilities year_obs sic_year sic4 change_sic ... pcd2 pcds cty laua ctry rgn ttwa pct lep1 lep2
0 07341142 2010-08-10 NaN NaN 56331.0 NaN 2017 2007.0 6202.0 0.0 ... AL1 1AJ AL1 1AJ E10000015 E07000240 E92000001 E12000006 E30000237 E16000150 E37000017 NaN
1 10490667 2016-11-22 NaN NaN 11254.0 NaN 2018 2007.0 7211.0 0.0 ... AL1 1AU AL1 1AU E10000015 E07000240 E92000001 E12000006 E30000237 E16000150 E37000017 NaN
2 11283731 2018-03-29 NaN NaN NaN NaN 2018 2007.0 6201.0 0.0 ... AL1 1DG AL1 1DG E10000015 E07000240 E92000001 E12000006 E30000237 E16000150 E37000017 NaN
3 09852054 2015-11-02 NaN NaN 20002.0 NaN 2017 2007.0 6201.0 0.0 ... AL1 1DG AL1 1DG E10000015 E07000240 E92000001 E12000006 E30000237 E16000150 E37000017 NaN
4 08577485 2013-06-20 NaN NaN 8390.0 11888.0 2017 2007.0 8621.0 0.0 ... AL1 1DS AL1 1DS E10000015 E07000240 E92000001 E12000006 E30000237 E16000150 E37000017 NaN

5 rows × 64 columns

Select the necessary columns:

  • location: ttwa
  • year: birth_year & diss_year
  • firms' performance : latest_accounts_assets
In [257]:
df2 = df_10_tc_20.copy()

# select data in below columns
df2 = df2[["registered_number",\
        "ttwa",\
        "birth_year",\
        "diss_year",\
        "latest_accounts_assets"]]

df2.head()
Out[257]:
registered_number ttwa birth_year diss_year latest_accounts_assets
0 07341142 E30000237 2,010.0000 nan 56,331.0000
1 10490667 E30000237 2,016.0000 nan 11,254.0000
2 11283731 E30000237 2,018.0000 nan nan
3 09852054 E30000237 2,015.0000 nan 20,002.0000
4 08577485 E30000237 2,013.0000 nan 8,390.0000

Export above df2 into Excel to conduct a pivot operation

In [258]:
# export to excel to process
df2.to_excel(PATH + "/Top_10_Tech_TTWA_Cluster_Reg_Prepare.xlsx", index = False)

After doing pivot in Excel, read this .xlsx file to get the processed dataset df3

In [10]:
df3 = pd.read_excel(PATH + "/Excel/Top_10_Tech_TTWA_Cluster_Reg_Prepare.xlsx",sheet_name = "dynamics")
df3.head()
Out[10]:
ttwa birth_year Count_of_tech_firms Average_of_latest_accounts_assets Total_tech_firms Entry_Rate
0 E30000169 1998 57 154110.763158 9047 0.006300
1 E30000169 1999 78 392610.473684 9047 0.008622
2 E30000169 2000 82 234658.389831 9047 0.009064
3 E30000169 2001 67 198533.519231 9047 0.007406
4 E30000169 2002 125 169835.277778 9047 0.013817

Distribution analysis for the dependent variable Entry_Rate

In [15]:
import seaborn as sns
import numpy as np

df_dist = df3.copy()
df_dist.Entry_Rate = np.log(df_dist["Entry_Rate"])
sns.displot(df_dist, x="Entry_Rate",bins=80)
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x7ffac9c90fd0>
In [67]:
df_dist = df3.copy()
df_dist.Average_of_latest_accounts_assets = np.log1p(df_dist["Average_of_latest_accounts_assets"])
sns.displot(df_dist, x="Average_of_latest_accounts_assets",bins=80)
Out[67]:
<seaborn.axisgrid.FacetGrid at 0x7ffb01926898>

2.3 Hypothesis

Hypothesis 1:

$\ Dynamics\ Variable_{i,t}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times location_i + b_3 \times year_t$ (1)

$\ Firm\ Performance_{i,t}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times dynamics_{i,t} + d_3 \times location_i + d_4 \times year_t$ (2)

Hypothesis 2:

$\ log_e{\ Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times location_i + b_3 \times year_t$ (1)

$\ log_e\ Firm\ Performance_{i,t}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times dynamics_{i,t} + d_3 \times location_i + d_4 \times year_t$ (2)

  • $Dynamics\ Variable_{i,t}$ in (1) means Entry_Rate in $t$ and $i$
  • $dynamics_{i,t}$ in (2) means Entry_Rate in a specific $t$ and $i$
  • $firms_{i,t}$ means Count_of_tech_firms in a specific $t$ and $i$
  • $t$ means year -- birth_year
  • $i$ means location -- ttwa (travel to work area)

2.4 Regression

In [68]:
# environment preparation
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', 300) # specifies number of rows to show
pd.options.display.float_format = '{:40,.4f}'.format # specifies default number format to 4 decimal places
plt.style.use('ggplot') # specifies that graphs should use ggplot styling
%matplotlib inline

2.4.1 Hypothesis 1

In [38]:
df3_2 = df3.copy()

# intro dummy variables
df3_2.birth_year = df3_2.birth_year.astype("int")
df3_2_numeric = pd.get_dummies(df3_2)

# add dummies operation requirement 
# Set the reference object is ttwa_E30000169(Birmingham)
# drop Birmingham for dummies variables comparasion
df3_2_numeric = df3_2_numeric.drop(['ttwa_E30000169'], axis=1)

# preprocess drop n/a value
df3_2_final = df3_2_numeric.dropna(subset = ["Average_of_latest_accounts_assets"])

# correlation analysis
df = df3_2_final
plt.rcParams["axes.grid"] = False
f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16)
Out[38]:
Text(0.5, 1.0, 'Correlation Matrix')

High correlation variables groups:

  1. Total_tech_firms & birth_year
  2. Count_of_tech_firms & birth_year
  3. Count_of_tech_firms & E30000234 (London)

Then I introduce VIF method to help remove columns which might cause strong multicollinearity problem

In [69]:
# VIF method
from statsmodels.stats.outliers_influence import variance_inflation_factor 
from statsmodels.tools.tools import add_constant

def drop_column_using_vif_(df, list_var_not_to_remove, thresh=5):
    '''
    Calculates VIF each feature in a pandas dataframe, and repeatedly drop the columns with the highest VIF
    A constant must be added to variance_inflation_factor or the results will be incorrect

    :param df: the pandas dataframe containing only the predictor features, not the response variable
    :param list_var_not_to_remove: the list of variables that should not be removed even though it has a high VIF. For example, dummy (or indicator) variables represent a categorical variable with three or more categories.
    :param thresh: the max VIF value before the feature is removed from the dataframe
    :return: dataframe with multicollinear features removed
    '''
    while True:
        # adding a constatnt item to the data
        df_with_const = add_constant(df)

        vif_df = pd.Series([variance_inflation_factor(df_with_const.values, i) 
               for i in range(df_with_const.shape[1])], name= "VIF",
              index=df_with_const.columns).to_frame()

        # drop the const
        vif_df = vif_df.drop('const').drop(list_var_not_to_remove)
        
        print('Max VIF:', vif_df.VIF.max())
        
        # if the largest VIF is above the thresh, remove a variable with the largest VIF
        if vif_df.VIF.max() > thresh:
            # If there are multiple variables with the maximum VIF, choose the first one
            index_to_drop = vif_df.index[vif_df.VIF == vif_df.VIF.max()].tolist()[0]
            print('Dropping: {}'.format(index_to_drop))
            df = df.drop(columns = index_to_drop)
        else:
            # No VIF is above threshold. Exit the loop
            break

    return df
In [84]:
predictors_df3 = df3_2_final.drop('Entry_Rate', axis=1)
response_df3 = df3_2_final['Entry_Rate']

# this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 
# list_var_not_to_remove = ['ttwa_E30000234', 'Count_of_tech_firms', 'Total_tech_firms']
list_var_not_to_remove = ["birth_year","ttwa_E30000234"]
df_predictors_select_VIF = drop_column_using_vif_(predictors_df3, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

hypo1_1_regressor_OLS = sm.OLS(endog=response_df3, exog=sm.add_constant(df_predictors_select_VIF)).fit()
hypo1_1_regressor_OLS.summary()
Max VIF: inf
Dropping: Total_tech_firms
Max VIF: 2.5658319318283667
The columns remaining after VIF selection are:
Index(['birth_year', 'Count_of_tech_firms',
       'Average_of_latest_accounts_assets', 'ttwa_E30000175', 'ttwa_E30000179',
       'ttwa_E30000180', 'ttwa_E30000186', 'ttwa_E30000212', 'ttwa_E30000234',
       'ttwa_E30000237', 'ttwa_E30000239', 'ttwa_E30000266'],
      dtype='object')
Out[84]:
OLS Regression Results
Dep. Variable: Entry_Rate R-squared: 0.838
Model: OLS Adj. R-squared: 0.828
Method: Least Squares F-statistic: 81.90
Date: Tue, 27 Jul 2021 Prob (F-statistic): 2.53e-68
Time: 00:33:48 Log-Likelihood: 555.40
No. Observations: 203 AIC: -1085.
Df Residuals: 190 BIC: -1042.
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -11.5268 0.668 -17.259 0.000 -12.844 -10.209
birth_year 0.0058 0.000 17.358 0.000 0.005 0.006
Count_of_tech_firms 4.046e-06 7.6e-07 5.323 0.000 2.55e-06 5.55e-06
Average_of_latest_accounts_assets 3.257e-08 1.77e-08 1.838 0.068 -2.39e-09 6.75e-08
ttwa_E30000175 -0.0012 0.005 -0.240 0.811 -0.011 0.009
ttwa_E30000179 0.0041 0.005 0.807 0.421 -0.006 0.014
ttwa_E30000180 0.0022 0.005 0.422 0.673 -0.008 0.012
ttwa_E30000186 -0.0014 0.005 -0.277 0.782 -0.012 0.009
ttwa_E30000212 0.0022 0.005 0.428 0.669 -0.008 0.012
ttwa_E30000234 -0.0199 0.006 -3.234 0.001 -0.032 -0.008
ttwa_E30000237 -0.0008 0.005 -0.156 0.877 -0.011 0.009
ttwa_E30000239 0.0007 0.005 0.142 0.887 -0.009 0.011
ttwa_E30000266 -0.0008 0.005 -0.148 0.883 -0.011 0.009
Omnibus: 44.698 Durbin-Watson: 0.499
Prob(Omnibus): 0.000 Jarque-Bera (JB): 86.296
Skew: 1.074 Prob(JB): 1.82e-19
Kurtosis: 5.364 Cond. No. 1.04e+08


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.04e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

Then, the residual analysis is conducted to test the regression performance

In [85]:
plt.scatter(hypo1_1_regressor_OLS.fittedvalues, hypo1_1_regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted cases rate 1')
plt.ylabel('Residual')
plt.title('Figure')
plt.ylim(-0.5,0.5)
plt.show()

As given above, the performance for regression is quite good

In [86]:
predictors_df3 = df3_2_final.drop('Average_of_latest_accounts_assets', axis=1)
response_df3 = df3_2_final['Average_of_latest_accounts_assets']

# this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 
# list_var_not_to_remove = ['ttwa_E30000234', 'Count_of_tech_firms', 'Total_tech_firms']
list_var_not_to_remove = ["birth_year","ttwa_E30000234","Entry_Rate"]
df_predictors_select_VIF = drop_column_using_vif_(predictors_df3, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

hypo1_2_regressor_OLS = sm.OLS(endog=response_df3, exog=sm.add_constant(df_predictors_select_VIF)).fit()
hypo1_2_regressor_OLS.summary()
Max VIF: inf
Dropping: Total_tech_firms
Max VIF: 2.607020276192744
The columns remaining after VIF selection are:
Index(['birth_year', 'Count_of_tech_firms', 'Entry_Rate', 'ttwa_E30000175',
       'ttwa_E30000179', 'ttwa_E30000180', 'ttwa_E30000186', 'ttwa_E30000212',
       'ttwa_E30000234', 'ttwa_E30000237', 'ttwa_E30000239', 'ttwa_E30000266'],
      dtype='object')
Out[86]:
OLS Regression Results
Dep. Variable: Average_of_latest_accounts_assets R-squared: 0.617
Model: OLS Adj. R-squared: 0.593
Method: Least Squares F-statistic: 25.51
Date: Tue, 27 Jul 2021 Prob (F-statistic): 1.73e-33
Time: 00:33:55 Log-Likelihood: -2533.4
No. Observations: 203 AIC: 5093.
Df Residuals: 190 BIC: 5136.
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 3.429e+07 3.56e+06 9.635 0.000 2.73e+07 4.13e+07
birth_year -1.702e+04 1777.930 -9.575 0.000 -2.05e+04 -1.35e+04
Count_of_tech_firms 14.1961 3.141 4.519 0.000 8.000 20.392
Entry_Rate 5.362e+05 2.92e+05 1.838 0.068 -3.93e+04 1.11e+06
ttwa_E30000175 -3.437e+04 2.08e+04 -1.651 0.100 -7.54e+04 6690.439
ttwa_E30000179 -3.258e+04 2.06e+04 -1.583 0.115 -7.32e+04 8018.824
ttwa_E30000180 -2816.1829 2.08e+04 -0.135 0.893 -4.39e+04 3.82e+04
ttwa_E30000186 1.849e+04 2.08e+04 0.888 0.375 -2.26e+04 5.95e+04
ttwa_E30000212 2.169e+04 2.08e+04 1.041 0.299 -1.94e+04 6.28e+04
ttwa_E30000234 4219.1182 2.56e+04 0.165 0.869 -4.63e+04 5.47e+04
ttwa_E30000237 1.075e+04 2.08e+04 0.517 0.606 -3.03e+04 5.18e+04
ttwa_E30000239 6566.5115 2.06e+04 0.319 0.750 -3.4e+04 4.72e+04
ttwa_E30000266 -1716.2042 2.08e+04 -0.082 0.934 -4.28e+04 3.94e+04
Omnibus: 55.982 Durbin-Watson: 1.846
Prob(Omnibus): 0.000 Jarque-Bera (JB): 175.434
Skew: 1.114 Prob(JB): 8.03e-39
Kurtosis: 6.972 Cond. No. 2.11e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.11e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [87]:
plt.scatter(hypo1_2_regressor_OLS.fittedvalues, hypo1_2_regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted cases rate 1')
plt.ylabel('Residual')
plt.title('Figure')
# plt.ylim(-0.5,0.5)
plt.show()

2.4.2 Hypothesis 2

Refer to: Dynamics_variable_it = a1 + b2#firms_it-n + b3industry mix_it-n + Xb_it-n + location_i + year_t + e_it (1)

$\ log_e{\ Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times location_i + b_3 \times year_t$ (1)

In [88]:
df3_3_final = df3_2_final.copy()

predictors_df3 = df3_3_final.drop({'Entry_Rate',"Average_of_latest_accounts_assets"}, axis=1)
response_df3 = np.log(df3_3_final['Entry_Rate'])

# this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 
# list_var_not_to_remove = ['ttwa_E30000234', 'Count_of_tech_firms', 'Total_tech_firms']
list_var_not_to_remove = ["ttwa_E30000234","birth_year"]
df_predictors_select_VIF = drop_column_using_vif_(predictors_df3, list_var_not_to_remove, thresh=2.5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

hypo2_1_regressor_OLS = sm.OLS(endog=response_df3, exog=sm.add_constant(df_predictors_select_VIF)).fit()
hypo2_1_regressor_OLS.summary()
Max VIF: inf
Dropping: Total_tech_firms
Max VIF: 2.13351053449709
The columns remaining after VIF selection are:
Index(['birth_year', 'Count_of_tech_firms', 'ttwa_E30000175', 'ttwa_E30000179',
       'ttwa_E30000180', 'ttwa_E30000186', 'ttwa_E30000212', 'ttwa_E30000234',
       'ttwa_E30000237', 'ttwa_E30000239', 'ttwa_E30000266'],
      dtype='object')
Out[88]:
OLS Regression Results
Dep. Variable: Entry_Rate R-squared: 0.932
Model: OLS Adj. R-squared: 0.928
Method: Least Squares F-statistic: 238.5
Date: Tue, 27 Jul 2021 Prob (F-statistic): 3.43e-105
Time: 00:34:53 Log-Likelihood: -9.0648
No. Observations: 203 AIC: 42.13
Df Residuals: 191 BIC: 81.89
Df Model: 11
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -310.4470 6.992 -44.399 0.000 -324.239 -296.655
birth_year 0.1529 0.003 43.878 0.000 0.146 0.160
Count_of_tech_firms 2.804e-05 1.13e-05 2.489 0.014 5.82e-06 5.03e-05
ttwa_E30000175 -0.4938 0.082 -5.986 0.000 -0.657 -0.331
ttwa_E30000179 -0.1491 0.082 -1.829 0.069 -0.310 0.012
ttwa_E30000180 0.1377 0.082 1.669 0.097 -0.025 0.300
ttwa_E30000186 0.1193 0.083 1.446 0.150 -0.043 0.282
ttwa_E30000212 0.2416 0.083 2.928 0.004 0.079 0.404
ttwa_E30000234 -0.2003 0.099 -2.028 0.044 -0.395 -0.005
ttwa_E30000237 -0.0440 0.082 -0.533 0.595 -0.207 0.119
ttwa_E30000239 0.0074 0.082 0.091 0.928 -0.154 0.168
ttwa_E30000266 0.0248 0.083 0.300 0.765 -0.138 0.188
Omnibus: 6.590 Durbin-Watson: 1.008
Prob(Omnibus): 0.037 Jarque-Bera (JB): 6.880
Skew: 0.316 Prob(JB): 0.0321
Kurtosis: 3.643 Cond. No. 1.04e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.04e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

The result can be writen in the below formula:

$ log_e{\ Dynamics\ Variable_{i,t}} = -310 + 2.8e^{-5} \times firms_{i,t} + b_2 \times location_i + 0.15 \times year_t$

In [56]:
plt.scatter(hypo2_1_regressor_OLS.fittedvalues, hypo2_1_regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted Dynamics Variables after Logarithmic Transformation')
plt.ylabel('Residual')
plt.title('Residual vs. Fitted Plot of Dynamics Variables')
plt.ylim(-2,2)
plt.show()
In [57]:
fig1 = plt.figure(figsize=(12,8))
fig1 = sm.graphics.plot_regress_exog(hypo2_1_regressor_OLS, 'birth_year', fig=fig1)

The second formula:

Refer to: Firm_performance_it = c1 + c2#firms_it-n + c3dynamics_it-n + c4industry mix_it-n + Xc_it-n + location_i + year_t + u_it (2)

$\ log_e\ Firm\ Performance_{i,t}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times dynamics_{i,t} + d_3 \times location_i + d_4 \times year_t$ (2)

In [95]:
df3_3_final = df3_2_final.copy()

predictors_df3 = df3_3_final.drop({"Average_of_latest_accounts_assets"}, axis=1)
response_df3 = np.log(df3_3_final['Average_of_latest_accounts_assets'])

# this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 
# list_var_not_to_remove = ['ttwa_E30000234', 'Count_of_tech_firms', 'Total_tech_firms']
list_var_not_to_remove = ["ttwa_E30000234","birth_year","Entry_Rate","Count_of_tech_firms"]
df_predictors_select_VIF = drop_column_using_vif_(predictors_df3, list_var_not_to_remove, thresh=2.5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

hypo2_2_regressor_OLS = sm.OLS(endog=response_df3, exog=sm.add_constant(df_predictors_select_VIF)).fit()
hypo2_2_regressor_OLS.summary()
Max VIF: inf
Dropping: Total_tech_firms
Max VIF: 1.8437386819351917
The columns remaining after VIF selection are:
Index(['birth_year', 'Count_of_tech_firms', 'Entry_Rate', 'ttwa_E30000175',
       'ttwa_E30000179', 'ttwa_E30000180', 'ttwa_E30000186', 'ttwa_E30000212',
       'ttwa_E30000234', 'ttwa_E30000237', 'ttwa_E30000239', 'ttwa_E30000266'],
      dtype='object')
Out[95]:
OLS Regression Results
Dep. Variable: Average_of_latest_accounts_assets R-squared: 0.745
Model: OLS Adj. R-squared: 0.729
Method: Least Squares F-statistic: 46.37
Date: Tue, 27 Jul 2021 Prob (F-statistic): 6.23e-50
Time: 00:40:53 Log-Likelihood: -99.816
No. Observations: 203 AIC: 225.6
Df Residuals: 190 BIC: 268.7
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 147.5047 22.122 6.668 0.000 103.868 191.142
birth_year -0.0675 0.011 -6.111 0.000 -0.089 -0.046
Count_of_tech_firms 0.0001 1.95e-05 5.773 0.000 7.42e-05 0.000
Entry_Rate -8.9585 1.814 -4.940 0.000 -12.536 -5.381
ttwa_E30000175 -0.2359 0.129 -1.823 0.070 -0.491 0.019
ttwa_E30000179 -0.2070 0.128 -1.618 0.107 -0.459 0.045
ttwa_E30000180 0.0600 0.129 0.464 0.643 -0.195 0.315
ttwa_E30000186 0.1323 0.129 1.023 0.308 -0.123 0.387
ttwa_E30000212 0.2389 0.129 1.845 0.067 -0.016 0.494
ttwa_E30000234 0.0065 0.159 0.041 0.967 -0.307 0.320
ttwa_E30000237 0.0750 0.129 0.580 0.563 -0.180 0.330
ttwa_E30000239 -0.0467 0.128 -0.365 0.716 -0.299 0.206
ttwa_E30000266 0.0912 0.129 0.704 0.482 -0.164 0.347
Omnibus: 103.810 Durbin-Watson: 1.682
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1355.690
Skew: -1.582 Prob(JB): 4.13e-295
Kurtosis: 15.259 Cond. No. 2.11e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.11e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

$\ Firm\ Performance_{i,t}\ =\ 147.5 + 0.0001 \times firms_{i,t} + {-8.9585} \times dynamics_{i,t} + d_3 \times location_i + {-0.0675} \times year_t$ (2)

In [92]:
plt.scatter(hypo2_2_regressor_OLS.fittedvalues, hypo2_2_regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted Dynamics Variables after Logarithmic Transformation')
plt.ylabel('Residual')
plt.title('Residual vs. Fitted Plot of Dynamics Variables')
plt.ylim(-3,3)
plt.show()
In [93]:
fig2 = plt.figure(figsize=(12,8))
fig2 = sm.graphics.plot_regress_exog(hypo2_2_regressor_OLS, 'Entry_Rate', fig=fig2)