Spatial Research
According to EDA analysis, It can be found that most of tech firms are clustered in the England. In the next step, this research will focus more on the England country. ctry
can be a filter to get the country == 'England'
tech firms information. HHI can be calculate in the below function:
Where the HH index will be calculated from industry 1
to k
in location i
and year t
. This formula means, in year t
and location i
, the number of tech firms which belong to industry k
divide that of the total firms. This index can measure how diverse the tech cluster are.
Hypothesis is given below:
$\ log_e\ {Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times HHI_{i,t} + b_3 \times location_i + b_4 \times year_t$ (1)
$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times HHI_{i,t} + d_3 \times dynamics_{i,t} + d_4 \times location_i + d_5 \times year_t$ (2)
birth_year
ttwa
(travel to work area)Entry_Rate
in $t$ and $i$ Entry_Rate
in a specific $t$ and $i$ Count_of_tech_firms
in a specific $t$ and $i$Herfindahl-Hirschman Index
in a specific $t$ and $i$Industry Mix V.S. Dynamics:
Regression OLS in (1)
the residual analysis in (1)
Regression OLS in (2)
the residual analysis in (2)
HH index can be treated to measure the tech industry diversity. For calculate industry mix level, primary_sic_code
will be utilised to do a group by
operation
According to EDA analysis, It can be found that most of tech firms are clustered in the England. In the next step, this research will focus more on the England country. ctry
can be a filter to get the country == 'England'
tech firms information
#read data
import sys
import pandas as pd
PATH = sys.path[0]
df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory = False)
dfp = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis=1)
# data select
# dfp means df is processed
dfp = dfp[["registered_number","diss_year","birth_year","primary_uk_sic_2007","ttwa","ctry"]]
# dna means drop na value
dfp_dna = dfp.dropna(subset=["birth_year","ttwa"]).reset_index(drop=True)
# introduce the ttwa to filter england firms
# dfp_dna = dfp_dna[dfp_dna.ctry=="E92000001"]
dfp_dna = dfp_dna[dfp_dna.birth_year >= 1998]
#export to the excel to do operation
df_excel = dfp_dna.copy()
# df_excel.to_excel(PATH + "/tech_cluster_ttwa_in_England.xlsx")
# process the .csv file
df_excel = pd.read_excel(PATH + "/MySQL/tech_cluster_ttwa_in_England.xlsx")
df_excel = df_excel.fillna(0)
# df_excel.to_csv(PATH + "/MySQL/tech_cluster_ttwa_in_England.csv")
df_excel.head()
After pivot operation in the Excel, the dataset is also imported into MySQL to calculate the HH index.
The method/code is presented below:
CREATE TABLE tech_cluster_hh
SELECT t2.ttwa, t2.birth_year, sum(t2.sub_hh) as hh
FROM (SELECT t0.ttwa,
t0.birth_year,
t0.primary_uk_sic_2007,
power(count(t0.registered_number)/t1.year_total,2) as sub_hh
FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic as t0,
(SELECT ttwa,
birth_year,
count(registered_number) as year_total
FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic
GROUP BY ttwa, birth_year) as t1
WHERE t0.ttwa = t1.ttwa and t0.birth_year = t1.birth_year
GROUP BY t0.ttwa, t0.birth_year, t0.primary_uk_sic_2007 ) as t2
GROUP BY t2.ttwa, t2.birth_year
Then import this processed .csv
to get HH index from two scale (location&year, location)
【HH index in TTWA and Year】
method from MySQL/hh_index.sql
input from the tech_cluster_ttwa_in_England_year_98218_sic
table
output from tech_cluster_hhtech_cluster_hh
table
# means HH index in year t and location
import pandas as pd
import sys
PATH = sys.path[0]
df_hh_i_t = pd.read_csv(PATH + "/MySQL/tech_cluster_hh.csv",",")
df_hh_i_t.head()
【HH index in TTWA】
method from MySQL/hh_index_drop_year.sql
input from the tech_cluster_ttwa_in_England_year_98218_sic
table
output from the tech_cluster_hh_ttwa
table
CREATE TABLE tech_cluster_hh_ttwa
SELECT t2.ttwa, sum(t2.sub_hh_2) as hh
FROM ( SELECT t0.ttwa,
t0.primary_uk_sic_2007,
power(count(t0.registered_number)/t1.ttwa_total,2) as sub_hh_2
FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic as t0,
(SELECT ttwa,
count(registered_number) as ttwa_total
FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic
WHERE diss_year = 0 -- diss year's firms should be excluded
GROUP BY ttwa) as t1
WHERE t0.ttwa = t1.ttwa and t0.diss_year = 0 -- diss year's firms should be excluded
GROUP BY t0.ttwa, t0.primary_uk_sic_2007 ) as t2
GROUP BY t2.ttwa
# means HH index in year t and location i
df_hh_i = pd.read_csv(PATH + "/MySQL/tech_cluster_hh_ttwa_only.csv",",")
df_hh_i.head()
import the geo dataset for ttwa in the England
# spatial map prepare
# read spatial data
import geopandas as gpd
g_ttwa = gpd.read_file(PATH + "/Dataset/Spatial/Travel_to_Work_Areas__December_2011__Generalised_Clipped_Boundaries_in_United_Kingdom.geojson")
According to CSDN, I ustilise the below method to filter string
# print(g_ttwa.TTWA11CD.value_counts())
# input
df_tmp = g_ttwa.copy()
df_tmp.col_tmp = g_ttwa.TTWA11CD
letter_str_1 = "E"
# letter_str_2 = "K"
mask = [True if i[0]==letter_str_1 else False for i in df_tmp.col_tmp]
# output
g_ttwa_England = df_tmp[mask]
# print(g_ttwa_England.TTWA11CD.value_counts())
g_ttwa_England.plot()
# inspect the data
g_ttwa_England.head(1)
# inspect the data
df_hh_i.head(1)
# merge with `df_hh_i`
df_ttwa_hh_mg = pd.merge(g_ttwa,df_hh_i, left_on='TTWA11CD',right_on="ttwa",how = "inner").reset_index(drop=True)
df_ttwa_hh_mg.head()
title_text = "Spatial Distribution of the HH index of tech cluster in England TTWA"
df_left = g_ttwa
left_key = 'TTWA11CD'
df_right = df_hh_i
right_key = "ttwa"
map_attribute = "hh"
# merge
df_mg = pd.merge(df_left,df_right, left_on = left_key,right_on=right_key,how = "inner").reset_index(drop=True)
# map
import pysal as ps
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt
tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1 = plt.figure()
f1.set_size_inches(15,12)
ax1 = f1.add_subplot()
df_mg.plot(column = map_attribute, legend=True, cmap='plasma',linestyle='solid', ax=ax1)
# df_ttwa_hh_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)
f1.subplots_adjust(top=0.9)
f1.suptitle(title_text, x=0.01, size=30, **tfont)
f1.savefig(PATH + "/Img/" + title_text + ".png", dpi=300)
import seaborn as sns
import numpy as np
df = df_ttwa_hh_mg.copy()
df.hh = np.log(df.hh)
attributes = "hh"
sns.displot(df, x=attributes,bins=10)
# read data
# means Entry Rate in location i
df_EnR_i = pd.read_excel(PATH + "/Excel/tech_cluster_ttwa_in_England_EnR.xlsx", sheet_name = "ttwa_EnR")
df_EnR_i.head()
The template method is writen below
title_text = "Spatial Distribution of the entry rate index of tech cluster in England TTWA"
df_left = g_ttwa
left_key = 'TTWA11CD'
df_right = df_EnR_i
right_key = "ttwa"
map_attribute = "Entry_Rate"
# merge
df_mg = pd.merge(df_left,df_right, left_on = left_key,right_on=right_key,how = "inner").reset_index(drop=True)
# map
import pysal as ps
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt
tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1 = plt.figure()
f1.set_size_inches(15,12)
ax1 = f1.add_subplot()
df_mg.plot(column = map_attribute, legend=True, cmap='plasma', linestyle='solid', ax=ax1)
# df_ttwa_hh_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)
f1.subplots_adjust(top=0.9)
f1.suptitle(title_text, x=0.01, size=30, **tfont)
# f1.savefig(PATH + "/Img/" + title_text + ".png", dpi=300)
# the last dataset
import pandas as pd
import sys
PATH = sys.path[0]
df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory=False)
df = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis = 1)
# preprocess the data
# drop the null value record in birth year and ttwa(travel to work area) columns
df = df.dropna(subset = {"birth_year","ttwa"})
# England's filter can be processed in the final stage with merge function
# filter by `birth_year`
# df_10_tc_20 means "top 10 tech ttwa clusters in 20 years dataframe"
df = df[df["birth_year"]>=1998].reset_index(drop=True)
# select data in below columns
df = df[["registered_number",\
"ttwa",\
"birth_year",\
"diss_year",\
"latest_accounts_assets"]]
df.head()
df_EnR_i_t = df.copy()
df_EnR_i_t.head()
df1 = df_EnR_i_t.copy()
# MITO CODE START (DO NOT EDIT)
from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run
# Step 1
df2 = pd.DataFrame(data={})
# MITO CODE END (DO NOT EDIT)
# MITO CODE START (DO NOT EDIT)
from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run
# Step 1
df2 = pd.DataFrame(data={})
# MITO CODE END (DO NOT EDIT)
# MITO CODE START (DO NOT EDIT)
from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run
# Step 1
df2 = pd.DataFrame(data={})
# MITO CODE END (DO NOT EDIT)
# MITO CODE START (DO NOT EDIT)
from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run
# Step 1
# Pivot the data
pivot_table = df1.pivot_table(
index=['ttwa', 'birth_year'],
values=['registered_number'],
aggfunc={'registered_number': 'count'}
)
# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
# MITO CODE END (DO NOT EDIT)
# MITO CODE START (DO NOT EDIT)
from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run
# Step 1
# Pivot the data
pivot_table = df1.pivot_table(
index=['ttwa', 'birth_year'],
values=['registered_number'],
aggfunc={'registered_number': 'count'}
)
# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
# MITO CODE END (DO NOT EDIT)
# MITO CODE START (DO NOT EDIT)
from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run
# Step 1
# Pivot the data
pivot_table = df1.pivot_table(
index=['ttwa', 'birth_year'],
values=['registered_number', 'latest_accounts_assets'],
aggfunc={'registered_number': 'count', 'latest_accounts_assets': 'count'}
)
# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
# MITO CODE END (DO NOT EDIT)
# MITO CODE START (DO NOT EDIT)
from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run
# Step 1
# Pivot the data
pivot_table = df1.pivot_table(
index=['ttwa', 'birth_year'],
values=['registered_number', 'latest_accounts_assets'],
aggfunc={'registered_number': 'count', 'latest_accounts_assets': 'mean'}
)
# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
# MITO CODE END (DO NOT EDIT)
df_EnR_i_t_pivot = df2.copy()
df_EnR_i_t_pivot = df_EnR_i_t_pivot.rename({"latest_accounts_assets":"average_assets","registered_number":"firms"},axis=1)
df_EnR_i_t_pivot.head()
# group by ttwa come up with entry rate
df = df_EnR_i_t_pivot.copy()
df_tmp = df.groupby('ttwa').agg({"firms":sum}).rename({"firms":"total_firms"},axis=1).reset_index()
dfm = pd.merge(df, df_tmp, left_on = 'ttwa', right_on = 'ttwa', how='inner') # vlookup like function
dfm["entry_rate"] = dfm.firms/dfm.total_firms
dfm.head()
df_EnR_i_t = dfm.copy()
df_EnR_i_t = df_EnR_i_t.fillna(0)
df_EnR_i_t.birth_year = df_EnR_i_t.birth_year.astype(int)
df_EnR_i_t.head()
df_left = df_EnR_i_t
left_key = ["ttwa","birth_year"]
print("Entry Rate Dataset is" + str(df_EnR_i_t.shape) + " because it includes other countries like Scotland, Wales")
df_right = df_hh_i_t
right_key = ["ttwa","birth_year"]
print("HHI Dataset is" + str(df_hh_i_t.shape))
# merge
df_mg = pd.merge(df_left,df_right, left_on=left_key, right_on=right_key,how = "inner").reset_index(drop=True)
df_hh_EnR = df_mg.copy()
print("HHI&EnR Dataset is" + str(df_hh_EnR.shape))
df_hh_EnR.head()
df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = df_ttwa[["code","name"]]
df_left = df_hh_EnR
left_key = "ttwa"
print("df_left is " + str(df_left.shape))
df_right = df_ttwa
right_key = "code"
print("df_right is " + str(df_right.shape))
# merge
df_mg = pd.merge(df_left,df_right, left_on=left_key, right_on=right_key,how = "inner").reset_index(drop=True)
print("df_merge" + str(df_mg.shape))
df_hh_EnR_ttwaName = df_mg.copy()
df_hh_EnR_ttwaName = df_hh_EnR_ttwaName[["name","code","birth_year","firms","total_firms","entry_rate", "hh", "average_assets"]]
df_hh_EnR_ttwaName = df_hh_EnR_ttwaName.rename({"name":"ttwa", "code":"ttwa_code", "birth_year":"year"},axis=1)
df_hh_EnR_ttwaName.head()
df = df_hh_EnR_ttwaName.copy() # input your data
attribute = "entry_rate" # input your X
import seaborn as sns
import numpy as np
df[attribute] = np.log(df[attribute])
sns.displot(df, x=attribute,bins=80)
df = df_hh_EnR_ttwaName.copy() # input your data
attribute = "average_assets" # input your X
import seaborn as sns
import numpy as np
df[attribute] = np.log1p(df[attribute])
sns.displot(df, x=attribute,bins=40)
df[attribute].describe()
Can we drop the close to 0
value?
#df_hh_EnR_ttwaName
import math
assets_level = (math.e)**5-1
print(assets_level)
df_hh_EnR_ttwaName[df_hh_EnR_ttwaName.average_assets < assets_level].shape[0] / df_hh_EnR_ttwaName.shape[0]
The 0~4 value only accounts for 0.05 in the dataset, so we can drop our 0
value
df_hh_EnR_ttwaName_asset_not_drop = df_hh_EnR_ttwaName
# drop the asset which can cause outlier
df_hh_EnR_ttwaName = df_hh_EnR_ttwaName[df_hh_EnR_ttwaName["average_assets"]>=assets_level]
# then test the distribution again
df = df_hh_EnR_ttwaName.copy()
attribute = "average_assets"
import seaborn as sns
import numpy as np
df[attribute] = np.log1p(df[attribute])
sns.displot(df, x=attribute,bins=80)
# export the df_hh_EnR_ttwaName_i_t_asset_not_drop
df_hh_EnR_ttwaName_asset_not_drop.to_csv(PATH + "/Dataset/df_hh_EnR_ttwaName_asset_not_drop.csv")
$\ Dynamics\ Variable_{i,t}\ =\ a_1 + b_1 \ firms_{i,t} + b_2 \ HHI_{i,t} + b_3 \ location_i + b_4 \ year_t + e_{i,t} $ (1)
$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \ firms_{i,t} + d_2 \ HHI_{i,t} + d_3 \ dynamics_{i,t} + d_4 \ location_i + d_5 \ year_t + e_{i,t}$ (2)
birth_year
ttwa
(travel to work area)Entry_Rate
in $t$ and $i$ Entry_Rate
in a specific $t$ and $i$ Count_of_tech_firms
in a specific $t$ and $i$Herfindahl-Hirschman Index
in a specific $t$ and $i$In (1), Y is entry_rate
, X is ttwa
, birth_year
, total
, hh
In (2), Y is average_latest_accounts_assets
, X is ttwa
, birth_year
, total
, hh
, entry_rate
# environment preparation
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_rows', 300) # specifies number of rows to show
pd.options.display.float_format = '{:40,.4f}'.format # specifies default number format to 4 decimal places
plt.style.use('ggplot') # specifies that graphs should use ggplot styling
%matplotlib inline
introduce correlation analysis
df = df_hh_EnR_ttwaName.copy()
df.year = df.year.astype("int")
df = df.drop({"ttwa","ttwa_code"},axis=1)
# add dummies operation requirement
# Set the reference object is ttwa_E30000234(London)
# drop Birmingham for dummies variables comparasion
# df = df.drop(['ttwa_London'], axis=1)
# correlation analysis
plt.rcParams["axes.grid"] = False
f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16)
# df = pd.get_dummies(df)
introduce the dummy variables
df = df_hh_EnR_ttwaName.copy()
# intro dummy variables
# df.year = df.year.astype("int")
df = df.drop({"ttwa_code"},axis=1)
df = pd.get_dummies(df)
# add dummies operation requirement
# Set the reference object is ttwa_E30000234(London)
# drop Birmingham for dummies variables comparasion
# df = df.drop(['ttwa_London'], axis=1)
# gd means get dummies
df_hh_EnR_gd = df.copy()
# VIF method
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
def drop_column_using_vif_(df, list_var_not_to_remove, thresh=5):
'''
Calculates VIF each feature in a pandas dataframe, and repeatedly drop the columns with the highest VIF
A constant must be added to variance_inflation_factor or the results will be incorrect
:param df: the pandas dataframe containing only the predictor features, not the response variable
:param list_var_not_to_remove: the list of variables that should not be removed even though it has a high VIF. For example, dummy (or indicator) variables represent a categorical variable with three or more categories.
:param thresh: the max VIF value before the feature is removed from the dataframe
:return: dataframe with multicollinear features removed
'''
while True:
# adding a constatnt item to the data
df_with_const = add_constant(df)
vif_df = pd.Series([variance_inflation_factor(df_with_const.values, i)
for i in range(df_with_const.shape[1])], name= "VIF",
index=df_with_const.columns).to_frame()
# drop the const
vif_df = vif_df.drop('const').drop(list_var_not_to_remove)
print('Max VIF:', vif_df.VIF.max())
# if the largest VIF is above the thresh, remove a variable with the largest VIF
if vif_df.VIF.max() > thresh:
# If there are multiple variables with the maximum VIF, choose the first one
index_to_drop = vif_df.index[vif_df.VIF == vif_df.VIF.max()].tolist()[0]
print('Dropping: {}'.format(index_to_drop))
df = df.drop(columns = index_to_drop)
else:
# No VIF is above threshold. Exit the loop
break
return df
In (1):
$\ {Dynamics\ Variable_{i,t}}^{0.5}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times HHI_{i,t} + b_3 \times location_i + b_4 \times year_t$
predictors_X = df_hh_EnR_gd.drop({'entry_rate','average_assets'}, axis=1) # drop the Y_1 and Y_2
response_Y = np.sqrt(df_hh_EnR_gd['entry_rate'])
list_var_not_to_remove = ["year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF.
df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)
regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()
R^2 = 0.78
In (2):
$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times HHI_{i,t} + d_3 \times dynamics_{i,t} + d_4 \times location_i + d_5 \times year_t$ (2)
predictors_X = df_hh_EnR_gd.drop({'average_assets'}, axis=1) # drop the Y_1 and Y_2
response_Y = np.log1p(df_hh_EnR_gd['average_assets'])
list_var_not_to_remove = ["entry_rate","year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF.
df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)
regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()
R^2 = 0.43
small data preparation
df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory=False)
df = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis = 1)
df = df.dropna(subset = {"birth_year","ttwa"})
# Generate the top 10 tech clusters table based on the ttwa
top_10_ttwa = df.ttwa.value_counts().head(10)
top_10_ttwa = pd.DataFrame(top_10_ttwa).reset_index()
top_10_ttwa.columns = ["ttwa_code","counts"]
df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = pd.merge(top_10_ttwa, df_ttwa[["code","name"]], left_on="ttwa_code",right_on="code")
df_ttwa_10 = df_ttwa[["ttwa_code", "name","counts"]]
# df_10_tc means "top 10 tech ttwa clusters"
df_10_tc = df_hh_EnR_ttwaName.copy()
# filter data to top 10 ttwa data raws
df_10_tc = df_10_tc[df_10_tc.ttwa_code.isin(df_ttwa_10.ttwa_code.to_list())].reset_index(drop=True)
df_10_tc = df_10_tc.drop({"ttwa_code"},axis=1)
# get dummies
df_10_tc = pd.get_dummies(df_10_tc)
df_10_tc = df_10_tc.drop({"ttwa_London"},axis=1)
# delete the resid outlier rows
df_10_tc = df_10_tc.drop(index=[182,92,141])
In (1):
$\ log_e\ {Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times HHI_{i,t} + b_3 \times location_i + b_4 \times year_t$
df = df_10_tc.copy()
predictors_X = df.drop({'entry_rate','average_assets'}, axis=1) # drop the Y_1 and Y_2
#response_Y = df['entry_rate']
response_Y = np.log(df['entry_rate']) # sqrt is created for residual analysis
list_var_not_to_remove = ["year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF.
df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)
regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()
plt.scatter(regressor_OLS.fittedvalues, regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted Entry Rate')
plt.ylabel('Residual')
plt.title('Figure')
#plt.ylim(-1,1)
plt.show()
In (2):
$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times HHI_{i,t} + d_3 \times dynamics_{i,t} + d_4 \times location_i + d_5 \times year_t$ (2)
df = df_10_tc.copy()
predictors_X = df.drop({'average_assets'}, axis=1) # drop the Y_1 and Y_2
response_Y = np.log1p(df['average_assets'])
list_var_not_to_remove = ["entry_rate","year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF.
df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)
regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()
plt.scatter(regressor_OLS.fittedvalues, regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted Average Assets')
plt.ylabel('Residual')
plt.title('Figure')
# plt.ylim(-0.5,0.5)
plt.show()
In England, as year goes, the performance might grow less. Higher HH index might cause lower performance, but more entry rate might cause less asset. This should be discussed in the next section with other literatures.
# value include dynamics, industry mix and firms' performance
df1 = df_hh_EnR_ttwaName.copy()
df1.shape
df1.describe()
df1[df1.hh == 1].ttwa.unique() #only have hh=1
df1.head()
# export df1 to MySQL to operate
# df1.to_csv(PATH + "/MySQL/df_hh_EnR_ttwaName.csv")
Excel Operation (ref to MySQL operation ...)
DROP TABLE IF EXISTS tech_cluster_hh_enr_ttwa_for_r;
CREATE TABLE tech_cluster_hh_enr_ttwa_for_r
SELECT * FROM (select t1.ttwa
,t1.ttwa_code
,sum(t1.firms) as entry_total
,avg(t1.total_firms) as grand_total
,avg(t1.hh) as hhi
,avg(t1.entry_rate) as entry_rate
,avg(t1.average_latest_accounts_assets) as assets
from `casa-dissertation`.df_hh_enr_for_r as t1
group by t1.ttwa) as t2
Read df1
from MySQL
# MITO CODE START (DO NOT EDIT)
from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-20dc35da-8f74-48f1-9355-810587cb82e8') # Let Mito know which analysis is being run
# Step 1
df2 = pd.DataFrame(data={})
# MITO CODE END (DO NOT EDIT)
# MITO CODE START (DO NOT EDIT)
from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-20dc35da-8f74-48f1-9355-810587cb82e8') # Let Mito know which analysis is being run
# Step 1
# Pivot the data
pivot_table = df1.pivot_table(
index=['ttwa', 'ttwa_code'],
values=['firms', 'total_firms', 'hh', 'entry_rate', 'average_assets'],
aggfunc={'firms': 'sum', 'total_firms': 'mean', 'hh': 'mean', 'entry_rate': 'mean', 'average_assets': 'mean'}
)
# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
# MITO CODE END (DO NOT EDIT)
# df1 = pd.read_csv(PATH + "/MySQL/tech_cluster_hh_enr_ttwa_for_r.csv",",")
df1 = df2.copy()
type(g_ttwa_England)
type(df1)
df_left = g_ttwa_England
left_key = 'TTWA11CD'
df_right = df1
right_key = "ttwa_code"
# merge
df_mg = pd.merge(df_left,df_right, left_on = left_key,right_on=right_key,how = "inner").reset_index(drop=True)
type(df_mg)
df_mg.shape
# df_mg.to_csv(PATH + "/HHI_EnR_ttwa_England_for_R.csv")
import geopandas as gpd
import fiona
# asset is not dropped
df_mg.to_file(PATH + "/Dataset/Spatial/HHI_EnR_ttwa_England_for_R_assetnd.geojson", driver='GeoJSON')