Spatial Research

1 Overview¶

HH index is added into OLS regression experiment(Top 10 Tech Cluster in England)
Residual Analysis is done to exam the performance of regression
Spatial pattern can be detected by Moran's I & GWR

1.1 Methodology¶

Data Prepare
OLS Regression
- Hypothesis
- Distribution Correction
- OLS with std error
- VIF (remove multicollinearity)
- Regression
- Residual Analysis
Moran's I Test
GWR

According to EDA analysis, It can be found that most of tech firms are clustered in the England. In the next step, this research will focus more on the England country. ctry can be a filter to get the country == 'England' tech firms information. HHI can be calculate in the below function：

$$ Herfindahl-Hirschman \ Index_{i,t} = \sum_1^k (\frac{Tech\ Firms_{i,t,k}}{Total\ Tech\ Firms_{i,t}})^2 $$

Where the HH index will be calculated from industry 1 to k in location i and year t. This formula means, in year t and location i, the number of tech firms which belong to industry k divide that of the total firms. This index can measure how diverse the tech cluster are.

Hypothesis is given below:

$\ log_e\ {Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times HHI_{i,t} + b_3 \times location_i + b_4 \times year_t$ (1)

$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times HHI_{i,t} + d_3 \times dynamics_{i,t} + d_4 \times location_i + d_5 \times year_t$ (2)

$t$ means year -- birth_year
$i$ means location -- ttwa (travel to work area)
$Dynamics\ Variable_{i,t}$ in (1) means Entry_Rate in $t$ and $i$
$dynamics_{i,t}$ in (2) means Entry_Rate in a specific $t$ and $i$
$firms_{i,t}$ means Count_of_tech_firms in a specific $t$ and $i$
$HHI_{i,t}$ means Herfindahl-Hirschman Index in a specific $t$ and $i$

1.2 Results¶

Industry Mix V.S. Dynamics:

Regression OLS in (1)
the residual analysis in (1)
Regression OLS in (2)
the residual analysis in (2)

2 Reproducible Analysis: HH index¶

HH index can be treated to measure the tech industry diversity. For calculate industry mix level, primary_sic_code will be utilised to do a group by operation

According to EDA analysis, It can be found that most of tech firms are clustered in the England. In the next step, this research will focus more on the England country. ctry can be a filter to get the country == 'England' tech firms information

#read data
import sys
import pandas as pd
PATH = sys.path[0]
df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory = False)
dfp = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis=1)

# data select
# dfp means df is processed
dfp = dfp[["registered_number","diss_year","birth_year","primary_uk_sic_2007","ttwa","ctry"]]

# dna means drop na value
dfp_dna = dfp.dropna(subset=["birth_year","ttwa"]).reset_index(drop=True)

# introduce the ttwa to filter england firms
# dfp_dna = dfp_dna[dfp_dna.ctry=="E92000001"]
dfp_dna = dfp_dna[dfp_dna.birth_year >= 1998]

#export to the excel to do operation
df_excel = dfp_dna.copy() 
# df_excel.to_excel(PATH + "/tech_cluster_ttwa_in_England.xlsx")

# process the .csv file
df_excel = pd.read_excel(PATH + "/MySQL/tech_cluster_ttwa_in_England.xlsx")
df_excel = df_excel.fillna(0)
# df_excel.to_csv(PATH + "/MySQL/tech_cluster_ttwa_in_England.csv")

df_excel.head()

After pivot operation in the Excel, the dataset is also imported into MySQL to calculate the HH index.

The method/code is presented below:

CREATE TABLE tech_cluster_hh
SELECT t2.ttwa, t2.birth_year, sum(t2.sub_hh) as hh
FROM (SELECT t0.ttwa, 
            t0.birth_year,
            t0.primary_uk_sic_2007, 
            power(count(t0.registered_number)/t1.year_total,2) as sub_hh
            FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic as t0,
                (SELECT ttwa, 
                        birth_year,  
                        count(registered_number) as year_total 
                        FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic
                        GROUP BY ttwa, birth_year) as t1
            WHERE t0.ttwa = t1.ttwa and t0.birth_year = t1.birth_year
            GROUP BY t0.ttwa, t0.birth_year, t0.primary_uk_sic_2007 ) as t2
GROUP BY t2.ttwa, t2.birth_year

Then import this processed .csv to get HH index from two scale (location&year, location)

【HH index in TTWA and Year】

method from MySQL/hh_index.sql

input from the tech_cluster_ttwa_in_England_year_98218_sic table

output from tech_cluster_hhtech_cluster_hh table

# means HH index in year t and location
import pandas as pd
import sys
PATH = sys.path[0]
df_hh_i_t = pd.read_csv(PATH + "/MySQL/tech_cluster_hh.csv",",")
df_hh_i_t.head()

【HH index in TTWA】

method from MySQL/hh_index_drop_year.sql

input from the tech_cluster_ttwa_in_England_year_98218_sic table

output from the tech_cluster_hh_ttwa table

CREATE TABLE tech_cluster_hh_ttwa
SELECT t2.ttwa, sum(t2.sub_hh_2) as hh
FROM ( SELECT   t0.ttwa, 
                t0.primary_uk_sic_2007, 
                power(count(t0.registered_number)/t1.ttwa_total,2) as sub_hh_2
                FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic as t0,
                    (SELECT ttwa, 
                            count(registered_number) as ttwa_total 
                            FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic
                            WHERE diss_year = 0  -- diss year's firms should be excluded
                            GROUP BY ttwa) as t1
                WHERE t0.ttwa = t1.ttwa and t0.diss_year = 0 -- diss year's firms should be excluded
                GROUP BY t0.ttwa, t0.primary_uk_sic_2007 ) as t2
GROUP BY t2.ttwa

# means HH index in year t and location i
df_hh_i = pd.read_csv(PATH + "/MySQL/tech_cluster_hh_ttwa_only.csv",",")
df_hh_i.head()

Map Visualisation¶

import the geo dataset for ttwa in the England

# spatial map prepare
# read spatial data
import geopandas as gpd
g_ttwa = gpd.read_file(PATH + "/Dataset/Spatial/Travel_to_Work_Areas__December_2011__Generalised_Clipped_Boundaries_in_United_Kingdom.geojson")

According to CSDN, I ustilise the below method to filter string

# print(g_ttwa.TTWA11CD.value_counts())
# input
df_tmp = g_ttwa.copy()
df_tmp.col_tmp = g_ttwa.TTWA11CD
letter_str_1 = "E"
# letter_str_2 = "K"

mask = [True if i[0]==letter_str_1 else False for i in df_tmp.col_tmp]

# output
g_ttwa_England = df_tmp[mask]
# print(g_ttwa_England.TTWA11CD.value_counts())
g_ttwa_England.plot()

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/geopandas/geodataframe.py:182: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  super(GeoDataFrame, self).__setattr__(attr, val)

<AxesSubplot:>

# inspect the data
g_ttwa_England.head(1)

Industry Mix: HHI¶

# inspect the data
df_hh_i.head(1)

# merge with `df_hh_i`

df_ttwa_hh_mg = pd.merge(g_ttwa,df_hh_i, left_on='TTWA11CD',right_on="ttwa",how = "inner").reset_index(drop=True)
df_ttwa_hh_mg.head()

title_text = "Spatial Distribution of the HH index of tech cluster in England TTWA"

df_left = g_ttwa
left_key = 'TTWA11CD'

df_right = df_hh_i
right_key = "ttwa"

map_attribute = "hh"


# merge
df_mg = pd.merge(df_left,df_right, left_on = left_key,right_on=right_key,how = "inner").reset_index(drop=True)
# map
import pysal as ps
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt

tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1 = plt.figure()
f1.set_size_inches(15,12)

ax1 = f1.add_subplot()
df_mg.plot(column = map_attribute, legend=True, cmap='plasma',linestyle='solid', ax=ax1)
# df_ttwa_hh_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)

f1.subplots_adjust(top=0.9)
f1.suptitle(title_text, x=0.01, size=30, **tfont)
f1.savefig(PATH + "/Img/" + title_text + ".png", dpi=300)

import seaborn as sns
import numpy as np
df = df_ttwa_hh_mg.copy()
df.hh = np.log(df.hh)
attributes = "hh"
sns.displot(df, x=attributes,bins=10)

<seaborn.axisgrid.FacetGrid at 0x7fa660a1a278>

Dynamics: Entry Rate¶

# read data

# means Entry Rate in location i
df_EnR_i = pd.read_excel(PATH + "/Excel/tech_cluster_ttwa_in_England_EnR.xlsx", sheet_name = "ttwa_EnR")
df_EnR_i.head()

The template method is writen below

title_text = "Spatial Distribution of the entry rate index of tech cluster in England TTWA"

df_left = g_ttwa
left_key = 'TTWA11CD'

df_right = df_EnR_i
right_key = "ttwa"

map_attribute = "Entry_Rate"


# merge
df_mg = pd.merge(df_left,df_right, left_on = left_key,right_on=right_key,how = "inner").reset_index(drop=True)

# map
import pysal as ps
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt

tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1 = plt.figure()
f1.set_size_inches(15,12)

ax1 = f1.add_subplot()
df_mg.plot(column = map_attribute, legend=True, cmap='plasma', linestyle='solid', ax=ax1)
# df_ttwa_hh_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)

f1.subplots_adjust(top=0.9)
f1.suptitle(title_text, x=0.01, size=30, **tfont)
# f1.savefig(PATH + "/Img/" + title_text + ".png", dpi=300)

Text(0.01, 0.98, 'Spatial Distribution of the entry rate index of tech cluster in England TTWA')

Industry Mix V.S. Dynamics¶

Regression¶

Hypothesis
Data Prepare
Distribution EDA
OLS with std error
VIF
Regression
Residual

Data Preparation¶

# the last dataset
import pandas as pd
import sys
PATH = sys.path[0]

df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory=False)
df = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis = 1)

# preprocess the data
# drop the null value record in birth year and ttwa(travel to work area) columns
df = df.dropna(subset = {"birth_year","ttwa"})

# England's filter can be processed in the final stage with merge function
# filter by `birth_year`
# df_10_tc_20 means "top 10 tech ttwa clusters in 20 years dataframe"
df = df[df["birth_year"]>=1998].reset_index(drop=True)

# select data in below columns
df = df[["registered_number",\
        "ttwa",\
        "birth_year",\
        "diss_year",\
        "latest_accounts_assets"]]

df.head()
df_EnR_i_t = df.copy()
df_EnR_i_t.head()

df1 = df_EnR_i_t.copy()

# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

df2 = pd.DataFrame(data={})
  
# MITO CODE END (DO NOT EDIT)

# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

df2 = pd.DataFrame(data={})
  
# MITO CODE END (DO NOT EDIT)

# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

df2 = pd.DataFrame(data={})
  
# MITO CODE END (DO NOT EDIT)

# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

# Pivot the data
pivot_table = df1.pivot_table(
    index=['ttwa', 'birth_year'],
    values=['registered_number'],
    aggfunc={'registered_number': 'count'}
)

# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
  
# MITO CODE END (DO NOT EDIT)

# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

# Pivot the data
pivot_table = df1.pivot_table(
    index=['ttwa', 'birth_year'],
    values=['registered_number'],
    aggfunc={'registered_number': 'count'}
)

# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
  
# MITO CODE END (DO NOT EDIT)

# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

# Pivot the data
pivot_table = df1.pivot_table(
    index=['ttwa', 'birth_year'],
    values=['registered_number', 'latest_accounts_assets'],
    aggfunc={'registered_number': 'count', 'latest_accounts_assets': 'count'}
)

# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
  
# MITO CODE END (DO NOT EDIT)

# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

# Pivot the data
pivot_table = df1.pivot_table(
    index=['ttwa', 'birth_year'],
    values=['registered_number', 'latest_accounts_assets'],
    aggfunc={'registered_number': 'count', 'latest_accounts_assets': 'mean'}
)

# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
  
# MITO CODE END (DO NOT EDIT)

df_EnR_i_t_pivot = df2.copy()
df_EnR_i_t_pivot = df_EnR_i_t_pivot.rename({"latest_accounts_assets":"average_assets","registered_number":"firms"},axis=1)
df_EnR_i_t_pivot.head()

# group by ttwa come up with entry rate
df = df_EnR_i_t_pivot.copy()

df_tmp = df.groupby('ttwa').agg({"firms":sum}).rename({"firms":"total_firms"},axis=1).reset_index()

dfm = pd.merge(df, df_tmp, left_on = 'ttwa', right_on = 'ttwa', how='inner') # vlookup like function

dfm["entry_rate"] = dfm.firms/dfm.total_firms

dfm.head()

df_EnR_i_t = dfm.copy()

df_EnR_i_t = df_EnR_i_t.fillna(0)
df_EnR_i_t.birth_year = df_EnR_i_t.birth_year.astype(int)
df_EnR_i_t.head()

df_left = df_EnR_i_t
left_key = ["ttwa","birth_year"]
print("Entry Rate Dataset is" + str(df_EnR_i_t.shape) + " because it includes other countries like Scotland, Wales")

df_right = df_hh_i_t
right_key = ["ttwa","birth_year"]
print("HHI Dataset is" + str(df_hh_i_t.shape))

# merge
df_mg = pd.merge(df_left,df_right, left_on=left_key, right_on=right_key,how = "inner").reset_index(drop=True)

df_hh_EnR = df_mg.copy()
print("HHI&EnR Dataset is" + str(df_hh_EnR.shape))

df_hh_EnR.head()

Entry Rate Dataset is(4487, 6) because it includes other countries like Scotland, Wales
HHI Dataset is(3089, 3)
HHI&EnR Dataset is(3089, 7)

df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = df_ttwa[["code","name"]]

df_left = df_hh_EnR
left_key = "ttwa"
print("df_left is " + str(df_left.shape))

df_right = df_ttwa
right_key = "code"
print("df_right is " + str(df_right.shape))

# merge
df_mg = pd.merge(df_left,df_right, left_on=left_key, right_on=right_key,how = "inner").reset_index(drop=True)
print("df_merge" + str(df_mg.shape))

df_hh_EnR_ttwaName = df_mg.copy()

df_hh_EnR_ttwaName = df_hh_EnR_ttwaName[["name","code","birth_year","firms","total_firms","entry_rate", "hh", "average_assets"]]
df_hh_EnR_ttwaName = df_hh_EnR_ttwaName.rename({"name":"ttwa", "code":"ttwa_code", "birth_year":"year"},axis=1)

df_hh_EnR_ttwaName.head()

df_left is (3089, 7)
df_right is (430, 2)
df_merge(3089, 9)

df = df_hh_EnR_ttwaName.copy() # input your data
attribute = "entry_rate" # input your X

import seaborn as sns
import numpy as np
df[attribute] = np.log(df[attribute])
sns.displot(df, x=attribute,bins=80)

<seaborn.axisgrid.FacetGrid at 0x7febffb11b00>

df = df_hh_EnR_ttwaName.copy() # input your data
attribute = "average_assets" # input your X

import seaborn as sns
import numpy as np
df[attribute] = np.log1p(df[attribute])
sns.displot(df, x=attribute,bins=40)
df[attribute].describe()

count    3089.000000
mean       10.677960
std         2.702829
min         0.000000
25%        10.502893
50%        11.213327
75%        11.915154
max        15.959450
Name: average_assets, dtype: float64

Can we drop the close to 0 value?

#df_hh_EnR_ttwaName
import math
assets_level = (math.e)**5-1
print(assets_level)
df_hh_EnR_ttwaName[df_hh_EnR_ttwaName.average_assets < assets_level].shape[0] / df_hh_EnR_ttwaName.shape[0]

147.41315910257657

0.0534153447717708

The 0~4 value only accounts for 0.05 in the dataset, so we can drop our 0 value

df_hh_EnR_ttwaName_asset_not_drop = df_hh_EnR_ttwaName
# drop the asset which can cause outlier
df_hh_EnR_ttwaName = df_hh_EnR_ttwaName[df_hh_EnR_ttwaName["average_assets"]>=assets_level]

# then test the distribution again
df = df_hh_EnR_ttwaName.copy()
attribute = "average_assets"

import seaborn as sns
import numpy as np
df[attribute] = np.log1p(df[attribute])
sns.displot(df, x=attribute,bins=80)

<seaborn.axisgrid.FacetGrid at 0x7fec272f34e0>

# export the df_hh_EnR_ttwaName_i_t_asset_not_drop
df_hh_EnR_ttwaName_asset_not_drop.to_csv(PATH + "/Dataset/df_hh_EnR_ttwaName_asset_not_drop.csv")

Hypothesis¶

$\ Dynamics\ Variable_{i,t}\ =\ a_1 + b_1 \ firms_{i,t} + b_2 \ HHI_{i,t} + b_3 \ location_i + b_4 \ year_t + e_{i,t} $ (1)

$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \ firms_{i,t} + d_2 \ HHI_{i,t} + d_3 \ dynamics_{i,t} + d_4 \ location_i + d_5 \ year_t + e_{i,t}$ (2)

$t$ means year -- birth_year
$i$ means location -- ttwa (travel to work area)
$Dynamics\ Variable_{i,t}$ in (1) means Entry_Rate in $t$ and $i$
$dynamics_{i,t}$ in (2) means Entry_Rate in a specific $t$ and $i$
$firms_{i,t}$ means Count_of_tech_firms in a specific $t$ and $i$
$HHI_{i,t}$ means Herfindahl-Hirschman Index in a specific $t$ and $i$

OLS with std error¶

In (1), Y is entry_rate, X is ttwa, birth_year, total, hh

In (2), Y is average_latest_accounts_assets, X is ttwa, birth_year, total, hh, entry_rate

# environment preparation
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', 300) # specifies number of rows to show
pd.options.display.float_format = '{:40,.4f}'.format # specifies default number format to 4 decimal places
plt.style.use('ggplot') # specifies that graphs should use ggplot styling
%matplotlib inline

introduce correlation analysis

df = df_hh_EnR_ttwaName.copy()

df.year = df.year.astype("int")
df = df.drop({"ttwa","ttwa_code"},axis=1)

# add dummies operation requirement 
# Set the reference object is ttwa_E30000234（London）
# drop Birmingham for dummies variables comparasion
# df = df.drop(['ttwa_London'], axis=1)

# correlation analysis
plt.rcParams["axes.grid"] = False
f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16)

# df = pd.get_dummies(df)

Text(0.5, 1.0, 'Correlation Matrix')

introduce the dummy variables

Dummies¶

df = df_hh_EnR_ttwaName.copy()

# intro dummy variables
# df.year = df.year.astype("int")
df = df.drop({"ttwa_code"},axis=1)
df = pd.get_dummies(df)

# add dummies operation requirement 
# Set the reference object is ttwa_E30000234（London）
# drop Birmingham for dummies variables comparasion
# df = df.drop(['ttwa_London'], axis=1)

# gd means get dummies
df_hh_EnR_gd = df.copy()

VIF¶

# VIF method
from statsmodels.stats.outliers_influence import variance_inflation_factor 
from statsmodels.tools.tools import add_constant

def drop_column_using_vif_(df, list_var_not_to_remove, thresh=5):
    '''
    Calculates VIF each feature in a pandas dataframe, and repeatedly drop the columns with the highest VIF
    A constant must be added to variance_inflation_factor or the results will be incorrect

    :param df: the pandas dataframe containing only the predictor features, not the response variable
    :param list_var_not_to_remove: the list of variables that should not be removed even though it has a high VIF. For example, dummy (or indicator) variables represent a categorical variable with three or more categories.
    :param thresh: the max VIF value before the feature is removed from the dataframe
    :return: dataframe with multicollinear features removed
    '''
    while True:
        # adding a constatnt item to the data
        df_with_const = add_constant(df)

        vif_df = pd.Series([variance_inflation_factor(df_with_const.values, i) 
               for i in range(df_with_const.shape[1])], name= "VIF",
              index=df_with_const.columns).to_frame()

        # drop the const
        vif_df = vif_df.drop('const').drop(list_var_not_to_remove)
        
        print('Max VIF:', vif_df.VIF.max())
        
        # if the largest VIF is above the thresh, remove a variable with the largest VIF
        if vif_df.VIF.max() > thresh:
            # If there are multiple variables with the maximum VIF, choose the first one
            index_to_drop = vif_df.index[vif_df.VIF == vif_df.VIF.max()].tolist()[0]
            print('Dropping: {}'.format(index_to_drop))
            df = df.drop(columns = index_to_drop)
        else:
            # No VIF is above threshold. Exit the loop
            break

    return df

All Tech Firms in England (OLS)¶

In (1):

$\ {Dynamics\ Variable_{i,t}}^{0.5}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times HHI_{i,t} + b_3 \times location_i + b_4 \times year_t$

predictors_X = df_hh_EnR_gd.drop({'entry_rate','average_assets'}, axis=1) # drop the Y_1 and Y_2
response_Y = np.sqrt(df_hh_EnR_gd['entry_rate'])
list_var_not_to_remove = ["year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 

df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()

R^2 = 0.78

In (2):

$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times HHI_{i,t} + d_3 \times dynamics_{i,t} + d_4 \times location_i + d_5 \times year_t$ (2)

predictors_X = df_hh_EnR_gd.drop({'average_assets'}, axis=1) # drop the Y_1 and Y_2
response_Y = np.log1p(df_hh_EnR_gd['average_assets'])
list_var_not_to_remove = ["entry_rate","year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 

df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()

R^2 = 0.43

Top 10 Tech Cluster (OLS)¶

small data preparation

df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory=False)
df = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis = 1)
df = df.dropna(subset = {"birth_year","ttwa"})

# Generate the top 10 tech clusters table based on the ttwa
top_10_ttwa = df.ttwa.value_counts().head(10)
top_10_ttwa = pd.DataFrame(top_10_ttwa).reset_index()
top_10_ttwa.columns = ["ttwa_code","counts"]

df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = pd.merge(top_10_ttwa, df_ttwa[["code","name"]], left_on="ttwa_code",right_on="code")
df_ttwa_10 = df_ttwa[["ttwa_code", "name","counts"]]

# df_10_tc means "top 10 tech ttwa clusters"
df_10_tc = df_hh_EnR_ttwaName.copy()
# filter data to top 10 ttwa data raws
df_10_tc = df_10_tc[df_10_tc.ttwa_code.isin(df_ttwa_10.ttwa_code.to_list())].reset_index(drop=True)

df_10_tc = df_10_tc.drop({"ttwa_code"},axis=1)
# get dummies
df_10_tc = pd.get_dummies(df_10_tc)
df_10_tc = df_10_tc.drop({"ttwa_London"},axis=1)

# delete the resid outlier rows
df_10_tc = df_10_tc.drop(index=[182,92,141])

In (1):

$\ log_e\ {Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times HHI_{i,t} + b_3 \times location_i + b_4 \times year_t$

df = df_10_tc.copy()

predictors_X = df.drop({'entry_rate','average_assets'}, axis=1) # drop the Y_1 and Y_2
#response_Y = df['entry_rate']
response_Y = np.log(df['entry_rate']) # sqrt is created for residual analysis
list_var_not_to_remove = ["year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 

df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()

Max VIF: inf
Dropping: total_firms
Max VIF: 3.598233664337501
The columns remaining after VIF selection are:
Index(['year', 'firms', 'hh', 'ttwa_Birmingham', 'ttwa_Bournemouth',
       'ttwa_Brighton', 'ttwa_Bristol', 'ttwa_Cambridge',
       'ttwa_Guildford and Aldershot', 'ttwa_Luton', 'ttwa_Manchester',
       'ttwa_Slough and Heathrow'],
      dtype='object')

plt.scatter(regressor_OLS.fittedvalues, regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted Entry Rate')
plt.ylabel('Residual')
plt.title('Figure')
#plt.ylim(-1,1)
plt.show()

In (2):

$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times HHI_{i,t} + d_3 \times dynamics_{i,t} + d_4 \times location_i + d_5 \times year_t$ (2)

df = df_10_tc.copy()

predictors_X = df.drop({'average_assets'}, axis=1) # drop the Y_1 and Y_2
response_Y = np.log1p(df['average_assets'])
list_var_not_to_remove = ["entry_rate","year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 

df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()

Max VIF: inf
Dropping: total_firms
Max VIF: 4.045694598536752
The columns remaining after VIF selection are:
Index(['year', 'firms', 'entry_rate', 'hh', 'ttwa_Birmingham',
       'ttwa_Bournemouth', 'ttwa_Brighton', 'ttwa_Bristol', 'ttwa_Cambridge',
       'ttwa_Guildford and Aldershot', 'ttwa_Luton', 'ttwa_Manchester',
       'ttwa_Slough and Heathrow'],
      dtype='object')

plt.scatter(regressor_OLS.fittedvalues, regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted Average Assets')
plt.ylabel('Residual')
plt.title('Figure')
# plt.ylim(-0.5,0.5)
plt.show()

In England, as year goes, the performance might grow less. Higher HH index might cause lower performance, but more entry rate might cause less asset. This should be discussed in the next section with other literatures.

Moran'I and GWR¶

# value include dynamics, industry mix and firms' performance
df1 = df_hh_EnR_ttwaName.copy()
df1.shape

(2924, 8)

df1.describe()

df1[df1.hh == 1].ttwa.unique() #only have hh=1

array(['Skipton', 'Hexham and Haltwhistle', 'Minehead',
       'Penrith and Appleby', 'Whitby', 'Barrow-in-Furness', 'Bideford',
       'Boston', 'Bridgwater', 'Bridlington', 'Bridport', 'Bude',
       'Buxton', 'Clacton', 'Cromer and Sheringham', 'Darlington',
       'Great Yarmouth', 'Hartlepool', 'Kingsbridge and Dartmouth',
       'Launceston', 'Liskeard', 'Lowestoft', 'Ludlow', 'Penzance',
       'Scarborough', 'Scunthorpe', 'Skegness and Louth', 'Wadebridge',
       'Whitehaven', 'Wisbech', 'Workington'], dtype=object)

df1.head()

# export df1 to MySQL to operate
# df1.to_csv(PATH + "/MySQL/df_hh_EnR_ttwaName.csv")

Excel Operation (ref to MySQL operation ...)

DROP TABLE IF EXISTS tech_cluster_hh_enr_ttwa_for_r;
CREATE TABLE tech_cluster_hh_enr_ttwa_for_r
SELECT * FROM (select    t1.ttwa
                        ,t1.ttwa_code
                        ,sum(t1.firms) as entry_total
                        ,avg(t1.total_firms) as grand_total
                        ,avg(t1.hh) as hhi
                        ,avg(t1.entry_rate) as entry_rate
                        ,avg(t1.average_latest_accounts_assets) as assets
                        from `casa-dissertation`.df_hh_enr_for_r as t1
                        group by t1.ttwa) as t2

Read df1 from MySQL

# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-20dc35da-8f74-48f1-9355-810587cb82e8') # Let Mito know which analysis is being run

# Step 1

df2 = pd.DataFrame(data={})
  
# MITO CODE END (DO NOT EDIT)

# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-20dc35da-8f74-48f1-9355-810587cb82e8') # Let Mito know which analysis is being run

# Step 1

# Pivot the data
pivot_table = df1.pivot_table(
    index=['ttwa', 'ttwa_code'],
    values=['firms', 'total_firms', 'hh', 'entry_rate', 'average_assets'],
    aggfunc={'firms': 'sum', 'total_firms': 'mean', 'hh': 'mean', 'entry_rate': 'mean', 'average_assets': 'mean'}
)

# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
  
# MITO CODE END (DO NOT EDIT)

# df1 = pd.read_csv(PATH + "/MySQL/tech_cluster_hh_enr_ttwa_for_r.csv",",")

df1 = df2.copy()

type(g_ttwa_England)
type(df1)

pandas.core.frame.DataFrame

df_left = g_ttwa_England
left_key = 'TTWA11CD'

df_right = df1
right_key = "ttwa_code"

# merge
df_mg = pd.merge(df_left,df_right, left_on = left_key,right_on=right_key,how = "inner").reset_index(drop=True)
type(df_mg)

geopandas.geodataframe.GeoDataFrame

df_mg.shape

(149, 13)

# df_mg.to_csv(PATH + "/HHI_EnR_ttwa_England_for_R.csv")
import geopandas as gpd
import fiona

# asset is not dropped
df_mg.to_file(PATH + "/Dataset/Spatial/HHI_EnR_ttwa_England_for_R_assetnd.geojson", driver='GeoJSON')

	registered_number	diss_year	birth_year	primary_uk_sic_2007	ttwa
0	11445613	NaN	2018	86101	E30000272
1	09464011	NaN	2015	86900	E30000272
2	09617233	NaN	2015	73200	E30000272
3	08508950	NaN	2013	86900	E30000272
4	08631480	NaN	2013	58290	E30000272

	ttwa	hh
0	E30000237	0.141279
1	E30000212	0.081876
2	E30000292	0.058302
3	E30000234	0.070716
4	E30000245	0.053532

	OBJECTID	TTWA11CD	TTWA11NM	Shape__Area	Shape__Length	geometry	ttwa	hh
0	1	E30000004	Barnsley	3.424992e+08	129308.319110	POLYGON ((-1.44681 53.61282, -1.44475 53.61061...	E30000004	0.045157
1	2	E30000018	Bradford	3.438902e+08	119605.015955	POLYGON ((-1.94399 53.95619, -1.94161 53.95252...	E30000018	0.081738
2	3	E30000029	Halifax	3.639425e+08	105182.259501	POLYGON ((-2.03257 53.81023, -2.03138 53.80902...	E30000029	0.067777
3	4	E30000039	Skipton	1.178893e+09	197077.350855	POLYGON ((-2.30733 54.25008, -2.30511 54.24934...	E30000039	0.062993
4	5	E30000046	Dorchester and Weymouth	7.200201e+08	226091.682220	MULTIPOLYGON (((-2.22501 50.61646, -2.22520 50...	E30000046	0.059792

	ttwa	2018	total	Entry_Rate
0	E30000004	93	749	0.124166
1	E30000018	334	1771	0.188594
2	E30000029	132	945	0.139683
3	E30000039	23	225	0.102222
4	E30000046	66	623	0.105939

	registered_number	ttwa	birth_year	diss_year	latest_accounts_assets
0	SC586896	S22000047	2018.0	NaN	NaN
1	SC288343	S22000047	2005.0	NaN	1751549.0
2	SC500426	S22000047	2015.0	NaN	NaN
3	SC216869	S22000047	2001.0	NaN	169943.0
4	SC366379	S22000047	2009.0	NaN	NaN

	ttwa	birth_year	hh
0	E30000237	1998	0.147377
1	E30000212	1998	0.110137
2	E30000292	1998	0.218935
3	E30000234	1998	0.084710
4	E30000245	1998	0.155709

	ttwa	birth_year	average_assets	firms
0	E30000004	1998.0	19405.250000	10
1	E30000004	1999.0	759500.500000	6
2	E30000004	2000.0	286162.333333	8
3	E30000004	2001.0	129527.428571	8
4	E30000004	2002.0	56740.400000	11

Dep. Variable:	entry_rate	R-squared:	0.959
Model:	OLS	Adj. R-squared:	0.957
Method:	Least Squares	F-statistic:	373.3
Date:	Tue, 03 Aug 2021	Prob (F-statistic):	4.84e-125
Time:	22:09:46	Log-Likelihood:	42.872
No. Observations:	203	AIC:	-59.74
Df Residuals:	190	BIC:	-16.67
Df Model:	12
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	-290.6295	5.694	-51.038	0.000	-301.862	-279.397
year	0.1427	0.003	50.089	0.000	0.137	0.148
firms	3.236e-05	8.76e-06	3.696	0.000	1.51e-05	4.96e-05
hh	4.1339	0.367	11.267	0.000	3.410	4.858
ttwa_Birmingham	0.2020	0.077	2.634	0.009	0.051	0.353
ttwa_Bournemouth	-0.8259	0.090	-9.130	0.000	-1.004	-0.647
ttwa_Brighton	-0.1465	0.079	-1.866	0.064	-0.301	0.008
ttwa_Bristol	0.3233	0.077	4.180	0.000	0.171	0.476
ttwa_Cambridge	0.3363	0.077	4.343	0.000	0.184	0.489
ttwa_Guildford and Aldershot	0.3654	0.078	4.703	0.000	0.212	0.519
ttwa_Luton	-0.1255	0.081	-1.553	0.122	-0.285	0.034
ttwa_Manchester	0.2522	0.074	3.392	0.001	0.106	0.399
ttwa_Slough and Heathrow	0.1100	0.076	1.450	0.149	-0.040	0.260

Omnibus:	0.644	Durbin-Watson:	1.454
Prob(Omnibus):	0.725	Jarque-Bera (JB):	0.744
Skew:	0.022	Prob(JB):	0.689
Kurtosis:	2.707	Cond. No.	1.10e+06

Dep. Variable:	average_assets	R-squared:	0.824
Model:	OLS	Adj. R-squared:	0.811
Method:	Least Squares	F-statistic:	66.86
Date:	Tue, 03 Aug 2021	Prob (F-statistic):	7.71e-63
Time:	22:46:32	Log-Likelihood:	-41.356
No. Observations:	200	AIC:	110.7
Df Residuals:	186	BIC:	156.9
Df Model:	13
Covariance Type:	nonrobust

Omnibus:	5.347	Durbin-Watson:	1.922
Prob(Omnibus):	0.069	Jarque-Bera (JB):	7.800
Skew:	0.023	Prob(JB):	0.0202
Kurtosis:	3.966	Cond. No.	1.99e+06

	year	firms	total_firms	entry_rate	hh	average_assets
count	2924.000000	2924.000000	2924.000000	2924.000000	2924.000000	2.924000e+03
mean	2007.638167	113.609097	2541.871409	0.043932	0.151608	1.337131e+05
std	5.759616	667.562166	9758.846630	0.032065	0.155557	2.567358e+05
min	1998.000000	1.000000	54.000000	0.002457	0.039683	1.720000e+02
25%	2003.000000	10.000000	319.000000	0.017731	0.074395	4.152866e+04
50%	2008.000000	27.500000	784.000000	0.033211	0.101514	7.965773e+04
75%	2013.000000	76.000000	2007.000000	0.066359	0.157025	1.553704e+05
max	2018.000000	22353.000000	113412.000000	0.197096	1.000000	8.532989e+06

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	150.1955	18.208	8.249	0.000	114.275	186.117
year	-0.0688	0.009	-7.569	0.000	-0.087	-0.051
firms	3.411e-05	1.93e-05	1.764	0.079	-4.03e-06	7.22e-05
entry_rate	-7.4339	1.609	-4.619	0.000	-10.609	-4.259
hh	1.3541	0.626	2.163	0.032	0.119	2.589
ttwa_Birmingham	-0.2711	0.127	-2.136	0.034	-0.522	-0.021
ttwa_Bournemouth	-0.6849	0.146	-4.682	0.000	-0.974	-0.396
ttwa_Brighton	-0.5555	0.129	-4.315	0.000	-0.810	-0.302
ttwa_Bristol	-0.2290	0.129	-1.777	0.077	-0.483	0.025
ttwa_Cambridge	-0.2020	0.130	-1.559	0.121	-0.458	0.054
ttwa_Guildford and Aldershot	-0.0727	0.129	-0.564	0.573	-0.327	0.182
ttwa_Luton	-0.2907	0.131	-2.218	0.028	-0.549	-0.032
ttwa_Manchester	-0.1381	0.123	-1.119	0.264	-0.381	0.105
ttwa_Slough and Heathrow	-0.1937	0.123	-1.573	0.117	-0.437	0.049