Spatial Research

1 Overview

  • HH index is added into OLS regression experiment(Top 10 Tech Cluster in England)
  • Residual Analysis is done to exam the performance of regression
  • Spatial pattern can be detected by Moran's I & GWR

1.1 Methodology

  • Data Prepare
  • OLS Regression
    • Hypothesis
    • Distribution Correction
    • OLS with std error
    • VIF (remove multicollinearity)
    • Regression
    • Residual Analysis
  • Moran's I Test
  • GWR

According to EDA analysis, It can be found that most of tech firms are clustered in the England. In the next step, this research will focus more on the England country. ctry can be a filter to get the country == 'England' tech firms information. HHI can be calculate in the below function:

$$ Herfindahl-Hirschman \ Index_{i,t} = \sum_1^k (\frac{Tech\ Firms_{i,t,k}}{Total\ Tech\ Firms_{i,t}})^2 $$

Where the HH index will be calculated from industry 1 to k in location i and year t. This formula means, in year t and location i, the number of tech firms which belong to industry k divide that of the total firms. This index can measure how diverse the tech cluster are.

Hypothesis is given below:

$\ log_e\ {Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times HHI_{i,t} + b_3 \times location_i + b_4 \times year_t$ (1)

$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times HHI_{i,t} + d_3 \times dynamics_{i,t} + d_4 \times location_i + d_5 \times year_t$ (2)

  • $t$ means year -- birth_year
  • $i$ means location -- ttwa (travel to work area)
  • $Dynamics\ Variable_{i,t}$ in (1) means Entry_Rate in $t$ and $i$
  • $dynamics_{i,t}$ in (2) means Entry_Rate in a specific $t$ and $i$
  • $firms_{i,t}$ means Count_of_tech_firms in a specific $t$ and $i$
  • $HHI_{i,t}$ means Herfindahl-Hirschman Index in a specific $t$ and $i$

1.2 Results

Industry Mix V.S. Dynamics:

image.png

  • Regression OLS in (1)

  • the residual analysis in (1)

  • Regression OLS in (2)

  • the residual analysis in (2)

2 Reproducible Analysis: HH index

HH index can be treated to measure the tech industry diversity. For calculate industry mix level, primary_sic_code will be utilised to do a group by operation

According to EDA analysis, It can be found that most of tech firms are clustered in the England. In the next step, this research will focus more on the England country. ctry can be a filter to get the country == 'England' tech firms information

In [58]:
#read data
import sys
import pandas as pd
PATH = sys.path[0]
df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory = False)
dfp = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis=1)

# data select
# dfp means df is processed
dfp = dfp[["registered_number","diss_year","birth_year","primary_uk_sic_2007","ttwa","ctry"]]

# dna means drop na value
dfp_dna = dfp.dropna(subset=["birth_year","ttwa"]).reset_index(drop=True)

# introduce the ttwa to filter england firms
# dfp_dna = dfp_dna[dfp_dna.ctry=="E92000001"]
dfp_dna = dfp_dna[dfp_dna.birth_year >= 1998]

#export to the excel to do operation
df_excel = dfp_dna.copy() 
# df_excel.to_excel(PATH + "/tech_cluster_ttwa_in_England.xlsx")
In [63]:
# process the .csv file
df_excel = pd.read_excel(PATH + "/MySQL/tech_cluster_ttwa_in_England.xlsx")
df_excel = df_excel.fillna(0)
# df_excel.to_csv(PATH + "/MySQL/tech_cluster_ttwa_in_England.csv")
In [62]:
df_excel.head()
Out[62]:
registered_number diss_year birth_year primary_uk_sic_2007 ttwa
0 11445613 NaN 2018 86101 E30000272
1 09464011 NaN 2015 86900 E30000272
2 09617233 NaN 2015 73200 E30000272
3 08508950 NaN 2013 86900 E30000272
4 08631480 NaN 2013 58290 E30000272

After pivot operation in the Excel, the dataset is also imported into MySQL to calculate the HH index.

The method/code is presented below:

CREATE TABLE tech_cluster_hh
SELECT t2.ttwa, t2.birth_year, sum(t2.sub_hh) as hh
FROM (SELECT t0.ttwa, 
            t0.birth_year,
            t0.primary_uk_sic_2007, 
            power(count(t0.registered_number)/t1.year_total,2) as sub_hh
            FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic as t0,
                (SELECT ttwa, 
                        birth_year,  
                        count(registered_number) as year_total 
                        FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic
                        GROUP BY ttwa, birth_year) as t1
            WHERE t0.ttwa = t1.ttwa and t0.birth_year = t1.birth_year
            GROUP BY t0.ttwa, t0.birth_year, t0.primary_uk_sic_2007 ) as t2
GROUP BY t2.ttwa, t2.birth_year

Then import this processed .csv to get HH index from two scale (location&year, location)

【HH index in TTWA and Year】

method from MySQL/hh_index.sql

input from the tech_cluster_ttwa_in_England_year_98218_sic table

output from tech_cluster_hhtech_cluster_hh table

In [3]:
# means HH index in year t and location
import pandas as pd
import sys
PATH = sys.path[0]
df_hh_i_t = pd.read_csv(PATH + "/MySQL/tech_cluster_hh.csv",",")
df_hh_i_t.head()
Out[3]:
ttwa birth_year hh
0 E30000237 1998 0.147377
1 E30000212 1998 0.110137
2 E30000292 1998 0.218935
3 E30000234 1998 0.084710
4 E30000245 1998 0.155709

【HH index in TTWA】

method from MySQL/hh_index_drop_year.sql

input from the tech_cluster_ttwa_in_England_year_98218_sic table

output from the tech_cluster_hh_ttwa table

CREATE TABLE tech_cluster_hh_ttwa
SELECT t2.ttwa, sum(t2.sub_hh_2) as hh
FROM ( SELECT   t0.ttwa, 
                t0.primary_uk_sic_2007, 
                power(count(t0.registered_number)/t1.ttwa_total,2) as sub_hh_2
                FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic as t0,
                    (SELECT ttwa, 
                            count(registered_number) as ttwa_total 
                            FROM `casa-dissertation`.tech_cluster_ttwa_in_England_year_98218_sic
                            WHERE diss_year = 0  -- diss year's firms should be excluded
                            GROUP BY ttwa) as t1
                WHERE t0.ttwa = t1.ttwa and t0.diss_year = 0 -- diss year's firms should be excluded
                GROUP BY t0.ttwa, t0.primary_uk_sic_2007 ) as t2
GROUP BY t2.ttwa
In [70]:
# means HH index in year t and location i
df_hh_i = pd.read_csv(PATH + "/MySQL/tech_cluster_hh_ttwa_only.csv",",")
df_hh_i.head()
Out[70]:
ttwa hh
0 E30000237 0.141279
1 E30000212 0.081876
2 E30000292 0.058302
3 E30000234 0.070716
4 E30000245 0.053532

Map Visualisation

import the geo dataset for ttwa in the England

In [6]:
# spatial map prepare
# read spatial data
import geopandas as gpd
g_ttwa = gpd.read_file(PATH + "/Dataset/Spatial/Travel_to_Work_Areas__December_2011__Generalised_Clipped_Boundaries_in_United_Kingdom.geojson")

According to CSDN, I ustilise the below method to filter string

In [7]:
# print(g_ttwa.TTWA11CD.value_counts())
# input
df_tmp = g_ttwa.copy()
df_tmp.col_tmp = g_ttwa.TTWA11CD
letter_str_1 = "E"
# letter_str_2 = "K"

mask = [True if i[0]==letter_str_1 else False for i in df_tmp.col_tmp]

# output
g_ttwa_England = df_tmp[mask]
# print(g_ttwa_England.TTWA11CD.value_counts())
g_ttwa_England.plot()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/geopandas/geodataframe.py:182: UserWarning: Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access
  super(GeoDataFrame, self).__setattr__(attr, val)
Out[7]:
<AxesSubplot:>
In [8]:
# inspect the data
g_ttwa_England.head(1)
Out[8]:
OBJECTID TTWA11CD TTWA11NM Shape__Area Shape__Length geometry
0 1 E30000004 Barnsley 3.424992e+08 129308.31911 POLYGON ((-1.44681 53.61282, -1.44475 53.61061...

Industry Mix: HHI

In [47]:
# inspect the data
df_hh_i.head(1)
Out[47]:
ttwa hh
0 E30000266 0.105359
In [51]:
# merge with `df_hh_i`

df_ttwa_hh_mg = pd.merge(g_ttwa,df_hh_i, left_on='TTWA11CD',right_on="ttwa",how = "inner").reset_index(drop=True)
df_ttwa_hh_mg.head()
Out[51]:
OBJECTID TTWA11CD TTWA11NM Shape__Area Shape__Length geometry ttwa hh
0 1 E30000004 Barnsley 3.424992e+08 129308.319110 POLYGON ((-1.44681 53.61282, -1.44475 53.61061... E30000004 0.045157
1 2 E30000018 Bradford 3.438902e+08 119605.015955 POLYGON ((-1.94399 53.95619, -1.94161 53.95252... E30000018 0.081738
2 3 E30000029 Halifax 3.639425e+08 105182.259501 POLYGON ((-2.03257 53.81023, -2.03138 53.80902... E30000029 0.067777
3 4 E30000039 Skipton 1.178893e+09 197077.350855 POLYGON ((-2.30733 54.25008, -2.30511 54.24934... E30000039 0.062993
4 5 E30000046 Dorchester and Weymouth 7.200201e+08 226091.682220 MULTIPOLYGON (((-2.22501 50.61646, -2.22520 50... E30000046 0.059792
In [106]:
title_text = "Spatial Distribution of the HH index of tech cluster in England TTWA"

df_left = g_ttwa
left_key = 'TTWA11CD'

df_right = df_hh_i
right_key = "ttwa"

map_attribute = "hh"


# merge
df_mg = pd.merge(df_left,df_right, left_on = left_key,right_on=right_key,how = "inner").reset_index(drop=True)
# map
import pysal as ps
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt

tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1 = plt.figure()
f1.set_size_inches(15,12)

ax1 = f1.add_subplot()
df_mg.plot(column = map_attribute, legend=True, cmap='plasma',linestyle='solid', ax=ax1)
# df_ttwa_hh_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)

f1.subplots_adjust(top=0.9)
f1.suptitle(title_text, x=0.01, size=30, **tfont)
f1.savefig(PATH + "/Img/" + title_text + ".png", dpi=300)
In [84]:
import seaborn as sns
import numpy as np
df = df_ttwa_hh_mg.copy()
df.hh = np.log(df.hh)
attributes = "hh"
sns.displot(df, x=attributes,bins=10)
Out[84]:
<seaborn.axisgrid.FacetGrid at 0x7fa660a1a278>

Dynamics: Entry Rate

In [85]:
# read data

# means Entry Rate in location i
df_EnR_i = pd.read_excel(PATH + "/Excel/tech_cluster_ttwa_in_England_EnR.xlsx", sheet_name = "ttwa_EnR")
df_EnR_i.head()
Out[85]:
ttwa 2018 total Entry_Rate
0 E30000004 93 749 0.124166
1 E30000018 334 1771 0.188594
2 E30000029 132 945 0.139683
3 E30000039 23 225 0.102222
4 E30000046 66 623 0.105939

The template method is writen below

In [104]:
title_text = "Spatial Distribution of the entry rate index of tech cluster in England TTWA"

df_left = g_ttwa
left_key = 'TTWA11CD'

df_right = df_EnR_i
right_key = "ttwa"

map_attribute = "Entry_Rate"


# merge
df_mg = pd.merge(df_left,df_right, left_on = left_key,right_on=right_key,how = "inner").reset_index(drop=True)

# map
import pysal as ps
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt

tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1 = plt.figure()
f1.set_size_inches(15,12)

ax1 = f1.add_subplot()
df_mg.plot(column = map_attribute, legend=True, cmap='plasma', linestyle='solid', ax=ax1)
# df_ttwa_hh_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)

f1.subplots_adjust(top=0.9)
f1.suptitle(title_text, x=0.01, size=30, **tfont)
# f1.savefig(PATH + "/Img/" + title_text + ".png", dpi=300)
Out[104]:
Text(0.01, 0.98, 'Spatial Distribution of the entry rate index of tech cluster in England TTWA')

Industry Mix V.S. Dynamics

image.png

Regression

  1. Hypothesis
  2. Data Prepare
  3. Distribution EDA
  4. OLS with std error
  5. VIF
  6. Regression
  7. Residual

Data Preparation

In [9]:
# the last dataset
import pandas as pd
import sys
PATH = sys.path[0]

df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory=False)
df = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis = 1)

# preprocess the data
# drop the null value record in birth year and ttwa(travel to work area) columns
df = df.dropna(subset = {"birth_year","ttwa"})

# England's filter can be processed in the final stage with merge function
# filter by `birth_year`
# df_10_tc_20 means "top 10 tech ttwa clusters in 20 years dataframe"
df = df[df["birth_year"]>=1998].reset_index(drop=True)

# select data in below columns
df = df[["registered_number",\
        "ttwa",\
        "birth_year",\
        "diss_year",\
        "latest_accounts_assets"]]

df.head()
df_EnR_i_t = df.copy()
df_EnR_i_t.head()
Out[9]:
registered_number ttwa birth_year diss_year latest_accounts_assets
0 SC586896 S22000047 2018.0 NaN NaN
1 SC288343 S22000047 2005.0 NaN 1751549.0
2 SC500426 S22000047 2015.0 NaN NaN
3 SC216869 S22000047 2001.0 NaN 169943.0
4 SC366379 S22000047 2009.0 NaN NaN
In [10]:
df1 = df_EnR_i_t.copy()
In [11]:
# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

df2 = pd.DataFrame(data={})
  
# MITO CODE END (DO NOT EDIT)
In [12]:
# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

df2 = pd.DataFrame(data={})
  
# MITO CODE END (DO NOT EDIT)
In [13]:
# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

df2 = pd.DataFrame(data={})
  
# MITO CODE END (DO NOT EDIT)
In [14]:
# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

# Pivot the data
pivot_table = df1.pivot_table(
    index=['ttwa', 'birth_year'],
    values=['registered_number'],
    aggfunc={'registered_number': 'count'}
)

# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
  
# MITO CODE END (DO NOT EDIT)
In [15]:
# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

# Pivot the data
pivot_table = df1.pivot_table(
    index=['ttwa', 'birth_year'],
    values=['registered_number'],
    aggfunc={'registered_number': 'count'}
)

# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
  
# MITO CODE END (DO NOT EDIT)
In [16]:
# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

# Pivot the data
pivot_table = df1.pivot_table(
    index=['ttwa', 'birth_year'],
    values=['registered_number', 'latest_accounts_assets'],
    aggfunc={'registered_number': 'count', 'latest_accounts_assets': 'count'}
)

# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
  
# MITO CODE END (DO NOT EDIT)
In [17]:
# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-e910eec5-8991-430a-8ba4-f9ddcc5283bc') # Let Mito know which analysis is being run

# Step 1

# Pivot the data
pivot_table = df1.pivot_table(
    index=['ttwa', 'birth_year'],
    values=['registered_number', 'latest_accounts_assets'],
    aggfunc={'registered_number': 'count', 'latest_accounts_assets': 'mean'}
)

# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
  
# MITO CODE END (DO NOT EDIT)
In [18]:
df_EnR_i_t_pivot = df2.copy()
df_EnR_i_t_pivot = df_EnR_i_t_pivot.rename({"latest_accounts_assets":"average_assets","registered_number":"firms"},axis=1)
df_EnR_i_t_pivot.head()
Out[18]:
ttwa birth_year average_assets firms
0 E30000004 1998.0 19405.250000 10
1 E30000004 1999.0 759500.500000 6
2 E30000004 2000.0 286162.333333 8
3 E30000004 2001.0 129527.428571 8
4 E30000004 2002.0 56740.400000 11
In [19]:
# group by ttwa come up with entry rate
df = df_EnR_i_t_pivot.copy()

df_tmp = df.groupby('ttwa').agg({"firms":sum}).rename({"firms":"total_firms"},axis=1).reset_index()

dfm = pd.merge(df, df_tmp, left_on = 'ttwa', right_on = 'ttwa', how='inner') # vlookup like function

dfm["entry_rate"] = dfm.firms/dfm.total_firms

dfm.head()

df_EnR_i_t = dfm.copy()
In [20]:
df_EnR_i_t = df_EnR_i_t.fillna(0)
df_EnR_i_t.birth_year = df_EnR_i_t.birth_year.astype(int)
df_EnR_i_t.head()
Out[20]:
ttwa birth_year average_assets firms total_firms entry_rate
0 E30000004 1998 19405.250000 10 749 0.013351
1 E30000004 1999 759500.500000 6 749 0.008011
2 E30000004 2000 286162.333333 8 749 0.010681
3 E30000004 2001 129527.428571 8 749 0.010681
4 E30000004 2002 56740.400000 11 749 0.014686
In [21]:
df_left = df_EnR_i_t
left_key = ["ttwa","birth_year"]
print("Entry Rate Dataset is" + str(df_EnR_i_t.shape) + " because it includes other countries like Scotland, Wales")

df_right = df_hh_i_t
right_key = ["ttwa","birth_year"]
print("HHI Dataset is" + str(df_hh_i_t.shape))

# merge
df_mg = pd.merge(df_left,df_right, left_on=left_key, right_on=right_key,how = "inner").reset_index(drop=True)

df_hh_EnR = df_mg.copy()
print("HHI&EnR Dataset is" + str(df_hh_EnR.shape))

df_hh_EnR.head()
Entry Rate Dataset is(4487, 6) because it includes other countries like Scotland, Wales
HHI Dataset is(3089, 3)
HHI&EnR Dataset is(3089, 7)
Out[21]:
ttwa birth_year average_assets firms total_firms entry_rate hh
0 E30000004 1998 19405.250000 10 749 0.013351 0.120000
1 E30000004 1999 759500.500000 6 749 0.008011 0.333333
2 E30000004 2000 286162.333333 8 749 0.010681 0.156250
3 E30000004 2001 129527.428571 8 749 0.010681 0.125000
4 E30000004 2002 56740.400000 11 749 0.014686 0.090909
In [22]:
df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = df_ttwa[["code","name"]]
In [32]:
df_left = df_hh_EnR
left_key = "ttwa"
print("df_left is " + str(df_left.shape))

df_right = df_ttwa
right_key = "code"
print("df_right is " + str(df_right.shape))

# merge
df_mg = pd.merge(df_left,df_right, left_on=left_key, right_on=right_key,how = "inner").reset_index(drop=True)
print("df_merge" + str(df_mg.shape))

df_hh_EnR_ttwaName = df_mg.copy()

df_hh_EnR_ttwaName = df_hh_EnR_ttwaName[["name","code","birth_year","firms","total_firms","entry_rate", "hh", "average_assets"]]
df_hh_EnR_ttwaName = df_hh_EnR_ttwaName.rename({"name":"ttwa", "code":"ttwa_code", "birth_year":"year"},axis=1)

df_hh_EnR_ttwaName.head()
df_left is (3089, 7)
df_right is (430, 2)
df_merge(3089, 9)
Out[32]:
ttwa ttwa_code year firms total_firms entry_rate hh average_assets
0 Barnsley E30000004 1998 10 749 0.013351 0.120000 19405.250000
1 Barnsley E30000004 1999 6 749 0.008011 0.333333 759500.500000
2 Barnsley E30000004 2000 8 749 0.010681 0.156250 286162.333333
3 Barnsley E30000004 2001 8 749 0.010681 0.125000 129527.428571
4 Barnsley E30000004 2002 11 749 0.014686 0.090909 56740.400000
In [33]:
df = df_hh_EnR_ttwaName.copy() # input your data
attribute = "entry_rate" # input your X

import seaborn as sns
import numpy as np
df[attribute] = np.log(df[attribute])
sns.displot(df, x=attribute,bins=80)
Out[33]:
<seaborn.axisgrid.FacetGrid at 0x7febffb11b00>
In [34]:
df = df_hh_EnR_ttwaName.copy() # input your data
attribute = "average_assets" # input your X

import seaborn as sns
import numpy as np
df[attribute] = np.log1p(df[attribute])
sns.displot(df, x=attribute,bins=40)
df[attribute].describe()
Out[34]:
count    3089.000000
mean       10.677960
std         2.702829
min         0.000000
25%        10.502893
50%        11.213327
75%        11.915154
max        15.959450
Name: average_assets, dtype: float64

Can we drop the close to 0 value?

In [35]:
#df_hh_EnR_ttwaName
import math
assets_level = (math.e)**5-1
print(assets_level)
df_hh_EnR_ttwaName[df_hh_EnR_ttwaName.average_assets < assets_level].shape[0] / df_hh_EnR_ttwaName.shape[0]
147.41315910257657
Out[35]:
0.0534153447717708

The 0~4 value only accounts for 0.05 in the dataset, so we can drop our 0 value

In [36]:
df_hh_EnR_ttwaName_asset_not_drop = df_hh_EnR_ttwaName
# drop the asset which can cause outlier
df_hh_EnR_ttwaName = df_hh_EnR_ttwaName[df_hh_EnR_ttwaName["average_assets"]>=assets_level]
In [40]:
# then test the distribution again
df = df_hh_EnR_ttwaName.copy()
attribute = "average_assets"

import seaborn as sns
import numpy as np
df[attribute] = np.log1p(df[attribute])
sns.displot(df, x=attribute,bins=80)
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x7fec272f34e0>
In [45]:
# export the df_hh_EnR_ttwaName_i_t_asset_not_drop
df_hh_EnR_ttwaName_asset_not_drop.to_csv(PATH + "/Dataset/df_hh_EnR_ttwaName_asset_not_drop.csv")

Hypothesis

$\ Dynamics\ Variable_{i,t}\ =\ a_1 + b_1 \ firms_{i,t} + b_2 \ HHI_{i,t} + b_3 \ location_i + b_4 \ year_t + e_{i,t} $ (1)

$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \ firms_{i,t} + d_2 \ HHI_{i,t} + d_3 \ dynamics_{i,t} + d_4 \ location_i + d_5 \ year_t + e_{i,t}$ (2)

  • $t$ means year -- birth_year
  • $i$ means location -- ttwa (travel to work area)
  • $Dynamics\ Variable_{i,t}$ in (1) means Entry_Rate in $t$ and $i$
  • $dynamics_{i,t}$ in (2) means Entry_Rate in a specific $t$ and $i$
  • $firms_{i,t}$ means Count_of_tech_firms in a specific $t$ and $i$
  • $HHI_{i,t}$ means Herfindahl-Hirschman Index in a specific $t$ and $i$

OLS with std error

In (1), Y is entry_rate, X is ttwa, birth_year, total, hh

In (2), Y is average_latest_accounts_assets, X is ttwa, birth_year, total, hh, entry_rate

In [197]:
# environment preparation
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_rows', 300) # specifies number of rows to show
pd.options.display.float_format = '{:40,.4f}'.format # specifies default number format to 4 decimal places
plt.style.use('ggplot') # specifies that graphs should use ggplot styling
%matplotlib inline

introduce correlation analysis

In [198]:
df = df_hh_EnR_ttwaName.copy()

df.year = df.year.astype("int")
df = df.drop({"ttwa","ttwa_code"},axis=1)

# add dummies operation requirement 
# Set the reference object is ttwa_E30000234(London)
# drop Birmingham for dummies variables comparasion
# df = df.drop(['ttwa_London'], axis=1)

# correlation analysis
plt.rcParams["axes.grid"] = False
f = plt.figure(figsize=(19, 15))
plt.matshow(df.corr(), fignum=f.number)
plt.xticks(range(df.shape[1]), df.columns, fontsize=14, rotation=45)
plt.yticks(range(df.shape[1]), df.columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16)

# df = pd.get_dummies(df)
Out[198]:
Text(0.5, 1.0, 'Correlation Matrix')

introduce the dummy variables

Dummies

In [199]:
df = df_hh_EnR_ttwaName.copy()

# intro dummy variables
# df.year = df.year.astype("int")
df = df.drop({"ttwa_code"},axis=1)
df = pd.get_dummies(df)

# add dummies operation requirement 
# Set the reference object is ttwa_E30000234(London)
# drop Birmingham for dummies variables comparasion
# df = df.drop(['ttwa_London'], axis=1)

# gd means get dummies
df_hh_EnR_gd = df.copy()

VIF

In [200]:
# VIF method
from statsmodels.stats.outliers_influence import variance_inflation_factor 
from statsmodels.tools.tools import add_constant

def drop_column_using_vif_(df, list_var_not_to_remove, thresh=5):
    '''
    Calculates VIF each feature in a pandas dataframe, and repeatedly drop the columns with the highest VIF
    A constant must be added to variance_inflation_factor or the results will be incorrect

    :param df: the pandas dataframe containing only the predictor features, not the response variable
    :param list_var_not_to_remove: the list of variables that should not be removed even though it has a high VIF. For example, dummy (or indicator) variables represent a categorical variable with three or more categories.
    :param thresh: the max VIF value before the feature is removed from the dataframe
    :return: dataframe with multicollinear features removed
    '''
    while True:
        # adding a constatnt item to the data
        df_with_const = add_constant(df)

        vif_df = pd.Series([variance_inflation_factor(df_with_const.values, i) 
               for i in range(df_with_const.shape[1])], name= "VIF",
              index=df_with_const.columns).to_frame()

        # drop the const
        vif_df = vif_df.drop('const').drop(list_var_not_to_remove)
        
        print('Max VIF:', vif_df.VIF.max())
        
        # if the largest VIF is above the thresh, remove a variable with the largest VIF
        if vif_df.VIF.max() > thresh:
            # If there are multiple variables with the maximum VIF, choose the first one
            index_to_drop = vif_df.index[vif_df.VIF == vif_df.VIF.max()].tolist()[0]
            print('Dropping: {}'.format(index_to_drop))
            df = df.drop(columns = index_to_drop)
        else:
            # No VIF is above threshold. Exit the loop
            break

    return df

All Tech Firms in England (OLS)

In (1):

$\ {Dynamics\ Variable_{i,t}}^{0.5}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times HHI_{i,t} + b_3 \times location_i + b_4 \times year_t$

In [ ]:
predictors_X = df_hh_EnR_gd.drop({'entry_rate','average_assets'}, axis=1) # drop the Y_1 and Y_2
response_Y = np.sqrt(df_hh_EnR_gd['entry_rate'])
list_var_not_to_remove = ["year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 

df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()

R^2 = 0.78

In (2):

$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times HHI_{i,t} + d_3 \times dynamics_{i,t} + d_4 \times location_i + d_5 \times year_t$ (2)

In [ ]:
predictors_X = df_hh_EnR_gd.drop({'average_assets'}, axis=1) # drop the Y_1 and Y_2
response_Y = np.log1p(df_hh_EnR_gd['average_assets'])
list_var_not_to_remove = ["entry_rate","year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 

df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()

R^2 = 0.43

Top 10 Tech Cluster (OLS)

small data preparation

In [115]:
df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory=False)
df = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis = 1)
df = df.dropna(subset = {"birth_year","ttwa"})

# Generate the top 10 tech clusters table based on the ttwa
top_10_ttwa = df.ttwa.value_counts().head(10)
top_10_ttwa = pd.DataFrame(top_10_ttwa).reset_index()
top_10_ttwa.columns = ["ttwa_code","counts"]
In [263]:
df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = pd.merge(top_10_ttwa, df_ttwa[["code","name"]], left_on="ttwa_code",right_on="code")
df_ttwa_10 = df_ttwa[["ttwa_code", "name","counts"]]

# df_10_tc means "top 10 tech ttwa clusters"
df_10_tc = df_hh_EnR_ttwaName.copy()
# filter data to top 10 ttwa data raws
df_10_tc = df_10_tc[df_10_tc.ttwa_code.isin(df_ttwa_10.ttwa_code.to_list())].reset_index(drop=True)

df_10_tc = df_10_tc.drop({"ttwa_code"},axis=1)
# get dummies
df_10_tc = pd.get_dummies(df_10_tc)
df_10_tc = df_10_tc.drop({"ttwa_London"},axis=1)

# delete the resid outlier rows
df_10_tc = df_10_tc.drop(index=[182,92,141])

In (1):

$\ log_e\ {Dynamics\ Variable_{i,t}}\ =\ a_1 + b_1 \times firms_{i,t} + b_2 \times HHI_{i,t} + b_3 \times location_i + b_4 \times year_t$

In [212]:
df = df_10_tc.copy()

predictors_X = df.drop({'entry_rate','average_assets'}, axis=1) # drop the Y_1 and Y_2
#response_Y = df['entry_rate']
response_Y = np.log(df['entry_rate']) # sqrt is created for residual analysis
list_var_not_to_remove = ["year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 

df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()
Max VIF: inf
Dropping: total_firms
Max VIF: 3.598233664337501
The columns remaining after VIF selection are:
Index(['year', 'firms', 'hh', 'ttwa_Birmingham', 'ttwa_Bournemouth',
       'ttwa_Brighton', 'ttwa_Bristol', 'ttwa_Cambridge',
       'ttwa_Guildford and Aldershot', 'ttwa_Luton', 'ttwa_Manchester',
       'ttwa_Slough and Heathrow'],
      dtype='object')
Out[212]:
OLS Regression Results
Dep. Variable: entry_rate R-squared: 0.959
Model: OLS Adj. R-squared: 0.957
Method: Least Squares F-statistic: 373.3
Date: Tue, 03 Aug 2021 Prob (F-statistic): 4.84e-125
Time: 22:09:46 Log-Likelihood: 42.872
No. Observations: 203 AIC: -59.74
Df Residuals: 190 BIC: -16.67
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const -290.6295 5.694 -51.038 0.000 -301.862 -279.397
year 0.1427 0.003 50.089 0.000 0.137 0.148
firms 3.236e-05 8.76e-06 3.696 0.000 1.51e-05 4.96e-05
hh 4.1339 0.367 11.267 0.000 3.410 4.858
ttwa_Birmingham 0.2020 0.077 2.634 0.009 0.051 0.353
ttwa_Bournemouth -0.8259 0.090 -9.130 0.000 -1.004 -0.647
ttwa_Brighton -0.1465 0.079 -1.866 0.064 -0.301 0.008
ttwa_Bristol 0.3233 0.077 4.180 0.000 0.171 0.476
ttwa_Cambridge 0.3363 0.077 4.343 0.000 0.184 0.489
ttwa_Guildford and Aldershot 0.3654 0.078 4.703 0.000 0.212 0.519
ttwa_Luton -0.1255 0.081 -1.553 0.122 -0.285 0.034
ttwa_Manchester 0.2522 0.074 3.392 0.001 0.106 0.399
ttwa_Slough and Heathrow 0.1100 0.076 1.450 0.149 -0.040 0.260
Omnibus: 0.644 Durbin-Watson: 1.454
Prob(Omnibus): 0.725 Jarque-Bera (JB): 0.744
Skew: 0.022 Prob(JB): 0.689
Kurtosis: 2.707 Cond. No. 1.10e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.1e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [213]:
plt.scatter(regressor_OLS.fittedvalues, regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted Entry Rate')
plt.ylabel('Residual')
plt.title('Figure')
#plt.ylim(-1,1)
plt.show()

In (2):

$\ log_e\ {(Firm\ Performance_{i,t}+1)}\ =\ c_1 + d_1 \times firms_{i,t} + d_2 \times HHI_{i,t} + d_3 \times dynamics_{i,t} + d_4 \times location_i + d_5 \times year_t$ (2)

In [264]:
df = df_10_tc.copy()

predictors_X = df.drop({'average_assets'}, axis=1) # drop the Y_1 and Y_2
response_Y = np.log1p(df['average_assets'])
list_var_not_to_remove = ["entry_rate","year","firms","hh"] # this is a list of dummy variables that represent a categorical variable with three or more categories. They should not be removed even if it has a high VIF. 

df_predictors_select_VIF = drop_column_using_vif_(predictors_X, list_var_not_to_remove, thresh=5)
print("The columns remaining after VIF selection are:")
print(df_predictors_select_VIF.columns)

regressor_OLS = sm.OLS(endog=response_Y, exog=sm.add_constant(df_predictors_select_VIF)).fit()
regressor_OLS.summary()
Max VIF: inf
Dropping: total_firms
Max VIF: 4.045694598536752
The columns remaining after VIF selection are:
Index(['year', 'firms', 'entry_rate', 'hh', 'ttwa_Birmingham',
       'ttwa_Bournemouth', 'ttwa_Brighton', 'ttwa_Bristol', 'ttwa_Cambridge',
       'ttwa_Guildford and Aldershot', 'ttwa_Luton', 'ttwa_Manchester',
       'ttwa_Slough and Heathrow'],
      dtype='object')
Out[264]:
OLS Regression Results
Dep. Variable: average_assets R-squared: 0.824
Model: OLS Adj. R-squared: 0.811
Method: Least Squares F-statistic: 66.86
Date: Tue, 03 Aug 2021 Prob (F-statistic): 7.71e-63
Time: 22:46:32 Log-Likelihood: -41.356
No. Observations: 200 AIC: 110.7
Df Residuals: 186 BIC: 156.9
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 150.1955 18.208 8.249 0.000 114.275 186.117
year -0.0688 0.009 -7.569 0.000 -0.087 -0.051
firms 3.411e-05 1.93e-05 1.764 0.079 -4.03e-06 7.22e-05
entry_rate -7.4339 1.609 -4.619 0.000 -10.609 -4.259
hh 1.3541 0.626 2.163 0.032 0.119 2.589
ttwa_Birmingham -0.2711 0.127 -2.136 0.034 -0.522 -0.021
ttwa_Bournemouth -0.6849 0.146 -4.682 0.000 -0.974 -0.396
ttwa_Brighton -0.5555 0.129 -4.315 0.000 -0.810 -0.302
ttwa_Bristol -0.2290 0.129 -1.777 0.077 -0.483 0.025
ttwa_Cambridge -0.2020 0.130 -1.559 0.121 -0.458 0.054
ttwa_Guildford and Aldershot -0.0727 0.129 -0.564 0.573 -0.327 0.182
ttwa_Luton -0.2907 0.131 -2.218 0.028 -0.549 -0.032
ttwa_Manchester -0.1381 0.123 -1.119 0.264 -0.381 0.105
ttwa_Slough and Heathrow -0.1937 0.123 -1.573 0.117 -0.437 0.049
Omnibus: 5.347 Durbin-Watson: 1.922
Prob(Omnibus): 0.069 Jarque-Bera (JB): 7.800
Skew: 0.023 Prob(JB): 0.0202
Kurtosis: 3.966 Cond. No. 1.99e+06


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.99e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [265]:
plt.scatter(regressor_OLS.fittedvalues, regressor_OLS.resid)
# adding title and labels
plt.xlabel('Fitted Average Assets')
plt.ylabel('Residual')
plt.title('Figure')
# plt.ylim(-0.5,0.5)
plt.show()

In England, as year goes, the performance might grow less. Higher HH index might cause lower performance, but more entry rate might cause less asset. This should be discussed in the next section with other literatures.

Moran'I and GWR

In [46]:
# value include dynamics, industry mix and firms' performance
df1 = df_hh_EnR_ttwaName.copy()
df1.shape
Out[46]:
(2924, 8)
In [47]:
df1.describe()
Out[47]:
year firms total_firms entry_rate hh average_assets
count 2924.000000 2924.000000 2924.000000 2924.000000 2924.000000 2.924000e+03
mean 2007.638167 113.609097 2541.871409 0.043932 0.151608 1.337131e+05
std 5.759616 667.562166 9758.846630 0.032065 0.155557 2.567358e+05
min 1998.000000 1.000000 54.000000 0.002457 0.039683 1.720000e+02
25% 2003.000000 10.000000 319.000000 0.017731 0.074395 4.152866e+04
50% 2008.000000 27.500000 784.000000 0.033211 0.101514 7.965773e+04
75% 2013.000000 76.000000 2007.000000 0.066359 0.157025 1.553704e+05
max 2018.000000 22353.000000 113412.000000 0.197096 1.000000 8.532989e+06
In [48]:
df1[df1.hh == 1].ttwa.unique() #only have hh=1 
Out[48]:
array(['Skipton', 'Hexham and Haltwhistle', 'Minehead',
       'Penrith and Appleby', 'Whitby', 'Barrow-in-Furness', 'Bideford',
       'Boston', 'Bridgwater', 'Bridlington', 'Bridport', 'Bude',
       'Buxton', 'Clacton', 'Cromer and Sheringham', 'Darlington',
       'Great Yarmouth', 'Hartlepool', 'Kingsbridge and Dartmouth',
       'Launceston', 'Liskeard', 'Lowestoft', 'Ludlow', 'Penzance',
       'Scarborough', 'Scunthorpe', 'Skegness and Louth', 'Wadebridge',
       'Whitehaven', 'Wisbech', 'Workington'], dtype=object)
In [49]:
df1.head()
Out[49]:
ttwa ttwa_code year firms total_firms entry_rate hh average_assets
0 Barnsley E30000004 1998 10 749 0.013351 0.120000 19405.250000
1 Barnsley E30000004 1999 6 749 0.008011 0.333333 759500.500000
2 Barnsley E30000004 2000 8 749 0.010681 0.156250 286162.333333
3 Barnsley E30000004 2001 8 749 0.010681 0.125000 129527.428571
4 Barnsley E30000004 2002 11 749 0.014686 0.090909 56740.400000
In [270]:
# export df1 to MySQL to operate
# df1.to_csv(PATH + "/MySQL/df_hh_EnR_ttwaName.csv")

Excel Operation (ref to MySQL operation ...)

DROP TABLE IF EXISTS tech_cluster_hh_enr_ttwa_for_r;
CREATE TABLE tech_cluster_hh_enr_ttwa_for_r
SELECT * FROM (select    t1.ttwa
                        ,t1.ttwa_code
                        ,sum(t1.firms) as entry_total
                        ,avg(t1.total_firms) as grand_total
                        ,avg(t1.hh) as hhi
                        ,avg(t1.entry_rate) as entry_rate
                        ,avg(t1.average_latest_accounts_assets) as assets
                        from `casa-dissertation`.df_hh_enr_for_r as t1
                        group by t1.ttwa) as t2

Read df1 from MySQL

In [50]:
# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-20dc35da-8f74-48f1-9355-810587cb82e8') # Let Mito know which analysis is being run

# Step 1

df2 = pd.DataFrame(data={})
  
# MITO CODE END (DO NOT EDIT)
In [51]:
# MITO CODE START (DO NOT EDIT)

from mitosheet import * # Import necessary functions from Mito
register_analysis('UUID-20dc35da-8f74-48f1-9355-810587cb82e8') # Let Mito know which analysis is being run

# Step 1

# Pivot the data
pivot_table = df1.pivot_table(
    index=['ttwa', 'ttwa_code'],
    values=['firms', 'total_firms', 'hh', 'entry_rate', 'average_assets'],
    aggfunc={'firms': 'sum', 'total_firms': 'mean', 'hh': 'mean', 'entry_rate': 'mean', 'average_assets': 'mean'}
)

# Reset the column name and the indexes
df2 = pivot_table.rename_axis(None, axis=1).reset_index()
  
# MITO CODE END (DO NOT EDIT)
In [52]:
# df1 = pd.read_csv(PATH + "/MySQL/tech_cluster_hh_enr_ttwa_for_r.csv",",")
In [53]:
df1 = df2.copy()
In [54]:
type(g_ttwa_England)
type(df1)
Out[54]:
pandas.core.frame.DataFrame
In [55]:
df_left = g_ttwa_England
left_key = 'TTWA11CD'

df_right = df1
right_key = "ttwa_code"

# merge
df_mg = pd.merge(df_left,df_right, left_on = left_key,right_on=right_key,how = "inner").reset_index(drop=True)
type(df_mg)
Out[55]:
geopandas.geodataframe.GeoDataFrame
In [56]:
df_mg.shape
Out[56]:
(149, 13)
In [59]:
# df_mg.to_csv(PATH + "/HHI_EnR_ttwa_England_for_R.csv")
import geopandas as gpd
import fiona

# asset is not dropped
df_mg.to_file(PATH + "/Dataset/Spatial/HHI_EnR_ttwa_England_for_R_assetnd.geojson", driver='GeoJSON')