For entry pattern, how to pick incorperation year? (Like 1950-2018)
Looking up for literature to find supporting evidence? (Like Apple were found in 1970s)?
How to do map making for entry/exit/churn pattern?
churn pattern may be similar to the entry pattern because dissolution date is a small number in dataset(fame_OC)
How to define the number of all companies in entry rate measurement?
Does it represent the number of tech companies in a specific cluster from 19xx to 2018?
Which regression method I should pick up (multiple regression?)
D/ Regression analysis - I would suggest trying two things to start. Again, for cluster i, year t:
$Dynamics_variable_it = F(#firms_it-n, industry mix_it-n, X_it-n, location_i, year_t)$ (1)
$Firm_performance_it = F(#firms_it-n, dynamics_it-n industry mix_it-n, X_it-n, location_i, year_t)$ (2)
Attributes Question:
According to the previous EDA: https://zeqiang.fun/CASA0012-Dissertation/jupyterlab/EDA.html
The top 10 tech cluster can be divided by TTWA (Travel to Work Area). You can see the bar chart below
The key finding of bar chart:
The map of top 10 tech clusters based on ttwa boundaries can be found below. The blue area are non-top-10-tech-clusters.
The key finding of map:
The dataset of entry rate for 10 tech clusters from 1900 to 2018 can be found below
entry_rate.style
The simple linear regression for London tech cluster is operated
regressor_OLS.summary()
As shown above, the regression performance is not good. The exit rate and churn rate will be considered as variables in next multiple regression.
The research plan in the next week:
import pandas as pd
import sys
PATH = sys.path[0] # set the local file path
df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv")
# preprocess the data
df = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis = 1)
df.head()
According to previous EDA research, there are 88803 missing values in birth_year
and incorperation_date
columns, accounting for about 17% in the dataset.
# drop the null value record in birth year and ttwa(travel to work area) columns
df_1 = df.dropna(subset = {"birth_year","ttwa"})
print("There are "+ str(df_1.shape[0]/df.shape[0] * 100) + "% value remain \n")
# count the value
df_1.birth_year.value_counts()
# Tranform data type
df_birth_year = pd.DataFrame(df_1.birth_year.value_counts()).reset_index()
df_birth_year.rename({"index":"birth_year","birth_year":"count"},axis=1,inplace=True)
df_birth_year[df_birth_year["count"]>10]
According to the previous EDA: https://zeqiang.fun/CASA0012-Dissertation/jupyterlab/EDA.html
The top 10 tech cluster can be divided by TTWA (Travel to Work Area). You can see the bar chart below
The key finding of bar chart:
The map of top 10 tech clusters based on ttwa boundaries can be found below. The blue area are non-top-10-tech-clusters.
The key finding of map:
Entry rate for tech cluster i, year t = firms incorporated_t / all firms_it
$Entry\ Rate_{i,t} = \frac{incorporating\ firms}{all\ companies}$
Where $i$ means the tech cluster, $t$ means the tech incorporate year 2009, 2014,..., 2020
Exit rate = firms dissolved_t / all firms_it
$Exit\ Rate_{i,t} = \frac{disolution\ firms}{all\ companies}$
Churn rate = (firms incorporated - firms dissolved)_t / all firms_it
$Churn\ Rate_{i,t} = \frac{incorporation\ firms\ -\ disolution\ firms}{all\ companies}$
*this index may be quite similar to the first one because there is a few number
You could also try fitting these in first differences, e.g. change in entry / exit / churn between t and t-1.
The entry rate can reflect the level of tech cluster attraction, the measurement method is given below:
$Entry\ Rate_{i,t} = \frac{incorporating\ firms}{all\ companies}$
Where $i$ means the tech cluster, $t$ means the tech incorporate year 2009, 2014,..., 2020
Note: only a few companies(1%, according to previous EDA research) have dissolution dates, the number of all companies includes dissolution ones.
Refer to previous EDA:https://zeqiang.fun/CASA0012-Dissertation/jupyterlab/EDA.html
The top 10 cluster based on Travel to Work Area(TTWA) will be concerned as shown below
FAQ:
1. Why I choose TTWA?
This geographical division can better reflect the relationship between population, company and work.
2. What is TTWA?
The 228 areas forming the 2011 TTWAs, covering the whole of the UK, were defined in 2015 using 2011 Census commuting flow data, indicating home and workplace address. The TTWAs are based on aggregations of Lower Layer Super Output Areas (LSOA) in England and Wales, Data Zones (DZ) in Scotland, and Super Output Areas (SOA) in Northern Ireland and in some cases span country borders. There are six cross-border TTWAs, 149 in England, 18 in Wales, 45 in Scotland and 10 in Northern Ireland.
If ttwa
is attributed to 10_tech_clusters
, then new column cluster_label
will be set to the corresponding one.
# generate the top 10 tech clusters table based on the ttwa
top_10_ttwa = df.ttwa.value_counts().head(10)
top_10_ttwa = pd.DataFrame(top_10_ttwa).reset_index()
top_10_ttwa.columns = ["ttwa_code","counts"]
df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = pd.merge(top_10_ttwa, df_ttwa[["code","name"]], left_on="ttwa_code",right_on="code")
df_ttwa_10 = df_ttwa[["ttwa_code", "name","counts"]]
df_ttwa_10.style
Data Filting: remain 10 tech clusters data
Then the ttwa
can be considered as the cluster label
# df_10_tc means df_10_tech_clusters
df_10_tc = df.copy()
print(df_10_tc.shape)
df_10_tc = df_10_tc[df_10_tc.ttwa.isin(df_ttwa_10.ttwa_code.to_list())]
print(df_10_tc.shape)
df_10_tc.birth_year.value_counts()
Data Filting: pick birth year from 2013 to 2018
refer to tech cluster
London, UK, from 2009 to 2018. The London technology cluster has occupied the world's leading level in the world venture capital market and start-up unicorn companies. The technology cluster effect has played an important role in this period of time (Kerr & et al., 2020).
Kerr, William R., and Frederic Robert-Nicoud. 2020. "Tech Clusters." Journal of Economic Perspectives, 34 (3): 50-76. https://www.aeaweb.org/articles?id=10.1257/jep.34.3.50
print(df_10_tc.shape)
# df_10_tc = df_10_tc[df_10_tc.birth_year>2008]
print(df_10_tc.shape)
Export csv to excel to do pivot operation:
ttwa
(ttwa
there is considered as the tech cluster label)birth_year
register_number
Then, calculate the entry number in excel
df_10_tc_clipped = df_10_tc[["registered_number","ttwa","birth_year"]]
df_10_tc_clipped.to_csv(PATH + "/fame_OC_10_tc_clipped.csv")
Import the processed dataset from .xlsx
file
# read the entry rate dataset from excel file
sheetName = "entry_rate"
df_entry_excel = pd.read_excel(PATH + "/Excel/fame_OC_10_tc_clipped.xlsx",sheet_name = sheetName)
df_entry_excel.head()
# Preprocess the entry rate dataset
df_entry_rate = df_entry_excel.rename({"ttwa":"tech_cluster","Grand Total":"all_companies"},axis=1)
df_entry_rate.head()
add ttwa name to the entry rate dataset
entry_rate = pd.merge(df_entry_rate, df_ttwa_10,left_on = "tech_cluster", right_on = "ttwa_code")
entry_rate.drop({"counts"},axis=1,inplace = True)
entry_rate
take London as the first example
# calculate the entry rate of London
London_ER = entry_rate[entry_rate["tech_cluster"]=="E30000234"].reset_index()
London_ER.drop({"index","all_companies","tech_cluster","ttwa_code","name"},axis=1,inplace = True)
X_t = London_ER.columns.to_list() # x axis is t(year) for London
Y_entry_rate = London_ER.values.tolist()[0] # y axis is the entry rate for London
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
pd.set_option('display.max_rows', 300) # specifies number of rows to show
pd.options.display.float_format = '{:40,.4f}'.format # specifies default number format to 4 decimal places
plt.style.use('ggplot') # specifies that graphs should use ggplot styling
%matplotlib inline
# fit a lineare regression model using statsmodel
regressor_OLS = sm.OLS(endog=Y_entry_rate, exog=sm.add_constant(X_t)).fit()
regressor_OLS.summary()
to do
to do
to do