Question: The tech industry contains many industries (i.e. many SICs), but the LQs represent a specific industry index in a specific area (1:1). How to measure tech cluster's LQs in a specific area? (maybe using the average or weighted method?)
The top 10 number of tech firms for TTWA is given below
f1_ttwa
f2_ttwa
f1_laua
f2
Top 4(the Camden, Westminster, Hackney and Islington clusters) belongs to the London
Only Brighton and Hove, Birmingham, Manchester and Leeds do not belong to London
possible_tech_movers
asset
,incorporation_date
or other columns to find other characters of movers# set the local word path
import sys
PATH = sys.path[0]
# read the tech firm records in fame_OC (15GB)
import pandas as pd
df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory=True)
# preprocess the data
df = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis = 1)
df.head()
# read the OC1-3
import pandas as pd
df_OC_1_1 = pd.read_csv(PATH + "/Dataset/OC_1_1.csv",low_memory=True)
# df_OC_1_2 = pd.read_csv(PATH + "/Dataset/OC_2_1.csv",low_memory=True)
# df_OC_1_3 = pd.read_csv(PATH + "/Dataset/OC_3_1.csv",low_memory=True)
# proccess the OC_1_1 data
df_OC_1_1 = df_OC_1_1.drop({"Unnamed: 0"},axis=1)
df_OC_1_1.head()
# find the tech_firm_fame_OC missing value
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.style
detect the OC_1_1 missing value and clean
# find the OC1-3 missing value
total = df_OC_1_1.isnull().sum().sort_values(ascending=False)
percent = (df_OC_1_1.isnull().sum()/df_OC_1_1.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data
# detect the unique value for country columns
df_OC_1_1.country.unique()
# drop the columns which contain NA value
df_OC_1_1_dropna = df_OC_1_1.copy()
print(df_OC_1_1_dropna.shape)
for i in df_OC_1_1_dropna.columns:
# i = str(i)
if missing_data.Percent[i]!=0:
df_OC_1_1_dropna = df_OC_1_1_dropna.drop({i},axis = 1)
# df_OC_1_1 = df_OC_1_1.drop()
print(df_OC_1_1_dropna.shape)
# and we can find there are 6 columns remaining
For tech_firm_fame_OC
, the temporal variables are given below:
Columns | Missing Number | Missing Percentage |
---|---|---|
dissolution_date | 512706 | 0.998049 |
incorporation_date | 88803 | 0.172867 |
* 17% incorporation_date are N/A
* Almost 99% dissolution_date are N/A (most tech firms survive?)
And the spatial attributes can be found in the following table:
Columns | Description |
---|---|
Postcode | |
Pcd2 | postcode split by space |
Pcds | |
Cty | City |
Laua | Local/Unitary Authority/London Borough |
Ctry | Country |
Rgn | Region |
Ttwa | Travel to work area |
Pct |
* 4 countries in this dataset
* 12 regions in dataset
TTWA
means "travel to work area"
ref: https://www.ncl.ac.uk/media/wwwnclacuk/curds/files/RR2015-05.pdf
The first 10 TTWAs in the UK can be presented below:
top_10_ttwa = df.ttwa.value_counts().head(10)
top_10_ttwa = pd.DataFrame(top_10_ttwa).reset_index()
top_10_ttwa.columns = ["ttwa_code","counts"]
df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = pd.merge(top_10_ttwa, df_ttwa[["code","name"]], left_on="ttwa_code",right_on="code")
df_ttwa_10 = df_ttwa[["ttwa_code", "counts", "name"]]
df_ttwa_10.style
Compared to the Largest TTWAs by workplace population, UK, 2011 TTWAs
But it is hard to find the ttwa boundary to do visulisation work. I might consider Laua
attributes
This TTWA EDA value is:
the top 10 tech clusters can be identified like London, Manchester (more details can be found in the Travel to Work Areas (Mike Coombes and Office for National Statistics, 2015))
Its population of job can be found which is conducive to the LQs detecting
The difficulties:
The boundaries dataset is in theOpen Geography portalx, which is hard to get spatial information in Python because its type is JSON
The boundaries and work population number is not up-to-date and available in 2021. (I cannot find open data for ttwa in 2020)
import geopandas as gpd
g_ttwa = gpd.read_file(PATH + "/Dataset/Spatial/Travel_to_Work_Areas__December_2011__Generalised_Clipped_Boundaries_in_United_Kingdom.geojson")
g_ttwa.plot()
Laua
& Ttwa
local authority boundaries
& Travel to work area boundaries
1.
and 2.
to visualise the spatial distributiondrop the rows which contain NA value in the laua column
print(df.shape)
df1 = df.copy()
df1 = df1.dropna(subset=["laua","ttwa"])
print(df1.shape)
# import the laua boundaries
df_laua = pd.read_csv(PATH + "/Dataset/Spatial/Local_Authority_2017.csv")
df_laua.head()
# import the ttwa boundaries
import geopandas as gpd
df_ttwa = gpd.read_file(PATH + "/Dataset/Spatial/Travel_to_Work_Areas__December_2011__Generalised_Clipped_Boundaries_in_United_Kingdom.geojson")
df_ttwa.head()
The top 10 Local Authority can be found in the following list
# df_laua = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_laua_10 = pd.DataFrame(df1.laua.value_counts()[:12]).reset_index()
df_laua_10.columns = ["laua_code","count"]
df_laua_10 = pd.merge(df_laua_10, df_laua[["lad17cd","lad17nm"]], left_on="laua_code",right_on="lad17cd")
df_laua_10 = df_laua_10[["laua_code","count","lad17nm"]]
df_laua_10.columns = ["laua_code","count","name"]
df_laua_10.style
Top 4(the Camden, Westminster, Hackney and Islington clusters) belongs to the London
Only Brighton and Hove, Birmingham, Manchester and Leeds do not belong to London
The top 10 Travel to Work Area can be found in the following list
top_10_ttwa = df.ttwa.value_counts().head(10)
top_10_ttwa = pd.DataFrame(top_10_ttwa).reset_index()
top_10_ttwa.columns = ["ttwa_code","counts"]
df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = pd.merge(top_10_ttwa, df_ttwa[["code","name"]], left_on="ttwa_code",right_on="code")
df_ttwa_10 = df_ttwa[["ttwa_code", "counts", "name"]]
df_ttwa_10.style
This file is Local_Authority_2019_Boundaries/Local_Authority_Districts_(December_2019)_Boundaries_UK_BFC.shp
# read the shape file and plot map
import geopandas as gpd
g_laua = gpd.read_file(PATH + "/Dataset/Spatial/Local_Authority_2019_Boundaries/Local_Authority_Districts_(December_2019)_Boundaries_UK_BFC.shp")
g_ttwa = gpd.read_file(PATH + "/Dataset/Spatial/Travel_to_Work_Areas__December_2011__Generalised_Clipped_Boundaries_in_United_Kingdom.geojson")
# print(g_laua)
g_laua.plot()
g_ttwa.plot()
get the laua and frequency list and transform into the frequency
df_laua_freq = pd.DataFrame(df1.laua.value_counts()).reset_index()
df_laua_freq.columns = ["laua_code","count"]
df_laua_freq.head()
get the ttwa and frequency list and transform into the frequency
df_ttwa_freq = pd.DataFrame(df1.ttwa.value_counts()).reset_index()
df_ttwa_freq.columns = ["ttwa_code","count"]
df_ttwa_freq.head()
df_laua_freq
and g_laua
df_ttwa_freq
and g_ttwa
df_laua_mg = pd.merge(g_laua, df_laua_freq,left_on='lad19cd',right_on='laua_code',how='inner').reset_index()
df_ttwa_mg = pd.merge(g_ttwa, df_ttwa_freq, left_on='TTWA11CD', right_on='ttwa_code',how='inner').reset_index()
df_laua_mg.plot()
df_ttwa_mg.plot()
cmap settings ref: https://matplotlib.org/stable/tutorials/colors/colormaps.html
import pysal as ps
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt
tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1_laua = plt.figure()
f1_laua.set_size_inches(20,20)
ax1 = f1_laua.add_subplot()
df_laua_mg.plot(column='count', legend=True, cmap='YlGn', ax=ax1)
# df_laua_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)
f1_laua.subplots_adjust(top=0.95)
# f1.suptitle(f"Frequency Distribution of tech firms cluster by Local Authorities", x=0.025, size=24, **tfont)
f1_laua.suptitle(f"Spatial Distribution of the number of tech firms cluster in UK Local Authorities", x=0.25, size=24, **tfont)
f1_laua.savefig(PATH + '/Img/Frequency_Distribution_of_tech_firms_cluster_by_LAUA.png', dpi=300)
df_ttwa_mg
# map the ttwa and frequency of tech clusters
tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1_ttwa = plt.figure()
f1_ttwa.set_size_inches(20,20)
ax1 = f1_ttwa.add_subplot()
df_ttwa_mg.plot(column='count', legend=True, cmap='YlGn', ax=ax1)
# df_laua_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)
f1_ttwa.subplots_adjust(top=0.95)
# f1.suptitle(f"Frequency Distribution of tech firms cluster by Local Authorities", x=0.025, size=24, **tfont)
f1_ttwa.suptitle(f"Spatial Distribution of the number of tech firms cluster in UK Travel to Work Area", x=0.25, size=24, **tfont)
f1_ttwa.savefig(PATH+'/Img/Frequency_Distribution_of_tech_firms_cluster_by_TTWA.png', dpi=300)
For create a high contrast, making the TTWA (less than top 10) count 0.
We need df_ttwa_10
& df_ttwa_mg
to make a better visualisation version
df_ttwa_10.head()
df_ttwa_mg.head()
df_ttwa_mg_for_viz = df_ttwa_mg.copy()
df_ttwa_mg_for_viz.loc[~df_ttwa_mg_for_viz["TTWA11CD"].isin(df_ttwa_10.ttwa_code.to_list()),"count"]=-100000
df_ttwa_mg_for_viz
note: the loc
is a "inplace = True" method, you dont need to write df = df.loc[]
just df.loc[] = ..
# map the ttwa and frequency of tech clusters again!!
tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1_ttwa = plt.figure()
f1_ttwa.set_size_inches(20,20)
ax1 = f1_ttwa.add_subplot()
cmap_set = "tab20b"
df_ttwa_mg_for_viz.plot(column='count', legend=True, cmap=cmap_set, ax=ax1)
# coolwarm
# winter
# tab20b
# tab20c
# df_laua_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)
f1_ttwa.subplots_adjust(top=0.95)
# f1.suptitle(f"Frequency Distribution of tech firms cluster by Local Authorities", x=0.025, size=24, **tfont)
f1_ttwa.suptitle(f"Spatial Distribution of the number of tech firms cluster in UK Travel to Work Area", x=0.25, size=24, **tfont)
f1_ttwa.savefig(PATH+'/Img/Frequency_Distribution_of_tech_firms_cluster_by_TTWA' + '_' + cmap_set + '.png', dpi=300)
df_laua_10.style
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
# Initialize the matplotlib figure
f2_laua, ax2 = plt.subplots(figsize=(15,6))
# Load the example car crash dataset
# crashes = sns.load_dataset("car_crashes").sort_values("total", ascending=False)
# Plot the total crashes
sns.set_color_codes("pastel")
sns.barplot(x="count", y="name", data=df_laua_10, color="b")
# Add a legend and informative axis label
# ax.legend(ncol=2, loc="lower right", frameon=True)
ax2.set(xlim=(0, 16050), ylabel="", xlabel="the number of technology companies")
ax2.set_title("Top 10 local authorities with the largest number of technology companies in the UK",fontsize = 24)
# sns.despine(left=True, bottom=True)
f2_laua.savefig(PATH + "/Img/The_Top_10_Number_of_Tech_Firms_laua.png")
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")
# Initialize the matplotlib figure
f2_ttwa, ax2 = plt.subplots(figsize=(15,6))
# Load the example car crash dataset
# crashes = sns.load_dataset("car_crashes").sort_values("total", ascending=False)
# Plot the total crashes
sns.set_color_codes("pastel")
sns.barplot(x="counts", y="name", data=df_ttwa_10, color="b")
# Add a legend and informative axis label
# ax.legend(ncol=2, loc="lower right", frameon=True)
ax2.set(xlim=(0, 150000), ylabel="", xlabel="the number of technology companies")
ax2.set_title("Top 10 Travel to Work Area with the largest number of technology companies in the UK",fontsize = 24)
# sns.despine(left=True, bottom=True)
f2_ttwa.savefig(PATH + "/Img/The_Top_10_Number_of_Tech_Firms_ttwa.png")
For fame_OC
df2 = df.copy()
firm_name_freq = pd.DataFrame(df2.normalised_name.value_counts()).reset_index()
firm_name_freq.columns = ["firm_name","count"]
# firm's name re-appears
firm_name_re = firm_name_freq[firm_name_freq["count"]>1]
firm_name_re.style
In this case, I can detect the 41 companies that might move
Then back to the fame_OC
and filter these possible movers
df_fame_movers = df.copy()
df_fame_movers = df_fame_movers[df_fame_movers.normalised_name.isin(firm_name_re.firm_name.to_list())]
# df_fame_movers.to_csv(PATH + "/df_fame_OC_movers.csv")
# detect which one has different postcode in the same normalised name
df_movers = pd.DataFrame(df_fame_movers.groupby("normalised_name")["postcode"].nunique()).reset_index()
# more than 2 distinct postcodes mean these tech firms may move
df_movers = df_movers[df_movers["postcode"]>=2]
# export a list that tech firm have different postcode while they have the same names
df_movers_1 = df_fame_movers[df_fame_movers["normalised_name"].isin(df_movers["normalised_name"].to_list())]
possible_tech_movers = df_movers_1[["registered_number","incorporation_date","dissolution_date","normalised_name","postcode"]].sort_values(by="normalised_name")
possible_tech_movers