1 Key Results

  1. choropleth map: Spatial Distribution of the number of tech firms cluster in UK Local Authorities
  2. bar chart: Top 10 local authorities with the largest number of technology companies in the UK
  3. table: Possible movers for tech firms

Question: The tech industry contains many industries (i.e. many SICs), but the LQs represent a specific industry index in a specific area (1:1). How to measure tech cluster's LQs in a specific area? (maybe using the average or weighted method?)

The top 10 number of tech firms for TTWA is given below

In [63]:
f1_ttwa
Out[63]:
In [70]:
f2_ttwa
Out[70]:
In [64]:
f1_laua
Out[64]:
In [245]:
f2
Out[245]:
  • Top 4(the Camden, Westminster, Hackney and Islington clusters) belongs to the London

  • Only Brighton and Hove, Birmingham, Manchester and Leeds do not belong to London

In [242]:
possible_tech_movers
Out[242]:
registered_number incorporation_date dissolution_date normalised_name postcode
164178 3060979 1995-05-24 NaN agility limited EN49HN
144252 07418654 2010-10-25 NaN agility limited EC1V2NX
279459 11632107 2018-10-19 NaN amd tech limited N156RH
182708 06250619 2007-05-17 NaN amd tech limited GL53PN
228143 08424152 2013-02-28 NaN conversion company limited KT70RY
512012 1981640 1986-01-23 NaN conversion company limited YO211QL
157297 SC583384 2017-12-07 NaN create limited EH107DU
285768 7460781 2010-12-06 NaN create limited N17GU
20379 11059998 2017-11-13 NaN ctrl z limited BA113EE
469818 05608171 2005-11-01 NaN ctrl z limited W1T6AD
213430 11634444 2018-10-22 NaN entertainment international limited IG110GH
323343 11143790 2018-01-10 NaN entertainment international limited NW51EN
12019 8070874 2012-05-16 NaN eye clinic limited B288EZ
211689 06219671 2007-04-19 NaN eye clinic limited HU53TS
140510 7744742 2011-08-18 NaN future corporation limited E83DL
39534 5631136 2005-11-21 NaN future corporation limited BL26RT
271030 04397427 2002-03-18 NaN globex corporation limited MK428TU
19904 09552070 2015-04-21 NaN globex corporation limited B976HR
387395 11065921 2017-11-15 NaN grace and grace limited SE64EJ
241453 11412829 2018-06-13 NaN grace and grace limited LL116UG
47123 8546782 2013-05-28 NaN hit reset limited BN16AF
224545 6598405 2008-05-20 NaN hit reset limited KT168PD
242014 3937810 2000-03-01 NaN hive limited LL181RF
511527 10797308 2017-06-01 NaN hive limited YO126HT
291684 10589202 2017-01-30 NaN jc film limited N32JU
143676 11613438 2018-10-09 NaN jc film limited EC1R0DU
424940 9549356 2015-04-19 NaN jd software limited SW152BF
72656 04405957 2002-03-28 NaN jd software limited CB249AD
133614 11306057 2018-04-12 NaN limited E162DQ
326148 10219186 2016-06-07 NaN limited NW99NL
8204 7998847 2012-03-20 NaN mak corporation limited B112AA
39799 10674075 2017-03-16 NaN mak corporation limited BL35AZ
323418 9252531 2014-10-07 NaN mind music limited NW51UA
153204 09331890 2014-11-28 NaN mind music limited EC2R6AY
192884 9989675 2016-02-05 NaN quality care solutions limited HA12SP
407150 7745467 2011-08-18 NaN quality care solutions limited SN33AJ
33981 11438196 2018-06-28 NaN red red limited BH149NY
215822 11233132 2018-03-05 NaN red red limited IG45HQ
332492 6329371 2007-07-31 NaN sar uk limited OX46NG
268281 8922191 2014-03-04 NaN sar uk limited MK11SW
180039 05549948 2005-08-31 NaN solutions design limited GL170PZ
494527 11567756 2018-09-13 NaN solutions design limited WC2H9JQ
368604 8417509 2013-02-25 NaN sri corporation limited S117AQ
398016 8817694 2013-12-17 NaN sri corporation limited SL11UA
151594 11487977 2018-07-27 NaN we unicorns limited EC2A4NE
491599 9732240 2015-08-13 NaN we unicorns limited WC2H7LA
  • The number of movers are small (64 records for tech firms)
  • I plan to draw the direction of their movement
  • I plan to explore asset,incorporation_date or other columns to find other characters of movers

2 Reproducible Exploratory Data Analysis(EDA)

In [2]:
# set the local word path 
import sys
PATH = sys.path[0]

2.1 Read Data

In [7]:
# read the tech firm records in fame_OC (15GB)
import pandas as pd
df = pd.read_csv(PATH + "/Dataset/fame_OC_tech_firm.csv",low_memory=True)

# preprocess the data
df = df.drop({"Unnamed: 0","Unnamed: 0.1","Unnamed: 0.1.1","is_tech"},axis = 1)
df.head()
/Users/fangzeqiang/Library/Python/3.7/lib/python/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (3,28,41,66) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[7]:
registered_number incorporation_date dissolution_date latest_accounts_cash latest_accounts_assets latest_accounts_liabilities year_obs sic_year sic4 change_sic ... pcd2 pcds cty laua ctry rgn ttwa pct lep1 lep2
0 06199872 2007-04-02 NaN NaN 15785.0 NaN 2017 2007.0 8622.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 11896710 NaN NaN NaN NaN NaN 2018 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 SC041753 1965-02-19 NaN NaN NaN NaN 2018 2007.0 2013.0 0.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 12125173 NaN NaN NaN NaN NaN 2018 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 11968595 NaN NaN NaN NaN NaN 2018 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 64 columns

In [37]:
# read the OC1-3
import pandas as pd
df_OC_1_1 = pd.read_csv(PATH + "/Dataset/OC_1_1.csv",low_memory=True)
# df_OC_1_2 = pd.read_csv(PATH + "/Dataset/OC_2_1.csv",low_memory=True)
# df_OC_1_3 = pd.read_csv(PATH + "/Dataset/OC_3_1.csv",low_memory=True)

# proccess the OC_1_1 data
df_OC_1_1 = df_OC_1_1.drop({"Unnamed: 0"},axis=1)
df_OC_1_1.head()
/Users/fangzeqiang/Library/Python/3.7/lib/python/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (1) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[37]:
registered_number bvd_id id registered_addresspostal_code primaryaddresspostcodeifuk alltradingaddressespostcodeifuk sic4 birth_year diss_year streetnobuildingetcline1 ... country countryisocode regionincountry typeofregionincountry telephonenumber faxnumber addresstype address_group isdup_t_postcode _merge
0 118 NaN 233927.0 TN23 1DA NaN NaN 4611.0 1856.0 NaN The New Ashford Market ... United Kingdom GB England|South Eastern|Tonbridge (TN)|Ashford Country|Region|Postal area|Town 01233506201 NaN Trading address 1.0 0 matched (3)
1 258 NaN 12022629.0 RG12 1AN NaN NaN 7499.0 1856.0 NaN India Buildings ... United Kingdom GB England|North West|Liverpool (L)|Liverpool Country|Region|Postal area|Town NaN NaN Trading address 1.0 0 matched (3)
2 371 NaN 2111477.0 SW1Y 6BN NaN NaN 7499.0 1863.0 NaN St Clements House ... United Kingdom GB England|Eastern|Norwich (NR)|Norwich Country|Region|Postal area|Town NaN NaN Trading address 1.0 0 matched (3)
3 402 NaN 2302906.0 PL4 0RA NaN NaN 7499.0 1863.0 NaN Millbay Road ... United Kingdom GB England|South Western|Plymouth (PL)|Plymouth Country|Region|Postal area|Town 01752275850 NaN Trading address 1.0 0 matched (3)
4 425 NaN 3310625.0 EC4Y 8BB NaN NaN NaN 1856.0 2014.0 Surrey House ... United Kingdom GB England|London Outer|Kingston Upon Thames (KT)... Country|Region|Postal area|Town NaN NaN Trading address 1.0 0 matched (3)

5 rows × 30 columns

2.2 Missing Value Detect

In [64]:
# find the tech_firm_fame_OC missing value
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.style
Out[64]:
Total Percent
dissolution_date 512706 0.998049
diss_year 512706 0.998049
companieshousenumbersofchildcomp 512703 0.998044
incorporationdate 508225 0.989327
alltradingaddressespostcodeifuk 508107 0.989097
primaryaddresspostcodeifuk 508104 0.989091
numberofchildcompanies 508104 0.989091
companieshousenumbersofcompanies 508104 0.989091
numberofcompaniesinitscorporates 508104 0.989091
registeredname 508104 0.989091
companieshousestatus 508104 0.989091
siccodes2007 508104 0.989091
istheultimateparent 508104 0.989091
latest_accounts_cash 493925 0.961490
latest_accounts_liabilities 463134 0.901551
lep2 450192 0.876358
nr_employees_last_avail_yr 349899 0.681124
latest_accounts_assets 255560 0.497481
year_accounts 224805 0.437612
keep 95745 0.186380
string 95745 0.186380
sic4 89079 0.173404
sic_year 89079 0.173404
age 88803 0.172867
incorporation_date 88803 0.172867
birth_year 88803 0.172867
id 88802 0.172865
name 88802 0.172865
normalised_name 88802 0.172865
company_type 88802 0.172865
change_sic 88802 0.172865
jurisdiction_code 88802 0.172865
max_anna 88802 0.172865
oc 88536 0.172347
bh_notmerged 88536 0.172347
bh_dummy 88536 0.172347
merge_max_anna 88536 0.172347
duo_direct_share 17374 0.033821
pct 11050 0.021510
laua 11050 0.021510
cty 11050 0.021510
lep1 11050 0.021510
rgn 11050 0.021510
ctry 11050 0.021510
ttwa 11050 0.021510
pcd2 11036 0.021483
pcds 11036 0.021483
postcode 4333 0.008435
registered_addresspostal_code 4333 0.008435
ro_full_postcode 3476 0.006766
sample 266 0.000518
ro_country 1 0.000002
year_obs 0 0.000000
guo_name 0 0.000000
guo_bvd_id 0 0.000000
fame 0 0.000000
merge_fame 0 0.000000
duo_name 0 0.000000
duo_bvd_id 0 0.000000
company_name 0 0.000000
bvd_id 0 0.000000
primary_uk_sic_2007 0 0.000000
guo_direct_share 0 0.000000
registered_number 0 0.000000

detect the OC_1_1 missing value and clean

In [171]:
# find the OC1-3 missing value
total = df_OC_1_1.isnull().sum().sort_values(ascending=False)
percent = (df_OC_1_1.isnull().sum()/df_OC_1_1.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data
Out[171]:
Total Percent
faxnumber 259581 1.000000
streetnobuildingetcline4 258103 0.994306
streetnobuildingetcline4native 258103 0.994306
alltradingaddressespostcodeifuk 257100 0.990442
primaryaddresspostcodeifuk 257100 0.990442
streetnobuildingetcline3native 219300 0.844823
streetnobuildingetcline3 219300 0.844823
diss_year 190576 0.734168
bvd_id 176827 0.681202
telephonenumber 154869 0.596611
streetnobuildingetcline2 112826 0.434647
streetnobuildingetcline2native 112826 0.434647
sic4 45985 0.177151
birth_year 4844 0.018661
id 4812 0.018538
regionincountry 1174 0.004523
typeofregionincountry 1174 0.004523
country 1160 0.004469
countryisocode 1160 0.004469
city 1157 0.004457
citynative 1157 0.004457
streetnobuildingetcline1native 128 0.000493
streetnobuildingetcline1 128 0.000493
registered_addresspostal_code 11 0.000042
_merge 0 0.000000
isdup_t_postcode 0 0.000000
postcode 0 0.000000
addresstype 0 0.000000
address_group 0 0.000000
registered_number 0 0.000000
In [173]:
# detect the unique value for country columns
df_OC_1_1.country.unique()
Out[173]:
array(['United Kingdom', nan, 'Bulgaria', 'Israel',
       'United States of America', 'Jersey', 'Isle Of Man', 'Guernsey'],
      dtype=object)
In [22]:
# drop the columns which contain NA value

df_OC_1_1_dropna = df_OC_1_1.copy()
print(df_OC_1_1_dropna.shape)

for i in df_OC_1_1_dropna.columns:
    # i = str(i)
    if missing_data.Percent[i]!=0:
        df_OC_1_1_dropna = df_OC_1_1_dropna.drop({i},axis = 1)
    # df_OC_1_1 = df_OC_1_1.drop()

print(df_OC_1_1_dropna.shape)

# and we can find there are 6 columns remaining
(259581, 30)
Out[22]:
(259581, 6)

2.3 Attributes Recap (fame_OC)

For tech_firm_fame_OC, the temporal variables are given below:

Columns Missing Number Missing Percentage
dissolution_date 512706 0.998049
incorporation_date 88803 0.172867

* 17% incorporation_date are N/A

* Almost 99% dissolution_date are N/A (most tech firms survive?)

And the spatial attributes can be found in the following table:

Columns Description
Postcode
Pcd2 postcode split by space
Pcds
Cty City
Laua Local/Unitary Authority/London Borough
Ctry Country
Rgn Region
Ttwa Travel to work area
Pct

* 4 countries in this dataset

* 12 regions in dataset

2.4 TTWA EDA (fame_OC)

TTWA means "travel to work area"

ref: https://www.ncl.ac.uk/media/wwwnclacuk/curds/files/RR2015-05.pdf

The first 10 TTWAs in the UK can be presented below:

In [10]:
top_10_ttwa = df.ttwa.value_counts().head(10)
top_10_ttwa = pd.DataFrame(top_10_ttwa).reset_index()
top_10_ttwa.columns = ["ttwa_code","counts"]

df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = pd.merge(top_10_ttwa, df_ttwa[["code","name"]], left_on="ttwa_code",right_on="code")
df_ttwa_10 = df_ttwa[["ttwa_code", "counts", "name"]]
df_ttwa_10.style
Out[10]:
ttwa_code counts name
0 E30000234 146959 London
1 E30000239 21683 Manchester
2 E30000266 20224 Slough and Heathrow
3 E30000169 11262 Birmingham
4 E30000237 10408 Luton
5 E30000175 9232 Bournemouth
6 E30000179 8368 Brighton
7 E30000186 7296 Cambridge
8 E30000180 7090 Bristol
9 E30000284 6968 Warrington and Wigan

Compared to the Largest TTWAs by workplace population, UK, 2011 TTWAs

WX20210712-203201@2x.png

But it is hard to find the ttwa boundary to do visulisation work. I might consider Laua attributes

This TTWA EDA value is:

  1. the top 10 tech clusters can be identified like London, Manchester (more details can be found in the Travel to Work Areas (Mike Coombes and Office for National Statistics, 2015))

  2. Its population of job can be found which is conducive to the LQs detecting

The difficulties:

  1. The boundaries dataset is in theOpen Geography portalx, which is hard to get spatial information in Python because its type is JSON

  2. The boundaries and work population number is not up-to-date and available in 2021. (I cannot find open data for ttwa in 2020)

In [12]:
import geopandas as gpd
g_ttwa = gpd.read_file(PATH + "/Dataset/Spatial/Travel_to_Work_Areas__December_2011__Generalised_Clipped_Boundaries_in_United_Kingdom.geojson")
In [13]:
g_ttwa.plot()
Out[13]:
<AxesSubplot:>

3 Mapping the Tech Cluster (Laua & TTWA)

  1. count the frequency of tech firm grouped by Laua & Ttwa
  2. import the spatial dataset local authority boundaries & Travel to work area boundaries
  3. merge with 1. and 2. to visualise the spatial distribution

3.1 Detect the fame_OC data

drop the rows which contain NA value in the laua column

In [43]:
print(df.shape)
df1 = df.copy()
df1 = df1.dropna(subset=["laua","ttwa"])
print(df1.shape)
(513708, 64)
(502658, 64)
In [134]:
# import the laua boundaries
df_laua = pd.read_csv(PATH + "/Dataset/Spatial/Local_Authority_2017.csv")
df_laua.head()
Out[134]:
objectid lad17cd lad17nm lad17nmw bng_e bng_n long lat st_areashape st_lengthshape
0 1 E06000001 Hartlepool 447157 531476 -1.27023 54.676159 9.355951e+07 71707.407523
1 2 E06000002 Middlesbrough 451141 516887 -1.21099 54.544670 5.388858e+07 43840.866561
2 3 E06000003 Redcar and Cleveland 464359 519597 -1.00611 54.567520 2.448203e+08 97993.391012
3 4 E06000004 Stockton-on-Tees 444937 518183 -1.30669 54.556911 2.049622e+08 119581.595543
4 5 E06000005 Darlington 428029 515649 -1.56835 54.535351 1.974757e+08 107206.401694
In [11]:
# import the ttwa boundaries
import geopandas as gpd
df_ttwa = gpd.read_file(PATH + "/Dataset/Spatial/Travel_to_Work_Areas__December_2011__Generalised_Clipped_Boundaries_in_United_Kingdom.geojson")
df_ttwa.head()
Out[11]:
OBJECTID TTWA11CD TTWA11NM Shape__Area Shape__Length geometry
0 1 E30000004 Barnsley 3.424992e+08 129308.319110 POLYGON ((-1.44681 53.61282, -1.44475 53.61061...
1 2 E30000018 Bradford 3.438902e+08 119605.015955 POLYGON ((-1.94399 53.95619, -1.94161 53.95252...
2 3 E30000029 Halifax 3.639425e+08 105182.259501 POLYGON ((-2.03257 53.81023, -2.03138 53.80902...
3 4 E30000039 Skipton 1.178893e+09 197077.350855 POLYGON ((-2.30733 54.25008, -2.30511 54.24934...
4 5 E30000046 Dorchester and Weymouth 7.200201e+08 226091.682220 MULTIPOLYGON (((-2.22501 50.61646, -2.22520 50...

The top 10 Local Authority can be found in the following list

In [138]:
# df_laua = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_laua_10 = pd.DataFrame(df1.laua.value_counts()[:12]).reset_index()
df_laua_10.columns = ["laua_code","count"]

df_laua_10 = pd.merge(df_laua_10, df_laua[["lad17cd","lad17nm"]], left_on="laua_code",right_on="lad17cd")

df_laua_10 = df_laua_10[["laua_code","count","lad17nm"]]
df_laua_10.columns = ["laua_code","count","name"]
df_laua_10.style
Out[138]:
laua_code count name
0 E09000007 15897 Camden
1 E09000033 12069 Westminster
2 E09000012 12036 Hackney
3 E09000019 10271 Islington
4 E06000043 8026 Brighton and Hove
5 E08000025 7356 Birmingham
6 E09000003 7023 Barnet
7 E08000003 6254 Manchester
8 E08000035 5899 Leeds
9 E09000001 5576 City of London
  • Top 4(the Camden, Westminster, Hackney and Islington clusters) belongs to the London

  • Only Brighton and Hove, Birmingham, Manchester and Leeds do not belong to London

The top 10 Travel to Work Area can be found in the following list

In [14]:
top_10_ttwa = df.ttwa.value_counts().head(10)
top_10_ttwa = pd.DataFrame(top_10_ttwa).reset_index()
top_10_ttwa.columns = ["ttwa_code","counts"]

df_ttwa = pd.read_csv(PATH + "/Dataset/ttwa.csv")
df_ttwa = pd.merge(top_10_ttwa, df_ttwa[["code","name"]], left_on="ttwa_code",right_on="code")
df_ttwa_10 = df_ttwa[["ttwa_code", "counts", "name"]]
df_ttwa_10.style
Out[14]:
ttwa_code counts name
0 E30000234 146959 London
1 E30000239 21683 Manchester
2 E30000266 20224 Slough and Heathrow
3 E30000169 11262 Birmingham
4 E30000237 10408 Luton
5 E30000175 9232 Bournemouth
6 E30000179 8368 Brighton
7 E30000186 7296 Cambridge
8 E30000180 7090 Bristol
9 E30000284 6968 Warrington and Wigan

3.2 Import the Laua & TTWA spatial info

This file is Local_Authority_2019_Boundaries/Local_Authority_Districts_(December_2019)_Boundaries_UK_BFC.shp

In [15]:
# read the shape file and plot map

import geopandas as gpd
g_laua = gpd.read_file(PATH + "/Dataset/Spatial/Local_Authority_2019_Boundaries/Local_Authority_Districts_(December_2019)_Boundaries_UK_BFC.shp")
g_ttwa = gpd.read_file(PATH + "/Dataset/Spatial/Travel_to_Work_Areas__December_2011__Generalised_Clipped_Boundaries_in_United_Kingdom.geojson")
# print(g_laua)
In [16]:
g_laua.plot()
g_ttwa.plot()
Out[16]:
<AxesSubplot:>

3.3 Merge the spatial and info

get the laua and frequency list and transform into the frequency

In [29]:
df_laua_freq = pd.DataFrame(df1.laua.value_counts()).reset_index()
df_laua_freq.columns = ["laua_code","count"]
df_laua_freq.head()
Out[29]:
laua_code count
0 E09000007 15897
1 E09000033 12069
2 E09000012 12036
3 E09000019 10271
4 E06000058 9437

get the ttwa and frequency list and transform into the frequency

In [44]:
df_ttwa_freq = pd.DataFrame(df1.ttwa.value_counts()).reset_index()
df_ttwa_freq.columns = ["ttwa_code","count"]
df_ttwa_freq.head()
Out[44]:
ttwa_code count
0 E30000234 146959
1 E30000239 21683
2 E30000266 20224
3 E30000169 11262
4 E30000237 10408
  1. merge the df_laua_freq and g_laua
  2. merge the df_ttwa_freq and g_ttwa
In [48]:
df_laua_mg = pd.merge(g_laua, df_laua_freq,left_on='lad19cd',right_on='laua_code',how='inner').reset_index()
df_ttwa_mg = pd.merge(g_ttwa, df_ttwa_freq, left_on='TTWA11CD', right_on='ttwa_code',how='inner').reset_index()
df_laua_mg.plot()
df_ttwa_mg.plot()
Out[48]:
<AxesSubplot:>

3.4 Mapping the Freuqency

In [32]:
import pysal as ps
import numpy as np
import pandas as pd
import seaborn as sns
import geopandas as gpd
import matplotlib.pyplot as plt

tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1_laua = plt.figure()
f1_laua.set_size_inches(20,20)

ax1 = f1_laua.add_subplot()
df_laua_mg.plot(column='count', legend=True, cmap='YlGn', ax=ax1)
# df_laua_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)

f1_laua.subplots_adjust(top=0.95)
# f1.suptitle(f"Frequency Distribution of tech firms cluster by Local Authorities", x=0.025, size=24, **tfont)
f1_laua.suptitle(f"Spatial Distribution of the number of tech firms cluster in UK Local Authorities", x=0.25, size=24, **tfont)
f1_laua.savefig(PATH + '/Img/Frequency_Distribution_of_tech_firms_cluster_by_LAUA.png', dpi=300)
In [34]:
df_ttwa_mg
Out[34]:
index OBJECTID TTWA11CD TTWA11NM Shape__Area Shape__Length geometry ttwa_code count
0 0 1 E30000004 Barnsley 3.424992e+08 129308.319110 POLYGON ((-1.44681 53.61282, -1.44475 53.61061... E30000004 940
1 1 2 E30000018 Bradford 3.438902e+08 119605.015955 POLYGON ((-1.94399 53.95619, -1.94161 53.95252... E30000018 2258
2 2 3 E30000029 Halifax 3.639425e+08 105182.259501 POLYGON ((-2.03257 53.81023, -2.03138 53.80902... E30000029 1164
3 3 4 E30000039 Skipton 1.178893e+09 197077.350855 POLYGON ((-2.30733 54.25008, -2.30511 54.24934... E30000039 267
4 4 5 E30000046 Dorchester and Weymouth 7.200201e+08 226091.682220 MULTIPOLYGON (((-2.22501 50.61646, -2.22520 50... E30000046 708
... ... ... ... ... ... ... ... ... ...
223 223 224 W22000030 Pwllheli and Porthmadog 6.734859e+08 255493.843537 MULTIPOLYGON (((-4.80252 52.75667, -4.80423 52... W22000030 98
224 224 225 W22000031 Rhyl 1.219876e+09 275509.758780 MULTIPOLYGON (((-3.45998 53.28200, -3.46014 53... W22000031 482
225 225 226 W22000032 Swansea 1.002925e+09 409732.315759 MULTIPOLYGON (((-3.76057 51.53667, -3.76110 51... W22000032 1293
226 226 227 W22000033 Tywyn and Dolgellau 1.419986e+09 291174.323726 MULTIPOLYGON (((-3.92305 52.75456, -3.92483 52... W22000033 51
227 227 228 W22000034 Wrexham 5.103039e+08 162452.205500 POLYGON ((-3.12262 53.15796, -3.11578 53.15503... W22000034 507

228 rows × 9 columns

In [36]:
# map the ttwa and frequency of tech clusters
tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1_ttwa = plt.figure()
f1_ttwa.set_size_inches(20,20)

ax1 = f1_ttwa.add_subplot()
df_ttwa_mg.plot(column='count', legend=True, cmap='YlGn', ax=ax1)
# df_laua_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)

f1_ttwa.subplots_adjust(top=0.95)
# f1.suptitle(f"Frequency Distribution of tech firms cluster by Local Authorities", x=0.025, size=24, **tfont)
f1_ttwa.suptitle(f"Spatial Distribution of the number of tech firms cluster in UK Travel to Work Area", x=0.25, size=24, **tfont)
f1_ttwa.savefig(PATH+'/Img/Frequency_Distribution_of_tech_firms_cluster_by_TTWA.png', dpi=300)

For create a high contrast, making the TTWA (less than top 10) count 0.

We need df_ttwa_10 & df_ttwa_mg to make a better visualisation version

In [39]:
df_ttwa_10.head()
Out[39]:
ttwa_code counts name
0 E30000234 146959 London
1 E30000239 21683 Manchester
2 E30000266 20224 Slough and Heathrow
3 E30000169 11262 Birmingham
4 E30000237 10408 Luton
In [42]:
df_ttwa_mg.head()
Out[42]:
index OBJECTID TTWA11CD TTWA11NM Shape__Area Shape__Length geometry ttwa_code count
0 0 1 E30000004 Barnsley 3.424992e+08 129308.319110 POLYGON ((-1.44681 53.61282, -1.44475 53.61061... E30000004 0
1 1 2 E30000018 Bradford 3.438902e+08 119605.015955 POLYGON ((-1.94399 53.95619, -1.94161 53.95252... E30000018 0
2 2 3 E30000029 Halifax 3.639425e+08 105182.259501 POLYGON ((-2.03257 53.81023, -2.03138 53.80902... E30000029 0
3 3 4 E30000039 Skipton 1.178893e+09 197077.350855 POLYGON ((-2.30733 54.25008, -2.30511 54.24934... E30000039 0
4 4 5 E30000046 Dorchester and Weymouth 7.200201e+08 226091.682220 MULTIPOLYGON (((-2.22501 50.61646, -2.22520 50... E30000046 0
In [55]:
df_ttwa_mg_for_viz = df_ttwa_mg.copy()
df_ttwa_mg_for_viz.loc[~df_ttwa_mg_for_viz["TTWA11CD"].isin(df_ttwa_10.ttwa_code.to_list()),"count"]=-100000
df_ttwa_mg_for_viz
Out[55]:
index OBJECTID TTWA11CD TTWA11NM Shape__Area Shape__Length geometry ttwa_code count
0 0 1 E30000004 Barnsley 3.424992e+08 129308.319110 POLYGON ((-1.44681 53.61282, -1.44475 53.61061... E30000004 -100000
1 1 2 E30000018 Bradford 3.438902e+08 119605.015955 POLYGON ((-1.94399 53.95619, -1.94161 53.95252... E30000018 -100000
2 2 3 E30000029 Halifax 3.639425e+08 105182.259501 POLYGON ((-2.03257 53.81023, -2.03138 53.80902... E30000029 -100000
3 3 4 E30000039 Skipton 1.178893e+09 197077.350855 POLYGON ((-2.30733 54.25008, -2.30511 54.24934... E30000039 -100000
4 4 5 E30000046 Dorchester and Weymouth 7.200201e+08 226091.682220 MULTIPOLYGON (((-2.22501 50.61646, -2.22520 50... E30000046 -100000
... ... ... ... ... ... ... ... ... ...
223 223 224 W22000030 Pwllheli and Porthmadog 6.734859e+08 255493.843537 MULTIPOLYGON (((-4.80252 52.75667, -4.80423 52... W22000030 -100000
224 224 225 W22000031 Rhyl 1.219876e+09 275509.758780 MULTIPOLYGON (((-3.45998 53.28200, -3.46014 53... W22000031 -100000
225 225 226 W22000032 Swansea 1.002925e+09 409732.315759 MULTIPOLYGON (((-3.76057 51.53667, -3.76110 51... W22000032 -100000
226 226 227 W22000033 Tywyn and Dolgellau 1.419986e+09 291174.323726 MULTIPOLYGON (((-3.92305 52.75456, -3.92483 52... W22000033 -100000
227 227 228 W22000034 Wrexham 5.103039e+08 162452.205500 POLYGON ((-3.12262 53.15796, -3.11578 53.15503... W22000034 -100000

228 rows × 9 columns

note: the loc is a "inplace = True" method, you dont need to write df = df.loc[] just df.loc[] = ..

In [62]:
# map the ttwa and frequency of tech clusters again!!
tfont = {'fontname':'DejaVu Sans', 'horizontalalignment':'left'}
f1_ttwa = plt.figure()
f1_ttwa.set_size_inches(20,20)

ax1 = f1_ttwa.add_subplot()
cmap_set = "tab20b"
df_ttwa_mg_for_viz.plot(column='count', legend=True, cmap=cmap_set, ax=ax1)
# coolwarm
# winter
# tab20b
# tab20c
# df_laua_mg.plot(column='count', legend=True, cmap='plasma',figsize=(180,60),ax=ax1)

f1_ttwa.subplots_adjust(top=0.95)
# f1.suptitle(f"Frequency Distribution of tech firms cluster by Local Authorities", x=0.025, size=24, **tfont)
f1_ttwa.suptitle(f"Spatial Distribution of the number of tech firms cluster in UK Travel to Work Area", x=0.25, size=24, **tfont)
f1_ttwa.savefig(PATH+'/Img/Frequency_Distribution_of_tech_firms_cluster_by_TTWA' + '_' + cmap_set + '.png', dpi=300)

3.5 Plot the Bar Chart

In [143]:
df_laua_10.style
Out[143]:
laua_code count name
0 E09000007 15897 Camden
1 E09000033 12069 Westminster
2 E09000012 12036 Hackney
3 E09000019 10271 Islington
4 E06000043 8026 Brighton and Hove
5 E08000025 7356 Birmingham
6 E09000003 7023 Barnet
7 E08000003 6254 Manchester
8 E08000035 5899 Leeds
9 E09000001 5576 City of London
In [240]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

# Initialize the matplotlib figure
f2_laua, ax2 = plt.subplots(figsize=(15,6))

# Load the example car crash dataset
# crashes = sns.load_dataset("car_crashes").sort_values("total", ascending=False)

# Plot the total crashes
sns.set_color_codes("pastel")
sns.barplot(x="count", y="name", data=df_laua_10, color="b")

# Add a legend and informative axis label
# ax.legend(ncol=2, loc="lower right", frameon=True)
ax2.set(xlim=(0, 16050), ylabel="", xlabel="the number of technology companies")
ax2.set_title("Top 10 local authorities with the largest number of technology companies in the UK",fontsize = 24)
# sns.despine(left=True, bottom=True)
f2_laua.savefig(PATH + "/Img/The_Top_10_Number_of_Tech_Firms_laua.png")
In [69]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="whitegrid")

# Initialize the matplotlib figure
f2_ttwa, ax2 = plt.subplots(figsize=(15,6))

# Load the example car crash dataset
# crashes = sns.load_dataset("car_crashes").sort_values("total", ascending=False)

# Plot the total crashes
sns.set_color_codes("pastel")
sns.barplot(x="counts", y="name", data=df_ttwa_10, color="b")

# Add a legend and informative axis label
# ax.legend(ncol=2, loc="lower right", frameon=True)
ax2.set(xlim=(0, 150000), ylabel="", xlabel="the number of technology companies")
ax2.set_title("Top 10 Travel to Work Area with the largest number of technology companies in the UK",fontsize = 24)
# sns.despine(left=True, bottom=True)
f2_ttwa.savefig(PATH + "/Img/The_Top_10_Number_of_Tech_Firms_ttwa.png")

4 Explore the Movers

For fame_OC

  1. keep the same name before and after movement
  2. change the name before and after movement
In [185]:
df2 = df.copy()
firm_name_freq = pd.DataFrame(df2.normalised_name.value_counts()).reset_index()
firm_name_freq.columns = ["firm_name","count"]

# firm's name re-appears
firm_name_re = firm_name_freq[firm_name_freq["count"]>1]
firm_name_re.style
Out[185]:
firm_name count
0 plexus holdings plc 4
1 rhythmone plc 4
2 mcguirk industries limited 2
3 create limited 2
4 solutions design limited 2
5 synthesizedesign limited 2
6 multiconnect limited 2
7 ja richardson engineering limited 2
8 entertainment international limited 2
9 globex corporation limited 2
10 bloody surgery limited 2
11 dwm media limited 2
12 jc film limited 2
13 sar uk limited 2
14 kcom group plc 2
15 jkl tribe limited 2
16 bucks cbt plus limited 2
17 conversion company limited 2
18 proact it uk limited 2
19 access it resource limited 2
20 ready player one limited 2
21 nmr geosciences limited 2
22 grace and grace limited 2
23 amd tech limited 2
24 red red limited 2
25 we unicorns limited 2
26 quality care solutions limited 2
27 future corporation limited 2
28 ctrl z limited 2
29 gopher it support limited 2
30 jd software limited 2
31 cliste consulting limited 2
32 sri corporation limited 2
33 eye clinic limited 2
34 hit reset limited 2
35 agility limited 2
36 acm transport limited 2
37 mak corporation limited 2
38 hive limited 2
39 limited 2
40 mind music limited 2
41 anlex8 limited 2

In this case, I can detect the 41 companies that might move

Then back to the fame_OC and filter these possible movers

In [222]:
df_fame_movers = df.copy()
df_fame_movers = df_fame_movers[df_fame_movers.normalised_name.isin(firm_name_re.firm_name.to_list())]
# df_fame_movers.to_csv(PATH + "/df_fame_OC_movers.csv")
In [224]:
# detect which one has different postcode in the same normalised name
df_movers = pd.DataFrame(df_fame_movers.groupby("normalised_name")["postcode"].nunique()).reset_index()
# more than 2 distinct postcodes mean these tech firms may move
df_movers = df_movers[df_movers["postcode"]>=2]
Out[224]:
normalised_name postcode
0 access it resource limited 1
1 acm transport limited 1
2 agility limited 2
3 amd tech limited 2
4 anlex8 limited 1
5 bloody surgery limited 1
6 bucks cbt plus limited 1
7 cliste consulting limited 1
8 conversion company limited 2
9 create limited 2
10 ctrl z limited 2
11 dwm media limited 1
12 entertainment international limited 2
13 eye clinic limited 2
14 future corporation limited 2
15 globex corporation limited 2
16 gopher it support limited 1
17 grace and grace limited 2
18 hit reset limited 2
19 hive limited 2
20 ja richardson engineering limited 1
21 jc film limited 2
22 jd software limited 2
23 jkl tribe limited 1
24 kcom group plc 1
25 limited 2
26 mak corporation limited 2
27 mcguirk industries limited 1
28 mind music limited 2
29 multiconnect limited 1
30 nmr geosciences limited 1
31 plexus holdings plc 1
32 proact it uk limited 1
33 quality care solutions limited 2
34 ready player one limited 1
35 red red limited 2
36 rhythmone plc 1
37 sar uk limited 2
38 solutions design limited 2
39 sri corporation limited 2
40 synthesizedesign limited 1
41 we unicorns limited 2
In [235]:
# export a list that tech firm have different postcode while they have the same names
df_movers_1 = df_fame_movers[df_fame_movers["normalised_name"].isin(df_movers["normalised_name"].to_list())]
possible_tech_movers = df_movers_1[["registered_number","incorporation_date","dissolution_date","normalised_name","postcode"]].sort_values(by="normalised_name")
possible_tech_movers
Out[235]:
registered_number incorporation_date dissolution_date normalised_name postcode
164178 3060979 1995-05-24 NaN agility limited EN49HN
144252 07418654 2010-10-25 NaN agility limited EC1V2NX
279459 11632107 2018-10-19 NaN amd tech limited N156RH
182708 06250619 2007-05-17 NaN amd tech limited GL53PN
228143 08424152 2013-02-28 NaN conversion company limited KT70RY
512012 1981640 1986-01-23 NaN conversion company limited YO211QL
157297 SC583384 2017-12-07 NaN create limited EH107DU
285768 7460781 2010-12-06 NaN create limited N17GU
20379 11059998 2017-11-13 NaN ctrl z limited BA113EE
469818 05608171 2005-11-01 NaN ctrl z limited W1T6AD
213430 11634444 2018-10-22 NaN entertainment international limited IG110GH
323343 11143790 2018-01-10 NaN entertainment international limited NW51EN
12019 8070874 2012-05-16 NaN eye clinic limited B288EZ
211689 06219671 2007-04-19 NaN eye clinic limited HU53TS
140510 7744742 2011-08-18 NaN future corporation limited E83DL
39534 5631136 2005-11-21 NaN future corporation limited BL26RT
271030 04397427 2002-03-18 NaN globex corporation limited MK428TU
19904 09552070 2015-04-21 NaN globex corporation limited B976HR
387395 11065921 2017-11-15 NaN grace and grace limited SE64EJ
241453 11412829 2018-06-13 NaN grace and grace limited LL116UG
47123 8546782 2013-05-28 NaN hit reset limited BN16AF
224545 6598405 2008-05-20 NaN hit reset limited KT168PD
242014 3937810 2000-03-01 NaN hive limited LL181RF
511527 10797308 2017-06-01 NaN hive limited YO126HT
291684 10589202 2017-01-30 NaN jc film limited N32JU
143676 11613438 2018-10-09 NaN jc film limited EC1R0DU
424940 9549356 2015-04-19 NaN jd software limited SW152BF
72656 04405957 2002-03-28 NaN jd software limited CB249AD
133614 11306057 2018-04-12 NaN limited E162DQ
326148 10219186 2016-06-07 NaN limited NW99NL
8204 7998847 2012-03-20 NaN mak corporation limited B112AA
39799 10674075 2017-03-16 NaN mak corporation limited BL35AZ
323418 9252531 2014-10-07 NaN mind music limited NW51UA
153204 09331890 2014-11-28 NaN mind music limited EC2R6AY
192884 9989675 2016-02-05 NaN quality care solutions limited HA12SP
407150 7745467 2011-08-18 NaN quality care solutions limited SN33AJ
33981 11438196 2018-06-28 NaN red red limited BH149NY
215822 11233132 2018-03-05 NaN red red limited IG45HQ
332492 6329371 2007-07-31 NaN sar uk limited OX46NG
268281 8922191 2014-03-04 NaN sar uk limited MK11SW
180039 05549948 2005-08-31 NaN solutions design limited GL170PZ
494527 11567756 2018-09-13 NaN solutions design limited WC2H9JQ
368604 8417509 2013-02-25 NaN sri corporation limited S117AQ
398016 8817694 2013-12-17 NaN sri corporation limited SL11UA
151594 11487977 2018-07-27 NaN we unicorns limited EC2A4NE
491599 9732240 2015-08-13 NaN we unicorns limited WC2H7LA