Category Archives: Data

Visualising data on Cambridgeshire council wards with Carto

“Data is the new oil…” is a phrase often used by those in the industry to describe the vast,  valuable and extremely powerful resource available to us. We generate an astronomical amount of data, in fact 90% of the data in the world today has been created in the last two years alone according to IBM. However, there is a real skills gap when it comes to working with the data – people often perceive it as being too complicated or intimidating to work with, while those who can extract it, process it and refine it will reap the rewards.

Data is readily available if you know where to look for it. The UK Government’s data.gov.uk site opens up datasets on the environment, health and education and often these datasets come cleansed without the need to process it too much (though I would always advise the cleansing and scrutiny of data regardless of source!). As a Cambridgeshire resident, I am very lucky to have access to Cambridgeshire Insight Open Data, which is a repository of open data relating to local matters and is managed by Cambridgeshire County Council and run by a local partnership. I therefore wanted to have a go at playing around with the data and utilising the numerous free tools available to explore and visualise it with the hope of making it more accessible – let’s face it, there’s more to life than staring at a spreadsheet!

My first project will be looking at crime figures in the county and projecting them on to a map, by council ward, using an online tool called Carto. The data will be sourced from the Cambridgeshire Crime Rates and Counts 2015/16 data set on the Cambridgeshire Insight Open Data site and I will be focusing on the data related to the rate of crimes per 1,000 population at ward and district level. The data I’m using can be found in column “CU” in the Excel sheet.

The benefit of looking at the rate instead of raw numbers is that it means the stats aren’t skewed by heavily populated areas e.g. I would expect the number of crimes in heavily populated areas to be higher than those in remote rural communities, but as the rate takes in to account the population within the area, it provides a more accurate measure.

The mapping processes behind Carto aren’t sophisticated enough to recognise a council ward name in data (yet), but it does allow you to overlay KML data meaning that the wards can be “drawn” on to the map in a new layer. Where does this KML data come from? Surely we don’t have to create it ourselves? Heh, don’t be silly, someone has already done all of the hard work for us (phew!) – thank you to Alasdair Rae, a Lecturer in the Department of Town and Regional Planning at the University of Sheffield who created it for an article in the Guardian and a spreadsheet containing all of the KML shapes of all council wards in England can be downloaded via a Google Fusion table.

I then matched up the wards, by name, between the two sheets and pulled out the KML code for each one and the resulting sheet can be found here. From here, everything can be done from within Carto so create yourself a free account and prepare to be dazzled!

There are plenty of Carto tutorials but there are essentially two sides to the tool: the dataset side (where the data can be uploaded) and the mapping side (where the maps connect to your existing datasets). The great thing about Carto is that your maps can be embedded in your own website so hey presto, here’s the resulting map…

Total crime rate per 1,000 people by Cambridgeshire ward (2015/2016)

Source: Cambridgeshire Insight Open Data :: Full page view of the map

I hope to make this a regular series of small tutorials about data visualisation and the use of open data so check back in a few months to see what else I have been up to!

Disclaimer: I am employed by Cambridgeshire County Council, though I have no links with the Insight Open Data team.

Quick links

Data or it didn’t happen!

I attended the Visualized.io conference in November and amongst many brilliant talks was one by @JazzPazz about lifelogging where I first heard the phrase “Data or it didn’t happen”. Much like the phrase I’ve heard many times before, “Photo or it didn’t happen…”, we are now moving towards an age where we have to prove we have seen something, been somewhere, walked that epic walk. I think I welcome this to a point; whenever I see a sweeping statement made in the paper I want to see the numbers to prove it (though stats can of course be spun to say what you want) and anything that makes us question the drivel we are fed in the papers can only be a good thing.

What I don’t welcome is when the stats let us down!

I have started a fitness regime (HA, yes me!) and am on the Couch to 5K running program. I’ve got the podcasts on my phone, I’m signed up to Endomondo and so far it’s been great. Wow, the stats I can get on each C25K running session are awesome and I love being able to see how I’ve improved via the fastest mile, average pace or 12 minute test, I love it. UNTIL TODAY. Today, my GPS let me down and copped out over half way through the run – technology let me down. I had such a great run too; it was cold outside, the grass was crispy and my route took me along an old railway line where birds were darting through the trees and I smashed it. I was able to run each minute session in full and repeated it eight times, I’m recovering quicker between each run and I felt I went faster…but I have no proof!

Data or it didn’t happen.

I know it happened, I know how awesome it felt and that’s something I need to focus on more. Why do I log my runs and share them anyway? It’s certainly not to prove that Usain Bolt has anything to worry about! But I guess it’s for pride in myself, for motivation purposes, a badge of honour (gamification of life…) and to play with all the graphs. In a few months time this will just be a minor hiccup on the graph, but it was interesting to see how much of an effect it had on me. Next run I’m going to do even better to get my stats back on track…unless the GPS lets me down again and then it’s war!

Mapping Country Links in Tableau

Map I wanted!  

I’ve been looking to do this sort of visualisation with map data for a while; to show links between countries and to have the chunkiness of the lines correspond to particular amounts. I am still very, very new to Tableau but thought their tutorials may be a great place to start. The Advanced Mapping Techniques tutorial had exactly what I wanted, though bizarrely their training workbooks below the video don’t contain that specific example. However, by scrolling through the video and working it out myself – thank GOODNESS there was a shot of the data behind the scenes around the 1 minute mark – I managed to put together my own version, and it works! Woo hoo!

So, how did I get there? Here’s how to do it step-by-step:

1. Create a spreadsheet with the following data (you can copy & paste mine from below).

IMPORTANT – each line is a one-way path but in order for it work, there must also be a return path. For example, the data in the first row goes from the UK to US, but there also needs to be a return line back from the US to the UK (row 4). If you don’t do this, all the points in the first three rows will show in the UK unless they have return points – which is a bit pointless (fnarr!)

Country Path Total
UK UK-US 200
UK UK-Brazil 450
UK UK-China 700
US UK-US 200
Brazil UK-Brazil 450
China UK-China 700

2. Create your workbook in Tableau and import your data (I connect live so I can tweak my data behind the scenes).

3. Tableau will automatically generate the Longitude and Latitude for you, but make sure you spell the country names correctly! Then set-up the following dimensions:

Columns: Longitude
Rows: Latitude
Color: Path
Size: SUM(Total)

Here’s a screenshot to help! Oh and make sure it’s set to a “Line” graph.

Tableau setup

Hooray you now have your map set-up!

You can now do funky things like creating (sort of) arrows instead of fat lines by altering the data like this…

Country Path Total
UK UK-US 200
UK UK-Brazil 450
UK UK-China 700
US UK-US
Brazil UK-Brazil
China UK-China

This means the line width starts from nothing at the point of origin and grows as it reaches its destination. Your map should now look like this:

Map pointy lines

See, that was fairly painless, though I still want to continue having a play around with it, I’m sure there is a lot more you can do. I’d like to find a definitive list of country names that can be used in Tableau, can initials work too I wonder? Who knows, that’s for another time…

GCSE Results

There has been much chatter this week about GCSE results, which not only serves as a means by which to make me feel old (16 years ago people, 16 YEARS!) but it has also been a fabulous opportunity to dive in to the data. News outlets predictably report on the performance gap between girls & boys and the increase/decrease in the uptake of particular subjects…some things never change! I personally wanted to get my hands on the data so that I could whack it in to Tableau and play with it myself (thus also learning how to use Tableau too – double bonus!)

I obtained all of my data from a comprehensive Guardian article and wrangled it a bit get it in to the format that I needed (BIG shout out for the fabulous Tableau add-on for reshaping data in Excel!)

NOTE: As this blog is hosted by WordPress, I can’t embed the graphics due to Javascript restrictions. So if you want to see the graphics in all of their beauty and filter by different criteria, please click the link below each graphic).

GCSE Subject Breakdown

Subjects taken by all candidates in 2013 – interactive graphic allows you to filter by gender and year.

GCSE Subject Treemap

Link to interactive graphic

Percentage of grades by gender

Split by gender and grade – you can see that girls’ entries achieve higher grades than boys’. As widely reported, 8.3% of entries from girls achieved an A* compared to 5.3% of entries from boys. The interactive graphic allows you to filter by year.

Grades for entries by gender

Link to interactive graphic

Change in the percentage of entries achieving each grade

This graphic demonstrates that the percentage of entries achieving top grades has fallen between 2012 and 2013. The interactive graphic allows you to filter by gender.

Change in grades for entries

Link to interactive graphic

Results per subject

For each subject, we have a breakdown of the percentage of entries achieving each grade – for example 16.6% of Chemistry entries achieved an A* compared to 9.8% of French entries. The interactive graphic allows you to filter by gender.

Results per subject

Link to interactive graphic

This was a fantastic opportunity to for me to brush up on my Tableau skills and play with interesting and insightful data. Happy to hear comments on how I can improve my work or further areas I can explore.

Displaying English districts in Google Earth

The other day at work, I was asked to find out how many X* were in a particular district in England. Given we seem to have SO MANY ways that we slice & dice the country up with Ceremonial Counties, Parliamentary Constituencies, Local Government Districts (which can be boroughs, cities or Royal Boroughs) you can see why I started to have a small panic! But then I remembered you could search for districts in Google and it would outline them with a little red boundary line, so I gave it go.

“Woking District”

Woking Outline

Ta dahhhh!

Great…except I then wanted to overlay the data I already had mapped in a Google Fusion table with longitudes & latitudes. FAIL. I could find no way of either exporting this boundary data, or overlaying the longitude & latitude points I had setup in a fusion table. So I Googled around the subject and found this wonderful post on Stack Overflow – so I needed to head on over the OpenStreetMap, download the boundary data for my district, additionally convert the gpx file to a KML and then whack it in to Google Earth…so I did that!

1) Find the district you want in OpenStreetMap

2)  Download the gpx file and then convert it to a KML on Gpx2KML.com

3) Load your resulting KML in to Google Earth and hey presto…

Google Earth - Woking

I then exported my data with the longitude & latitude points in it from my Google Fusion table as a KML file and then imported it in to Google Earth and all the little points appear on a layer over my district boundary.

Woking Schools - Google Earth

Hooray! Fortunately, there were only four points within the boundary – I need to think about what would happen if we were talking hundreds or thousands – how would I manage that? Is there a way to filter in Google Earth to say give me all the X in boundary Y? I’m not sure but I’ll be darn sure to find out!

* Trying to maintain a bit of anonymity here!

Using Python & BeautifulSoup to scrape a Wikipedia table

Well, it was only a couple of weeks ago that I set myself a challenge to complete the Python course on Codecademy and I did it – I completed the Python track and it was fantastic! I was given the opportunity to put my newly found Python skills in to action this week as I needed to scrape some data from a Wikipedia page – I have a table of addresses and need to compare the County in the list that has been provided to the one that it really should be. This page on Wikipedia contains the data I need, for each Postcode district there’s a Postal County and I could use this data as a comparison – formatted in an HTML table like this:

WikiCapture

Normally, I’d just copy & paste the table in to Excel for use later on BUT it’s not as easy as that (oh no!), as there can be are multiple Postcode Districts within a row which is slightly annoying! To be of any use to me, I need the data to be formatted so that there is a row for each Postcode District like so (I don’t necessarily need the Postcode Area & Town but I’ll keep them anyway – I don’t like throwing away data!):

Postcode Area Postcode District Post Town Former Postal County
AB AB10 ABERDEEN Aberdeenshire
AB AB11 ABERDEEN Aberdeenshire
AB AB13 ABERDEEN Aberdeenshire
AB AB15 ABERDEEN Aberdeenshire

And so I thought this would be the perfect project for me to undertake in Python and to familiarise myself with friend-of-the-screen-scrapers, BeautifulSoup. I won’t jabber on too much about BeautifulSoup as I’m not fully up to speed on it myself yet, but from reading around the subject I gather it’s a great way to grab elements from web pages for further processing.

Step One: Wikipedia doesn’t like you…

Wikipedia doesn’t like this code:

[code language=”Python” highlight=”7″]
from bs4 import BeautifulSoup
import urllib2
wiki = “http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom”
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page)
print soup
#urllib2.HTTPError: HTTP Error 403: Forbidden

[/code]

Wikipedia only allows access to recognised user agents in order to stop bots retrieving bulk content. I am not a bot, I just want to practise my Python and so to get around this you just need some additional code to the header (thanks to Stack Overflow for coming to the rescue).

Step Two: Hunt the table

If you look at the code behind the Wikipedia article, you’ll see that there are multiple tables but only one (thankfully the one we want) uses the “wikitable sortable” class – this is great as we can use BeautifulSoup to find  the table with the “wikitable sortable” class and know that we will only get this table.

[code language=”Python”]
from bs4 import BeautifulSoup
import urllib2
wiki = “http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom”
header = {‘User-Agent’: ‘Mozilla/5.0’} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

area = “”
district = “”
town = “”
county = “”
table = soup.find(“table”, { “class” : “wikitable sortable” })
print table
[/code]

Output looks like this:

TableOutput

Great! This means that we just have the HTML table stored in our variable. Now, it’s just a case of iterating through the rows and columns…easy…*ahem*

Step Three: For your iteration pleasure

We need to do the iteration in two stages – the first stage is to iterate through each row (tr element) and then assign each element in the tr to a variable. At this stage, we will grab everything in the Postcode Districts column and store it in a list for further iteration later. To do this, I used the following code:

[code language=”Python” firstline=”19″]
for row in table.findAll(“tr”):
cells = row.findAll(“td”)
#For each “tr”, assign each “td” to a variable.
if len(cells) == 4:
area = cells[0].find(text=True)
district = cells[1].findAll(text=True)
town = cells[2].find(text=True)
county = cells[3].find(text=True)
[/code]

The .findAll function in Python returns a list and so on line 20, we obtain a list containing four elements, one for each of the columns in the table. This means they can be accessed via the cells[n].find(text=True) syntax. You’ll notice that I’ve used .findAll for the Postal Districts column, this is because I want a list of the items within the cell for iteration purposes later!

After this code executes, I have a value for the area, a list of districts, a town and a county. Now for the second part of my iteration:

[code language=”Python” firstline=”28″]
#district can be a list of lists, so we want to iterate through the top level lists first…
for x in range(len(district)):
#For each list, split the string
postcode_list = district[x].split(“,”)
#For each item in the split list…
for i in range(len(postcode_list)):
#Check it’s a postcode and not other text
if (len(postcode_list[i]) > 2) and (len(postcode_list[i]) <= 5):
#Strip out the “n” that seems to be at the start of some postcodes
write_to_file = area + “,” + postcode_list[i].lstrip(‘n’).strip() + “,” + town + “,” + county + “n”
print write_to_file
[/code]

I found that, instead of district being a standard list of postcodes, in some cases it was a list of lists (oh joy!). I was expecting it to looks like this:

[u’AB10, AB11, AB12, AB15, AB16, nAB21, AB22, AB23, AB24, AB25, nAB99, non-geo’] *

*Ignore the n signs and non-geo text – we’ll deal with them later!

I got this…

[u’AB10, AB11, AB12, AB15, AB16,‘, u’nAB21, AB22, AB23, AB24, AB25,‘, u’nAB99‘, u’non-geo‘]

And so I needed an additional layer of iteration: one for the whole list and then another for the items in the individual lists. Simple.

For each item in the list, the .split(",") function in Python allowed me to split out the comma separated list of postcodes in to a list that could be iterated over. For each item in that list, we just check to see if it’s a postcode (a check on string length sufficed nicely this time!) and then build up our output string. To deal with the n that was appended to some of the postcodes, I just left-stripped the string to remove the n characters and hey presto it worked!

I flushed the output to a CSV file as well as to the screen and it worked beautifully!

Here is the full code:

[code language=”Python”]
from bs4 import BeautifulSoup
import urllib2

wiki = “http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom”
header = {‘User-Agent’: ‘Mozilla/5.0’} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

area = “”
district = “”
town = “”
county = “”

table = soup.find(“table”, { “class” : “wikitable sortable” })

f = open(‘output.csv’, ‘w’)

for row in table.findAll(“tr”):
cells = row.findAll(“td”)
#For each “tr”, assign each “td” to a variable.
if len(cells) == 4:
area = cells[0].find(text=True)
district = cells[1].findAll(text=True)
town = cells[2].find(text=True)
county = cells[3].find(text=True)

#district can be a list of lists, so we want to iterate through the top level lists first…
for x in range(len(district)):
#For each list, split the string
postcode_list = district[x].split(“,”)
#For each item in the split list…
for i in range(len(postcode_list)):
#Check it’s a postcode and not other text
if (len(postcode_list[i]) > 2) and (len(postcode_list[i]) <= 5):
#Strip out the “n” that seems to be at the start of some postcodes
write_to_file = area + “,” + postcode_list[i].lstrip(‘n’).strip() + “,” + town + “,” + county + “n”
print write_to_file
f.write(write_to_file)

f.close()
[/code]

Disclaimer(ish)

This code has no additional error checking or handling and was merely written to solve a small problem I had, and to put in to practise everything I’d learned so far. It does also just work for this particular table on the Wikipedia page – although it could be adapted for use on other tables. But, it was great fun to put the learning in to action and work on a real-life problem to solve. Here’s to more exercises like this!