Category Archives: Python

Using a UK postcode to find Lat and Long and other useful information

So I have a list of UK postcodes and I want to find out the latitude and longitude of that postcode to plot it on to a map. There are numerous ways to do this, including:

1. Visiting Google Maps

2. Entering the postcode in to the search box

3. Right-clicking on the little point on the map and selecting “What’s here?”

4. Copying and pasting the lat and long that magically appears in the search box.

That’s nice, quick and easy BUT what if I’ve got hundreds or thousands of postcodes to search? I can’t spend the whole sunny weekend doing the same procedure can I?

No.

I’m self-learning Tableau at the moment and it seems to have great location support…if you provide it with US addresses (gnrrr), but I wanted to find a way of plotting a series of points on a map of the UK derived using their addresses. A bit of Google searching led me to UK Postcodes (set up by Stuart Harrison and based on Ordnance Survey OpenData) , a site that lets you enter a UK postcode and returns lots of information about that location (e.g. long, lat, county and right down to district and ward) and what made me excited was that the site had an API allowing you to pass it a postcode via URL (e.g. http://www.uk-postcodes.com/postcode/NW18TQ.json) and it would output the meta data it in either XML, CSV, JSON or RDF for you. PERFECT!

After a further read around the site, I found that Edd Robinson had created a library acting as a wrapper for the API which I could import in to my own Python project. And so, without further ado here is my Python code:

[code language=”Python”]
from postcodes import PostCoder
import csv

f = open(‘Extracted_Data_from_Postcodes.csv’, ‘w’)
i = 0
pc = PostCoder()

loc_lat = ""
loc_long = ""

with open(‘Postcodes_to_Extract.csv’, ‘rU’) as g:
reader = csv.reader(g)

for row in reader:
#Col 0 = ID :: Col 1 = Postcode
result = pc.get(str(row[1]))

if str(result) != "None":
location = result[‘geo’]

for key, value in location.items():
if (key == "lat"):
loc_lat = value
if (key == "lng"):
loc_long = value

#ID, Postcode, Lat, Long
write_to_file = str(row[0]) + ", " + str(row[1]) + ", " + str(loc_lat) + ", " + str(loc_long) + "n"

#Add the iteration count to output to screen to see how far we are up to.
print str(i + 1) + ", " + write_to_file
f.write(write_to_file)
else:
#If there is a problem translating the postcode, output "BROKEN" to file to manually check.
write_to_file = "BROKEN, " + str(result) + ", " + str(row[0]) + ", " + str(row[1]) + "n"

print str(i + 1) + ", BROKEN, " + str(row[0]) + ", " + str(row[1])
f.write(write_to_file)
i = i + 1

f.close()

[/code]

My input file looked like this:

[code]
109484, SG5 4PF
109486, MK44 2DB
109487, LU4 9UJ
109488, LU6 1RE
109489, MK43 8DY
109490, MK45 5JH
109491, MK44 3QD
109492, MK45 3BX
109493, MK17 9QL
109494, MK43 9JT
[/code]

And my screen output looked like this:

[code]
1, BROKEN, 109484, SG5 4PF
2, 109486, MK44 2DB, 52.214741, -0.461977
3, 109487, LU4 9UJ, 51.927696, -0.500824
4, 109488, LU6 1RE, 51.879322, -0.563452
5, 109489, MK43 8DY, 52.164539, -0.623209
6, 109490, MK45 5JH, 51.982376, -0.495111
7, 109491, MK44 3QD, 52.137440, -0.377085
8, 109492, MK45 3BX, 52.080341, -0.446214
9, 109493, MK17 9QL, 51.989906, -0.619803
10, 109494, MK43 9JT, 52.095955, -0.528752
[/code]

Afterwards, I was then able to manually check out why SG5 4PF failed, actually, I’m not sure why it failed but I was able to correct the lat & long via a Google Map search. Exactly what I needed with minimal effort and a chance to flex my muscles with Python again. I then imported the CSV in to Tableau, but that’s the subject of a future blog post…

Setting up Python and Easy_Install on Windows 7

This weekend, I needed to set up Python on a Windows 7 virtual machine that I had set up on my Mac. I don’t know why, but I always seem to have a problem installing Python and setting up the environment and so I’m going to document the process here so that in X months time when I need to do it again, I just need to follow my own instructions!

As part of the process, I also install easy_install to install libraries and so I’ll also document that here.

1. Download the Python Windows installer from here -> http://www.python.org/download/ (I downloaded Python 2.7.5) and run the executable on your machine.

2. Run the installer – I normally leave all the defaults as they are, so end up installing it in C:Python27

3. Once installed, SET UP YOUR ENVIRONMENT VARIABLES, I can not express how important this is as you’ll hit so many problems without do it (try opening the command prompt and typing python to see Windows shrug its shoulders and say “huh…dunno…”)

  • If you haven’t played with environment variables before, just stick to following these instructions as you can set them up through the Windows GUI.
  • Right click on “My Computer”, select “Properties” > “Advanced system settings” and click on the “Environment Variables” button
  • In the System Variables box, find the variable called “path” and click on the “Edit…” button
  • In the “Variable value” box, at the end of the entry, add the following text: ;C:Python27;C:Python27Scripts (note the leading semi colon, also note that I’m using the default installation folder – if you have installed Python elsewhere, just change the path structure).
  • Click “OK” a couple of times and hey presto, your environment variables are set up.
  • To test it has worked, open a new command window and type in python and if you see this, you’re on to a winner…

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.

C:Usersadelegilpin>python
Python 2.7.5 (default, May 15 2013, 22:43:36) [MSC v.1500 32 bit (Intel)] on win32
Type “help”, “copyright”, “credits” or “license” for more information.
>>>

4. The next step in the process is to set up easy_install and so we need to go to the setuptools page (links to version 0.8) and download the ez_setup.py script (I normally put it straight in to my Python27 folder).

5. Open a command prompt and type python ez_setup.py install – you’ll see a load of code whizz by which will hopefully end as follows;

Processing dependencies for setuptools==0.8
Finished processing dependencies for setuptools==0.8

C:Python27>

6. easy_install has now been set up and you can test to see if it is there, by typing easy_install in to a command prompt, which will throw an error about no URLs, you know that the tool has been set up successfully

Done!

To use easy_install to get new libraries, just use the following syntax: easy_install <library name> (e.g. easy_install postcodes)

Using Python & BeautifulSoup to scrape a Wikipedia table

Well, it was only a couple of weeks ago that I set myself a challenge to complete the Python course on Codecademy and I did it – I completed the Python track and it was fantastic! I was given the opportunity to put my newly found Python skills in to action this week as I needed to scrape some data from a Wikipedia page – I have a table of addresses and need to compare the County in the list that has been provided to the one that it really should be. This page on Wikipedia contains the data I need, for each Postcode district there’s a Postal County and I could use this data as a comparison – formatted in an HTML table like this:

WikiCapture

Normally, I’d just copy & paste the table in to Excel for use later on BUT it’s not as easy as that (oh no!), as there can be are multiple Postcode Districts within a row which is slightly annoying! To be of any use to me, I need the data to be formatted so that there is a row for each Postcode District like so (I don’t necessarily need the Postcode Area & Town but I’ll keep them anyway – I don’t like throwing away data!):

Postcode Area Postcode District Post Town Former Postal County
AB AB10 ABERDEEN Aberdeenshire
AB AB11 ABERDEEN Aberdeenshire
AB AB13 ABERDEEN Aberdeenshire
AB AB15 ABERDEEN Aberdeenshire

And so I thought this would be the perfect project for me to undertake in Python and to familiarise myself with friend-of-the-screen-scrapers, BeautifulSoup. I won’t jabber on too much about BeautifulSoup as I’m not fully up to speed on it myself yet, but from reading around the subject I gather it’s a great way to grab elements from web pages for further processing.

Step One: Wikipedia doesn’t like you…

Wikipedia doesn’t like this code:

[code language=”Python” highlight=”7″]
from bs4 import BeautifulSoup
import urllib2
wiki = “http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom”
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page)
print soup
#urllib2.HTTPError: HTTP Error 403: Forbidden

[/code]

Wikipedia only allows access to recognised user agents in order to stop bots retrieving bulk content. I am not a bot, I just want to practise my Python and so to get around this you just need some additional code to the header (thanks to Stack Overflow for coming to the rescue).

Step Two: Hunt the table

If you look at the code behind the Wikipedia article, you’ll see that there are multiple tables but only one (thankfully the one we want) uses the “wikitable sortable” class – this is great as we can use BeautifulSoup to find  the table with the “wikitable sortable” class and know that we will only get this table.

[code language=”Python”]
from bs4 import BeautifulSoup
import urllib2
wiki = “http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom”
header = {‘User-Agent’: ‘Mozilla/5.0’} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

area = “”
district = “”
town = “”
county = “”
table = soup.find(“table”, { “class” : “wikitable sortable” })
print table
[/code]

Output looks like this:

TableOutput

Great! This means that we just have the HTML table stored in our variable. Now, it’s just a case of iterating through the rows and columns…easy…*ahem*

Step Three: For your iteration pleasure

We need to do the iteration in two stages – the first stage is to iterate through each row (tr element) and then assign each element in the tr to a variable. At this stage, we will grab everything in the Postcode Districts column and store it in a list for further iteration later. To do this, I used the following code:

[code language=”Python” firstline=”19″]
for row in table.findAll(“tr”):
cells = row.findAll(“td”)
#For each “tr”, assign each “td” to a variable.
if len(cells) == 4:
area = cells[0].find(text=True)
district = cells[1].findAll(text=True)
town = cells[2].find(text=True)
county = cells[3].find(text=True)
[/code]

The .findAll function in Python returns a list and so on line 20, we obtain a list containing four elements, one for each of the columns in the table. This means they can be accessed via the cells[n].find(text=True) syntax. You’ll notice that I’ve used .findAll for the Postal Districts column, this is because I want a list of the items within the cell for iteration purposes later!

After this code executes, I have a value for the area, a list of districts, a town and a county. Now for the second part of my iteration:

[code language=”Python” firstline=”28″]
#district can be a list of lists, so we want to iterate through the top level lists first…
for x in range(len(district)):
#For each list, split the string
postcode_list = district[x].split(“,”)
#For each item in the split list…
for i in range(len(postcode_list)):
#Check it’s a postcode and not other text
if (len(postcode_list[i]) > 2) and (len(postcode_list[i]) <= 5):
#Strip out the “n” that seems to be at the start of some postcodes
write_to_file = area + “,” + postcode_list[i].lstrip(‘n’).strip() + “,” + town + “,” + county + “n”
print write_to_file
[/code]

I found that, instead of district being a standard list of postcodes, in some cases it was a list of lists (oh joy!). I was expecting it to looks like this:

[u’AB10, AB11, AB12, AB15, AB16, nAB21, AB22, AB23, AB24, AB25, nAB99, non-geo’] *

*Ignore the n signs and non-geo text – we’ll deal with them later!

I got this…

[u’AB10, AB11, AB12, AB15, AB16,‘, u’nAB21, AB22, AB23, AB24, AB25,‘, u’nAB99‘, u’non-geo‘]

And so I needed an additional layer of iteration: one for the whole list and then another for the items in the individual lists. Simple.

For each item in the list, the .split(",") function in Python allowed me to split out the comma separated list of postcodes in to a list that could be iterated over. For each item in that list, we just check to see if it’s a postcode (a check on string length sufficed nicely this time!) and then build up our output string. To deal with the n that was appended to some of the postcodes, I just left-stripped the string to remove the n characters and hey presto it worked!

I flushed the output to a CSV file as well as to the screen and it worked beautifully!

Here is the full code:

[code language=”Python”]
from bs4 import BeautifulSoup
import urllib2

wiki = “http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom”
header = {‘User-Agent’: ‘Mozilla/5.0’} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

area = “”
district = “”
town = “”
county = “”

table = soup.find(“table”, { “class” : “wikitable sortable” })

f = open(‘output.csv’, ‘w’)

for row in table.findAll(“tr”):
cells = row.findAll(“td”)
#For each “tr”, assign each “td” to a variable.
if len(cells) == 4:
area = cells[0].find(text=True)
district = cells[1].findAll(text=True)
town = cells[2].find(text=True)
county = cells[3].find(text=True)

#district can be a list of lists, so we want to iterate through the top level lists first…
for x in range(len(district)):
#For each list, split the string
postcode_list = district[x].split(“,”)
#For each item in the split list…
for i in range(len(postcode_list)):
#Check it’s a postcode and not other text
if (len(postcode_list[i]) > 2) and (len(postcode_list[i]) <= 5):
#Strip out the “n” that seems to be at the start of some postcodes
write_to_file = area + “,” + postcode_list[i].lstrip(‘n’).strip() + “,” + town + “,” + county + “n”
print write_to_file
f.write(write_to_file)

f.close()
[/code]

Disclaimer(ish)

This code has no additional error checking or handling and was merely written to solve a small problem I had, and to put in to practise everything I’d learned so far. It does also just work for this particular table on the Wikipedia page – although it could be adapted for use on other tables. But, it was great fun to put the learning in to action and work on a real-life problem to solve. Here’s to more exercises like this!

Self development fortnight

Well, it’s been a while since my last update and a lot has been happening with this blog behind the scenes (bye, bye 1&1 hosting and hello to the wonderful Squirrel Hosting) and with myself.

In short, I am going in to hospital for an operation on 6th June which is a little bit nerve-wracking, exciting (strange…but this op will vastly improve my quality of life) and unknown – I haven’t had an operation since my tonsils were removed when I was six and so I don’t know what to expect, I only have Holby City to go on! As a consequence of this, I have two weeks off work – time off beforehand for preparation & relaxation and then time off afterwards to rest & recuperation. I am putting my foot down now and saying that these two weeks will be for self development, learning new things and essentially NOT SITTING AROUND AND WATCHING DAYTIME TV (like I always find myself doing when I have the odd day off!)…except for Pointless, I love Pointless and so that’s my only exception.

So, I have created an account on Codecademy and am slowly making my way through the Python course. At the time of writing, I am up to lesson 8 on Loops and I am thoroughly enjoying it, it’s such a great way to learn a new language. I have written C++ and C# code in the past and so am not a complete beginner, but it’s great to start right at the beginning and learn a new language from scratch. It’s a bit of a revelation not to have to add a semi-colon at the end of a line of code…it feels a bit naughty!

When I have completed the Python course, I hope to undertake a small project of my own. My main motivation for learning Python is for screen-scraping and data extraction purposes and so I’d like to start a project to help me to gain experience in these areas. I will of course keep blogging about my progress and the new discoveries that I make.

The fortnight of self development starts now…