Using Python & BeautifulSoup to scrape a Wikipedia table

Well, it was only a couple of weeks ago that I set myself a challenge to complete the Python course on Codecademy and I did it – I completed the Python track and it was fantastic! I was given the opportunity to put my newly found Python skills in to action this week as I needed to scrape some data from a Wikipedia page – I have a table of addresses and need to compare the County in the list that has been provided to the one that it really should be. This page on Wikipedia contains the data I need, for each Postcode district there’s a Postal County and I could use this data as a comparison – formatted in an HTML table like this:

WikiCapture

Normally, I’d just copy & paste the table in to Excel for use later on BUT it’s not as easy as that (oh no!), as there can be are multiple Postcode Districts within a row which is slightly annoying! To be of any use to me, I need the data to be formatted so that there is a row for each Postcode District like so (I don’t necessarily need the Postcode Area & Town but I’ll keep them anyway – I don’t like throwing away data!):

Postcode Area Postcode District Post Town Former Postal County
AB AB10 ABERDEEN Aberdeenshire
AB AB11 ABERDEEN Aberdeenshire
AB AB13 ABERDEEN Aberdeenshire
AB AB15 ABERDEEN Aberdeenshire

And so I thought this would be the perfect project for me to undertake in Python and to familiarise myself with friend-of-the-screen-scrapers, BeautifulSoup. I won’t jabber on too much about BeautifulSoup as I’m not fully up to speed on it myself yet, but from reading around the subject I gather it’s a great way to grab elements from web pages for further processing.

Step One: Wikipedia doesn’t like you…

Wikipedia doesn’t like this code:

[code language=”Python” highlight=”7″]
from bs4 import BeautifulSoup
import urllib2
wiki = “http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom”
page = urllib2.urlopen(wiki)
soup = BeautifulSoup(page)
print soup
#urllib2.HTTPError: HTTP Error 403: Forbidden

[/code]

Wikipedia only allows access to recognised user agents in order to stop bots retrieving bulk content. I am not a bot, I just want to practise my Python and so to get around this you just need some additional code to the header (thanks to Stack Overflow for coming to the rescue).

Step Two: Hunt the table

If you look at the code behind the Wikipedia article, you’ll see that there are multiple tables but only one (thankfully the one we want) uses the “wikitable sortable” class – this is great as we can use BeautifulSoup to find  the table with the “wikitable sortable” class and know that we will only get this table.

[code language=”Python”]
from bs4 import BeautifulSoup
import urllib2
wiki = “http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom”
header = {‘User-Agent’: ‘Mozilla/5.0’} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

area = “”
district = “”
town = “”
county = “”
table = soup.find(“table”, { “class” : “wikitable sortable” })
print table
[/code]

Output looks like this:

TableOutput

Great! This means that we just have the HTML table stored in our variable. Now, it’s just a case of iterating through the rows and columns…easy…*ahem*

Step Three: For your iteration pleasure

We need to do the iteration in two stages – the first stage is to iterate through each row (tr element) and then assign each element in the tr to a variable. At this stage, we will grab everything in the Postcode Districts column and store it in a list for further iteration later. To do this, I used the following code:

[code language=”Python” firstline=”19″]
for row in table.findAll(“tr”):
cells = row.findAll(“td”)
#For each “tr”, assign each “td” to a variable.
if len(cells) == 4:
area = cells[0].find(text=True)
district = cells[1].findAll(text=True)
town = cells[2].find(text=True)
county = cells[3].find(text=True)
[/code]

The .findAll function in Python returns a list and so on line 20, we obtain a list containing four elements, one for each of the columns in the table. This means they can be accessed via the cells[n].find(text=True) syntax. You’ll notice that I’ve used .findAll for the Postal Districts column, this is because I want a list of the items within the cell for iteration purposes later!

After this code executes, I have a value for the area, a list of districts, a town and a county. Now for the second part of my iteration:

[code language=”Python” firstline=”28″]
#district can be a list of lists, so we want to iterate through the top level lists first…
for x in range(len(district)):
#For each list, split the string
postcode_list = district[x].split(“,”)
#For each item in the split list…
for i in range(len(postcode_list)):
#Check it’s a postcode and not other text
if (len(postcode_list[i]) > 2) and (len(postcode_list[i]) <= 5):
#Strip out the “n” that seems to be at the start of some postcodes
write_to_file = area + “,” + postcode_list[i].lstrip(‘n’).strip() + “,” + town + “,” + county + “n”
print write_to_file
[/code]

I found that, instead of district being a standard list of postcodes, in some cases it was a list of lists (oh joy!). I was expecting it to looks like this:

[u’AB10, AB11, AB12, AB15, AB16, nAB21, AB22, AB23, AB24, AB25, nAB99, non-geo’] *

*Ignore the n signs and non-geo text – we’ll deal with them later!

I got this…

[u’AB10, AB11, AB12, AB15, AB16,‘, u’nAB21, AB22, AB23, AB24, AB25,‘, u’nAB99‘, u’non-geo‘]

And so I needed an additional layer of iteration: one for the whole list and then another for the items in the individual lists. Simple.

For each item in the list, the .split(",") function in Python allowed me to split out the comma separated list of postcodes in to a list that could be iterated over. For each item in that list, we just check to see if it’s a postcode (a check on string length sufficed nicely this time!) and then build up our output string. To deal with the n that was appended to some of the postcodes, I just left-stripped the string to remove the n characters and hey presto it worked!

I flushed the output to a CSV file as well as to the screen and it worked beautifully!

Here is the full code:

[code language=”Python”]
from bs4 import BeautifulSoup
import urllib2

wiki = “http://en.wikipedia.org/wiki/List_of_postcode_districts_in_the_United_Kingdom”
header = {‘User-Agent’: ‘Mozilla/5.0’} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)

area = “”
district = “”
town = “”
county = “”

table = soup.find(“table”, { “class” : “wikitable sortable” })

f = open(‘output.csv’, ‘w’)

for row in table.findAll(“tr”):
cells = row.findAll(“td”)
#For each “tr”, assign each “td” to a variable.
if len(cells) == 4:
area = cells[0].find(text=True)
district = cells[1].findAll(text=True)
town = cells[2].find(text=True)
county = cells[3].find(text=True)

#district can be a list of lists, so we want to iterate through the top level lists first…
for x in range(len(district)):
#For each list, split the string
postcode_list = district[x].split(“,”)
#For each item in the split list…
for i in range(len(postcode_list)):
#Check it’s a postcode and not other text
if (len(postcode_list[i]) > 2) and (len(postcode_list[i]) <= 5):
#Strip out the “n” that seems to be at the start of some postcodes
write_to_file = area + “,” + postcode_list[i].lstrip(‘n’).strip() + “,” + town + “,” + county + “n”
print write_to_file
f.write(write_to_file)

f.close()
[/code]

Disclaimer(ish)

This code has no additional error checking or handling and was merely written to solve a small problem I had, and to put in to practise everything I’d learned so far. It does also just work for this particular table on the Wikipedia page – although it could be adapted for use on other tables. But, it was great fun to put the learning in to action and work on a real-life problem to solve. Here’s to more exercises like this!

Self development fortnight

Well, it’s been a while since my last update and a lot has been happening with this blog behind the scenes (bye, bye 1&1 hosting and hello to the wonderful Squirrel Hosting) and with myself.

In short, I am going in to hospital for an operation on 6th June which is a little bit nerve-wracking, exciting (strange…but this op will vastly improve my quality of life) and unknown – I haven’t had an operation since my tonsils were removed when I was six and so I don’t know what to expect, I only have Holby City to go on! As a consequence of this, I have two weeks off work – time off beforehand for preparation & relaxation and then time off afterwards to rest & recuperation. I am putting my foot down now and saying that these two weeks will be for self development, learning new things and essentially NOT SITTING AROUND AND WATCHING DAYTIME TV (like I always find myself doing when I have the odd day off!)…except for Pointless, I love Pointless and so that’s my only exception.

So, I have created an account on Codecademy and am slowly making my way through the Python course. At the time of writing, I am up to lesson 8 on Loops and I am thoroughly enjoying it, it’s such a great way to learn a new language. I have written C++ and C# code in the past and so am not a complete beginner, but it’s great to start right at the beginning and learn a new language from scratch. It’s a bit of a revelation not to have to add a semi-colon at the end of a line of code…it feels a bit naughty!

When I have completed the Python course, I hope to undertake a small project of my own. My main motivation for learning Python is for screen-scraping and data extraction purposes and so I’d like to start a project to help me to gain experience in these areas. I will of course keep blogging about my progress and the new discoveries that I make.

The fortnight of self development starts now…

Week 5 & 6 – A topic of our own

For the final week of the MOOC, we have been given the task of producing an infographic of our own – this means choosing a topic, gathering the information and presenting an idea to show the information in graphic form.

As my previous sketches have been for interactive infographics, I wanted to give a static graphic a go. Having so much freedom was pretty hard – there is a wealth of information and data out there, but choosing which story to go for and what angle to take was going to be hard! It was lucky then that I got a tweet from the team behind the BBC iPlayer pointing me to the latest performance report and that is when inspiration struck.

The BBC produce these performance reports every month and I read them with interest – I am a stats geek and love stuff like this. The report gives stats such as the viewing figures for content on iPlayer, popular programmes, usage by device type and the gender/age group of users. It’s a wealth of information that I find fascinating. But I also love it because it’s about the iPlayer – something I use for at least two hours a day and have a certain affection for, it’s brilliant. For non-UK residents, the iPlayer is a service that the BBC officially launched at the end of 2007 and allows viewers/listeners of BBC TV programmes/radio shows to replay missed content and to watch shows live via the internet. The iPlayer is available on PCs, tablets, mobile phones, via Smart TVs and via cable operators. In essence, it’s brilliant.

I am fairly certain that the report released by the BBC is not aimed at the typical iPlayer user – it feels more for those in the media or for those who have a specific interest in audience figures and so my goal for the infographic was to produce something that everyone could appreciate. Luckily for me, October was a record month for iPlayer usage with 213 million requests for TV or radio content – breaking the 200 million request barrier for the first time and so I had a nice little slant for my infographic. It also meant that the story had been picked up the press too:

BBC iPlayer tops 200 million monthly requests for first time – Digital Spy

iPlayer passes 200 million monthly requests for the first time – Digital TV Europe

Merlin and Jimmy Savile documentary help BBC iPlayer to record month – The Telegraph

BBC enjoys record iPlayer requests in October – Cable.co.uk

…but no-one had produced an infographic, and so I felt it was my duty to produce one to celebrate!

My goals for the infographic were as followed;

  • Produce something for everyone – using the stats from the October performance report but make them easier to read and emphasise their relevance.
  • What were the most popular shows in October? Why did it break the 200 million request barrier in October and not, say, during the Olympics?
  • Who and what is using the iPlayer service? What proportion of requests are coming from tablets?
  • Make a static graphic that could serve as a template for every performance report so that non-industry readers could glean the key information easier on one page as opposed to trawling through the report.

And so with all of this in mind (and not a lot of time to complete the task – despite two weeks to work on it, December is a crazy busy time at work!), here is what I have come up with…

October 2012: A record iPlayer month for the BBC (PDF)

Notes about the graphic

  • This is a static graphic which uses the figures from the October 2012 iPlayer Performance report but could be used as a template for other monthly reports.
  • I extracted the information that I thought would be interesting such as iPlayer requests since 2009 (as far back as the report goes), the gender breakdown of users, the devices used to access the service and the popular TV and radio shows in October. I have also put a few stats in the blurb at the top.
  • The graphic style is largely similar to my last task with minimal use of colour –  I stuck to pink as that is the predominant colour in the iPlayer branding.
  • If I had more time, I would have liked to explore the peaks and troughs around the end of 2010 and beginning of 2011. Do peaks relate to the release of iPlayer apps on mobile and tablet devices for example?
  • This graphic could be made interactive and this is a project I would like to work on in the future – especially to see the variation in the share of the device types – so watch this space! 🙂
I am pretty happy with this graphic but feel there are plenty more angles to explore with this data – but this is good as it gives me something to tinker with over the Christmas holidays. Now, do you think I’ve been good enough for Santa to bring me a copy of Adobe Illustrator?

Week 4: Interactive graphic based on US unemployment stats

Our goal this week was to think about what kind of interactive graphic we could create based on the data used in the Guardian’s piece about unemployment in the US -> http://www.guardian.co.uk/news/datablog/interactive/2011/sep/08/us-unemployment-obama-jobs-speech-state-map

There is a lot of data used behind the scenes of this graphic which is great but is also slightly frustrating. For example, if you click on a particular state, you get a wealth of additional information – but it doesn’t allow you to easily compare it to other states. The same goes for the drop-down at the top of the graphic – it’s great that you can view the unemployment rate at particular point in time, but it’s really hard to compare unless you are focussing on a particular state. I do however like the range of comparisons that have been made with the data, especially the ability to visualise the percentage point difference from the national figure – I shall have to remember that one in future 🙂

And so, I jotted down some thoughts about what I would like to see on an interactive graphic like this and came up with the following list;

  • The Guardian piece focuses on the unemployment rate in the US since Obama came to power…what about further back?
  • Is state level in-depth enough? What about within the state – how does the unemployment rate differ within the states themselves?
  • In the accompanying course material, we were told not to add more than 6 colours to a choropleth map (which makes total sense for comparison) but what about viewing a small list of those counties with the very lowest & highest unemployment rates that would normally be enclosed in ranges?
  • Based on feedback from last week’s assignment – I wanted to focus more on type, colour and “interactiveness” of the graphic – this is definitely where I need more practise.

And so with all of this in mind, I scribbled down possible graph/map/info ideas and arranged them on the table (see last week’s post for an idea of how it looked!) and I came up with this:

Unemployment Rates in the US (PDF)

Unemployment Rates in the US – with notes (PDF)

Notes about the graphic

  • The user is able to scroll back in time to see how the unemployment data differs on the map of the US. I added a line graph so that it was clear to see years when the unemployment rate was particularly high/low. I did think about adding an overlay to show the years that a new President came in to power – incidentally there does seem to be a trend of the unemployment rate dropping in the year this happens – but I did not progress along this line of investigation for this project. Maybe another time 🙂
  • The map at the top is interactive and allows the user to click on a particular county to see detailed information about it as well as the state in which is belongs. The small bar chart on the left would become active when a county is selected.
  • The user also has the ability to tick the boxes and add lines to the graph showing the county and state unemployment rates and compare them to the national figures.
  • I have taken on board comments from last week about colour, type and making it appear more interactive. It was VERY hard being so restrained with colour (I’m not used to this!) but I actually found working with Colorbrewer for the map colours gave me a base to start from and I didn’t stray from there.

I am really happy with this graphic and I didn’t rush as much as I did last week. I took my time, didn’t faff around with Illustrator too much and so had more time to concentrate on what I wanted to do and actually what I’d want to see on an interactive visualisation like this.

22 days left on my Illustrator trial…will I be adding it to my Christmas list (as well as Alberto Cairo’s book and Andy Kirk’s too)? YES!

Week 3: Sketch an interactive graphic

The goal for this week was to think about how an interactive graphic based on a particular report by Publish What You Fund, and also published in a Guardian blog, would look. The data in question relates to how transparent major donor organisations are with their own data and so each organisation has been rated using a distinct set of criteria created by Publish What You Fund, therefore producing an overall transparency index.

This assignment has really stretched me this week and made me take full advantage of the sketching/note-taking apps on my tablet as I found I was coming up with ideas in random places and needed to get them down for exploration.

My first task was to find out what the heck “transparency” actually meant and how it was actually measured and I was thankful that the data originated from a very well organised website. I then looked at both source websites and noted down what I thought was missing and how I would like to play with the data myself. This took about three or four days – and this is where a lot of sketching and brain storming came in; thinking of the “what ifs….” and “oooh how about I just change this…” scenarios.

I toyed with the data in Excel to see if I could find any interesting correlations such as splitting the data right down to individual indicators, looking at the annual resources and budget of each donor and in turn where the money goes but what I was really missing was information about the donor itself. I was very pleased to see the UK’s Department for International Development at the top of the list but in all honesty, I really knew nothing about them  and so I wanted to build that in to the graphic.

And so I started by jotting down potential graphs/data to include in my final interactive graphic and started arranging the sketches until I had something that I thought could work. Incidentally, I find jotting things down on paper like this so helpful as you invest very little time in it and it allows easy rearranging of elements – paper prototypes FTW!

 

 

From there, I installed the trial version of Illustrator CS6 and started playing around. To cut a long story short (it really was a long story as I battled with Illustrator’s graphs – I won in the end though!) I came up with the following design;

Aid Transparency Graphic (PDF)                  

Aid Transparency Graphic + Notes (PDF)

Notes about the graphic

  • The bar chart that can be seen at the top of the graphic can be manipulated by the buttons on the right hand side and the user can select to show the results of individual aid information levels or all of them (the total).
  • The user can also select to show particular countries instead of having everything on the graph which I found really hard to read in the Guardian blog.
  • If a user clicks on a donor’s name or the bar associated with that donor, the panel at the bottom will display additional information about the organisation. I added a space for some text about the organisation to add a bit of context and also a timeline to chart their major accomplishments so that users would be able to relate to an organisation’s particular focus. Both pieces of information could be scraped from donors’ websites and annual reports.
  • I have tried to minimise the use of the word “transparency” and instead used “openness” where possible as I personally wasn’t very clear about what this meant at first.

I am personally really pleased with this, as the work involved way more that playing around with a few graphs. I had to think about what I wanted to say, how I was going to represent it in a prototype form that would communicate how an interactive version of it would work. But I’m doing something that I love and time did indeed fly when I was tinkering all weekend!

Week 2: A critique of the “Convention Word Counts” visualisation in the NYT

Source material: http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html

A comparison of how often speakers at the two presidential nominating conventions used different words and phrases, based on an analysis of transcripts from the Federal News Service.

Although I very much like the look of this graphic at first glance I feel it includes too much information and too many layers of abstraction and hides the beauty of what is quite a high impact piece.

The main graphic serves two purposes:

1. It acts like a word cloud and illustrates the frequency of words by resizing bubbles accordingly.

2. Shows how the usage of words is split between the two parties.

Therefore it presents the reader with the ability to see that both parties have used the words ‘Tax’, ‘Energy’ and ‘Families’ in equal measures, but Democrats have used the word ‘Health’ more than the Democrats, though they themselves have used the word ‘Leadership’ more. The reader is clearly able to see this by comparing the size of the bubbles and they are able to identify the split in usage between the two parties.

However, I do feel that it presents the reader with too much information. I don’t think it is necessary for the numbers to be present in the bubbles as they serve as a distraction – the blue/red split in the bubble itself should be enough to allow someone to see the proportion of the word’s usage. The numbers themselves are also per 25,000 words, which bombard the reader with unnecessary information. Is the average reader really interested to know that the word ‘Health’ is used 38 times per 25,000 words by the Democrats vs. 9 times per 25,000 words by the Republicans? I’d hazard a guess and say “no”, but I think they are more interested in seeing that the Democrats used it more than the Republicans overall. But I do think the numbers are interesting and so maybe they should only be displayed when a bubble has been clicked on.

I feel the text, which is placed below the bubble, describing the words, does not need to be present the whole time. It’s taking up room and in actual fact I didn’t even bother to read it when I was playing around. Also, while I love the ability to add your own words to the collection, it does allow you to add words (e.g. “UK”) that have no mentions on either side – I personally didn’t find this very interesting and found that the 0-words cluttered up the visualisation.

I think the visualisation would benefit from altering the shade of blue/red depending on where the bubble is located. If for example we take the word ‘Forward’ which is far over on the Democrats’ side, I think the blue should be a lot darker than for a word such as ‘Success’ , which is more prominently in the Republicans’ side. This would help to reinforce the fact that there are two extremes to the viz and a middle ground shared by both parties.

Another level of abstraction I would love to see would be the ability to see who said the word and how many times. This data is used below the bubbles but it’s not really used effectively. How fascinating would it be to see how many times Mitt Romney said Obama (and vise versa) without having to count it up yourself?!? Then if you clicked on a bar in this chart, it would take you down to the section of the person to show a breakdown of the paragraphs.

And finally, while it’s not a criticism of the visualisation itself, I am an avid follower of the Guardian Data Blog and am used to seeing a link where I can browse and download the raw data. I’m not sure what the NYT’s policy is about this, but I think the visualisation would benefit from a link at the bottom so that data geeks like us can tinker with the data ourselves.

And so here is my rough sketch to show how I would tidy it up with the main changes listed below:

BEFORE                                                                                      AFTER

NTY_Adele_Sketch

  • I have removed the figures inside the bubbles that displayed the share per 25,000 words to give the graphic a cleaner finish.
  • I have removed the descriptive text that relates to particular bubbles.
  • I have added the percentage share of the word used by each party at the top.
  • I have added two graphs below the main graphic that will be displayed when a bubble has been clicked on. They will show who in each party has used the word and their proportional usage as a whole per party.
  • If you were to click on the person’s name or a particular bar on the graph, it would take you straight down to that person’s section in the blurb below.

Week 1: Introduction to Infographics and Data Visualization course

About a month ago, I signed up to a new MOOC offered by the Knight Center for Journalism. The course is run by Alberto Cairo and is exactly the sort of course I’ve been after for a while. As an aside, Higher Education institutions in the US seem to be way ahead of the game when it comes to MOOCs; I have completed courses via Coursera in the past and they have been fabulous.

Anyway, as part of my week 1 assignment, I have been asked to critique and discuss with fellow students the following graphic:

Week 1: Social Web Involvement

As with everything, it is so much easier to critique the work of other people and I realise that I will have fallen in to some common traps when creating my own graphics. Even viewing the first week’s lectures made me cringe at the screen as I am guilty of a lot of them. But that’s why we go on courses right? But the most important point I learned was to stop thinking like a designer and think like a reader – does the graphic convey its point within three seconds?

Do I think the graphic satisfies this? No, not really. If I’m honest, for the first three seconds it did grab my attention and if I’d seen that in a newspaper I probably would have stayed on the page and wanted to explore it. However, it’s only when I delved a bit deeper that I realised how difficult it was to decipher.

So what does the designer want me to do with the graphic? Well, it’s definitely not being used for a geography lesson, as the label for the UK is far up in the North Sea and Germany is over in Russia. The only real clue as to what the graphic is about is the title, “Social Web Involvement” and the rubrik in the bottom right hand side attempts to describe it. However, it is not clear what the sizes of the donut charts mean and also what the millions of users in the tables mean. We are told that 32,000 people were interviewed (2,000 per country) and so where the heck have these figures quoted in millions come from?!?

The graphic does present several variables, but it does not present them very clearly. The variables come in the form of the countries and each type of social web involvement. The graphic does not allow for comparisons, organisation or show correlations easily as the data is bunched by country and so it is not easy to find, for example the country with the second highest proportion of people writing their own blogs. The details are scattered around the graphic and a reader has to memorise the data in order to compare it. There is a lot of redundant grey space on the graphic which could have been used more efficiently.

I would improve the graphic by removing the map completely and just concentrating on the data. The data we have can also be sliced and diced with other data, for example the GDP of the country, the population (Internet users per 100,000 people) or we could look at other methods of social web involvement – why are there no games mentioned when according to Wikipedia there were 10.3 million players of World of Warcraft in May 2011? As a start, I would go back to basics and just ask the question: How does social web involvement vary by country to country?

Let’s take the UK as an example (edited a wee bit to fit):

UK

We could represent this data in a bar chart that could easily be compared to other countries. Of course, sixteen of these graphs may look a little cluttered but I think this is a step in the right direction.

excel_compare_1

If we wanted to develop this idea and wanted to just focus on one particular aspect of social web involvement, all categories could be greyed out and just one particular item focused upon.

excel_compare_2

These alteratives are most definitely still a work in progress but I think immediately they are a lot clearer to read and allow the data to be compared and organised a lot easier. Having read the forum for the course, I see that someone else has had a similar idea and created a stunning graphic -> http://www.flickr.com/photos/89317425@N05/8133822514/ so it’s good that I was thinking along similar lines and hope to keep developing my skills as the course progreses.

Fixing the root cause of ORA-12519: TNS:no appropriate service handler found

Since about the middle of last week, the Oracle server in our dev environment had been reporting “ORA-12519: TNS:no appropriate service handler found” intermittently. Not having much time to look at it (and not being much of an Oracle DBA), I found restarting the Oracle server made it go away for a day or so. By Tuesday though, it had started to get really annoying for everyone, so I had no choice but to cancel my meetings and take a proper look at the problem.

A quick search on Google revealed that it was most likely due to Oracle reaching its maximum number of processes. Most posts tended to just suggest increasing the number of processes without covering why you might suddenly running out of processes. As this had only just started happening and the load in our dev environment is meant to be very light, I really wanted to avoid just upping the number of processes as that would be just hiding the problem. I wanted to find and fix the root cause.

So a bit more searching around, and I managed to find these useful queries.

First, you need to be able to connect to your database either using the sys account or by logging into your Oracle box and using a direct SQL Plus connection as the Oracle user:

sqlplus / as sysdba

If you’re still getting errors about no available processes, then you will have to manually kill one of the Oracle processes using the kill command.

Once logged in, you can use these two queries to find out how many processes and sessions there are currently logged:

[code language=”sql”]
select count(*) from v$process;
select count(*) from v$session;
[/code]

For 11g, the default maximum number of processes is 150, so you should get 149 back (I don’t know why it’s out by one). Once you’ve confirmed that you have definitely reached the maximum number of processes, you can use this query to see what they all are and what they’re currently doing if they’re active:

[code language=”sql”]
SELECT sess.process, sess.status, sess.username, sess.schemaname, sql.sql_text
FROM v$session sess,
v$sql sql
WHERE sql.sql_id(+) = sess.sql_id;
[/code]

The above query is adapted from this Stack Overflow question and answer.

In the case of my failing server, this revealed that one of our QA testers was checking his load test script against our dev server! Every time he ran the script, it consumed all the available processes and locked everyone else out! A quick word with the tester (and one final Oracle restart to flush out all his sessions) and normality was restored without needing to mess about with any Oracle system settings.

C++ Unit Testing

Unit testing in managed languages such as Java and C# is easy. The general theory is that you create a Test annotation and apply that to your test functions. Next, you create a series of assert functions that throw an AssertFailed exception if an assertion doesn’t match what you expect. Finally, create a test runner that scans your assemblies or JARs using reflection for functions marked with your Test annotation and invoke them. The test runner just needs to catch any exceptions thrown by the test functions and report them somewhere. The test runner doesn’t have to care too much if the exception thrown is a failed assert or something more serious such as a null pointer exception, it can handle both of them in pretty much the same way. Tools such as NUnit or TestNG will provide all this for you so you will very rarely ever need to write any of this yourself.

With C++, things aren’t quite so easy. There isn’t really any form of annotations or reflection so discovering tests is harder. You do have exceptions, but you might not be able to use them due to the environment you are using and you don’t get call stacks with them either. And anyway, you could just get a fatal crash deep in the code you’re testing before you ever get the chance to throw an exception from one of your assertions.

This doesn’t mean that you can’t get C++ unit testing frameworks with a similar level of functionality as the ones for managed languages, Google Test is a pretty good one for example and CppUnitLite2 is another example of a very portable framework. I want to take a look at how a C++ unit testing framework could be implemented as I find it an interesting problem.

Goals

  • Easy to implement test functions that can be discovered by the test runner.
  • Assert functions that will tell me what test has failed along with a call stack.
  • Fatal application crashes won’t kill the test runner but are reported as failed tests along with a call stack.
  • Possible to plug into an continuous integration build server so that it can evaluate if a build is stable or not.

For my example framework, I’ll only be targeting Unix type platforms (Linux, Mac OS) as the methods I’ll be using are cleaner to implement making it easier to explain the theory. This also allows me to to provide a sample that will work on Ideone so you can have a play with the framework and see it running without needing to download any code.

The framework I present here takes its inspiration from Google Test so I highly recommend taking a look at that.

The Sample Framework

You can try out my sample framework on Ideone. Due to Ideone being primarily a code fiddle site to try out ideas, all your code must live in a single source file so don’t judge the structure too harshly! Normally you would separate everything out a bit and have clear interfaces between the test runner and your tests.

Test Function Registration

This is achieved by defining a macro to generate the test function declaration. The macro also creates a static object that contains the test function details and registers itself with the test runner in its constructor. The test function details contain the name of the function, the source file, line number and a pointer to the function to execute. They can then be stored in a simple linked list for the test runner to iterate over when it comes to run the tests. By using static objects, we can ensure that all our tests are registered automatically before main() is executed saving us the need to explicitly call a set up function that contains a list of all our test functions that needs to be maintained as new tests are added.

Test Reference Class

[code language=”cpp” firstline=”36″]
//—————————————————————————————–//
// Class for storing reference details for a test function
// Test references are stored in a simple linked list
//—————————————————————————————–//
// Type def for test function pointer
typedef void (*pfnTestFunc)(void);

// Class to store test reference data
class TestRef
{
public:
TestRef(const char * testName, const char * filename, int lineNumber, pfnTestFunc func)
{
function = func;
name = testName;
module = filename;
line = lineNumber;
next = NULL;

// Register this test function to be run by the main process
registerTest(this);
}

pfnTestFunc function; // Pointer to test function
const char * name; // Test name
const char * module; // Module name
int line; // Module line number
TestRef * next; // Pointer to next test reference in the linked list
};

// Linked list to store test references
static TestRef * s_FirstTest = NULL;
static TestRef * s_LastTest = NULL;
[/code]

This is a pretty simple class as it doesn’t need to do much more than register itself. In my sample, registerTest() is a global function that just add the object to the linked list.

[code language=”cpp” firstline=”206″]
// Add a new test to the linked list
void registerTest(TestRef * test)
{
if (s_FirstTest == NULL)
{
s_FirstTest = test;
s_LastTest = test;
}
else
{
s_LastTest-&gt;next = test;
s_LastTest = test;
}
}
[/code]

Test Registration Macro

[code language=”cpp” firstline=”22″]
// Macro to register a test function
#define TEST(name)
static void name();
static TestRef s_Test_ ## name(#name, __FILE__, __LINE__, name);
static void name()
[/code]

It simply declares the test function prototype, constructs a static test reference object passing the function pointer into the constructor and then declares the first line of the function implementation. Here’s an example of using it:

[code language=”cpp”]
TEST(MyTest)
{
// Your test implementation…
}
[/code]

When the macro is expanded by the preprocessor, it effectively becomes:

[code language=”cpp”]
static void MyTest();
static TestRef s_Test_MyTest(&quot;MyTest&quot;, &quot;example.cpp&quot;, 1, MyTest);
static void MyTest()
{
// Your test implementation…
}
[/code]

I’ve inserted line breaks to make it easier to read.

Test Execution

This isn’t the exact code I’ve used in my sample, but it’s doing pretty much the same thing.

[code language=”cpp” light=”true”]
TestRef * test = s_FirstTest;
while (test != NULL)
{
test-&gt;function();
// Report success or failure…
test = test-&gt;next;
}
[/code]

Assert Function

In my sample, I’ve just used an assert macro similar to one you’re probably already using in your own code.

[code language=”cpp” firstline=”14″]
// Assert macro for tests
#define TESTASSERT(cond)
do {
if (!(cond)) {
assertHandler(__FILE__, __LINE__, #cond);
}
} while (0);
[/code]

If the assert condition fails, it turns it into a string and passes it along with the current file and line number into an assert handler function to actually report the failure.

This actually isn’t the best example for a unit testing framework as it’s really only testing for a true condition. If you were developing a fully featured framework, you would probably want more assert functions along the lines of ASSERT_EQUALS(actual,expected) and ASSERT_NOTEQUALS(actual,notexpected) so that you can report how the actual result from a test differs from what was expected. Implementing these types of functions isn’t too hard, so I won’t dwell on that now.

Assert Handler

[code language=”cpp” firstline=”242″]
// Handler for failed asserts
void assertHandler(const char * file, int line, const char * message)
{
fprintf(stdout, &quot;nAssert failed (%s:%d):n&quot;, file, line);
fprintf(stdout, &quot; %sn&quot;, message);
fflush(stdout);

dumpStack(1);

_exit(1);
}
[/code]

The functions reports the location of the failed assert along with the failed condition before dumping a stack trace and exiting. The reason for calling exit is because my framework actually runs tests in a child process separate to the test runner (more on that later). This is also why I’ve used fprintf with the stdout file handle rather than just using printf(). The child and parent processes actually share the same file handles and so I need to be explicit about where my output is going and when buffers are flushed so that I don’t get overlapping test output.

Dumping the Call Stack

For this, I’ve used a feature of glibc which is one of the reasons my sample is written for *nix.

[code language=”cpp” firstline=”221″]
// Dump a stack trace to stdout
void dumpStack(int topFunctionsToSkip)
{
topFunctionsToSkip += 1; // We always want to skip this dumpStack() function
void * array[64];
size_t size = backtrace(array, 64);
backtrace_symbols_fd(array + topFunctionsToSkip, size – topFunctionsToSkip, 1); // Adjust the array pointer to skip n elements at top of stack
}
[/code]

I provide the ability to skip a number of calls at the top of the stack so that the assert and stack dumping functions aren’t reported in the call stack. The call stack is then written to stdout directly.

The function backtrace_symbols_fd() will attempt to resolve function symbols when it outputs the stack trace but it can be a bit hit or miss with getting the names and will be affected by optimisation level. For the most likely chance to get symbols out, you need to compile with the -g option and link with -rdynamic if using gcc. When I compile and run the sample on my Raspberry Pi, I get the following call stack for a failed assert:

Assert failed (main.cpp:89):
    1 == false
./a.out[0x8c44]
./a.out(_Z7runTestP7TestRef+0x70)[0x8ed8]
./a.out(main+0xb8)[0x8d40]
/lib/libc.so.6(__libc_start_main+0x11c)[0x403cc538]

As you can see, it’s managed to find the symbols for some functions but not the one at the very top of the call stack which is where our assert failed. Fortunately, we can use the addr2line tool to look up this function:

pi@raspberrypi:~/Devs/cunit_sample$ addr2line -e a.out -f -s 0x8c44
TestAssertFailed
main.cpp:90

Calling addr2line can become quite tedious, so you might find it worth writing a script (e.g. in Python) to feed stack traces into addr2line if you find yourself needing to do this regularly which is something I’ve done in the past.

Sample Test Function

[code language=”cpp” firstline=”74″]
TEST(TestPass1)
{
int x = 1;
int y = 2;
int z = x + y;
TESTASSERT(z == 3);
}
[/code]

Nothing too shocking there and hopefully very easy to implement.

Handling Fatal Crashes

The first C++ testing framework I wrote had all the tests running in the same process. If everything was working, this wouldn’t be a problem as all test would pass without incident. However, if there was a fatal crash (e.g. attempting to use a null pointer), the entire test application would crash halting all the tests making it very difficult to assess the overall code health. This can be resolved by signal handlers that wait for crash conditions and attempt to gracefully clear up so that the test runner can keep on running. However, I still ran into bugs that could screw up the stack or heap in fatal ways leaving me no better off in these situations.

In this sample framework, I’ve borrowed an idea from Google Chrome in that I run each test in its own process. This way a test can mess up its own process as much as it wants and it’s completely isolated from any of the other tests. It also enforces good practice with your tests as you can’t have one test depending on the side effects of another test. Each test is completely independent and can be guaranteed to run in any order which makes them much easier to debug. In addition, it makes my crash handling code much simpler as I don’t need to do any more than report the error and exit the process. Simpler code is good in my opinion.

Signal Handler

[code language=”cpp” firstline=”230″]
// Handler for exception signals
void crashHandler(int sig)
{

fprintf(stdout, &quot;nGot signal %d in crash handlern&quot;, sig);
fflush(stdout);

dumpStack(1);

_exit(1);
}
[/code]

The handler uses the same stack dumping code as the assert handler and exits with a non-zero exit code to notify the parent test runner application that the test has failed.

The handler is registered with the following code in main():

[code language=”cpp” firstline=”104″]
// Register crash handlers
signal(SIGFPE, crashHandler);
signal(SIGILL, crashHandler);
signal(SIGSEGV, crashHandler);
[/code]

Here, I’ve used the antiquated signal() interface when really, I should be using sigaction(). I’ve probably not registered all the signals that could indicate a fatal code bug either. This is something I may address in the future, but for now it provides a simple example of what I’m trying to achieve.

Spawning the Child Test Process

For simplicity, my test runner just forks itself as that’s one of the easiest way to launch a child process on *nix. It also has the advantage of not needing to do much configuration in the child process in order to run the test.

I’ve wrapped the forking and running of a single test in a function to keep all the logic in one place:

[code language=”cpp” firstline=”156″]
// Wrapper function to run the test in a child process
bool runTest(TestRef * test)
{
// Fork the process, the test will actually be run by the child process
pid_t pid = fork();

switch (pid)
{
case -1:
fprintf(stderr, &quot;Failed to spawn child process, %dn&quot;, errno);
exit(1); // No point running any further tests

case 0:
// We’re in the child process so run the test
test-&gt;function();
exit(0); // Test passed, so exit the child with a success code

default:{
// Parent process, wait for the child to exit
int stat_val;
pid_t child_pid = wait(&amp;stat_val);

if (WIFEXITED(stat_val))
{
// Child exited normally so check the return code
if (WEXITSTATUS(stat_val) == 0)
{
// Test passed
return true;
}
else
{
// Test failed
return false;
}
}
else
{
// Child process crashed in a way we couldn’t handle!
fprintf(stdout, &quot;Child exited abnormally!n&quot;);
return false;
}

break;}
}
}
[/code]

After the process is forked, the child process calls the test function referenced in the passed in TestRef object. If the function completes without incident, the child exits with a zero exit code to indicate success. The parent process waits for the child process to exit and then logs success or failure of the test based on the exit code of the child process.

The main test runner loop is:

[code language=”cpp” firstline=”111″]
int testCount = 0;
int testPasses = 0;

// Loop round all the tests in the linked list
TestRef * test = s_FirstTest;
while (test != NULL)
{
// Print out the name of the test we’re about to run
fprintf(stdout, &quot;%s:%s… &quot;, test-&gt;module, test-&gt;name);
fflush(stdout);

testCount++;

bool passed = runTest(test);
if (passed == true)
{
testPasses++;
fprintf(stdout, &quot;Okn&quot;);
}
else
{
fprintf(stdout, &quot;FAILEDn&quot;);
}

// Get the next test and loop again
test = test-&gt;next;
}
[/code]

Plugging Into Build Server

This is just a case of following the Unix principal of your process returning 0 if you’re happy or non-zero if not. In my main() function, I keep a count of the number of tests run and the number of tests passed. I then have the following at the end of main():

[code language=”cpp” firstline=”139″]
// Print out final report
int exitCode;
if (testPasses == testCount)
{
fprintf(stdout, &quot;n*** TEST SUCCESS ***n&quot;);
exitCode = 0;
}
else
{
fprintf(stdout, &quot;n*** TEST FAILED ***n&quot;);
exitCode = 1;
}
fprintf(stdout, &quot;%d/%d Tests Passedn&quot;, testPasses, testCount);

return exitCode;
[/code]

Pretty much every build server has the ability to launch external processes as part of a build and report a build failure if that process doesn’t exit with a zero code. It’s just a case of building your test framework as part of your normal build process and then executing it as a post build step. Everyone should be doing it!

Other Platforms and Future Improvements

As I mentioned earlier, this sample will only work on *nix platforms. However with a bit of work, most of these ideas can be ported to other platforms.

Call Stack Dumping

Although there is no standard way to get a stack trace, it’s been possible on every platform I’ve used so far. Some platforms being easier than others though.

For Windows, here’s one example, There also the CaptureStackBackTrace() API in the Windows API.

Fatal Exception Handling

I’ve already mentioned that I should switch to using sigaction() rather than signal() for registering my crash handlers.

On Windows, you could use Structured Exception Handling (SEH) to detect access violations and other fatal errors. Here’s a Stack Overflow question that covers some of the pros and cons of SEH. This is something that’s always going to be very platform specific so you may have to research this for yourself if you’re using something a bit more esoteric.

Child Process Spawning

This is one area I could put a lot more effort in. Currently, I’m only using fork() which isn’t available on all platforms and only gives me limited control over the child process. If instead I launched the child processes as completely separate processes that I attached to using stdout/stderr and specified which test to run using command line arguments, I’d have a much more portable solution. It would make debugging individual tests much easier as I could launch the test process directly from my debugger without needing to run through a complete test cycle. This would also give me more options over how I implemented my test runner as I could develop a GUI application in a completely different language if I wanted or implement distributed tests across multiple machines if my test cycle could take a long time. Finally, reading test output such as failed assertions and call stacks from stdout of the child process rather than letting the child write directly to stdout of the parent process would allow the test runner to present the output in a much nicer way or redirect it to a separate log file that only contained info about failed tests.

If I were to develop this sample further, this is an area I would certainly put more effort into.

More Restrictive Platforms

A few platforms I’ve worked on have only supported running a single process at a time. Launching another process results in the current running process being unloaded and completely replaced by the child process. This makes running tests in a background child process completely impossible. In these situations, I’d have the runTest() function run the test directly in the current process. The assert and crash handlers would also need to be updated to return control to the test runner in the case of a test failure. Your best bet would be to use normal C++ exceptions for this, but if you really don’t want to use them, you could use setjmp()/longjmp(). Which ever way you go, fatal application crashes are likely to halt your test cycles early.

If possible, I’d try to get the code I was testing to also compile on another platform such as Windows or Linux and perform most of my testing there. If you get to the point where all your tests are passing, running the tests on your target platform should just be a formality to make sure the platform specific parts of you code are also working.

Before/After Method Events

Something that I haven’t implemented in this sample but would be very easy to add would be before and after method events so that common set up and tear down code could be automatically called by the test runner. This is a standard feature of just about every other framework, so I wouldn’t consider a framework I wrote complete without it.

Classical Objects in JavaScript

I’ve recently found myself needing to do more with JavaScript than I have in the past and I thought a good way to understand some of the more interesting features of the language would be to attempt implement Java/C# style classes and inheritance. This has obviously been done before with various JavaScript libraries and frameworks such as Mootools, but I wanted to actually understand what was going on. With that in mind, this post is not about the best way to implement classes, it’s simply about what is happening when you do.

Java vs JavaScript

The big challenge to trying to copy what Java does is that Java and JavaScript are two very different languages. Java is a classic OOP language and all the features I’m trying to emulate are a core part of it. JavaScript has objects but they’re little more than hash tables and so everything is dynamically typed.

My ultimate goal is to implement an object hierarchy in JavaScript such as this Java sample. The features I’m aiming to emulate are:

  • Member functions
  • Public and private member variables
  • Public and private statics
  • Class extending

JavaScript Objects

And to get stuck in, here’s the JavaScript implementation I came up with:

[code language=”javascript”]
ObjectA = (function(){

var privateStatic = &quot;private static&quot;;

function make(x, y) {
this.publicMember = x;
var privateMember = y;

this.setPrivate = function(v) {
privateMember = v;
}
this.getPrivate = function() {
return privateMember;
}

this.calc = function() {
return this.publicMember + privateMember;
}
};

make.prototype.setPublicStatic = function(v) {
make.prototype.publicStatic = v;
};

make.prototype.setPrivateStatic = function(v) {
privateStatic = v;
};
make.prototype.getPrivateStatic = function() {
return privateStatic;
};

make.prototype.publicStatic = &quot;public static&quot;;

return make;

})();

ObjectB = (function(){

function make(x, y, z) {
ObjectA.call(this, x, y);
var privateSubMember = z;

var __calc = this.calc;
this.calc = function() {
return (__calc.call(this) * z) + this.publicMember;
}
}
make.prototype = new ObjectA();
make.prototype.constructor = make;

return make;

})();
[/code]

I’ve created a jsFiddle for the above along with some tests here. The first thing I should point out is that although I achieved my goals, I haven’t necessarily done it in the best way. The public static variable is almost certainly not done well, but I’ll go into that later.

Constructors

When implementing objects like this the the classes are actually functions and those functions also double up as the constructor. When you call new MyObject(); the interpreter allocates a new empty object and then passes it to the function you’ve specified for initialisation. In the code above, I’ve actually declared my constructors as functioned called make() within an anonymous function which returns the constructor.

Member Functions

This is probably the easiest thing to achieve as functions are first-class objects in JavaScript. This means that they can be passed around as data and assigned to other objects in the same way as any other data type. In my above code, I attach functions to my objects in two different ways. The first is by assigning it within the constructor:

[code language=”javascript” highlight=”4,5,6″]
MyObject = (function(){

function make() {
this.doStuff = function() {
// Stuff happens here…
}
}

return make;

})();
[/code]

This approach will allocate a new instance of the function and store it directly within the constructed object each time the constructor is called. This may seem wasteful, but it’s the only way to access private members within the object.

The second way for defining member functions is by assigning the function to the constructors prototype:

[code language=”javascript” highlight=”6,7,8″]
MyObject = (function(){

function make() {
}

make.prototype.doStuff = function() {
// Do stuff here…
}

return make;

})();
[/code]

The prototype member is a dictionary/hash table that is shared by all instances of the same object type. If a member isn’t found directly within the object instance, the prototype will be searched. Assigning functions to the prototype will cause only one instance of the function to be allocated per class, but these functions are only able to access public members of your object.

If you look back at my sample JavaScript above, you can see where I’ve used the two different types of function definition where I’m accessing variables with different access levels.

Public Member Variables

These are defined by simply assigning them to the this object within the constructior:

[code language=”javascript” highlight=”4″]
MyObject = (function(){

function make() {
this.member = 123;
}

return make;

})();
[/code]

The JavaScript language has no syntax for public or private access levels so all members of an object are publicly accessible. As such, there’s not much more to say about public members.

Private Member Variables

This is where things start to get interesting. As I mentioned previously, JavaScript has no concept of public or private variables so we have to fake it using closures. Closures are functions that are able to access variables that while not global, are accessible within the scope that the function way originally defined. That probably didn’t make much sense, so here’s an example:

[code language=”javascript” highlight=”4,5,6,7,8″]
MyObject = (function(){

function make() {
var private = 123;

this.doStuff = function() {
return private * 2;
}
}

return make;

})();
[/code]

The variable private has been defined locally within the constructor which would normally mean that I couldn’t be accessed from outside the constructor, it’s private if you will. However, the function doStuff() was also defined within the constructor and so has access to the same scope including the local variable private.

Public Static Variables

In the code above, I’ve implemented public statics by assigning them to the constructors prototype. I did this so that I could access the statics in a syntactically similar way to Java. However, it has the very big problem in that if I did the following:

[code language=”javascript”]
var o = new ObjectA();
o.publicStatic = &quot;New Value&quot;;
[/code]

The static variable wouldn’t be updated, instead a new member called “publicStatic” would be assigned to the object instance o.

A much better way to implement statics would have been for me to assign them as members of the constructor itself:

[code language=”javascript” highlight=”6,12″]
MyObject = (function(){

function make() {
}

make.static = 123;

return make;

})();

document.write(MyObject.static);
[/code]

Each time I wanted to access the static, I would have to explicitly reference the class that the static belonged to which is not a requirement of Java and C#, but it would be clear exactly what I was doing which is no bad thing.

Private Static Variables

Back to closures again to simulate private variables. This time the static variables are local variables in the anonymous function scope.

[code language=”javascript” highlight=”3,8,9,10″]
MyObject = (function(){

var private = 123;

function make() {
}

make.prototype.getPrivate = function() {
return private;
}

return make;

})();
[/code]

This is the whole reason for using anonymous functions for defining classes. If you didn’t need private static functions, then you could implement everything in this post without using anonymous functions. Depending on how you feel, that might be a good thing or a bad thing.

Class Extending

By setting the constructors prototype to a new instance of the class you want to extend, you are able to simulate inheritance as you know it in other languages. Because the prototype is searched if a requested member isn’t found on the object instanced being referenced, it has the effect of exposing all all the super class members to the new class.

[code language=”javascript” highlight=”16,17″]
ObjectA = (function(){
function make() {
}

make.prototype.hello = function() {
document.write(&quot;Hello!&quot;);
}

return make;
})();

ObjectB = (function(){
function make() {
}

make.prototype = new ObjectA();
make.prototype.constructor = make;

return make;
})();

var o = new ObjectB();
o.hello();
[/code]

By creating a new instance of ObjectA and assigning it to the prototype, it means that we can add new items to the prototype without affecting the original ObjectA class or any other classes that extend it. However, as we’ve replaced our original prototype with a copy from another class, the constructor needs to be set back to the correct one again. constructor is a standard property that contains a reference to the function used to construct the object and so we should always make sure it points to the correct function.

Obviously, any additions you want to make to the prototype need to happen after you’ve set the super class!

Calling the Super Class Constructor

Generally, you should also call the super class’s constructor so that it is set up correctly in your new objects. In Java and C#, the default constructor will be called for you automatically so you often don’t need to worry about this but you must always do this explicitly in JavaScript.

[code language=”javascript” highlight=”15″]
ObjectA = (function(){
function make(message) {
this.message = message;
}

make.prototype.hello = function() {
document.write(this.message);
}

return make;
})();

ObjectB = (function(){
function make() {
ObjectA.call(this, &quot;Hello World!&quot;);
}

make.prototype = new ObjectA();
make.prototype.constructor = make;

return make;
})();

var o = new ObjectB();
o.hello(o);
[/code]

You’ll notice that I’ve used the call() method to invoke the super class constructor. This is because the super class constructor isn’t being called as a member of any object and so the this pointer won’t be what you expect.

Here is a jsFiddle in which the super constructor is called directly. As you can see, when the ObjectA constructor is called directly from the ObjectB constructor, the this reference is actually the browser window. Using the call() method allows us to specify the object instance to use for this when calling the constructor.

This is another down side to attempting to emulate OOP in JavaScript. The call() method is similar to MethodInfo.Invoke() in other languages and so is actually much slower than calling methods directly. If you’re creating a lot of objects frequently and you need your code to run quickly, you don’t want to be calling super class constructors. Mind you, if you want your code to run quickly, you probably won’t be dynamically creating a large number of objects at run time.

Overriding Super Class Methods

The last thing to tackle is overriding super class methods. This is done by storing a copy of the super function and then overriding it with a closure that calls your stored copy.

[code language=”javascript” highligh=”16,17,18,19″]
ObjectA = (function(){

function make() {
}

make.prototype.calc = function(x, y) {
return x + y;
}

return make;
})();

ObjectB = (function(){
function make() {

var _calc = this.calc;
this.calc = function(x, y) {
return _calc.call(this, x, y) * 2;
}

}

make.prototype = new ObjectA();
make.prototype.constructor = make;

return make;
})();

var o = new ObjectB();
document.write(o.calc(1, 2));
[/code]

The reason for using a closure like this rather than simply accessing your class’s prototype is because you may not know if the super function is a closure or attached to the prototype. This way reduces the requirement for the sub class to know the exact implementation details of the super class and therefore reducing maintenance overheads if the super class implementation changes without an interface change.

You also need to use the call() method of the super function again in order to ensure the this pointer is set correctly. You will need to keep the overheads of overriding super class methods in mind if you need to implement fast code.

Conclusions

Although it is certainly possible to emulate classical OOP in JavaScript, in my opinions the overheads of doing so won’t always be worth it. Your code will be more complex and will have less optimal code paths unless you’re willing to trade maintainability and implementation isolation. I can understand the desire for someone to use a familiar programming style but I would question the need to use classical OOP in JavaScript. You can’t achieve an exact copy of what’s available in Java or C# so it’s important to know how it differs when you simulate it.

JavaScript is a great language and it feels like you’re doing it a bit of a disservice if you try to blindly make it act like another language it wasn’t meant to be.