Hello World - a Histogram

After some admin delays, I collected the data from the Archives yesterday, and have been digging in with some excitement. The data consists of three big XML files, totalling around 300Mb; initially I have been looking at the largest of these datasets (180Mb), which records the 27000+ series in the Archives collection.

Initial data-munging presented some challenges, as expected; many of the records contained HTML in plain text wrapped inside the XML. Archives staff had warned me about this and I'd blithely replied that it would be fine, and the more data the better. Of course the first thing that happened as I attempted to parse the XML with Processing, was that the HTML broke the parser. So step one was to make a copy of the dataset without the HTML; a quick grep tutorial later and I was able to use Textwrangler to automate the process of stripping it out, reducing the file size along the way to about 50Mb.


After that the process of getting the data in to Processing has been straightforward, and I'm impressed with its ability to ingest a large lump of XML without complaint. As a sort of "hello world" visualisation I decided to make a simple histogram of the entire series dataset by date; specifically, the start date of the contents of each series (click the image to see it without the nasty scaling artefacts, at full resolution). The x axis is year, with a range from 1800 to 2000; the y axis is the number of series with that start date; it's unlabelled here but the maximum value (in 1950) is about 960. Already you can get a sense of the shape of the collection from this image; there are spikes at 1901 and 1914 that correspond, I'd guess, to Federation and World War I; and the next spike is, of course, 1939. One question I can't answer at the moment is why there is such a dramatic drop in the number of series commencing after 1960 - perhaps a change in recordkeeping or the archival process itself? Any thoughts?

4 comments:

I'd be looking at the data closely - that's a deinfite break in series that's due to something screwy. I'd be guessing that the records post 1960 are in another data set or stored differently. What's the histogram recording? Bytes in each year, or records? If the record format changed then the bytes should still be large.

17 September 2008 at 4:14 pm  

The histogram measures the number of series with records commencing in a given year: so there's dramatically fewer series that commence in 1961, compared to 1960. This data doesn't include number of records in a series - that's another layer down!

17 September 2008 at 5:42 pm  

Hey Mitchell – we haven't met but I'm part of the NAA web team. The spikes in the histogram are interesting – probably one of those things that if I'd thought hard about I might have come up with an answer for, but seeing it is infinitely cooler. I'm looking forward to keeping up with what you're doing!

19 September 2008 at 2:14 pm  

Thanks a lot Kate - that's really good to hear. Look forward to your feedback!

20 September 2008 at 8:43 am  

Template based on Cutline port by Blogcrowds