As part of a massive project that I’ve been working on the last few weeks in my spare time (hence the lack of good quality blog posts), I’ve been having to handle a lot of data. By “a lot”, I mean that during one day my computer transferred over 5Gb of data to and from remote systems – and then had over 21million MySQL database queries/updates to process: you try and use a computer for anything else when it’s processing the h-ll out of a lot of files in several different formats (oh, many thanks for Jeremy Zawodny for reminding me of his post regarding MySQL’s 4Gb table limit – I had seen it before, but once I took into account I was hitting the 4Gb limit after just a couple of thousand records (and I’m dealing with a minimum of 3million records), I thought a DB redesign was needed.
Anyway, I’ve just had to process 250files of over 700Mb of TSV (tab separated data) data and extract the information I need. I originally used a Perl regexp (regular expression) to separate the data in each line and then perform a brief comparison on the data (if field X is the letter “B, C or D” and field Z is “K or M” then make an SQL database insert entry, otherwise ignore). Alas, the script was SLOOOOW. After a day of processing, the script had only done around 30files and then crashed my machine for some reason (probably because I was trying to make it go faster by increasing it’s priority).
So I decided to try a write and use the split(/\t/,$_); command to split the data and then use an (if $fieldx=~/[B|C|D]/) style query to compare the data and store it. Perl then shot through the data and 5.4million records later, it’s extracted the 2million items I needed. Speed? 400seconds! Yep, that’s fast!
What have I learnt? Well, if you know you can “trust the data” (i.e. it’s been produced by a computer instead of being typed in by an error prone human-and you know the data hasn’t been corrupted), then use the split command instead of regular expressions – it’s a lot lot faster (just a brief speed test during development showed that regexps were taking over 843 seconds to process the same data that a split took less than 8 seconds to process!).
Anyway, now I’ve extracted the data, I’ve now got to get it in the database and then get the hard bit done where I’ve got to link the new 2million items with the 3million items already in the database whilst querying around 20 remote database systems. Once that’s done and it’s all cross references (which will take ages – I’ll have to use regular expressions for all the cross referencing), I’ve then got to get the data analysed before it can be outputted and (finally) uploaded to my web server. It’s a lot of data, but if this pans out (which it should do), this project will instantly offer something no one else on the internet offers at the moment (although I’m aware of a number of similar developments in the pipeline).