Press "Enter" to skip to content

Richy's Random Ramblings

Blogging: Comment Spam

Like practically everybody else in the blogsphere at the moment, I’m suffering quite a bit of comment spam: I had to block my first IP address yesterday – and now I’m blocking the following 7 IP addresses:
209.210.176.19
209.210.176.20
209.210.176.21
209.210.176.22
209.210.176.23
80.50.117.113
64.109.143.166

What sort of spam have I been getting? Well, 80.50.117.113 from “klaus” was a Cheap Viagra, Vicodin, Xanax, Prescription Drugs, and Penis Enlargement Pills spam and 64.109.143.166 from “Alex Dolbayov” was for “Great Site Folks! I have another [?] big t-ts site for you which is really the #1 big t-ts site” (and that’s after I’ve implemented Neil’s change the comment cgi-bin script filename patch type thing) and all the rest were over a variety of posts advertising the same pedo orientated porn site which a number of others have been unfortunately hit with.

Patches I’m going to try include URLs including zipcode are prohibited, Avoid Comment Spam, Comment Spam Quick Fix and I’m certainly going to try Jay Allen’s MT-BlackList once it’s released (I’ve in fact had an email from Kadyellebee of MT-Plugins to let me that it’ll be included in the MT Plugins Manager as soon as it’s released!)

(I may also include the Avoid Duplicate Comments and use some of the advice from Seven quick tips for a spam-free blog) and Comment Queue Script/MT Hack).

Expect a few minor things to change around here once it’s all been implemented (oh, I’ve also installed the Trickle thingy so I can schedule “future blog entries”).

Techy: Regexp’s are slow…

As part of a massive project that I’ve been working on the last few weeks in my spare time (hence the lack of good quality blog posts), I’ve been having to handle a lot of data. By “a lot”, I mean that during one day my computer transferred over 5Gb of data to and from remote systems – and then had over 21million MySQL database queries/updates to process: you try and use a computer for anything else when it’s processing the h-ll out of a lot of files in several different formats (oh, many thanks for Jeremy Zawodny for reminding me of his post regarding MySQL’s 4Gb table limit – I had seen it before, but once I took into account I was hitting the 4Gb limit after just a couple of thousand records (and I’m dealing with a minimum of 3million records), I thought a DB redesign was needed.

Anyway, I’ve just had to process 250files of over 700Mb of TSV (tab separated data) data and extract the information I need. I originally used a Perl regexp (regular expression) to separate the data in each line and then perform a brief comparison on the data (if field X is the letter “B, C or D” and field Z is “K or M” then make an SQL database insert entry, otherwise ignore). Alas, the script was SLOOOOW. After a day of processing, the script had only done around 30files and then crashed my machine for some reason (probably because I was trying to make it go faster by increasing it’s priority).

So I decided to try a write and use the split(/\t/,$_); command to split the data and then use an (if $fieldx=~/[B|C|D]/) style query to compare the data and store it. Perl then shot through the data and 5.4million records later, it’s extracted the 2million items I needed. Speed? 400seconds! Yep, that’s fast!

What have I learnt? Well, if you know you can “trust the data” (i.e. it’s been produced by a computer instead of being typed in by an error prone human-and you know the data hasn’t been corrupted), then use the split command instead of regular expressions – it’s a lot lot faster (just a brief speed test during development showed that regexps were taking over 843 seconds to process the same data that a split took less than 8 seconds to process!).

Anyway, now I’ve extracted the data, I’ve now got to get it in the database and then get the hard bit done where I’ve got to link the new 2million items with the 3million items already in the database whilst querying around 20 remote database systems. Once that’s done and it’s all cross references (which will take ages – I’ll have to use regular expressions for all the cross referencing), I’ve then got to get the data analysed before it can be outputted and (finally) uploaded to my web server. It’s a lot of data, but if this pans out (which it should do), this project will instantly offer something no one else on the internet offers at the moment (although I’m aware of a number of similar developments in the pipeline).

Snippet: 4Gb in, no Gigs out

I spent last week running my computer 24/7 processing several gigabytes of data and eventually filled up a MySQL database table with 4Gb of data (I’ve still got 1Gb to load in!). So I’ve had to redesign the database scheme, branch out to flat file databases and now I’m trying to get the data out of the MySQL database into the flat files…

Except, it won’t export the data now. Every time I try and extract it out using a very very basic Perl script, it bombs out with “Out of memory (Needed XXXXX bytes)”. Despite the fact I’ve got 0.5Gb RAM and 4Gb of “swap space” currently configured.

I need to get this thing sorted ASAP – I want to be able to upload it this weekend (ready for a launch on the 1st of October) and I want to be able to use my PC again. Whilst it’s “number crunching” and processing and whatnot, my machine slows down to a crawl 🙁

Snippet: Job Vacancy

Oh – if anybody is interested, there’s a job vacancy going around: contact me for more details if you’re interested (just leave a comment with your email address and I’ll forward further details):
“Rough job spec” follows:

Looking for a unique job position within Leicester? (UK)
Do you have experience or want experience with *any* or all of the following

* Search engine optimisation/placement – Our customers are looking to us to get their sites in the top ten in the major search engines such as Google, Altavista, Yahoo etc. Have you got any experience in this field or want to learn the secrets of the trade?

* Customer support – have you had a customer facing role where you interact with new or existing customers over the telephone and email?

* Server administration: We’ve got a number of Linux based webservers running Apache, Cpanel, Ensim, Ensim Pro and a few other software suites – have you had any experience of these from either administration or user end? How about Microsoft FrontPage, FTP or anything like that? We’re not necessarily looking for “l33t” people, but if you can tell a web site address from an email address it’s a start!

It’s a challenging role working in a small, but growing and dedicated team, with a customer base now in its thousands! Can you “work on you own” yet still interact in a team and do you like challenges without too much “set routine”?

Snippet: Still At Work

It’s gone 7pm and I’m still at work and look to be here for another hour afterwards finishing a recruitment site (I was meant to be able to pass the design onto a coworker, but unfortently they didn’t “meet requirements” and hence I’m having to do all of the work myself). And once I get back home, I’ve got to give the DNS system a really good kicking as, by looking at things, my main website is still showing the “Under maintenance” page (how it can do that with Apache down I’m not sure: attempt to stop Apache comes up with “Apache not running” attempting to start it comes up with “segmentation fault”): and checking on DNS Report shows there’s some really strange things happening with my DNS at the moment.

Groan. Can’t I have an “easy week” for once?