Command Line awk Regular Expression for Apache logs

February 10th, 2013 by Richy B. Leave a reply »

For code testing against a live site, I’ve had to extract all urls from an Apache access file – but how to do this from the Linux command line?

The secret is to use two regular expressions (regexp) in a “awk” command – for example:

cat examine.txt | awk 'sub(/.*(GET|POST) \//,"")&&sub(/ HTTP.*/,"")'

This will pipe the contents of the file examine.txt to AWK which will run two regular expressions. The first one will remove the “phrase” “GET /” or “POST /” and anything before it – and the second will remove the “phrase” ” HTTP” and anything after it. It’ll then give you a nice list of URLs to test.

Oh – and if you’d like it to produce a nice “curl friendly” file of just URLs starting “xyz.php” from host example.com then:

cat examine.txt | grep "GET /xyz.php" | awk 'sub(/.*(GET|POST) \//,"http://example.com/")&&sub(/ HTTP.*/,"")' > curl.txt

should do the trick (combine that with cat curl.txt | xargs -n1 -i curl {} > /dev/null to test)

This post is over 6 months old.

This means that, despite my best intentions, it may no longer be accurate.

This blog holds over 12 years of archived content - during that time, I may have changed my opinion of something, technology will have advanced (and old "best standards" may no longer be the case), my technology "know how" has improved etc etc - it would probably take me a considerable amount of time to update all the archival entries: and defeat the point of keeping them anyway.

Please take these posts for what they are: a brief look into my past, my history, my journey and "caveat emptor".

Leave a Reply

Human Verification: In order to verify that you are a human and not a spam bot, please enter the answer into the following box below based on the instructions contained in the graphic.


gamy-dance
%d bloggers like this: