Command Line awk Regular Expression for Apache logs

For code testing against a live site, I’ve had to extract all urls from an Apache access file – but how to do this from the Linux command line?

The secret is to use two regular expressions (regexp) in a “awk” command – for example:

cat examine.txt | awk 'sub(/.*(GET|POST) \//,"")&&sub(/ HTTP.*/,"")'

This will pipe the contents of the file examine.txt to AWK which will run two regular expressions. The first one will remove the “phrase” “GET /” or “POST /” and anything before it – and the second will remove the “phrase” ” HTTP” and anything after it. It’ll then give you a nice list of URLs to test.

Oh – and if you’d like it to produce a nice “curl friendly” file of just URLs starting “xyz.php” from host example.com then:

cat examine.txt | grep "GET /xyz.php" | awk 'sub(/.*(GET|POST) \//,"http://example.com/")&&sub(/ HTTP.*/,"")' > curl.txt

should do the trick (combine that with cat curl.txt | xargs -n1 -i curl {} > /dev/null to test)

Leave a Reply