Parsing Apache Logfiles to get IP addresses of high-request users

I ran into this issue a while ago, where clients would complain that they were hitting their Entry Process limits very quickly, even though their websites didn’t have a lot of users. Sometimes these sites were WordPress or another CMS, and generally optimizing the site for OpCache, or limiting plugin use/optimizing imports for JS and CSS would fix these issues (Still don’t know how people manage to have hundreds of WordPress plugins installed all with their own 18 versions of JQuery… but that’s another matter.)

80% of the clients who complained about this issue fell into the following category: text-based online RPGs. Yup. Like the Mafia games, or other vampire ones I’ve seen. These games are where you click around on the website to do different actions, usually using JavaScript with AJAX calls to PHP files in order to do in-game actions and update right on the page without a reload. They’d all insist it wasn’t the issue I’m about to explain… which leads me to another point. Everyone. Lies.

These clients often said they had a small community, and that no one would ever cheat or abuse their application that they spent so many hours writing. Well… they did. The biggest thing I saw was that I could non-artificially hit entry-process limits on client’s accounts just by logging in and spamming action buttons on their website/game. Most of them (being hand-coded) didn’t have any sort of rate-limiting on the buttons you could click, and you can just click away to your hearts content. Easily spinning up 20+ entry processes while you wait for MySQL queries to run, or on-disk files to be accessed.

 

I had to prove it to them, and thus was born this script. It turns out parsing apache log files can get pretty complicated:

The Script

I think for the most part, it’s pretty self explanatory, but in short, it takes a standard Apache log file, and parses it to find timestamps in which there are over -r (10) requests, and then from there, finds IP addresses that made over -i (defaults to -r) requests during that one-minute period from the timestamp, and echos it out, or optionally tees it to a file. The one thing I believe I need to work out on the script is actually making sure the requests are unique, and not just 404’s for favicon.ico or something like that. Obviously it’s pretty easy to implement, but it could give some unrealistic data if you’re not careful about what’s in the log.

 

Example output:

Hopefully this can help someone out in the future to explain to a user, or to do some simple analytics on their logs! Comment below if you have any comments or suggestions–I’d be happy to hear them.

Tags :