TL-DR: Always verify the bots when conducting bot analysis in server traffic logs.
What are server traffic logs?
Server traffic logs are raw data files exported directly from a web server. Server log files store the following information on an HTTP request.
- IP Address
- Date and time
- HTTP Request (file, page or asset)
- Status response codes
- Request size in bytes
Server traffic logs are generally exported as CLF (Common Log Format) files, automatically compressed on-the-file to gzip files to save on bandwidth and exporting time.
For high-traffic websites, server log exports can run into tens of gigabytes compressed and hundreds of gigabytes unpacked. Low traffic websites server log exports are usually under 1 gigabyte compressed.
Why are server traffic logs useful?
Server logs are immensely useful for server administrators, developers and technical SEO’s. The information detailed within the logs paints insights into who, what and when of all the inbound HTTP requests.
- Country of IP address
- Timestamped access log
- Visitor request or bot request
- Type of Request (GET / POST)
- The local HTTP file location of request
- Status codes (200, 3xx, 4xx, 5xx)
Technical SEO usage of Server Logs
If you are reading this article from a technical SEO perspective you’re likely to know the benefits of log analysis. For those readers on a learning curve, server log analysis is vital to a better understanding of crawler-bot activity such as Googlebot Yandexbot and Bingbot (plus many others).
Depending on exactly where in the world you are reading this from, Googlebot is the most probable crawler-bot loitering the server logs with entries. Other common crawler-bot user-agents entries are:
- Googlebot (Smartphone, Images & News)
The popular method of reading Server Logs
From recollection, I know 5 of methods for reading and analysing a server log. Although it would depend on data you are intending to extract which tool you use.
Popular log file analysis tools:
- Screaming Frog Log Analyser
- Apache Log Viewer
- Microsoft Excel
- Bigquery / SQL
Screaming Frog Log Analyser
Screaming Frog Log Analyser (SFLA for short) is probably the most well-known log analysing tool out of the bunch. Many of you will have probably already heard of the other Screaming Frog tool, the SEO spider.
SFLA already comes preloaded with user-agent signatures of the crawling bots used by the main Search Engines. It’s quick, visual, intuitive and easy to operate. It’s USP for me is the ability to verify the bots identities. Which is crucial if you’re looking to better understand where and how often does Googlebot. If you don’t enable verify bots you’re also including all the chaff such as bot spoofing. More on bot spoofing later in this article.
Personally, I have never used Splunk so I cannot provide a comment on Splunk. I have a quick glance on their website, they’ve grown over the years to provide a number of different log analysis options.
Apache Log Viewer
Apache Log Viewer is as it says on the tin, It’s a log viewer. There’s a free version of Apache Log Viewer. I would recommend that you download it, certainly worth it just for understanding the data in a clean format rather than the raw log view.
Microsoft Excel is a versatile tool for most SEO’s needs. Reading and analysing server logs is one of them. It can be a little tricky formatting and parsing, to begin with. Once you’ve overcome that it’s a very powerful server log reader. There’s plenty of tutorials online to help you.
Bigquery / SQL
Bigquery is Google’s own cloud database for largescale data analysis. You can use simple SQL queries to retrieve and analyse datasets. Bigquery is probably the best solution for really high traffic websites. Websites that get millions of visits a month.
There’s a new kid on the block for log file analysis. I have seen Jet Octopus promoting themselves as an alternative to Botify and Screaming Frog Log Analyser. I have never used their tool so I cannot vouch for it. They do have a free trial on their website.
Screaming Frog Log Analyser (SFLA) is the Winner
Out of the bunch of tools, the standout favourite for myself would have to be Screaming Frog Log Analyser (SFLA). If you’re a technical SEO conducting log analysis. The task at hand is to keep an eye on bot activity, mostly Googlebot. The fact that you can quickly (within a few minutes) verify the real Googlebot from all the Googlebot entries is a clear winner.
What I mean by that, think about this! When you’re crawling a competitor’s website. Your goto user-agent choice is going to be Googlebot Smartphone. That essentially is bot spoofing.
Sorting real Googlebot visits from spoofed Googlebot Visits
Think of this scenario. You’re about to conduct a bot analysis task on your (or a client) website to better understand bot activity in the previous month. In that previous month, ten competitors of yours (or a client) have crawled your website using Screaming Frog SEO Spider or DeepCrawl to gain insights into site health, tactics and strategy. Each of them selecting Googlebot user-agent.
So, unless you sort out the real Googlebot visits from the spoofed Googlebot. You’re going to analyse an incorrect dataset. From my understanding, BigQuery and Excel log analysis may overlook that fact. So again, another win for SFLA. I’m probably going to get proved wrong at some point. Which is great, BigQuery and Excel log analysis experts need to make it clearer on how they verify bots.
Googlebot Negative SEO
I think too much, especially late at night when I struggle to drop off to sleep. One of those thoughts was how can you IP spoof Googlebot. It’s probably out of my existing skillset, it’s not impossible though for a seasoned cybersecurity professional (hacker). I bet they can IP spoof Googlebot to the point when a tool like SFLA queries the list of known Googlebot IPs to spoof and emulate it.
You could spoof a real Googlebot IP and flood a server with requests to slow it down. You could also run dozens of crawls of a website to build up a Googlebot footprint in the server logs that skew data analysis. Leading to bad decision making for the target’s marketing/development teams.
I told you I think too much. One last thing before I sign-off. The security-conscious part. From some of the research and the reading, I have discovered that Googlebot (user-agent) spoofing is one of the most popular methods of hunting for website exploits. Dig into the server logs to investigate POST requests locations. You’ll see what I mean.