While identifying bots in a data set is different depending on the environment, here are some common ways to identify bots.
You can pull a Data Warehouse report with IP address, page views, and unique visitors. Then create a calculation in Excel for page views per visit, and sort from highest to lowest. Bots commonly have a very high number of page views per visit (several hundred to thousands). You will see a sharp decline as you move into actual real traffic.
Bots typically don’t have a referring URL. In segmentation this can be filtered as Referring Domain equals Typed/Bookmarked
.
Bots often use custom user agents that aren’t classified in the Browsers dimension or show up as an unknown
version of a standard browser. Unknown Safari and unknown Opera have an extremely high likelihood of being bots.
We don’t mean to discredit the great open-source Linux operating system, but apparently bots like to set it as their operating system. However, be careful about excluding legitimate traffic from Linux users. Bots also like to not set an operating system, which can be segmented on as Operating System equals Not Specified
.
This applies especially to the user agent report. As you can see in the screenshot below, the “unknown version” of these browsers has almost the same number of visitors as unique visitors (and almost the same number of page views). This can be isolated in segmentation by building an Include container for Single Page Visits equals Enabled
or Hit Depth is less than 2
.
Bots usually get a new visitor ID every time they execute, thus incurring only one visit ever and all their traffic will consist of a visit number of 1.
Modern users have much higher resolution monitors than in years past. Hits with the following resolutions seem to be very popular for bots:
You would notice a mismatch between the originating country and the time zone. For example, the location might be the United States but the time zone might me GMT.
The user doesn’t log in at any point in their visit, and their user identification eVars aren’t persisting from previous visits. While some bots can be set up to authenticate, the majority aren’t that smart.
Bots typically don’t add products to a shopping cart or check out. Most of the time they aren’t submitting lead forms or other success events, but some bots do submit simple HTML forms.
Sometimes bots try to bust cache or otherwise break sites by hitting malformed URLs or URLs that don’t exist (like typical LAMP or Wordpress admin pages) or by appending specific query strings.
Web hosting services like Amazon Web Services or Google Cloud can be abused as bot farms. These IP addresses are at a high risk of being bots:
35.199
or 35.194