Update to the robots.txt parser
Summary
Siteimprove has updated its robots.txt parser to align with search engine behavior. This change may affect how your website is crawled, especially if you use custom rules or user-agent-specific configurations.
Overview
This article explains an update to the robots.txt parser used by the Siteimprove crawler, including what has changed, why the change was made, and how it may affect your crawled data.
What has happened?
We have updated the robots.txt parser used by the Siteimprove crawler to one that mirrors the parser behavior of search engines such as Google.
When did the changes take effect?
The changes were rolled out on May 2, 2022. After this date crawls use the new parser.
What is the robots.txt parser?
The Siteimprove crawler uses a robots.txt parser to determine which URLs/files are allowed or disallowed to be crawled on a specific domain.
The robot.txt file (found at yourwebsite.com/robots.txt) is downloaded and examined for each domain. URLs are then checked according to the rules for the domain and either included or discarded.
Why did Siteimprove updating the parser?
Over the past few years, it has become clear that parsing of robots.txt files changes greatly across different libraries and technologies.
As most sites are using the robots.txt file to accommodate search engine crawlers, Siteimprove has decided to move to a mechanism that better suits this scenario and the changing technologies.
What difference will this make to how my robots.txt is interpreted?
There are three main differences:
- The user agent tokens used for checking against robots.txt rules will change to "SiteimproveBot" and "SiteimproveBot-Crawler" (This will not change the user agent string used when fetching customer pages).
- The parser will now support wildcard patterns such as (*$) but will not support regular expressions.
- The agent matching will change from “substring matching” to “exact matching”.
How might this update affect my crawled data?
In most cases, you should not be affected by this update.
If your website does not use disallow rules in the robots.txt file, then you will not experience any changes in the crawl.
If however, you have specific rules allowing Siteimprove to access your sites, the crawler may be disallowed due to the update of the user agent tokens. See “What do I need to do” later in this article.
If you have rules containing wildcard patterns, the resulting number of crawled pages may change.
Customers can inspect their website's robots.txt by entering their domain appended by /robots.txt (e.g. yourwebsite.com/robots.txt). If in doubt, please contact your website's administrator.
What do I need to do?
If your website does not use disallow rules in the robots.txt file, then you do not need to do anything.
If your website has strict disallow rules in robots.txt files, please ensure your robots.txt file contains one of the following bot User-agents so we can continue to crawl your website:
- SiteimproveBot
- SiteimproveBot-Crawler
Your website's administrator will know how to update your robots.txt file.
Where can I see when this update occurred in the platform?
Graph annotations will be added within the Siteimprove platform indicating when sites were switched to the new robots.txt parser. This will allow you to see when the change happened along with any changes experienced.
The annotations will be visible in the history graphs in Quality Assurance, Accessibility, SEO, and Policy on the day of the robots.txt update.
Did you find it helpful? Yes No
Send feedback