Siteimprove's Crawler: Frequently Asked Questions

Modified on: Thu, 11 Jun, 2026 at 9:56 PM

Summary

Get quick answers to common issues about crawl data, discrepancies, crawl control, and system behavior.

Overview

This FAQ provides direct answers to high-frequency questions related to crawler performance, discrepancies, and controls.

Crawl Status & Discrepancies

Where can I find more information on my website crawl status?

You can find the most recent scan dates and the scan times for your sites in Crawler Management.

Go to, Settings > Crawler Management

Note: that only Account Owners and Administrators have access to Crawler Management.

Why does Crawler Management show that a crawl is finished, but I still can’t see it in QA Check history?

The crawl will show as finished in Crawler Management as soon as the crawl is complete; however, the QA check history will only show when the full scan, including processing of data (link checking, accessibility, etc.) is complete.

At Settings > Crawler Management > Scan History, we show each stage of the scan and the status. If any stage in the scan history table says “Pending” then that scan is not complete.

The QA check history, along with all the data in the platform, will only update when a full scan is complete.

The screenshot below shows, the crawl got done but processing the data found in the crawl did not finish. Therefore, the QA check history won’t update.

You can read more about the scan stages in the scan process description.

Why does Crawler Management show a different number of pages and links than the QA Check history for a specific site?

When crawling a site, we analyze (parse) all the URLs. Afterward, we process the data, which includes removing links/pages based on exclusions, aliases, deduplication rules, etc., configured for your website.

Crawler Management shows all the pages and links found during a crawl.
QA Check history will show the pages and links that have been stored after site content settings, deduplication rules, etc. have been processed.

Why does Crawler Management show more pages and links for a site than the products (QA, Accessibility, Policy, SEO, Data Privacy)?

Crawler Management shows all the pages and links that we have seen during a crawl.
QA Check History shows the pages and links that have been stored after the crawl data has been processed, meaning those we have found, minus the pages/links that have been excluded due to site content settings.

See Site Content Settings for information.

Why does Crawler Management show 0 pages for a site, but the products (QA, Accessibility, Policy, SEO, Data Privacy) show all pages?

If we find 0 pages in a crawl, then Crawler Management will show 0 pages, but QA still stores all the pages from the last successful scan. This state will remain until there is a new successful scan that completes all three stages (queue, crawl, processing).

The crawl may find 0 pages due to a site being down temporarily, but this mechanism means users can still work on the results of the last successful scan until the next scan completes. See also "Typical Reasons for Crawl Problems".

Crawl Control & Actions

Can I recheck my site or pages outside of the normal crawl schedule?

Yes, it is possible to initiate a recheck at the following levels:

Single page
Multiple pages
Group of pages
Entire site

Learn more about how to re-crawl your pages, groups, and sites.

Note: Crawl duration varies depending on the number of pages on your site and the number of sites on your account crawling simultaneously.

Can I prevent specific sections of my site from being crawled?

Yes, you can set up site content settings to include and exclude content and to remove links from your site's index.

How do I cancel the crawl of a website?

To cancel or stop a crawl on a website, please contact the Siteimprove technical support team with details of the site account and URL.

System Behavior & Impact

Which products are impacted by Site Content Settings?

Site Content Settings affect data in your content site. Content sites are used by the QA, Accessibility, SEO, and Policy products.

Site content settings will not affect data in any other Siteimprove products, including Analytics, Ads, or Performance.

What steps can be taken to reduce unnecessary load on the web server during crawling?

Siteimprove uses intelligent algorithms and looks at several parameters to determine when and what to re-check. For example, we use an MD5 key to determine if the page has changed; if the page has not changed, there is no need for a recheck.
The default delay between HTTP requests is 200 milliseconds. Pauses of any time up to 20,000 milliseconds between requests will be added automatically if we suspect the crawler is affecting the site's performance.
If necessary, pauses between HTTP requests can be added manually by Siteimprove.
We automatically stop crawling the site if we get several time-outs or if we notice internal errors from the website server.
You can change the Site Content Settings to remove links from being checked or to exclude content from the site.
The crawl can be configured to start at a particular time/day by request.
Siteimprove can exclude parts of the site from a crawl.
It is possible to change the crawl frequency with limitations. If you would like to change the frequency of your crawl, please contact Customer Support.
By default, we limit the number of simultaneous crawls running on one account to two at a time.

If you would like any of the above settings changed for a crawl on your website, please contact Siteimprove Support.

Are all checks performed during the crawl?

No. Many checks are performed after the crawl is complete. The image below can be used as a rough guide to illustrate checks that will typically continue after the crawl has ended.
Crawl_and_check_sequence What are some typical reasons for a problem with a website crawl?

For information on this see the article "Typical reasons for crawl problems".

Technical Details

Where can I find the IP address and User-Agent string of the crawler?

The crawler IP address and User-agent strings can be found in the article - What IP addresses and user agents are used by Siteimprove?

Did you find it helpful? Yes No