Skip to main content

Use the Power of Caching to Speed up Site Crawls

Using ETags for faster crawls.

Updated today

DubBot Crawling and Cache Management

Our crawler is designed to allow for 'skipping over' content that has not changed since the last crawl. Skipping this unchanged content can result in a measurable decrease of a site's crawl time.

As the crawler prepares to inventory a page it performs the following steps:

  1. Looks for a variable in an item's HTTP response header to use as a reference for comparison to the last cached copy of the item. The preferred format for these values is an entity tag or ETag, but a Last Modified header can work as a fallback.

  2. If the value found in the variable is the same as that saved in the last cached copy, the item is skipped in the current crawl and the cached copy is used for site calculations.

  3. If the value found in the variable is different than the last cached variable, the item is crawled normally.


What is an ETag?

ETags are sent out by the server where your page resides to the DubBot crawler that is making a request for content.

The ETag (entity tag) response header is an identifier for a specific version of a resource. It lets caches be more efficient and save bandwidth, as a web server does not need to resend a full response if the content has not changed. Additionally, ETags help to prevent simultaneous updates of a resource from overwriting each other.

Why ETags are Best Practice

Some of the reasons ETags are considered an industry best practice:

  • Efficient Caching: ETags act as unique identifiers for specific versions of a resource, so browsers and caches only re-download content if the ETag changes, saving bandwidth and speeding up page loads.

  • Reduced Server Load: By avoiding unnecessary data transfers for unchanged content, servers handle fewer requests.

  • Granular Validation: More precise than timestamps, ETags accurately track resource changes, even if the modification time stays the same.

  • Improved User Experience: Faster loading times and responsiveness lead to better user retention and lower bounce rates.


What is a Last Modified header?

The Last-Modified response header tells the requesting crawler (DubBot) a date and time when the host server believes the resource was last modified. It is less accurate than using an ETag, but can serve as a fallback if ETags are unavailable.


Does Your Site use Etags?

To find our if your site uses ETags, follow the steps below:

  1. Open Browser Developer Tools: Press F12 (or right-click and select "Inspect" or "Inspect Element") in your browser (Chrome, Firefox, Edge).

  2. Go to Network Tab: Click on the "Network" tab.

  3. Reload Page: Refresh the webpage (F5 or the reload icon) to capture network requests.

  4. Select a Resource: Click on the main document (e.g., index.html) or any other file (like a .js, .css, or .jpg) in the list.

  5. Check Headers: In the right-hand panel, find the "Headers" section and look for the Response Headers.

  6. Look for ETag: See if an ETag header is listed, often appearing as ETag: "some-unique-string" or ETag: W/"another-string" (the W/ denotes a weak ETag).

There are other methods as well, ask your favorite search engine "How to see if my site uses ETags".


If you have questions, please reach out to our DubBot Support team via email at help@dubbot.com or via the blue chat bubble in the lower right corner of your screen. We are here to help!

Did this answer your question?