Site Crawler options

Four different ways to configure the crawl of a site

Updated over a week ago

DubBot allows for crawling websites in four different ways. 

  1. Crawling based on a specific URL: Use a specified URL to perform a crawl based on the site's hierarchy from the provided URL forward.

  2. Crawling a full website based on a provided page within the website : Use a specified URL to crawl all links within the same domain as the provided URL.

  3. Crawling based on a Sitemap: Use a sitemap to provide a list of pages

  4. Crawling based on a List of URLs: Manually create a list of URLs or upload a CSV of your URL list for crawling more targeted content.

Sites options screen focused on Crawl Type field in DubBot App

1. Crawling based on a specified URL

Default (Follow site hierarchy starting at URL) should be selected as the Crawler Type

Begin crawl at URL should be set to the URL where you would like the crawler to begin crawling. If the URL is the main website link, the full website will be crawled. If the URL entered contains a subdirectory off of the main website's link, only links within that section of the site will be crawled.

Please note, the URL entered should not be a redirect to the page, but the actual URL. Redirects will not work.


Users can then select the tab, Ignored Paths to setup any crawling exclusions within the provided URL.

2. Crawling a full website based on a provided page within the website

Full Domain (Crawl everything on the domain starting at URL) should be selected as the Crawler Type

Begin crawl at URL should be set to the URL where you would like the crawler to begin crawling. The crawler will then crawl all content within the website of the page's URL provided.

How is this different from the Default selection?

Even if the URL provided is for a subdirectory within the website, any links that can be crawled will be crawled, whether they are outside the scope of the subdirectory or not.

Users can select the tab, Ignored Paths to setup any crawling exclusions within the provided URL.

3. Crawling based on a Sitemap

Sitemap should be selected as the Crawler Type

Sitemap URL should be set to the URL for the sitemap. Note that an XML sitemap is required for this option.

Page Limit allows a user to set a maximum number of pages the crawler will crawl (approximately). Leave blank or '0' for no limit.


4. Crawling based on a List of URLs

List of URLs should be selected as the Crawler Type

The URLs to Crawl section contains a field that allows users to enter specific URLs for crawling. Enter the URL in the corresponding field and click the plus sign to add the URL. Users may do this process for as many URLs as should be part of the site.

A list of URLs can be uploaded using a CSV file. Select the Import CSV button and navigate to the file on your computer. You should see the list of URLs added to the list. The CSV should contain one URL link per line. Learn more about the DubBot CSV Upload Format.

List of URLs crawler type options in DubBot with the URLs to crawl  Add URL button and the Import CSV button highlighted

Please note the following behaviors for the List of URLs crawler:

  • If a URL is already a part of a site and added to the site using the interface, the URL will not be added as a duplicate page.

  • If a site already contains URLs and a new URL is added or CSV is uploaded, the newly detected URLs will be added to the existing list.

Did this answer your question?