Site Crawler Options | DubBot Help Center

DubBot allows for crawling websites in four different ways:

Crawling based on a specific URL: Use a specified URL to perform a crawl based on the site's hierarchy from the provided URL forward.
Crawling a full website based on a provided page within the website : Use a specified URL to crawl all links within the same domain as the provided URL.
Crawling based on a Sitemap: Use a sitemap to provide a list of pages
Crawling based on a List of URLs: Manually create a list of URLs or upload a CSV of your URL list for crawling more targeted content.
Crawl Depth Limit
Setting a Page Limit

Sites options screen focused on Crawl Type field in DubBot App

Crawling based on a specified URL

Default (Follow site hierarchy starting at URL) should be selected as the Crawler Type.

Begin crawl at URL should be set to the URL where you would like the crawler to begin crawling. If the URL is the main website link, the full website will be crawled. If the URL entered contains a subdirectory off of the main website's link, only links within that section of the site will be crawled.

Please note, the URL entered should not be a redirect to the page, but the actual URL. Redirects will not work.

Users can then select the tab, Ignored Paths to setup any crawling exclusions within the provided URL.

Crawling a full website based on a provided page within the website

Full Domain (Crawl everything on the domain starting at URL) should be selected as the Crawler Type.

Begin crawl at URL should be set to the URL where you would like the crawler to begin crawling. The crawler will then crawl all content within the website of the page's URL provided.

How is this different from the Default selection?

Even if the URL provided is for a subdirectory within the website, any links that can be crawled will be crawled, whether they are outside the scope of the subdirectory of the domain given or not.

Users can select the tab, Ignored Paths to setup any crawling exclusions within the provided URL.

Crawling based on a Sitemap

Sitemap should be selected as the Crawler Type.

Sitemap URL should be set to the URL for the sitemap. Note that an XML sitemap is required for this option.

Page Limit allows a user to set a maximum number of pages the crawler will crawl (approximately). Leave blank or '0' for no limit.

Crawling based on a List of URLs

List of URLs should be selected as the Crawler Type.

The URLs to Crawl section contains a field that allows users to enter specific URLs for crawling. Enter the URL in the corresponding field and click the plus sign to add the URL. Users may do this process for as many URLs as should be part of the site.

A list of URLs can be uploaded using a CSV file. Select the Import CSV button and navigate to the file on your computer. You should see the list of URLs added to the list. The CSV should contain one URL link per line. Learn more about the DubBot CSV Upload Format.

Please note the following behaviors for the List of URLs crawler:

If a URL is already a part of a site and added to the site using the interface, the URL will not be added as a duplicate page.
If a site already contains URLs and a new URL is added or CSV is uploaded, the newly detected URLs will be added to the existing list.
List of URLs only crawls those pages explicitly added in the list; no link discovery happens during the crawl to add pages that are linked from these pages to the crawl.

List of URLs crawler type options in DubBot with the URLs to crawl Add URL button and the Import CSV button highlighted

Crawl Depth Limit

The Crawl Depth Limit field lets you tell the crawler how many folder levels (the depth) you would like to go.

For example:

If the Crawl Depth Limit is set to 3 the crawler would behave as follows:

Would crawl: https://usa.gov/states/ga/counties/index.html (3 folders deep)
Would not crawl: https://usa.gov/states/ga/counties/cities/index.html (4 folders deep)

Setting a Page Limit

Page Limit allows a user to set a maximum number of pages the crawler will crawl (approximately). Leave blank or '0' for no limit.

If you have questions, please contact our DubBot Support team via email at help@dubbot.com or via the blue chat bubble in the lower right corner of your screen. We are here to help!