DubBot allows for crawling websites in four different ways.
Crawling based on a specific URL: Use a specified URL to perform a crawl based on the site's hierarchy from the provided URL forward.
Crawling a full website based on a provided page within the website : Use a specified URL to crawl all links within the same domain as the provided URL.
Crawling based on a Sitemap: Use a sitemap to provide a list of pages
Crawling based on a List of URLs: Manually create a list of URLs or upload a CSV of your URL list for crawling more targeted content.
1. Crawling based on a specified URL
Default (Follow site hierarchy starting at URL) should be selected as the Crawler Type.
Begin crawl at URL should be set to the URL where you would like the crawler to begin crawling. If the URL is the main website link, the full website will be crawled. If the URL entered contains a subdirectory off of the main website's link, only links within that section of the site will be crawled.
Please note, the URL entered should not be a redirect to the page, but the actual URL. Redirects will not work.
Users can then select the tab, Ignored Paths to setup any crawling exclusions within the provided URL.
2. Crawling a full website based on a provided page within the website
Full Domain (Crawl everything on the domain starting at URL) should be selected as the Crawler Type.
Begin crawl at URL should be set to the URL where you would like the crawler to begin crawling. The crawler will then crawl all content within the website of the page's URL provided.
How is this different from the Default selection?
Even if the URL provided is for a subdirectory within the website, any links that can be crawled will be crawled, whether they are outside the scope of the subdirectory or not.
Users can select the tab, Ignored Paths to setup any crawling exclusions within the provided URL.
3. Crawling based on a Sitemap
Sitemap should be selected as the Crawler Type.
Sitemap URL should be set to the URL for the sitemap. Note that an XML sitemap is required for this option.
Page Limit allows a user to set a maximum number of pages the crawler will crawl (approximately). Leave blank or '0' for no limit.
4. Crawling based on a List of URLs
List of URLs should be selected as the Crawler Type.
The URLs to Crawl section contains a field that allows users to enter specific URLs for crawling. Enter the URL in the corresponding field and click the plus sign to add the URL. Users may do this process for as many URLs as should be part of the site.
A list of URLs can be uploaded using a CSV file. Select the Import CSV button and navigate to the file on your computer. You should see the list of URLs added to the list. The CSV should contain one URL link per line. Learn more about the DubBot CSV Upload Format.
Please note the following behaviors for the List of URLs crawler:
If a URL is already a part of a site and added to the site using the interface, the URL will not be added as a duplicate page.
If a site already contains URLs and a new URL is added or CSV is uploaded, the newly detected URLs will be added to the existing list.