DubBot allows for crawling websites in four different ways.
- Use a specified URL to perform a crawl based on the site's hierarchy from the provided URL forward.
- Use a specified URL to crawl all links within the same domain as the provided URL.
- Use a sitemap to provide a list of pages
- Manually create a list of URLs or upload a CSV of your URL list for crawling more targeted content.
1. Crawling based on a specified URL
Under the Crawler tab, Default (Follow site hierarchy starting at URL) should be selected as the Crawler Type.
Begin Crawl at URL should be set to the URL where you would like the URL to begin crawling. If the URL is just the main website link, the full website will be crawled. If the URL entered contains a subdirectory off of the main website's link, only links within that section of the site will be crawled.
Users can then select the tab, Ignored Paths to setup any crawling exclusions for within the provided URL.
2. Crawling a full website based on a provided page within the website
Under the Crawler tab, Full Domain (Crawl everything on the domain starting at URL) should be selected as the Crawler Type.
Begin Crawl at URL should be set to the URL where you would like the URL to begin crawling. The crawler will then crawl all content within the website of the page's URL provided. Even if the URL is provided for a subdirectory within the website, any links that can be crawled will be crawled, whether they are outside the scope of the subdirectory or not.
Users can select the tab, Ignored Paths to setup any crawling exclusions for within the provided URL.
3. Crawling based on a Sitemap
Under the Crawler tab, Sitemap should be selected as the Crawler Type.
Sitemap URL should be set to the URL for the sitemap. Note that an XML sitemap is required for this option.
Crawler Max Depth will allow users to determine how deep the crawler goes within the site's hierarchy.
4. Crawling based on a List of URLs
Under the Crawler tab, List of URLs should be selected as the Crawler Type.
Add URLs field will allow users to enter specific URLs for crawling. Enter the URL in the corresponding field and click Add URL button. Users may do this process for as many URLs that should be a part of the site.
A list of URLs can be uploaded using a CSV file. Select the Choose CSV button and navigate to the file on your computer. You should see the list of URLs added to the list. The CSV should contain one URL link per line.
Please note the following behaviors for the List of URLs crawler:
If a URL is already a part of a site and added to the site using the interface, the URL will not be added as a duplicate page.
If a site already contains URLs and a new URL is added or CSV is uploaded, the newly detected URLs will be added to the existing list.