Administrators can use Ignored Path settings to ignore full folders or specific pages in site crawls. This is commonly used for sites that contain legacy content that does not need to be tested and also for breaking a site into multiple site dashboards in DubBot.
Learn how to locate a Site's Settings panel for Modifying an existing Site's Settings.
In this Article
Finding Ignored Paths Settings
Select the Ignored Paths tab in the Site Settings section of the Site Settings panel to configure the Ignored Paths for a Site.
Ignore Pages with URL Queries
Enable or disable page crawls using queries with the Ignore pages with URL queries checkbox. Query pages are those that contain a ? in the URL. This setting is helpful if you find that your crawl is picking up a lot of extra query pages from your calendar, for example. In that case, you would simply uncheck this box and re-run your crawl.
Add a New Ignored Item
Select the Add button.
In the pop-up that appears, use the Type dropdown menu to choose the type of dropdown you want to add:
Ignored Path
Ignored Path using Regular Expression
Ignored URL using Regular Expression
These choices are explained in detail below.
Paths entered are applied to all domains entered in a Site's settings.
Ignored by Path
In the New Path field, Enter the full (root-relative) path to the page or folder that you want to be excluded from the crawl inventory.
Use the following URL as an example: https://www.benson.edu/events/2020.
To exclude the 2020
folder that resides in the events
folder, you would add the full path that follows the .edu
domain extension as part of the exclusion. That path would be /events/2020
. This exclusion will only be for content within the 2020
folder.
This ignore path will be applied to all domains entered for the Site.
Ignored Paths by Regular Expressions vs. Ignored URL using Regular Expression
Both options allow using Regular Expression syntax to ignore URLs in the Site. Ignored URLs take into account the full URL, including the domain. Ignored Paths only evaluates against the piece of the link after the domain.
Ignored Paths by Regular Expressions
Using JavaScript-based regular expression syntax, more complicated exclusions can be created. Enter regular expression paths that are root-relative to the site's domain. These regular expressions will be evaluated on the crawl URLs without looking at the domain.
For instance, enter .*/example1/.*
to prevent the crawler from accessing content that is at any level deep within a folder named, example1
. The evaluation will not take into consideration the domain when evaluating the URL against the regex.
For a link such as, https://mysite.com/elaborating/example1/
, the regular expression will be tested against the following part of the URL /elaborating/example1/
.
Ignored URL using Regular Expression
Enter regular expression paths that are evaluated against the full URLs of a crawl. These regular expressions will be evaluated on the full crawl URLs, including the domain information.
For instance, enter http:\/\/mysite\.com.*\.pdf$
to prevent the crawler from accessing files that have a PDF extension that are within the mysite.com site. This is useful for sites that allow following redirected content but want to limit what is crawled from specified domains.
Import CSV
A CSV of paths to ignore can also be uploaded using the Import CSV button.
This CSV file should be in the format of : value,type
(where type
is one of path
, regex
or url_regex
).
More on the Site Settings panel
Site Setup (General Tab)
Ignored Paths << You are here
If you have questions, please contact our DubBot Support team via email at help@dubbot.com or via the blue chat bubble in the lower right corner of your screen. We are here to help!