Skip to main content
Ignored Paths for Specific Sites

Used to determine which folder paths will not be crawled while DubBot inventories a site.

Updated over a week ago

Administrators can use Ignored Path settings to ignore full folders or specific pages in site crawls. This is commonly used for sites that contain legacy content that does not need to be tested and also for breaking a site into multiple site dashboards in DubBot.

Learn how to locate a Site's Settings panel for Modifying an existing Site's Settings.

Finding Ignored Paths Settings

Select the Ignored Paths tab in the Site Settings section of the Site Settings panel to configure the Ignored Paths for a Site.

Ignored Paths button is highlighted in the Site Setting section of the DubBot app

Ignore Pages with URL Queries

Enable or disable page crawls using queries with the Ignore pages with URL queries checkbox. Query pages are those that contain a ? in the URL. This setting is helpful if you find that your crawl is picking up a lot of extra query pages from your calendar, for example. In that case, you would simply uncheck this box and re-run your crawl.

Ignored Path tab focusing on the Ignore pages with URL queries checkbox

Add a New Ignored Item

Select the Add button.

In the pop-up that appears, use the Type dropdown menu to choose the type of dropdown you want to add:

  • Ignored Path

  • Ignored Path using Regular Expression

  • Ignored URL using Regular Expression

These choices are explained in detail below.

Paths entered are applied to all domains entered in a Site's settings.

Higlighting the Add button in the Ignored Paths tab and showing the Add New Ignored Item dialog box

Ignored by Path

In the New Path field, Enter the full (root-relative) path to the page or folder that you want to be excluded from the crawl inventory.

Use the following URL as an example: https://www.benson.edu/events/2020.

To exclude the 2020 folder that resides in the events folder, you would add the full path that follows the .edu domain extension as part of the exclusion. That path would be /events/2020. This exclusion will only be for content within the 2020 folder.

This ignore path will be applied to all domains entered for the Site.

Ignored Paths by Regular Expressions vs. Ignored URL using Regular Expression

Both options allow using Regular Expression syntax to ignore URLs in the Site. Ignored URLs take into account the full URL, including the domain. Ignored Paths only evaluates against the piece of the link after the domain.

Ignored Paths by Regular Expressions

Using JavaScript-based regular expression syntax, more complicated exclusions can be created. Enter regular expression paths that are root-relative to the site's domain. These regular expressions will be evaluated on the crawl URLs without looking at the domain.

For instance, enter .*/example1/.* to prevent the crawler from accessing content that is at any level deep within a folder named, example1. The evaluation will not take into consideration the domain when evaluating the URL against the regex.

For a link such as, https://mysite.com/elaborating/example1/, the regular expression will be tested against the following part of the URL /elaborating/example1/.

Ignored URL using Regular Expression

Enter regular expression paths that are evaluated against the full URLs of a crawl. These regular expressions will be evaluated on the full crawl URLs, including the domain information.

For instance, enter http:\/\/mysite\.com.*\.pdf$ to prevent the crawler from accessing files that have a PDF extension that are within the mysite.com site. This is useful for sites that allow following redirected content but want to limit what is crawled from specified domains.

Import CSV

A CSV of paths to ignore can also be uploaded using the Import CSV button.

This CSV file should be in the format of : value,type (where type is one of path, regex or url_regex).

More on the Site Settings panel

If you have questions, please contact our DubBot Support team via email at help@dubbot.com or via the blue chat bubble in the lower right corner of your screen. We are here to help!

Did this answer your question?