Ignored Paths for Specific Sites

Administrators can use Ignored Path settings to ignore full folders or specific pages in site crawls. This is commonly used for sites that contain legacy content that does not need to be tested and also for breaking a site into multiple site dashboards in DubBot.

Learn how to locate a Site's Settings panel for Modifying an existing Site's Settings.

Finding Ignored Paths Settings

Select the Ignored Paths tab in the Site's Settings to configure the Ignored Paths for a Site.

Ignored Paths button is highlighted in the Site Setting section of the DubBot app

Ignore Pages with URL Queries

Enable or disable page crawls using queries with the Ignore pages with URL queries checkbox. Query pages are those that contain a ? in the URL. This setting is helpful if you find that your crawl is picking up a lot of extra query pages from your calendar, for example. In that case, you would simply uncheck this box and re-run your crawl.

Add a New Ignored Item

Select the Add button.

In the pop-up that appears, use the Type dropdown menu to choose the type of dropdown you want to add:

Ignored Path
Ignored Path using Regular Expression
Ignored URL using Regular Expression

These choices are explained in detail below.

Paths entered are applied to all domains entered in a Site's settings.

Higlighting the Add button in the Ignored Paths tab and showing the Add New Ignored Item dialog box

Ignored by Path

In the New Path field, Enter the full (root-relative) path to the page or folder that you want to be excluded from the crawl inventory.

Use the following URL as an example: https://www.benson.edu/events/2020.

To exclude the 2020 folder that resides in the events folder, you would add the full path that follows the .edu domain extension as part of the exclusion. That path would be /events/2020. This exclusion will only be for content within the 2020 folder.

This ignore path will be applied to all domains entered for the Site.

Ignored Paths by Regular Expressions vs. Ignored URL using Regular Expression

Both options allow using Regular Expression syntax to ignore URLs in the Site. Ignored URLs take into account the full URL, including the domain. Ignored Paths only evaluates against the piece of the link after the domain.

Ignored Paths by Regular Expressions

Using Ruby-based regular expression syntax, more complicated exclusions can be created. Enter regular expression paths that are root-relative to the site's domain. These regular expressions will be evaluated on the crawl URLs without looking at the domain.

For instance, enter .*/example1/.* to prevent the crawler from accessing content that is at any level deep within a folder named, example1. The evaluation will not take into consideration the domain when evaluating the URL against the regex.

For a link such as https://mysite.com/elaborating/example1/, the regular expression will be tested against the following part of the URL /elaborating/example1/.

An inline check helps you ensure the ignore Type is correctly set if a regex is detected.

For example, if the system detects a regex ignore, but a Type of basic Ignored Path is selected, you will see the following error:

It looks like you may have entered a regular expression, but a basic Ignored Path was expected. Please enter a basic Ignored Path or select a different type that supports regular expressions.

Ignored Path entry form showing new error message if a regex path is entered but the correct type of path is not entered.

Ignored URL using Regular Expression

Enter regular expression paths that are evaluated against the full URLs of a crawl. These regular expressions will be evaluated on the full crawl URLs, including the domain information.

For instance, enter http:\/\/mysite\.com.*\.pdf$ to prevent the crawler from accessing files that have a PDF extension that are within the mysite.com site. This is useful for sites that allow following redirected content but want to limit what is crawled from specified domains.

Import CSV

A CSV of paths to ignore can also be uploaded using the Import CSV button.

This CSV file should be in the format of : value,type (where type is one of path, regex or url_regex).