If you notice in your DubBot sites that pages are being indexed that are not a part of that Site or Page Set, or not even in your live website domain, you may be wondering why.
When DubBot crawls sites, all links within a crawled page are inventoried. DubBot then determines whether the link is a part of the Site by looking at whether the link is a part of the section as outlined in the Site Settings. If a link is determined to be part of the section, the webpage is added to DubBot’s inventory.
If you have a link that is determined to be part of the DubBot Site but that link then redirects to a section outside of the Site or even to another website, DubBot will pick up this webpage and add it to your Site.
As an example, consider a site with the Starting URL of
https://dubbot.com/training/. The start page of the site has a link to “support.html” but that link actually redirects to “https://help.dubbot.com/”. The site will end up with help.dubbot.com as a part of the website inventory.
DubBot uses your sitemap.xml file to crawl your site and inventory your pages. If some of your sitemap URLs redirect to web pages outside of your website, then those get returned to DubBot as well.
DubBot checks for 404 responses, so broken links won’t be crawled, but 301 responses (redirects) are still allowed to be crawled. One reason it is allowed is that you may be redirecting to a new page in the site that hasn’t been added to the sitemap yet.
To prevent outside pages from being crawled, review your sitemap often to check for redirects. When you create a redirect, check your sitemap for the original URL and remove it. This is a best practice for SEO and will keep your site in good standing. You don't want a search engine indexing a URL for your site that is pointing to a 3rd party link; this could cause the search engine to direct traffic to that 3rd party site in the future instead of yours.