If you notice in your DubBot sites that pages are being indexed that are not a part of that site or group, or not even in your live website domain, you may be wondering why.
When DubBot crawls sites, all links within a crawled page are inventoried. DubBot then determines whether the link is a part of the site by looking at whether the link is a part of the section as outlined in the Site Settings. If a link is determined to be part of the section, the webpage is added to DubBot’s inventory.
If you have a link that is determined to be part of the DubBot site but that link then redirects to a section outside of the Site or even to another website, DubBot will pick up this webpage and add it to your Site.
As an example, my site URL is https://dubbot.com/training/. My Training landing page has a link to “support.html” but that link actually redirects to “https://help.dubbot.com/”, I will end up with help.dubbot.com as a part of my website inventory.
In some cases, a link may be a generic redirect meant to work from any section that it is loaded. To prevent these pages from being considered a part of every DubBot site, consider whether the link could be written as root relative, starting with “/” ie: href=”/root-relative-link” instead of href=”relative-link-path”.
DubBot uses your sitemap.xml file to crawl your site and find your pages. If some of your sitemap URLs redirect to web pages outside of your website, then those get returned to DubBot as well.
DubBot checks for 404 responses, so broken links won’t be crawled, but 301 responses (redirects) are still allowed to be crawled. One reason it is allowed, is that you may be redirecting to a new page in your site that hasn’t been added to the sitemap yet.
To prevent outside pages from being crawled, review your site map often to check for redirects. When you create a redirect, check your sitemap for the original url and remove it. This is a best practice for SEO and will keep your site in good standing. You don't want a search engine indexing a url for your site that really is pointing to a 3rd party link; this could cause the search engine to direct traffic to that 3rd party site in the future instead of yours.