How to properly prevent pages from being indexed in Google

This “Indexed, though blocked by robots.txt” error appears when the Google bot indexes pages, despite the fact that you have blocked it in the robots.txt file. This happens because the bot often ignores prohibitions and adds these pages to the index, although according to the rules they should not have been there. There are several methods you can use to resolve this issue, such as adjusting your robots.txt settings, using noindex meta tags on selected pages, or contacting Google support for recommendations.

Previously, there was a publication about indexing sites through the Google API, which contained detailed instructions for indexing pages through Console Cloud Google.

Ways to block pages from indexing

There are several ways to block search bots from indexing your website pages:

1. HTTP header X-Robots-Tag. This is an elegant and unobtrusive method that is configured on the server side. It is not visible in the page source code, but it can be seen through the developer tools in the Network section. Typically, development teams or server administrators handle the setup.

2. Meta robots. This reliable method involves adding a robots meta tag to the section of the page with the desired attributes. I prefer to use the noindex, follow combination to explicitly tell Google not to index the page.

3. JavaScript script. Often developers offer to close via scripts written in the JavaScript programming language. They can also be found freely available on the Internet. What’s good about this method is that users can enter and see content, but search robots cannot.

4. Canonical attribute. To remove duplicate pages, you can add a tag with the canonical attribute and indicate the main page in it. Previously, this trick always worked, but lately it works 50/50.

All information can be found in the Google documentation https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag.

What about robots.txt?

Contrary to popular belief, this directive does not block the indexing of pages on your site. Google explicitly states this in its documentation. The main function of this directive is to prevent certain partitions from being scanned. However, Google may still include content in the index that you would rather not see in search results.

Indexing trap

Now I will explain why I am talking about this. A problem arose on the site that I am promoting: many pages with get parameters were included in the index, which led to the appearance of unnecessary referrals.

In order not to create unnecessary difficulties for my beloved developers, I decided to quickly close such pages in the robots.txt file and switch to other issues. But, as you already understood, this did not solve the problem – the pages continued to appear. At some point I thought that Google itself would sort out this problem and these technical issues would not be reflected in any way on the main pages. But alas, pages with parameters began to interrupt the main pages.