Back

The robots.txt file is not valid

What is the robots.txt file ?

The robots.txt file is a text file located at the base of your site that is intended to provide instructions to search engine spiders. 

It allows, among other things, to :

  • define the rules of  access authorization to the various pages of your site, i.e. indicate the pages that they can explore or not
  • define one or more links to sitemap files : files that list all the pages of your site

An example of robots.txt file accessible via the url : https://www.tikamoon.com/robots.txt

User-Agent: *
Allow: /
Disallow: /V2/
Disallow: /recherche
Disallow: /articlepopup.php*
Disallow: /recommander.php*
Disallow: *filtreprix=*
Disallow: *action=*
Disallow: *artid=*
sitemap : https://www.tikamoon.com/sitemap.xml

Why is it important to have this file ?

It is important to have a robots.txt file because, it allows you to clearly define the access rules of your website and, also it is where you fill in its sitemap file. When a part of your site must not be explored for security or uselessness reasons, it is interesting to prohibit the exploration of these pages. Prohibiting exploration does not mean that the pages cannot be indexed (contrary to the "no index" instructions) but it is very unlikely that they will be there. The main interest is  that the robots do not waste time (crawl budget) to analyze the content of pages that you don't want in the SERPs.

For example, you have a part of your community site that contains user profile sheets that are poor in terms of content and added value, so it is better to prohibit access to these pages so that the robots mainly explore your pages with added value. 

In the absence of this file and generally a HTTP 4xx error during recovery, the robots consider that they are authorized to explore your entire site, which can be a problem as they will eventually explore pages that you did not want them to explore.

If an error occurs during the recovery of this file with an HTTP 5xx error or no response (with a timeout, for example), then they consider that they are not allowed to explore the entirety of your site and you have very little chance that your pages appear on the SERPs.

Similarly, if syntax errors are present in the directives, it is possible that the robots misinterpret your intentions and therefore explore pages that should not be and vice versa.

How to fix the robots.txt file

In order to ensure that the robots.txt file is valid, it is necessary to :

  • check that the https://mywebsite.com/robots.txt page returns a text file with an HTTP 200 code in less than a few seconds
  • check if it does not exist or returns a 404 HTTP code, it is useful to create it
  • check if it returns an HTTP code other than 200 or a non-text file, it is necessary to intervene at the server or application level to restore access to it

And also, check the syntax of the file by following these few instructions :

  • only empty lines, comments and directives corresponding to the "name:value" format are allowed in the robots.txt file
  • make sure that the allow and disallow values are empty or start with / or *
  • do not use $ in the middle of a value (for example, allow: / file $ html)
  • make sure there is a value for user-agent
  • make sure there are no allow or disallow directives before user-agent