Adding sitemaps to robots.txt

The XML Sitemap & Google News plugin can add both sitemaps to robots.txt automatically but only if you let WordPress itself handle robots.txt requests. This means that (1) there must be no static robots.txt file in your site root and (2) there are no htaccess or nginx rules that prevent (robots).txt requests from reaching index.php.

On a default WordPress installation there is no static robots.txt file and all robots.txt requests are handled by WordPress. Visiting your site’s yourdom.ain/robots.txt location in a browser, you should see:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Or, if you have the option Discourage search engines from indexing this site activated on Settings > Reading, you’ll see:

User-agent: *
Disallow: /

When the XML Sitemap & Google News plugin is active, depending on which one of the sitemaps is active on Settings > Reading, the corresponding sitemap locations are added automatically.

Verify your robots.txt

To verify if your site is responding to robots.txt requests correctly, follow these steps:

  1. Go to your site’s Settings > Reading admin page and find the section Additional robots.txt rules where you’ll see a link “robots.txt” in the description.
  2. Follow this link to open your site’s robots.txt location in your browser (new tab).
  3. The response should now look something like:
# XML Sitemap & Google News version 5.2.7 - https://status301.net/wordpress-plugins/xml-sitemap-feed/
Sitemap: https://yourdom.ain/sitemap.xml
Sitemap: https://yourdom.ain/sitemap-news.xml

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Note: If you do not have both sitemaps activated on Settings > Reading, then there will be only one line starting with “Sitemap:” …

What if my robots.txt response is different?

For various reasons, you might get a different response when visiting your site’s robots.txt location. It might be a 404 (not found) or another error response. Or it might be a text response but with other content.

Error response

If you get either a “404 Not found” page or text, or another error response like “500 Internal server error”, then most likely your hosting space is configured to expect a static file when a .txt request is made, and — when no such file is found — not configured to process any further .htaccess rules or hand off the request to the default index.php file.

To fix this issue, you can either ask your hosting provider about it or try to find the cause yourself in your .htaccess file or server configuration…

Or you can try to work around it by manually creating a static robots.txt file:

  1. Open a non-formatting Plain text editor like Notepad (Widows), TextEdit (MacOS), Gedit/Kate/Mousepad/Leafpad (Linux) or any one of the many many text editors out there, as long as it can save files in plain-text format.
  2. Copy the example robots.txt response above and paste that into the editor.
  3. Modify the part https://yourdom.ain to match your site’s domain and remove any of the two sitemap lines if you do not have the corresponding sitemap activated on your site.
  4. Save the file as robots.txt
  5. Upload it to your site root with your preferred FTP program or your hosting provider’s File Manager.

Test your robots.txt location again in your browser to make sure it was uploaded to the correct directory and in the correct format.

Note: if you enable or disable any of the two sitemaps on Settings > Reading, do not forget to manually adapt the static robots.txt file accordingly.

Other content

If you get a text response, but it has other content, then most likely there is a static robots.txt file in your site root. You can verify its existence with your preferred FTP program, your hosting provider file browser/manager or by going to your site’s admin page Settings > XML Sitemap (if you have the Sitemap Index enabled) or Settings > Google News (if you have the Google News enabled) and using the button Check for conflicts.

If there is a static file found, there will be an admin message about it. The message will include an option to delete this file, but before you do this, verify if the content of the static file should be kept or not.

Questions to consider: “How did it get there?” and “What is its purpose?” Read further for the implications of these questions and their answers…

How did it get there?

It might have been added manually by you or another/previous site admin, or created by another plugin. Its purpose might be to hide certain parts of your site from search engines or control indexing speed/frequency to reduce server load.

Now, if it was created manually or by a plugin that is no longer active, then you can simply move on to the next question.

But if it was created by an active (SEO) plugin, it will likely be created again at some point after deletion. So, you’ll need to decide to either stop using that plugin’s functionality to manage robots.txt or keep using the static robots.txt file. In the last case, use the plugin in question to add the needed sitemap rules:

Sitemap: https://yourdom.ain/sitemap.xml
Sitemap: https://yourdom.ain/sitemap-news.xml
  1. Copy the lines above and paste them into the plugin’s robots.txt editor.
  2. Modify the part https://yourdom.ain to match your site’s domain and remove any of the two sitemap lines if you do not have the corresponding sitemap activated on your site.
  3. Save settings and test your robots.txt response.

What is its purpose?

A robots.txt file is a powerful tool and can be used to block access to certain parts of your site from search engines or control indexing speed to reduce server load. But whatever its purpose, this might be a good opportunity to reconsider each rule you find there.

In most cases, the default WordPress generated robots.txt response is just fine. If you wish to keep certain pages out of search or prevent duplicate content, the general advise is to use noindex header meta tags or response headers instead of using robots.txt rules. Noindex headers can be managed with most good SEO plugins.

There are many web pages explaining what can be done with a robots.txt file. A good place to start is https://www.semrush.com/blog/beginners-guide-robots-txt/

Note: be very careful when disallowing complete directories in your robots.txt response. If there are files (images, video, javascript etc) in a blocked directory, but they are used on pages that do get indexed, then parts of those pages will be “invisible” or non-functional (in case of javascript) to the search engine. This can negatively impact your SEO results!

If you decide no rules are worth keeping or can be replaced by your SEO plugin settings, you can simply make a backup of the file and then delete it. WordPress should then start responding to robots.txt requests.

But if you need to keep certain rules, you have two ways to proceed:

  1. Replace the static robots.txt file with a dynamic one, managed from within your WordPress (see below).
  2. Modify the static robots.txt file (see further below).

Replace a static robots.txt file with a dynamic response

What I suggest is:
1. rename your static robots.txt file to robots.bak
2. visit your site’s /robots.txt in your browser to see if you get a valid response.

If you get a 404 page response or error, your easiest option is to just rename robots.bak to robots.txt again and add the two sitemaps manually. See instructions under “Modify or create a static robots.txt file” below.

But if all is well, your browser should show the default WordPress generated robots.txt with the sitemaps added by the plugin:

Next, you can open your old robots.bak file and copy all custom rules that you wish to add to your new dynamic robots.txt response by going to Settings > Reading in your WordPress admin, and pasting your custom rules in the field Additional robots.txt rules.

Attention: these custom rules will be automatically appended to the robots.txt response, below the default rules. You may need to modify your rules to make sense in that context!

Modify or create a static robots.txt file

Follow the steps below to add the following references to your sitemaps to your static robots.txt file.

A. Sitemap references

Reference for the regular XML Sitemap:

Sitemap: https://yourdom.ain/sitemap.xml

Reference for the Google News Sitemap:

Sitemap: https://yourdom.ain/sitemap-news.xml

Notes:
– If you use both, then add both lines, otherwise just pick the one you use.
– Convert https://yourdom.ain to your site domain.

B. User agent rules

Common robots.txt rules to start with, modify to your needs, leave them as they are or discard them if you already have user agent rules in an existing robots.txt file:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Add the appropriate combination of these rules to a static file, following these steps:

  1. If there is an existing static robots.txt file already, then use your preferred FTP program* or your hosting provider’s File Manager to download it to your computer.
  2. Open a non-formatting Plain text editor** or any one of the many many text editors out there, as long as it can save files in plain-text format.
  3. If you downloaded an existing robots.txt file, then open that file in your plain text editor.
  4. Copy the appropriate robots.txt rule(s) above and paste them into the editor. Modify the part https://yourdom.ain to match your site’s domain.
  5. Save the file either overwriting the old one (if you had an existing robots.txt file) or save it as a new robots.txt file.
  6. Upload it to your site root with your preferred FTP program or your hosting provider’s File Manager. Choose “Yes” if you’re prompted with the question to overwrite the existing robots.txt file.

*) FTP program examples: FileZilla, CrossFTP, WinSCP (Windows), FireFTP (Firefox browser extension), sFTP Client (Chrome browser extension)
**) Plain text editor examples: Notepad (Windows), TextEdit (MacOS), Gedit/Kate/Mousepad/Leafpad (Linux)