Normally, you would use the robots.txt
file to provide directives to search engines on what pages, files, folders, and subdomains you want to be crawled. This is an indispensable tool for sites of any size, but crucial for larger websites.
The issue with the robots.txt
file is that it only contains crawler directives. We should note that there are two kinds of directives: crawler directives, and indexer directives.
Crawler directives tell the googlebot where it can go. They also can be used to point the googlebot to your sitemap. The most common crawler directives are Allow
, Disallow
, Sitemap
, and User-agent
. These are used to tell search engines what and where they should crawl.
Indexer directives tell the googlebot what it should index. Unlike the crawler directives which usually are placed in the robots.txt
file, the indexer directives are placed on each page or element. Indexer directives are placed in the HTML head for a page e.g. <meta name="robots" content="noindex, follow">
. They can also be placed inside a link e.g. <a href="http://example.com/page" rel="nofollow">example page</a>
.
This is all well and good, but as previously mentioned images and pdf files don't have HTML heads. Yes, you can nofollow, noindex
all links on your site pointing to an image or pdf, but that does not stop other people from linking to it.
The solution is to use our .htaccess file to set a custom header. The header is the X-robots-tag
. We can use it with any indexer directive.
Example: X-robots-tag HTTP header
Setting the X-robots-tag
is the same as setting any other custom HTTP header.
For a single file we can use:
<Files white-paper.pdf>
Header add X-robots-tag "noindex, noarchive, nosnippet"
</Files>
To set the header all .docx and .pdf files we would use the following:
<FilesMatch ".(docx|pdf)$">
Header add X-robots-tag "noindex, noarchive, nosnippet"
</FilesMatch>
The X-robots-tag
is an invaluable tool in the SEO tool box.
Like the rel="canonical"
header thought it should be used judiciously. Being careless with any part of the .htaccess file can cause serious problems.
Which brings us to our next topic How to Put rel="canonical" on Non-HTML Resources.