Robots.txt is a text file which contains a set of instructions and tells the search engine crawlers or spider which parts of the website to be crawled or not to be crawled. Search Engines like Google, Yahoo, and Bing follows the standard protocol of robots.txt to index or crawl webpages.
In order to implement Robots.txt, it is uploaded in the root directory of a domain. Therefore, it is the first file that search engine spider visits in your website. It is advantageous to control the crawling, and it can also be used to insert the sitemap to provide search engine crawler with an overview of the existing URL of your website.
Do we need Robots.txt for our Website?
Search Engines are public, and anybody can access it. If you own a website, you would not want to give access to your relevant and confidential data. To prevent its access, you would need this file in order to stop indexing.
The robots.txt file is very important in aspect to Search Engine Optimization (SEO) because it tells the search engines which page to index or not to index.
Why Is It Important to Handle Robots.Txt file carefully?
While you are dealing with Robots.txt file, you should be meticulous and should take care of the given below points.
1. The Robots.txt file sets the interaction between the
search engines crawlers and a website.
2. If the file, Robots.txt is not implemented correctly, it would hurt your ranking in search engines and may adversely affect your SEO.
3. It is a set of instructions especially for search engine bots and thus actually helps crawler to understand which pages are to be crawled and which are restricted.
Note:: Google as announced that it would discontinue support for Robots.txt NonIndex in September 2019.
How to create robots.txt File?
It is very simple to generate a robots.txt file. It is a set of command including “User-agent” and “Disallow”. First of all, let us understand what is the purpose of User-Agent and Disallow?
User-Agent: It represents search engines bot, or you can also say it is a search engines crawlers.
Disallow: It lists the files or webpages or directories which are to be excluded from indexing.
Apart from using these two sets of the command, you can also use the “#” symbol if you want to insert a comment on text file.
Examples of Robots.txt File
If you want to allow full access to search engines bot, then you would have to use the given below code in your robots.txt file.
User-agent: *
Disallow:
#Here, all the user agents are allowed to access all the files of a website.
If you want to block the entire website from the access of
robots, then you should write the given below code in your robots.txt file.
User-agent: *
Disallow: /
#Here, all the user agents or bots are disallowed to access all the files of a website.
Sometimes it’s become easy to block files which are available in the same directory instead of disallowing every URL of a website in robots.txt. For example, If you want to block a folder or directory from bot, then you should go with the given below code.
User-agent: *
Disallow: /folder/
Sometime, there may be a requirement to hide access to a particular URL. For example, you would want to hide access to the URL www.beingoptimist/contact-us. If you want to block this URL from access, then the code is written in this format in robots.txt file.
User-agent:*
Disallow: /www.beingoptimist/contact-us/
How To check robots.txt file in website
In order to check, if your website has a robots.txt file,
just put “/robots.txt” behind your domain. For example:
www.beingoptimist.com/robots.txt
In this way, you can see, if your domain has a robots.txt file or not. If your, website does not have the file just create it on the notepad and upload it on the root folder of your domain, i.e. Under “public_html” folder.
You can also test the robots.txt through using robots.txt tester, but you should have access to your Google search console. If you have not verified your website in Google Search Console, then you can read the article How To Use Google Webmaster Tools.
Common Search Engines Bot Name to remember
In the example above, we have seen that we have used User-agent: *, Here, the asterisk represents all the search engines’ bot. Given below are the list of common User-agent names of search engine robots which you should remember. These are:
Google
1. Googlebot
2. Googlebot-News (for news)
3. Googlebot-Image (for images)
4. Googlebot-Videos (for videos)
Baidu
1. Baiduspider
Bing
1. Bingbot
2. MSNBot-Media (videos and Images)
Conclusion
I hope you have understood how you can use robots.txt file to control the access of your webpages. So, using this file, you can easily hide your important files or folder from the bot, and thus it would be not publically available for the access. In addition to this, I would like to remind you that to create this file, you have to write the above set of instructions on a “notepad” file. After this submit your file in the root directory of your domain. You can submit it by uploading the file on “PUBLIC_HTML” folder.