CodeSteps

Python, C, C++, C#, PowerShell, Android, Visual C++, Java ...

How to instruct web robots to exclude web-site URLs or folders?

A common web robot (also known as an internet bot) is a program that periodically crawls the web-site pages over the internet, analyse the crawled data and store the necessary information into it’s repository.

If you don’t want some pages or folders to be crawled by bots, you need to inform before they crawl your web-site. “robots.txt” is a text file which is used for this purpose. This is a simple text file contains set of rules for bots to what to fetch and what not to fetch from our web-sites and this file is located on servers.

Remember that this is not a standard. It is not necessary that all bots should follow the rules mentioned in the “robots.txt” file. Some bots may crawl all your site pages, irrespective of the rules mentioned in the “robots.txt” file. But most of the bots, will check the “robots.txt” file and crawl the pages depending on the rules mentioned in the “robots.txt” file.

This article explains the steps to instruct internet bots, to exclude web-site URLs or folders to crawl.

Step (1). Open your web-site’s control panel and open File Manager.

Step (2). Look for “robots.txt” file in the root access folder where your website is installed. Note that “robots.txt” file must be in the root folder of your web-site (eg: public_html/), not in the sub-folder.

Step (3). If “robots.txt” file does not exist, create the new one.

Step (4). Open the file “robots.txt” in file editor. And add the following lines of code depending of your requirement:

  • To exclude a folder, enter the following code. Where my-folder-name is the name of the folder you want to exclude.
    Disallow: /my-folder-name/
  • To exclude whole site:
    Disallow: /
  • To exclude particular file, use the below code. Where “file-name” is the name of the file you want to exclude (eg: Disallow: /junk.html).
    Disallow: /file-name
  • You can also allow only one web robot to access all your files and restrict other web robots. Below is the code. Allow only Google web robot and disallow all other robots.
    User-agent: Google
    Disallow:
    User-agent: *
    Disallow: /

Once you made the changes, save the “robots.txt” file. Next time, when an internet bot tries to crawl pages on your web-site, it will first look into the “robots.txt” file and depending on the rules mentioned in the file, it will ignore the pages or folders to crawl.

How to instruct web robots to exclude web-site URLs or folders?

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top