Robots Exclusion

 

Truth About Web Crawlers
By: Maksym Nesen

Wouldnt it be nice to be able to leave some code in your web site to tell the search engine spider crawlers to make your site number one Unfortunately a robots.txt file or robots meta tag wont do that but they can help the crawlers to index your site better and block out the unwanted ones.

First a little definition explaining:

Search Engine Spiders or Crawlers A web crawler (also known as web spider) is a program which browses the World Wide Web in a methodical automated manner. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches.

A web crawler is one type of bot or software agent. In general it starts with a list of URLs to visit. As it visits these URLs it identifies all the hyperlinks in the page and adds them to the list of URLs to visit recursively browsing the Web according to a set of policies.

Robots.txt The robots exclusion standard or robots.txt protocol is a convention to prevent wellbehaved web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the toplevel directory of the website.

The robots.txt protocol is purely advisory and relies on the cooperation of the web robot so that marking an area of your site out of bounds with robots.txt does not guarantee privacy. Many web site administrators have been caught out trying to use the robots file to make private parts of a website invisible to the rest of the world. However the file is necessarily publicly available and is easily checked by anyone with a web browser.

The robots.txt patterns are matched by simple substring comparisons so care should be taken to make sure that patterns matching directories have the final / character appended: otherwise all files with names starting with that substring will match rather than just those in the directory intended.

Meta Tag Meta tags are used to provide structured data about data.

In the early 2000s search engines veered away from reliance on Meta tags as many web sites used inappropriate keywords or were keyword stuffing to obtain any and all traffic possible.

Some search engines however still take Meta tags into some consideration when delivering results. In recent years search engines have become smarter penalizing websites that are cheating (by repeating the same keyword several times to get a boost in the search ranking). Instead of going up rankings these websites will go down in rankings or on some search engines will be kicked off of the search engine completely.

Index a site The act of crawling your site and gathering information.

How can the robots.txt file and meta tag help you

In the robots.txt you can tell the harmful web crawlers to leave your web site alone and give helpful hints to

the ones you want to crawl your site. Below is an example on how to disallow a web crawler to search your site:

# this identifies the wayback machine Useragent:

ia_archiver

Disallow: /

ia_archiver is the crawler name for the wayback machine that you may have heard of and the / after disallow tells ia_archiver not to index any of your site. The # allows you to write comments to yourself so you can keep track of what you typed.

Type the above three lines into notepad from your computer and save it to the root directory of your web site as robots.txt. Web crawlers look for this document first at a web site before doing anything else. This helps the crawler to do its job and helps the web site owner tell the spider what to do. Say for instance you have some data that you dont want the crawlers to see. (Like duplicate content for other browser referrer pages)

You can deter crawlers from indexing the duplicate directory by typing this into your robots.txt file.

Useragent: *

Disallow: /duplicate/

The * after useragent says that this action applies to all crawlers and /duplicate/ after disallow tells all crawlers to ignore this directory and not search it. For each useragent and disallow line there must be a blank space between them in order for it to function correctly. So this is how you would create the above two commands into a robots.txt file:

# this identifies the wayback machine

Useragent: ia_archiver

Disallow: /

Useragent: *

Disallow: /duplicate/

One thing to note that is very important: Anyone can access the robots.txt file of a site. So if you have information that you dont want anyone to see dont include it into the robots.txt file. If the directory that you dont want anyone to see is not linked to from your web site the crawlers wont index it anyway.

An alternative to blocking indexing of your site is to put a meta tag into the page. It looks like this:

< meta name=robots content=noindexnofollow >

You put this into the tag of your web page. This line tells the robot crawlers not to index (search) the page and not to follow any of the hyperlinks on the page. So as an example < meta name=robots content=noindexfollow > tells the robot crawlers to not index the page but follow the hyperlinks on this page.

Did You Know That Google Has Its Own Meta Tag

It looks like this:

< meta name=googlebot content=noindexnofollownoarchive >

This tells the Google robot crawler not to index the page not to follow any of the links and not to keep from storing cached versions of your web site. You will want this done if you update the content on your site frequently. This prevents the web user from seeing outdated content that isnt refreshed because of storage in the cache.

You can use the meta tag to specifically talk to Googles robots to avoid complications or if you are optimizing your site for Googles search engine.

View this site to read more.




Copyright © robotictoys1.com