[Do you like this? Please subscribe to my YouTube Channel and then share it for me!]

Video Summary

What do robots have to do with your website? And what comes to mind when you think about robots?

I’m talking about the robots, or bots, that make search engines work. These are little programs that go out and scour the internet looking for information. When these bots arrive at your site, before they start the crawling and indexing process, they first look for the robots.txt file. This file tells them how to properly look at the pages on your site.

If there is no robots.txt file found, they will crawl every page they can find. But with the robots.txt file, you can tell search engine bots to stay away from specific files and certain areas of your website.

And what’s that good for? Continue reading to find out…

Video Transcript

Hello, my name is Jeffrey Kirk. What do robots have to do with your website? And what comes to mind when you think about robots?

Back in college I studied computer science, and we had a HERO 1 robot that we could work with to learn about programming robots and getting them to do things while sensing their surroundings.

Back in the 1980s computer memory was very expensive. Today you can get a 256 GigaByte micro SD memory card for around $20. It’s easy to get memory and there are barely any limits to what you can store.

Computer programs can be huge and they’re often quite bloated. But back then, with the HERO 1, we had to load all the programming into 4 KiloBytes of memory. Today’s 256 GigaByte micro memory card, which is less than a square inch in size, holds more than 67 million times as much data as we could use to program the robot. And yet, we could get that robot to see its way around a maze!

Today I want to talk to you about a robots.txt file. This is not a programming language. It does not operate any type of robot you might picture from real life or science fiction.

Instead, we’re talking about the bots that make search engines work. These are little programs that go out and scour the internet looking for information. And your robots.txt file influences those bots.

Before I continue, please subscribe to my channel and click the little bell so you get notified when my next video comes out. Okay, let’s get started…

Crawling and Indexing

When you go to a search engine and do a search, you expect to get results that match your query. For that to happen, search engines need a huge database that they can reference. Because, in the moment you ask, there’s no way they can go out and suddenly look at all 2 billion or so websites to find the best answer for you.

So search engines use a process of crawling and indexing. And that means, when you do a search, the search engine can go to its database, look up the best answers, and quickly display the information to you.

The first half of this process, the crawling, is where search engines use robots, often called bots, and they visit every page they can find. They visit a page, look for any links, and then visit the linked pages, and the process continues. Everything they find is put in the index.

In an earlier video I talked about sitemaps. Site maps help these bots find content that is considered important and content that might not be linked. The robots.txt file provides a different function for the bots visiting your site. This file contains instructions on how they should crawl parts of your site.

Robots.txt File

Basically, the robots.txt file tells a search engine which parts of the site they can visit and which parts they should not. So, when the bots arrive at your site, before they start the crawling and indexing process, they first look for the robots.txt file to know how to properly look at the pages of your site.

If there is no robots.txt file that they can find, then they proceed with crawling all the pages that they can find. If you have a robots.txt file, it should be at the top level of your website. So that way, you can quickly look at your domain followed by robots.txt.

In the case of my website, it would be at upatdawn.biz/robots.txt. Note that the file name robots.txt is case sensitive. It must be lower case with correct spelling for it to work properly.

Inside the robots.txt file, there is a list of rules. Instructions such as “allow” or “disallow” are common, though the default condition is to allow, so it’s not really necessary to add the “allow” directive. And, since search engine use of robots.txt files came long before sitemap.xml files, it’s a good practice to include the sitemap address at the end of the robots.txt file.

Basically, when used together the robots file tells the search engine bots to stay away from certain files, while the sitemaps file tells the bots which specific files to look at.

One other important consideration though, is that the robots.txt file does not tell search engines to not index a file. It says, “don’t look at it”. If you don’t want a file to appear in the index at all, then you should use the “no index” tag within the file itself. But that’s a different issue.

What can you do with a robots.txt file?

  • First, you can block pages that should not be public.

Not all of your website pages need to be crawled and ranked. Like duplicate pages, or login pages, or maybe when you are staging pages in development.

You need these pages in your website, but you don’t want people randomly finding them. This is where a robots.txt file can be used to block pages from search engine bots.

This works best if those pages are within a separate folder. One line in the robots.txt file can eliminate crawling of the entire folder.

  • Second, you can manage the “crawl budget”.

What I mean by this is the amount of time, resources, and number of pages the bots spend on your site is going to depend on your site’s “crawl budget”.

And that crawl budget is determined by your site’s reputation, size, and backlinks. And once the budget is used up, the bots move on to another site.

So, if there are pages on your site that are not being indexed, it’s possible the crawl budget is used up before the bots get there. You can use your robots.txt file to block unnecessary pages of your site so crawlers can focus on the most important ones.

This is important because if a page is not indexed, it cannot rank in the search results.

By using robots.txt, web crawlers can focus their “energy” doing the indexing of the more important parts of your website.

  • Third, you can prevent non-page resources from getting indexed.

This is related to the crawl budget as well as to the efforts that you’d have to block pages that are not public. But in this case, we’re not talking about website pages, but rather other content.

If you don’t want pdfs, images, videos, and other graphics to get indexed by search engine crawlers you can block them with the robots.txt file. There is no way to include a no index tag in these files, so the robots file is your only option.

How do you create a robots.txt file?

The first step is actually very simple. Open a new text file and name it robots.txt. That’s it, you’re done with step 1.

The next step is to add the directives. The “user-agent” line names the specific bot that you want to interact with. Like Google, or Bing, for example.

And anything after the “disallow” includes pages that you want to block or skip. For example, if you do not want Google to index the photos directory in your website, you would do something like this…

User-agent: Googlebot
Disallow: /photos

In this case all the photos in the /photos subdirectory are blocked.

Or, let’s say you don’t want to just block Google from indexing the photos directory, but you want to stop all bots, then you use an asterisk (*) to represent all bots that visit your site. So it’d be:

User-agent: * 

Disallow: /photos

Or let’s say you want to block the photos directory and the archive directory. You simply add another directive below the first Disallow, so it’d be like this:

User-agent: *
Disallow: /photos
Disallow: /archive

Another example, if you leave the Disallow line empty, then there isn’t anything to block. Crawlers can index all pages of your site. So you’d have something like:

User-agent: *
Disallow: 

And then, once you’re finished adding agents and directives, you can add your sitemap to the bottom part of the file, like this…

Sitemap: https://www.yoursite.com/sitemap.xml

Then save the file and upload it to the root folder of your website. And then test it, put your domain slash robots.txt into your browser. If it comes up as expected, you can go to Google’s Robots testing tool to make sure everything works the way you want it to work. Google’s going to show you if there are any errors or warnings in your file.

You do need a Google search console account to use this tool, but that’s free to get and it’s worthwhile. So if prompted to do so, just follow the steps Google gives you.

Since you’re still here, I suspect you’re trying to improve your business’s search engine visibility. And in that case, I would like to invite you to attend an upcoming free training. I’m going to show you how to get your business showing up in the best part of the search results so you get more clients.

Look for the link here and join me soon. Your business deserves to be seen online, and I will help you get there.

Thanks for watching and have a great day!

Comments

Leave a Reply

Your email address will not be published.