Advanced Insights into Googlebot Crawling

Here are some interesting aspects of Googlebot's crawling of news websites that are useful to know when you want to optimise crawl efficiency.

Jul 27, 2023

A few months ago I wrote a guest newsletter for my friends Shelby & Jessie: Crawl Budget 101 for news publishers.

If you haven’t yet read the guest article, you should do so now as this is a follow-up piece, serving as a deeper dive into the intricacies of Googlebot.

Improving Googlebot Crawl Rate

A common question is how to improve the rate at which Googlebot crawls your content. This is a simple question with complicated answers, as the tactics you can employ vary on your circumstance and end goals.

First of all, let’s dig a little into how Google decides which pages should be crawled.

There is a concept called URL Importance which plays a big role in Google’s crawl scheduling. Simplified, URLs that are seen as more important are crawled more often.

So what makes for an important URL? Generally two factors apply:

How many links point to the URL
How often the URL is updated with new content and links

If a URL has many links pointing to it from other sources, and the content of the page served on that URL changes frequently (say, on a daily basis or more), then Google will likely choose to crawl that URL often.

The homepage and key section pages of news websites tend to fit both these criteria. That’s why news homepages and section pages are crawled very often, sometimes as much as several times a minute.

Google crawls these pages very aggressively because it wants to find newly published articles as soon as possible, so they can be indexed and served in Google’s news-specific ranking elements. Users depend on Google to find the latest stories on developing news topics, which is why Google puts in extra effort to quickly crawl and index news articles.

So one way to improve crawling of your website is to increase the importance of your homepage and section pages.

Get more links pointing to these pages, for example with site-wide top navigation that features your homepage and key sections. And make sure that your homepage and section pages prominently feature the newest articles as soon as they’re published, so that Google knows to crawl often to find new articles.

It’s All About The First Crawl

One very important aspect about Googlebot’s crawling of news websites is that Google doesn’t quickly re-crawl already crawled article URLs.

I believe there are underlying infrastructure reasons for this, which I explain in this talk I gave at YoastCon:

In summary, I suspect Google has multiple layers of crawling, and its most urgent crawler (what I call its ‘priority crawler’) will crawl new URLs almost as soon as they’re available. However, it won’t revisit those URLs once they’ve been crawled. Any subsequent re-crawls of URLs is done by a less urgent crawling system (the ‘regular crawler’).

This has a rather important implication for news: When you publish an article and it’s available on your website, Google will crawl it almost immediately and it will not be recrawled until hours or days later.

So if you publish an article, and then update it (change the headline, fix some typos, add some SEO magic, whatever) Google most likely has already crawled and indexed the first version of the article and will not see your changes until much later. At that stage the article isn’t news any more and will have dropped out of Top Stories.

So you get one chance to ensure your article is properly optimised for maximum visibility in Google’s news-specific ecosystem, and one chance only. And that is the moment you first publish it.

This is why it’s so incredibly important to make SEO part of your editorial workflow and ensure articles are optimised before they are published.

Any improvements made to an article after its publication is unlikely to have an impact on the article’s visibility. Unless, of course, you change the URL - because then Googlebot will treat it as an entirely new article.

I believe Live articles are an exception to this. When Google detects a live article, it’ll be regularly re-crawled to find the latest updates.

Robots.txt as Rank Management

We see the robots.txt file as a mechanism to control crawling. By default, Googlebot (and other crawlers) assume that every publicly accessible URL is freely crawlable, and will do their best to crawl all URLs on a website. With robots.txt disallow rules, you can prevent crawlers from accessing certain URLs that meet a specific pattern.

For example, if you want to prevent crawling of all URLs that start with /search you’ll need this disallow rule:

User-agent: *
Disallow: /search

However, there is an often overlooked additional dimension to robots.txt disallow rules: it’s also a mechanism to prevent ranking in specific Google verticals.

When Google crawls your website, it will do so primarily with the Googlebot Smartphone user-agent.

There used to be a different user-agent for crawling approved news websites: Googlebot-News. But since 2011, news websites are crawled with its regular Googlebot crawler, and Googlebot-News isn’t used anymore.

Yet, robots.txt disallow rules can be specified for Googlebot-News:

User-agent: Googlebot-News
Disallow: /

The effect of this rule is not to prevent crawling. It will have no impact on Googlebot’s crawl activity on your site, because crawling doesn’t happen with Googlebot-News.

What actually happens is that this Googlebot-News disallow rule will stop your content from showing up in Google News. So it is, in effect, a rank-prevention mechanism.

This goes against the purpose of the robots.txt web standard. It makes sense inasmuch that it supports the purpose of a historic but now deprecated user-agent, but technically it’s not a proper use for robots.txt.

Stop LLMs from Crawling

As an additional note on robots.txt, there could be ways to prevent your content from being used by LLMs to train their generative AI.

Google introduced the GoogleOther user-agent, which they strongly hint is what their LLM uses to crawl content. Additionally, we know that OpenAI uses the Common Crawl dataset to train their LLM, and we can block the Common Crawl bot from our site with rules for the CCBot user-agent.

So, theoretically, with these disallow rules we can prevent some LLMs from being trained on our website’s content:

User-agent: GoogleOther
Disallow: /
User-agent: CCBot
Disallow: /

This is not full proof of course, and mostly theoretical. In practice, LLMs have been harvesting the web’s copyrighted content for years to train themselves on, with little to no advance warning that this was happening.

There is some noise being made to create specific blocking mechanisms for large language models and other AI systems, but nothing has yet emerged that definitively prevents your content being used for building someone else’s proprietary AI product (except to lock it behind a very hard paywall).

Update 07 August 2023: OpenAI have now provided a method of disallowing their platform from crawling your content to improve their LLM. According to their newly published documentation, you can implement robots.txt disallow rules for GPTBot to prevent them crawling your pages:

User-agent: GPTBot
Disallow: /

Google Detects Site Changes

One element of Google’s crawl scheduling system is a clever piece of engineering, intended to allow Google to quickly come to grips with big changes on websites.

Whenever Google detects that there have been major changes on a website, for example a new section has launched or the whole CMS has changed with new designs and content, the crawl rate will temporarily increase to enable Google to find all changes as fast as possible and update its index accordingly.

This is reflected in the crawl stats report with a spike in crawl requests, which can look something like this:

The bigger the changes, the longer the spike in crawl requests can last. After a while, Googlebot’s crawl activity will go back down to normal levels when Google is confident its index has been sufficiently updated.

You’ll see such crawl spikes whenever a site migrates to a new domain, when URLs are updated site-wide, and/or when there is a significant change to the underlying codebase of the website.

Crawl Challenges

In addition to the frequent crawl issues I outlined in my original guest article, there are some crawl challenges that many news publishers struggle with.

Internal Links with Parameters

One issue I regularly see is websites marking up internal links with URL parameters (also known as query strings) that allow them to track when the link is clicked.

This is a terrible idea and utterly self-defeating.

First, because these links with tracking parameters are new URLs, Google will crawl them whenever it sees them. This can consume quite a lot of crawl effort on Google’s end. Yet, these URLs aren’t new content in any way; they’re just the same articles, with tracking parameters added to their URLs. So there is no new content for Google to index, and crawl effort spent on these URLs is wasted.

Secondly, these URLs will of course have a rel=canonical meta tag that asks Google to only index the ‘clean’ version of the URL. However, canonical tags are just hints, not directives. There are many other canonicalisation signals that Google uses to determine which version of a URL it should index.

You know what else is a (very strong) canonicalisation signal? Internal links.

When a website links to its internal URLs with tracking parameters, Google sees those internal links as canonicalisation signals and may choose to index the URL with the tracking parameter. That URL can be shown in Google’s results, and when users click on those results the visit will be registered in your web analytics as an internal click - not a Google click.

Which, of course, makes the whole point of tracking your internal clicks with URL parameters entirely unreliable, and rather silly.

So please, I implore you, do not use tracking parameters on internal links.

There are other, much more effective ways to monitor how users move through your website that don’t rely on creating ridiculous amounts of crawl waste and introduce entirely unreliable data into your web analytics.

Old Content & Pagination

Here are two common questions about crawl optimisation:

How to deal with older content on a news site?
How to deal with pagination in a way that is optimal for Google?

This revolves around the same underlying question: should we keep old content?

Personally I am not in favour of bluntly deleting older articles. Those news stories on your website from many years ago might not drive traffic, but they serve an important purpose: they show your topic authority.

When you start deleting old articles, you are erasing part of your journalistic history and risk undermining the perceived authority you have on topics that you cover regularly.

When Google sees a topic page on your website that has only 10 articles visible, while another website might have the same topic page with over 200 visible articles, guess which one Google will see as the more authoritative publisher on that topic?

We want Google to see a substantial number of articles on a topic page to prove that you have a history of writing content on that topic.

But we also don’t want Google to spend huge amounts of effort crawling old pages that don’t drive traffic.

Rest assured, Google is quite smart about prioritising its crawling and unlikely to waste precious crawl resources on pages that haven’t changed in years. So it’s probably not an issue in the first place.

If you want to limit crawling of older content, you can restrict pagination on your topic pages to, say, 10 pages and serve a 404 on page 11 and beyond. That means you will end up with orphaned articles once they drop off the 10th page, but as Google never actually forgets a URL this isn’t a particularly big issue.

Alternatively, you can implement pagination with a single ‘Next Page’ link at the bottom of the topic page and keep paginating for as long as there are articles to show. With only one pagination link on every paginated page, the crawl priority of deeper paginated pages will decrease (thanks to the Pagerank Damping effect) and Google will de-prioritise crawling those URLs accordingly.

In short, don’t worry too much about Google’s crawl effort on older URLs. It’s usually an imagined problem and rarely a real problem.

GSC Crawl Stats

Most of you will have your website verified as a property in Google Search Console, and you will see the Crawl Stats report for that main website - i.e. https://www.example.com.

But have you verified the entire domain in GSC?

With domain-wide verification, you get to see crawl stats reports in GSC for all subdomains associated with the main domain:

This can give you much more detailed insights into Googlebot’s crawling of your entire website, plus sometimes you find some unexpected subdomains in there that Google may be crawling (for example, secret staging sites that Google probably shouldn’t be crawling).

NESS 2023

In my previous newsletter I announced the 2023 edition of the News & Editorial SEO Summit, our virtual conference dedicated entirely to SEO for news publishers.

We were incredibly proud of the previous two editions, for which we managed to get truly awesome speakers on board. And once again, somehow, we pulled together another epic speaker roster with some huge names from SEO and publishing.

This year, the News & Editorial SEO Summit will take place on Wednesday 11th and Thursday 12th October 2023. We’re in the process of finalising the schedule and will serve up a range of amazing talks on AI & SGE, algorithm updates, topic authority, technical SEO, content syndication, SEO in the newsroom, and much more.

Get your tickets now, as early bird prices will only last until the end of August.

Miscellanea

It’s been a while since my last proper newsletter, so this round-up of interesting articles and resources will be a long one:

Podcasts:
I was a guest on two podcasts recently:

Media Voices, where I spoke about why some publishers are ranking despite being terrible websites, how AI will not be stealing our lunch just yet, and much more.
Strategy Sessions with my friend Andi Jarvis, where we touched on many aspects of SEO unique to news and a whole lot more.

Interesting Articles:

Google AMP: How Google tried to take over the web - The Verge
Google and Meta have made 6,773 grants to news publishers: What are they up to? - Press Gazette
Attitudes towards algorithms and their impact on news  - Reuters Institute for the Study of Journalism
Inside the AI Factory: the humans that make tech seem human - The Verge
AI, the media, and the lessons of the past - Columbia Journalism Review
All of the internet now belongs to Google’s AI - Digital Trends
How to build a better search engine than Google - The Verge
Meta’s business model is crumbling in Europe - Coda Story

Official Google Docs:

Introducing INP to Core Web Vitals - Google Search Central
Understanding news topic authority - Google Search Central
Sitemaps ping endpoint is going away - Google Search Central

The latest in SEO:

Google Updates Canonicalization Help Document: Don't Use Canonicals For Syndicated Content - Search Engine Roundtable
Google News Indexing Disruption Resolved - Search Engine Journal
Authorship SEO for news publishers - WTF is SEO
Web Performance and SEO Guidelines - The Washington Post
Guard Your SEO: Are News Syndication Partners Hijacking Your Traffic? - NewzDash
We Analyzed Millions Of Publisher Links. Here’s How To Syndicate Your Content & PR For Free - BuzzSumo
Website Crawling: The What, Why & How To Optimize - Search Engine Journal
9 Internal Linking Case Studies - The SEO Sprint
5 HTML tags every news SEO should know - WTF is SEO
How to Succeed in Google Discover - Search Engine Journal
Ruthless Prioritisation in SEO - The SEO Sprint

Lastly, I’ve finally given up on the churning cesspool that Twitter/X has become. I’m sure I’ll miss the audience I built up over there, but we all have limits to what we’ll endure.

Currently my active socials are LinkedIn, Mastodon, and Bluesky. We’ll see what else may come around that provides meaningful engagement without the overwhelming stench of toxicity and doesn’t serve as a privacy-invading surveillance platform channeling sensitive data to questionable entities (which is why I am not and will never be on any Meta-owned property).

As always, thanks for reading and subscribing. Leave a comment if you have any thoughts or questions about Googlebot’s crawling, and I’ll see you at the next one.