Why Articles Should Be Optimised Before Publishing
One of Google's quirks means that once an article has been crawled and indexed, any changes won't necessarily be picked up by Google until it's too late.
This is an expansion of a concept I touched upon in my previous newsletter, but I felt it needed a proper deep-dive to explain its intricacies.
I have to start with a disclaimer: Much of what follows is speculation, based on my experiences and those of other SEOs I’ve spoken with over the years, and entirely unconfirmed by Google.
So what follows is not ‘established SEO knowledge’ by any means. It’s theory and hypothesis that matches obervable facts, but could still be totally wrong.
Of course I don’t think it’s wrong (otherwise I wouldn’t be sharing it with all of you), but I’m not convinced it’s 100% right either.
With that disclaimer out of the way, let’s dig in.
Anyone who’s worked in online publishing for any length of time will have first- or second-hand knowledge of something like this: An article is published and contains an error. It may contain a blatant spelling mistake in the headline, a poor choice of phrase, a factual error, or something else.
Immediately after it’s been published, the article is amended and the error fixed. But, for some reason, Google has indexed the initial, erroneous version, and it just will not update its index and show the corrected article.
Hours pass and the initial version of the article containing the error remains in Google’s index and shows in search results. And then, finally, many hours later, Google reindexes the article and the corrected version starts showing.
At this stage the article is considered old for a news story and no longer shows up in Top Stories, and has fallen down the rankings in Google News as well.
To explain why Google is often slow to update its index and show an article’s most recent version, we need to understand how its crawling systems work.
We tend to base our understanding of Googlebot on what we see in relevant reports in Google Search Console: There’s a mobile crawler, a desktop crawler, a page resource crawler, and some other miscellaneous crawlers looking at specific file types like images and videos. And of course Google’s unruly AdsBot which obeys its own set of rules.
But I don’t believe this is an accurate representation of Googlebot’s actual crawling system.
I believe Googlebot, independent of its user-agents, is actually a tiered crawling system with at least three distinct crawling processes:
Allow me to explain.
1. Priority Crawler
The first and most aggressive crawling process is what I call the Priority Crawler (or the Realtime Crawler). This crawling process is focused on crawling VIPs: Very Important Pages.
These VIPs are high value webpages that have many inbound links, change very frequently, and are regularly and consistently shown on the first page of Google’s results.
VIPs are pages like popular ecommerce homepages (think Amazon.com and Etsy.com), classified portals (job boards, property sites, etc.), and other pages that are very popular and have a high turnover of content.
Other pages, like news website homepages and key section pages.
Some news homepages are crawled by Googlebot as often as once every five seconds. And this is because these news sites have a lot of incoming link value, they have top rankings in many Google search results, and there’s a high probability that Google will find a new article whenever it crawls the site’s homepage or one of its main section pages.
Some publishing sites produce staggering amounts of content (more than 500 articles a day is not uncommon), and Google is eager to crawl and index this so it can make sure its news surfaces, like Top Stories, contain the latest stories.
So this Priority Crawler, which crawls homepages and key section pages very frequently, is eager to find new URLs to crawl and send on to Google’s indexing process.
But, and this is crucial, once the Priority Crawler finds a new URL and crawls it, it then passes that URL on to the second crawling process - the Regular Crawler.
The Priority Crawler doesn’t revisit a newly discovered article. The Priority Crawler is focused on crawling VIPs, and once it finds new URLs it will crawl them almost instantly but then promptly forget about them. It’s then left to the Regular Crawler to recrawl those URLs and pick up any changes.
Such a division of crawl activity allows Google to optimise its Priority Crawling process for speed, ensuring worthwhile new content is rapidly discovered and added to Google’s index.
And at the same time, Google’s second tier Regular Crawling process doesn’t have to be super fast and can carefully manage its crawl queue to focus on URLs that deserve to be recrawled.
2. Regular Crawler
The Regular Crawler is Google’s main crawling process that does most of the work. The web has trillions of URLs and the Regular Crawler’s job is to decide which of those should be recrawled to check for changes.
Many signals go into the Regular Crawler’s crawl queue, and new URLs get added to the crawl queue (which are in fact multiple crawl queues with various purposes and focusing on different signals, running as multithreaded processes across multiple data centres) when they’re first crawled.
The Priority Crawler will do a lot of discovery and send loads of new URLs to the Regular Crawler for crawling, and of course the Regular Crawler also does a lot of discovery itself. So these crawl queues - basically lists of URLs for Googlebot to crawl in sequence - are constantly being updated and changed as new URLs are added and new signals are taken into account.
I suspect the Regular Crawler is also the crawling process that fetches page resources as part of Google’s indexing rendering process.
The Regular Crawler is not as urgent as the Priority Crawler. Once a newly published URL has been crawled and indexed, it is added to the Regular Crawler’s crawl queue and will be recrawled at some stage - often many hours, if not days, after the initial crawl.
The speed with which an already crawled URL, such as a news article, will be recrawled depends on many different signals. These include the timestamp shown with the article on a site’s homepage or section page, its <lastmod> attribute in relevant XML sitemaps, and whether or not the article is requested for indexing in Google Search Console.
3. Legacy Crawler
The third tier of crawl processes is what I call the Legacy Crawler. This is a crawling process that focuses on old and unimportant URLs.
Google has been around for 25 years now, and it has crawled astonishing amounts of URLs in that time. As part of its mission to make the world’s information accessible, Google doesn’t like to ‘forget’ URLs.
Think of what happens to a news story after it disappears from a site’s homepage and section pages: It fades into the archives of the publisher, still accessible and part of the site’s history, but no longer news and no need to be crawled very often.
Initially the Regular Crawler will revisit the URL, especially if changes have been made to the article. But after a while, a few months down the line, there’s no point to keep revisiting the article. If it’s not evergreen and won’t be updated, why recrawl it?
This is where the Legacy Crawling process comes in. Google will keep the article in its index, but at the same time will want to make sure the article still exists and hasn’t changed. So the Legacy Crawler will occasionally recrawl the article, even when there are no signals that the article should be recrawled.
The Legacy Crawler also recrawls URLs that once served content but now give a 404 Not Found or 410 Gone error. Google wants to make sure these errors persist and the URL hasn’t been reinstated, so it’ll sometimes recrawl those old URLs to make sure there’s still a Page Not Found error shown there.
The Legacy Crawler has its own crawl queue, and has a very low sense of urgency. It’ll crawl at its leisure, checking if old pages still exist and ensuring URLs that have been known to Google for years are still live.
The three distinct tiers of Googlebot crawling - Priority , Regular, and Legacy - have a profound impact on news publishers and how they should approach their SEO.
When a news article is published, Google’s Priority Crawler will find it almost immediately. The delay between publishing and crawling is very short - usually a few minutes at most - after which the article will be indexed and can be shown in Top Stories.
Any changes made to an article after Google’s initial crawling & indexing may not be picked up by Googlebot’s Regular Crawler until much later. Hours could pass before the Regular Crawler decides to recrawl the article, and Google’s index is updated to reflect any changes made.
At this stage the article isn’t news any more. For its Top Stories boxes, Google has a strong preference for newer articles. An article that is hours old may not appear in Top Stories anymore if Google has many newer articles to show there instead.
The article, with its updates and improvements, is relegated to the News tab or the Google News vertical where it attains a fraction of the traffic it could’ve achieved in Top Stories, before finally disappearing from Google’s news surfaces entirely.
The consequence for news publishers is that you need to get an article’s SEO right before it is published. You get one chance at achieving Top Stories prominence, and that chance is the moment you click ‘publish’.
Any changes or improvements made to an article after it’s been published are not guaranteed to be picked up by Google in time to make a difference on its Top Stories visibility. In fact, chances are those changes won’t be seen by Google until long after the article has ceased being news.
This is why SEO needs to be an integral part of your publishing workflow. Optimising an article for Google after it’s been published is usually fruitless, as these optimisations won’t be seen in time to make a difference to the article’s performance.
Does that mean that once an article is published, there’s no way to improve it for SEO? No opportunity to get a second chance at Top Stories?
Not quite. We do have some methods at our disposal to improve a published article’s recrawl rate.
First, we can send signals to Google that the article has changed, in the hopes that the Regular Crawler revisits it. These signals are:
Updated timestamp on the article
Updated <lastmod> attribute in your XML sitemaps
Updated dateModified attribute in the NewsArticle structured data
Prominent placement on the homepage and/or main section page
Changed headline (especially the internal link headline)
URL submitted for indexing in GSC
These signals can help Googlebot understand that a known article has changed and should probably be recrawled. But even with all these signals, there is no guarantee that Googlebot will actually recrawl and reindex the article with any urgency.
So we have one final trick up our sleeve: Change the URL.
If you absolutely want to guarantee Googlebot will recrawl and re-index the article, you can use the fullproof method of changing its URL. This will make Google see it as an entirely new article, and crawl and index it immediately.
When a major news event happens, the journalistic instinct is to immediately cover it even if there is limited information available. Get the story out there, even if it’s just a headline, a single line of text, and a ‘More to follow’ disclaimer. Then, as new facts are uncovered and verified, you expand the article and provide more coverage of the event.
But, if you accept the tiered model of Googlebot crawling, you will understand that this is not an ideal approach to breaking news.
Instead you may need to find ways to update the breaking news story on your website in such a way that Googlebot will recrawl and reindex it constantly - by feeding those aforementioned signals into Google, or by republishing the story with a new URL every time there is a major update to report.
Don’t Be First - Be Best
I would also emphasise that being the first to publish a breaking news story isn’t necessarily good for that story’s potential Google traffic.
Take this simple graph that is a typical example of a news topic’s popularity. Once a news event happens, the public becomes aware of it gradually and the volume of searches on Google will steadily increase until it peaks and drops off:
When is the ideal time to publish your story on this news event? Is it as soon as possible after the event happens?
By the time the search volume on this news event reaches peak popularity, your story is relatively old. Publishers that were late to the party have newer stories, which often also contain more information.
And as we know, Google has a strong preference for newer articles in Top Stories, especially if those newer articles are more detailed and contain better information.
So you may just want to hold off on publishing your breaking news story - or, if you absolutely have to publish immediately, you will want to publish a new article (or change the existing article’s URL) when there’s a major development to report.
Over the years I have heard dozens of publishers complain that their breaking news loses out on traffic because Google prefers to fill its Top Stories boxes with later articles from other publishers.
Some ways to prevent this from happening is to publish your article a bit later to coincide with the projected peak in search volume, keep your coverage updated and publish new articles on the event, and/or encourage Googlebot to recrawl and reindex your content.
What about Live Articles?
I believe articles that have the LiveBlogPosting structured data are probably an exception to the Priority /Regular crawling division. Live coverage articles (that are recognised as such by Google) will remain in the Priority Crawler’s crawl queue for around 24 hours or until the coverage has ended (as defined in the coverageEndTime attribute).
This allows Google to recrawl live articles frequently and ensure the article’s presence in Top Stories is accurate and up to date.
You can use this for breaking news as well. Instead of a standard article, consider a Live article for a breaking story. That way you can be among the first to publish on a breaking news event and still reap the benefits of frequent recrawling and reindexing in Google’s search ecosystem.
Unique to News
These challenges are unique for news publishers, due to the context of news in Google’s search results. Websites that lack the urgency of news don’t have to deal with ‘one chance’ SEO opportunities.
A non-news website can tweak their pages to maximise their SEO and continually improve, as the speed with which Google crawls and indexes is rarely an issue and popularity curves are often more gradual and seasonal.
Want More Like This?
If you enjoyed this newsletter, I will be presenting more about Google’s crawling and indexing systems at the 2023 News and Editorial SEO Summit next week.
In my talk I’ll give an overview of the current state of technical SEO, including details on Googlebot’s tiered crawling and (spoiler!) tiered indexing infrastructure, and the latest developments in tech SEO for publishing sites.
I’ll be joined at NESS by some of the best and brightest in SEO and publishing:
Wil Reynolds, CEO at Seer Interactive: Keynote on the Future of search, SEO and SEOs in the era of AI.
Glenn Gabe, SEO Consultant at G-Squared: Major Google Algorithm Updates and Their Impact on News Publishers.
Jes Scholz, Marketing Consultant: The Need for Speed - SEO Strategies for Rapid Crawling.
Claudio E. Cabrera, VP of Audience at The Athletic: How Audience and News SEO can Influence a Newsroom.
Lily Ray, Senior Director at Amsive Digital: Winning in Google Discover (Without Losing in Organic Search).
Richard Nazarewicz, Global SEO & Discovery Lead at BBC Studios: Taking Newsroom SEO Priorities Through to Product and Tech Roadmaps.
Anna Sbuttoni, Deputy Audience Editor at The Times & Sunday Times: How SEO Is Reshaping ‘Classic’ Newsrooms.
Kevin Indig, Growth Advisor: AI won’t replace writers. It will make them 10x better. This is how.
John Shetata, CEO & Founder NewzDash: Elevating Your Enterprise SEO Game - Scaling Tactics for Massive Websites.
Check out the full schedule on the NewsSEO.io website where you can also buy tickets to this online event.
If you haven’t yet bought your NESS23 ticket, you can use the barry25 coupon code at checkout to get 25% off your purchase.
As usual I’ll end with my customary roundup of interesting articles and resources published in the last while.
Official Google docs:
An update on Web Publisher Controls - Official Google Blog
A Guide to Google Ranking Systems - Google Search Central
Influence your Byline Dates in Google Search - Google Search Central
The Comedy of Errors - Google Search Central
Why Discover traffic might change over time - Google Search Central
The Value of News Content to Google is Way More Than You Think - Tech Policy Press
Top 50 biggest news websites in the world - Press Gazette
Latest in SEO:
Google August 2023 Core Update: Winners, Losers & Analysis - Lily Ray, Amsive
The September 2023 Google Helpful Content Update - Glenn Gabe, GSQi
Former Googler: Google ‘using clicks in rankings’ - Search Engine Land
Ask the experts: paywalls, subscription and SEO - The Audiencers
That’s it for another edition. As always, thanks for reading and subscribing. Feel free to leave a comment if you have any questions, and I’ll see you at the next one!