Why Semantic HTML matters for SEO and AI

HTML is the foundational language of the web. Semantic HTML markup on your webpages can help machine systems better understand your content and its value.

Aug 06, 2025

I’ve had this post in drafts for a while, mostly as a container for me to drop bits into for when I got time to expand it into a proper newsletter. Then my good friend Jono Alderson published his excellent piece on semantic HTML, and for a few weeks I lost the will to complete mine.

But I thought I should finish my version anyway, as my focus is slightly different and perhaps a bit more practical than Jono’s. You should still definitely read Jono’s blog, it says all I want to say and more.

Semantic HTML

Let’s start with a quick overview of what semantic HTML is. As the language upon which the web is built, HTML is markup that surrounds text to provide it with structure.

The <p> tag around a block of content indicates that it is a paragraph of text. The <h1> tag around a sentence shows that it is the page’s main heading. The <ol> tag indicates the start of an ordered (usually numbered) list. The <img> tag indicates you’ll be loading an image onto the webpage. And so forth.

Semantic HTML was used to code every webpage. Content was surrounded by specific tags that indicated what each bit of content was meant for, and then CSS was applied to make it look good. It wasn’t perfect by any means, but it worked.

It also meant that you could look at the raw HTML source of a webpage and see what the page was trying to deliver, and how. The HTML signposted the structure and meaning of each bit of content on the page. You could see the purpose of the page just by looking at its code.

Then WYSIWYG editors and later JavaScript frameworks arrived on the scene, and HTML took a backseat. Instead of <p> and <table> we got endless nestings of <div> and <span> tags.

The end result is webpage HTML that lacks structure and has no meaning, until it is completely rendered in the browser and visually painted onto a screen. Only then will the user (and a machine system trying to emulate a user) understand what the page’s purpose is.

It’s why Google goes through the effort of rendering pages as part of its indexing process (even though it really doesn’t want to).

We know Google doesn’t usually have the time to render a news article before it needs to rank it in Top Stories and elsewhere. The raw HTML is therefore immensely important for news publishers. Good HTML allows Google to effortlessly extract your article content and rank your story where it deserves in Google’s ecosystem.

Semantic HTML is a key factor here. This is the reason why SEOs like me insist that an article’s headline is wrapped in the <h1> heading tag, and that this is the only instance of <h1> on an article page. The H1 headline indicates a webpage’s primary headline. It signposts where the article begins, so that Google can find the article content easily.

Which HTML tags are semantic?

Beyond the <h1> heading tag, there are many other semantic HTML elements you can implement that allow Google to more easily extract and index your article content. In no particular order, the elements you should be using are:

Paragraphs: Don’t use <div> and <span> tags to format the article into paragraphs. There’s been a tag for that for as long as HTML has existed, and it’s the <p> tag. Use it.
Subheadings: Use <h2>/<h3>/<h4> subheading tags to give your page structure. Use subheadings in an article to preface specific sections of content in your article. Use subheadings for the headers above concrete structural elements, such as recommended articles.
Images: Always use the <img> tag if you want to show an image that you’d like Google to see as well. Google explicitly recommends this.
Clickable Links: When linking to another page, either internal or external, use the <a> tag with an ‘href’ value containing the target URL. It’s the only kind of link that Google will definitely follow.
Relational Links: The <link> tag allows you to create a relationship between the current URL and another URL. This can be a canonical page, a stylesheet, an alternative language version of the current page, etc.
Lists: Bullet lists should use the <ul> tag, and numbered lists should use <ol> tag. You can make them look however you want with CSS, but do use the list tags as the foundation.
Emphasis: When you want to highlight a specific word or phrase, there are semantic HTML tags you should use for that; <em> for italics, and <strong> for bold.

All the above tags, with the exception of <link>, are intended for the content of the webpage, providing structure and meaning to the text.

There are additional semantic HTML tags that are intended to provide structure and meaning to the code of the page. These tags allow Google to identify different elements on the page, such as the navigation vs a sidebar, and process them accordingly.

HTML Semantic Elements — *Semantic HTML image from W3Schools.com*

The <head> and <body> tags exist to separate the page’s metadata (in the <head>) from the actual content (in the <body>). Every HTML page starts with those two.
<header> can be used to wrap around the head section of the page, where the logo, navigation, and other stylistic elements sit.
<nav> should be used for your site’s main navigation. Mega menus, hamburger menus, top navigation links, whatever form your navigation takes, you should wrap it in the <nav> tag.
You can use <section> tags to divide your page into multiple sections. One section could be the article, another could be the comments below the article.
<article> is the tag that shows where the page’s actual main article text begins (including the headline). This is a very valuable tag for news publishers.
With <aside> you can indicate blocks of content like a sidebar of trending stories, recommended articles, or the latest news.
<footer> is used for, you guessed it, the footer of the webpage.

These structural semantic tags help search engines understand the purpose and value of each section of HTML. It enables Google to rapidly index your content and process the different elements of your pages appropriately.

There are many more semantic HTML tags at your disposal, for various different purposes. Chances are there’s an HTML element for every imaginable use case. Rather than cram your code full of <div> tags to make something happen, first see if there’s a proper HTML element that does the trick.

How does it help AI?

We know that LLMs like ChatGPT and Perplexity crawl the open web for training data, as well as for specific user queries that require content from the web. What some of you may not know is that LLMs do not render JavaScript when they process webpages.

Google is the exception to the rule, as it has devoted a great deal of resources to rendering webpages as part of indexing. Because Google’s Gemini is the only LLM built on Google’s index, Gemini is the only LLM that uses content from fully rendered webpages.

So if you want to have any chance of showing up as a cited source in ChatGPT or Perplexity, you’d do well to ensure your complete page content is available in your raw unrendered HTML.

Using semantic HTML to structure your code and provide meaning also helps these LLMs easily identify your core content. It’s much simpler for ChatGPT to parse a few dozen semantic HTML tags rather than several hundred (or even thousand) nested <div> tags to find a webpage’s main content.

If and when the ‘agentic web’ comes to life (I’m skeptical), semantic HTML is likely a crucial aspect of success. With meaningless <div> and <span> tags, it’s much easier for an AI agent to misunderstand what actions it should perform.

When you use semantic HTML for things like buttons, links, and forms, the chances of an AI agent failing its task are much lower. The meaning inherent in proper HTML tags will tell the AI agent where to go and what to do.

What about Structured Data?

You may think that structured data has made semantic HTML obsolete. After all, with structured data you can provide machine systems with the necessary information about a page’s content and purpose in a simple machine-readable format.

This is true to an extent. However, structured data was never intended to replace semantic HTML. It serves an entirely different purpose.

Structured data has limitations that semantic HTML doesn’t have. Structured data won’t tell a machine which button adds a product to a cart, what subheading precedes a critical paragraph of text, and which links the reader should click on for more information.

By all means, use structured data to enrich your pages and help machines understand your content. But you should also use semantic HTML for the same reasons.

Used together, semantic HTML and structured data is an unbeatable combination.

Build websites, not web apps

I could go off on a 2500-word rant about why we should be building websites instead of web apps and how the appification of the web is anathema to the principles on which the world wide web was founded, but I’ll spare you that particular polemic.

Suffice to say that web apps for content-delivery websites (like news sites) are almost always inferior to plain old-fashioned websites. And websites are built, or should be, on HTML. Make use of all that HTML has to offer, and you’re avoiding 90% of the technical SEO pitfalls that web apps tend to faceplant themselves into.

News and Editorial SEO Summit 2025

The full speaker lineup and agenda for the 2025 News and Editorial SEO Summit is now complete! Take a look at the NewsSEO.io website for all the details, and if you don’t have your ticket yet you should grab it now.

Use the code barry2025 at checkout to get 20% off the ticket price.

Miscellanea

Despite the summer holidays, there’s no let-up in the stream of news and events in SEO and publishing. The fight between Cloudflare and LLMs is heating up, which brings me a lot of joy.

Official Google Docs:

Latest in SEO:

Google users are less likely to click on links when an AI summary appears in the results - Pew Research
Google’s June 2025 Update Analysis: What Just Happened? - SEJ
June 2025 Core Update: Winners, Losers & Trends - Amsive
Goodbye, Featured Snippets: How SERP Features Have Evolved in the AI Era - Ahrefs
Major UK publishers have seen Google search visibility ‘drop by up to 80%’ since 2019 - Press Gazette
How to Build a Brand (with SEO) in a Post AI World - Harry Clarkson-Bennett
A publisher's playbook for YMYL and E.E.A.T - WTF is SEO?
Google’s Index is the Gatekeeper to AI - Indexing Insights
Danny Sullivan Steps Away From Google Search Liaison Role - SEJ

Interesting Articles:

Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large - Cloudflare
Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives - Cloudflare
Top trends from our latest look at the UK’s news habits - Ofcom
Google Discover adds AI summaries, threatening publishers with further traffic declines - TechCrunch
Google Seeks Licensing Talks With News Groups, Following AI Rivals - Bloomberg
Google's AI Overviews hit by EU antitrust complaint from independent publishers - Reuters
MythBusting Large Language Models - Joe Lochlann Smith
I sat in on an AI training session at KPMG. It was almost like being back at journalism school. - Business Insider
Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task - MIT
Not So Fast: AI Coding Tools Can Actually Reduce Productivity - Second Thoughts
The dangers of so-called AI experts believing their own hype - New Scientist
Google Offerwall explained: Easy way for publishers to test pay-as-you-go and ad-gated access - Press Gazette

That’s it for another edition. Thanks for reading and subscribing, and I’ll see you at the next one!

Janeth Duque

Aug 14

While most website visitors won’t know whether your HTML validates or not. The systems and tools they rely on absolutely do.

That includes:

• Search engine crawlers like Googlebot

• Assistive technologies like screen readers

• Web browsers on different devices and screen sizes

• Automated accessibility and performance audits

• Future developers who need to work on your site

When your code is full of errors, missing tags, broken structure, invalid nesting, or outdated syntax, it’s like asking those systems to read a book with missing pages, scrambled sentences, and blank chapters.

It’s for this reason that professional web developers validate their code, however 98% of websites don’t have valid code.

The Web Design Industry as a Whole is a MESS!

Expand full comment

Discussion about this post

Ready for more?