Semantic markup: adding meaning to the everyday web

Dave Everitt, last updated (none)

As contributors to a web currently embarking on a journey of semantic transformation, the least we can do is learn the language…

We all need the machines we use to understand the data we enter into them, in order for others to retrieve it in meaningful ways. Without the work of countless programmers teasing human meaning from the sea of collective data, Google's search results would be drowned under waves of irrelevant results (remember those early search engines? And the web scammers' abuse of the 'keywords' HTML meta tag?).

The story of machine-readable¹ data is the story of computing, perhaps most elegantly summed up (for geeks) in an axiom coined by the Extreme Programming movement² as the code is the comment which, to the rest of us, means "don't attempt to explain what each part of your code does by cluttering it up with comments, when you can suggest its function by structuring the code intelligently". Other users - and their machines - are then able to understand the function simply by looking at the form.

Every heading in this article is marked up as a sequenced heading without a single 'bold' or 'italic' tag, and it's peppered (okay, a little over-seasoned to illustrate the point) with semantic markup…

Good, old, HTML

In the ASCII plain text of the internet before HTML and the web, asterisks, underscores and other signifiers from the ASCII character set (i.e. most, but not all, of the stuff on your keyboard) were used to indicate bold, italic, and other kinds of primitive formatting (this - oddly - survives in Microsoft's Word, where adding a pair of asterisks or underscores either side of a word converts it to bold or italic text).

Then Hyper Text Markup Language (HTML) arrived with an array of tags designed to enable us all to mark up our otherwise unstructured text into meaningful blocks consisting of headings, paragraphs, tables and lists; with quotations, links, emphasis and other discrete and similarly meaningful inline elements scattered throughout that text. It distinguished between two kinds of tags for text markup: physical and logical. Physical tags just make something bold, underlined or italic, while logical tags are used to define the relationships between the various elements of your content. However, logical tags tended to get ignored because people weren't sure what they meant or how they would appear in a browser, and b and i were easier to remember.

These logical tags are essential to semantic HTML markup because - unlike bold or italic - they describe what they enclose. Emphasised or cited text isn't merely italic and - even though the em and cite tags are rendered in italics by default in web browsers - the machine knows you mean something emphasised or cited, and can process, sort, store, filter, re-present etc. accordingly.

Well-formed, with hidden depth

The aim of HTML was (and still is) to encourage authors of web content (us) to learn this finite set of simple tags to lend usability, readability and portability³ to web-based information. Yet, despite the fact that an increasing number of people are authoring web content, many web content providers (like bloggers) remain ignorant of the subtleties of HTML. Its potential semantic power is thereby trampled under a stampede of ill-structured content, with little regard for the poor machines that have to read that content and re-present it to our (consequently even poorer) readers.

Even more crucial, the assistive technology used by disabled readers can make far more sense from HTML code that is well-structured⁴, than (to take a common example) headings unintelligently marked 'big and bold' instead of being identified by a proper HTML heading tag. Listen to a screen reader zip through a badly-structured page of HTML; you'll hear a breathless rush where the (non-)headings and paragraphs blur seamlessly into a single and confusing stream. A rich client experience isn't all about mashups and AJAX widgets, its also about well-formed information with hidden depth (isn't that how we'd like to appear to others?).

When typing up a blog entry, who (go on, put up your virtual hands) uses dfn for a defined term, cite for a brief quote and abbr for abbreviations? Or the invaluable title attribute (hover over the word for instant recognition) to add extra snippets of meaning? Who even uses the logical HTML tags for strong and em or - depending on context - cite, instead of the physical tags b and i for bold and italic? As contributors to a web currently embarking on a journey of semantic transformation, the least we can do is learn the language.

What does 'semantic' mean to a machine?

The collective information pool that will be 'Web 3.0' (yes, I know we're still accommodating version 2.0, but 3.0 is already on the horizon⁵) will be built on the ability to get meaningful information from any networked source, move it around, manipulate and re-present it in multiple ways. For this to work, at least some of that information must have a structure of some kind, and there are various initiatives designed to enable this to happen, some of which are already in use: the Dublin Core Metadata Initiative, microformats (like the hCard format), domain-specific ontologies, and the rest; including, of course, tags (but tags don't offer structure, just some extra information about that whole chunk of data). The machine can be programmed to understand our messy input, but it helps if we meet it at least halfway.

Refining the computer science formula
input > process > output
by examining the intentions behind it, we get something like this:
well-formed data > machine-readability > maximum human usefulness

Adapting the above formula for (say) the model web author:

programmers (humans, mostly) design a language for structuring information so that machines can distinguish its semantic elements (a list here, a quotation there, a paragraph, a heading…);
the web author (still human, but also an HTML programmer) follows these procedures when inputting their information;
the machine re-presents that information so that blogs, news feeds, search engines, databases, etc. receive it in a structure that preserves the author's original meaning.

But this remains in the realm of theory for the many web authors who have fallen into bad habits by applying visual - rather than semantic - styles from the formatting palettes of Word or Dreamweaver, rather than using the available style sheets or semantic HTML to provide meaningful structure. The second stage above fails and the data appears to the poor machine as a mush–a long, unformatted string of text with the occasional instruction to make something appear 'bold' or 'large'. The author thinks it's a heading, but the machine is a dummy, and needs the heading to be marked up as such; otherwise it simply won't see a heading at all.

Semantic markup for dummies

How did our information get into such a mushy state? Like this: we're in the flow. We've typed a few paragraphs and need a heading. We hit return twice (after all, we need a space before the heading, right?) and type. It doesn't look like a heading so we make the text size larger. It still doesn't stand out enough, so we hit the [B] button. We pause for a micro-second and admire our 'heading'. But all the machine sees is a paragraph in large bold text with a blank line before it, because that's all it is.

Let's run through the above sequence again, semantically. We've just typed a few paragraphs and need a heading. we hit return once (leaving out the blank line), type in the heading and choose 'Heading 2' from Word's styles palette (Word documents can be semantic, too) or an HTML h2 tag ('heading 1' would be the title of the article, which we've already marked up with 'Heading 1', 'Title' or - in HTML - <h1>. Google loves a nice, explanatory <h1> tag at the start of a page…).

However, since appearance often dictates our choices and software makes 'big and bold' easy, people avoid the semantic option because the default styles for these headings don't look right (too big, too much space, wrong font, etc.). But default styles (in both Word and HTML) are exactly that - default styles, so they can be changed (bold, font size, spacing before and after, etc. But this isn't a tutorial, so you can go off and find out how to define styles elsewhere. It's not that hard :-).

To sum up:

all articles have a single title, which (if it's the only article on the page) will be a level one heading ('Heading 1' in Word or <h1> in HTML);
most articles benefit from a few subheadings ('Heading 2' or <h2>);
headings can be styled however you want, in both Word and HTML;
when you use that heading again, it assumes the same style.

What about the rest of the text?

Semantic seasoning for HTML text

It's all plain old semantic HTML from here on (or POSH), because HTML has all those neat tags that describe exactly what they enclose which, with Cascading Style Sheets (CSS), can be made to appear exactly as we wish. See the cursor change to a question mark over that word with a dotted underline? That's a dfn tag with two styles: cursor:help; border-bottom:1px dotted #333; (the semicolons here signify the end of each style). I should add that CSS styles like these are best stored in separate style sheets, not inline with the HTML.

Those nice programmers made it all read just like English (okay, with US-spelling, but you can't have everything; US-English HTML and CSS markup languages are now the Lingua Franca of the web, and there's no going back).

So (apart from paragraphs and headings) what are these logical HTML tags with semantic meaning? Here's a list of the most useful (tags are linked to in-depth descriptions from WATWG site):

dfn (actually classed as a 'physical' tag) - a defined or specialist term, usually expanded with the 'title' attribute
em - emphasis (italic by default)
strong - extra emphasis, or 'loud' (bold by default)
cite - quoted text or the title of something
code - to display lines of example computer code
kbd - text a user must type in
sup - superscript (usually for footnotes)
abbr - abbreviations and acronyms, usually expanded with the 'title' attribute
q - a short inline quotation (adds quotes, not supported in Internet Explorer, but can be styled with CSS)
address - what it says - use br tags for line breaks

All of which can be used on HTML pages, blog entries, and anywhere where HTML tags are allowed. There are, of course, also the HTML list tags and various others - see the References and further information below.

The tip of the semantic iceberg

Semantic markup goes far deeper than text emphasis, lists and addresses. It applies to the entire content of a web page, and to chunks of data within that web page like contact details, for which an array of best practice rules is emerging. That's another article, though.

If you want to know how this is being carried forward in HTML5, the good news is that most of it still applies. See Text-level semantics

Notes

[1] Human readable information means documents in the traditional sense which are intended for human consumption. While these may be transformed, rendered, analyzed and indexed by machine, the idea of them being understood is an artificial-intelligence problem not addressed by the client's Web software. Machine-understandable documents, therefore, contain data explicitly prepared for machine reasoning, which can become part of a future semantic web.

[2] For an excellent critique of Extreme Programming, read: Martin Fowler (Chief Scientist, ThoughtWorks), Is Design Dead?

[3] An architectural rule which the SGML community embraced is the separation of form and content. It is an essential part of Web architecture, making possible for information to be displayed effectively independent of the display device, and greatly aiding the processing and analysis.

[4] If you already write a bit of HTML and CSS, and want to know how to do it better, here's an excellent introduction to well-structured markup, semantic organization of information into content blocks and making your HTML structurally ready for CSS: Virginia DeBolt, The Early Bird Catches the CSS: Planning Structural HTML (on Wise-Women.org). A word of warning, never use CSS hacks, so don't follow the link to Long-Term CSS Hack Management; instead, Conditional Comments are the way to tame the various versions of Internet Explorer.

[5] For more about Web 3.0 see The Web 3.0 Manifesto - The Knowledge Doubling Curve, posted by roschler, November 24, 2006.
For an amusing and informative sideways view of the 2.0 hype, see Jeffrey Zeldman, Web 3.0, January 16, 2006.

References and further information

Web Architecture from 50,000 feet (1998)
from: Tim Berners-Lee, Design Issues, Architectural and philosophical points.

Semantic (X)HTML on the MicroFormats Wiki.

The BBC's own semantic markup standard.

HTML5: A technical specification for Web developers.