Think Info

Exploring the information space

Semantic long-form

Long form: it’s been the basis of communication for millennia. We tell stories; we’ve been successfully sharing concepts with others this way for as long as we’ve been recording history – indeed, long-form communications is perhaps the fundamental enabler of the very concept of history.

Why, then, do we have such trouble migrating this most basic form of communication to the digital realm? What about how we create, manage, maintain and distribute long-form content makes it machine-unreadable?

Personally, I blame Xerox.

Space Diner, by Chris Shipton

History and long-form

If we limit ourselves to written history – it gets too convoluted if we look also at oral tradition and the use of vocal and gestural modifiers to indicate semantics – it is clear that semantic writing has been with us for a long time. Imagine the works of the Bard: Shakespeare did not have a computer at his disposal, yet his work is highly semantic. There are acts and scenes. There is stage direction, action and dialogue. In short, there is enough information stored within that long-form, written with quill on parchment, for us, centuries later, to recreate his works as books, plays, screenplays, songs, video games, educational coursework and – should the financing of the technology be viable – as interactive virtual reality adventures.

How, you might justifiably wonder, could content created with such primitive tools be repurposed so efficiently while the latest technology platforms fail so miserably at providing a format for inputting and managing this same style of material in a way that allows for automated repurposing?

Old-style long-form semantics

Did Shakespeare mark up his prose with XML tags? Did he put little descriptors alongside each element of content he created? Not in so many words; but, yes, he did.

If you have ever read a play or film script, you will know there are presentation conventions that indicate the semantic value elements. These semantic markers may not be as explicit as XML tags, but they are as effective.

Some old-style semantic elements you will recognise easily, while others may be more subtle or part of a particular industry’s jargon. Certainly you recognise the title, or the chapter heading. You are also familiar with the section-level heading.

The problem is that computers do not know the context of a piece of long-form: they had no basis on which to identify content as being governed by one semantic convention rather than another.

The blame game

You may be wondering why I would blame Xerox for our lack-of-digital-semantic-content woes.

I know, it can’t really be all their fault, but they are at the heart of the problem. They started the epidemic: WYSIWYG, a technical concept developed to pander to the old school of content creators – specifically the marketing departments – who understand the cultural semantic language of persuasion so well, they can manipulate your perception of things with presentation.

Simply put, the semantic mark-up within a WYSIWYG environment is left where it has been for millennia: with the human authors.

Honestly, the technology (read storage space and conditional formatting engines) was not available when Xerox created WYSIWYG to allow designers to output to their laser printers. Xerox were doing the best they could at the time, to promote their new technology – you can’t really blame them for that. Unfortunately, it was a case of the non-Henry Ford “faster horses the customer wanted.”

Solutions to semantic long-form

l have outlined a problem; what then is the solution? How do we modify our long-form content in a way that provides all the elements a computer needs to understand it?

First, we need to do a little more in the way of deconstruction:

The anatomy of long-form content

Take any piece of long-form content – the longer the better. What does it consist of? What are the semantic parts? (I may not get it perfectly accurate for every selection, but I am not going to be too far off.)

The first thing we discover is that it has a title, and that title has a context that it covers – it is a top-level label for the entire body of the work. Now, in this case, there may also be some additional metadata – author, publisher, publication date. In XML parlance, that would be <work><metadata><author /><publisher /><publication-date /> </metadata><title /><content-body /></work>.

All element of that structure are pretty straightforward, except <content-body>. It has an interesting internal structure. Therein, we have front matter, followed by an ordered list of chapters, and perhaps appendices. Ignoring the detail of the front matter and appendices, what do we have inside that sequence of <chapter>s?

Well, each chapter consists of a chapter title, and – yes, you guessed it – a chapter body. This is <chapter><chapter-title /><chapter-body /></chapter>. Again, the title is easy. It is the body that has structure. Maybe it has sections, each of which consists of paragraphs. Or let’s look at it more like this article, where each section has a bit more structure than that: Each section has a section title and a section body.

Perhaps the most important thing to be aware of here is that the section body continues… until the next section title. The “section title” heading weight implies, from visual semantics, that it applies to all content that follows it until you reach another section title, or the end of the chapter. And within a section, you can break things down into a sequence of sub-sections. Again with sub-section titles and sub-section bodies. Ad nauseam.

When you get down to the smallest element, you will find paragraphs. Sequences of them.

Now, you may also find some additional elements: tables, sequenced lists, images, etc. Some of these elements work as paragraph-level blocks in the sequence; some do not have a specific order in the whole, but act as metadata to a paragraph block.

The three media

Now that we have a better idea of the real semantics of long-form, we need to look at it from the perspective of technology. Of interest to us here are the three layers of the content process:

  • Creation
  • Storage
  • Presentation


If we think about this first in the context of presentation, for it is the aspect of the technology we are generally most familiar with (even the technophobes), we are talking primarily about HTML for web presentation, or XML for most others. The truth is, these format have the functionality they need to present content in a way that reads semantically. The real structure may not be semantically valid, but the general conceptual consistency of presentation platforms means that at least visually, the semantics are represented.

If this were all we needed, WYSIWYG would be good enough.


However, we need to look also at the storage medium. Here, we need more semantic information. We need enough data that we can convert the stored content to whatever output is being requested, breaking it up in sensical ways. (There is a certain high profile tech blog that paginates posts – I have seen an example where their algorithm splits the last section over the boundary of pages 1 and 2… which wouldn’t be so bad if that last section weren’t two paragraphs of two lines each!)

In order to empower output of this long-form content to a plethora of devices and formats, we need every last bit of semantics stored. It doesn’t matter if this is XML, or a database of serialised blocks. The important thing is that the system be aware of the title-and-block structure, and the sequential elements within each block. It needs to be aware when a media element is accompanying metadata to a paragraph, and when it is a paragraph-level block in its own right.

As well as adding this title-and-block semantic approach, I would add an additional tag/container element that XML does without: a specific identifier. The reason for this – within an XML-based system – is that there is no real constraint on text in an XML structure. You can put anything you like anywhere you like. The additional element simply locks down textual elements that are real text content. It is an inline element, and can only contain actual text. This means, for example, that if you have a paragraph that includes a reference to a person, the structure would be along the lines of <p><text>This is about </text><person reference=“person-id”><text>John</text></person><text> who is up to something.</text></p>

This level of in-body semantics provides a lot more flexibility to the presentation layer – it can output as a single block of text with the link and a contextually imported image alongside the paragraph, or it can render an in-line link on John. Or as text without any sort of link. The key is that it provides an additional layer of information for the system to work with.


Now that I have basically said that storage needs to be painfully granular, this doesn’t leave many options for content creation. Plainly, that same level of semantic definition must be supplied by the author.

Or must it?

Some parts, like the association of section bodies with the section title can be automated easily enough. The rules are very simple. The big difference between the old-style WYSIWYG editor and one that supports semantic mark-up is that this input environment would actually have semantically meaningful titling levels, not those abstract <h2>, <h3> and <h4>.

The second aspect of input is the insistence, when an author wants to mark something up, that they provide a justification. So, you want a bulleted list? Don’t tell me it is a bulleted list: tell me why it is a list at all.

A semantic content editor would be contextual to the type of content being entered, so metadata would be available to identify the type of list being created. The presentation layer may choose to render that as an unordered list, or as an ordered list. Or it may render it as a carousel if that is the appropriate output styling for the list type the author is defining.

Digital semantic long-form

Fundamentally, the first step in implementing this digitisation of semantic long-form is to modify content storage – the inner workings of the CMS – to retain a full semantic structure to the content. We need to do away with the single blob of styled rich text. We have the technology, processing power and storage space to make such a system work.

It’s a phased approach: we need to implement the storage layer (which can still support the horrid “rich text”), then swap out WYSIWYG editors for WYSISMUC editors. We can accomplish this shift without the authors even realising that their old work horses have transformed into sleek, high-powered engines of semantic efficiency.

Reference Links

Space Diner, by Chris Shipton@ChrisShipton
Wikipedia History of WYSIWYG
An introduction to WYSISMUC


3 responses to “Semantic long-form

  1. Mark Baker 2013/07/09 at 00:54


    This really all sounds like the classic case for structured authoring — which today, for better or worse — means XML. Storage really is not a problem, if CMS designers can learn to see past SQL. Swap out the RDBMS and swap in either an XML database, such as eXist, BaseX, or MarkLogics, or a else maybe something like MongoDB. Swap out the WYSIWYG editor and swap in an XML editor component with forms capability, such as oXygen. That’s the basics of what needs to change architecturally to do this.

    I’m not sure I understand what you mean when you say: “I would add an additional tag/container element that XML does without: a specific identifier.” XML is a metalanguage and lets you define any tag/container you want. The example you give of what you mean is perfectly ordinary XML, and very similar in intent to what I describe here:

    True, you don’t see this kind of usage in document-oriented XML very much because most of the people designing those systems still think in terms of publishing markup. They have yet to get their heads around the idea that the real purpose of XML is to turn content into a database that can be queried. More on that here:

    By the bye, the samples in those posts also illustrate my number one AX principle: never ask an author for information that is not part of their subject matter expertise.

    • Rick Yagodich 2013/07/09 at 06:53

      Thanks Mark
      In answer to the second paragraph, the reason I would add in that additional text element is two-fold. On the one hand, XML has a quirky shortfall around whitespace, so if you have an inline tag, it can easily lose a padding space between the body text of the outer tag and the inline tag (you’ve seen it – all those sites where links end with the following text attached, rather than a space away). A dedicated text tag provides that.

      On the other hand, because of XML’s flexibility, you can put text absolutely anywhere, without restriction, But that is not wanted. Having a dedicated text tag inside everything else basically allows you to discard any – read all – text that is not specifically, structurally sound.

      • mbakeranalecta 2013/07/09 at 14:00

        It’s true that whitespace handling can be tricky, and there are some tools that get it wrong. All of Altova’s tools incorrectly discard whitespace-only text nodes, and don’t support XSLT’s whitespace handling declarations. I reported this to them something like 8 years ago, and last time I checked, they still had not fixed it. (I think they actually consider it an optimization which makes the handling of data-oriented XML — which does not have mixed content — slightly faster.)

        Similarly, you have to be careful in how you process your XML, particularly in XSLT, where you can have issues with whitespace in the stylesheet finding its way into the output. But if you understand it, whitespace can be handled reliably.

        Restrictions in XML depend entirely on which schema language you are using. XSD, for instance, does not allow you to require a specific element within a mixed content element. But Schematron certainly does. Schematron lets you implement any restriction you can express as an XPath expression, which is pretty much anything. I think the new version of XSD may provide similar capability.

        On the other hand, in any situation in which you want to require the use of an element inside of mixed content, I would suggest it probably makes sense to move the textual content from the instance to the stylesheet, at which point you can require just the elements you want. (I find it hard to imagine a case for this where the text content isn’t boilerplate.)

        So I’m pretty sure that XML can do what you are trying to do, if you use the right tools.

        On the other hand, I would not object at all to an effort to come up with something better for content than XML. I’m not sure that the people who created XML expected it to replace SGML for defining markup languages for content creation. They wanted it to be a lingua franca for the Web — a role increasingly played by JSON today.

        XML is not well designed for content. The question is, is it too well entrenched to be displaced? I can see no rhyme or reason for when established standards persist or fall, so maybe there is an opportunity to displace it.

        The growing success of Markdown suggests that there is a growing appetite for simple markup solutions for authoring (where the author is not being asked for information outside of their domain of expertise!). A semantic markdown is something worth pursuing, and if you want to start such an effort, I’d be glad to participate.

%d bloggers like this: