Think Info

Exploring the information space

On third-party transclusion

On 25 August 2014, Sorin Pintilie (@sorpeen, http://www.sorpin.com/) published an article on The Pastry Box Project, discussing a mechanism that would allow content to be transcluded into a web page, by applying an href="…" attribute to a <p> tag. This article is a response to that.

Transclusion is the inclusion of a small element of content from one source into other material, by reference. The transcluded content is presented as an integral part of the final material – at the point of reference – while remaining dependent on its primary source. It is included at presentation time. The principle of transclusion was part of the original description of hypertext, as published by Ted Nelson in 1965.

There are two variants to transclusion. The first, as envisaged by Nelson, is the easier: content reuse within a single publishing environment. Sorin’s article, and this one, deal with the second type: including a snippet of someone else’s content into your publishing.

Sorin’s approach to third party transclusion follows a model that is common today. Hack the problem at the browser level. He suggests attaching an href="…" to a <p> tag using a new syntax: <p href="http://samplelink.com/^^firstkeyword…lastkeyword">…</p>.

While I can see how this approach is designed to get a quick solution to the issue – indeed, there is a working model linked to from Sorin’s article – I see four fundamental issues with it:

  • Source obsolescence
  • Implementation at browser level
  • The syntax
  • Applying it to the <p> tag

Source obsolescence

The most obvious issue is this approach’s dependence on the source content staying intact. If the transcluded source is rewritten – with the keywords removed, reorderd, or duplicated – the link integrity will be jeopardised.

Sorin’s model partially overcomes this, almost by accident. As a result of cross-site scripting restrictions, his sample offering only works by parsing the transcluded snippet remotely – requesting it from the same location as the supporting scripts are served, with that environment also caching the snippet.

While caching helps, it is a fragile arrangement. The cache would need to be indefinitely persisted, including when its environment is replaced. The link to the underlying source could become defunct at any time, including before it has ever been cached. At this point, there is no value in transclusion: we might as well just copy the source material.

Implementation at browser level

The second issue is more general, one that seems to be the de facto approach to many upgrades in web functionality: hack it into browsers using jQuery, until browser developers decide they ought to support it directly.

This approach has been proven to work with other functionality. There have been quite a few successful polyfills.

But transclusion is too fundamental to be hacked in this way. There are governance implications that must be considered. The source needs to be aware that it is being transcluded by a third party.

The syntax

The syntax issue has multiple parts to it. The first is that is it not – and probably never would be – a valid URI syntax. HTTP does not allow for unilateral changes or extension to URI syntax and semantics (and the web community does not tolerate attempts).

Even if unilateral extension were not an issue, there is the inherent uncertainty of identifying the snippet correctly simply from a text match, even if the keywords are allowed to be multi-word.

Also, the … is clearly supposed to be an ellipsis (…), but as per Sorin’s sample, is just a series of three full stops. I presume that both formats would need to be supported, in case someone actually wrote it correctly.

Also, there is the question of why would one reinvent the wheel? The XPointer specification (a system for addressing components of XML based internet media) already provides addressing features for identifying anything you might want to transclude, including arbitrary ranges of text.

Applying it to the <p> tag

Lastly, I take issue with the use of the <p> tag as the container for the transcluded content. What happens when we want to transclude not just a sentence, but multiple paragraphs? Or two half paragraphs? Transclusion into a <p> tag is a hack that only considers an edge case.

We already have one element fully capable of being a transclusion container: the iframe. And even if we figured it wasn’t capable enough for various reasons (size control, for starters), there are much better options available.

An approach designed for the reusable content age

It would be totally unfair of me to stop here. I have just ripped holes in Sorin’s proposed approach, so I am duty bound to explain how to do it better.

Implementation at content level, and source persistence

The first problem we need to solve is source obsolescence. How do we ensure content can be transcluded without needing to rely on a caching work-around? How do we ensure that when the source moves, the link remains valid? How do we inform the source that it is being transcluded by a third party, so the owners of that content can apply some governance to it?

This goes hand in hand with the best place to implement the model. It requires new implementations of content management. It requires content to be considered as content, not as strings of text within web pages. The core implementation must be at the level of the server that holds the content to be transcluded. It must be aware that its material is being reused.

As Eliot Kimber (@drmacro) commented in reviewing this article, transclusion is not something that can be patched into the web, it has to be built in at the lowest level.

The models I see working requires new functionality in content management systems, and in browsers. When an author wants to transclude content, he would highlight it in a page and perform a copy-like action. This action would not copy the selected text; it would query the server and obtain a specific URI for that snippet. The server would then know that snippet was to be transcluded. The returned URI is a clean reference to the content we want, from the master source.

When the owners of the transcluded content later edit it, they have several choices. They can move it to a different page; because the snippet URI links to the content, not the page it appears on. They can edit it, so the reference updates to the new content. They can choose to completely rewrite their content in a way that makes the originally transclauded content redundant; which has two sub-scenarios: retain the snippet so it can still be referenced as archived material, or mark it as obsolete.

Of course, there is the equally important question of how to inform all the parties who are transcluding the content that it is changing. I don’t have a particularly good answer to that, except to say that broadcast feed technologies already exist, so reusing one of those is likely the most viable option.

As to the question of what happens when the content source organisation replaces their content management system, it is a safe bet that any organisation using a platform that supports third-party transclusion, with content referenced in this way, would only migrate to another platform that provided the same. As long as the site itself exists, so will the snippets.

The syntax, and the container

With the source issue resolved, we need to turn to how we would reference content to be transcluded. We have already established that a transcluded snippet will have its own source-server-provided URI, so we do not need any special syntax in the request.

However, we do need to standardise how transcluded content will be returned. There need to be rules: agreements to avoid the requesting page being spammed, while also avoiding any tendency to create shell sites that simply rip other’s content. (Several of these were considerations in Nelson’s original hypertext project (Project Xanadu), and are still relevant.)

  • The transcluded snippet should be allowed to limit its length (considering copyright and fair-use guidelines – if you attempt to transclude a massive article, you might only get a paragraph and a half, and the reader would need to go to the source for more). Whether this will occur should be identified when the URI is obtained.
  • If the transcluded content begins or ends mid-paragraph, it will include leading and trailing ellipses (…) as appropriate.
  • The transcluded content should be wrapped in appropriate tags, effectively delineated as paragraphs (thereby allowing multi-paragraph transclusion).
  • The transcluded snippet should include three additional elements after the imported content: an optional element identifying the author, a link to the primary home of the transcluded snippet, and an optional identifier to indicate that the transcluded version is no longer current. There would need to be an option in the second element identifying it as an orphaned snippet, that has no source article.
  • Transcluded content could support the embedding of advertising (allowing the reuse of ad-supported content), but there should always be an ad-free option (which might just mean it has a lower length threshold).
  • The transcluded content should not contain any CSS or script references (this is an issue for some of the more commonly transcluded content). Indeed, it would be fair to strip them out entirely if they were served. This would help ensure that the delivered snippet used semantically clean structure.

The fact that the transcluded content can be complex, may contain multiple elements, means that there are only four reasonable elements to pull the content into.

One is the iframe. The problem here is height. As we cannot know the height of the transcluded snippet, we end up with whitespace or scrollbars. Extension of the iframe definition to support a fit-to-content directive may be an option.

The next obvious element to use is the <div>. It is a simple container, which allows complex content.

The third option, and I believe the best, is the <figure> element. It has a simple benefit over the <div>, in that it supports the <figcaption> sub-element, which would be a perfect container for the author, source link and out-of-date markers.

The last option, which would require the structure of the returned content to be more tightly controlled, would be to simply use the <a> element, with a role of transclusion (<a href="…" role="transclusion">).

The DIY transclusion approach

While we wait for the web to catch up with Nelson’s ideas (which are now more than half a century old), several organisations have already put in place mechanisms to allow their content to be transcluded by third parties. Commenting systems are one obvious example. But perhaps the more visible today is Twitter, which provides a simple syntax and script to embed tweets into any content.

Twitter’s approach demonstrates some basic positive thinking: they define the snippet size you can transclude (the tweet) and provide certain surrounding structure (the source, favourite and retweet counts, and favourite and retweet functions). The syntax is predefined – all you need is the tweet’s identifier.

The downside is that in order to include tweets, you need to load Twitter’s scripts. This is primarily because of cross-site scripting, and the embedded retweet functionality. Twitter’s model is not reusable as is. Everyone wanting to provide that ability to transclude their content needs their own model, their own syntax, and their own scripts.

So long as there are only a few sources that anyone cares to transclude, this can work. But it is not a long-term approach to third party transclusion.

Advertisements

6 responses to “On third-party transclusion

  1. Pingback: Transclusion Will Never Catch On | Every Page is Page One

  2. Mark Baker 2014/09/15 at 15:09

    Interesting post, Rick. I started on a comment, but it quickly assumed the proportions of a blog post in its own right: http://everypageispageone.com/2014/09/15/transclusion-will-never-catch-on/

    • Rick Yagodich 2014/09/16 at 08:13

      An interesting response, Mark, though I would disagree with a fundamental principle. Yes, you quote David Weinberger’ description of the fundamental nature of the web, but some would call that a fundamental flaw of the web. The fundamental nature is the link – the coupling mechanism. Yet so many of them are defunct; the web is frayed.

      For transclusion to work, basic content self-awareness is required. Content needs to know that it is transcluded. So the underlying platform of the web needs that built in to it. Some semblance of governance is required. I would never suggest that the right to remove one’s own content should be denied. But if we cannot have some stability – if content moves around at random, is not on the same address today as it was yesterday, or an address has completely different content tomorrow than it does today – then no link has any validity. We can no longer trust any claim of subject affinity. (And no, you cannot just rely on a search engine; if you think you can, speed up the changes to addresses/content until they cannot keep up.)

      Transclusion does not fly in the face of tight cohesion and loose coupling. If anything, it tightens the cohesion. And while it creates a slightly tighter coupling, it is a smarter coupling that knows to release one end when the other lets go…

      • Mark Baker 2014/09/16 at 14:45

        Rick,

        “The Web is frayed”. No, that is the wrong metaphor. When a cloth becomes frayed, it becomes weaker and frays more easily. Eventually, the cloth becomes too weak to hold together and it disintegrates. That does not happen on the Web.

        One of the characteristics of tightly coupled systems is that they are subject to cascade failures. Once one thing breaks, the likelihood of the next failure increases until you get an unstoppable collapse. Broken links do not cause a cascade failure in the Web. The server returns an HTTP 404 and life goes on.

        Of course, if too many links were broken, the Web would indeed cease to function. But in fact the Web does not reach that state. The Web acts as a filter. Old content with broken links slowly sinks out of site as the Web’s various filters move more current and higher quality content to the top. Thus the Web remains adequately connected without the need for central management of its links.

        “For transclusion to work … some semblance of governance is required.”

        Yes, and that’s exactly my point because the genius of the Web is that no semblance of governance is required. Yes, we need agreement on a few basic protocols and arbitration of addresses, but beyond that no governance is required to place content on the Web, and that is exactly the reason that the Web has eaten the world.

        “But if we cannot have some stability … then no link has any validity.”

        The Web does have some stability. It has as much stability as the individual contributors of content give it without central coordination, and that is not just enough, it it just exactly the right amount for the Web to become the dominant information platform or our time. The Web has succeeded not inspite of a lack of governance, but because of it.

        But links don’t have validity.They implicitly say, “It was here last time I looked” and that is all they claim. This is the essence of loose coupling. Links don’t have to have validity, because nothing bad happens if they break. The Web goes on.

        This offends our sense of organization and order. By everything the physical world taught us about organization, the Web should just not work. Yet is does work. It works brilliantly. It succeeds far better than previous attempts to impose high level order on information.

        Our instincts tell us that the larger a collection becomes the greater the need for organization and top-down control. The Web shows us that the opposite is true.That at a large scale, top down organization breaks down and inhibits both growth and findability. (Free markets tell us the same thing.)

        “Transclusion does not fly in the face of tight cohesion and loose coupling. If anything, it tightens the cohesion.”

        Tight cohesion is a desirable property in an individual module. Things inside an individual module may well be tightly coupled. (Top-down organization works well at the small scale.) But the cohesion of that module depends on it not having a dependency on the implementation of another module. That is precisely the dependency that transclusion creates. Transclusion breaks the cohesion of the individual object.

        At the same time, it tightens the cohesion of the system, which is to say that it violates loose coupling and introduces the possibility of cascade failure. (I can speak from experience on this having spent several days last month sorting out a complex set of cascade failures in a large DITA project that were caused by complex interlocking transclusions.)

        The point of tight cohesion of modules is to encapsulate the places where tight coupling is necessary so that the system as a whole can be loosely coupled.

        ” And while it creates a slightly tighter coupling, it is a smarter coupling that knows to release one end when the other lets go…”

        And that is exactly the point on which this discussion hinges. By this definition, a smarted coupling is the one that contains more information. That is a tempting definition, but the coupling that contains more information also creates a tighter coupling, which makes adding nodes to the network harder and makes cascade failures more likely — all of which is not a smart thing to do.

        The smartest coupling, therefore, is actually the one that contains the least amount of information necessary to create a coupling at all.

        And where that should lead us is this: rather than trying to impose greater order on the collection of content, we should focus instead on making the individual units of content that we create have better internal cohesion. Every Page is Page One.

  3. userinnovation1 2014/09/15 at 18:11

    Rick, Interesting post. I believe transclusion already happens frequently, through embedding. WikiMedia supports this. I think the larger issue is how to describe what one wants to include from other source. To me, it makes no sense to try to grab a paragraph without first having some semantic container. In the future I expect people will, using HTML, be able to grab and reuse content from elsewhere using something like schema.org linked data. It’s true most systems are set up to do this now out of the box, but it will be a query call (Sparql ?, JSON ?, not sure) to a URI to get the information, rather than necessarily being a hard-coded URL.

    You may be interested in a recent post I’ve published mentioning transclusion: http://storyneedle.com/four-approaches-content-reuse/

    • Rick Yagodich 2014/09/16 at 08:23

      I agree Michael. There are many existing cases of transclusion. But I am pretty sure they all work on the same model that Twitter uses: the source provider specifies a syntax, and access to script.

      If we are going to move beyond that hack – and I believe we need to – then we need the fundamentals of transclusion to be built in to the hosting platforms. By all means, those who do not want to responsibly govern their own content can opt out of being transcludable (or those who do opt in).

      As to the query call vs hard-coded URL, I think it has to follow the URL approach, which internal to the server can be translated into the query. Otherwise, there is the assumption of content immutability, which we all know is a myth.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: