Better metadata through a DITA lens

Dec 4, 2017

11 min read

From “Linda Lens”, artist unknown, via Petapixel

Modern ways of using metadata connect groups, countries, and organizations. Structured content such as DITA XML provides powerful ways to work with this metadata, but also challenges.

In structured authoring, metadata is best defined as “data about content”. It may describe what the content is about, who wrote it, or what it is intended to achieve. It is used to make content searchable, discoverable, manageable, interchangeable, analyzable, and more easily integrated with other systems. Structured authoring allows us to apply metadata to granular chunks of content. For example, in DITA XML, it can be applied to maps and topics, but also to elements and attributes within those topics.

Just as DITA made structured content far easier to adopt and work with, standards around metadata allow different systems and organization- or industry-specific vocabularies to be linked together successfully. This post talks about some of those standards in relation to structured authoring, as well as the basic principles behind all metadata.

It’s hard to define “metadata”

Principles of metadata

The NISO standards body distinguishes four types of metadata (http://www.niso.org/publications/press/understanding_metadata). The first three are:

Descriptive (for finding and understanding information, e.g. values from a subject taxonomy)
Administrative (technical and legal information, e.g. data about image formats to use in rendering, or copyright data to decide when and where content can be used)
Structural (the relationship of resources to other resources and container objects)

As NISO point out, these are not mutually exclusive. For example, the same piece of metadata that describes an article’s author for internal administration purposes may also enable the author byline to be displayed when the article is published.

The fourth type of metadata is markup languages, indicating that these languages mix metadata and content together to a greater extent than happens with other content storage approaches. The metadata in markup languages can still be classified under the first three types, but it is applied to nested, granular chunks of content: inline block and phrase-level elements. This allows ways of working with content that are not possible with traditional document- or page-oriented content management. For example:

Phrases and keywords can be automatically hyperlinked to definitions or related information without risking broken links when content changes, moves, or is deleted.
In search interfaces and elsewhere, more relevant paragraphs and blocks of information (sometimes known as microcontent) can be suggested — not just the title and short description or other snippets derived from naive indexing of literal strings.
In conversational interfaces, useful short answers (microcontent again) can be sourced from long articles provided that the inline elements in those articles are suitably annotated.
By transforming the markup into web pages enriched with the Schema.org vocabulary, search engines can understand the content better and use snippets of information in link previews.

Key ideas and terms

In any publishing environment, metadata follows the same basic pattern: a series of statements are made about a piece of content. We can break each statement into three components:

The content itself, or an identifier that represents it
Some data about the content
A component that represents the type of relationship between the content and the data that describes it

For example:

In traditional data modelling terms, the content is an Entity; the relationship an Attribute; and the data a Value of that attribute. In Linked Data terminology / RDF, the same conceptual components are understood as analogous to natural language statements and are called Subject, Predicate, and Object. However, in an RDF statement the Object can alternatively represent another entity, and the Predicate a relationship with that entity. As Entity / Attribute / Value are more specific to the case of describing content directly, I'll use those terms for the rest of this post.

As in any technical field, a variety of terms are in use for similar ideas. To help with definitions for the rest of the post, here are three important ideas, with some of the terms associated with them:

Schema / Ontology / Vocabulary In a broad sense, these define containers of information (entities / classes) and the relationships (properties / elements) between these containers and the data they hold. When “vocabulary” is used on its own, it typically has this sense.
Taxonomy / Thesaurus / Controlled vocabulary / Controlled list These structures define key concepts that can be used as data in metadata statements. This is often descriptive data about the resource (entity) that is the subject of the statement, for example the themes of a document, the user goals it addresses, or the products it relates to. A taxonomy concept does not become metadata until it is used as part of a statement: in an element or field, as a property of the resource it describes. The compound term “controlled vocabulary” always means a taxonomy rather than an ontology / schema.
Linked Data / Semantic Web These terms describe an increasingly popular approach to defining and using metadata schemas and taxonomies. In the NISO document’s list of “Notable Metadata Languages: Examples in Broad Use”, nearly all use this approach, or at least have a published version that uses it. The approach is to express and publish semantic data so that the entities within are unambiguously defined and so that different schemas and usages(whether within an organization or across the web) are easily linked. It uses web technologies to achieve this, particularly the use of URIs (now IRIs) as identifiers. These are typically in URL format, which can cause confusion for people new to the Linked Data approach:
they may link to useful information about the identified entity, but they don’t have to. A key technology in Linked Data is RDF — the framework that provides the subject - predicate - object terminology used above.

The following sections outline two key use cases for metadata in DITA: describing content with taxonomy, and understanding DITA structures in relation to external schemas or ontologies.

Describing DITA content with taxonomy and controlled lists

We often need to tag content using values from an external source such as an enterprise or industry taxonomy. This makes it easier for all users, internal and external, to find the content, for example via navigational hierarchies or faceted search interfaces. It can also power intelligent content recommendations, ease information sharing, and help with analyzing data on how the content is being used.

In structured content, particularly in markup languages, taxonomical metadata can describe content chunks at various levels of granularity. In DITA, these levels could be root maps, sub-maps, topics, block elements, and phrase-level elements. For example:

These examples can be expressed in DITA in various ways. The topic author could be stated using the available <prolog> metadata. (An example of the powerful ability of markup languages to use nested text containers — elements — to provide metadata describing their parent objects.)

<task id="topic">
    <title>…</title>
    <prolog>
        <author>Francis-Noël Thomas</author>

Storing metadata directly in document XML or attached to it?

If the author value will be manually keyed in (or if the editing environment can restrict prolog element values to an allowable list), this could be a good option. Most authoring tools display element text by default and allow it to be directly edited.

However, if the author name is sourced from an external list of values, we might need to use a unique identifier for that name (which also makes life easier if the author changes name at some point). This could be placed in the <author> element still or in an attribute. Attributes are often better locations for data that is not intended to be readable or translatable. In addition, their values can be controlled by Subject Scheme maps.

Thus, using a specialized attribute, we could either add the value to the <author> element in the prolog, or, as the prolog structure becomes rather redundant semantically, directly to the <topic> itself:

<task id="topic" author="francis-noel-thomas">
    <title>…</title>

If we are working with external systems, it may be necessary to use a URI as the unique identifier for the author:

<task id="topic" author="http://example.com/users/0800200c9a66">
    <title>…</title>

Benefits of unique identifiers for controlled values

As in the example above of the author, we often need to insert standard snippets of text or data in metadata statements. The values are defined in controlled lists of terms or taxonomies. This means that we can effectively exchange content between systems that have access to the same lists or taxonomies, for example web delivery platforms with faceted search interfaces, analytics tools, or taxonomy management systems. To use a value in metadata, we may feel that the easiest way is to simply use the readable string, for example "Francis-Noël Thomas" as the author of a DITA topic, or “factory reset” as the topic's subject value: the concept that the topic is about. However, relying solely on readable strings presents problems when exchanging content between systems or working with an external taxonomy management tool:

If there are several strings for the same idea (concept) in different languages, which should be the authoritative one?
If the string is changed, for example due to capitalization rule changes, how do we update that across all the different pieces of content that refer to it?
If the string is taken from a hierarchical taxonomy or an even richer structure, how do we share changes to that structure without having to update all the uses of the string?

Many systems such as older content management systems, discussion forum software, and some CRMs use literal strings for controlled values, and integration is always difficult as a result.

The smarter, future-proof approach is to use a unique identifier for each value from a controlled vocabulary. It is this identifier, not the associated string, that is embedded in any content that needs to use it. In this way, any updates to the string do not require changes to the content. It is far easier to integrate systems and keep them in sync. In fact, a single identifier can represent a “concept” that includes readable labels in various languages and even alternative labels — synonyms — for the same idea. Clearly, for usability and accuracy, system users have to be able to work with the readable labels instead of pasting in IDs from a list. Many systems allow exactly that.

Adding semantic tagging via a menu in a browser-based XML editor. — Tagging inline terms with taxonomy values in FontoXML

In taxonomies based on RDF standards such as SKOS, the identifiers are all URIs; typically in URL format, such as:

http://example.com/users/0800200c9a66

(Their primary purpose is still to identify concepts, although a best practice in Linked Data is to have the URL point to some useful resource about the concept.) However, there is no easy way to work with URIs in DITA Subject Scheme maps, nor is there a good way to embed URIs in attributes, at least in out-of-the-box, unspecialized DITA. Discussions in the DITA Technical Committee have raised some promising opportunities to deal with this challenges.

Understanding DITA elements in relation to external ontologies

So far, we have looked at the basic model for expressing data about a chunk of content — both the data itself, and the relationship that it has with the content (i.e. what property of the content is being expressed). However, there need to be some rules for what relationships can be expressed, and perhaps also what forms the data can take. A complete metadata schema expresses:

The types of content that the data can be applied to
For each type of content, what relationships or attributes may be expressed (or, in some cases, must be present)
For each property, what values are allowed: what format do they need to be in, and are they keyed in by the author, generated automatically by the system, or sourced from a controlled list?

In DITA, some of the more mechanical rules for structuring metadata are provided in the DITA standard itself, and typically enforced in the authoring tool via the DTD or other schema format. For example, the content model for metadata elements in a topic <prolog>, and the general rules for using XML attributes, are controlled in this way.

Other rules, particularly concerning the meaning of metadata, tend to be maintained in human-readable rather than machine-readable form. If an authoring team decides that the <category> element should be used to store subject matter / aboutness metadata, this guideline may be defined in a team wiki alongside other authoring conventions.

However, as organizations realize the potential of integrating structured content environments with external systems for content delivery, semantic search, customer relationship management, master data management and so on, it becomes more important that DITA elements (and hence their content) can be understand in relation to external ontologies that link these systems together.

For example, the author metadata above could be understood as the Creator element from the widely known, though perhaps dated, Dublin Core metadata schema. Equally, it could be the Schema.org author property of the resulting generated page.

Schema.org is a very popular vocabulary for marking up web content, primarily to help external search engines such as Google understand the content, rank it better to relevant queries, and in some cases display enhanced result snippets. It provides both the general, whole-document level descriptive and administrative metadata that you would expect to find in other schemas such as Dublin Core, but it also mirrors aspects of DITA in its ability to mark up granular fragments of content in a semantically meaningful way. For example, the schema:HowTo entity (perhaps the equivalent of a task topic) can have a schema:HowToStep (equivalent of <step>) that itself relates to a schema:HowToDirection (equivalent of <cmd>). Using our previous framework, a simplified view of this metadata could be:

In fact, there are a number of statements involved to describe this piece of metadata. A more accurate way to visualise it is:

This structure can be parsed and used by services such as Google and Bing to highlight relevant data in search results. There are also some non-search-engine applications of Schema.org being discussed and developed. An obvious use case for DITA in a web publishing environment (particularly a commercial one) is to map these semantic elements to their Schema.org markup equivalents.

However, there is currently no reliable, consistent way to map DITA elements to their equivalents in external vocabularies or ontologies. The DITA Open Toolkit does in fact generate some Dublin Core markup in HTML based on certain prolog metadata elements, but there is no easy way for external applications to work with that mapping directly in the DITA content. As Linked Data techniques become increasingly prevalent, both for standard ontologies and organization-specific ones, the current approach of ad-hoc mappings shows its limitations.

For sure, it is not the DITA standard’s responsibility to incorporate every new ontology or vocabulary that comes along. However, it could be very useful for information architects to have a mechanism in DITA to indicate their own mappings from DITA elements (core or specialized) to entities and properties in external vocabularies. Simply enabling RDFa support in DITA could be one way, although it could face significant adoption and usage challenges. Perhaps a better mapping method would be to extend Subject Scheme to define mappings between DITA elements and URIs for externally defined entities and properties. Discussions in the DITA Technical Committee on these and other approaches may reveal a sensible way forward to make DITA, in the Linked Data sense, more truly semantic.