Thesauri and lists of keywords are stored by DC-X in its topic map – a set of database tables modeled after the XML Topic Maps (XTM) 1.0 standard. For an introduction to topic maps, see the wonderful article The TAO of Topic Maps by Steve Pepper.

So far we have implemented merely half of the XTM standard; we’ll look into supporting more of it when the need arises. But the core concepts are all there. – [By the way: Why not RDF? Because topic maps are a higher-level abstraction (RDF triples have less semantics built in) and seemed to provide more value “out of the box”…]

The benefits of treating thesaurus and list terms as topics in a topic map:

  • Built-in support for multiple names, which we’re using to store translations for terms: All lists and thesauri can now be multi-lingual.
  • Class/instance relationship between terms; the “City” list is itself a topic, “Hamburg” and “Oslo” are instances of the “City” topic. This way an unlimited number of lists or thesauri can co-exist. Terms can even belong to multiple lists.
  • Arbitrary relations between terms: A thesaurus hierarchy is modeled using associations like “broader/narrower” or “synonym/preferred term”. Geographic hierarchies can use “part/whole” associations.
  • External identifier URIs can be specified for any term, so metadata can be mapped to metadata of other software using RDF, or anything else that points to the same URI.
  • Custom metadata can be attached to any term. We’ll use this for thesaurus “scope notes”, geo coordinates for cities etc.

We are already importing the (multi-lingual) IPTC subject codes thesaurus and CLDR language and country name lists into the DC-X topic map via the XTM XML format. Importing custom thesauri (in a few common text file formats) is also supported. A couple of DC-X fields are set up to auto-fill lists in the topic map as documents with new values come in. Lists and thesauri can be used for auto-completion during document editing, or for lookup in an “assistant dialog”.

In an upcoming DC-X release we will add a simple topic map browser and editor so that administrators can modify lists and thesauri, and we will be looking into automatically following “use/preferred term” relations so that the administrator can define values that are automatically to be corrected during document import.

Differences compared to DC5: Lists and thesauri are not stored as flat files in the file system anymore, they live in the database. They are available out of the box in DC-X with much less configuration overhead. Multiple languages are now supported. All kinds of relations between terms are now possible, not just simple hierarchies.


Leave a Reply