DC-X Architecture

DC-X is built on a solid foundation – well-established (and mostly open source) software. Here’s a list of the components that form the basis of DC-X:

  • The DC-X source code is written in the open source language PHP. Its web browser user interface and web service API are delivered by the open source Apache web server, batch processes are run via the PHP command line interface.
  • All data (except for files, see below) is held in an Oracle or MySQL database.
  • Imported files (images, videos, PDF etc.) are stored in the file system – which can be a simple local hard disk, a cluster file system or SAN, or a mounted network volume (NFS et al.).
  • Searches are performed by the open source Solr search engine (using its XML over HTTP API).
  • For better performance, document views are cached in memcached (also open source, which has a TCP/IP API).
  • User and group data and authentication utilizes an external directory service that talks LDAP (like Active Directory or the open source OpenLDAP server).

In the diagram above, each component with one or more asterisks can run on its own server. Two asterisks mean that the component can be distributed across multiple servers. (Of course, you can also run all components together on a single server.)

Some notes on horizontal scaling, i.e. splitting components across multiple servers:

  • The Apache web server can run on multiple servers, with optional load balancing or round robin DNS as per customer requirements. This is supported by DC-X out of the box.
  • The command line batch processes (importers, workflows) can be split across multiple servers as well. Only one import process may run per hotfolder, but there is no limit on the number of workflow processes that can be started on one or multiple servers (since they’re synchronized through the database): This means that you can easily set up specialized “workflow servers”, keeping load off the servers delivering the user interface.
  • Oracle RAC works fine with DC-X, so that you can build a database cluster from multiple servers. Support for MySQL replication is not yet built into DC-X, but we’re planning to add it.
  • We’re leaving storage choices up to the customer, so you can work with your favourite SAN or cluster filesystem vendor as long as the device looks like a regular Unix file system to DC-X. We’re thinking about implementing external storage based on Amazon S3 or the Atom Publishing Protocol. Note that not all storage needs to mounted on all other servers, so you can have a centralized database while still storing files in the location where they’re used.
  • Solr is accessed via HTTP, so it can run on its own hardware. Solr can replicate indexes, allowing for multiple search servers.
  • Memcached can run on multiple servers out of the box.

As you can see, there’s lots of ways to scale DC-X!

Differences compared to DC5: Solr and memcached are new components. MySQL support has been added, and decentralized storage support and workflow parallelization have been added.

Tim Strehle
About Tim Strehle

Tim was part of Digital Collections' Research & Development team from 1999 to 2017. He is an expert for Metadata and Thesauri.

Leave a Reply