ANZLIC Metadata - SGML Format

What is SGML?

Standard Generalized Markup Language, also known as SGML, became an International Organisation for Standardisation (ISO) standard in 1985. It is used to define the structure of electronic text files or documents. It is concerned primarily with structure and not with the content of the document.

It consists of text contained within a series of fields called elements which are defined by markup tags at the beginning and end of each field. These tags are contained within triangular brackets, <>. The beginning and ending tag contain the same name however the ending tag name is preceded by a forward slash, /.

SGML is designed primarily for defining the structure of electronic documents and not for direct viewing by the user. But if you did look at an SGML file, what would it look like? Below is an example of an extract from an SGML document:

<anzmeta>

<title>Eucalypts of Australia: 1996</title>

<abstract>

<p>This data is a compilation of Eucalyptus species site data from all over Australia. </p>

</abstract>

........

</anzmeta>

This format may look familiar to you. It is probably because you have looked at another group of marked up documents on the World Wide Web called Hyper Text Markup Language or HTML documents. HTML is a form of SGML document whose name tags have been specially defined for use on the web. Web documents are written in HTML and HTML is recognised and interpreted for display by web browsers.

The elements, order and structure of an SGML file are defined in another document file called a Document Type Definition or DTD. There is a DTD which defines HTML. A DTD allows for different documents of the same type to be processed in a similar fashion. The DTD is read and used by programs such as SGML parsers, or indexing programs to check if all the required elements are present and correctly ordered. The DTD can also be used by display programs such as word processors or web browsers to present SGML to the user.

The ANZLIC metadata DTD v1.1 defines a standard set of elements and standard structure for text files of ANZLIC metadata using SGML.

Why do we need to have a standard SGML format?

ANZLIC metadata entries exhibit a structure in that they contain some 22 data items which are grouped into 10 categories. This information lends itself to storage in a system which can handle structured text. SGML or database systems are designed to store and manage this type of information. SGML also provides a convenient mechanism for exchange of metadata entries as information in SGML can be transfered easily from one hardware and software environment to another. The SGML format can also easily be read by databases and programs for searching, checking, reporting and other functions.

SGML also has the potential to be used as a core component of the Australian Spatial Data Directory (ASDD) which is part of the Australian Spatial Data Infrastructure (ASDI). SGML documents can be created directly or outputted from metadatabases. These documents can then be indexed, searched and presented to the user via the World Wide Web using a range of technologies for searching distributed indexes.


Z39.50 Protocol

Z39.50 is an international standard defining a protocol for computer-to-computer information retrieval first published in 1988. It was originally designed for use in library systems for retrieval of bibliographic information but more recently has been used for the search and retrieval of information about geographic datasets. Z39.50 is a network protocol which allows information to be retrieved from a number of servers on the Internet and results combined and presented to the user.

Examples of Internet-based data directories which employ indexed SGML formatted metadata and the Z39.50 search and retrieval protocol include the US Federal Geographic Data Committee (FGDC) Geospatial Data Clearinghouse and the Australian Spatial Data Directory.

SGML and other metadata standards

The United States has initiated a number of metadata standards which have gained international recognition including the FGDC GEO profile for describing Geospatial Metadata and the GILS or Global Information Locator Service standard for describing information resources . Both of these standards have chosen to use SGML and have defined their own DTDs. By trying to use consistent metadata SGML tags wherever possible and the Z39.50 protocol, we greatly enhance the potential for interoperability in directory searching.

The ISO/TC 211 Working Group on Geographic Information is also currently working on the development of an international metadata standard which incorporates SGML as a standard format for input and output of metadata entries.

Conformance with XML v1.0 standard

The World Wide Web Consortium responsible for coordinating development of web standards have just released XML v1.0. XML is the "eXtensible Markup Language" (extensible because it is not fixed like HTML). XML is designed to bring the benefits of SGML to the web, namely the ability to handle large and complex documents and the ability to define your own class of documents with their own unique structure.

XML is a "cut-down" version of SGML and is fully compliant with the ISO SGML standard. The goal is to enable XML documents to be served, received and processed on the web in the way that HTML documents are today.

The ANZLIC metadata SGML DTD v1.1 is also XML v1.0 compliant. Future versions of web browsers such as Internet Explorer and Netscape will support XML documents. This would enable ANZLIC metadata in XML to be viewed and marked up directly by web browsers.

Key features of the ANZLIC metadata SGML format

As much as possible, the DTD reflects the structure of the ANZLIC Core Elements as outlined in the Metadata Guidelines - Version 1.0, July 1996. However at times the DTD departs from this structure to facilitate addition of elements for jurisdictional and thematic directories. However the content model (schema) is unchanged.

The DTD has a number of key elements which need to be highlighted.

  1. Sound future proof structure - The DTD uses entities, attributes and structured elements to ensure that future additions and local changes are straightforward. It also can link to thesaurus lists of keywords defined in the Guidelines and conforms to the XML v1.0 standard.

  2. Eight character tag names - The eight character tag have been chosen to comply with some SGML program limits and to reduce the size of the SGML document. As much as possible, the DTD uses the existing GEO profile tags. The key for tag names and full element names is documented in material referenced at the beginning of the DTD.

    One of the advantages of this approach is that interoperability between ANZLIC and FGDC systems and SGML documents is enhanced.

  3. Additional elements - There are a number of additional elements which have been included to provide a more flexible and extensible structure for the ANZLIC elements.

    The ANZLIC unique dataset ID has been included to facilitate identification of metadata records and exchange of metadata between directory systems.

    There are also four additional elements, the Bounding Coordinates, which give summary level information on the geographic coverage of the dataset which can be used when performing spatial searches on the SGML documents across the Internet.

    The ANZLIC fields "Custodian" and "Jurisdiction" together form the key organisation responsible for the data. This concept of custodianship is unique to Australia. While these fields do not map exactly to the FGDC and GILS field "Originator", this is a key search field across international directories. For implementation reasons, an additional element origin has been proposed made up of custod and jurisdic.