Metadata Extraction Tool

About the Metadata Extraction Tool

The Metadata Extraction Tool automatically extracts preservation-related metadata from digital files then outputs that metadata in a standard format (XML) for use in preservation processes and activities.

The Tool was developed to programmatically extract preservation metadata which is a key component in preserving digital objects from the headers of a range of file formats. A wide range of file formats can be processed including PDF documents, image files, sound files, Microsoft Office documents and many others. It was developed by the National Library in 2003 and redeveloped in 2007. The current version of the tool is available as open-source software. It can be downloaded from GitHub (external link)  

Purpose of the tool

The tool builds on the library’s work on digital preservation and its logical preservation metadata schema. It is designed to:

  • automatically extract preservation-related metadata from digital files. The current growth of digital material makes it impossible for manual processing to generate preservation metadata. The Metadata Extract Tool provides a simple way of automating preservation processes; and
  • output that metadata in XML formats for use in preservation activities.

The Metadata Extraction Tool was primarily designed for preservation processes and activities, however, it can also be used for other tasks, such as the extraction of metadata for resource discovery.

Supported file formats

The Metadata Extraction Tool includes a number of 'adapters' that extract metadata from specific types of file. The extractors are currently provided for:

  • Images: BMP, GIF, JPEG and TIFF
  • Office documents: MS Word (version 2, 6), Word Perfect, Open Office (version 1), MS Works, MS Excel, MS PowerPoint, and PDF
  • Audio and Video: WAV, MP3, FLAC
  • Markup languages: HTML and XML
  • Webharvests: ARC, WARC

If a file type is unknown the tool applies a generic adapter, which extracts data that the host system knows about any given file (such as size, filename and date created).

How the tool works?

The Tool is based on a library of adapters. Each adapter knows how to recognise and extract metadata from a different type of file. Adapters can handle dependencies within and between objects of varying levels of complexity, ranging from single, simple objects like TIFF files through to complex web sites or databases.


Extracting preservation metadata is a two-stage process. In the first phase each incoming file is processed by the adapters until one of the adapters recognises the file type. That adapter extracts data from the header fields of the file and generates an Extensible Markup Language (XML) file.

In the second phase an Extensible Stylesheet Language (XSL) transformation converts the internal XML file into an XML file in a useful format. The Tool currently outputs the XML file using the NLNZ preservation metadata data model schema.

Solution Architecture 1

Solution Architecture

Additional Information

The Tool is written in Java and XML and is distributed under the Apache Public License (version 2). Developers may be interested in extending some of the key components of the Metadata Extraction Tool such as extending existing adapters or developing new ones to process other file types, or creating new XSLT files to generate different XML output formats. 

Please refer to the (external link)  page for operation guides and 
for more information on these components.



 Back to top