To NORDINFO

Description of the Nordic Metadata project:

Cataloguing, Indexing and Retrieval of Digital Documents

Introduction

Emergence of the Internet is bringing along major changes to the role of libraries. They are being transformed from document depositories to network information access providers. Instead of, or in addition to providing local access to local documents, they are providing global access to relevant resources anywhere within the Internet.

Increased use of the Internet makes it very important to provide good information about networked resources. Libraries should be actively involved in this work, so that our expertise and skills can be utilized in the process of making the Net more easily accessible.

Information about Internet resources is generally called metadata. Rachel Heery describes metadata in the following way in her article Review of Metadata Formats:

Metadata in its broadest sense is data about data. The familiar library catalogue record could be described as metadata in that the catalogue record is 'data about data'. Similarly database records from abstracting and indexing services are metadata (with a different variation on location data). However the term metadata is increasingly being used in the information world to specify records which refer to digital resources available across a network, and this is the definition used within this paper. By this definition a metadata record refers to another piece of information capable of existing in a separate physical form from the metadata record itself. Metadata also differs from traditional catalogue data in that the location information is held within the record in such a way to allow direct document delivery from appropriate application software, in other words the record may well contain detailed access information and the network address(es).

From libraries' point of view, metadata provision poses a few interesting challenges. The total number of the Internet resources - even document-like Internet resources, which concern us most - is growing very rapidly. Digital documents also tend to be much more unstable than printed documents. New versions are introduced frequently, and documents are often moved to other places in the Net, or renamed. Quite frequently it is not obvious at what level to analyze digital resources - how should e.g. WWW pages be described when for instance a single project is described but there are three pages referring to the separate stages of the project?

A fundamental idea in metadata provision is that the authors themselves should do it. Libraries will not have resources to catalogue all highly relevant Internet resources into the current MARC-based OPACs. The aim is to some extent pass the burden of metadata creation from librarians to authors.

However, design of a metadata creation and utilisation environment for digital documents is a complex task. As these documents can be easily exported from one Nordic country to another - or even outside Scandinavia - with the help of Internet services like WWW, it is of major interest that embedded metadata can be utilized as globally as possible. This is only possible if the same metadata element set and similar "cataloguing" practices are used. Having the same element set would also make it easier to build conversions to MARC-based "legacy" systems.

Among Nordic countries there is a special need for a shared metadata system, as it would facilitate further the already active use of ILL and document delivery services within Scandinavia. The ultimate aim of metadata provision is to enhance end-user services by making digital documents more easily searchable and deliverable over the Net.

Nordic Metadata Project will create a Nordic Metadata production, indexing and retrieval environment. This system is not intended for test use only, but primarily for production purposes.


Participants of the Nordic Metadata Project

The following organizations and people will be involved with the project:

Table 1. Project participants
Bibsys, Norway Ole Husby
Helsinki University Library, TKAY, FinlandJuha Hakala
Lund University Library, NetLab, Sweden Traugott Koch
Munksgaard, Denmark Anders Geertsen
The National and University Library of Iceland Sigbergur Fridriksson
Swedish Institute of Computer Science (SICS) Preben Hansen

The project manager will be Juha Hakala, who originated the idea for this kind of project. He was a project group member in NORDINFO's Nordic SR-Net project, and is currently involved with e.g. ONE and Gabriel projects.

Generic information on participating organizations:

SICS = Swedish Institute of Computer Science. SICS is a non-profit research foundation funded by the Swedish National Board for Technical and Industrial Development (NUTEK) and by a group of companies. SICS research programme is available at http://www.sics.se/sicsinfo/research_prog/title.html.

HYK/TKAY = Helsinki University Library, Automation Unit of Finnish Research Libraries. TKAY is a department of the Helsinki University Library responsible for planning and co-ordinating automated library functions in Finnish academic and research libraries. TKAY provides also services to public libraries.

Bibsys and the Lund University Electronic Library are well known in the field of library automation not only in Scandinavia, but also elsewhere in Europe. Therefore these partners do not really need presentation here.

Munksgaard is a major Danish scientific publisher, which has an active interest towards electronic publishing and digital documents.

The project is a representative example of Nordic cooperation. Expert organizations from Finland, Norway and Sweden cooperate in order to solve a common problem. The results of the project will benefit all five Nordic countries.


Tasks of the Nordic Metadata Project

Implementation of a shared metadata provision environment in Scandinavia is a complex task. The project proposed here will concentrate on the tasks that we believe will provide most benefits in the future. These are outlined below, together with organizations that will be responsible of the task.
  1. Evaluation of the existing metadata formats.

    Although the Dublin Core Metadata Element Set is a very promising format alternative, there are a few other candidates which deserve closer attention. Candidates to evaluation include IAFA Templates (IAFA = Internet Anonymous FTP Archive), MARC and SOIF (Summary Object Interchange Format), which is used in the Harvest WWW indexing service.

    This evaluation does not need to begin from tabula rasa. Rachel Heery has written an article "Review of Metadata Formats". Pre-publication draft of the text is already available at http://www.ukoln.ac.uk/metadata/review. The article will be formally published in October 1996 issue of the Program journal. In addition to Heery's evaluation, EU's project DESIRE will provide an another review document during Autumn 1996.

    Our aim is therefore primarily to check the state-of-the-art (as changes are very rapid in the Internet environment) and to see if there are any special Nordic needs beyond the scope of the existing reviews. Heery's review and preliminary information from the review done by Desire project (which is currently in beta stage) corroborate our belief that the Dublin Core Metadata Element Set is the best metadata format currently in existence for the Nordic countries. Still, we want to provide a proof that this view is correct.

    The result of this task is a review report, which ought to be maintained also after the project is completed when new metadata formats are introduced or old ones modified.

    The partner responsible of this task is the HYK/TKAY.


  2. Enhancement of the existing Dublin Core specification.

    Dublin Core Metadata Element Set is a simple metadata format. It is a result of cooperative work by 52 specialists (several of them library professionals) in networked information retrieval. There is no single organizations responsible of maintenance of the set; however, OCLC and NCSA have been the main players at the development stage. Several other major organisations like the Library of Congress and the British Library have also had an important role. At the moment there are a lot of Dublin Core -related activities in the U.S, Europe and Australia. Developers from different countries use actively an email list to inform others on new developments. The Set and it's history are described in more details in The 1995 Dublin Metadata Workshop report. According to this report, Dublin Core's intended ecological niche is the following:

    The Dublin Core is not intended to supplant other resource descriptions, but rather to complement them. There are currently two types of resource descriptions for networked electronic documents: automatically generated indexes used by locator services such as Lycos and WebCrawler; and cataloging records, such as MARC, created by professional information providers. Automatically generated records often contain too little information to be useful, while manually generated records are too costly to create and maintain for the large number of electronic documents currently available on the Internet. Records created from the Dublin Core are intended to mediate these extremes, affording a simple structured record that may be enhanced or mapped to more complex records as called for, either by direct extension or by a link to a more elaborate record.

    Dublin Core contains the following elements (Dempsey):

    Table 2. The Dublin Core Elements
    Subject The topic addressed by the work.
    Title The name of the object.
    Author The person(s) primarily responsible for the intellectual content of the object.
    Publisher The agent or agency responsible for making the object available in its current form.
    Other AgentThe person(s), such as editors, transcribers, and illustrators who have made other significant intellectual contributions to the work.
    Date The date of publication.
    Object typeThe genre of the object, such as novel, poem or dictionary.
    Form The physical manifestation of the object, such as PostScript file or Windows executable file.
    Identifier String or number used to uniquely identify the object.
    Relation Relationship to other objects.
    Source Objects, either print or electronic, from which this object is derived, if applicable.
    Language Language of the intellectual content.
    Coverage The spatial location and/or temporal duration characteristics of the object.

    The reason for keeping the specification simple is that the authors themselves can easily learn to provide metadata for their texts. Keeping in mind the volume and volatility of digital documents, author-made cataloguing is obviously the best way to enhance metadata provision.

    Some DC elements may by specified further with schemes. The 1995 Workshop report defines this feature in the following way (example is taken from the definition of the subject element):

    The Subject element can be qualified by a scheme, which specifies adherence to a known classification system such as the Library of Congress Subject Headings, the Dewey Decimal System, or the Art and Architecture Thesaurus, to name a few. For example:

    Without the scheme, the Subject element is a keyword and may contain any word or phrase that describes the intellectual content of the object. For example:

    The Nordic countries do not have any generic needs of adding new elements to the existing ones, although some Nordic organizations may decide to add an element or a few to the Dublin Core in order to satisfy local needs. But we can foresee a requirement to define a new schemes to some elements. For instance, subject element is lacking schemes for several subject headings lists and classification systems commonly used in Nordic countries. Then for instance the following line in the header of the HTML document:

    <META NAME= "Subject(YSA)" content="Televisio">

    would signify that the subject heading comes from the Finnish General Thesaurus and can be put into 652 tag of the FINMARC format in the Dublin Core -> FINMARC conversion.

    The result of this task is a Nordic version of Dublin Core and it's DTD (Document Type Definition) for it. These are a "superset" of the existing "pure" Dublin Core documents. As the current set of 13 element is more or less fixed, we do not want to add new ones into Nordic Dublin Core. However, if our metadata providers feel very strongly that some relevant data can not be provided with the current element set, this opinion - and an enhancement proposal - will be passed to Dublin Core maintenance group.

    The partner responsible of this task is the HYK/TKAY.


  3. Creation of conversions from Dublin Core to Nordic MARC formats and vice versa

    DC-MARC -conversions are an unavoidable step, if we want the current MARC-based systems and the future Dublin Core -based systems to peacefully coexist. There is no rational way of providing the same metadata twice manually, as libraries do not have resources for doing it even once. As one of the main aims of Dublin Core usage is to move at least part of the burden of metadata creation from libraries to authors themselves, it would be disappointing if this metadata could not be used in existing MARC-based OPACs.

    Library of Congress has already defined convertion from Dublin Core to USMARC (see gopher://marvel.loc.gov:70/00/.listarch/usmarc/dp86.doc). Basically the conversion is relatively easy, although resulting MARC record is not of high quality. The problem is that MARC formats need a few enhancements, or should we say changes, in order to accommodate DC data. The most important change is that a new author tag is needed, as DC does not differentiate between different kinds of authors in the same way the MARC formats do. This tag, 720, has been added to USMARC in January 1996 and it will soon be incorporated to FINMARC as well. The specification of how this was done is available at gopher://marvel.loc.gov:70/00/.listarch/usmarc/dp88.doc.

    As the first step of this task, each project partner will specify with the help of the LC documentation conversion from Dublin Core Element Set to the national MARC format. It should be noted that at this time the formats should contain the 720 tag. BIBSYS will then build conversion programmes from DC to NORMARC, DANMARC, SWEMARC and FINMARC. If there is time and resources left for this task after these conversions are finished, the project will also develop and test conversions from national MARC formats to DC. Generally conversions from simple format to more complex one are less complicated than conversions from complex format to a simpler one, but according to Rebecca Guenther from the Library of Congress (who has been actively involved in the specification of DC -> USMARC conversion), conversions from Dublin Core to MARC formats are an exception from this rule.

    The result of this task will be a set of DC -> MARC conversion programmes (that will initially run in one or more UNIX environments), plus related documentation. In addition the conversion tables will be made available in the public domain, so that other software developers can use them as a basis of their applications.

    The partner responsible of this task is Bibsys, which has already had independent plans to provide DC -> NORMARC -conversion.


  4. Creation of DC Metadata Syntax, User Environment and User Interaction

    This section of the project plan has been written by Juha Hakala and Preben Hansen.

    Creation of metadata production environment may be divided into
    a) syntax issues,
    b) user training, and
    c) test collection issues.

    A) DC Syntax requirements and recommendations

    Dublin Core metadata is placed in the header part of HTML documents; the META tag is used. The first Dublin Core Workshop avoided by purpose the task of specifying the syntax for DC. Following the second Metadata Workshop in April 1996 (which was attended by Juha Hakala, Ole Husby and Traugott Koch; a travel report is available at http://www.bibsys.no/warwick.html ) a proposition of Dublin Core metadata syntax has been written by Lou Burnard et. al. The text is available at http://info.ox.ac.uk/%7Elou/wip/metadata.syntax.html.

    An example of Dublin Core HTML syntax (from Burnard et al.):

     
    <html>
    <head>
    <title>On the pulse of the morning</title>
    <META NAME="title" content="On the pulse of the morning">
    <META NAME="publisher" content="University of Virginia Electronic Text Center">
    <META NAME="otheragent:transcriber"
          content="University of Virginia Electronic Text Center">
    <META NAME='date(ISO)' content="1993-01-23">
    <META NAME="objectType" content="poem">
    <META NAME="form" content="1 ASCII file">
    <META NAME="form(IMT)" content="text/ASCII">
    <META NAME="source"
          content="Newspaper stories and oral performance of text at the
           Presidential inauguration of Bill Clinton">
    <META NAME="language(ISO 639)" content="en">
     ...
    </head>
    <body>
    <h1>On the pulse of the morning</h1>
     ...
    

    An another example is available in the header of this document. It can be viewed by watching the source code of the text.

    Discussion in the metadata list imply that the Burnard et al.'s proposition will be accepted without any major changes. This opens the way to two important further developments, which are both actually already under way. First, HTML editors can be enhanced with support for Dublin Core metadata provision. This is important, since - as can be seen from above example - the HTML syntax of DC metadata is fairly complex, and not very easy to provide manually. One software vendor, SoftQuad, is already committed to provide Dublin Core support into their editor, HotMetal (see http://ww.sq.com). When the Dublin Core Element Set is more widely accepted, other vendors are likely to follow SoftQuad's example.

    Even with an HTML editor which does not support DC directly it is relatively easy to provide DC metadata if the author has a skeletal DC header he/she can cut and paste to his/her own HTML documents. Such a header (or a few) can be easily created and included into the project's user guide (see below).

    B) DC User guidelines

    Once an agreement on the DC HTML syntax has been reached, the project will be able to provide the Nordic authors information of how to create DC metadata. In practice the details of how the task is done will on some extent depend on the HTML Editor used to write HTML. The Nordic Metadata project will cover the most common alternatives like Hot Dog and Hot Metal in a single guide. The choice of which HTML editors to use depends on the needs of our test users and international developments. The DC production guides already (or soon) available will be used as a basis for the work in order to avoid duplicate effort. This subtask will develop a WWW-based User Guide on how to use DC Metadata.

    C) Coordination of DC Test Collection creation

    Test collection creation is a prerequisite for further tests in indexing. It is also of utmost importance to see how the authors are capable of providing metadata that meets our quality expectations, and what kind of documentation and support they really need to fulfill the task. Experiences from similar projects carried out earlier suggest that the authors will need a lot of help in order to be able to produce satisfactory metadata.

    Each participating country will provide a set of HTML documents which contain DC Metadata. These documents will be written by voluntary authors who participate in the project. The set of documents will later be utilized by Lund in indexing tests (see next task).

    Copyright issues are outside the scope of Nordic Metadata. Therefore all the documents included in the test collection must be free of copyright. It is anticipated that both locally made documents and documents retrieved from the Net will be used to create a collection big enough for testing purposes.

    D) Information Retrieval Interaction and Evaluation

    To adapt to the new functionality we have to evaluate the interaction with the new metadata functionality. This could be done by evaluating:

    1. User Interface interaction
      The User Interface Design of the IR system must in some way adapt to the new functionality from the user's perspective. How should the interface adapt to the DC Metadata search functionality? Using methods like a survey or questionnaire we could gather information which could form basis for recommendations to the further development of the User Interface. This could also include Information retrieval performance etc. The result of this subtask should be to establish some basic Requirements and Recommendations for the User Interface Design and to publish statistics from the surveys concerning the usage of the system.
    2. Support guidelines
      When using this DC Metadata enhanced system there is a need to provide some kind of support for both "ends" (e.g. the information "provider" and the "user"). This subtask will develop both a DC Metadata User Guidelines and Search Support Guidelines. The User Guide produced in subtask B) will be further evaluated and enhanced during the evaluation period. The Search Support Guidelines will be developed during the same period. These two guidelines will be a valuable input also to the global DC Metadata community.

    The main deliverables of this task will be to provide DC Syntax recommendations; the test collection of DC-enhanced HTML documents; DC User Guidelines for both authors information searchers; and an evaluation of the interface so that the interface adapt to the DC metadata functionality. Project documentation will cover feedback from authors and user's (searchers) and discuss possible future steps in the area of user support and interface design.

    The partner responsible of this task is SICS, which has already done work in this area.


  5. Improve the discovery and retrieval of Nordic Internet documents through a metadata aware search service

    This section of the project plan has been written by Traugott Koch.

    Metadata in the Internet is in the first place produced to allow search services to provide better and more precise retrieval results. Internet documents published in the Nordic countries are harvested, indexed and offered for retrieval in the cooperative search service "Nordic Web Index" (NWI), developed by NetLab (together with DTV, with NORDINFO funding, BTJ as sponsor and with partners in all Nordic countries. Among the partners are BIBSYS, FUNET and BIBSAM. This development project was finished during Summer 1996).

    To promote the production of metadata and the implementation of services, it is of outmost importance to have a search service, capable of using the metadata. It is equally important that the search service treats metadata correctly, adapts to the standards and regulations agreed upon, demonstrates the importance of metadata for improved and more precise retrieval and supports search processes in Nordic digital documents.

    The Nordic Metadata Project intends to develop and add new facilities to NWI, so that it provides a basis for good resource description, discovery and retrieval in the Nordic countries.

    This development work consists of several parts:

    1) Modify NWI's harvesting and indexing software

    The harvesting robot and the indexing software must be developed in order to take advantage of and recognize metadata according to the chosen scheme or exchange standard (e.g. Dublin Core).

    A future extension to the project could be to adapt NWI to understand other schemes and to follow links from documents to external metadata (incl. Warwick containers), produced by libraries, publishers, subject services and other "third parties". Inspite of all recommendations, the reality will be that individual records carrying metadata will be widely distributed and mingled with many more records without metadata. NWI should be made capable of prioritizing the harvesting and indexing of documents carrying metadata.

    2) Adaptation and improvements of the retrieval system

    NWI's retrieval system has to be adapted to the mix of full text and different metadata records. We intend to study different indexing, retrieval and display solutions. The project will accomplish the separation of different qualities of metadata from each other, offer filtering options adapted to the chosen metadata scheme and allow different display alternatives (for instance to display hits in metadata separate from hits in fulltext).

    It might be much more difficult to apply alternative ranking algorithms to NWI's retrieval system in order to reflect the different "quality" of the records since this system is a commercial software package. The project will study the need for changed algorithms and the implementation possibilities in different software solutions.

    3) Adaptation of the user interface and search support

    The user interface must be expanded and adapted to the new possibilities of searching metadata alongside the traditional fulltext retrieval. Especially, search support must be improved considerably and adapted to Nordic documents as well as to the existing metadata and the different search purposes of Nordic users. The quantities and types of metadata created at Nordic sites will be surveyed and statistics published continuously.

    4) Establish Nordic Metadata Project's test database

    All metadata created by the project will be collected into a test database and properly indexed. If there are digital fulltext documents available for the same objects, they too will be harvested and indexed.

    5) Create test environment for retrieval experiments

    In order to allow retrieval studies to be carried out in the future (or in parallel by researchers or other institutions), a test environment will be created. Documents of different types, languages and subjects will be indexed into four different databases:

    a) the fulltext of the document
    b) the metadata
    c) the metadata enriched with headings and similar content
    information extracted from the documents
    d) a combination of full text and metadata

    Retrieval studies could compare the performance of different description levels and database contents and give advice for future policies, user guidelines and search services.

    The main deliverables of this task will be the proper treatment of Nordic metadata in the Nordic Web Index, improvements in the retrieval, in the user interface and in the search support for Nordic metadata, as well as a WWW-accessible test database for metadata created by the project, an environment for comparative retrieval tests using different combinations of fulltext and metadata and statistical overviews over Nordic metadata production.

    The partner responsible of this task is Lund University Library, Development Dept. NetLab (http://www.ub2.lu.se/). NetLab has, in a NORDINFO project, developed the Nordic Web Index (http://nwi.ub2.lu.se/) and is together with partners in all Nordic countries responsible for a continued distributed search service. NetLab is also responsible for the indexing part of the EU project DESIRE (http://www.ub2.lu.se/desire/) working towards a European Web Index. Developments to be carried out in the Nordic Metadata Project will in a useful way complement the DESIRE work and secure that Nordic approaches in this area stay in close connection with the European developments.


  6. Documentation and project management

    Documentation of results is of major importance in a project like Nordic Metadata, which intends to produce applications and services that will be of high relevance to Nordic research community. In addition to ordinary project documentation like final report (which due to international character of the project's work will be written in English and published in HTML format) the project must do also other things to get researchers' attention.

    It is for instance very important to arrange in the beginning of the project a workshop to inform relevant parties both about metadata issues in general, and about the project's aims in particular. NORDINFO has kindly already provided support for this workshop, which will be held in Lund in October 1996. We wish that it will also be possible to arrange an another workshop when the project is completed in order to publicize it's results.

    The project management routines will be kept as light as possible, so as not to use resources to non-producing work in vain.

    The partner responsible of project management and documentation activities, including production of the final report, is TKAY.


Timeplan

The project will start in November 1996 and it will last about 1,5 years. As the project has a lot of links to activities in other projects like DESIRE, we will be able to save a lot of time by using other's results as a basis of our work.


Costs per task

Evaluation of existing metadata formats (task 1)

  1. Creation of review report
    Volume: 0.25 manmonths
    Date: November - December 1996


Enhancement of the existing Dublin Core specification (task 2)

  1. Nordic version of Dublin Core and it's DTD
    Volume: 0.25 manmonths
    Date: December 1996 - January 1997


Creation of MARC format conversions (task 3)

  1. DC -> NORMARC
    Volume: 1 manmonth
    Date: Spring 1997
  2. DC -> SWEMARC
    Volume: 0.5 manmonths
    Date: Summer 1997
  3. DC -> FINMARC
    Volume: 0.5 manmoths
    Date: Summer 1997

    Cumulative time for this task: 2 manmonths

Due to similarity of Nordic MARC formats it will be relatively easy to modify NORMARC conversion programme in such a way that FINMARC & SWEMARC conversion are also possible. The project might also develop a DC -> DANMARC conversion, if we find a Danish partner who wants to help us in doing it.


Dublin Core Metadata Syntax, User Environment and User Interaction (task 4)

  1. DC Syntax requirements and recommendations
    Volume: 0.5 manmonths
    Date: Spring 1997

  2. DC User Guidelines
    Volume: 1 manmonth
    Date: Spring 1997

  3. Coordination of DC Test Collection creation
    Volume: 1 manmonth
    Date: Summer 1997

  4. Information Retrieval Interaction Evaluation
    Volume: 1.5 manmonths
    Date: Winter 1997/1998

    Cumulative time for this task: 4 manmonths


Information retrieval (task 5):

  1. Modify NWI's harvesting and indexing software
    Volume: 0,5 manmonths
    Date: Nov 1996 - Summer 1997

  2. Adaptation and improvements in the retrieval system
    Volume: 2 manmonths
    Date: Summer 1997 - end of the project

  3. Adaptation of the user interface, search support
    Volume: 0,5 manmonths
    Date: Summer 1997 - end of the project

  4. Establish Nordic Metadata Project's testdatabase
    Volume: 0,25 manmonths
    Date: Fall 1997 - end of the project

  5. Create test environment for retrieval experiments
    Volume: 0,75 manmonths
    Date: Fall 1997 - end of the project

    Cumulative time for this task: 4 manmonths


Documentation and project management

  1. Production of the final report
    Volume: 0.5 manmonths
    Date: Spring 1998
  2. Project management
    Volume: 1 manmonth
    Date: Nov. 1996 - Spring 1998

    Cumulative time for this task: 1.5 manmonths


Cumulative time for all tasks: 12 manmonths


References

On behalf of the project group,

Juha Hakala
Systems analyst
Helsinki University Library / TKAY