Tietolinja

Tietolinja
News 1/1999


EDITORIAL

ARTICLES


Testing of digital search methods:
Promoting access of historical newspapers in the north

Majken Bremer-Laamanen


Access via the computer is changing the behaviour of the researcher and the ordinary citizen of today. Information is sought from computers on almost any matter. People are getting used to direct access. If the information is not to be found via the Internet the next step to actually visit the Library - especially in another town or country - is a very long step to take.

In time, the use of the collections in national and special research libraries will diminish if not special attention is paid to digitisation and the extended services of the collections. The digitisation of collections is however an expensive and time-consuming matter.

At the same time there is this great opportunity for the national and special libraries to serve the whole country in a totally different way than before. It is possible for the remote researcher to use the collections on site. It is possible to use collections that have earlier been out of reach for the public.

To promote the use of older library collections - in digital form- Helsinki University Library, Kungliga Biblioteket in Stockholm, Nasjonalbiblioteksavdelninga in Rana, and Aarhus Statsbibliotek are co-operating to study methods for the digital conversion of large quantities of material. This project "Testing digital search methods. Historical newspapers in the Nordic Countries" started in June 1998 and is funded by the participating libraries and Nordinfo. The project is planned to take three years. However, with all likelihood the results of the OCR tests will affect the project schedule, so that a forth year will be needed for the establishment of the historical newspaper database.

Nordic newspapers were chosen as an example of an extensive and important source material that can be found in libraries. In the Nordic countries, newspapers have been produced since the mid- 17th century in all sizes and text types, and they often have photographs of historical importance. Newspapers are normally microfilmed for the sake of preserving their contents. The microfilm has a life expectancy of many hundred years. Thus the microfilm is used to serve as a platform for digitising the Nordic newspapers up to 1850.

Aim of the project

The aim of this project is to find economically viable and productive digitisation processes of large quantities of material. The participating libraries will test optical character recognition programs (OCR), and examine productive search methods and access via the Internet. On the bases of these investigations the libraries will create a database of historical newspapers that will serve researchers in all the Nordic countries in a much more flexible way than is possible by using existing search methods. The project will be divided into four phases:

  • Testing the quality of the microfilms and the suitability of different OCR programs for the conversion of large amounts of material and for the processing of texts in various languages and types and originals of varying quality and resolution.

  • Testing of user-friendly search methods, possibly setting up a search program in one of the Nordic languages, and making use of existing library catalogues of, for example, newspaper articles and supplements.

  • Investigation of copyright issues relating to newspapers published in the 20th century.

  • Access to historical newspapers from the mid 17th century to the mid 19th century in the Nordic countries will be provided via the Internet.

The first phase of the project starts with the testing of the options for successful and possible OCR-reading and search. The selection of 30-50 newspaper titles on film, including a variety of languages, paper sizes and text types, are collected from the Nordic countries. The period covered is reaching over the time limits for the project, that is from 1640-1990.

The project is divided between the participants so that, for example The National Library of Norway will digitise and OCR-read the selected microfilms, except for OCR-reading the Finnish ones, because of the language. The State and University Library of Aarhus will cooperate with library users and investigate copyright issues and the Royal Library in Sweden will focus on search issues.

Measurement criterias sor digitizing and OCR-reading

Information is gathered from each stage in the project. That is information about the original item, of the microfilm, of the digitisation and OCR-reading and search. In this article I will mainly deal with the stages we do in Finland.

The original item is naturally influencing all the other stages in the process. The size of the material influences the reduction ratio in microfilming and the resolution. The condition of each item is essential as well as the text type and its x-height. The x-height is the smallest significant character (e) measured in millimetres in the original item.

The quality of the microfilm as a carrier and platform of information for the future is very important. It is also one of the critical factors for future OCR-reading in this project. The brand of the microfilm camera, the generation and polarity of the film, reduction ratio, test resolution pattern number and the film resolution in lines per mm and density is measured. We are testing reduction ratios from 10-21 in our project.

  • The resolution pattern number multiplied with the x-height gives the standardised Quality Index QI of the film. A high level of legibility is QI 8.0, a medium one is 5.0 and a marginal, minimum quality level of legibility is 3.6.

  • Anne Kenney and Stephen Chapman have also adopted these resolution requirements for digitised images from Cornell University Library. Substituting the required quality QI and the x-height of the smallest significant detail on the original (h) into the formula dpi= 3QI / 0.039h. The constants are needed for the conversion from millimetres to inches. Using this approach for newspapers the resolution should be 308 dpi for an x-height of 2 mm and a QI of 8.

  • In this project the QI for both the film and the digitised image will be established according to the OCR-readability.

The film scanners are the next and very important steps both for the results. As we are testing the digitising of large quantities of material, all partners are using a microfilmscanner for 35 mm rollfilm situated in Norway, in Nasjonalbibliotekavdelninga i Rana. It is a 1-bit filmscanner with a linear pixel array of 6000 pixels, which means 4350 dpi for the film and 272 dpi for a newspaper filmed at a reduction rate of 16 and a QI of 7 if the x-height is 2 mm. The images are not improved after the scanning process - because of our aim to digitise large quantities in an economical way. However, the images from microfilm are quite satisfactory for the text appearance.

We are also going to make some tests with filmscanners with a higher resolution and grey-scale possibilities.

OCR reading is a crucial and an interesting part of this project. Will it be possible to read newspaper microfilms or even scanned original newspapers from over 300 years time? Newspapers of various sizes and various prints. Fragile newspapers and newspapers where the ink has spread into the paper or where the ink is fading away.

Well, in this project we are testing the above mentioned measurement criterias for possible OCR-reading. In Finland and Norway we are building up a systematical approach for testing. Two programs are tested at the moment. It should be noted that the errors in OCR are usually systematic errors typical for to the font text being used but also for the language. Other challenges in OCR are:

  • Subscripts and headings cannot be easily extracted from OCR-read text for indexing of search engines

  • in the text patterns ii gets recognised as U-umlaut and U gets recognised as ll and vice versa and u and n and m will mix.

  • In the Fractura font pattern there are errors with f/t, t/l, l/i, i/s conversions

  • Training fonts is not easy because OCR engine skips some of the fonts being trained

  • Each newspaper in the Fractura font have to be run through a OCR-engine using a font pattern set specially for that particular newspaper.

Good results were obtained in training the problematic text patterns within a larger context. For instance the word "täSSä" gets recognised correctly as "tässä" as "ssä" was trained into the text pattern.

Efficient batch processing is still being under construction.

Search methods have at this point been tested in Sweden with a program using fuzzy search methods. These tests will continue during the next two years side by side with the OCR- and other indexing devices.

A copyright and user-friendly search investigation is beginning during the autumn in Denmark.

Summary

The project will benefit all Nordic libraries interested in digitising their collections. The general methodology and production processes for large-scale digitising of older materials have both a Nordic and international importance.

This project is one step further in making it possible for the reader and the library to link together in accessing the old valuable collections in the future. When this project is done, we will see on which terms this can be done in an economical and user-friendly way.

Majken Bremer-Laamanen,
Head of the Centre for Microfilming and Conservation
Helsinki University Library
Email: Majlis.Bremer-Laamanen@helsinki.fi

Tietolinja News 1/1999