Tietolinja

Tietolinja
02/2003

Archiving the Web: European experiences

Juha Hakala
Helsinki University Library

URN:NBN:fi-fe20031951


Pääkirjoitus
Artikkelit
Uutisia,
ajankohtaista


Presentation in CONSAL XII, 20-23 October 2003, Brunei

Introduction

Historically, national libraries have assumed responsibility for preserving the published cultural heritage of their countries for future generations. This mandate has usually been based on deposit laws that require publishers to submit to the national library copies of all their publications. This responsibility has been limited to printed and audiovisual documents. More recently, legal deposit requirements have been extended to electronic publications.

Unfortunately a mere change in jurisdiction is no guarantee that these resources will survive, as the methods libraries have developed for preserving printed materials are not applicable to electronic sources. In fact, we need to design new tools and work flows in order to cope with the new resource types. Further, with the emergence of the World Wide Web as a principal publishing venue, national libraries have become concerned about the proliferation of cultural heritage materials available only on the Web. Some extended their depository responsibility to archive substantial parts of the national Web space already in the 90's, and since the year 2000 the number of national libraries experimenting with Web archival has been growing fast. By now it is clear to all that manual means for collecting Web resources are impractical, and automated tools must be developed.

In this text I will describe some European Web archival projects and tools they have built. Legal and organisational aspects of Web archiving will not be discussed, but the readers should be aware of the fact that preserving the national Web space for future generations is not only a technical question. One of the main challenges is a political one: to create Copyright Act and Act on Legal Deposit which allow the national library to carry out this task. Substantial funding will be needed as well, especially for preservation of the Web resources. 

Harvesters

Web harvester is an application which fetches and stores the Web content according to user defined parameters. The operation of any Web harvester can be easily described on generic level (although details can get complicated). A harvester is first fed a set of links (URLs) to qualifying Web documents. The larger this set is, the better; in Finland we used 60.000 addresses on the second harvesting round. The pages qualify either because they belong to a valid root domain (in our case, *.fi, in Brunei *.bn), or because they have been published in the Web by a Finnish organisation (for instance, nokia.com is a valid address for the Finnish national library).

These pages are fetched and analysed in order to find hyperlinks (further URLs) embedded in them. Those URLs which match the specified selection criteria are put aside. The next step is to use these URLs to retrieve a second batch of documents, which is processed in a similar manner. This process goes on, until every valid document has been retrieved. With this simple method, large portions of the Web can be covered quickly.

First harvesters were built in the mid-90's to enable creation of Web indexes such as Alta Vista. It was at this point that the first pioneers, most notably The Internet Archive and the Royal Library of Sweden, built tools for collecting and archiving the Web. From the Web point of view, the basic technology behind the Web archiving, therefore, is not new. However, there are not that many harvesters which were designed with archiving in mind. The so called NEDLIB harvester, developed in the EU-funded project NEDLIB (http://www.kb.nl/coop/nedlib/) in 1997 - 2000, is one of them.

These specialised harvesters have been designed to permanently retain the retrieved documents. The server must have sufficient storage for storing all data, and the organisation responsible of the work should have legal justification for its wor - for instance, an Act on Legal Deposit which encompasses Web materials.

The idea of using harvester technology to preserve Web content first emerged in Sweden in 1996, where the Royal Library's Kulturarw3 initiative (http://www.kb.se/kw3/ENG/Default.htm) begun to build tools for Web archiving. In 2003, Web archiving is an essential part of the Royal Library's deposit activities. The Swedish Web has been archived 10 times, and these ten sweeps have resulted to 185 million files containing over 5.5 terabytes of data. The Internet Archive has also been in operation since 1996, and has archived more than 50 terabytes of data from the global Internet.

The NEDLIB project decided to build its own tool for archiving when a formal assessment of existing public domain harvesters made it clear that the necessary technical adaptation to accommodate archiving features would be difficult to accomplish. There was also a risk that such adaptation might compromise the normal harvesters' basic operating functions. For instance, one harvester evaluated by NEDLIB was built to throw away URLs queued for retrieval if some pre-defined internal problems occurred. This behaviour may be acceptable when harvesting is done for indexing purposes, but for archiving this kind of technical solution-of course not mentioned in the documentation, but discovered by us in the source code-is highly problematic. Changing this and other non-optimal features in the tool in question and in other applications would have been difficult.

Therefore, instead of using an existing harvester, the project decided to build a new harvester, based on the specifications written jointly by the NEDLIB partners. Had we known how difficult it would be to build a really good harvester, we might have been more forgiving of limitations in the existing tools! The first version of the harvester was released in January 2000, but after thorough testing it became clear that the application still had a lot of problems with the Web content; malformed documents often killed one or the other of the harvester processes.

A second version of the harvester was published in September 2000. Testing continued in the national libraries of Norway, Estonia, & Iceland, and further bugs and functional limitations were found. The errors were fixed, and testing went on, which then led to yet more reported problems. This process continued until in September 2002 the version 1.2.2 of was finally deemed satisfactory.

NEDLIB harvester is freeware, available from http://www.csc.fi/sovellus/nedlib/. In its final form the application is quite robust, capable of harvesting tens of millions of Web documents. The first harvesting of the Finnish Web space resulted into 11.7 million files from more than 40 million locations. The second round started in September 2002. It adds documents to the first generation archive, and had by October 2003 extended the archive into 15 million files from more than 50 million URL addresses. The harvesting is done by a Sun E450 server with one 480 MHz CPU, one GB memory and 8 x 36.4 GB disk. Data is stored on a tape robot located in the Finnish Center for Scientific Computing, and occupies less than one terabyte in compressed form.

The problems found in testing of the NEDLIB harvester were most often related either to bad data or poorly developed HTTP server applications. Our Swedish colleagues involved with the Kulturarw3 project verified this result; having done the job many times, they knew pretty well what makes harvesting the Web a difficult enterprise. Thus, although building a harvester looks like an easy task, it is actually quite difficult to do it because the Web is not a friendly place. Browsers such as the Internet Explorer and Netscape Navigator can tolerate almost everything and are therefore very good at hiding the gory details, but any application built to process millions of HTML documents must be prepared for everything.

To give you an idea of what can happen, the NEDLIB harvester once retrieved an HTML file with a large blob of binary data in it. This surprise encounter brought the early version of the harvester HTML parser module down to its knees. Needless to say, this harvester module does not abort any more if it encounters such a file, or similar malignant cases such as URLs longer than 256 bytes.

The NEDLIB harvester was of course not built from scratch. We used existing applications for e.g. calculation of the MD5 checksum (needed for automatic generation of identifiers for stored files, and for duplicate control). Testing revealed some problems in these 3rd party applications as well, and the most serious ones of them were fixed.

As of this writing the harvester has been tested in at least 10 European countries, and at least Finland, Czech Republic and Norway are using it for archiving their national Web spaces. Development and use of this tool has given the national libraries a good understanding of technical issues related to Web harvesting. These include: 

Storage

A major consideration for any Web archiving system is storage. Any archiving tool will need disk space for:

  • The database controlling the harvesting activity (in NEDLIB harvester, MySQL)
  • Workspace, in which the harvested documents are processed (e.g., extraction of metadata) and prepared for archiving 
  • The archived documents

There is no way to know in advance the exact size of any country's archive during a Web sweep. Having done the job once helps, but since the Web grows exponentially it is hard to make an accurate estimate on how large the archive will be five years later. Experiences from Iceland, Finland, and Sweden indicate, however, that despite such growth, the national Web space remains surprisingly small. The Finnish archive took only 500 GB in 2002; this amount of data can be stored on two disks now, and in 2013 it will only occupy a fraction of an average disk available at that time.

Defining the National Web Space

In order to be able to harvest all freely available Internet resources published in a country, just retrieving everything from the root domain of a country domain (e.g. .sg) is not enough. The national library or other organisation responsible for this job must also obtain valid server names from other top-level domains, such as .com or .org. There are different methods for doing this.

First, companies selling domain names could provide a list of valid domains to the national library. However, these companies tend to think this information is proprietary and are not likely to provide the domain list to national libraries or any other organisation asking for it.

Second, network providers can provide lists of domains their domestic customers are using. We managed to get such a list from two out of ten Finnish companies, which is not really a satisfactory result. To make things worse, at least in Scandinavia, there are no rules and regulations that would entitle the national libraries to get the domain information, so even if the data is received once, there is no guarantee that there will be any updates as the library is dependent on the good will of the company delivering the information.

Third, the national library could co-operate with organisations which have for one reason or another collected representative sets of domestic domains and/or server names from various root domains. This is the strategy chosen by the Finnish national library; we have a partner which gave us 60.000 server names - many of them not *.fi) prior to our second harvesting round. Having such an exhaustive list of Web servers has helped us to get close to our ultimate aim, harvesting everything Finnish.

In some cases it will also be possible to use linguistic methods to recognize languages, and then harvest everything written in domestic language / languages. In Finland this approach works quite well and is used to complement the pre-set server list. In United Kingdom such an approach could only be applied for documents written in e.g. Cornish and Welsh.

Based on the results gathered up to now, approximately 40 % of Web documents are located in domains other than the country domain such as *.fi or *.se. Due to the new top-level domains approved by Internet Corporation for Assigned Names and Numbers (ICANN; http://www.icann.com/) in 2001, an ever diminishing number of Web servers will be located in country domains. Therefore it is very important to co-operate with other organisations or use linguistic tools in order to create on exhaustive list of domestic servers. Trying to maintain such a list manually in the national library will not be a successful strategy.

Deep Web

Increasing percentage of Web documents is being made available from within databases. This makes the Web server more secure and easier to manage, but unfortunately such content, as it can not be linked to directly with a URL at all (or such URL may be exceedingly complex) can not be harvested with the tools we have built. Thus an important part of the Web is beyond our reach.

National libraries are investigating methods for avoiding this problem in a new co-operative project (see below). 

Scheduling Issues

Since the archive harvester grabs literally everything from the servers it visits, it must be well behaved. The application simply can not retrieve files as often as it could, since this might overload small servers. During the tests of the NEDLIB harvester we found that simple scheduling policies lead to a situation where towards the end of a harvesting round there is only a handful of very large sites left to collect. The initial flood of data became at that point a trickle, which went on for weeks.

The only solution is to build a harvester which can estimate the size of the Web server on the basis of the number of documents it contains. The harvester can then have adjustable time limits as regards how often it can retrieve files from a small, medium sized and large servers. Implementing such possibility for automatic performance tuning made the NEDLIB harvester more complicated, but paid off since the time needed for completing harvesting rounds was substantially diminished. It should be noted that the size of the Web servers varies dramatically; most servers are tiny, but some may contain a lot more than 100.000 documents. 

Metadata

Utilizing Web harvesting technology for preservation purposes requires the development of an archive module. Its task is to generate archival metadata and to process the harvested documents so that they can be stored and indexed. Just storing the harvested files on disk is insufficient. Without harvesting related metadata, it would be impossible to determine where and when the archived documents were retrieved. Nor would it be possible to prove that the document had not been changed during the archival period, which would badly compromise its authenticity.

Although the NEDLIB harvester stores the original URLs for files as accompanying metadata, it does not rely on them to serve as unique identifiers, since over time the data content in a given location will often change or the resource may be moved to new or multiple locations. Traditional identifiers, such as ISBN or ISSN embedded in the document, cannot be used as archive identifiers either, as there may be many versions of an electronic book, each with the same ISBN. In the archive all these versions, even if the differences between them are small, have to be stored and identified separately. Therefore the harvester calculates an MD5 checksum of each file, and uses that sum as the archive identifier. In addition, this unique access key enables duplicate control and authentication.

Although duplicate documents are retrieved during harvesting, they can and actually should be removed before archiving to reduce storage needs. From the Icelandic experience we know that up to two thirds of the archives contents may be duplicates unless duplicate control is enforced. The question is, can we rely on MD5 checksum when the collection consists of hundreds of millions of documents?

The MD5 Message-Digest Algorithm (MD5; http://www.isi.edu/in-notes/rfc1321.txt) is the Internet standard RFC 1321. The MD5 value for a file is a 128-bit value similar to a checksum. Its length (conventional checksums are usually either 16 or 32 bits) means that the possibility of a different or corrupted file having the same MD5 value as the file of interest is extremely small. Thus we felt that the MD5 technique would be sufficient for duplicate control in a Web archive.

Since duplicate files are deleted, a harvester must create and store a list of URLs from which the documents were originally collected. This information may be useful in many ways, not least in uncovering copyright infringements. Based on the MD5 checksums of copyrighted documents, a publisher can use the Web archive index to check if any unauthorised copies of a document have been found. More advanced linguistic methods can be applied in order to find remarkable similarities between texts.

The NEDLIB harvester also generates a time stamp, which shows the exact time the document was harvested. If the document is retrieved again from the same location and is found-on the basis of MD5-to be the same, the second time stamp is stored. The archive can then be used to verify that the document has remained unchanged and available in the Web during the period defined by the first and last time stamp. If the third harvesting round finds the document unchanged, the second time stamp is updated. 

Database Considerations

All administrative information related to harvesting and metadata generated by the NEDLIB harvester is stored in a MySQL relational database. The database and workspace reside on disk, and the harvested files are stored in a UNIX file system, either on disk or tape, depending on access time requirements.

The NEDLIB harvester uses TAR software to merge a configurable number of harvested documents into a single file. The archive file is compressed with ZIP software in order to save storage space. Reversing this process is relatively fast, so the response time that can be achieved with this technique is acceptable if performance is not an overriding concern.

The archived files are not stored in a database since this would prevent the use of tape and other slow and affordable storage media. Database usage might also complicate the long-term preservation of documents, if the documents cannot be extracted from the database in original form. Using a database for storage might also become a problem if there is an inherent limit on database size or on the number of items it can hold, or limitations on file formats that can be handled. The main downside of using a file system for storing the archived documents is that a second database must be built for indexing the documents. 

 

Indexing a Web Archive

A Web archive built with a suitable harvester is not accessible as such for end users. The documents in the archive must be indexed with a full text search engine, preferably one optimised for indexing Web content. This module of the Web archive was outside the scope of NEDLIB, but work for building an access module begun in Scandinavia at the same time when NEDLIB was closing.

The Nordic Web Archive (http://nwa.nb.no/) was a collaborative project of the Nordic national libraries, which began in September 2000 and ended in June 2002. It had more funding than any previous Nordic digital library project, with a total budget of 2 million Danish crowns (250,000 Euros, approximately 300.000 USD with the exchange rate of October 2003). The main part of the NWA resources came from the Nordunet2 (http://www.nordunet2.org/) research program.

There is nothing new about the indexing of Web documents; indeed this activity has been the key function in every Web index, from Alta Vista to Google. There are many companies developing software especially for indexing the Web content (see searchenginewatch.com for a list of the most well known systems). However, these standard indexing products are not sufficient as such for a Web archive. The indexing application must be able to process additional metadata generated by a harvester, that is, archive identifiers, location information, and time stamps. Since the NEDLIB harvester stores all metadata into relational database tables, from which the data can be extracted as text, indexing the metadata should not be too difficult.

The NWA project did not have resources for developing a text-indexing engine of its own; our preferred choice would have been an efficient public domain application. Unfortunately, an evaluation of existing products led us to conclude that in January 2001, that the tools available were not suitable for indexing extremely large and varied document collections such as Web archives. The NWA project group decided that the best strategy would be to cooperate with an existing text-indexing engine vendor. In February 2001 the project group decided to acquire the search engine developed by a Norwegian company FAST. The same application has since been chosen by e.g. Elsevier.

FAST has created and made available for free a global Web index which in October 2003 contained more than 3 billion files (see http://www.alltheweb.com/). Based on projected growth rates, this capacity is definitely sufficient for indexing even the union of all NordicWeb archives for the foreseeable future.

In addition to the indexing FAST does quite a lot of pre-processing. This includes e.g. conversion of documents into XML (indexing is only done in this format) and language recognition and subsequent morphological analysis of texts.

The NWA libraries were themselves responsible of creating the user interface. By the project's end in 2002, the Nordic national libraries did have a complete tool set for harvesting, archiving, and indexing the Web. The actual use of the NWA toolset (i.e. searching and navigating a Web archive) is done via a regular web browser, and no special plugins are needed to make it work. A short description of the tool set is available at http://nwa.nb.no/aboutNwaT.php.

By the time the project was over many of the NWA tools were not yet robust enough for being made available in the public domain. Therefore the participating libraries sent a proposal for NWA II initiative to NORDINFO, which decided to support the initiative. NWA II started in March 2003, and will complete the tools development so that the applications built by NWA partners can be easily be used by anyone who wants to archive Web content.

FAST search engine will of course remain commercial. Any national library planning Web harvesting must be prepared to incur some costs, but at least in the light of the NWA and NEDLIB projects these costs are manageable. Project-based cooperation in software development has definitely been a viable option for the Nordic national libraries. None of us could have been able to develop the tools by ourselves, or pay for a commercial company for developing them.

An important feature of the NWA tool set and the NEDLIB harvester is modularity; each tool and modules of the harvester communicate with each other via standard interfaces. Thus it is relatively easy to e.g. reprogram the scheduler module of the harvester and leave the rest of the application as is. Changing the search engine should not be too difficult either. But nobody has actually tried to replace FAST with some other tool. 

 

After NWA: International Internet Preservation Consortium

In 2002 several European national libraries and the Internet Archive started discussions about possible co-operation in developing new tools for Web archiving. It was not too difficult a choice for the Nordic national libraries to get involved with these talks. Although NEDLIB harvester works reasonably well, it has one fundamental shortcoming. The application is no longer developed and supported by the Finnish Center for the Scientific Computing. Moreover, it is not certain that the Nordic national libraries can maintain the NWA tool set once the project funding is over.

The negotiations between the Internet Archive and national libraries were successful. IIPC (International Internet Preservation Consortium) was formed in Summer 2003. The consortium consists of the Internet Archive and 11 national libraries, and aims at fostering Web archiving via development of standards, best practices and tools for this purpose. Six working groups, including access tools, deep Web and research requirements, have been formed. As of this writing these groups are responsible of developing the new tools. Experiences from earlier projects such as NEDLIB and NWA are used in this work.

The consortium is led by the Bibliotheque Nationale de France, and participating libraries include the British Library and all Nordic national libraries. More information about the work will be available from the consortium's Web site, to be opened in late 2003 in the address netpreserve.org.

Many IIPC libraries participate in development of the next generation Web archiving tool called Heritrix. The project aims at "building a flexible, extensible, robust, and scalable web crawler capable of fetching, archiving, and analyzing the full diversity and breadth of internet-accesible content". Heritrix will be an open source application, more information about the tool and its development status is available at http://sourceforge.net/projects/archive-crawler/.

IIPC and related activities will have a central role in the Web archiving. Even the fact that the consortium exists proves that the importance of preserving the Web for the future generations has been recognised, and that the national libraries are eager to co-operate in this work. Indeed, the problems and solutions for preserving the Web are basically the same everywhere, from Europe to Asia. I hope that the tools built by NEDLIB and NWA, and future applications still to be developed by the IIPC consortium will be used by many national libraries and other organisations who are responsible of preserving the Web for the generations to come.

Note: this presentation is a modified and thoroughly updated version of an article published in RLG DigiNews in April 2001 (see http://www.rlg.org/preserv/diginews/diginews5-2.html#feature2).

 


Tietolinja 02/2003

Juha Hakala
Director, Information Technology
Helsinki University Library - The National Library of Finland
P.O.B. 26, FIN-00014 University of Helsinki, Finland
e-mail format: firstname.surname@helsinki.fi