The National Library of Finland is responsible for archiving Finnish public online material. The online material to be harvested and preserved constitutes a representative and diverse range of material available online for the public.
The harvesting and preservation of online material is based on a collection policy approved by the Ministry of Education and Culture, as based on the Act on Collecting and Preserving Cultural Materials.
Publicly available online materials are collected using an automated collection software in an annual harvest, as well as through thematic harvests that focus on a particular topic. The National Library may also request an online publisher to deposit materials, if automatic harvesting is not possible. In such cases, the material is received as a deposit from the online publisher or aggregator.
Both collected and deposited electronic material – including e-books, e-newspapers and music recordings – are accessible to the public on legal deposit terminals at the National Library and other legal deposit libraries, the Library of Parliament, and the National Audiovisual Institute.
Depositing online material
Material received through deposits from online publishers include electronic:
- books and magazines
- publication series
- government publications
- music recordings and sheet music.
A web form is available for submitting online material.
Large amounts of material may also be submitted through an SFTP connection or in a data storage device (such as an external hard disk drive or a USB flash drive). More information on submitting online material is available via email at vapaakappale(at)helsinki.fi.
If you submit metadata in ONIX-format, the metadata must include at least the following fields (link to an Excel-table).
Examples of ONIX-metadata in XML-format (the files are zip-compressed):
Automatic harvesting of web pages
1) The annual harvest
Harvest of Finnish online material is carried out at least once a year with a web crawler. The National Library archives websites with .fi or .ax domain names. Finnish websites with .com, .net or other domain names are also archived. The annual harvest is not based on a particular content, topic, or theme.
The web crawler cannot collect all Finnish online publications. For example, paid online publications and databases are not available for automatic harvesting. Preserving this material requires cooperation with online publishers.
The annual harvests are described in the national bibliography on a collection level. The archiving of a particular website may be checked from the directory of the web archive. Keyword searches to harvested websites are also possible on the legal deposit terminals.
2) Thematic harvests
Purpose of thematic harvesting is to complement the annual harvest and to record online material relating to a particular issue or topical event. Such themes may include:
- significant government events and affairs (e.g., elections)
- other major events (e.g., sports competitions, cultural events)
- important and/or unexpected major events in global politics, natural disasters, etc.
- harvests planned in cooperation with other memory organizations or with research institutions.
Links for thematic harvests are gathered by the National Library staff. Material collected in thematic harvests is available in the web archive. Thematic harvests are described in the national bibliography on a collection level.
3) Material excluded from harvesting
Online material excluded from harvesting and preserving include:
- intranet pages of companies and organizations
- newsgroups and Internet forums
- online material with very little information, image, or sound content (such as sound and image samples in online shops, web forms and system software available on the Internet)
- registers and databases that are documents as described in the Archives Act, or which are comprised of such documents.
However, these materials may be included in the online archive through automated harvesting.
Technical information about the automatic harvesting of online material
The National Library conducts most of its harvesting using the Heritrix web crawler. The main targets for the web crawler are websites, but other data is also harvested (e.g., from FTP servers). The harvesting is carried out so that the load on a single web server is distributed over a long period of time and the overall strain on the network remains minor. Even the most extensive harvests have not caused a noticeable increase in data traffic on the core network level. Possible load spikes may be reported via email at kk-webcrawler(at)helsinki.fi.
When harvesting online material from websites, the National Library web crawler identifies itself with the following HTTP values:
The National Library also searches for websites based in Finland by scanning through web servers and checking if they distribute websites to the outside world (HTTP/port 80). The search for new websites is conducted by the National Library computer nwa5a.lib.helsinki.fi (IP 188.8.131.52).
As a rule, the harvesting complies with the content of a possible robots.txt file. However, the National Library may also decide to harvest material disallowed by the robots.txt file if it is considered significant for the harvest in progress.
All harvested files as well as protocol-level data traffic during the file transfer will be stored in its entirety in ARC or WARC format. The National Library preserves these archive files in its databases.
Any questions regarding web harvesting may be sent to the address vapaakappale(at)helsinki.fi.