![]() |
Search Options | Help | Site Map | Cultivate Web Site | |||||
|
||||||
| Home | Current Issue | Index of Back Issues |
| Issue 8 Home | Editorial | Features | Regular Columns | News & Events | Misc. | ||
By Michael Day - November 2002
Michael Day reports on the 2nd ECDL Workshop on Web Archiving held in Rome on the 19th September 2002 [1].
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
On the 19th September 2002, around 50 people gathered in a lecture theatre at the Pontifical Gregorian University in Rome for the 2nd ECDL Workshop on Web Archiving, an event held in association with the 6th European Conference on Digital Libraries (ECDL). The workshop was jointly organised by the Bibliothèque nationale de France (BnF) and the Vienna University of Technology (TU Wien).
The workshop was a follow-up to the first workshop, held in Darmstadt on the 8th September 2001 in association with ECDL 2001. The Darmstadt workshop, entitled "What's next for digital deposit libraries," had included presentations on the Web archiving activities of selected legal deposit libraries, including the Library of Congress and the national libraries of Denmark, Finland, France and Sweden [2]. The Rome workshop had a wider focus and concentrated on three main themes: technologies, collection policies and platforms for collaboration. It supplemented a session on Web archiving in the main programme at ECDL 2002 that included papers describing parts of the experiments undertaken in France [3] and Austria [4].
The workshop began with a welcome from Catherine Lupovici of the BnF on behalf of the joint organisers. Twelve presentations followed, the morning session being given over to presentations mainly on technical issues. Raymie Stata of the Internet Archive gave the opening presentation. He described how the Internet Archive had amassed a large amount of data since they started crawling and harvesting the Web in 1996. The total collection is currently in excess of 120 Terabytes and is growing steadily, as each snapshot retrieves an additional two billion pages from around 40 million sites. The Internet Archive has a broad crawling strategy in that it tries to cover as many sites as possible but does not attempt to achieve complete coverage of all sites collected. The large and growing size of the Web makes the Internet Archive an expensive operation, especially with regard to the high bandwidth required for crawlers and the ongoing cost of storing terabytes of additional data. In order to give some user access to the collection, the Internet Archive had launched a public interface called the "Wayback Machine" [5]. This allows users to view older versions of Web sites collected by the Internet Archive, accessed by individual URLs. Enhanced access to the full database is available to researchers (e.g. for statistical or linguistic research) on application to the archive. The Internet Archive had also collaborated on developing special collections in association with organisations like the Library of Congress, for example on the September 11 Web Archive [6].
Throughout the morning session, there were three presentations on Web archiving experiments undertaken in France. Patrick Ferran of Xyleme gave the first of these, describing issues relating to smart crawling and the storage and querying of XML databases. Xyleme had its origins in an INRIA research project concerned with the collection and warehousing of XML data [7]. It was now (since 2000) an independent company providing XML-based products to companies in the news, publishing and financial services sectors [8]. Ferran described the distributed "smart crawler" that had been developed by Xyleme to collect XML data on the Web, but which is also able to manage HTML. In an attempt to save bandwidth, the crawler calculates page rankings and the change frequency of pages and can be focused on particular Web sites. Once the data has been collected, it is kept in a native XML repository, the Xyleme Store, which can then be interrogated using the Xyleme QL (query language). Xyleme makes an attempt at the semantic integration of different XML DTDs by using automatic classification techniques to cluster them into different application domains.
Morgan Cundiff (Library of Congress) then described some experiments undertaken by the Network Development and MARC Standards Office of the Library of Congress on using the Metadata Encoding and Transmission Standard (METS) for the description of Web sites. METS is an XML Schema for encoding descriptive, administrative, and structural metadata for digital library objects [9]. It is an open standard, developed largely by the library community and is modular and easily extensible. After a brief introduction to the basic structure of METS with reference to extension schema, Cundiff described how both the Library of Congress and OCLC [10] were interested in using METS to support the preservation of Web sites. He then gave a demonstration of how a simple Web site (text and images only) encoded in METS could be retrieved via the "METS-Viewer" software.
The following two presentations returned to the French experiments. Grégory Cobéna of INRIA (the French National Institute for Research in Computer Science and Control) described how the French research team had approached the computation of Web page importance. Researchers at INRIA and Zyleme had developed an algorithm that allowed page importance to be calculated at the same time as a Web crawl was being undertaken, thereby removing the need to compute page importance offline. This is described in more detail in a short paper by the INRIA and Xyleme researchers [11].
Julien Masanès of the BnF gave the next presentation on archiving the 'deep Web.' The BnF project had adopted a twin approach to collecting the Web. While the surface Web would be collected automatically using crawler programs, those parts of the Web inaccessible to the current generation of crawler programs - the so-called 'deep Web' - would be subject to a more traditional deposit strategy. In order to test the deposit strategy, a team based at the BnF evaluated and selected over 100 sites (of all types) that would be suitable for deposit. The Web site owners were then contacted with a request for deposit. Only 50 of these signed the deposit agreement and, of these, only 34 actually made a deposit (via FTP or on physical media). This relatively low response suggests that there may be a need for tighter legal deposit rules. Masanès also described the processes of ingest, validation and metadata creation. He noted that successful ingestion sometimes required a kind of migration.
Donna Bergmark of Cornell University gave the final presentation of the technical session. She described how a parallel crawling program had been adopted for the automatic collection of Web resources on science and mathematics for the National Science Digital Library (NSDL). The crawler used was Mercator, produced by Compaq's Systems Research Center (now part of HP) and written in Java [12]. The NSDL researchers used Mercator to test the automatic generation of collections of high-quality Web resources. The technique used was focused crawling, which uses analyses of link structure and content to indicate the relevance of pages to certain subject areas. For the NSDL experiments, centroids (broadly, the highest weighted terms extracted from the results of a Web search) were generated for each topic area and matched to the documents downloaded by a Web crawl [13]. The experiment resulted in a test collection of Web documents that could be used for detailed statistical study. More information on the outcome of these experiments is available in a paper given on the main programme at ECDL 2002 [14].
The first session of the afternoon provided updates on selected Web archiving projects. The first presentation was by Deborah Woodyard, the Digital Preservation Coordinator of the British Library (BL), on its "Britain on the Web" pilot - formerly known as "Domain.uk." This small-scale project had involved the collection of 100 Web sites of relevance to the UK. BL staff selected the sites in 2001, concentrating on those perceived to have long-term historical or cultural significance, e.g. those related to the 2001 UK General Election or the Foot and Mouth epidemic. Large-sized Web sites, those with potential copyright problems and those in the government domain were excluded. Web site owners were contacted and if a favourable response was forthcoming, the sites were periodically captured using Web site copying software. It was hoped that the pilot could be scaled up in the future and the BL had applied for additional funding to do this.
Hans Liegmann of Die Deutsche Bibliothek (DDB) gave the next presentation on the submission and delivery interfaces that DDB had implemented for online publications. Ways of submitting documents to DDB were largely based on (or adapted from) the traditional deposit model. In general, publishers would 'push' their documents to the DDB. There were variants in this model for different types of published material. For the collection of Online Theses, universities would provide the DDB with metadata about new theses and DDB would then 'pull' the relevant documents onto the deposit server. E-journals from Springer-Link would be provided to DDB on physical media for copying to the deposit server. More recently, DDB had developed a generic submission and delivery interface that allowed publishers to submit documents and metadata to the library. While few documents had been received so far, many lessons had been learned through developing this process.
Birgit Henriksen of the Royal Danish Library (Det Kongelige Bibliotek) then gave an update on the Danish project netarchive.dk [15]. The Royal Library had been collecting selected Web publications as part of legal deposit since 1998 and had been involved in the Nordic Web Archive project. Netarchive.dk was a project that started in 2001 to test different archiving strategies and the usability of the resulting collections for research. In order to do this, a case study was made of Web pages associated with the 2001 Danish municipal elections. For legal reasons, the netarchive.dk participants could not run a Web crawler over the Danish domain, so they made agreements with the producers of selected Web sites and then collected the sites using various software packages. In future, Henriksen suggested that the law needed to be changed so that harvesting the .dk domain could be covered by legal deposit. She finally proposed a mixed collection strategy for Denmark based on one major 'event' being selectively archived each year, together with four automatic snapshots. She noted that some news and media Web sites would need to be collected on a more frequent basis. Henriksen ended by noting that collaboration on both national and international levels would also be important.
Neil Beagrie of the UK Joint Information Systems Committee (JISC) gave the final presentation in this session. This was entitled "Developing research and community collections for the Web" and began by describing the UK context, noting that the UK Web was a relatively large domain and that, to date, there had been few attempts made to preserve it. The BL and JISC had undertaken some pilot projects, while a small number of organisations (like the British Broadcasting Corporation) had instituted their own Web archiving initiatives. Beagrie went on to explain JISC's interest in Web archiving in its roles as a provider of services and as a funding body responsible for research projects. JISC was, therefore, proposing a pilot Web archive that would collect JISC project Web sites from early 2003. JISC's work in this area was being informed by a feasibility study being funded jointly by JISC and the Wellcome Trust. This would be published later this year.
The final session of the conference dealt with two proposals for collaboration. First, Andreas Rauber of Vienna University of Technology gave a progress report on the European Web Archive initiative. This is a consortium of national and research libraries and other organisations that had collaborated on the production of an Expression of Interest (EoI) for an Integrated Project under the European Union's 6th Framework Programme. Rauber gave an outline of the motivations for the project and described some of the areas in which work was proposed. Michele Kimpton of the Internet Archive then gave the final presentation of the workshop on a proposed Internet Archive Consortium. This would be means for the Internet Archive to collaborate with national libraries. National libraries would define criteria for selection, collection and access while the Internet Archive would provide technical support and develop new tools.
To sum-up, the workshop provided a good review of the existing state of Web archiving initiatives. There continue to be technical developments, particularly in the development of Web crawler technology. Some of these developments are not primarily motivated by Web archiving requirements - the experiments with the automatic collection of Web resources for NSDL are a case in point - but remain relevant. Other research, e.g. experiments at Stanford University on crawling the deep Web [16], may also have implications for future Web archiving strategies.
The workshop also emphasised the growing importance of collaboration for Web archiving. The nature of the Web means that the task of collecting and preserving it is daunting. Even if all of the technical problems can be solved, the global nature of the Web and the difficulty of defining 'national' domains make it hard to know who exactly should be responsible for its preservation. National libraries undoubtedly have a role, but there will also be a need for much wider collaboration, for example with computing scientists, research libraries, archivists and records managers, funding agencies, standards organisations, publishers, art galleries, etc. Managing an ongoing collaboration between these diverse types of organisation will be difficult but may be the only way to ensure the success of Web archiving programmes.
Presentation slides from the workshop are available on the BnF Web site [17]. A shorter workshop report has also been published in D-Lib Magazine [18].
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Michael Day
Research Officer
UKOLN
University of Bath
Bath BA2 7AY
United Kingdom
URL: <http://www.ukoln.ac.uk>
Email: m.day@ukoln.ac.uk
Phone: +44 (0)1225 383923
Fax: +44 (0)1225 386838
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
For citation purposes:
Day, M. "The 2nd ECDL Workshop on Web Archiving", Cultivate Interactive, issue
8, 15 November 2002
URL: <http://www.cultivate-int.org/issue8/ecdlws2/>
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Related articles:
If you would like to view similar articles to this one click on a key word below:
< - web archiving - ECDL - Internet Archive - web crawling - METS - Metadata Encoding and Transmission Standard - digital preservation - >
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
|
Copyright ©2000 - 2001 Cultivate. | Published by UKOLN | Design by ILRT | Contact Us |