DELOS Working Group 2.1

Creating test environments for digital library research.


 Back to DELOS WG 2.1 homepage

Questionnaires about digital library test collections

While many of the descriptive characteristics of DLs are interconnected and dependent (the nature of the user predefines the collection they will use, which in turn delimits the potential tool set they may adopt), it is possible to assign relatively independent criteria within the overall domain of users, technology and data/collection. Creating such a definition allows us then to select denumerable subsets as appropriate evaluation criteria.

Two similar questionnaires are presented here. The one named "available digital libraries and test collections" asks for the type of collections that are already available and could be used for research in the field of digital libraries. In contrast, the other questionnaire "desired digital library test collections" addresses the characteristics of collections that are required for future evaluations in this field.


The uses and the user are intimately connected with the four basic questions that can be asked about any market. Who is the market? What are they interested in? How and why do they behave as they do?

The "who?" question is largely a matter of demographics and hierarchy within the information chain. In the first cut, users are either internal to the DL system or external. The former are the people who are involved in the maintenance of the DL (similar to the librarians in a conventional library). In the case of the external users, they correspond to different levels of the standard information pyramid: the mass market (general), primary, secondary and tertiary educational market, the industrial, manufacturing or professional users, the high level university, corporate of institutional research users. Following this classification of user demography we can evaluate in terms of the numbers of each user type and their distribution among the user classes.

The "what?" question concerns the subject area of interest to a user. That is the domain of their DL use. For evaluation we can use the distribution of subject areas as a metric.

The third dimension relates to the ways in which users seek information, the "how?" question. Users can adopted essentially two strategies. The first is direct object-seeking, that is the use of sophisticated tools (largely search engines) to identify specific, singular pieces of information that resolve closely defined questions. The second is the traditional wandering approach of library browsing. There may or may not be more systematic approaches contained within this strategy. A user may use a classification scheme or other labels to limit the domain of the browse. Alternatively they may randomly wander around the information lighting upon topics of interest serendipitiously. For evaluation purposes, the distribution of users between these approaches can be used.

Lastly, we can consider the purpose behind a users information encounter, the answer to the "why?" question. For some users the encounter may simply be to consume the information for pleasure or interest. For others the information may be an object to analyse critically for educational, research or review purposes. For another group, the information will be crucial to synthesize new works via quotation, commentary, anotation or citation. Again for evaluation purposes, the distribution of uses between these categories can be a useful metric.


The technological issues can be subdivided into four areas, namely user technology, information access, systems structure and document technology.

User technology deals with the functions that the DL system offers to the user: Most basic, these functions have to be provided via an appropriate user interface. Documents are made accessible via searching and browsing; furthermore, there may be a disclosure mechanism that notifies the user about new documents that might be relevant for him. Once a relevant document is located, most users prefer to read it on paper; thus, a printing function is essential. Since users often work in teams, support for user groups also is an important function in DL systems, e.g. for collaborative filtering. Besides accessing existing documents, a DL system also may support the creation of new documents.

For information access, a DL system should implement a rich set of functions. Retrieval searches for document in response to a query. Navigation follows (explicit or implicit) links between documents and/or metadata. Based on a profile specified by the user, filtering locates potentially relevant documents in a stream of incoming documents. Information extraction generates facts from text documents. Based on this input, text mining can discover correlations and trends in a document collection.

Systems structure technology deals with the architecture of the repository (i.e. type of database management system (DBMS) like standard (relational or object oriented) DBMS, Multimedia DBMS, Information retrieval system, Hypertext system) and its distribution (centralized vs. distributed system) as well as with the transport protocol involved in the communication between the system and the user.

Document technology addresses the issue of the representation of documents. First, there is the question if there is an explicit scheme for the metadata. The document model describes the abstract structure of documents such as the linear, hierarchical or graph (hyperlinked) logical structure. Some document models make a clear distinction between thye logical and the layout structure, whereas presentation formats like e.g. postscript only deal with the layout structure. In addition, some models also manage external attributes of a document (e.g. publication date, date of last revision, owner of the document). The document format specifies the syntax of the internal document representation (e.g. postscript, PDF, RTF).


The collections and the information objects in a Digital Library can be described using different axes: content description, quality/reliability qualifiers, and management and accessibility qualifiers.

The collections in a digital library contains information objects gathered according to some rules or ideas, on the basis of one or several attributes to be described collectively. It may be a thematic collection such as work by a specific author or composer (or "creator") (W. Shakespeare, J.S. Bach); or subject (mathematics, history); a collection based upon media types (paper, CDs, films, maps, a.o.); age (information objects 'produced after 1968'), or just a general collection for a wide audience where the collection may include a variety of media types.

The collections may contain primary objects like the text of Shakespeare's "Romeo and Juliet", or the film "Cinderella". Collections of secondary objects contain bibliographic descriptions; holdings; data to assist in authority control (thesauri, gazetteers, classification schemes, etcetera), or may assist in the thematic information seeking process (collections of citations).

To describe quality in an objective manner is almost impossible. It is, however, feasible to give descriptors that may help to estimate quality and authenticity. In scientific domains it may be of importance to know if a collection contains 'grey' or reviewed literature, and if the collection's owner is well reputed, by giving the name of the owner. The metadata scheme(s) used to describe the information objects gives the level of detail of the data (MARC format, Dublin Core, RFC1807, robot generated, none).

A collection needs maintenance. Redundant information objects have to be removed, errors have to be repaired, and the growth of a collection must be secured; for new documents, only partial information may be available, in order to be completed later. A responsible organisation or body must be in charge of this work, and the corresponding workflow should be supported. Additional functions that need to be handled properly are user management, security and access control. Examples of possible qualifiers may be the name(s) of the body in charge; maintenance intervals; statistics of growth rate, accessibility, number of users, types of users, and others.


 Jump to top of page