Cultivate Interactive Home Page *
*

Search Disabled

  Home | Current Issue | Index of Back Issues
  Issue 6 Home | Editorial | Features | Regular Columns | News & Events | Misc.

Content-Based Multimedia Information Handling: Should we Stick to Metadata?

By Paul Lewis, David Dupplaw and Kirk Martinez - February 2002

Paul Lewis, David Dupplaw and Kirk Martinez discuss retrieval and navigation as ways of accessing multimedia information and the use of content as an aid to these activities. They ask whether content-based techniques are really making a useful contribution or whether we should restrict ourselves to the use of metadata.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Introduction

First some definitions. Multimedia information is digital information which may be visual data, images or video for example; or it may be sound data, music or speech; and it may now include 3D visualisations or mixed reality experiences. Finally it will almost certainly include the medium with which we are most familiar, that is text.

In this article we use the word “document” to refer to a multimedia object. It may be a collection of text, an article or a book, it may be an image or a video, or a frame of a video, it may be a mixture of these, in fact it may be any type of basic digital information object.

The issues we are going to discuss apply to text as much as to images or other media and it will be useful to establish the ideas by talking first about text on its own. Let us begin by distinguishing between retrieval and navigation. Retrieval is the business of extracting a document from a collection in order to satisfy some query. The query may take a variety of forms. For example, we may require documents by a particular author, or about a particular subject. This sort of retrieval has traditionally been achieved by using indexed metadata that is stored with the document. Key terms in the metadata may give a controlled vocabulary to aid the retrieval.

Content-based retrieval of text is retrieval that uses the text of the document rather than any added metadata. Free text searching is a good example of content-based text retrieval. The words making up the content of the document are indexed and used as the basis for retrieval, sometimes in conjunction with quite sophisticated “intelligent” software used to satisfy the query. Search engines like Google and AltaVista offer content-based text retrieval on the Web.

By contrast, navigation is the process of moving from one document in the information collection to another because there is some useful association between them, and this is typically achieved by following pre-authored links. On the Web this is achieved by clicking on a highlighted source anchor of a link in one document in order to navigate to the destination document to which it points. Sometimes the distinction between navigation and retrieval is unclear. For example, following links that are stored in a bookmark file under a particular subject heading could be regarded in one sense as indexed retrieval and in another sense as link-based navigation. This is also true when using a search engine to retrieve documents on a particular subject. The documents are presented initially as links to be followed. In both these examples we will regard the process as retrieval rather than navigation, as the aim is to retrieve rather than to follow an association between documents.

On the Web, navigation is mainly based on fixed links that are embedded in the documents themselves. However, it is possible for hypermedia navigation to be content-based. By this we mean that the links offered are determined at link following time and selected on the basis of the content of the chosen source anchor. Link authoring for content-based navigation involves making an association between some chosen source anchor and the address of a destination document. The link information may be stored in a separate location from the document, typically a linkbase holding source anchors and destination addresses. With this content-based approach to navigation, multiple links may be made available for a given source anchor, previously authored links may be added to new documents on the fly with minimal effort and different viewers may see different link sets depending on the linkbases which are active at the time [1].

In both content-based retrieval and content-based navigation for text, the process depends on matching content. In the case of retrieval, the textual content of the query is matched with text forming the content of the document, typically indexed in some way to accelerate the retrieval process. In content-based navigation, the query (which is typically a portion of text selected from the content of the document) is matched with the text making up the source anchors of links in the linkbase.

For text, these processes of content-based retrieval and navigation are sufficiently well established and widely used for us to conclude with some conviction that content-based retrieval and navigation are worthwhile and effective approaches for text information handling. Of course metadata based searches with text are also widely used and the two approaches can complement each other well. The content matching, on which text content-based processes depend, are in many cases straightforward exact matches between words, although statistical matches between word sets, term switching or query expansion via thesauri, word stemming and other textual tricks can greatly enhance the processes to provide more powerful retrieval and navigation facilities.

Now let us turn our attention to content based retrieval and navigation with non-text media. We will use images as our example although many of the comments will apply equally to other non-text media. Can we say with the same conviction as we did for text that content-based image retrieval and navigation are worthwhile and effective approaches for image information handling? Well, in short, the answer is “No, certainly not with the same conviction”. But there are circumstances where content-based retrieval and content-based navigation may be worthwhile particularly in conjunction with metadata-based techniques. And in the longer term, as research into media processing offers up more powerful approaches, the value of content-based techniques should increase.

In the following sections we look more closely at content-based image retrieval and navigation techniques, examine why they are currently less powerful than for text and examine specific efforts to make them more effective.

Content-based Image Retrieval

The basic reason why image retrieval is more difficult than text retrieval is that the digital representation for most images is as a collection of pixels. The only information which is explicit in such a representation is the colour values at each pixel point. Although, when we look at images we, as humans, are able to interpret them automatically and see meaningful regions of colour, recognise objects and identify scenes which can usefully form the basis of effective image matching processes, we are performing substantial and sophisticated information processing which relies on a large volume of prior knowledge for its success. To achieve effective content based image retrieval (CBIR) software systems must achieve some of this extraction and interpretation in order to find something meaningful to form the basis of the content matching. By contrast, in text documents, the words themselves are explicit in the digital document and it is these that form the basis of the matching process. Hence for text retrieval in its basic form, little additional processing is required.

There have been some excellent recent reviews of content-based image retrieval [2], [3] and the reader is encouraged to look at these for further details. Querying in CBIR can take many forms but the most common is probably the query by example paradigm where the user provides a query image and asks for images from the collection that are similar to it in some way. An alternative might be to ask explicitly for images containing some particular object using a text interface to provide a description of the required image. Such an approach requires that the CBIR system can perform object recognition or scene analysis in order to find the required image and at present this is only possible in specific highly constrained application domains.

General approaches to CBIR attempt to find representations of the image which make more information explicit than simply the pixel colour values. Unsurprisingly, many of the approaches have been based on colour. The colour histogram [4] has been a simple and popular representation which captures the relative amounts of each colour in an image. But it is a global measure and does not give information about colour variations at local positions in the image. Nevertheless it provides a useful measure of some aspects of similarity between images and has been widely used in CBIR systems.

To overcome the global nature of the colour histogram the image has sometimes been divided into patches and the colour histogram calculated for each patch. This allows images to be retrieved from a collection when the query image is only similar to a sub-section of an image in the collection. This is taken further when the images are decomposed into patches hierarchically at decreasing resolutions.

A representation which also tries to capture some local colour information is the colour coherence vector representation [5] which counts separately pixels which belong to large (coherent) regions of the same colour and those which do not. We have developed an approach to sub-image matching which uses a pyramid of colour coherence vectors and which can locate details of high resolution art images in large collections of such images [6]. An example of a sub-image query is shown in figure 1 and the resulting match with the located sub-image is shown in figure 2.

Figure 1
Figure 1
Figure 2
Figure 2

A representation which captures information about colour boundaries within the image has been proposed by Matas et al [7] in an approach they call the multi-modal neighbourhood signature representation. This approach has the added benefit that sub images can be matched directly without the need for a pyramid decomposition.

Colour is not the only basis for representations in CBIR. Texture, which in image processing refers to a measure of repeating patterns in an image, has also provided a useful basis for representations. Again the representations tend to be global and only appear useful for some particular image types where repeating patterns are a central characteristic.

For the ultimate CBIR system what we need is perfect image understanding software. We need to bridge the so-called semantic gap, even to be able to address queries like “Find me images in this collection containing a building”. A simple query by example would be inadequate for satisfying this simple query without a substantial knowledge of the variety of ways in which buildings may appear in images. An even greater challenge comes from queries like “Find me images in this collection which depict acts of kindness” It is worth noting that the semantic gap also exists for text. The gap is not as wide but until we have perfect natural language understanding software it will continue to exist at some level.

For CBIR, a starting point would be to represent explicitly any objects in the image. Shape is an important cue to object recognition and many attempts to use shape in CBIR systems have been reported, even in the early systems like QBIC from IBM [8]. The big problem with this is knowing what constitutes an object. It is possible to segment images into regions and represent the shapes of the regions but the software needs to be trained to match or recognise particular object shapes which will typically be composed of several regions from a segmentation of the image. Some approaches to this have been reported in particular domains but general purpose CBIR systems using objects as intermediate representations are still uncommon. A rather simple example of shape finding comes from the Artiste project [9], a European project to develop a distributed art retrieval, navigation and analysis system. It includes a facility to detect images of paintings in frames of a particular shape. Most frames are rectangular but some are circular, some are triptych etc. A border finder locates the boundary of the frame in the image and a neural net classifier has been trained to use the border to deliver the frame type.

Bridging the Semantic Gap

The search for approaches to the extraction of higher level representations from images is an active area of research. Associating features extracted from images with semantic concepts has been reported [10] and in Southampton we have developed the idea of a multimedia thesaurus in our MAVIS 2 multimedia information system as an attempt to bridge the semantic gap [11].

In a traditional thesaurus, different textual representations of the same concept are associated with one another. In the multimedia thesaurus (MMT), different multimedia representations of the same concept are associated with that concept. The MMT is a multi-layer data structure used for storing the multimedia information in the system. At the highest level there is a semantic layer which records concepts and the relationships between them in the application domain. At the next level down in the simplest form of the architecture are selections from media which in some way represent the concept. For example if the concept is a vase, an image selection containing a vase is a visual representation of the concept. Associated with the image selection are the extracted signatures, for example, giving shape, texture and colour information about the vase. Also at this level we may have textual representations of the vase, so the word vase may be stored and associated with the concept vase in the semantic layer. Textual synonyms such as “amphora” may also be stored in the second level as may other visual representations or sound clips of the word vase being spoken. At the lowest level we have the raw media from which the representations have been selected and by keeping pointers to the raw media for the selections rather than a duplicate we can minimise the storage requirements.

This structure provides some valuable additional functionality in the multimedia system, For example, if query by example is being used for content based image retrieval, and the query can be matched with a representation in the MMT, the system may be able to identify the concept forming the basis of the query and from that it may find alternative representations of the concept which may enable it to retrieve images which would otherwise have been missed. Similarly, if content-based navigation is being used and a link has been authored on one view of an object, it may be possible to follow the link using a different view as the source anchor if both views are associated with the same concept in the MMT. It is also possible to follow a link authored on the text representation of a concept from an image representation of the same concept if the user so wishes.

One of the problems with this approach is the building of the associations in the MMT between representations and the concepts they represent. Clearly a manual approach is possible but is time consuming in the extreme. In a prototype application [12], brief text descriptions associated with the concepts in the semantic layer were available and some of the images had sufficient metadata associated with them to allow the use of latent semantic indexing [13] to estimate the similarity between the concept description and the metadata description of the image. This facilitated automatic creation of some of the associations and others could be made by pattern matching between the images themselves. Images were then automatically associated with the concepts with which similar images were associated. Although not a fully automatic approach it enabled us to recognise this as a way of accelerating MMT building in particular application domains.

As the MMT evolves, it should be clear that a larger and larger number of representations associated with concept classes in the semantic layer will be available. To make an association between a query selection and a concept may take some considerable computation time as the representations extracted from the query are compared with representations in the MMT. However, at some stage in the evolution of the MMT it may be possible to develop a classifier which could allocate new representations to concepts more quickly than via brute force matching. We have made some preliminary investigations into the use of intelligent autonomous processes or agents for monitoring the MMT and clustering and classifying representations as their numbers become suitably large [14]. Existing associations in the MMT are used as the basis for learning by the classifiers.

Conclusion

Although we, and others, have made tentative steps towards bridging the semantic gap in multimedia information handling, particularly in the area of content and concept based retrieval and navigation, many problems remain. One of the key difficulties is that the signatures or representations that we are working with are crude and little prior knowledge is being utilised. Until more powerful image understanding techniques can be developed and incorporated into the image processing functions we will be severely handicapped in our efforts. This is even more true for other non-text media. But even for text, it is clear that retrieval and navigation will benefit from enhanced text understanding facilities.

Another difficulty is the computational problem associated with content based media retrieval. Many of the representations are multidimensional feature vectors of high dimensionality and there are serious problems with indexing such features for rapid retrieval. Although novel indexing strategies have been published many of them collapse at very high dimensionality. Finally, it is worth mentioning that human-computer interface problems are also associated with multimedia information handling. For example, given a query image which contains a complex scene and wishing to use one of the objects in the scene as the query object, how do you indicate to the computer the limits of the object required? Interactive segmentation is a possibility but it is slow and inelegant compared with human capabilities for reasoning over images.

In spite of these continuing difficulties, significant strides have been made in recent years in the area of content-based retrieval and navigation and although metadata will continue to be an essential aid, the increasing value of content-based retrieval and content-based navigation should not be overlooked, particularly in constrained application domains and when metadata is sparse.

Acknowledgements

The authors are grateful to the European Commission for their support through grant IST-1999-11978 and to their collaborators (C2RMF(F), NCR(Dk), Giunti Interactive Labs (I), Uffizi Gallery (I), IT Innovation Centre (UK), The National Gallery (UK), The Victoria and Albert Museum(UK)) on the ARTISTE project for image data and useful conversations.

References

  1. Les A. Carr, David C. DeRoure, Hugh C. Davis and Wendy Hall (1998) Implementing an Open Link Service for the World Wide Web. World Wide Web Journal, 1, 1998.
  2. A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta and R. Jain, (2000) Content-Based Image Retrieval at the end of the Early Years, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22 no 12 1349—1380, 2000
  3. J. Eakins and M. Graham. Content based image retrieval. Technical Report 39, U.K. JISC Technology Application Programme, Oct. 1999.
    URL: <http://www.jtap.ac.uk/> Link to external resource
  4. M. J. Swain and D. H. Ballard. Color Indexing. International Journal of Computer Vision, 7(1):11-32, 1991.
  5. Greg Pass, Ramin Zabih, and Justin Miller. Comparing Images Using Color Coherence Vectors. MultiMedia, pages 65-73. ACM, 1996.
  6. Stephen Chan, Kirk Martinez, Paul Lewis, C. Lahanier and J. Stevenson (2001) Handling Sub-Image Queries in Content-Based Retrieval of High Resolution Art Images. International Cultural Heritage Informatics Meeting p.157-163.
  7. J. Matas, D. Koubaroulis, and J. Kittler. Colour Image Retrieval and Object Recognition Using Multimodal Neighbourhood Signature. In D. Vernon, editor, Proceedings of the European Conference on Computer Vision, LNCS volume 1842, pages 48-64, Berlin, German, June 2000. Springer.
  8. M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: The QBIC system IEEE Computer, 28(9):23-32, Sept. 1995.
  9. The Artiste Project Home Page
    URL: <http://www.artisteweb.org/> Link to external resource
  10. Carlo Columbo, Alberto Del Bimbo, Pietro Pala, Semantics in Visual Information Retrieval IEEE Multimedia, 38-53 July 1999.
  11. M. Dobie, R. Tansley, D. Joyce, M. Weal, P. Lewis, and W. Hall. A flexible architecture for content and concept based multimedia information exploration. In Proceedings of the Challenge of Image Retrieval (CIR'99), pages 1-12, Newcastle, UK, Feb. 1999.
  12. Robert Tansley, Colin Bird, Wendy Hall, Paul Lewis and Mark Weal (2000) Automating the Linking of Content and Concept. Proceedings ACM Multimedia 2000 p.445-448.
  13. T. K. Landauer, P. W. Foltz, and D. Laham. An introduction to latent semantic analysis.Discourse Processes, 25:259-284, 1998.
  14. Dan W. Joyce, Paul H. Lewis, Robert H. Tansley, Mark R. Dobie and Wendy Hall (2000) Semiotics and Agents for Integrating and Navigating Through Media Representations of Concepts. Storage and Retrieval for Media Databases 2000 p.120-31.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Author Details

Paul Lewis
Department of Electronics and Computer Science
University of Southampton
Southampton
SO17 1BJ

Phone: +44 2380 593715

phl@ecs.soton.ac.uk Link to an email address
<http://www.ecs.soton.ac.uk/~phl> Link to external resource

Paul Lewis is a Senior Lecturer in the Intelligence, Agents and Multimedia Research Group in the Department of Electronics and Computer Science in the University of Southampton. His research interests are in image and video analysis and their applications to multimedia information handling. He has been an investigator on numerous EPSRC and EU grants most recently working on the development of content and concept based retrieval and navigation tools in multimedia environments

Kirk Martinez is a lecturer in the Intelligence, Agents, Multimedia Research Group in the Department of Electronics and Computer Science in the University of Southampton. He has a BSc in Physics from the University of Reading and a PhD in image processing from the University of Essex. When he was Arts Computing lecturer in The University of London he developed image processing applications and imaging for art. His current research is content-based retrieval and museum applications of augmented reality.

David Dupplaw is a research assistant in the Department of Electronics and Computer Science in the University of Southampton working on the Artiste European project. He graduated in Computer Science from the University of Southampton and he is nearing the completion of a PhD on image representations for content-based applications.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

For citation purposes:
Lewis, P., Dupplaw, D., and Martinez, K. "Content-Based Multimedia Information Handling: Should we Stick to Metadata?", Cultivate Interactive, issue 6, 11 February 2002
URL: <http://www.cultivate-int.org/issue6/retrieval/>