Research themes
Database Summaries
The ever growing size of databases makes data summarization needed in order to
present the user a concise, yet complete view of the data. Our proposed summarization
process [9]
can roughly been described as a two step process. The first step is to rewrite the
original database data into an homogeneous user-oriented vocabulary. The second step is
then to use a concept formation algorithm against the rewritten data. The fuzzy-set
theory
provides a mathematical foundation to handle these two steps in a more efficient and
robust way than can be achieved with first order logic.
Fuzzy sets theory was introduced by L.A. Zadeh in 1965 in order to model sets whose
boundaries are not sharp. A fuzzy (sub)set F of an universe
is defined
thanks to a membership function denoted by
F which maps every element x of
into a degree
F(x) in the unit interval [0, 1]. Thus, a fuzzy set is a
generalization of regular set (whose membership function is defined on
the pair 0, 1).
In the first step, database tuples are rewritten using a user defined vocabulary. This vocabulary is intended to match as well as possible the natural language in which users expresses their knowledge. A database user usually refers to his or her data using a vocabulary appropriate for his field of expertise and understood by his or her fellows. For example, a salary will be said to be high, reasonable or average. This description in fact is an implicit categorization and there is no crisp border line between an average and a high salary. Fuzzy logic offers the mathematical ground to define such a vocabulary in term of linguistic variables where each data is more or less satisfactorily described by the concept.
In a concept formation algorithm, new data are incorporated into a concept hierarchy using a local optimization criteria to decide how should the hierarchy be modified. A quality measure is evaluated to compare the effect of operators that modify the hierarchy topology namely, creating a new node, creating a new level, merging two nodes, or splitting one. Using fuzzy logic in the evaluation of this measure, our concept formation algorithm is less prone to suffer the well known threshold effect of similar incremental algorithm.
Database query languages are typically based on first order logic. To allow for more flexible manipulation of large quantities of data, we rest on fuzzy logic to handle this operation. Using the database summary, queries with too few results can be relaxed to retrieve partially satisfactory subsets of the database. The fuzzy matching mechanism also permit to handle user queries expressed in vague or imprecise terms.
Multimedia Data Management
Multimedia data such as image, audio or video is quite different from structured data and semi-structured (text) data in that it is media-specific (with specific operations), possibly very large, and described by metadata. Multimedia data management aims at providing high-level capabilities for searching and manipulating multimedia collections efficiently and accurately. To address this objective, we rely on two kinds of techniques: multimedia data analysis and database techniques [5].
Multimedia data analysis is useful to automatically translate multimedia data into sets of features which can be used for indexing and searching. The features can be low-level or higher-level. Low-level features, e.g. color, texture and shape for images, can be obtained from signal processing techniques, e.g., colour histograms for images. Mid- to high-level features are also useful to understand the media content using more complex analysis techniques such as statistical classification or pattern recognition. In addition, media content creators can add metadata information that conveys more semantics.
Database techniques are useful to bridge the semantic gap between the media descriptions that can be automatically produced by analysis and the user requirements to search multimedia data in a natural way. Applying the fundamental principle of data abstraction to multimedia yields new theories and techniques for modeling, indexing and querying multimedia data.
Multimedia modeling is concerned with the definition of a suitable representation for media data, their metadata and the operations to be applied to them. A multimedia data model introduces an abstraction between the physical level (data files and indexes) and the conceptual representation, together with the operations to manipulate the data. An important capability is to capture relationships between content and metadata which is best done using objects and objet references. Thus, most multimedia models use or extend a relational or object-oriented data model.
Indexing is concerned with the physical access to multimedia data. The aim of indexes is to rapidly access the data requested by the query. Multimedia features such as average colors, color histograms and textures are usually modelled as points in a multidimensional space. Thus, multimedia can be efficiently indexed using multidimensional index structures, also called spatial access methods, which organize the multidimensional points as buckets, each corresponding to a disk block. There are two categories of such index structures: tree-based and hashing-based. The major issue faced by these methods is that multimedia features usually have a high number of dimensions. Thus, they suffer from the ``dimensionality curse'' which says that the performance of indexing (and thus querying) degrades as the data dimensionality increases.
Querying is concerned with the conceptual access to multimedia by the user with a high-level (SQL-like) query language. The common query in multimedia is similarity search where the objects retrieved are ordered according to some scores based on a distance function defined on a feature vector. In the presence of indexes, a similarity query involving two or more features must be decomposed into subqueries whose results need be finally integrated. In the case of image databases, a useful alternative to querying is browsing. By organizing an image collection in a suitable way, e.g. as a Galois' lattice, browsing allows for more interactive and iterative searching.
Distributed Data Management
The Atlas group considers data management in the context of distributed systems, with the objective of making distribution transparent to the users and applications. Thus we capitalise on the principles of distributed systems, in particular, large-scale distributed systems such as clusters, grid, and peer-to-peer (P2P) systems, to address issues in data replication and high availability, transaction load balancing, and query processing.
Data management in distributed systems has been traditionally achieved by distributed database systems which enable users to transparently access and update several databases in a network using a high-level query language (e.g. SQL). Transparency is achieved through a global schema which hides the local databases' heterogeneity. In its simplest form, a distributed database system is a centralized server that supports a global schema and implements distributed database techniques (query processing, transaction management, consistency management, etc.). This approach has proved effective for applications that can benefit from centralized control and full-fledge database capabilities, e.g. information systems. However, it cannot scale up to more than tens of databases. Data integration systems extend the distributed database approach to access data sources on the Internet with a simpler query language in read-only mode.
Parallel database systems also extend the distributed database approach to improve performance (transaction throughput or query response time) by exploiting database partitioning using a multiprocessor or cluster system. Although data integration systems and parallel database systems can scale up to hundreds of data sources or database partitions, they still rely on a centralized global schema and strong assumptions about the network.
In contrast, peer-to-peer (P2P) systems adopt a completely decentralized approach to data sharing. By distributing data storage and processing across autonomous peers in the network, they can scale without the need for powerful servers. Popular examples of P2P systems such as Gnutella and Kaaza have millions of users sharing petabytes of data over the Internet. Although very useful, these systems are quite simple (e.g. file sharing), support limited functions (e.g. keyword search) and use simple techniques (e.g. resource location by flooding) which have performance problems. To deal with the dynamic behavior of peers that can join and leave the system at any time, they rely on the fact that popular data get massively duplicated.
Initial research on P2P systems has focused on improving the performance of query routing in the unstructured systems which rely on flooding. This work led to structured solutions based on distributed hash tables (DHT), e.g. CAN and CHORD, or hybrid solutions with super-peers that index subsets of peers. Although these designs can give better performance guarantees, more research is needed to understand their trade-offs between fault-tolerance, scalability, self-organization, etc.
Recently, other work has concentrated on supporting advanced applications which must deal with semantically rich data (e.g., XML documents, relational tables, etc.) using a high-level SQL-like query language. Such data management in P2P systems is quite challenging because of the scale of the network and the autonomy and unreliable nature of peers. Most techniques designed for distributed database systems which statically exploit schema and network information no longer apply. New techniques are needed which should be decentralized, dynamic and self-adaptive.


