MDP2P : Massive Data in Peer-to-Peer
MDP2P is a joint research project funded by the ACI Masses de Données of the French Ministry of Research. The project is scheduled for 3 years starting in November 2003.
Final report of the project, 22 november 2006:
pdf file.

New means of creating numeric data have led to the production of masses of text and multimedia data, stored in autonomous, heterogeneous and distributed equipments such as mobile devices, personal computers, enterprise servers, etc. At the same time, the need to share those masses of distributed data is becoming increasingly important, for instance in decision-support applications or Web-based interest communities. However, the current functionality of the Web (i.e. hypermedia navigation and text retrieval) is fairly limited for such data sharing and the current technologies do not scale up to large numbers of massive data.
The typical enterprise solution of centralizing data management in very large servers is heavy, in particular, in data migration, costly (in human and computing resources) and does not scale up well. A light-weight solution is to globally exploit the massive storage and computing power already available in a computer network, and add high-level distributed data management services. The solution must be general enough to work in various contexts such as corporate intranets, with computers of very different powers (PC, departmental servers, multiprocessors, etc.), and the Internet connecting lots of individual computers. To strive for a general solution, we adopt a peer-to-peer (P2P) distributed architecture, where each node can support the same functionality (client or server) and communicate with any other node over the network. The potential advantages of P2P systems are node autonomy, scale up to large numbers of nodes, high availability (through replication) and performance (through parallelism).
Thus, the main objective of the project is to provide high-level services for managing text and multimedia data in large-scale P2P systems. Similar to database management systems, these services are not limited to file sharing (like current P2P systems) and need be high-level with query capabilities and transactional support (for data consistency). Furthermore, they must provide good access performance which can be obtained through data replication, distributed query optimization, and parallel query processing. To validate our approach and show its wide range of application, we concentrate on two different P2P contexts that we know well: the Web and clusters of PC.
The expected results are the following:
- Fundamental contributions in the various areas investigated in P2P systems and data management.
- New algorithms for distributed data management (e.g. replication, query processing) and multimedia data management (e.g. indexing, clustering).
- A better understanding of P2P applications (with which we are involved in industrial projects) such as Web-based data warehousing.
- Prototypes, based on Open Source Software such as PostGreSQL and Kerrighed (from the Paris team).
Participants:
Atlas
Team:
INRIA and LINA
2 rue de la Houssinière, BP 92208
44322 Nantes Cedex 3
:
José Martinez (José.Martinez@univ-nantes.fr)
Esther Pacitti (Esther.Pacitti@univ-nantes.fr)
Patrick Valduriez (Patrick.Valduriez@inria.fr)
Contact: Patrick.Valduriez@inria.fr
Gemo
Team:
INRIA UR Futurs
ZAC des Vignes --- Parc Club Orsay
4, rue Jacques Monod
91893 Orsay Cedex
:
Serge Abiteboul (Serge.Abiteboul@inria.fr)
Ioana Manolescu (Ioana.Manolescu@inria.fr)
Tova Milo (Tova.Milo@inria.fr)
Marie-Christine Rousset (Marie-Christine.Rousset@lri.fr)
Contact: Ioana.Manolescu@inria.fr
Paris
Team:
IRISA
Campus de Beaulieu
35042 Rennes Cedex
:
Yvon Jégou (Yvon.Jegou@irisa.fr)
Christine Morin (Christine.Morin@irisa.fr)
Thierry Priol (Thierry.Priol @irisa.fr)
Contact: Yvon.Jegou@irisa.fr
Texmex
Team:
IRISA
Campus de Beaulieu
35042 Rennes Cedex
:
Laurent Amsaleg (Laurent.Amsaleg@irisa.fr)
Patrick Gros (Patrick.Gros@irisa.fr)
Contact: Laurent.Amsaleg@irisa.fr
Research themes
Based on a generic P2P data management architecture which we will
define, we will focus on the following services and related techniques:
Data replication and large-scale load balancing;
Large-scale indexing and retrieval of text and multimedia documents;
Massive data management in P2P systems.
Data replication is useful to improve both data availability and access performance (by favoring parallelism and load balancing). Important quality criteria of replicated data are freshness (all replicas are up-to-date) and consistency (all replicas are the same). Different replication techniques devised for distributed database systems yield different quality/performance trade-offs. The most general technique, multi-master (or symmetric) replication, can yield high performance since any master node can perform updates. However, this is at the expense of freshness and consistency since there can be update conflicts which need be either prevented or repaired. Based on our previous work on optimistic replication for large-scale distributed systems and preventive replication for cluster systems, we plan to study the various trade-offs of symmetric replication techniques for large-scale P2P systems. Also, we plan to focus on complex objects such as XML documents.
In a large-scale distributed system, at least two kinds of data access are difficult to support : update transactions which can be very frequent and deal with replicated data, and retrieval queries that deal with very large objects. The traditional distributed query processing strategies devise statically an optimal execution plan based on a cost model and statistics on the data, and execute it on selected nodes. This approach is not suited to our context where the node load can change rapidly and the cost of processing large objects is difficult to predict. Thus our approach is to devise new dynamic techniques that yield a high degree of parallelism. To address the issue of sharing load information across different nodes, we plan to study the use of distributed shared memory, in particular the Kerrighed system developed in the Paris team.
Large-scale indexing and retrieval of text and multimedia documentsMasses of digitized data tend to become really huge. This is due to the combination of the number of digitized data, the size of each of them, and the size of the indexing meta-data. In addition, audio, video, and images are mostly proprietary information. All these considerations lead to privileging a (logical) server approach with a support for high performance, such as clusters of PCs. Then, one of the main issues is to exploit parallelism in order to scale-up multimedia indexing techniques. Effectively, multimedia meta-data indexing faces important difficulties. First, individual meta-data are not informative enough. Conversely, taking into consideration too many of them leads to the so-called ~high-dimensionality curse problem~. So far, indexing techniques that address this issue (e.g., SR-trees, X-trees, etc.) are limited to the range 5 up to 12 dimensions. Some multimedia meta-data would necessitate hundreds of dimensions...
Effectively, multimedia meta-data indexing faces important difficulties. First, individual meta-data are not informative enough. Conversely, taking into consideration too many of them leads to the so-called ~high-dimensionality curse problem~. So far, indexing techniques that address this issue (e.g., SR-trees, X-trees, etc.) are limited to the range 5 up to 12 dimensions. Some multimedia meta-data would necessitate hundreds of dimensions...
We envisage combining several techniques to circumvent the problem of indexing and searching efficiently large amounts of multimedia data and meta-data. First, parallelism has been identified as a pre-requisite for any solution, from collecting multimedia data, to indexing, to querying. Next, clustering algorithms have to be used in order to access not a flat repository but an organized set of meta-data (trees and lattices have been explored). These algorithms are usually expensive, therefore they must be improved in this respect (probably at the expense of some clustering quality) and parallelized too. Additionally, we expect benefiting from the classification for leveraging indexing. More precisely, at particular nodes of the meta-data clustering graph, not all the meta-data properties need to be efficiently indexed, i.e., we can trade a unique and heavy index for several smaller indices, which can be queried in parallel more easily. Of course, we have to experiment alternative solutions in order to find the best trade-off between indexing performance, disk usage, and querying performance.
Massive data management in P2P systemsWithin the Active XML system, in the Gemo team, we have proposed an architecture for distributed AXML document management in a P2P context, but in a limited context of cooperating peers, all knowing each other. This architecture can be seen as a starting point for the management of large data volumes in a full P2P architecture, where peers can join and leave the network freely. Furthermore, the particularity of AXML documents consisting of including service calls provides interesting opportunities for distributed document querying. Indeed, each peer may provide a set of ~standardised~ Web services, specific for the context of P2P data management.
Our purpose is to outline the necessary functionalities, and develop solutions for a massive data management infrastructure in P2P. Our work will pursue several techniques: indexing, querying, data acquisition, distributed data management. The test application for this infrastructure will be the construction of a data warehouse in P2P mode. We will also explore the usage of monitoring services and change management in this warehouse, based on existing results obtained in the Gemo team on XML change management. We envision the transfer of these results in a distributed P2P architecture, turning them into generic services that a peer may offer to another.
A key point to our research will be the construction of a distributed indexing scheme allowing the location of interesting peers, documents, or services, on a peer that is new to the network. Such an index will be built by all peers on the network, cooperating with each other. The index will then be used for routing and evaluating distributed queries. We have implemented a first prototype for distributed query evaluation; we aim at enhancing it for large-scale P2P networks.
WorkplanData replication and large-scale load balancing
- t0+12 : technical specifications for data replication and load balancing, including functions for fail stop, update propagation, fail over and distributed query processing;
- t0+24 : prototyping on a Linux cluster with PostGreSQL for database management and Kerrighed for distributed shared memory;
- t0+36 : evolution for a more general P2P context with clusters of servers connected by the Internet; experimentation using the clusters at LINA and at IRISA.
Large-scale indexing and retrieval of text and multimedia documents
- t0+12 : technical specifications for large-scale multimedia indexing, including data-partitioning index methods, space-partitioning index methods and approximate methods;
- t0+24 : technical specifications for parallelizing queries on multimedia documents, including clustering and image similarity search algorithms, and load balancing techniques;
- t0+36 : prototyping on a Linux cluster of clustering and similarity search algorithms and performance comparisons.
Massive data management in P2P systems
- t0+12 : technical specifications for extending AXML with advanced services and functions for P2P such as queries, change management and monitoring;
- t0+24 : prototyping of the AXML extensions, as a library useful to develop complex applications;
- t0+36 : optimization of query processing and experimentation with a data warehouse application.
Management
-
Project meeting: October 17-20, 2006, Bases de Données Avancées (BDA) 2006, Palais des congrès, Lille.
Organizers: S. Abiteboul, E. Pacitti.
Participants: S. Abiteboul, R. Akbarinia, I. Manolescu, V. Martins, N. Preda, J. Quiane, E. Pacitti
Meeting minutes: see pdf file
-
Project Meeting: September 18 2006, Réunion de travail, LINA, Nantes.
Organizers: ACI Masses de données.
Participants: Equipe Atlas, S. Gançarski (LIP6), Atlas-GDD team (LINA).
Guests: Rui Oliveira (Universidade do Minho, Portugal); Ricardo Jimenez-Peris (Universidad Politecnica de Madrid, Spain).
Meeting minutes: see pdf file.
Meeting slides: -
Project meeting: April 11 2006, Réunion de travail, LINA, Nantes.
Organizers: ACI Masses de données.
Participants: S. Abiteboul, L. Amsaleg, R. Akbarinia, I. Manolescu, J. Martinez, V. Martins, E. Pacitti, N. Preda, J. Quiane, P. Valduriez, Atlas-GDD team (LINA).
Meeting minutes: see pdf file.
Meeting slides: - Intensional XML Indexing
- Décrire 1.000.000 d'images grace à une grappe : quelques problèmes ouverts
- Classification et parallélisme pour une recherche efficiente
- High dimensional data allocation in a shared nothing cluster
- A framework for distributed data optimization
- Query load balancing with autonomous information sources
- Top-k
query processing in DHTs
- Project
meeting: October 21-23, 2005, Journées Paristic,
LaBRI, Bordeaux.
Organizers:
ACIs Sécurité, Masses de données -
Project meeting: October 18-20, 2005, Bases de Données Avancées (BDA) 2005, Palais du Grand Large, Saint Malo.
Organizers: L. Amsaleg, I. Manolescu, and P. Valduriez
Participants: S. Abiteboul, R. Akbarinia, L. Amsaleg, I. Manolescu, J. Martinez, V. Martins, N. Preda, G. Raschia, P. Valduriez
Meeting minutes: see pdf file
Meeting slides:- Sharing
Content in Structured P2P Networks
- Optimistic
Preventive Replication in a Database Cluster
- Efficient and Effective Image Copyright
Enforcement
- Sharing
Content in Structured P2P Networks
-
Project meeting: April 26, 2005, Réunion de travail JXTA Projets GDS et MDP2P, Rennes.
Organizers: G. Antoniu, L. Bougé and P. Valduriez
Participants:
Atlas : R. Akbarinia, G. Gaumer, V. Martins, J. Quiane, P. Valduriez
Paris : G. Antoniu, L. Bougé, M. Jan, S. Monnet
Program: detailled program
Meeting minutes: see detailled program -
Project meeting: Mars 17-18, 2005, Journées Masses de Données en P2P, Orsay.
Organizers: S. Abiteboul, I. Manolescu, and P. Valduriez
Participants: Atlas, Gemo, Texmex, Paris + CEDRIC, LIP6, LSR-IMAG, PRISM, Smis.
Program: detailled program
Meeting minutes: see detailled program -
Project meeting: December 9, 2004, LINA, Nantes.
Organizers: I. Manolescu, and P. Valduriez
Participants: S. Abiteboul, R. Akbarinia, I. Manolescu, N. Mouaddib, E. Pacitti, G. Raschia, P. Valduriez and all members of Atlas project
Meeting minutes: see detailled program
Meeting slides:- P2P
XML data warehouse data management
- Replication and Query Processingin the APPA
Data Mangement System
- P2P
XML data warehouse data management
-
MDP2P presentation : Journées ACI, 16-17 september 2004
Presentation slides: ppt file
-
Project meeting: June 13-18, 2004, ACM SIGMOD/PODS, Maison de la Chimie, Paris.
Organizers: L. Amsaleg, I. Manolescu, and P. Valduriez
Participants: S. Abiteboul, L. Amsaleg, P. Gros, I. Manolescu, J. Martinez, E. Pacitti, P. Valduriez
Meeting minutes: pdf file
-
Kick-off project meeting: December 1, 2003, IRISA, Rennes.
Organizers: L. Amsaleg and P. Valduriez
Participants: S. Abiteboul, L. Amsaleg, L. Duval, P. Gros, Y. Jegou, I. Manolescu, J. Martinez, P. Valduriez
Meeting minutes: pdf file
Meeting slides:- Active
XML, Data management in P2P systems
- MDP2P
-- P2P XML data warehouse management
- Data Management in large-scale distributed
systems
- Active
XML, Data management in P2P systems
Participants: I. Manolescu, P. Valduriez
Meeting minutes: see pdf file
Poster: MDP2P: Masses de documents dans les systèmes P2P
Meeting slides: Large Scale Experimentation with Preventive Replication in a Database Cluster
ActiveXML:Active
XML (AXML for short) is a declarative framework that harnesses web
services for data integration, and is put to work in a peer-to-peer
architecture.
RepDB*: Open Source Software data management component for replicating autonomous databases or data sources in a cluster system.
KadoP: KadoP is a system for sharing data and knowledge content in a peer to peer environment.


