- Big Data Open Source Systems Workshop (BOSS´16) in VLDB 2016. Polyglot Data Management Session. Chairs: Patrick Valduriez (INRIA), Marta Patiño (UPM)
- Transaction Management across Data Stores. International Journal of High Performance Computing and Networking. September 2016. Ricardo Jiménez-Peris, Marta Patiño Martinez, Iván Brondino, Valerio Vianello. (accepted)
- CoherentPaaS – Real-Time Network Performance Analysis in a Telco Environment Use Case. RTPBD 2016: Final Public Workshop from LeanBigData and CoherentPaaS (DISCOTEC 2016). Luis Cortesão and Diogo Regateiro.
- Transactional Support for Cloud Data Stores. RTPBD 2016: Final Public Workshop from LeanBigData and CoherentPaaS (DISCOTEC 2016). Ricardo Jiménez Peris, Marta Patiño Martínez, Ivan Brondino.
-
Distributed Processing and Transaction Replication in MonetDB – Towards a Scalable Analytical Database System in the Cloud. RTPBD 2016: Final Public Workshop from LeanBigData and CoherentPaaS (DISCOTEC 2016). Ying Zhang, Dimitar Nedev, Panagiotis Koutsourakis and Martin Kersten.
Thanks to its flexibility (i.e. new computing jobs can be set up in minutes, without having to wait for hardware procurement) and elasticity (i.e. more or less resources can be allocated to instantly match the current workload), cloud computing has rapidly gained much interests from both academic and commercial users. Increasingly moving into the cloud is a clear trend in the software developments. To provide its users a fast in-memory optimised analytical database system with all the conveniences of the cloud environment, we embarked upon extending the open-source column store database MonetDB with new features to make it cloud-ready. In the paper, we elaborate the new distributed and replicated transaction features in MonetDB. The distributed query processing feature allows MonetDB to horizontally scale-out to multiple machines; while the transaction replication schemes increase the availability of the MonetDB database servers.
-
CQE: A middleware to execute queries across heterogeneous databases. RTPBD 2016: Final Public Workshop from LeanBigData and CoherentPaaS (DISCOTEC 2016). Raquel Pau.
The data management world has been evolving towards a large diversity of databases. This blooming of data management technologies, where each technology is specialized and optimal for specific processing, has led to a no one size fits all situation. Consequently, software applications usually need to use different database technologies simultaneously. In this situation, software developers need to deal with several data integration issues because these cannot apply SQL queries across databases. In order to solve this gap, we present a middleware called Common Query Engine (CQE) based on an open source technology called Apache Derby to execute SQL-like queries to integrate results from different data management technologies.
-
ActivePivot improvements from CoherentPaaS. RTPBD 2016: Final Public Workshop from LeanBigData and CoherentPaaS (DISCOTEC 2016). Francois Sabary.
Concurrency control is one of the most important and performancecritical feature of a database. Up to its version 4, ActivePivot was using a simple read-write lock to enforce mutual exclusion between queries and transactions. While functionally correct this technique causes various performance issues, the most visible one being that long-running queries prevent new data from being inserted in the database. In this paper, we will see how multi-version concurrency control has been implemented in ActivePivot to solve these problems. We will also show how this new mechanism has been leveraged to implement both as-of and what-if analysis, two new defining features of ActivePivot.
-
Transactional MongoDB. RTPBD 2016: Final Public Workshop from LeanBigData and CoherentPaaS (DISCOTEC 2016). Pavlos Kranas, Sotiris Stamokostas, George Vafiadis, Athanasia Evangelinou.
The wide adaptation of NoSQL data-stores and among them the document data-stores, because of their rich functionalities and their performance, has led to the need for also providing from them transactional semantics. MongoDB, a very popular document data-store, require from the application developer to implement his own two phase commit protocol in case the application requires ensuring ACID properties. In this paper, it is presented an extension of the official MongoDB client for providing transactional semantics and Snapshot Isolation.
-
STREAM-OPS: a Streaming Operator Library. RTPBD 2016: Final Public Workshop from LeanBigData and CoherentPaaS (DISCOTEC 2016). Ricardo Jiménez Peris, Valerio Vianello, Marta Patiño Martínez.
Nowadays applications consume huge amount of live data with the requirement of real-time processing. Complex Event Processing (CEP) represents a promising technology to allow these applications to process large amount of information in real-time. Some of the available CEP systems, like Apache Storm, provide a programmatic model for the definition of the continuous queries used to process data on the fly. This paper presents STREAM-OPS, a library of streaming operators written in JAVA that is designed to ease the process of streaming query definition. In the paper we also present an evaluation of the STREAM-OPS library when integrated into two CEP systems developed in the context of CoherentPaaS and LeanBigData European projects.
-
A Taxonomy of Multistore Systems. RTPBD 2016: Final Public Workshop from LeanBigData and CoherentPaaS (DISCOTEC 2016). Carlyna Bondiombouy and Patrick Valduriez.
Building cloud data-intensive applications often requires using multiple data stores (NoSQL, HDFS, RDBMS, etc.). However, the wide diversification of data store interfaces makes it difficult to integrate data from multiple data stores. This problem has motivated the design of a new generation of systems, called multistore systems, which provide integrated or transparent access to a number of cloud data stores through one or more query languages. In this paper, we give a taxonomy of multistore systems, based on their architecture, data model, query languages and query processing techniques. To ease comparison, we divide multistore systems based on the level of coupling with the underlying data stores, i.e. loosely-coupled, tightly-coupled and hybrid.
-
Satisfying Telecom and IoT big data application requirements using multiple data stores in a coherent way. RTPBD 2016: Final Public Workshop from LeanBigData and CoherentPaaS (DISCOTEC 2016). Vassilis Spitadakis, Dimitrios Bouras, Yorgos Panagiotakis, Apostolos Hatzimanikatis.
As services are moving onto the cloud and dataset volumes are exploding, data management becomes extremely demanding. Multiple data store technologies emerged (noSQL, in-memory, columnar) to address specific needs. When all needs are brought together in an application, big data solutions should respond efficiently. CoherentPaaS project addresses these issues by providing a rich PaaS with a diversity of data stores and data management technologies optimized for particular tasks and workloads. CoherentPaaS integrates data stores, and complex event processing systems with holistic transactional coherence and a common query language to enable data correlation across stores. To assess project results, two use cases are implemented, both addressing different needs of telecoms: a price simulation application and a platform to deploy IoT services such as vehicle monitoring. Combination of different data stores and integration with real-time stream queries is brought at the development focus. Benefits and achievements are measured, assessed and reported.
- An RDMA Middleware for Asynchronous Multi-Stage Shuffling in Analytical Processing. The 16th IFIP International Conference on Distributed Applications and Interoperable Systems (DAIS 2016), Heraklion (Grecia), Rui C. Gonçalves, José Pereira and Ricardo Jiménez-Peris.
- The CloudMdsQL Multistore System. Proceedings of the 2016 International Conference on Management of Data (SIGMOD Conference) Boyan Kolev, Carlyna Bondiombouy, Patrick Valduriez, Ricardo Jimenez-Peris, Raquel Pau, Jose Pereira.
- Multistore Big Data Integration with CloudMdsQL BDA 2016: Conférence sur la Gestion de Données — Principes, Technologies et Applications. Carlyna Bondiombouy, Boyan Kolev, Oleksandra Levchenko, Patrick Valduriez.
The title “Multistore Big Data Integration with CloudMdsQL” was presented in the BDA 2016 Conference on Data Management — Principles, Technologies and Applications, which took place in Poitiers, France. BDA (Bases de Données Avancées) is a major annual event gathering top scientists from the French database community.
Multistore systems provide integrated access to multiple, heterogeneous data stores through a single query engine, typically integrating unstructured big data stored in HDFS with relational data. In this paper, we propose a functional SQL-like query language (based on CloudMdsQL) that takes full advantage of the functionality of the underlying data processing frameworks by allowing the ad-hoc usage of user defined map/filter/reduce operators in combination with traditional SQL statements.
- Tucana: Design and Implementation of a Fast and Efficient Scale-up Key-value Store. Usenix Annual Technical Conference (Usenix ATC’2016). Anastasios Papagiannis, Giorgos Saloustros, Pilar González-Férez, Angelos Bilas.
-
Towards Quantifiable Eventual Consistency. CLOSER 2016: DataDiversityConvergence-Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP and OLAP. Francisco Maia, Miguel Matos and Fabio Coelho.
In the pursuit of highly available systems, storage systems began offering eventually consistent data models. These models are suitable for a number of applications but not applicable for all. In this paper we discuss a system that can offer a eventually consistent data model but can also, when needed, offer a strong consistent one.
-
Towards Performance Prediction in Massive Scale Datastores. CLOSER 2016: DataDiversityConvergence-Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP and OLAP. Francisco Cruz, Fábio Coelho and Rui Oliveira.
Buffer caching mechanisms are paramount to improve the performance of today’s massive scale NoSQL databases. In this work, we show that in fact there is a direct and univocal relationship between the resource usage and the cache hit ratio in NoSQL databases. In addition, this relationship can be leveraged to build a mechanism that is able to estimate resource usage of the nodes composing the NoSQL cluster.
-
Design and Implementation of the CloudMdsQL Multistore System. CLOSER 2016: DataDiversityConvergence-Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP and OLAP. Boyan Kolev, Carlyna Bondiombouy, Oleksandra Levchenko, Patrick Valduriez, Ricardo Jimenez-Péris, Raquel Pau, Jose Pereira.
The blooming of different cloud data management infrastructures has turned multistore systems to a major topic in the nowadays cloud landscape. In this paper, we give an overview of the design of a Cloud Multidatastore Query Language (CloudMdsQL), and the implementation of its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational, NoSQL, HDFS) within a single query that can contain embedded invocations to each data store’s native query interface. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized.
-
KVFS: An HDFS library over NoSQL databases. CLOSER 2016: DataDiversityConvergence-Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP and OLAP. Emmanouil Pavlidakis, Stelios Mavridis, Giorgos Saloustros, and Angelos Bilas.
Recently, NoSQL stores, such as HBase, have gained acceptance and popularity due to their ability to scale-out and perform queries over large amounts of data. NoSQL stores typically arrange data in tables of (key,value) pairs and support few simple operations: get, insert, delete, and scan. Despite its simplicity, this API has proven to be extremely powerful. Nowadays most data analytics frameworks utilize distributed file systems (DFS) for storing and accessing data. HDFS has emerged as the most popular choice due to its scalability. In this paper we explore how popular NoSQL stores, such as HBase, can provide an HDFS scale-out file system abstraction. We show how we can design an HDFS compliant filesystem on top a key-value store. We implement our design as a user-space library (KVFS) providing an HDFS filesystem over an HBase key-value store. KVFS is designed to run Hadoop style analytics such asMapReduce, Hive, Pig and Mahout over NoSQL stores without the use of HDFS. We perform a preliminary evaluation of KVFS against a native HDFS setup using DFSIO with varying number of threads. Our results show that the approach of providing a filesystem API over a key-value store is a promising direction: Read and write throughput of KVFS and HDFS, for big and small datasets, is identical. Both HDFS and KVFS throughput is limited by the network for small datasets and from the device I/O for bigger datasets.
-
PaaS-CEP – A Query Language for Complex Event Processing and Databases. CLOSER 2016: DataDiversityConvergence-Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP and OLAP. Ricardo Jiménez-Peris, Valerio Vianello and Marta Patiño Martinez.
Nowadays many applications must process events at a very high rate. These events are processed on the fly, without being stored. Complex Event Processing technology (CEP) is used to implement such applications. Some of the CEP systems, like Apache Storm the most popular CEPs, lack a query language and operators to program queries as done in traditional relational databases. This paper presents PaaS-CEP, a CEP language that provides a SQL-like language to program queries for CEP and its integration with data stores (database or key-value store). Our current implementation is done on top of Apache Storm however, the CEP language can be used with any CEP. The paper describes the architecture of the PaaS-CEP, its query language and the algebraic operators. The paper also details the integration of the CEP with traditional data stores that allows the correlation of live streaming data with the stored data.
- Multistore Big Data Integration with CloudMdsQL. Transactions on Large-Scale Data and Knowledge-Centered Systems (Springer). April 2016. Carlyna Bondiombouy, Boyan Kolev, Oleksandra Levchenko, Patrick Valduriez.
- Transactional Processing for Polyglot Persistence. AINA 2016 (IEEE): 30th International Conference on Advanced Information Networking and Applications Workshops (WAINA). Ricardo Jiménez-Peris, Marta Patiño Martinez, Iván Brondino, Valerio Vianello.
- Snapshot Isolation for Neo4j. EDBT 2016: 19th International Conference on Extending Database Technology (EDBT). Marta Patiño Martínez, Diego Burgos-Sancho, Ricardo Jiménez-Peris, Iván Brondino, Valerio Vianello, Rohit Dhamane.
- Query processing in Cloud Multistore Systems : an overview. Int. Journal of Cloud Computing (Elsevier). March 2016. Carlyna Bondiombouy, Patrick Valduriez.
- Otimização do HBase para dados estruturados. INForum 2015. Francisco Neves, José Pereira, Ricardo Vilaça and Rui Oliveira.
- Query Processing in Cloud Multistore Systems. BDA 2015: Conférence sur la Gestion de Données — Principes, Technologies et Applications. Carlyna Bondiombouy.
- Integrating Big Data and Relational Data with a Functional SQL-like Query Language. Springer, DEXA 2015. Carlyna Bondiombouy, Boyan Kolev, Oleksandra Levchenko, Patrick Valduriez.
- Rethinking HBase: Design and Implementation of an Elastic Key-Value Store over Log-Structured Local Volumes. IEEE, 14th International Symposium on Parallel and Distributed Computing. July 2015. Saloustros, G, Magoutis, K.
- X-Ray: Monitoring and analysis of distributed database queries. Springer, DAIS 2015 : IFIP International Conference on Distributed Applications and Interoperable Systems. José Pereira e Pedro Guimarães.
- A design approach towards MVCC provision by a document data store for enabling transactional semantics. 12th European, Mediterranean & Middle Eastern Conference on Information Systems (EMCIS 2015). Kranas Pavlos, Stamokostas Sotiris, Moulos Vrettos.
- Workload-aware table splitting for NoSQL. ACM, Symposium on Applied Computing (SAC 2015). Rui Oliveira, Ricardo Vilaça, Francisco Cruz e Francisco Maia.
- CumuloNimbo: A Cloud Scalable Multi-tier SQL Database. Bulletin of the Technical Committee on Data Engineering (TCDE). Vol. 38 (1). 2015, IEEE Computer Society. Ricardo Jimenez-Peris, Marta Patiño Martinez, Bettina Kemme, Ivan Brondino, José Pereira, Ricardo Vilaça, Francisco Cruz, Rui Oliveira y Yousuf Ahmad.
- STONE: A Streaming DDoS Defense Framework. Expert Systems With Applications, 2015, Elsevier. Vincenzo Gulisano, Mar Callau Zori, Zhang Fu, Ricardo Jiménez Peris, Marina Papatriantafilou, Marta Patiño Martinez.
- CloudMdsQL: Querying Heterogeneous Cloud Data Stores with a Common Language. Distributed and Parallel Databases, 2015, Springer. Boyan Kolev, Patrick Valduriez, Carlyna Bondiombouy, Ricardo Jiménez-Peris, Raquel Pau, José Pereira.
- On the Support of Versioning in Distributed Key-Value Stores. IEEE Xplore, 34th IEEE International Symposium on Reliable Distributed Systems (SRDS 2014). Fábio coelho, Rui Oliveira, Miguel Matos, Ricardo Vilaça, Pierre Sutra, Valerio Schiavoni, Etienne Rivière, Marcelo Pasin e Pascal Felber.
- pH1:A Transactional Middleware for NoSQL IEEE Xplore, 33rd IEEE International Symposium on Reliable Distributed Systems (SRDS 2014). Fábio Coelho, Rui Oliveira, Francisco Cruz, José Pereira e Ricardo Vilaça.
- Single-click to data insights: transaction replication and deployment automation made simple for the cloud age. In Proceedings of International Conference on Technics, Technologies and Education (ICTTE 2014). Yambol, Bulgaria, ISSN 1314-9474, p. 327-337. Dimitar Nedev, Niels Nes, Hannes Mühleisen, Ying Zhang, Martin Kersten.
- Using semijoin programs to solve traversal queries in graph databases. GRAph Data management Experiences and Systems (GRADES – ACM 2014). Norbert Martinez-Bazan, David Dominguez-Sal.
- Performance Evaluation of Database Replication Systems. 18th International Database Engineering Applications Symposium (IDEAS 2014). Rohit Dhamane, Patiño Martínez Marta, Vianello Vianello, Jiménez-Peris Ricardo.
- “Complex Event Processing Based SIEM.” In De Tangil, Guillermo Suarez, and Esther Palomar. Advances in Security Information Management: Perceptions and Outcomes. Nova Science Publishers, Inc., 2013. Gulisano, Vincenzo, Ricardo Jiménez Peris, Marta Patino Martínez, Claudio Soriente, and Valerio Vianello.
- CoherentPaaS: A Coherent and Rich PaaS with a Common Programming Model. July 2016.
- Architecture for Integrated Management Information System for Trakia University of Stara Zagora. ICTTE 2014. Nedeva, V.I. – Nedev, D.G.
- Big Data Space Fungus CIDR 2015. Kersten, M.L.
- Capturing the Laws of (Data) Nature. CIDR 2015 Jan. 2015. Mühleisen, Hannes – Kersten, M.L. – Manegold, S.
- Genome sequence analysis with MonetDB: a case study on Ebola virus diversity. DMS2015 & BTW2015 Robin Cijvat, Stefan Manegold, Martin Kersten, Gunnar W. Klau, Alexander Schönhuth, Tobias Marschall, Ying Zhang March 2015.
- The DBMS – your Big Data Sommelier ICDE 2015. The DBMS – your Big Data Sommelier ICDE 2015 Kargin, Y. – Kersten, M.L. – Manegold, S. – Pirk, H. April 2015. Kargin, Y. – Kersten, M.L. – Manegold, S. – Pirk, H.
- NUMA obliviousness through memory mapping. ACM SIGMOD Record 2015. NUMA obliviousness through memory mapping. ACM SIGMOD Record 2015. Gawade, M.M. – Kersten, M.L. May 2015. Gawade, M.M. – Kersten, M.L.
- GIS Navigation Boosted by Column Stores VLDB 2015. September 2015. Alvanaki, F. – Goncalves, R. A. – Ivanova, M. – Kersten, M.L. – Kyzirakos, K.
Companies have evolved from a world where they only had SQL databases to a world where they use different kinds of data stores such as key-value data stores, document-oriented data stores and graph databases. This scenario rose new challenges such as data model heterogeneity and data consistency. There could be inconsistencies in case of failures during business actions requiring to update data scattered across different data stores due to the lack of transactional consistency across data stores. In this paper we propose an ultra-scalable transactional management layer that can be integrated with any data store with multi-versioning capabilities. This layer was integrated with six different data stores, three NoSQL data stores and three SQL-like databases. We particularly focus on the ultra-scalable transaction management API and how it can be easily integrated in any versioned data stores.
In this paper we present the Real-Time Network Performance Analysis in a Telco Environment use case for CoherentPaaS project. It is based on Altice Labs product Altaia, which aims to detect network problems before any degradation or unavailability of services occur, by actively supervising it. However, monitoring the whole network implies analyzing big amounts of data in real-time and the current solution does not provide the required degree of scalability. The plan is to use CoherentPaaS to provide a rich Platform-as-a- Service that supports several data stores accessible via a uniform programming model and language and complying to demanding delay, throughput and data volume requirements. All the required transformation will be presented, as well as the expected associated benefits.
In the last decade it has been observed an exponential explosion of generated user data over the internet. Traditional data management solutions such as relational databases are simply not able to process this large amount of data in a reasonable time. A new need of high scalable data management tools emerged, the cloud data stores. These technologies are able to process petabytes of data but with an important trade-off: the lack of transactional consistency. The scenario becomes even more complex for those applications whose building blocks is on top of a hybrid data store ecosystem. This works presents a novel protocol to provide transaction semantics on top of heterogeneous data stores transparently to applications.
A key component in large scale distributed analytical processing is shuffling, the distribution of data to multiple nodes such that the computation can be done in parallel. In this paper we describe the design and implementation of a communication middleware to support data shuffling for executing multi-stage analytical processing operations in parallel. The middleware relies on RDMA (Remote Direct Memory Access) to provide basic operations to asynchronously exchange data among multiple machines. Experimental results show that the RDMA-based middleware developed can provide a 75 % reduction of the costs of communication operations on parallel analytical processing tasks, when compared with a sockets middleware.
The blooming of different cloud data management infrastructures has turned multistore systems to a major topic in the nowadays cloud landscape. In this demonstration, we present a Cloud Multidatastore Query Language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store’s native query interface. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized. Within our demonstration, we focus on two use cases each involving four diverse data stores (graph, document, relational, and key-value) with its corresponding CloudMdsQL queries. The query execution flows are visualized by an embedded real-time monitoring subsystem. The users can also try out different ad-hoc queries, not necessarily in the context of the use cases.
Given current technology trends towards fast storage devices and the need for increasing data processing density, it is important to examine key-value store designs that reduce CPU overhead. However, current key-value stores are still designed mostly for hard disk drives (HDDs) that exhibit a large difference between sequential and random access performance, and they incur high CPU overheads. In this paper we present Tucana, a feature-rich keyvalue store that achieves low CPU overhead. Our design starts from a Be –tree approach to maintain asymptotic properties for inserts and uses three techniques to reduce overheads: copy-on-write, private allocation, and direct device management. In our design we favor choices that reduce overheads compared to sequential device accesses and large I/Os. We evaluate our approach against RocksDB, a stateof- the-art key-value store, and show that our approach improves CPU efficiency by up to 9:2x and an average of 6x across all workloads we examine. In addition, Tucana improves throughput compared to RocksDB by up to 7x. Then, we use Tucana to replace the storage engine of HBase and compare it to native HBase and Cassandra two of the most popular NoSQL stores. Our results show that Tucana outperforms HBase by up to 8x in CPU efficiency and by up to 10x in throughput. Tucana’s improvements are even higher when compared to Cassandra.
Multistore systems have been recently proposed to provide integrated access to multiple, heterogeneous data stores through a single query engine. In particular, much attention is being paid on the integration of unstructured big data typically stored in HDFS with relational data. One main solution is to use a relational query engine that allows SQL-like queries to retrieve data from HDFS, which requires the system to provide a relational view of the unstructured data and hence is not always feasible. In this paper, we propose a functional SQL-like query language (based on CloudMdsQL) that can integrate data retrieved from different data stores, to take full advantage of the functionality of the underlying data processing frameworks by allowing the ad-hoc usage of user defined map/filter/reduce operators in combination with traditional SQL statements. Furthermore, our solution allows for optimization by enabling subquery rewriting so that bind join can be used and filter conditions can be pushed down and applied by the data processing framework as early as possible. We validate our approach through implementation and experimental validation with three data stores and representative queries. The experimental results demonstrate the usability of the query language and the benefits from query optimization.
NoSQL data stores have emerged in last years as a solution to provide scalability and flexibility in data modelling for operational databases. These data stores have proven that they are better suited for some kinds of problems than relational databases. In order to scale, they relaxed the properties provided by relational databases, mainly transactions. However, transactional semantics is still needed by most applications. In this paper we describe how CoherentPaaS provides scalable holistic transactions across SQL and NoSQL data stores such as document-oriented data stores, key-value data stores and graph databases.
NoSQL data stores are becoming more and more popular. Graph databases are one of this kind of data stores. Neo4j is a very popular graph database. In Neo4j all operations that access a graph must be performed in a transaction. Transactions in Neo4j use read-committed isolation level. Higher isolation levels are not available. In this paper we present an overview of the implementation of snapshot isolation (SI) for Neo4j. SI provides stronger guarantees that read-committed and provides more concurrency than serializability.
Building cloud data-intensive applications often requires using multiple data stores (NoSQL, HDFS, RDBMS, etc.), each optimized for one kind of data and tasks. However, the wide diversification of data store interfaces makes it difficult to access and integrate data from multiple data stores. This important problem has motivated the design of a new generation of systems, called multistore systems, which provide integrated or transparent access to a number of cloud data stores through one or more query languages. In this paper, we give an overview of query processing in multistore systems. We start by introducing the recent cloud data management solutions and query processing in multidatabase systems. Then, we describe and analyze some representative multistore systems, based on their architecture, data model, query languages and query processing techniques. To ease comparison, we divide multistore systems based on the level of coupling with the underlying data stores, i.e. loosely-coupled, tightly-coupled and hybrid. Our analysis reveals some important trends, which we discuss. We also identify some major research issues.
The ability of NoSQL systems to scale better than traditional relational databases motivates a large set of applications to migrate their data to NoSQL systems, even without aiming to exploit the provided schema flexibility. However, accessing structured data is costly due to such flexibility, incurring in a lot of bandwidth and processing unit usage. In this paper, we analyse this cost in Apache HBase and propose a new scan operation, named Prepared Scan, that optimizes the access to data structured in a regular manner by taking advantage of a well-known schema by application. Using an industry standard benchmark, we show that Prepared Scan improves throughput and decreases network bandwidth consumption.
In the context of the cloud, we are witnessing a proliferation of data management solutions, including NoSQL data stores (e.g. Hbase, MongoDB, Neo4J), file systems (e.g. HDFS) and parallel processing frameworks (e.g. MapReduce, Spark). This makes it very hard for a user to access and analyze efficiently her data sitting in different data stores, e.g. RDBMS, NoSQL and HDFS. Processing queries against heterogeneous data sources has long been studied in the context of multidatabase systems and data integration systems. However, these solutions no longer apply in the cloud, primarily because of the wide variety of models and languages of cloud data stores (key-value, document, table, graph, etc.). To address this problem, multistore systems, have been recently proposed to provide integrated access to multiple, heterogeneous data stores through a single query engine. However, most multistore systems trade data store autonomy for performance and work only for certain categories of data stores, typically with RDBMS. In the CoherentPaaS project, we are developing the Cloud Multidatastore Query Language (CloudMdsQL). CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores within a single query that may contain embedded invocations to each data store’s native query interface. The query engine has a fully distributed architecture, which allows query engine nodes to be collocated with data store nodes in a computer cluster. Compared to current multistore systems, CloudMdsQL is more general as it can be used to access any kind of data store, while respecting their autonomy. The major innovation of CloudMdsQL is that a query can exploit the full power of the local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, integrated in SQL-like statements.
Multistore systems have been recently proposed to provide integrated access to multiple, heterogeneous data stores through a single query engine. In particular, much attention is being paid on the integration of unstructured big data typically stored in HDFS with relational data. One main solution is to use a relational query engine that allows SQL-like queries to retrieve data from HDFS, which requires the system to provide a relational view of the unstructured data and hence is not always feasible. In this paper, we introduce a functional SQL-like query language that can integrate data retrieved from different data stores and take full advantage of the functionality of the underlying data processing frameworks by allowing the ad hoc usage of user defined map/filter/reduce operators in combination with traditional SQL statements. Furthermore, the query language allows for optimization by enabling subquery rewriting so that filter conditions can be pushed inside and executed at the data store as early as possible. Our approach is validated with two data stores and a representative query that demonstrates the usability of the query language and evaluates the benefits from query optimization.
HBase is a prominent NoSQL system used widely in the domain of big data storage and analysis. It is structured as two layers: a lower-level distributed file system (HDFS)supporting the higher-level layer responsible for data distribution, indexing, and elasticity. Layered systems have in many occasions proven to suffer from overheads due to the isolation between layers, HBase is increasingly seen as an instance of this. To overcome this problem we designed, implemented, and evaluated HBase-BDB, an alternative to HBase that replaces the HDFS store with a thinner layer of a log-structured B+ tree key value store (Berkeley DB) operating over local volumes. We show that HBase-BDB overcomes HBase’s performance bottlenecks (while retaining compatibility with HBase applications) without losing on elasticity features. We evaluate the performance of HBase and HBase-BDB using the Yahoo! Cloud Serving Benchmark (YCSB) and online transaction processing(OLTP) workloads on a commercial public Cloud provider. We find that HBase-BDB outperforms a tuned HBase configuration by up to 85% under a write-intensive workload due to HBase-BDB’s reduced background-write activity. HBase-BDB’s novel elasticity mechanisms operating over local volumes are shown to be as perform ant as HBase’s equivalent features when stress-tested under TPC-C workloads.
The integration of multiple database technologies, including both SQL and NoSQL, allows using the best tool for each aspect of a complex problem and is increasingly sought in practice. Unfortunately, this makes it difficult for database developers and administrators to obtain a clear view of the resulting composite data processing paths, as they combine operations chosen by different query optimisers, implemented by different software packages, and partitioned across distributed systems. This work addresses this challenge with the X-Ray framework, that allows monitoring code to be added to a Java-based distributed system by manipulating its bytecode at runtime. The resulting information is collected in a NoSQL database and then processed to visualise data processing paths as required for optimising integrated database systems. This proposal is demonstrated with a distributed query over a federation of Apache Derby database servers and its performance evaluated with the standard TPC-C benchmark workload.
Recent challenges in cloud applications have made the use of NoSQL data stores more popular than ever. These data stores however lack of the important guarantees that traditional relational databases offer: ACID properties and transactional semantics. Implementing ACID transactions has been a longstanding challenge for NoSQL systems. Because these systems are based on a sharded architecture, transactions necessarily require coordination across multiple servers. In this work, we propose an extended version of the much popular MongoDB data store, which provides transactional semantics and ensure ACID compatibility, while on the same time, maintain all its non-functional characteristics. This is achieved via its integration with an external conflict resolution service, which is responsible for identifying conflicts between modified documents of concurrent transactions and provide ACID properties. MongoDB’s server side core however remains identically with the native version, thus providing the high performance and scalability properties that makes it so popular. Moreover, having identifying that our proposition cannot be exploited by distributed cloud applications that need to span a transaction across different nodes, we propose an alternative solution that can be used instead, which can be scaled across layers so as to ensure the required non-functional requirements, as throughput and latency.
Massive scale data stores, which exhibit highly desirable scalability and availability properties are becoming pivotal systems in nowadays infrastructures. Scalability achieved by these data stores is anchored on data independence; there is no clear relationship between data, and atomic inter-node operations are not a concern. Such assumption over data allows aggressive data partitioning. In particular, data tables are horizontally partitioned and spread across nodes for load balancing. However, in current versions of these data stores, partitioning is either a manual process or automated but simply based on table size. We argue that size based partitioning does not lead to acceptable load balancing as it ignores data access patterns, namely data hotspots. Moreover, manual data partitioning is cumbersome and typically infeasible in large scale scenarios. In this paper we propose an automated table splitting mechanism that takes into account the system workload. We evaluate such mechanism showing that it simple, non-intrusive and effective.
This article presents an overview of the CumuloNimbo platform. CumuloNimbo is a framework for multi-tier applications that provides scalable and fault-tolerant processing of OLTP workloads. The main novelty of CumuloNimbo is that it provides a standard SQL interface and full transactional support without resorting to sharding and no need to know the workload in advance. Scalability is achieved by distributing request execution and transaction control across many compute nodes while data is persisted in a distributed data store. In this paper we present an overview of the platform.
Distributed Denial-of-Service (DDoS) attacks aim at rapidly exhausting the communication and computational power of a network target by flooding it with large volumes of malicious traffic. In order to be effective, a DDoS defense mechanism should detect and mitigate threats quickly, while allowing legitimate users access to the attack’s target. Nevertheless, defense mechanisms proposed in the literature tend not to address detection and mitigation challenges jointly, but rather focus solely on the detection or the mitigation facet. At the same time, they usually overlook the limitations of centralized defense frameworks that, when deployed physically close to a possible target, become ineffective if DDoS attacks are able to saturate the target’s incoming links. This paper presents STONE, a framework with expert system functionality that provides effective and joint DDoS detection and mitigation. STONE characterizes regular network traffic of a service by aggregating it into common prefixes of IP addresses, and detecting attacks when the aggregated traffic deviates from the regular one. Upon detection of an attack, STONE allows traffic from known sources to access the service while discarding suspicious one. STONE relies on the data streaming processing paradigm in order to characterize and detect anomalies in real time. We implemented STONE on top of StreamCloud, an elastic and paralleldistributed stream processing engine. The evaluation, conducted on real network traces, shows that STONE detects DDoS attacks rapidly, provides minimal degradation of legitimate traffic while mitigating a threat, and also exhibits a processing throughput that scales linearly with the number of nodes used to deploy and run it.
The blooming of different cloud data management infrastructures, specialized for different kinds of data and tasks, has led to a wide diversification of DBMS interfaces and the loss of a common programming paradigm. In this paper, we present the design of a cloud multidatastore query language (CloudMdsQL), and its query engine. CloudMdsQL is a functional SQL-like language, capable of querying multiple heterogeneous data stores (relational and NoSQL) within a single query that may contain embedded invocations to each data store’s native query interface. The query engine has a fully distributed architecture, which provides important opportunities for optimization. The major innovation is that a CloudMdsQL query can exploit the full power of local data stores, by simply allowing some local data store native queries (e.g. a breadth-first search query against a graph database) to be called as functions, and at the same time be optimized, e.g. by pushing down select predicates, using bind join, performing join ordering, or planning intermediate data shipping. Our experimental validation, with three data stores (graph, document and relational) and representative queries, shows that CloudMdsQL satisfies the five important requirements for a cloud multidatastore query language.
The ability to access and query data stored in multiple versions is an important asset for many applications, such as Web graph analysis, collaborative editing platforms, data forensics, or correlation mining. The storage and retrieval of versioned data requires a specific API and support from the storage layer. The choice of the data structures used to maintain versioned data has a fundamental impact on the performance of insertions and queries. The appropriate data structure also depends on the nature of the versioned data and the nature of the access patterns. In this paper we study the design and implementation space for providing versioning support on top of a distributed key-value store (KVS). We define an API for versioned data access supporting multiple writers and show that a plain KVS does not offer the necessary synchronization power for implementing this API. We leverage the support for listeners at the KVS level and propose a general construction for implementing arbitrary types of data structures for storing and querying versioned data. We explore the design space of versioned data storage ranging from a flat data structure to a distributed sharded index. The resulting system, ALEPH, is implemented on top of an industrial-grade open-source KVS, Infinispan. Our evaluation, based on realworld Wikipedia access logs, studies the performance of each versioning mechanisms in terms of load balancing, latency and storage overhead in the context of different access scenarios.
NoSQL databases opt not to offer important abstractions traditionally found in relational databases in order to achieve high levels of scalability and availability: transactional guarantees and strong data consistency. In this work we propose pH1, a generic middleware layer over NoSQL databases that offers transactional guarantees with Snapshot Isolation. This is achieved in a non-intrusive manner, requiring no modifications to servers and no native support for multiple versions. Instead, the transactional context is achieved by means of a multiversion distributed cache and an external transaction certifier, exposed by extending the client’s interface with transaction bracketing primitives. We validate and evaluate pH1 with Apache Cassandra and Hyperdex. First, using the YCSB benchmark, we show that the cost of providing ACID guarantees to these NoSQL databases amounts to 11% decrease in throughput. Moreover, using the transaction intensive TPC-C workload, pH1 presented an impact of 22% decrease in throughput. This contrasts with OMID, a previous proposal that takes advantage of HBase’s support for multiple versions, with a throughput penalty of 76% in the same conditions.
In this report we present out initial work on making the MonetDB column-store analytical database ready for Cloud deployment. As we stand in the new space between research and industry we have tried to combine approaches from both worlds. We provide details how we utilize modern technologies and tools for automating building of virtual machine image for Cloud, datacentre and desktop use. We also explain our solution to asynchronous transaction replication MonetDB. The report concludes with how this all ties together with our efforts to make MonetDB ready for the age where high-performance data analytics is available in a single-click.
Graph data processing is gaining popularity and new solutions are appearing to analyze graphs efficiently. In this paper, we present the prototype for the new query engine of the Sparksee graph database, which is based on an algebra of operations on sets of key-value pairs. The new engine combines some regular relational database operations with some extensions oriented to collection processing and complex graph queries. We study the query plans of graph queries expressed in the new algebra, and find that most graph operations can be efficiently expressed as semijoin programs.
One of the most demanding needs in cloud computing is that of having scalable and highly available databases. One of the ways to attend these needs is to leverage the scalable replication techniques developed in the last decade. These techniques allow increasing both the availability and scalability of databases. Many replication protocols have been proposed during the last decade. The main research challenge was how to scale under the eager replication model, the one that provides consistency across replicas. In this paper, we examine three eager database replication systems available today: Middle-R, C-JDBC and MySQL Cluster using TPC-W benchmark. We analyze their architecture, replication protocols and compare the performance both in the absence of failures and when there are failures.
Book Chapters
Additional Dissemination Assets