Efficient RDF Interchange (ERI) Format for RDF Data Streams

This page represents a bundle for the contents of the article submitted to ISWC'14, and it is currently under review. The purpose of this web page is to make accessible, link and describe the inputs and outputs of the analysis, which will be stored as a Research Object (pack).

Abstract

RDF streams are sequences of timestamped RDF statements or graphs, which can be generated by several types of data sources (sensors, social networks, etc.). They may provide data at high volumes and rates, and be consumed by applications that require real-time responses. Hence it is important to publish and interchange them efficiently. In this paper, we exploit a key feature of RDF data streams, which is the regularity of their structure and data values, proposing a compressed, efficient RDF interchange (ERI) format, which can reduce the amount of data transmitted when processing RDF streams. Our experimental evaluation shows that our format achieves significant space savings w.r.t. standard data streaming compression, remaining efficient in performance.

Inputs of the evaluation

The input of the analysis consists in 16 datasets, whose selection has been based on the number of triples, topic coverage, availability and, if possible, previous uses in benchmarking. We define three different categories of datasets: streaming (10), statistics (3) and general (3).

Obviously, Streaming datasets are our main application focus and they consist of:



Statistical datasets, using the RDF Data Cube Vocabulary, are the prototypical example of other (non-streaming) data presenting clear regularities that ERI can take advantage of:

  • Eurostat_migr_reschange, population statistics from Eurostat-Linked Data (accessible here, original source)
  • Eurostat_tour_cap_nuts3, tourism statistics from Eurostat-Linked Data (accessible here, original source)
  • Eurostat_avia_paexac, transport statistics from Eurostat-Linked Data (accessible here, original source)



Finally, we experiment with general static datasets, without prior assumptions on data regularities:



MODIFICATIONS

We convert each dataset to N-Triples by means of the Any23 0.9.0 tool. LOD_Nevada, LOD_Charley and LOD_Katrina result from appending their related turtle files by sampling date.

LICENSE

All datasets are freely provided by the aforementioned data sources. In general, datasets are licensed under the terms of the Creative Commons Attribution-ShareAlike License and the GNU Free Documentation License.


The Eurostat's data is available under the original Eurostat license.

Source Code

The source code of the current prototype is accessible here

About the authors

Javier D. Fernández Ontology Engineering Group (OEG), Univ. Politécnica de Madrid (Spain); jdfernandez@fi.upm.es
Alejandro Llaves Ontology Engineering Group (OEG), Univ. Polit ́ecnica de Madrid (Spain); allaves@fi.upm.es
Óscar Corcho Ontology Engineering Group (OEG), Univ. Polit ́ecnica de Madrid (Spain); ocorcho@fi.upm.es

Acknowledgements

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 257641, PlanetData network of excellence. We are thankful for discussions with authors of the RDSZ approach, specially with Norberto Fernández (Universidad Carlos III de Madrid).