RKBExplorer is an RDF aggregation and search interface developed by Southampton U (Soton) for an EU project and includes 3Store by default.
It was used for experimental load of BM data and one of the queries was disastrous, see [OntoCollab:RS Performance].
A couple papers on RKBExplorer:
- RKBExplorer- Application and Infrastructure.pdf
- RKBExplorer- Repositories, Linked Data and Research Support.pptx
- Vlado asked what repository was used for test loading of BM museum data and performance testing; whether they have evaluated various triple-stores (the report only had a title page)
- Dominic replied: test load is in 3Store (using RKBexplorer); uploaded the full eval report
- reviews available RDF stores and their applicability to ResearchSpace.
- by Manuel Salvadores (and/or Hugh Glaser?) of University of Southampton, for Seme4, for British Museum
- Dated 26.1.2011
- Long list: 4store, BigData, BigOWLIM, Talis platform (hosted repository), Oracle, Allegro Graph, Jena TDB, Mulgara
- Short list: determined on p.10 based on "priorities: (1) standards (2) supported architectures (3) Projects and (4) performance. Supporting SPARQL 1.0 and part of SPARQL 1.1 is the highest priority. After this, we consider supported architectures. The other two aspects most valued in this ranking are the numbers of Semantic Projects where it has been used, and performance.
- "According to this Virtuoso and 4store are considered the main candidates that are going to be evaluated in Section 5."
- "two other solutions are also potential candidates that could be taken into account in a more exhaustive evaluation, these are: Jena/TDB and BigData"
"I don't think it is necessary that the RDF database system has to be open source. As you say, there should be flexibility built-in to accommodate multiple systems. But such flexibility would require a clean and standard interface. Are RDF systems mature enough to define in your requirements what such an interface should look like? What we want to avoid of course is lock in to some kind of proprietary and expensive backend." – Project Officer, Don Waters, at the Mellon Foundation
– So open source is not critical, but portability is.
- Since all stores use a standard data representation (RDF), two other things should be considered:
- Query language: considered stores should support SPARQL 1.1 because of named graph and update requirements
- Reasoning: first the reasoning features required by ResearchSpace should be analyzed, then stores that support these features in a complete way
- "the way the application is designed and how it interacts with the backend will be affected depending on the type of inference engine selected. When reasoning becomes an important requirement in RS then one should assume that portability between stores will not be as easy. Commercial tools tend to provide more complex and the fastest reasoning, i.e.: BigOWLIM and AllegroGraph. As soon an application is built with one of these then it will be hard to move to another that offers the same spectrum of features"
- this is a shallow, weak and subjective report. A strong case can be made that OWLIM is a better choice.
- No performance testing is done at this stage
- The report includes unverified assumptions. In fact OWLIM does support SPARQL 1.1, is platform independent, has best performance, and completeness of reasoning
- Many viable triple stores are excluded from evaluation (even future evaluation) based on completely subjective and unverified assumptions
The selected contenders have serious problems in the selected priority areas:
- Virtuoso does not support SPARQL 1.1 properly:
Report p.13: "SPARQL 1.1: Partially, through Virtuoso Extensions - even though some of the functionalities seem to not have standard syntaxes."
- Virtuoso's incremental updates still don't work properly:
BSBM v3 Explore and Update Results:
"Benchmark Results for the Explore-And-Update Use Case: ... Virtuoso is not listed ...When executing the update mix on Virtuoso, we ran into technical problems and are still working on solving them together with the Openlink team."
These "technical problems" are not a small matter to be dismissed, considering that BSBM Benchmark Version 3 is funded through the LOD2 project (http://lod2.eu), Openlink is a partner in LOD2, and the BSBM developers (BerlinU) were very cooperative with the triple-store developers.
- 4store supports only basic SPARQL 1.1:
Report p.18: " partially supports INSERT, DELETE, aggregates and group by."
Report quotes an implementation roadmap (ref. ) but the roadmap itself says: "vague roadmap, with no timescales is something like: ..." which represents hardly any commitment.
4Store Documentation makes no hint of additional SPARQL 1.1 features
- 4store does not support multi-client queries (!?)
BSBM v3 Results sec 6.2 Query Mixes per Hour for Multiple Clients:
"4store: * We ran into technical problems while testing with multiple clients."
Report p9. says that OWLIM supports only SPARQL 1.0, which is not true. OWLIM supports SPARQL 1.1
- through Jena ARQ/Joseki since version 3.4 (Nov 2010)
- through Sesame (Ontotext contributions) since version 4.0 (beta in Jun 2011), for faster performance.
Just search for "1.1" on the ontotext.com website:
- Nov 25, 2010: Jena adapter (BETA): Jena's ARQ engine allows BigOWLIM to handle the latest SPARQL 1.1 extensions, e.g. aggregates. The adapter is still a beta version and has not been rigorously tested for conformance yet, but can be used with Joseki to make queries and has successfully passed BSBM and LUBM benchmarks.
- May 20, 2011: Ontotext contributes SPARQL 1.1 Query support to latest Sesame release
- June 6, 2011: Release of OWLIM version 4.0 (beta). While handling of SPARQL 1.1 through Jena was fast enough to allow OWLIM to get excellent scores in the latest BSBM evaluation, using SPARQL 1.1 through Sesame makes it even faster and available in all editions of OWLIM.
- BSBM v3 Results
- Feb 25, 2011: BSBM benchmark is extended with two new scenarios (query mixes): "Explore and Update" and "Business Intelligence". Those are based on the new features in SPARQL 1.1 (update and aggregates). The most recent BSBM results report on the performance of the leading semantic repositories (BigOWLIM, Virtuoso, 4store, BigData and Jena TDB) on 100M and 200M datasets.
BigOWLIM was the top performer on:
- loading time - BigOWLIM loads the 100M and the 200M datasets almost twice as fast as the next best product;
- query performance among those repositories that can handle update and multi-client query tasks.
- PA evaluation Press Association project for commercial image retrieval and browsing evaluated AllegroGraph, BigOWLIM, ORACLE, Sesame, Jena TDB and Virtuoso:
"In our tests, BigOWLIM provides the best average query response time and answers the maximum number of queries for both the datasets ... it is clear to see that execution speed-wise BigOWLIM outperforms AllegroGraph and Sesame for almost all of the dataset queries."
Reasoning features and completeness: partially treated on p.21 of the RS RDF report, since the reasoning features required by RS are not yet defined.
One of the CIDOC CRM articles (A Repository for 3D Model Production and Interpretation in Culture and Beyond) states the need for OWL2 RL reasoning clearly:
- "Metadata Repository search should be enhanced by reasoning and rules concerning deductions from the metadata. For the time being we are considering as reasoning platform OWLIM ... Adding reasoning to our semantic database will make queries more effective and simple. Further, the enforcement of constraints and other rules on the metadata content at ingest and update time will ensure the consistency and correctness of the information described in the semantic web."
Indeed, the CIDOC CRM Search model (New Framework for Querying Semantic Networks - FORTH TR419 2011) relies on complicated rules that infer search information (Fundamental Categories and Fundamental Relations) from basic CRM information
p.23 lists important store features. OWLIM notes follow:
- Text Indexing: through Lucene, option to index single graph nodes or whole "molecules" (subgraph of given depth, rooted at each node)
- Co-reference: OWLIM supports owl:sameAs reasoning optimization, which allows it to provide reason-able views of large parts of the LOD, see http://factforge.net and http://linkedlifedata.com
- Security and Level of access: per named graph.This is an important feature for RS, since museum data can be open per-project
- The OWLIM Sesame API has some authorization features.
- OWLIM will be adding named-graph restrictions. This can be achieved by syntactic restriction:
- SPARQL queries should not include non-authorized named graphs
- Imported data should not refer to different named graphs
- sameAs assertions should not include nodes from different named graphs
- Single Sign On: the spec says LDAP can store a login token, but I think we'll need to use a separate SSO product for this purpose. Such as CAS, which is open source (initially from Yale)
- Scalability: OWLIM shows best BSBM performance for SPARQL 1.1 features. OWLIM supports a cluster, and has been deployed on the cloud (100k queries answered for $1)
- Named Graphs: OWLIM supports tag-sets, which is a generalization of named graphs
- Updates at graph level and triple level: OWLIM supports this, including an important "fast retraction" mechanism:
June 22, 2010: When explicit statements are removed from the repository, any implicit statements that rely on the removed statement must also be removed. In OWLIM 3.3, removal of explicit statements is achieved by invalidating only those inferred statements that can no longer be derived in any way, which massively improves statement deletion efficiency. This allows very fast deletion of (instance) statements by computing exactly the necessary inferred statements to delete.
The BSBM report compares (amongst others) Jena TDB v0.8.9 vs OWLIM v3.4.3129
As can be seen from , OWLIM consistently outperforms Jena TDB. Below we calculate how much faster is OWLIM:
- Explore (multiple clients, 100m triples): 1.55 times (1 client), 2.32 times (4 clients), 4.23 times (8 clients), 6.91 times (64 clients).
OWLIM uses efficiently all opportunities for parallelization, so its performance actually increases with the number of clients. In contrast, Jena's performance decreases. So the performance gap widens considerably, especially at large number of clients
- Explore (multiple clients, 200m triples): 1.24 times (1 client), 1.68 times (4 clients), 2.85 times (8 clients). The same pattern repeats for the bigger repository size: OWLIM's performance increases, while Jena's performance stays the same
- The performance gap is less pronounced, but Jena TDB crashed for the 200m dataset and 64 clients
- Explore and Update (single client): 4.13 times (100m triples). Focusing on just the update queries we see this:
- Insert (Upd. Query 1): 27.7 times
- Delete (Upd. Query 2): 12.7 times
These huge performance differences are due to OWLIM's efficient incremental inferencing and retraction algorithms
(Numbered links are from BM proposal 188.8.131.52 "Incremental Assert/Retract", heading "MILARQ Offline Indexing")
- allows the UK union catalog ClarosNet to search quickly through 40M Cultural Heritage objects: http://explore.clarosnet.org/XDB/ASP/clarosHome/
- part of the CLAROS VRERI programme: http://code.google.com/p/vreri
Funded by JISC
- programme page : http://code.google.com/p/vreri/wiki/MILARQ
- proposal : http://vreri.googlecode.com/files/Bid30%20MILARQ.pdf
Start Date: 1 March 2010, End Date: 31 October 2010, budget 52k GBP (35k funded).
The MILARQ Project seeks to enhance performance of the CLAROS Explorer VRE, to allow wider public access to a rich set of classical art resources from major European research centres, and incorporation of additional source data.
The software developments relate to Jena, an existing, widely used, open source RDF data management platform. These will include creation of multiple indexes over the underlying RDF triple store, Jena TDB, and other optimizations relating to filter performance.
The developments will be tested and evaluated initially in the context of CLAROS, but it is intended that they will be usable by any system based on the Jena TDB and LARQ RDF storage system
- project site : http://code.google.com/p/milarq
- project plan : http://code.google.com/p/milarq/wiki/ProjectPlanOutline_201003_201010
- evaluation plan : http://code.google.com/p/milarq/wiki/MILARQ_evaluation_testing
- performance results : http://code.google.com/p/milarq/wiki/ClarosServerPerformanceNotes
"The overall improvement of the test suite is about 20x, though some queries show 100x"
- final report : http://code.google.com/p/milarq/wiki/MILARQ_final_progress_post
- Create The Basic LARQ index : http://code.google.com/p/junsbriefcase/wiki/MILARQOnNaxos
Explains that index update is offline
- Works with TDB 0.8.9 : http://code.google.com/p/milarq/source/browse/README.text
(probably not yet with newer versions)
- see chat-MILARQ and CRM searching.txt for a full skype transcript (with some updates)
ResearchSpace has high requirements for number of concurrent users and store updates. We think that currently only OWLIM can meet these requirements in a cost-effective manner regarding hardware. The BSBM tests were performed on this hardware:
- Processors: Intel i7 950, 3.07GHz (4 cores)
- Memory: 24GB
- Hard Disks: 2 x 1.8TB (7,200 rpm) SATA2.
Find PA evaluation presentation (slideshare, linked from ontotext.com)
Justify why OWLIM is better than Virtuoso
Justify why OWLIM is better than Jena TDB
Describe that we use and support Jena as an API
Are MILARQ additions likely to speed up Jena enough?