Skip to end of metadata
Go to start of metadata

Counting and analysis of repository content

Name Size Creator Creation Date Comment  
Microsoft Excel Sheet triples-BM.xls 73 kB Vladimir Alexiev Dec 17, 2012 17:22  
Microsoft Excel Sheet triples-BM-mitac-dec-17.xls 73 kB Dimitar Manov Dec 17, 2012 17:02 Added statements per thesauri + approx calculations
Microsoft Excel Sheet triples-BM-201206.xls 51 kB Vladimir Alexiev Dec 06, 2012 16:16  
Microsoft Excel Sheet triples-RKD.xls 63 kB Vladimir Alexiev Dec 06, 2012 16:16  

Counting

  • Total statements
    select (count(*) as ?c) {?s ?p ?o}
    
  • statements per property.
    The max limit=200, so we get them in two portions:
    select ?p (count(*) as ?c) {?s ?p ?o} group by ?p order by ?p
    select ?p (count(*) as ?c) {?s ?p ?o} group by ?p order by ?p offset 200
    
  • class instances (one instance has many rdf:type!)
    select ?t (count(*) as ?c) {?s rdf:type ?t} group by ?t order by ?t
    

Analysis

We provide historic data, but focus on the latest data (BM-triples.xls of 2012-12).

Properties


Wihout sameAs expansion: 89995389 (2.9M=3.1% less triples)

  • rdf:type=58426160 is 62.9% of all triples (see breakdown below)
  • Object (business) & thesauri triples are 26.0+4.9=30.9%, of which we can assume objects are 21% and thesauri 10%.
  • FRs=5751214 are 6.2% of all triples, or 29% of business triples
  • bmo:PX_physical_description=25584 ~ rso:FC70_Thing=23993 is 3x more than the 8k objects!? Due to owl:sameAs
  • owl:sameAs=72010 is 9x more than the 8k objects.
    Each object has 3 sameAs URIs (a,b,c), which causes 9 statements: aa bb cc ab bc ca ba cb ac
    That's what an equivalence relation will do to you.
  • skos:inScheme=357283 ~ skos:Concept=357318 is the total number of thesaurus terms
  • skos:exactMatch=4495 come from RKD. E.g. rkd-plaats:renaix and rkd-plaats:renaix give 4 triples (2 symmetric, 2 reflexive)

Types

  • _:nodeXX=23528903: 40.3% useless OWL DL restriction types
     crm:En_Whatever rdf:type [owl:Restriction...] 

    We could eliminate these (24% of all triples) by:

    1. Delete such statements after loading the ontologies and before loading the data
      delete where {?e rdfs:subClassOf ?t. ?t a owl:Restriction}
      
    2. Write a perl script to cut down ECRM to RDFS+inverse (what Doerr wanted) + transitive
  • CRM classes=30864964: 52.8%: this is broken down into a decreasing number down the class hierarchy (ok):
    owl:Thing=3627096 ~ crm:E1_CRM_Entity=3626903
    crm:E77_Persistent_Item=3092726
    crm:E2_Temporal_Entity=240162

Statements and MB

Objects, k 0 8,000 115,000 1,500,000
Note thes.only w.sameAs current estimated
Explicit statements 4,444,431 14,697,760 24,749,779 269,296,796
Total statements 21,992,273 99,139,808 179,458,815 2,075,903,690
Expansion ratio 4.95 6.75 7.25 7.71
calculated calculated
Explicit per object 1,282 177
Total per object 9,643 1,369
Expansion ratio 7.52 7.75
actual estimated
FTS size, MB 18 276 3,600
Storage, MB 7,000 80,973
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.