Measure the speed of the different steps in Repository Creation application
Old
rs-kbgen-times.xlsx.
The 1st dataset is with 6k objects, the 2nd is with 44k objects (lots of sameAs though, so actually half of it show as Museum objects in the UI).
The 2 test runs were measured on different machines (Mitac's laptop vs. Cr4 server), so a new clean nightly run on Cr4 with the small set is needed. Still there are some obvious points:
- thesauri index takes about a minute, it has ~250k objects
- main index has 6k (or 44k) objects and takes 0.5h (2.5h) because of the larger molecules and because each molecule should be navigated
- The speed of Lucene indexing is
item size time speed thesauri 250k 1min 250k/min 6k objects 6k 30min 200/min 44k objects 6k 30min 300/min - The speed of adding objects is: 700/min vs. 1300/min for the larger dataset; this also includes the overhead of parsing
New Mapping
8k | 115k | |||
museum objects | 23,993 | 115,559 | ||
owlim.properties | ||||
explicit statements | 14,697,760 | 24,749,779 | ||
total statements | 99,139,808 | 192,550,603 | ||
entities | 7,720,698 | 10,355,104 | ||
file sizes | MB | MB | ||
thesaurus FTS index size | 24 | 27 | ||
object FTS index size | 126 | 13,312 | ||
total repo size | 4,506 | 21,504 | ||
repo-indices | 4,356 | 8,165 | ||
Loading times | sec | sec | hrs | hrs |
object FTS indexing | 1,317 | 48,673 | 0.4 | 13.5 |
add objects | 5,626 | 99,354 | 1.6 | 27.6 |
add images | 3,129 | 3,063 | 0.9 | 0.9 |
thesaurus FTS index | 171 | 167 | 0.0 | 0.0 |
add ontologies/thesauri | 1,282 | 1,151 | 0.4 | 0.3 |
Total | 11,525 | 152,408 | 3 | 42 |
- objects: the 8k repo has sameAs, so the objects are tripled. E.g. Lucene indexes 24k objects, not 8k
- statement expansion explicit:total has grown from 5.5-6.5 to 8x, need to investigate this
- the 115k repo uses the new objects, but old thesauri/images files
- FTS indexing is quite fast. But FTS size is still too large
Full Set
See BM Data Volumetrics#Full Set
- Storage location was on a RAM drive. Took 55G out of 64G. Using a RAM drive for repo load is times faster according to previous experiments on other servers
- storage size: 50+GB
- adding BM objects: start speed 132 obj/s, end speed: 26 obj/s. Approx ~20h total time
- ~2M BM bjects according to nuxeo ID file (not 1.5M as we said before)
lots of DBPedia thesauri items, w/o label; don't know where those came from
- ~407,000 thesauri items, indexed in 3min
- nuxeo ids added in 2600s (<1h)
- failed w/ Exception during Rembrandt paintings
Labels:
None