Intro
Mitac wrote a molecule extractor that saves the FTS molecules for all objects:
http://researchspace.ontotext.com/molecules.rar
- molecules-myIndex.txt: objects: traverse Complete Museum Object, collect all literals, plus prefLabels of terms
- molecules-thesIndex.txt: thesauri: prefLabels and altLabels of terms
Empty Molecules Due to SameAs
There used to be empty object molecules due to bad sameAs
http://collection.britishmuseum.org/id/codex/457748 len=0
These are now eliminated since Josh puts sameAs in separate files, and we don't load them. To remove from file:
Empty Term Molecules
There are some empty terms: 5 in RKD, 215 in BM
[^empty-terms.txt]
- 15 that are referenced by related/broader, but not defined, eg:
place/-not-found-in-the-place-thesaurus thesauri/-sword-fitting-not-found-in-the-object-thesaurus thesauri/-theatre/amphitheatre-not-found-in-the-subject-thesaurus
- 200 normal-looking terms, eg thesauri/x114719. Most of these are again referenced by related/broader but not defined on their own.
- To find other errors, we filter by word "broader/related". But there are cases when a related term is on a second line:
idThes:x5814 skos:related idThes:x112260, idThes:x112261.
So we throw out terms that appear several times:
RS-1379 Josh:
- fix thesauri/modification/RP: no prefLabel, this is required by RForm
- Please check these 23 terms (all are in thesaurusandplace_1.trig)
[^empty-BM-terms.txt] - If a term is not defined, is it worth keeping it as related/broader? Eg idThes:x103973 is not defined, so what does it mean to say that "Sa'a-Ulawa" is a sub-group thereof?
idThes:x103967 a crm:E74_Group, skos:Concept; skos:broader idThes:x103966, idThes:x103973; skos:inScheme idThes:ethname; skos:prefLabel "Sa'a-Ulawa";
Can we find this more easily with SPARQL? This finds all 250 terms without prefLabel, but doesn't say whether they are defined on their own:
Investigate Molecules
Reformat
Mitac changed the dump format:
<uri>, len=<length>, <phrase1> - <phrase2> ...
Put <phrase1> on its own bulleted line:
perl -ple "s/(len=\d+),\s+([^\n]+)/\1\n - \2/" molecules-myIndex.txt > a mv a molecules-myIndex.txt perl -ple "s/(len=\d+),\s+([^\n]+)/\1\n - \2/" molecules-thesIndex.txt > a mv a molecules-thesIndex.txt
(This alternative is slightly broken in case len=0):
perl -pe "s{(len=\d+),\s+}{$1\n - }" molecules-myIndex.txt > a
Molecule Lengths
- Extract lengths, sort by len (descending):
grep "^http.*len=[^0]" molecules-myIndex.txt | sort -nrs -t = -k 2 > molecule-sizes-20121219.txt
- Check molecules of identical size (especially large ones) for common junk
- Previously:
260722: RFI42168 RFI42169 RFI42170 RFI42171 RFI42172 RFI42173 RFI42174... (167 objects)
124704: PPA356887 PPA356889 PPA356890 PPA356892 PPA356893 PPA356895... (106 objects)
... - Currently:
199486: RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136... (229 objects)
172289: EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981... (126 objects)
...
- Previously:
- Count repeated sizes
perl -ne "/len=(.*\n)/; print $1" molecule-sizes-20121219.txt | uniq -dc | sort -nr > repeated-sizes.txt
- the top-3 repeated sizes are
1059: YCA71368 YCA71442 YCA71829... (775 objects)
891: YCA70997 YCA71927 YCA37921... (610 objects)
889: YCA71479 YCA71936 YCA10263... (598 objects)
- the top-3 repeated sizes are
Extract Molecule
Get molecule of specified objects
- Long repeated sizes are due to leaks:
perl fts-get.pl RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136 perl fts-get.pl EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981
Searching with one of these (eg RFC40130) returns all 229
- Short repeated sizes show no commonality (the repetition is a coincidence)
perl fts-get.pl YCA71368 YCA71442 YCA71829 perl fts-get.pl YCA71479 YCA71936 YCA10263 perl fts-get.pl YCA71479 YCA71936 YCA10263
Investigate Common Junk
The investigation started because:
- The FTS molecules of all BM objects are between 149998 and 152029 bytes (there are even exact matches!).
- BM objects are bigger than RKD, which are 8-22k.
Neither of this is normal. I guesses that a large amount of common junk text is collected, together with a small amount of per-object specific text. I.e. we have an FTS leak: starting from object1, the properties that we chase go into a sub-object of another object2, and then all objects of teh same collection.
GAA21182 and GAA21183
(this is old)
- The FTS molecule is one long sequence of words.
- Luckily the order is approximately the same, else I couldn't have compared it
- First split it to shared-specific parts (lines) using emacs M-x compare-window, then sorted
The wdiff program is also useful, see below - Then compared, to extract the specific parts
- Extracted text from TTL
- Made summary table showing all specific parts and TTL text.
First char is a code: the comment applies to all lines with same first char (no char means it's ok, i.e. in both TTL and FTS specific part)
object1 | object2 | comment, all with same first char |
http://collection.britishmuseum.org/id/object/GAA21182 | http://collection.britishmuseum.org/id/object/GAA21183 | |
len=150307 | len=150081 | FTS len. Most of this is common junk |
!1882 | !1978 | in TTL and in FTS common part, checked manually |
!Acquisition date :: 1882 :: | !Acquisition date :: 1978 :: | |
!Consists of :: glass | !Consists of :: glass | |
!Purchased from :: Chester, Greville John :: | !Donated by :: Roberts, V G :: | |
?BM-GAA21182 | ?BM-GAA21183 | NuxeoID generated by Kasabov |
1882,0510.17 | 1978,0818.2 | |
2.50 | ||
3.00 | ||
448765 | 448764 | |
Dimension | ||
Dimension :: 2.50cm :: | ||
Dimension :: 3.00cm :: | ||
GAA21182 | GAA21183 | |
Object type :: bead :: | Object type :: necklace :: | |
Opaque black glass disc bead with a goose or swan stamped... | Necklace of thirty-seven blue glass beads (twenty-two cube.. | |
Subject :: bird :: | Subject :: mammal :: | |
Uses technique :: stamped :: | ||
>bead | >neck-ornament | thesaurus term. TODO check all: skos:altLabel or skos:broader? |
>cm | >necklace | |
>inscription | >xian lian | |
>stamped | ||
>Length | ||
>Width | ||
-BM Dimension | thesaurus name. TODO investigate, should not appear (skos:inScheme is not traversed) | |
-BM Inscription Type | ||
-BM Technique | ||
-The British Museum TECHNIQUE Concept Scheme | ||
-QUDT Unit | ||
.Information Object | class name. Doesn't hurt and cannot discard it (rdfs:label is traversed even for rdf:Class) | |
.Man-Made Feature | ||
.Measurement Unit | ||
.Physical Feature | ||
.Visual Item |
Here's a diff, keep in mind that the truncated lines are often several kb long
This statistic shows 99% common junk:
GAA21182.txt: 24914 words : 24838 99% common : 56 0% deleted : 20 0% changed GAA21183.txt: 24871 words : 24838 99% common : 21 0% inserted : 12 0% changed
I was able to isolate the common-junk.txt by using commands like this several times:
The question is which is the sub-object that becomes shared between all objects
- I searched for this common junk string: "Purchased from :: Gordon, Margot".
- it comes from http://collection.britishmuseum.org/id/object/PPA59031/acquisition in PD_101119_PPA59031.rdf
- so I guessed that Acquisition is the shared sub-object.
- I verified properties.txt (used for GetCompleteMO) and LuceneIndexCreation.lucene (used for FTS indexing)
- in both we have this pair of properties:
property used for P12i_was_present_at Obj present at Event (eg exhibition, research) P11_had_participant BM Association Mapping#Acquired Through (intermediary or contributor)
- unfortunately it causes this leak:
obj1 -> P12i -> obj1/acquisition -> P11<P12 -> seller/buyer=BM -> P12i -> obj2/acquisition
RKD Objects
From the above I had a hunch the common junk comes (mostly?) from Acquisition.
All BM objects have the same acquirer (P22) and that's BM. So I found the RKD objects acquired by Mauritshuis:
05_BadendeSusana.xml.ttl:<obj/2926/collection/5/entry>
07_man_met_baret.xml.ttl:<obj/2946/collection/4/entry>
08_NicolaesTulp.xml.ttl:<obj/3048/collection/3/entry>
10_oude_vrouw.xml.ttl:<obj/2952/collection/3/entry>
11_Andromeda.xml.ttl:<obj/2940/collection/5/entry>
12_lachende_man.xml.ttl:<obj/3064/collection/7/entry>
Then I extracted the corresponding molecules (2926.txt, 2946.txt, 3048.txt and one that's not acquired by Mauritshuis: 53707.txt The hunch was confirmed: common junk is much higher between objects with common acquirer:
> wdiff -123s 2926.txt 3048.txt
2926.txt: 2020 words 1031 51% common 134 6% deleted 855 42% changed
3048.txt: 2582 words 1031 39% common 391 15% inserted 1160 44% changed
> wdiff -123s 2926.txt 2946.txt
2926.txt: 2020 words 1124 55% common 306 15% deleted 590 29% changed
2946.txt: 1884 words 1124 59% common 181 9% inserted 579 30% changed
> wdiff -123s 2926.txt 53707.txt
2926.txt: 2020 words 522 25% common 0 0% deleted 1498 74% changed
53707.txt: 1820 words 522 28% common 10 0% inserted 1288 70% changed
Shadow Objects
See Business Properties#Shadow Object with Shared Images
RS-1375
- len=199309:: 229 objects "Photograph (black and white) from an album"
RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136 ...
Eg the first one includes foreign ids: BM-RFC40144, RFC40345, BM-RFC40143, BM-RFC40142... - len=172269: 126 photographic negatives by Kissling, Werner, Ruatoki
EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981 EPF109982 EPF109994 EPF109995 EPF109996 EPF109997 EPF109998 - len=109100: 133 litograph prints by Raffet, Denis Auguste Marie (sometimes with coauthors)
PPA368462 PPA368469 PPA368474 PPA368477 PPA368481 PPA368482 PPA368485 PPA368492 PPA368495 PPA368496 PPA368499 PPA368502 - len=100297: 19 bracelets, gaming sticks etc by Northwest Coast Peoples
ENA121769 ENA121770 ENA121789 ENA121790 ENA121791 ENA121828 ENA121829 ENA121830 ENA121831 ENA121832 ENA121834 EOC115652
Not sure why this happens, the first 4 don't have any images
We hot-fixed this problem by deleting such shadow objects, which caused another bug: RKD images disappeared (their objects don't have BM as owner).
The same bug appears again.
Investigate Susana FTS Molecule
RS-981
Anna
After Investigate Common Junk is elminated, check the molecule for Susana to ensure it's the same as free text in Turtle
Fixes
Focused FTS Indexing
RS-977
RS-1139
The problem was that the properties chased for the FTS index cause a loop, thus leak from one object to another.
Mitac patched OWLIM's Lucene module to do the same as getCompleteMO:
- use properties.txt to limit which properties are traversed
- when it reaches a skos:Concept, it uses much reduced set of properties (only labels)
- if it hits FC70_Thing a second time, cuts off
Misc Notes
- Extracted the properties from all BM configs, put in [BMX Issues^BM-properties.xls], marked in red the ones I have objections about
RS-934 - To analyze all possible loops, involving not just properties but also their superproperties
RS-680 - Extracted sub-prop hierarchy from ecrm-current.ttl: [^ecrm_subPropertyOf.txt]
- found a mistake, reported to ECRM mlist
- the hierarchy is not very deep (3-4 levels)
- Idea: extract with SPARQL query
- Added RSO and BMO subproperties, check against properties in use (all RSO, and BM-properties.xls)
- May have to distinguish properties for FullMO vs properties for FTS.
Eg P4_has_time-span is not needed for FTS, because there's no point FTS-indexing a date string - Make RKD thesauri same as BM thesauri (skos:prefLabel instead of rdfs:label or P1/P3)
- Specific properties
- investigate where does this come from
<object/PPA59031> crm:P15i_influenced <object/PPA59031/acquisition>
- investigate where does this come from
- P138 creates a potential loop since it's inverse of P138i and superprop P67i:
P67i_is_referred_to_by P70i_is_documented_in P138i_has_representation
P138_represents- Can we possibly eliminate P3?
- For their important labels, objects are supposed to use rdfs:label, and terms
- We can get other interesting texts (eg bmo:PX_physical_description) explicitly
- Can we possibly eliminate P3?