Skip to end of metadata
Go to start of metadata

RS-151
RS-981
RS-983
RS-980

Intro

Mitac wrote a molecule extractor that saves the FTS molecules for all objects:
http://researchspace.ontotext.com/molecules.rar

  • molecules-myIndex.txt: objects: traverse Complete Museum Object, collect all literals, plus prefLabels of terms
  • molecules-thesIndex.txt: thesauri: prefLabels and altLabels of terms

Empty Molecules Due to SameAs

There used to be empty object molecules due to bad sameAs

http://collection.britishmuseum.org/id/codex/457748 len=0

These are now eliminated since Josh puts sameAs in separate files, and we don't load them. To remove from file:

Empty Term Molecules

There are some empty terms: 5 in RKD, 215 in BM

[^empty-terms.txt]

  • 15 that are referenced by related/broader, but not defined, eg:
    place/-not-found-in-the-place-thesaurus
    thesauri/-sword-fitting-not-found-in-the-object-thesaurus
    thesauri/-theatre/amphitheatre-not-found-in-the-subject-thesaurus
    
  • 200 normal-looking terms, eg thesauri/x114719. Most of these are again referenced by related/broader but not defined on their own.
  • To find other errors, we filter by word "broader/related". But there are cases when a related term is on a second line:
        idThes:x5814 
                     skos:related idThes:x112260,
                                  idThes:x112261.
    

    So we throw out terms that appear several times:

RS-1379 Josh:

  • fix thesauri/modification/RP: no prefLabel, this is required by RForm
  • Please check these 23 terms (all are in thesaurusandplace_1.trig)
    [^empty-BM-terms.txt]
  • If a term is not defined, is it worth keeping it as related/broader? Eg idThes:x103973 is not defined, so what does it mean to say that "Sa'a-Ulawa" is a sub-group thereof?
        idThes:x103967 a crm:E74_Group, skos:Concept;
                       skos:broader idThes:x103966, idThes:x103973;
                       skos:inScheme idThes:ethname;
                       skos:prefLabel "Sa'a-Ulawa";
    

Can we find this more easily with SPARQL? This finds all 250 terms without prefLabel, but doesn't say whether they are defined on their own:

Investigate Molecules

Reformat

Mitac changed the dump format:

<uri>, len=<length>, <phrase1>
 - <phrase2> ...

Put <phrase1> on its own bulleted line:

perl -ple "s/(len=\d+),\s+([^\n]+)/\1\n - \2/" molecules-myIndex.txt > a
mv a molecules-myIndex.txt
perl -ple "s/(len=\d+),\s+([^\n]+)/\1\n - \2/" molecules-thesIndex.txt > a
mv a molecules-thesIndex.txt

(This alternative is slightly broken in case len=0):

perl -pe "s{(len=\d+),\s+}{$1\n - }" molecules-myIndex.txt > a

Molecule Lengths

  • Extract lengths, sort by len (descending):
    grep "^http.*len=[^0]" molecules-myIndex.txt | sort -nrs -t = -k 2 > molecule-sizes-20121219.txt
    
  • Check molecules of identical size (especially large ones) for common junk
    • Previously:
      260722: RFI42168 RFI42169 RFI42170 RFI42171 RFI42172 RFI42173 RFI42174... (167 objects)
      124704: PPA356887 PPA356889 PPA356890 PPA356892 PPA356893 PPA356895... (106 objects)
      ...
    • Currently:
      199486: RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136... (229 objects)
      172289: EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981... (126 objects)
      ...
  • Count repeated sizes
    perl -ne "/len=(.*\n)/; print $1" molecule-sizes-20121219.txt | uniq -dc | sort -nr > repeated-sizes.txt
    
    • the top-3 repeated sizes are
      1059: YCA71368 YCA71442 YCA71829... (775 objects)
      891: YCA70997 YCA71927 YCA37921... (610 objects)
      889: YCA71479 YCA71936 YCA10263... (598 objects)

Extract Molecule

Get molecule of specified objects

  • Long repeated sizes are due to leaks:
    perl fts-get.pl RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136
    perl fts-get.pl EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981
    

    Searching with one of these (eg RFC40130) returns all 229

  • Short repeated sizes show no commonality (the repetition is a coincidence)
    perl fts-get.pl YCA71368 YCA71442 YCA71829
    perl fts-get.pl YCA71479 YCA71936 YCA10263
    perl fts-get.pl YCA71479 YCA71936 YCA10263
    

Investigate Common Junk

The investigation started because:

  • The FTS molecules of all BM objects are between 149998 and 152029 bytes (there are even exact matches!).
  • BM objects are bigger than RKD, which are 8-22k.

Neither of this is normal. I guesses that a large amount of common junk text is collected, together with a small amount of per-object specific text. I.e. we have an FTS leak: starting from object1, the properties that we chase go into a sub-object of another object2, and then all objects of teh same collection.

GAA21182 and GAA21183

(this is old)

  • The FTS molecule is one long sequence of words.
  • Luckily the order is approximately the same, else I couldn't have compared it
  • First split it to shared-specific parts (lines) using emacs M-x compare-window, then sorted
    The wdiff program is also useful, see below
  • Then compared, to extract the specific parts
  • Extracted text from TTL
  • Made summary table showing all specific parts and TTL text.
    First char is a code: the comment applies to all lines with same first char (no char means it's ok, i.e. in both TTL and FTS specific part)
object1 object2 comment, all with same first char
http://collection.britishmuseum.org/id/object/GAA21182 http://collection.britishmuseum.org/id/object/GAA21183  
len=150307 len=150081 FTS len. Most of this is common junk
!1882 !1978 in TTL and in FTS common part, checked manually
!Acquisition date :: 1882 :: !Acquisition date :: 1978 ::  
!Consists of :: glass !Consists of :: glass  
!Purchased from :: Chester, Greville John :: !Donated by :: Roberts, V G ::  
?BM-GAA21182 ?BM-GAA21183 NuxeoID generated by Kasabov
1882,0510.17 1978,0818.2  
2.50    
3.00    
448765 448764  
Dimension    
Dimension :: 2.50cm ::    
Dimension :: 3.00cm ::    
GAA21182 GAA21183  
Object type :: bead :: Object type :: necklace ::  
Opaque black glass disc bead with a goose or swan stamped... Necklace of thirty-seven blue glass beads (twenty-two cube..  
Subject :: bird :: Subject :: mammal ::  
Uses technique :: stamped ::    
>bead >neck-ornament thesaurus term. TODO check all: skos:altLabel or skos:broader?
>cm >necklace  
>inscription >xian lian  
>stamped    
>Length    
>Width    
-BM Dimension   thesaurus name. TODO investigate, should not appear (skos:inScheme is not traversed)
-BM Inscription Type    
-BM Technique    
-The British Museum TECHNIQUE Concept Scheme    
-QUDT Unit    
.Information Object   class name. Doesn't hurt and cannot discard it (rdfs:label is traversed even for rdf:Class)
.Man-Made Feature    
.Measurement Unit    
.Physical Feature    
.Visual Item    

Here's a diff, keep in mind that the truncated lines are often several kb long

This statistic shows 99% common junk:

GAA21182.txt: 24914 words : 24838 99% common : 56 0% deleted  : 20 0% changed
GAA21183.txt: 24871 words : 24838 99% common : 21 0% inserted : 12 0% changed

I was able to isolate the common-junk.txt by using commands like this several times:

The question is which is the sub-object that becomes shared between all objects

  • unfortunately it causes this leak:
    obj1 -> P12i -> obj1/acquisition -> P11<P12 -> seller/buyer=BM -> P12i -> obj2/acquisition

RKD Objects

From the above I had a hunch the common junk comes (mostly?) from Acquisition.
All BM objects have the same acquirer (P22) and that's BM. So I found the RKD objects acquired by Mauritshuis:

05_BadendeSusana.xml.ttl:<obj/2926/collection/5/entry>
07_man_met_baret.xml.ttl:<obj/2946/collection/4/entry>
08_NicolaesTulp.xml.ttl:<obj/3048/collection/3/entry>
10_oude_vrouw.xml.ttl:<obj/2952/collection/3/entry>
11_Andromeda.xml.ttl:<obj/2940/collection/5/entry>
12_lachende_man.xml.ttl:<obj/3064/collection/7/entry>

Then I extracted the corresponding molecules (2926.txt, 2946.txt, 3048.txt and one that's not acquired by Mauritshuis: 53707.txt
The hunch was confirmed: common junk is much higher between objects with common acquirer:

> wdiff -123s 2926.txt 3048.txt
2926.txt: 2020 words 1031 51% common 134 6% deleted 855 42% changed
3048.txt: 2582 words 1031 39% common 391 15% inserted 1160 44% changed
> wdiff -123s 2926.txt 2946.txt
2926.txt: 2020 words 1124 55% common 306 15% deleted 590 29% changed
2946.txt: 1884 words 1124 59% common 181 9% inserted 579 30% changed
> wdiff -123s 2926.txt 53707.txt
2926.txt: 2020 words 522 25% common 0 0% deleted 1498 74% changed
53707.txt: 1820 words 522 28% common 10 0% inserted 1288 70% changed


Shadow Objects

See Business Properties#Shadow Object with Shared Images
RS-1375

  • len=199309:: 229 objects "Photograph (black and white) from an album"
    RFC40130 RFC40131 RFC40132 RFC40133 RFC40134 RFC40135 RFC40136 ...
    Eg the first one includes foreign ids: BM-RFC40144, RFC40345, BM-RFC40143, BM-RFC40142...
  • len=172269: 126 photographic negatives by Kissling, Werner, Ruatoki
    EPF109974 EPF109976 EPF109978 EPF109979 EPF109980 EPF109981 EPF109982 EPF109994 EPF109995 EPF109996 EPF109997 EPF109998
  • len=109100: 133 litograph prints by Raffet, Denis Auguste Marie (sometimes with coauthors)
    PPA368462 PPA368469 PPA368474 PPA368477 PPA368481 PPA368482 PPA368485 PPA368492 PPA368495 PPA368496 PPA368499 PPA368502
  • len=100297: 19 bracelets, gaming sticks etc by Northwest Coast Peoples
    ENA121769 ENA121770 ENA121789 ENA121790 ENA121791 ENA121828 ENA121829 ENA121830 ENA121831 ENA121832 ENA121834 EOC115652
    Not sure why this happens, the first 4 don't have any images

We hot-fixed this problem by deleting such shadow objects, which caused another bug: RKD images disappeared (their objects don't have BM as owner).

The same bug appears again.

Investigate Susana FTS Molecule

RS-981 Anna
After Investigate Common Junk is elminated, check the molecule for Susana to ensure it's the same as free text in Turtle

Fixes

Focused FTS Indexing

RS-977
RS-1139
The problem was that the properties chased for the FTS index cause a loop, thus leak from one object to another.
Mitac patched OWLIM's Lucene module to do the same as getCompleteMO:

  • use properties.txt to limit which properties are traversed
  • when it reaches a skos:Concept, it uses much reduced set of properties (only labels)
  • if it hits FC70_Thing a second time, cuts off

Misc Notes

  • Extracted the properties from all BM configs, put in [BMX Issues^BM-properties.xls], marked in red the ones I have objections about
    RS-934
  • To analyze all possible loops, involving not just properties but also their superproperties
    RS-680
  • Extracted sub-prop hierarchy from ecrm-current.ttl: [^ecrm_subPropertyOf.txt]
    • found a mistake, reported to ECRM mlist
    • the hierarchy is not very deep (3-4 levels)
    • Idea: extract with SPARQL query
  • Added RSO and BMO subproperties, check against properties in use (all RSO, and BM-properties.xls)
  • May have to distinguish properties for FullMO vs properties for FTS.
    Eg P4_has_time-span is not needed for FTS, because there's no point FTS-indexing a date string
  • Make RKD thesauri same as BM thesauri (skos:prefLabel instead of rdfs:label or P1/P3)
  • Specific properties
    • investigate where does this come from
      <object/PPA59031> crm:P15i_influenced <object/PPA59031/acquisition>
  • P138 creates a potential loop since it's inverse of P138i and superprop P67i:
    P67i_is_referred_to_by P70i_is_documented_in P138i_has_representation
    P138_represents
    • Can we possibly eliminate P3?
      • For their important labels, objects are supposed to use rdfs:label, and terms
      • We can get other interesting texts (eg bmo:PX_physical_description) explicitly
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.