Skip to end of metadata
Go to start of metadata

analysis of matches, precision and recall

Intro

The paper "Thesaurus Alignment for Linked Data Publishing" describes the matching

  • Matching is done using only the EN prefLabel (FR in case of RAMEAU), i.e. simple string matching
  • Uses these string measures: Hamming distance, N-Grams, Levenshtein, Jaro, Jaro-Winkler, SMOA
  • Compute closeness using all string measures, then average the result. What's the purpose of this? They don't make any judgement which measure(s) are better, so in effect just water down better measures with worse measures
  • Reports 98% precision.
    It's surprising good news that so much can be accomplished only with label matching.
  • About 42% of all terms are mapped to NALT (13.4k/31.9k = 42%)
  • Homonymy example showing the need for disambiguation (taking the context in account):
    Calice (Lat, Perianthe plant) Calices (FR, objects liturgiques)

RDF Files

Check basic matching:

PREFIX agrovoc: <http://aims.fao.org/aos/agrovoc/>
PREDIX nalt: <http://lod.nal.usda.gov/nalt/>
# FILTER(STRSTARTS(STR(?nalt),"http://lod.nal.usda.gov/nalt/")) doesn't work, not sure why
select * {
  ?a skos:exactMatch ?n.
  OPTIONAL {?a skos:prefLabel ?agro.   FILTER(lang(?agro)="en")}
  OPTIONAL {?n skos:prefLabel ?nalt.   FILTER(lang(?nalt)="en")}
} limit 10

Correspondence

I have a couple of questions, since I'm trying to assess the Recall of the method described in the paper.

  • Is this made using the approach described in this paper: Thesaurus Alignment for Linked Data Publishing (DC 2011)
    • Yes
  • It seems this is the same mapping from 2011, since the number of exactMatches (13391) is very close to the one quoted in the paper.
    • Yes
  • Do you have any info about the growth of NALT?
    The paper says "30.3k concepts" while the NALT site says "51k Preferred Terms"
    If you rerun the matcher now, do you think it'd find significantly more matches?
    If you rerun the matcher now, do you think it'd find significantly more matches?
    • We haven’t done a lot of matching in the past 10 months or so. We have been discussing this problem. As we don’t have an alignment management tool (we hope to get one into VocBench this year) it is not so easy to maintain the matches across evolving thesauri.

Manual Recall Estimation

The paper describes that about 42% of AGROVOC terms are mapped to NALT (13.4k/31.9k = 42%). This seemed a bit low for two thesauri that are dedicated to agriculture.

  • I've tried manual matching of 25 concepts from each thesaurus (the first descriptor term from each letter of the alphabet),
    and I get 12% matches
  • The matching ratio of 42% is significantly better (3.5x), which is quite interesting (i.e. I can't explain it

From AGROVOC to NALT

For each letter, take the first term with status=Descriptor

N AGROVOC NALT Note
1 Aaptosyax grypus  
2 B-lymphocytes  
3 C3 plants C3 plants  
4 Dabs (UF Yellowtail flounder, a fish)  
5 eagles  
6 F1 hybrids  
7 GABA  
8 habitat improvement there is habitat conservation and alts: preservation, protection, restoration)
9 IAA  
10 Jacaranda  
11 Kabatiella  
12 La Pampa  
13 Macaca  
14 NAA (Naphthylacetic acid) even alt doesn't match
15 Oak (tree) there is oak logs
16 Padus (Pachnaeus is before it in alphabetical order: no match)  
17 Q fever Q fever  
18 Rabaulichthys altipinnis  
19 Saarland (first non-place is Sabal: no match)  
20 T-2 toxin (also Tabanidae: no match)  
21 Uaru amphiacanthoides  
22 Vaccination vaccination  
23 Wadi  
24 X ray irradiation  
25 Yaks  

3/25=12%

From NALT to AGROVOC

For each letter, take the first preferred term (not in italic).
I skip US-specific organizations etal, eg U.S. Cooperative Extension Service

N NALT AGROVOC
1 A-DNA
2 babassu oil
3 C3 plants C3 plants
4 Daily Reference Values
5 early childhood education
6 factor VIII
7 galactosides
8 H-Y antigen
9 ice milk
10 jackfruits Jack fruit; jackfruit (tree)
11 kallikreins
12 La Nina
13 macroalgae
14 nafcillin
15 oases
16 p-anisidine value
17 Q fever
18 radiation resistance
19 sacral spine
20 table wines
21 udic regimes
22 vaccination
23 waferboards
24 X-ray diffraction
25 yams yams

3/25=12%

Analysis and Precision

Most of the matches (95%) are trivial, meaning the two matched labels are the same (case-insensitive comparison).
Analysis details:
AGROVOC-NALT-analysis.xls

Below we analyze the non-trivial matches.

Deleted Terms

Some matches (231 = 1.7%) are about old removed AGROVOC terms.

# select (count(*) as ?c) { 
select * {
  ?a skos:exactMatch ?n.
  OPTIONAL {?a skos:prefLabel ?agro.   FILTER(lang(?agro)="en")}
  OPTIONAL {?n skos:prefLabel ?nalt.   FILTER(lang(?nalt)="en")}
  FILTER (!BOUND(?agro) || !BOUND(?nalt))
} 

Eg here is one such error:

  <http://aims.fao.org/aos/agrovoc/c_9655> <http://www.w3.org/2004/02/skos/core#exactMatch>
     <http://lod.nal.usda.gov/nalt/1890> .

Non-Trivial Matches

Find nontrivial matches (the labels are different): 375 (2.8%)

select * {
  ?a skos:exactMatch ?n.
  {?a skos:prefLabel ?agro.   FILTER(lang(?agro)="en")}
  {?n skos:prefLabel ?nalt.   FILTER(lang(?nalt)="en")}
  FILTER (LCASE(?agro) != LCASE(?nalt))
} ORDER BY LCASE(?agro) LIMIT 100
  OFFSET 0 # then OFFSET 100, OFFSET 200, OFFSET 300

Precision

We find 50 wrong matches (see next section)

  • 15 are systemic error
  • 35 are due to misspelling-tollerant string metrics (Levenshtain and Jaro-Winkler). These introduce false positives, eg
    aviculture apiculture
    health wealth
    forest range forest ranger
    health care health card
    Qualite de la viande Qualite de la vie
  • The original paper reports 98% precision, i.e. 2% false positives (about 270), which wre cleaned up using maybe 20p/d of manual cleaning
  • We still find that 11% of these were missed by the manual cleaning.
  • Especially in biology, there are many Latin terms that are similar, but mean different things (eg genus vs species, or unrelated species)

So what is better, to allow misspelling-tollerant metrics or not?

  • The Variant Spelling excel section shows 137 good matches due to such metrics
  • The original paper implies 270 wrong matches
  • My conclusion is that since Thesauri terms are not very likely to include misspellings, such metrics do more harm than good.
  • It's better to include explicitly legitimate spelling variants (eg behaviour-behavior, programme-program) than to allow random misspellings

Wrong Matches

AGROVOC term NALT term comments
agricultural economics agricultural economist unrelated
balanitis balanites Male genital disease vs tree species
baphia raphia legume vs palm
bidens pilosa bidens species vs genus. Appropriate is broaderMatch not exactMatch
birnaviridae barnaviridae different viruses, see http://viralzone.expasy.org

capillaria hepatica capillaria species vs genus. Appropriate is broaderMatch not exactMatch
chitosan chitin Chitosan is produced by deacetylation of chitin
chlamydomonadales chlamydomonadaceae order vs family. Appropriate is narrowMatch not exactMatch
clostridium butyricum clostridium acetobutylicum some systemic error (misaligned matches?)
clostridium pasteurianum clostridium butyricum "
clostridium thermocellum clostridium pasteurianum "
cofactors clostridium thermocellum chemical compound bound to a protein vs bacterium
dicentrarchus decapterus different family: Moronidae vs Carangidae (jack)
endrin endria organochloride insecticide vs (can't find in NALT?)
fumariaceae funariaceae herbaceous plants vs mosses
integrated land management integrated weed management unrelated
intracellular fluid extracellular fluid opposite
intraspecific hybridization interspecific hybridization opposite
irrigation equipment fumigation equipment unrelated
jordan river jordan river vs country
larix occidentalis strix occidentalis larch tree vs spotted owl
macroclimate microclimate opposite
percophidae percopsidae different order: Perciformes vs Percopsiformes
petrology metrology unrelated
portuguesa portugal Venezuelan region vs European country
puccinia graminis puccinia coronata some systemic error (misaligned matches?)
puccinia helianthi puccinia graminis "
puccinia hordei puccinia helianthi "
puccinia horiana puccinia hordei "
puccinia melanocephala puccinia horiana "
puccinia pelargonii zonalis puccinia melanocephala "
puccinia recondita puccinia polysora "
puccinia striiformis puccinia sorghi "
pumping jumping unrelated
pyrrhocoris puccinia striiformis "
pythium aphanidermatum pyrrhocori "
pythium aphanidermatum pyrrhocoris "
pythium butleri pythium aphanidermatum "
radium radio unrelated
raillietia raillietina Different phylum: Arthropoda vs Platyhelminthes
retinoid retina chemical vs part of the eye
ribosomal rna ribosomal dna Ribonucleic acid vs Deoxyribonucleic acid
salts sales unrelated
selenium helenium chemical element vs herbaceous plant
sesamum angustifolium solanum angustifolium Unrelated: family Pedaliaceae vs Solanaceae
swine vesicular disease virus human enterovirus b Subtype vs Species. Appropriate is broaderMatch not exactMatch
syrinx larynx syrinx is avian equivalent to the mammalian larynx. Appropriate is broaderMatch not exactMatch
toxocara canis toxocara cati dog roundworm vs feline roundworm. closeMatch
trichothecium trichothelium Unrelated: class Ascomycetes vs Lecanoromycetes
urban development human development vaguely related
wast past unrelated
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.