Skip to end of metadata
Go to start of metadata

Hints and tips for working with Rembrandt (AdLib) XML

Name Size Creator Creation Date Comment  
Microsoft Excel Sheet 10 Objects RKDimages (Dutch).csv 28 kB Maria Todorova Nov 08, 2011 13:13    
File 10 Objects RKDimages (Dutch).dat 105 kB Maria Todorova Nov 08, 2011 13:13    
XML File 10 Objects RKDimages (Dutch).xml 310 kB Vladimir Alexiev Nov 14, 2011 22:47    
Microsoft Excel Sheet 2927 Objectrecord RKDtechnical.csv 1 kB Maria Todorova Nov 07, 2011 14:14    
File 2927 Objectrecord RKDtechnical.dat 4 kB Maria Todorova Nov 07, 2011 14:14    
XML File 2927 Objectrecord RKDtechnical.xml 20 kB Vladimir Alexiev Nov 14, 2011 22:47    
Microsoft Excel Sheet 2927 Researchrecord RKDtechnical 50... 2 kB Maria Todorova Nov 08, 2011 13:13    
File 2927 Researchrecord RKDtechnical 50... 1 kB Maria Todorova Nov 07, 2011 14:16    
XML File 2927 Researchrecord RKDtechnical 50... 6 kB Vladimir Alexiev Nov 14, 2011 22:47    
Microsoft Excel Sheet 2972 Documentationrecords linked to... 6 kB Maria Todorova Nov 07, 2011 14:16    
File 2972 Documentationrecords linked to... 6 kB Maria Todorova Nov 08, 2011 13:14    
XML File 2972 Documentationrecords linked to... 24 kB Vladimir Alexiev Nov 14, 2011 22:47    

API

  • Adlib API description: http://api.adlibsoft.com/site/
  • Wietske/Jan: whether you would like to use the Adlib API to question our databases directly. If this is something that could be useful to you, please let us know so that we can give you access
  • Vladimir: if you have data for 12 paintings only, I think it'll be easier if we get it from the DebugXML link, or you just send it to us. Otherwise you'd have to teach us and we'd have to learn how to use the API.

XML Problems

Unstructured XML

Beware of XMLs with uncorrelated multiple-values, eg like this from 10 Objects RKDimages (Dutch).xml or 2927 Objectrecord RKDtechnical.xml:

Here you can't robustly say which enddate (einddatum_in_collectie) relates to which collection (collectienaam), so this XML is not usable.

Maybe this (at file end) shows that an improper Adlib export option was selected:

Some element types (but not all) include a counter (eg occurrence="29"). But that'd require XSLT or full DOM loading, so we can't process it.
It's much better if the XML is properly structured (like it was previously): an element enclosing the corresponding sub-elements (eg <collectie> enclosing <einddatum_in_collectie> and <collectienaam>)

XML Escaping

The content of [] is not properly escaped, eg:

  • "list & lijden": should be &
  • The complex HTML excerpt in <file.application> doesn't have a single root: has a sequence of elements (first is <a>, last is <object>)

(Mariana) This file does not open, gives XML Parsing Error (not well-formed) in browser.
(Vlado) You can view it if you save it to disk and open with a text editor.

But the problem needs to be fixed before the file can be processed fully with XML tools.

Newlines

(cosmetic)
AdLib XML files don't have newlines between tags (a single huge line). Many XML tools don't mind, but you may want to fix it so you can read it more easily:

  • insert newlines:
    • open in browser (it indents and shows hierarchically), copy and paste as plain text
    • in editor, eg by replacing "><" with ">\n<"
    • for all files in the dir, using the power of pfind:
  • indent properly, eg using indent-region in emacs
  • indent properly using xslt the following tiny xslt script (pretty.xsl) will format any xml in a pretty form
    • Any xslt processor such as sabcmd will work 

Structural Problems

I've been extracting thesauri from the XML data, using perl, grep etc. Eg Here's how to extract labels from the two nested <value> tags:

perl -0ne 'print qq{$1\t$2\n} while \
m{<research.reason_objective>\s*<.*?>(.*?)</.*?>\s*<.*?>(.*?)<.*?/research.reason_objective>}gs' \
Rembrandt-data/xml/*.xml |sort|uniq

In about 200 places in 11 files, <research.reason_objective> has two <value> tags. But in file 07, the 16th occurrence has only one <value> tag!

I have rarely seen data to be as structurally dirty as RKD's data.

Record versions

Downloadable versions from website

XML versions downloadable from website, as per Rembrandt data#Data Sources

  • [] (got from Dominic, original name was "rdb-samplerecord_English translations.xml")
  • [] (got from Jan Teuben@RKB, 10/4/2011, original name was "badendesuzanna_adlib.xml")

What I did:

  • New: split lines per tag and indented properly
  • Old: untabified (tab -> 2 spaces)
    • Split some elements (all with value "RKD", etc) to 3 lines, so they compare better to New
    • Split one line between tags
  • compared with Araxis Merge

Differences:

  • New: no English translations of the Dutch tags. But these are in our excel anyway
  • New: content is not properly escaped
  • New: includes more <value lang> variants
    • Example1 (old had just "FRONT"). "0/1" are not proper languages, interpret as 0->en, 1->nl
    • Example2 (old had just "RKD")
    • happens for the following elements
      <object.size.unit> <doc.size.unit> <file.image.location> <file.application.location> <file.spec.object_status> <file.spec.overall_detail> <file.spec.front_back> <sample.location.vert.start> <sample.location.hor.start>
    • All of these are thesaurus values to be mapped to URI, so it doesn't change the mapping
    • Now we have not just the code but also the titles
  • Old: <plaats_tentoonstelling> was a number (eg 490: a bug we noticed earlier). New: now is a proper name, eg Berlijn (but is that from a thesaurus?)
  • Old: <instelling_tentoonstelling> was empty, now is present (eg Gemäldegalerie)

Versions sent by Wietske

Wietske 4-Nov-2011: Exports of a representative set of records from both RKDimages and RKDtechnical. The Susanna .xml file that has been received earlier was based on a test version of the Rembrandt Database website and this record contains a lot of "fake" data and is not totally complete. Therefore we are now sending you more up to date records from RKDimages and RKDtechnical.

Formats:

  • Wietske: The .csv file gives you the quickest overview, but it can probably not be used, because a lot of fields have more than one value, and this you do not see in the .csv file.
  • Vlado: The .dat format is a line-oriented unstructured format; using 2-character tags that are much harder to understand.
    We can't possibly figure out that "ad" means <opmerking_datering> but with Google or Excel translate we can figure out it means "remark on dating"
  • So we have to use the XML format

Files:

  • 10 Objects RKDimages (Dutch).xml: 10 sample records from RKDimages
  • The records below are linked to each other: In the documentation records you will find references to the research record ID (called "priref" in our system); in the research record you will find the reference to the object record ID in RKDtechnical; and in the object record in RKDtechnical (which we will not use) you will find the reference to the object record ID in RKDimages.

Vladimir 14-Nov-2011 analysis:

  • Thanks! I see in the new Susanna data (priref=2926) that there is:
    • no fake data/remarks
    • more data repetitions, eg: bibliographic references, exhibitions...
    • extra data elements, eg: <opm._datum_uit_collectie>, <fotograaf_instituut>, <herkomst_afbeelding>
  • However, there are some problems:
    • These files have the Unstructured XML problem, so we cannot use them. Can you reexport with a different AdLib option?
    • Can you produce a single deeply nested XML per object, instead of disparate files? It's much easier for us to ingest
      • Please make Research and Documentation for object.priref=2926 not 2927, since:
        we've worked with Susanna thus far, and 2927 is not present in "10 objects"
      • How about Files and Sampling? They were present in the old version, but I can't see them in this one.
  • So this latest version has lots of potential, but will also take some time to ask questions, receive clarifications and figure it out.
  • The deadline for our current iteration (RS3.1) is approaching fast (5-Dec-2011)
  • So we have to continue with the old version, and come back to reconsider in the next iteration

Data availability

  • We'd like to download with DebugXML from the old site, as per Rembrandt data#Data Sources
  • But that site has been down for a while. 2 weeks ago Jan wrote that you're working to bring it back up
  • Could you bring it back up, so we can obtain 10 more paintings? (we have Susanna and Herman Doomer)
Labels:
None
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.