Hints and tips for working with Rembrandt (AdLib) XML
API
- Adlib API description: http://api.adlibsoft.com/site/
- Wietske/Jan: whether you would like to use the Adlib API to question our databases directly. If this is something that could be useful to you, please let us know so that we can give you access
- Vladimir: if you have data for 12 paintings only, I think it'll be easier if we get it from the DebugXML link, or you just send it to us. Otherwise you'd have to teach us and we'd have to learn how to use the API.
XML Problems
Unstructured XML
Beware of XMLs with uncorrelated multiple-values, eg like this from 10 Objects RKDimages (Dutch).xml or 2927 Objectrecord RKDtechnical.xml:
Here you can't robustly say which enddate (einddatum_in_collectie) relates to which collection (collectienaam), so this XML is not usable.
Maybe this (at file end) shows that an improper Adlib export option was selected:
Some element types (but not all) include a counter (eg occurrence="29"). But that'd require XSLT or full DOM loading, so we can't process it.
It's much better if the XML is properly structured (like it was previously): an element enclosing the corresponding sub-elements (eg <collectie> enclosing <einddatum_in_collectie> and <collectienaam>)
XML Escaping
The content of [] is not properly escaped, eg:
- "list & lijden": should be &
- The complex HTML excerpt in <file.application> doesn't have a single root: has a sequence of elements (first is <a>, last is <object>)
(Mariana) This file does not open, gives XML Parsing Error (not well-formed) in browser.
(Vlado) You can view it if you save it to disk and open with a text editor.
But the problem needs to be fixed before the file can be processed fully with XML tools.
Newlines
(cosmetic)
AdLib XML files don't have newlines between tags (a single huge line). Many XML tools don't mind, but you may want to fix it so you can read it more easily:
- insert newlines:
- open in browser (it indents and shows hierarchically), copy and paste as plain text
- in editor, eg by replacing "><" with ">\n<"
- for all files in the dir, using the power of pfind:
- indent properly, eg using indent-region in emacs
- indent properly using xslt the following tiny xslt script (pretty.xsl) will format any xml in a pretty form
- Any xslt processor such as sabcmd will work
Structural Problems
I've been extracting thesauri from the XML data, using perl, grep etc. Eg Here's how to extract labels from the two nested <value> tags:
perl -0ne 'print qq{$1\t$2\n} while \ m{<research.reason_objective>\s*<.*?>(.*?)</.*?>\s*<.*?>(.*?)<.*?/research.reason_objective>}gs' \ Rembrandt-data/xml/*.xml |sort|uniq
In about 200 places in 11 files, <research.reason_objective> has two <value> tags. But in file 07, the 16th occurrence has only one <value> tag!
I have rarely seen data to be as structurally dirty as RKD's data.
Record versions
Downloadable versions from website
XML versions downloadable from website, as per Rembrandt data#Data Sources
- [] (got from Dominic, original name was "rdb-samplerecord_English translations.xml")
- [] (got from Jan Teuben@RKB, 10/4/2011, original name was "badendesuzanna_adlib.xml")
What I did:
- New: split lines per tag and indented properly
- Old: untabified (tab -> 2 spaces)
- Split some elements (all with value "RKD", etc) to 3 lines, so they compare better to New
- Split one line between tags
- compared with Araxis Merge
Differences:
- New: no English translations of the Dutch tags. But these are in our excel anyway
- New: content is not properly escaped
- New: includes more <value lang> variants
- Example1 (old had just "FRONT"). "0/1" are not proper languages, interpret as 0->en, 1->nl
- Example2 (old had just "RKD")
- happens for the following elements
<object.size.unit> <doc.size.unit> <file.image.location> <file.application.location> <file.spec.object_status> <file.spec.overall_detail> <file.spec.front_back> <sample.location.vert.start> <sample.location.hor.start> - All of these are thesaurus values to be mapped to URI, so it doesn't change the mapping
- Now we have not just the code but also the titles
- Example1 (old had just "FRONT"). "0/1" are not proper languages, interpret as 0->en, 1->nl
- Old: <plaats_tentoonstelling> was a number (eg 490: a bug we noticed earlier). New: now is a proper name, eg Berlijn (but is that from a thesaurus?)
- Old: <instelling_tentoonstelling> was empty, now is present (eg Gemäldegalerie)
Versions sent by Wietske
Wietske 4-Nov-2011: Exports of a representative set of records from both RKDimages and RKDtechnical. The Susanna .xml file that has been received earlier was based on a test version of the Rembrandt Database website and this record contains a lot of "fake" data and is not totally complete. Therefore we are now sending you more up to date records from RKDimages and RKDtechnical.
Formats:
- Wietske: The .csv file gives you the quickest overview, but it can probably not be used, because a lot of fields have more than one value, and this you do not see in the .csv file.
- Vlado: The .dat format is a line-oriented unstructured format; using 2-character tags that are much harder to understand.
We can't possibly figure out that "ad" means <opmerking_datering> but with Google or Excel translate we can figure out it means "remark on dating" - So we have to use the XML format
Files:
- 10 Objects RKDimages (Dutch).xml: 10 sample records from RKDimages
- The records below are linked to each other: In the documentation records you will find references to the research record ID (called "priref" in our system); in the research record you will find the reference to the object record ID in RKDtechnical; and in the object record in RKDtechnical (which we will not use) you will find the reference to the object record ID in RKDimages.
- 2927 Objectrecord RKDtechnical.xml: a sample record from one of the 10 objects from RKDtechnical.
We will not present the data from this record (and in due time this record will be replaced completely by the object record in RKDimages). However, you need it to be able to see how the object ID in RKDtechnical matches the object ID in RKDimages. - 2927 Researchrecord RKDtechnical 5000623.xml: "research record" related to this object from RKDtechnical
- 2972 Documentationrecords linked to 5000623.xml: a set of "documentation records" related to this research from RKDtechnical.
- 2927 Objectrecord RKDtechnical.xml: a sample record from one of the 10 objects from RKDtechnical.
Vladimir 14-Nov-2011 analysis:
- Thanks! I see in the new Susanna data (priref=2926) that there is:
- no fake data/remarks
- more data repetitions, eg: bibliographic references, exhibitions...
- extra data elements, eg: <opm._datum_uit_collectie>, <fotograaf_instituut>, <herkomst_afbeelding>
- However, there are some problems:
- These files have the Unstructured XML problem, so we cannot use them. Can you reexport with a different AdLib option?
- Can you produce a single deeply nested XML per object, instead of disparate files? It's much easier for us to ingest
- Please make Research and Documentation for object.priref=2926 not 2927, since:
we've worked with Susanna thus far, and 2927 is not present in "10 objects" - How about Files and Sampling? They were present in the old version, but I can't see them in this one.
- Please make Research and Documentation for object.priref=2926 not 2927, since:
- So this latest version has lots of potential, but will also take some time to ask questions, receive clarifications and figure it out.
- The deadline for our current iteration (RS3.1) is approaching fast (5-Dec-2011)
- So we have to continue with the old version, and come back to reconsider in the next iteration
Data availability
- We'd like to download with DebugXML from the old site, as per Rembrandt data#Data Sources
- But that site has been down for a while. 2 weeks ago Jan wrote that you're working to bring it back up
- Could you bring it back up, so we can obtain 10 more paintings? (we have Susanna and Herman Doomer)