Methodology

Clear functional requirements

Building a solution based on semantics starts from the client's requirements - what they want to achieve with this solution, what their business objectives are, what business problems they want to solve. In the context of Media Graph we aim to facilitate the management of big volumes of image briefs  and captions, in order to smartly search and navigate through silos of images.
The first step of the solution development is to define the so called functional requirements, or what the system is supposed to accomplish. Functional requirements are expressed in the form "system must do <requirement>" and they specify the particular results of the system. They drive the application architecture of the system. Functional requirements are: specific smart search (faceted, FTS, etc.), content enrichment, data aggregation, etc.
For example, through semantic text analysis based enrichment of content, media content editors can dynamically create new content-based image captions or descriptions, while users can benefit from the smart content search.

Clear annotation types

Having clear annotation types is another important prerequisite for the annotation process. The Subject Matter Expert (SME) or someone who is very much familiar with the specificities of the domain defines the annotation types, based on empirical observations over the data and the content.

The Annotation types (AT) are abstract descriptions of certain mentions, used for marking spans of text, t.e. recognising mentions of person, organisation, location, date, etc, within a text. An AT may have two parts:

Initial Corpus

The corpus is a collection of documents, which can be in different formats. The ones we support here are XML, HTML,TXT, CSV. Depending on the annotation task, these texts should be sampled to be representative and balanced. It means that the corpus should contain all types of texts (categories) present in that particular domain (e.g. for the news domain, it should contain texts about general news, social life, economy, finance, religion, sport, celebrities, etc.) and the proportion of the text types should be based on their share in real-life usage.

We usually start a new annotation task by creating an initial corpus of a small number of documents. In this way we are able to see how well the annotation task and initial guidelines work and, if necessary, adjust the text analysis component/guidelines/text collection before adding more documents to our corpus.

There is no fixed number of how big the corpus needs to be in order to get good results as this will depend largely on how complex the annotation task is. But usually we use between 100 - 500 documents with examples for evaluation and 700 - 2000 docs. for the machine learning component.

Initial annotation guidelines

We need to create initial annotation guidelines, which will be used as guidance for our text analytics tasks. Depending on the domain and complexity of the task it can be done automatically or manually.

Automatic approach

Based on observation over the documents and the data, the text analysis expert creates the initial model for the phenomena (software text-analysis component (ML, rule-based)) associated with the problem task we are trying to solve. This way the first annotation guidelines are automatically available. They describe the way the corpus should be annotated with the features in the model.

Manual approach

Based on observation over the documents and the data, and the cases in which entities appear in the text, or the context in which the mentions of AT appear, the MA experts create initial annotation guidelines. During the manual annotation process, they will enrich and refine them with specific use cases.

Semantic Annotation Cycle

The semantic annotation cycle consists of the following steps:

Step 1: The initial set of documents is loaded and the project annotation schema (annotation types, features, values, etc.) is applied. Having a good annotation schema and accurate annotations are critical for the machine learning component, which relies on data outside of the text itself.

Step 2: Automatic annotation is performed, based on the initial model of the phenomena (the extraction pipeline). This creates a pre-annotated corpus augmented with higher-level information from components such as tokenizers, sentence splitters, part of speech taggers, gazetteers, PER/ORG/LOC grammars, etc. Adding such information to a corpus allows the computer to find features that can make the defined task easier and more accurate.

Step 3: The pre-annotated corpus is then sent to MA experts for curation. A well-defined manual curation process is essential to ensure that all automatically pre-annotated entries are handled in a consistent manner. This process consists of several steps:

Step 4: Based on the observations on these pre-annotated documents and the data, and the cases in which entities appear in the text, or the context in which the mentions of annotation types appear, the MA experts revise the initial annotation guidelines and enrich them with specific use cases.

Step 5: A manually annotated corpus is created. It is further divided in two parts.