compared with
Current by laura.tolosi
on Jun 03, 2015 17:09.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (6)

View Page History
h2. Categorization scheme

We chose the [IPTC | http://show.newscodes.org/index.html?newscodes=medtop&lang=en-GB] categorization scheme, suitable for news articles.
https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora
А. A popular categorization is based on the the [IPTC | http://show.newscodes.org/index.html?newscodes=medtop&lang=en-GB] categorization scheme, suitable for news articles. Many of our competitors are basing their categorization on the IPTC standard. Advantage: describes well news. Disadvantage: it is a flat categorization, does not have levels. Too targeted to news.

To date, the categorizations comprizes 17 broad topics: Arts_Culture_Entertainment, Conflicts_War_Peace,Crime_Law_Justice, Disaster_Accident, Economy_Business_Finance, Education, Environment, Health, Human_Interest, Labor, Lifestyle_Leisure, Politics, Religion_Belief, Science_Technology, Society, Sports, Weather.

For the next development versions, it is possible (and desired) to extend the categorization scheme by appending sub-categories, identical or inspired by the IPTC. More refined categories can result in a more specific description of the topic of the document, but can raise problems with model fitting.

B. For unsupervised approaches, where the categories are not specified apriori, one can use ontology terms, such as dbpedia categories, of various degrees of specificity.

h2. Corpus Corpora
A. A corpus consisting of long abstracts from dbpedia of articles that belong to the 17 IPTC categories, as shown here: https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora . The corpus is available in EN and BG.

* An ensemble model, which combines a gazeteer and a classifier. The classifier outputs "yes" or "no" for each category. It is based on a small number of features, up to 30. Some reduced language model that hashcodes words to categories.
* An unsupervised model that works with tagged entities in the documents and tries to find dbpedia supercategories that cover well the entities. Unwanted aspect: very broad supercategories such as "Living_people" are very often output and are unspecific. The approach is promising, but some specificity score of the output categories mush be introduced.

Features:
* We are currently using: stopwords elimination, stemming and a bigram model for feature extraction
Algorithm:
* The multi-label classification is achieved by training K independent classifiers (perceptron, sigmoid perceptrons), corresponding to the K possible labels. For each classifier, the interpretation is: what is the likelihood that sample x has label l, against the alternative that it does not? After training all K classifiers, for each sample, the top highest likelihoods give the set of labels. A rule of thumb is used for deciding how many labels should be returned.