We chose the IPTC categorization scheme, suitable for news articles. https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora
To date, the categorizations comprizes 17 broad topics: Arts_Culture_Entertainment, Conflicts_War_Peace,Crime_Law_Justice, Disaster_Accident, Economy_Business_Finance, Education, Environment, Health, Human_Interest, Labor, Lifestyle_Leisure, Politics, Religion_Belief, Science_Technology, Society, Sports, Weather.
For the next develpment versions, it is possible (and desired) to extend the categorization scheme by appending sub-categories, identical or inspired by the IPTC. More refined categories can result in a more specific description of the topic of the document, but can raise problems with model fitting.
A. A corpus consisting of long abstracts from dbpedia of articles that belong to the 17 IPTC categories, as shown here: https://confluence.ontotext.com/display/GSC/Document+Classification+Corpora . The corpus is available in EN and BG.
B. One corpus has been obtained form the ACM classification system . It consists of titles and abstracts of scientific papers published by ACM. Here is the file. Each row starts with CCS, which is the root category of the tree. Tab-separated records specify parths in the tree. The leaves are articles, given as title and abstract, tab-separated. Example of articles in category:
CCS -> General and reference -> Cross-computing tools and techniques -> Metrics
|Measured impact of crooked traceroute||Data collected using traceroute-based algorithms underpins research into the Internet's router-level topology, though it is possible to infer false links from this data...|
|Semantic mining on customer survey||Business intelligence aims to support better business decision-making. Customer survey is priceless asset for intelligent business decision-making....|
|Predicting software complexity by means of evolutionary testing||One characteristic that impedes software from achieving good levels of maintainability is the increasing complexity of software...|
|Runtime monitoring of software energy hotspots||GreenIT has emerged as a discipline concerned with the optimization of software solutions with regards to their energy consumption....|
|Structured merge with auto-tuning: balancing precision and performance||Software-merging techniques face the challenge of finding a balance between precision and performance...|