Annotation LayersIn the GMB, annotations are automatically provided by the underlying toolchain. The annotations can be corrected via the GMB explorer (you will need to be logged in, though). There are two main types of annotation: (1) segmentation of word and sentence tokens; and (2) various tags for word tokens.
SegmentationIn order to change the boundaries of words and sentences click on the [tokens] tab in the GMB explorer. You will see the current segmentation of the document: sentence tokens are separated by new lines, and word tokens by single spaces. Once you click on [Edit] it is possible to adjust the segmentation of both words and sentences. This is done by selecting one of four options for a character: start of a sentence; start of a token; inside a token; or outside a token. Multi-word expressions can also be segmented in this way (locations such as New~York, organizations such as National~Basketball~Association, and titles such as Prime~Minister). In case of haplographic full stops, the period is not separated from the abbreviation.
We use the part-of-speech tagset used in the Penn Treebank tagset as listed in Ann Taylor, Mitchell Marcus and Beatrice Santorini (2003): The Penn Treebank: An Overview, Section 1.1.
The annotation scheme for named entities in the GMB distinguishes the following eight classes:
- Person (PER) - Person entities are limited to individuals that are human or have human characteristics, such as divine entities.
- Location (GEO) - Location entities are limited to geographical entities such as geographical areas and landmasses, bodies of water, and geological formations.
- Organization (ORG) - Organization entities are limited to corporations, agencies, and other groups of people defined by an established organizational structure.
- Geo-political Entity (GPE) - GPE entities are geographical regions defined by political and/or social groups. A GPE entity subsumes and does not distinguish between a city, a nation, its region, its government, or its people (LOC•ORG).
- Artifact (ART) - Artifacts are limited to manmade objects, structures and abstract entities, including buildings, facilities, art and scientific theories.
- Event (EVE) - Events are incidents and occasions that occur during a particular time.
- Natural Object (NAT) - Natural objects are entities that occur naturally and are not manmade, such as diseases, biological entities and other living things.
- Time (TIM) - Time entities are limited to references to certain temporal entities that have a name, such as the days of the week and months of a year. For all other temporal expressions the tagging layer timex is used (see below).
Time expressions and numericals are tagged on separate layers: timex and numex (Palmer and Day, 1997). The timex layer divides time expressions into Date and Time. The numex layer classifies numericals as Percentage or Money.
We use Combinatory Categorial Grammar (CCG; Steedman, 2001) for describing syntactic structure.
We use the C&C parser (Clark and Curran, 2004) trained on CCGbank (Hockenmaier and Steedman, 2007) for automatic annotation. Where the parser makes mistakes, manual correction is not as straightforward as in the case of tags, because it is a tree structure rather than individual word tags that need to be fixed. Currently there is no interface for directly editing the tree structure.
However, many parsing errors are due to wrong part-of-speech tags. In these cases, the tree can be corrected simply by correcting the part-of-speech annotation.
If this does not suffice, then many parsing errors can be fixed by correcting the CCG categories (also called supertags) of individual tokens. The category of a token (or larger phrase) determines in what ways it can combine syntactically with other phrases and therefore constrains the set of possible derivations. GMB Explorer allows users to edit token categories and creates corresponding bits of wisdom. These are then sent back to the parser to influence the derivation it will produce.
There are some limitations to this way of correcting syntactic annotation:
- It requires knowledge of CCG, in particular of CCGbank's flavor of it. Notably, this flavor includes feature annotations on some atomic categories, which are appended to them in square brackets.
- Due to the nature of the C&C parser, category bits of wisdom are currently treated as hints rather than hard constraints. The parser will sometimes appear to ignore them.
- Arbitrary CCG categories cannot currently be used. The parser's statistical model is limited to a finite inventory. Although Explorer allows the input of arbitrary strings as categories, it displays suggestions as you type. You should choose categories from these suggestions only.
- Lexical categories cannot disambiguate between all attachment decisions. E.g. although the CCG category of a preposition determines whether it attaches to an NP or to a VP, if, for example, more than one NP is available to attach to, then there is currently no way to disambiguate between the two.
For a more extensive description of the resource, check the publications.