Portuguese Corpus Annotated for Modality
The corpus MODAL is a corpus of 160.000 tokens annotated with modality information. The corpus was extracted from the written sub-part of the Reference Corpus of Contemporary Portuguese (CRPC) (Généreux et al., 2012), restricted to European Portuguese and excluding documents from Politics and Law to avoid formal language usage. The following categories are marked as potential triggers when they denote an event’s modality: nouns, verbs, adverbs, adjectives that are part of a verbal phrase and the verb+prep combinations ter de ‘must’ and haver de ‘have to’. There are 3183 triggers in the annotated corpus, but a total of 3352 modal values attributed, due to ambiguous cases.
To have access to the corpus, please contact firstname.lastname@example.org.
Annotation scheme: modal values
Our scheme covers a total of 13 modal values: epistemic and its su b-values (knowledge, belief, doubt, possibility and interrogative), deontic and sub-values (obligation, permission), participant-internal and sub-values (capacity, necessity), volition, evaluation, effort and success. The typology follows closely linguistic proposals such as Palmer (1986). The inclusion of participant-internal values follows directly from the typology of van der Auwera and Plungian (1998) (these values match what is called dynamic modality in other typologies (Palmer, 1986)). However, contrary to van der Auwera and Plungian (1998), the scheme doesn’t consider participant-external modality as an independent type, but rather as a sub-type of deontic modality. Other values, such as evaluation, effort and success, are inspired in other annotation schemes for modality (Baker et al., 2010).
Annotation scheme: components
The main components of the annotation scheme are:
- Trigger: the element conveying the modal value;
- Target: the expression in the scope of the trigger;
- Source of the event mention (speaker or writer);
- Source of the modality (agent or experiencer).
The annotation was performed over each trigger, and not over the global interpretation of the sentence. The trigger receives an attribute modal value, while both trigger and target are marked for polarity. This feature describes the positive or negative polarity of the trigger and not the polarity of full sentences. A trigger may be marked with negative polarity by a negation adverb, or the negative polarity can be expressed morphologically by an affix (improbable) or by the lexical verbal form itself (proibir ‘to forbid’).
Modal verbs may have more than one meaning and it is sometimes difficult to distinguish between those modal values, even when the annotator takes into consideration a larger context. To address this issue, the scheme includes an Ambiguity field, where the annotators can write down secondary meanings when present in a specific context. The annotator chooses the most salient meaning as the main modal value.
Tense is not annotated. So, there is no annotation of the past tense (although it provides certainty about the realization of an event), nor future (possibility) nor conditional (unless there is a conjunction introducing the conditional clause that we can consider a trigger). Declaratives are also not tagged, although they have an epistemic reading of belief (and even factuality) (Palmer, 1986). We consider, following Oliveira (1988), declaratives as representing the unmarked level of modality as they provide no evident trigger for the modal value. Evidentials are also not annotated as a separate value, and instead are marked as epistemic belief (supported by evidences).
The annotation was performed with the MMAX2 annotation software tool (Müller and Strube, 2006). The MMAX2 software is platform-independent, written in java and can freely be downloaded from http://mmax2.sourceforge.net/.MMAX2 offers a visual interface to annotate sentences by marking textual strings and creating links between the marked elements. MMAX2 enables the annotation of non-contiguous elements and produces stand-off XML annotation. The elements of the annotation consist of markables (namely the trigger, target, source of modality and source of event) that are linked to the same modal event.
The annotation was performed by one annotator and all difficult cases were discussed with a second annotator and included in the guidelines. A study to measure the inter-annotator agreement (IAA) for the task of modality annotation was conducted over 50 sentences by two linguists, targeting two specific features of our scheme: trigger identification and modal value attribution. We computed IAA using the kappa-statistic (Cohen, 1960) for each field in the annotation. For the Trigger the kappa value was .65 and for the accompanying Modal value a kappa of .85 was obtained, similar to the reported IAA for English (Matsuyoshi et al., 2010).
In recent years we see a clear trend in information extraction applications to go beyond the extraction of pure facts, to focus on personal opinions in sentiment analysis and opinion mining, and to distinguish between factual and probable information, to detect uncertainty, speculation and negation in biomedical text mining. This interest has resulted in some proposals for the annotation of modality mainly for the English language. The MODAL corpus contributes with a modality scheme for Portuguese, and its application over a 160.000-tokens corpus.
The corpus has been used to develop an automatic tool for modality identification and labeling, in partnership with the Department of Informatics of the University of Évora with Paulo Quaresma, Teresa Gonçalves and João Sequeira (see publications).
(2012). Modality in Text: a proposal for corpus annotation. In Proceedings of the Eighth International Conference on Language Resources and Evaluation - LREC 2012, May 21-27 2012, Istanbul (pp. 1805-1812)..