COMBINA-PT - Word Combinations in Portuguese Language
The main objective of the project Word Combinations in Portuguese (COMBINA-PT) was to establish a lexicon of significant lexical associations, based on a balanced corpus of Portuguese, using automatic computational processes followed by a manual revision of results.
The observation of corpus data enables the identification and analysis of complex patterns of word associations proving that the lexicon does not consist only of simple or compound lexical items but appears to be populated with numerous chunks, more or less predictable, though not fixed (Firth, 1955). These word associations uncovered with corpus analysis are not immediately identified when one only relies on intuition based studies. And, since new word associations are regularly created, the attainment of frequency data is absolutely important to determine (altogether with other criteria of linguistic analysis) if a particular group of words may be considered a collocation with a certain stability in the language. Although lexical collocations have been widely studied for many languages (several collocation dictionaries are available for English), this is an innovative resource in what concerns Portuguese.
The project followed a corpus-driven approach, where corpus data is the first motivation for the identification of different types of lexical associations, using a broad concept of collocation as a starting point towards a reflection on lexical associations typology in Portuguese. As a result, different types of groups were selected, ranging from totally frozen expressions to free combinations showing however an associative preference, but in all cases the group elements share a direct syntactic relationship. Results are crucial as an empirical working basis for the constitution of a large covering typology, which will contribute to the existing theoretical studies and to the current PhD dissertation of one of the project members (Sandra Antunes) on lexical association typology and their lexicographic treatment.
The following lexical associations are examples of the different types covered:
- frozen groups (e.g., patrão fora, dia santo na loja 'while the cat is away, the mice will play');
- semi-frozen groups where the meaning of the expression can not be predicted by the meaning of the parts (e.g., esticar o pernil 'kick the bucket'), that are not subject to syntactical variability (e.g., internal modification *esticar o grande pernil 'kick the big bucket' or passivization *o pernil foi esticado 'the bucket was kicked') but allow inflectional variation (e.g., esticaram o pernil 'kicked the bucket');
- semi-frozen groups that can be compositional and in some cases semantically idiosyncratic, and that allow for the substitution of one of the collocates by other words associated through a synonym or hyperonymy/hyponym relation (e.g., onda/maré/vaga de assaltos 'wave of robberies'; países/estados membros 'member states');
- sets of favoured co-occurring forms, that constitute however syntactic dependencies. These expressions are semantically and syntactically compositional but they are satistically idiosyncratic and they are observed with much higher frequency than any other alternative lexicalization of the same concept, which may reveal that they may be in their way to a possible fixedness (e.g., instaurar um processo 'to bring an action'; erros e imprecisões 'mistakes and imprecisions').
Since the extraction of lexical collocations must rely on a large collection of data, a written and balanced corpus of 50 million words, the COMBINA corpus, was designed from the existing corpus CRPC.
The following table shows the constitution of the COMBINA corpus:
|Supreme Court Verdicts||313.962|
The extraction task was implemented through CLUL's software tool Concor.cb, which extracts from the corpus groups of 2, 3, 4 and 5 words and their concordances. Several cut-off options were implemented to allow for the elimination of groups with internal punctuation, of word pairs with first or final grammatical word using a stop-list (in case one wishes to rule out non-lexical associations) and of groups under a selected total minimum frequency. The software Concor.cb statistically evaluates the groups, using the lexical association measure Mutual Information (Church & Hanks, 1990) and orders the results according to this measure.
The following table presents an example of the results produced by the software Concor.cb:
|# 10 noite de consoada 1 eg(3) og(10) ic(8.588317) fg(10) fe(16971 2290575 52) N(50866984)|
|209764730||s da SIC -- que o transmitirá na||noite de consoada||-- tomam os se|
|209764737||Povinho" à droga, passando pela||noite de consoada,||a discoteca e|
|209764744||ulham presentes numa evocação da||noite de consoada.||À medida que|
|209764751||e vai continuar a trabalhar pela||noite de consoada||adentro. Texto|
|209764758||ezes, faltar alguma coisa para a||noite de consoada.||Ainda que o l|
|209764765||as. Saiu para a rua. Nem parecia||noite de consoada.||Aqui e ali, e|
|209764772||À memória vêm-lhe imagens de uma||noite de Consoada,||muito tradici|
|209764779||enor: ao falar, por telefone, na||noite de consoada,||no intervalo|
|209764786||a vida foi deslizando assim. Na||noite de Consoada,||porém, aconte|
|209764793||ário O ADEUS ÀS ARMAS Quando, na||noite de consoada,||se iniciou a|
The first line of the preceeding table shows the following information:
- Total frequency of the group;
- Distance: groups of 2 tokens can be contiguous or be separated by a maximum of 3 tokens, while groups of more than 2 tokens are contiguous (first number after the collocation);
- Number of elements of the group (eg);
- Frequency of the group at a specific distance (og);
- Lexical association measure: groups automatically extracted are statistically analysed using the association measure Mutual Information (MI), which relates the frequency of each group in the corpus with the isolated frequency of each word of the group (Church & Hanks 1989) (ic);
- Total frequency of the group in all occurring distances (fg);
- Frequency of each element of the group (fe);
- Total number of words in the corpus (N);
The following lines in the table above show the concordances of the collocation in the corpus, in KWIC (Key Word in Context) format, with the indexation code of the context.
In order to enable the representation of these multiword units and to offer a platform for user-friendly validation and lemmatization, a relational database was designed in SQL with interface in Access, enabling to:
- automatically import the results of the tool Concor.cb;
- manually select significant collocations, while viewing concordances;
- manually eliminate, in the concordance list, contexts that do not correspond to the collocation being analysed;
- automatically revise the collocation frequency in a new database field (real frequency), when concordance lines are eliminated;
- lemmatize the results.
The collocations of a subset of lemma (only nouns, adjectives, verbs, adverbs) were manually validated and organized. Since collocations show cooccurrence preferences and fixedness, they tend to occur only in some of the forms of the lemma, making it impossible to proceed with a full lemmatization of the selected data.
Thus, in a first level of analysis, the groups were indexed in order to identify an abstract form, which gathers all possible inflected forms and is called "group lemma". In many cases, the collocation does not occur in any other inflected variant in the corpus and the group lemma is simply the form which occured.
In a second level, the main lemma of the group is then identified. Collocations are lemmatized according to the lemma that is under analysis, and so they are not associated to whole the word forms of the group. As an example, the groups posto de abastecimento ('gas station') and postos de abastecimento ('gas stations') are both associated to the group lemma POSTO DE ABASTECIMENTO ('gas station'). And this group lemma is then associated to the main lemma ABASTECIMENTO ('supply'), since the group selection was undertaken during the analysis of that lemma. The following results are a partial example of the group lemma posto de abastecimento ('gas station'), showing the main lemma, the group lemma, the groups and their concordances.
|GROUP LEMMA: posto de abastecimento|
|Group: posto de bastecimento|
|num "Honda Civic", assaltaram o posto de abastecimento "Galp", i|
|riação, com carácter urgente, do posto de abastecimento. Há dez d|
|comercial portuguesa. Num outro posto de abastecimento local, os|
|, disse ao JN um dos clientes do posto de abastecimento. Mais far|
|carem-se propositadamente ao seu posto de abastecimento. Mas já h|
|ssaltaram, anteontem à noite, um posto de abastecimento "Mobil",|
|assim, o funcionário de um outro posto de abastecimento na zona d|
|Vilar Formoso, que dispõe de um posto de abastecimento, o gasóle|
|e abrigo que não têm telefone, o posto de abastecimento, o que po|
|Group: postos de abastecimento|
|, afectado significativamente os postos de abastecimento localiza|
|de adição decorrer nos próprios postos de abastecimento, mas à r|
|das autoridades em controlar os postos de abastecimento. Mas que|
|igando ao encerramento de alguns postos de abastecimento. Nas Ast|
|onível na esmagadora maioria dos postos de abastecimento, pelo me|
|o. As entidades exploradoras dos postos de abastecimento que, à d|
The same process was undertaken with collocations containing a verb, like in the following example with the lemma ABORDAR ('to approach') and the lemma group ABORDAR A QUESTÃO (lit: 'to approach a question'):
|GROUP LEMMA: abordar a questão|
|Group: abordar a questão|
|go com os distribuidores, há que abordar a questão com cuidado. S|
|ta secção deste trabalho, tentou abordar a questão, concentrando-|
|e se tratava de uma boa forma de abordar a questão. Desde o login|
|cias" como do PÚBLICO tiveram de abordar a questão do tratamento|
|arco António Costa, preferiu não abordar a questão, limitando-se|
|Group: abordou a questão|
|sições tomadas. Arouca também já abordou a questão e o pedido de|
|com a vereadora da Acção Social, abordou a questão, tendo sido in|
|nto nos seus clubes, Artur Jorge abordou a questão assim: "Estamo|
|que revelou grande ironia quando abordou a questão relacionada co|
At the end of the project, 48.000 collocations were selected and 20.291 group lemma were created and associated to 1170 main lemma. This work on collocations is being followed up by a PhD dissertation which aims to study collocations typology and their lexicographic treatment, as well as to contrastively analyse collocations in spoken and written genres.
Project results can be consulted here.