Corpus Leiria (1991)


The data set available here consists of the corpus of Leiria (1991), "The acquisition by speakers of European Portuguese a second language, the verbal aspect expressed by the preterit and imperfect", her Master's thesis presented to the FLUL.

Twenty years after its collection and analysis, the corpus remains a great empirical database that can be useful in L2 Portuguese acquisition studies. It can even be used with other corpora, that have been collected later and are quite similar. All together form a large database of written production by Portuguese's learners, in formal context (data collection held by CLUL (2008-2009)), and semi-formal (data produced by learners of Portuguese who were living in Portugal and were been attending language classes when the data were collected (cf. Leiria (2001) and collection of data by CELGA)).

Data were collected from the exam of the basic course of Portuguese (listening comprehension and writing production) at the Department of Portuguese Language and Culture of the Faculty of Letters of Lisbon. In total 218 documents were collected, written by 168 portuguese's learners. Students, at the time of collection, have been attended Portuguese classes for different periods of time, ranging from 1 to 4 semesters of formal exposition.


Once the data were collected as part of an examination, the guidelines of the collection were displayed in the description of each of the three tests considered. The informants should retell one of three oral narratives. The first text comes from a folk tale that, as befits this subgenre, features a simple text layout, and even predictable, since it is common to the traditional oral literature and other cultures. The texts 2 and 3 appeal to extralinguistic knowledge related to socio-political aspects Portuguese. His narrative scheme is less predictable, requiring therefore a greater processing capacity (cf. Leiria 1991: 62). Each text was readed twice by the teacher. The first reading was a bit slower than the second and preceded by the title. The reading of the text 2 was preceded by a short introductory text (cf. text 2).

See the narratives and the guidelines used in this activity:

a) text 1
b) text 2
c) text 3

After collection, the data were transcribed, classified and organized in accordance with guidelines set out below.

1. Data from informants

Informants had between 18 and 55 years old; 59% were female and 41% were male. The frequency of the course only required the students to be over 17 years old and to be literate in a western language. Thus, the conditions for registration in the Course generate a huge diversity of audiences.

The informants are speakers of 16 different mother tongues, representatives of three language families: (i) Chinese, a language Sino-Siamese (ii) Arabic, a Semitic language, (iii) the remaining 14 are Indo-European languages (Romance, Slavic and germanic).

The information about the other languages known by the informants was not taken into consideration since "what each one believes means to know a second language, and the attitude they manifest in relation to this knowledge, involves very different assumptions" (cf. Leiria 1991: 66).

Since students were living in Portugal, they were exposed not only to formal input - 14 hours a week throughout the semester (November to February/February to May) - as well to informal input. This informal input, taking into account the heterogeneity's group, can be quite varied.

2. Transcription standards

According to Leiria (1991), L2 corpora's transcription must be similar to literary texts, since the procedure should require the same level of detail. In this sense, the transcription of this corpus was carried out according to some of the symbols of genetic and critical edition. The same conventions were also adopted in later projects, as previously mentioned, which contributes to data compatibility.

< xxx> segments scratched
<(...)> scratched unreadable segments
<#> illegible words
< # > scratched unreadable

3. Classification of texts collected

Each of the 218 documents that constitute the corpus is classified with a number that identifies it, and, in parentheses, in some detail, the conditions under which it was produced: 

a) Number of stimulus-text that informant was exposed to;
b) Informant's mother tongue;
c) Informant's number;
d) learning time;

Code 145 (3.AL8.98.2) indicates, for example, that the document number 145 was produced from the stimulus number 3, by an informant who has German as mother tongue (German is the eighth mother tongue in the organization of the corpus) and has the number 98. At the time of writing the exam, the student had been enrolled in the Basic Course for two semesters.


The corpus has 218 written productions, performed by 168 informants, speakers of 16 different mother tongues. The difference between the number of documents (218) and the number of informants (168) is due to the fact that some informants have contributed with more than one text, since they presented themself to more than one written examination.

See here, through the filter system, the number of documents produced by the stimulus, the mother tongues and the period of time the informants had been learning Portuguese.

In total, the corpus have about 55,000 transcribed words.

1. Number of texts by informants LM

German 29       English 23
Arabic 18   Italian 10
Bulgarian 5   Nederlands 5
Chinese 68   Norwegian 1  
Danish 2   Persian 12
Spanish 14   Polish 2
French 10   Serbo-Croatian 2
Hindi 10   Swedish 7


a) See a file sample: 145 (3.AL8.98.2)
b) Go to the full corpus: corpus_leiria1991