Data collection for learning Portuguese as a second language

Isabel Leiria

Transcription, coding and data organization:
Rita Gonçalves

The corpus available here results from the project "Recolha de dados de aprendizagem de português língua estrangeira" (Data collection for learning Portuguese as a second language), held under a protocol between the Camões Institute and the Linguistics Center of the University of Lisbon.

The project's main goal was to collect results of foreign learners of Portuguese to create a database that can support research in Portuguese and, in particular, teacher training and production of educational materials in Portuguese L2. The corpus, now available, is compatible with materials that integrate data collection of other projects, since were used the same collection methodology and the same codes for transcription (see Collection of Corpora PL2 and Leiria 2006). The corpus is a representative empirical data set that is essential for acquisition and learning of PL2.

From October 2008 until October 2010, results were collected from 397 Portuguese L2 learners. The students, at the time of collection, attended courses at different levels (A1-C1), in eighteen universities in the following countries: Austria, Bulgaria, South Korea, Spain, USA, France, India, Italy, Poland, United Kingdom, Czech Republic and Romania.

The following teachers from the Camões Institute collaborated in data collection in these countries: Ana Catarina de Castro, Ana Catarina de Matos, Ana Filipa Velosa, Ana Mendes e Land, Daniel Perdigão, Delfim da Silva, Francisco Nazareth, Hiteshkumar Chimanial Parmar, Joaquim Ramos, José Carlos Dias, Leonor Moura e Silva, Lia Ferreira, Mónica Pereira, Pedro Martins, Sandra Pinheiro e Vanessa Castagna.


Data were collected from participants through pre-defined guidelines that included (i) a language profile questionnaire that has been assigned a number; and (ii) the text identified with the respective number of the informant. This coding system allowed us to identify texts made by the same informant who only had to fill out the language profile questionnaire once. Upon release, the materials were transcribed, coded and organized.

1. Data from informants

At first, we created a Excel file, which would bring together the sociolinguistic data from participants. The information, organized by the universities where the collection was made, includes the following items that can be searched individually through the filter system:

a) personal data – age, sex, nationality, course/year of their course and university where the course is attended;

b) language course – the mother tongue (LM); education language; other languages known besides Portuguese, language in which the student has greater proficiency in addition to LM (levels from the Common European Framework of Reference for Languages (CEFR)).

c) Portuguese language – beginning year of study of Portuguese; other courses of Portuguese culture; contact with other Portuguese speakers; proficiency of Portuguese (levels from the CEFR)

d) stimuli (see point 2 of the Methodology)

2. Written productions

Each written production was obtained from a stimulus. Participating teachers were provided a list of 83 proposals for drafting (revised and expanded from the ones designed for the Phd of the coordinator of this project Isabel Leiria), organized into three thematic sections:
1. The individual
2. Society
3. The environment

We asked the teachers at the beginning of the project to select one/two stimuli from each of the three themes according to the students' learning level and personal preference. Consult here the selected stimuli and the number of productions obtained in each one.

Note: The materials provided by University of Pusan, identified with the code - PU, were not collected according to the guidelines of the present project and constitute text from exams. Therefore, sociolinguistic data of the informants were not available in this case. In order to respect the system of identification of the material it was assigned the number of stimuli that best fit the task performed.

3. Transcript standards

The texts were transcribed according to the following conventions (cf. Leiria, I. 2006 - Léxico, aquisição e ensino do Português Europeu língua não materna. Lisboa: FCG/FCT, p. 201):

< XXX >segments scratched
<(...)> scratched unreadable segments
/ xxx / segments added
/ * xxx / conjectured readings

In order to conceal the names and other elements that could replenish the identity of the informant, such elements were replaced by the code XXXX. This notation is also in line with the protocol of the PL2 Corpora Collection at the University of Coimbra.

. Encoding of texts collected

Each document is properly labeled with (i) university where the collection was made; (ii) level of proficiency in Portuguese at the time of collection (codes 1, 2 and 3 assigned, respectively, to levels A1-A2, B1-B2 and C1-C2 of the CEFR); (iii) number of informant (assigned in the form of language profile); and (iv) code of the stimulus (the codes have been respected according to the listed stimuli conveyed to teachers).

Thus, a text written at Rutgers University (RU), produced by a student-level A1-A2 (1), with identification number 07, under the stimulus 45.2L, has the following identification: RU_1_07_45.2L.

The corpus consists of 470 written productions, performed by 397 informants, speakers of 28 different mother tongues. Since each document has an average of 150 words, the total corpus has about 70,500 transcribed words.

1. Number of texts by informants LM

German 41 English 37
Apache 1 Italian 112
Bulgarian 7 Japonese/Portuguese 1
Bulgarian/Turkish 1 Konkani 13
Catalan 2 Konkani/English 1
Korean 59 Luxembourgish 1
Czech 2 Polish 21
Croatian 3 Portuguese 12
Slovak 1 Romanian 52
Spanish 79 Rwandan 1
Spanish/Italian 1 Russian 5
French 8 Serbian 2
French/Portuguese 2 Swedish 1
Hindi 4

2. Texts

a) Go to full corpus: corpus_ple
b) Go to each of the written productions:

