Corpus

Corpus PESTRA

Presentation

The corpus PESTRA (Leiria, 2001) was compiled by Isabel Leiria for her PhD dissertation Leiria (2001), “Lexicon, acquisition and teaching of L2 European Portuguese”, presented to FLUL.

Like the corpus Leiria 1991 and other corpora compiled by CLUL and CELGA (2008-2010), this database aims to provide empirical data that can support research on teaching and learning of Portuguese as foreign language, as well as the production of teaching materials in this area.

The data, here available, were collected as part of the exam of the basic course of Portuguese held at the Department of Portuguese of the Faculty of Letters of Lisbon University, coordinated by Isabel Leiria and Helena Marques Dias in the 80’s and 90’s.

The corpus is composed of exercises of writing production by students of Portuguese as foreign language. The students could choose one among several subjects and could produce different text types: a letter, an argumentative text, a narrative or even a cooking recipe.

With the materials obtained, it was possible to gather a sample of about 50 documents for each different L1: two Romance languages (Spanish and French), two Germanic languages (German and Swedish) and a non-Indo-European language (Chinese), what makes a corpus of about 250 documents.

Students, at the time of the collection, had attended at least one semester (120 hours of language teaching, for a period of about three months).

Methodology

Each written production was obtained from a stimulus. The drafting proposals were organized into three thematic sections:  

1. The individual
2. Society
3. The environment

See here the distribution of documents for each thematic area and sub-themes.

After collection, the data were transcribed, classified and organized in accordance with guidelines set out below.

1. Data from informants

The informants had between 18 and 57 years old; 64% were female and 36% were male. The informants are speakers of five different native languages: Spanish, French, German, Swedish or Chinese (Mandarin or Cantonese) and, in most cases, know at least one more language besides their L1 and Portuguese.

According to Leiria (2001:196-197), "most of the Spanish, at the beginning of the course in FLUL, hadn't studied Portuguese, but knew English; some of them also knew German or French. Many of the French had studied Portuguese in France between 10 and 60 hours; many knew English and some of them a bit German, Spanish or Italian. Many of the Germans had studied between 30 and 60 hours of Portuguese; most of them knew English and many knew a little of French. The Swedish profile was similar. Also, at the beginning of the course, several were already living in Lisbon for almost three months. All Chinese said they knew English; most of them had been studying Portuguese at least since two years (in Macau or the Popular Republic of China), and some had even studied in Lisbon or Coimbra during one or two semesters."

Since it was not possible to assess the knowledge of informants by the amount of time they had been studying Portuguese, their language proficiency was established by their results in the  exercise of speech production which was the first part of the test (on a scale from 0 to 4; see below the characterization of each scale items). Therefore, with this information we can conclude, in general terms, that the level of comprehension decreases when the distance between L1 and Portuguese increases (cf. Leiria, 2001:198).

Although they hadn't studied Portuguese before, the Spanish students had the higher level of comprehension and the Chinese, some already with two years of study, were those with most difficulties. An intermediate position, in terms of the relevance of the problems detected, was occupied by the informants with Swedish, German and French as L1. See here the linguistic profile of the informants according to the classification in the oral comprehension test.

Control group

The research carried out by Leiria (2001) established a control group of 50 speakers of Portuguese L1, consisting of first year students of the course of Portuguese and Portuguese's culture and fourth year students of Languages and Literatures of the Faculty of Arts of Lisbon. This control group was asked to draft compositions on the same topics as the students of Portuguese as foreign language. The materials thus obtained form the control subcorpus.

2. Transcription standards

As in Leiria (1991), this corpus has been transcribed following some of the symbols of genetical and critical edition. The same conventions were also adopted in the data collection of other related projects (cf. data collection and CLUL/CELGA), assuring data compatibility.

< xxx> segments scratched
<(...)> scratched unreadable segments
/ xxx / segments added
/ * xxx / conjectured readings

3. Classification of texts collected

Each of the documents that constitute the corpus PESTRA (European Portuguese Written by Foreign) was properly identified with a code that includes:

a) Mother tongue’s informant: Spanish, French, German, Swedish, Chinese;
b) document number: 1 – 53;
c) thematic section and subtheme: K (a – j); X ( l – p); Z (q – v);
d) discourse type: a: opinative; b: narrative; c: letter; d: other;
e) oral comprehension: 0 – 4;

The code A35xlb2, for example, indicates that the document was produced by a German speaker, has the number 35, the thematic section and subtheme were, respectively, X (society) and L (social habits and behaviors); it's a narrative (b) produced by an informant with the level 2 on the oral comprehension test.

Data

The corpus consists of 309 written productions subdivided into approximately 50 documents (100 to 400 words) by each L1: 52 documents were produced by the control group (Portuguese L1), 50 by speakers who have Spanish as their L1; 53 who have French; 53 who have German; 52 who have Swedish; and 49 who have the Chinese (Cantonese or Mandarin).

The order in which each subcorpus appears was based on typological order and linguistic distance between the Portuguese language and the L1. Therefore, we considered (i) typological order: Romance LanguagesGermanicChinese and (ii)  linguistic distance: SpanishFrenchGermanSwedishChinese.

In total, the corpus is about 68,000 transcribed words.

1. Number of texts by informants L1

Portuguese

52

Spanish

50

French

53

German

53

Swedish

52

Chinese

49

2. Texts

a) See a file sample: A35xlb2
b) Go to the full corpus: PESTRA_Leiria2001.rar / PESTRA_Leiria2001.zip