NLI-PT is the first Portuguese dataset compiled  for Native Language Identification (NLI), the task of identifying an author’s first language based on their second language writing.

The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. We collected data from three different sources: two learner corpora, COPLE2 and PEAPL2 and the dataset of the project "Recolha de dados de aprendizagem de português língua estrangeira". With the goal of unifying learner data collected from these various sources, we applied a methodology which has been previously used for the compilation of language variety corpora (Tan et al., 2014).

 

NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. We used the LX Parser for the simple POS and the Portuguese morphological module of Freeling for detailed POS. Concerning syntactic annotations, we used the LX Parser for
constituency parsing and the DepPattern toolkit for dependencies.

NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP.

NLI-PT is described in the following paper:

del Río, I., Zampieri, M. & Malmasi, S. 2018. A Portuguese Native Language Identification Dataset. The 13th Workshop on Innovative Use of NLP for Building Educational Applications, NAACL 2018. 05th June. New Orleans, USA. [pdf]

 

Download the dataset

 

Data collection and corpus constitution

Amália Maria Vera-Cruz de Melo Lopes

Universidade de Cabo Verde

 

Date and place of data collection

From 2005 to 2007, Cidade da Praia

 

Average duration per interview

60 minutes

 

Informants profile (all regular inhabitants of Cidade da Praia)

  

 

                                  Freelancers / Others

             High School Teachers

                   of Portuguese

 

 M/F

 FE

    Origin

          INST

Activity

 

  M/F

 FE

   Origin

   

 Bar

 Sot.

  Lic

Sec/M

 FP

 Bar

 Sot.

 Lic

 Bach

       

1.

  M

 51

  +

 

  +***

   

Social and cultural agent

16.

  M

 38

 

  +

 

    +

2.

  M

 56

  +

 

   +*

   

Social and cultural agent, writer, member of Parliament

17.

  M

 56

   +

 

  +

 

3.

  M

 49

 

   +

   +*

   

Media analyst, professor

18.

  M

 52

 

  +

 

    +

4.

  M

 47

  +

 

  +**

   

Journalist

19.

   F

 28

 

  +

  +

 

5.

  F

 68

  +

 

   +*

   

Writer, cultural agent

20.

   F

 34

 

  +

 

    +

6.

  M

 70

 

   +

 

   +

 

Cultural agent, musician

21.

  M

 28

 

  +

  +

 

7.

  M

 81

 

   +

   +*

   

Poet, literary translator, cultural agent

22.

   F

 41

   +

   

    +

8.

  M

 61

 

   +

   +*

   

Lawyer, writer

23.

   F

 42

 

  +

  +

 

9.

  M

 56

 

   +

   +*

   

Cultural agent, writer

24.

   M

 53

   +

 

  +

 

10.

  M

 53

  +

 

   +*

   

Member of Parliament

25.

   F

 34

 

  +

  +*

 

11.

  M

 40

  +

     

  +

Cultural agent

26.

   F

 51

 

  +

 

    +

12.

  M

 54

  +

 

  +**

   

Member of an international institution (associated with Portuguese language)

27.

   F

 50

   +

   

    +

13.

  M

 54

  +

 

   +*

   

Lawyer

28.

   F

 52

 

  +

 

    +

14.

  F

 52

 

   +

   +*

   

Member of Parliament

29.

   M

 43

 

  +

  +

 

15.

  F

 46

 

   +

 +***

   

Doctor

             

  

Labels

FE: Age group; Bar: Windward; Sot: Leeward; INST: education level; Lic: Graduation degree; Bach: Bachelor degree; Sec / M: High school; FP: professional qualification; *: studies carried out in Portugal; **: studies carried out in another CPLP country; ***: studies carried out in a non-Portuguese speaking country.

 

Corpus

Audios

Transcriptions

 


CEPLEXicon - A Lexicon of Child European Portuguese

CEPLEXicon is a lexicon based on two different corpora of child speech – Corpus SANTOS (Santos, 2006, Santos et al., 2014) and Freitas corpus (Freitas, 1997, Freitas et al. 2012) (part of AcEP). This lexicon results from the automatic tagging of the two corpora, using a tagger and the pos tag set produced in the research unit ANAGRAMA (Centro de Linguística da Universidade de Lisboa - CLUL) (Généreux, Hendrickx & Mendes, 2012). The automatic tagging was followed by a partial manual revision (as described in the manual).

 

This lexicon covers all the speech produced by seven monolingual Portuguese children aged 1;02.00 to 3;11.12, in a total of 114 files, each corresponding to 40-50 minutes of child-adult interaction in a naturalistic setting. The lexicon is presented in .xls format and includes 2201 lemmas, the number of occurrences of each lemma in three different age periods (<2 years of age; ≥ 2 and < 3 years of age; ≥ 3 years of age), frequency of the lemma in each period and age of first occurrence for each child.

CEPLEXicon was developed at ANAGRAMA (CLUL, Faculdade de Letras da Universidade de Lisboa), under the project Complement Clauses in the Acquisition of Portuguese (PTDC/CLE-LIN/120897/2010), funded by Fundação para a Ciência e Tecnologia.

 

The full reference to CEPLEXicon should be included in all types of work using it as a source of information, including books, papers, conference presentations or posters, evaluation tools or any other products.

 

How to cite CEPLEXicon (version 1.1):

Santos, Ana Lúcia, Maria João Freitas & Aida Cardoso (2014) CEPLEXicon - A Lexicon of Child European Portuguese. Lisboa: Anagrama (CLUL, FLUL). ISLRN: 408-817-203-152-3 , ELRA ID: ELRA-L0094
Link to ELRA Catalogue.

 

 

 

Boletim de Filologia was the periodical of Centro de Estudos Filológicos, now Centro de Linguística da Universidade de Lisboa. Between 1932, year of CEF foundation, and 1993 no less than 32 volumes were published.

Thanks to the cooperation of Instituto Camões, a digital reproduction of Boletim is available at the site of Centro Virtual Camões.

http://cvc.instituto-camoes.pt/bdc/lingua/boletimfilologia/index.html



 

Data collection for learning Portuguese as a second language

Team
Coordination:
Isabel Leiria (This email address is being protected from spambots. You need JavaScript enabled to view it.)

Transcription, coding and data organization:
Rita Gonçalves (This email address is being protected from spambots. You need JavaScript enabled to view it.)

The corpus available here results from the project "Recolha de dados de aprendizagem de português língua estrangeira" (Data collection for learning Portuguese as a second language), held under a protocol between the Camões Institute and the Linguistics Center of the University of Lisbon.

The project's main goal was to collect results of foreign learners of Portuguese to create a database that can support research in Portuguese and, in particular, teacher training and production of educational materials in Portuguese L2. The corpus, now available, is compatible with materials that integrate data collection of other projects, since were used the same collection methodology and the same codes for transcription (see Collection of Corpora PL2 and Leiria 2006). The corpus is a representative empirical data set that is essential for acquisition and learning of PL2.

From October 2008 until October 2010, results were collected from 397 Portuguese L2 learners. The students, at the time of collection, attended courses at different levels (A1-C1), in eighteen universities in the following countries: Austria, Bulgaria, South Korea, Spain, USA, France, India, Italy, Poland, United Kingdom, Czech Republic and Romania.

The following teachers from the Camões Institute collaborated in data collection in these countries: Ana Catarina de Castro, Ana Catarina de Matos, Ana Filipa Velosa, Ana Mendes e Land, Daniel Perdigão, Delfim da Silva, Francisco Nazareth, Hiteshkumar Chimanial Parmar, Joaquim Ramos, José Carlos Dias, Leonor Moura e Silva, Lia Ferreira, Mónica Pereira, Pedro Martins, Sandra Pinheiro e Vanessa Castagna.

Methodology

Data were collected from participants through pre-defined guidelines that included (i) a language profile questionnaire that has been assigned a number; and (ii) the text identified with the respective number of the informant. This coding system allowed us to identify texts made by the same informant who only had to fill out the language profile questionnaire once. Upon release, the materials were transcribed, coded and organized.

1. Data from informants

At first, we created a Excel file, which would bring together the sociolinguistic data from participants. The information, organized by the universities where the collection was made, includes the following items that can be searched individually through the filter system:

a) personal data – age, sex, nationality, course/year of their course and university where the course is attended;

b) language course – the mother tongue (LM); education language; other languages known besides Portuguese, language in which the student has greater proficiency in addition to LM (levels from the Common European Framework of Reference for Languages (CEFR)).

c) Portuguese language – beginning year of study of Portuguese; other courses of Portuguese culture; contact with other Portuguese speakers; proficiency of Portuguese (levels from the CEFR)

d) stimuli (see point 2 of the Methodology)

2. Written productions

Each written production was obtained from a stimulus. Participating teachers were provided a list of 83 proposals for drafting (revised and expanded from the ones designed for the Phd of the coordinator of this project Isabel Leiria), organized into three thematic sections:
1. The individual
2. Society
3. The environment

We asked the teachers at the beginning of the project to select one/two stimuli from each of the three themes according to the students' learning level and personal preference. Consult here the selected stimuli and the number of productions obtained in each one.

Note: The materials provided by University of Pusan, identified with the code - PU, were not collected according to the guidelines of the present project and constitute text from exams. Therefore, sociolinguistic data of the informants were not available in this case. In order to respect the system of identification of the material it was assigned the number of stimuli that best fit the task performed.

3. Transcript standards

The texts were transcribed according to the following conventions (cf. Leiria, I. 2006 - Léxico, aquisição e ensino do Português Europeu língua não materna. Lisboa: FCG/FCT, p. 201):

< XXX >segments scratched
<(...)> scratched unreadable segments
/ xxx / segments added
/ * xxx / conjectured readings

In order to conceal the names and other elements that could replenish the identity of the informant, such elements were replaced by the code XXXX. This notation is also in line with the protocol of the PL2 Corpora Collection at the University of Coimbra.

4
. Encoding of texts collected

Each document is properly labeled with (i) university where the collection was made; (ii) level of proficiency in Portuguese at the time of collection (codes 1, 2 and 3 assigned, respectively, to levels A1-A2, B1-B2 and C1-C2 of the CEFR); (iii) number of informant (assigned in the form of language profile); and (iv) code of the stimulus (the codes have been respected according to the listed stimuli conveyed to teachers).

Thus, a text written at Rutgers University (RU), produced by a student-level A1-A2 (1), with identification number 07, under the stimulus 45.2L, has the following identification: RU_1_07_45.2L.

Data
The corpus consists of 470 written productions, performed by 397 informants, speakers of 28 different mother tongues. Since each document has an average of 150 words, the total corpus has about 70,500 transcribed words.

1. Number of texts by informants LM

German 41   English 37
Apache 1   Italian 112
Bulgarian 7   Japonese/Portuguese 1
Bulgarian/Turkish 1   Konkani 13
Catalan 2   Konkani/English 1
Korean 59   Luxembourgish 1
Czech 2   Polish 21
Croatian 3   Portuguese 12
Slovak 1   Romanian 52
Spanish 79   Rwandan 1
Spanish/Italian 1   Russian 5
French 8   Serbian 2
French/Portuguese 2   Swedish 1
Hindi 4      

2. Texts

a) Go to full corpus: corpus_ple
b) Go to each of the written productions:

BU_1_01_1.1A PR_2_01_1.1A MA_2_29_55.2M VA_3_04_44.2L  RU_3_01_45.2L
BU_1_02_1.1A PR_2_02_6.1B MA_2_30_8.1B VA_3_04_69.3Q RU_3_02_35.1J
BU_1_03_5.1B PR_2_3_55.2M MA_2_30_83.3V VA_3_05_1.1A RU_3_02_65.2O
BU_1_04_5.1B SI_2_01_34.1J MA_2_31_83.3V VA_3_05_69.3Q RU_3_03_37.1J
BU_1_05_1.1A SI_2_01_50.2L PA_1_01_1.1A VA_3_06_44.2L RU_3_03_45.2L
BU_1_06_1.1A SI_2_02_53.2L PA_1_02_75.3S VA_3_06_45.2L RU_3_04_37.1J
BU_1_07_1.1A SI_2_02_70.3Q PA_1_03_1.1A VA_3_06_69.3Q RU_3_04_45.2L
BU_1_08_1.1A SI_2_03_1.1A PA_1_04_1.1A VA_3_07_69.3Q RU_3_05_37.1J
BU_1_09_1.1A SI_2_03_53.2L PA_1_05_1.1A ED_1_01_10.1C RU_3_05_45.2L
BU_1_10_5.1B MA_1_01_1.1A PA_1_06_75.3S ED_1_02_22.1G RU_3_06_65.2O
BU_1_11_6.1B MA_1_02_1.1A PA_1_07_75.3S ED_1_03_22.1G RU_3_07_45.2L
BU_1_12_6.1B MA_1_03_1.1A PA_2_08_60.2M ED_1_04_50.2L RU_3_08_45.2L
BU_1_13_6.1B MA_1_04_1.1A PA_2_09_10.1C ED_1_05_22.1G RU_2_09_37.1J
BU_1_14_5.1B MA_1_05_1.1A PA_2_10_5.1B ED_1_06_50.2L RU_2_10_37.1J
BU_1_15_1.1A MA_1_06_1.1A PA_2_11_5.1B ED_1_07_50.2L RU_2_10_45.2L
BU_1_16_1.1A MA_1_07_1.1A PA_2_12_48.2L ED_1_08_67.2P RU_2_11_35.1J
BU_1_17_1.1A MA_1_08_1.1A PA_2_13_78.3T ED_1_09_50.2L RU_2_12_35.1J
BU_1_18_1.1A MA_1_08_55.2M PA_2_14_5.1B ED_1_10_50.2L RU_2_13_37.1J
BU_1_19_1.1A MA_1_08_78.3T PA_2_15_5.1B ED_1_11_22.1G RU_2_14_35.1J
BU_1_20_6.1B MA_1_09_1.1A PA_2_16_48.2L ED_1_12_22.1G RU_2_15_37.1J
BU_1_21_1.1A MA_1_11_1.1A PA_2_17_10.1C ED_1_13_10.1C RU_2_16_37.1J
BU_1_22_1.1A MA_1_12_1.1A PA_2_18_60.2M ED_1_14_22.1G RU_2_17_35 1J
BU_2_23_45.2L MA_2_13_8.1B PA_2_19_10.1C ED_1_15_10.1C RU_2_18_37.1J
BU_2_24_7.1B MA_2_13_55.2M PA_2_20_48.2L ED_1_16_50.2L RU_2_19_37.1J
BU_2_25_10.1C MA_2_14_8.1B PA_2_21_60.2M ED_1_17_50.2L GO_2_01_1.1A
BU_2_26_70.3Q MA_2_14_55.2M PA_2_22_60.2M ED_1_18_75.3S GO_2_02_1.1A
BU_2_27_70.3Q MA_2_15_55.2M PA_2_23_48.2L ED_1_19_50.2L GO_2_03_45.2L
BU_2_28_70.3Q MA_2_15_83.3V PA_2_24_60.2M ED_1_20_50.2L GO_2_04_1.1A
BU_2_29_7.1B MA_2_16_55.2M PA_2_25_5.1B ED_1_21_50.2L GO_2_05_1.1A
BU_2_30_7.1B MA_2_16_83.3V PA_2_26_66.2O ED_1_22_75.3S GO_2_05_45.2L
BU_2_31_7.1B MA_2_17_55.2M PA_2_27_39.1J ED_1_23_67.2P GO_2_06_45.2L
BU_2_32_10.1C MA_2_17_83.3V PA_2_28_39.1J ED_1_24_22.1G GO_2_07_1.1A
BU_2_33_70.3Q MA_2_18_8.1B PA_2_29_39.1J SO_2_01_4.1A GO_2_08_1.1A
BU_2_34_70.3Q MA_2_18_55.2M PA_2_30_67.2P SO_2_02_4.1A GO_2_09_45.2L
BU_2_35_70.3Q MA_2_19_8.1B PA_2_31_66.2O SO_2_03_45.2L GO_2_10_1.1A
BU_2_36_70.3Q MA_2_19_83.3V PA_2_32_66.2O SO_2_04_4.1A GO_2_11_45.2L
BU_2_37_70.3Q MA_2_20_8.1B PA_2_33_5.1B SO_2_05_69.3Q GO_2_12_45.2L
BU_2_38_10.1C MA_2_20_55.2M PA_2_34_5.1B SO_2_06_4.1A GO_2_13_1.1A
BU_2_39_7.1B MA_2_21_52.2L PA_2_35_39.1J SO_2_07_69.3Q GO_2_14_45.2L
BU_2_40_70.3Q MA_2_22_78.3T PA_2_36_39.1J SO_2_08_4.1A GO_2_15_45.2L
BU_2_41_7.1B MA_2_23_78.3T PA_2_37_39.1J NI_1_01_1.1A GO_2_16_1.1A
BU_2_42_1.1A MA_2_24_78.3T PA_2_38_66.2O NI_1_02_1.1A GO_2_17_45.2L
BU_2_42_7.1B MA_2_25_55.2M VA_3_01_3.1A NI_1_03_1.1A AU_1_01_25.1H
BU_2_43_70.3Q MA_2_26_55.2M VA_3_01_69.3Q NI_1_04_1.1A AU_1_02_25.1H
BU_2_44_45.2L MA_2_26_83.3V VA_3_02_3.1A NI_1_05_1.1A AU_1_03_25.1H
BU_2_45_10.1C MA_2_27_8.1B VA_3_02_69.3Q NI_1_06_1.1A AU_1_04_25.1H
BU_2_46_7.1B MA_2_27_55.2M VA_3_03_44.2L NI_1_07_1.1A AU_1_05_25.1H
BU_2_47_7.1B MA_2_28_8.1B VA_3_03_45.2L NI_1_08_1.1A AU_1_06_25.1H
BU_2_48_70.3Q MA_2_28_83.3V VA_3_03_69.3Q NI_1_09_1.1A AU_1_07_25.1H
BU_2_49_7.1B MA_2_29_8.1B VA_3_04_3.1A NI_1_10_1.1A SA_1_01_25.1H
SA_1_02_71.3Q SA_1_03_25.1H SA_1_04_25.1H SA_1_05_25.1H SA_1_06_45.2L
SA_1_07_25.1H SA_1_08_25.1H SA_1_09_25.1H SA_1_10_25.1H SA_1_11_25.1H
SA_1_11_45.2L SA_1_12_45.2L SA_1_13_25.1H SA_1_14_45.2L LI_1_01_1.1A
LI_1_02_1.1A LI_1_03_1.1A LI_1_04_1.1A LI_1_05_1.1A LI_1_06_1.1A
LI_1_07_1.1A LI_1_08_1.1A LI_1_09_1.1A LI_1_10_1.1A AL_1_01_1.1A
AL_1_02_1.1A AL_1_03_1.1A AL_1_04_1.1A AL_1_05_1.1A AL_1_06_1.1A
AL_1_07_1.1A AL_2_08_1.1A AL_2_08_6.1B AL_2_08_31.1I AL_2_08_59.2M
AL_2_08_70.3Q AL_2_09_1.1A AL_2_09_6.1B AL_2_09_31.1I AL_2_09_59.2M
AL_2_09_70.3Q AL_2_10_1.1A AL_2_10_6.1B AL_2_10_31.1I AL_2_10_59.2M
AL_2_10_70.3Q AL_2_11_1.1A AL_2_11_6.1B AL_2_11_31.1I AL_2_11_59.2M
AL_2_11_70.3Q AL_2_12_1.1A AL_2_12_6.1B AL_2_12_31.1I AL_2_12_59.2M
AL_2_12_70.3Q AL_1_13_1.1A AL_1_14_1.1A AL_1_15_1.1A AL_1_16_1.1A
AL_1_17_1.1A AL_1_18_1.1A AL_1_19_1.1A AL_1_20_1.1A AL_1_21_1.1A
AL_1_22_1.1A AL_1_23_1.1A AL_1_24_1.1A AL_1_25_1.1A AL_1_26_1.1A
AL_1_27_1.1A HU_1_01_7.1b HU_1_02_6.1B HU_1_03_8.1B HU_1_04_6.1B
HU_1_05_6.1B HU_1_06_6.1B HU_1_07_6.1B HU_1_08_7.1B HU_1_09_15.1D
HU_1_10_7.1B HU_1_11_8.1B HU_1_12_15.1D HU_1_13_15.1D HU_1_14_7.1B
HU_1_15_18.1F HU_1_16_24.1H HU_1_17_24.1H HU_1_18_24.1H HU_1_19_24.1H
HU_1_20_1.1A HU_1_21_1.1A HU_1_22_1.1A HU_1_23_1.1A HU_1_24_1.1A
HU_1_25_1.1A HU_1_26_1.1A VE_1_01_1.1A VE_1_02_1.1A VE_1_03_1.1A
VE_1_03_80.3U VE_1_04_1.1A VE_1_05_1.1A VE_1_05_74.3R VE_1_06_1.1A
VE_1_06_34.1J VE_1_07_1.1A VE_1_08_1.1A VE_1_09_1.1A VE_1_09_55.2M
VE_1_09_80.3U VE_1_10_1.1A VE_1_10_34.1J VE_1_10_48.2L VE_1_11_80.3U
VE_1_12_55.2M VE_1_12_83.3V VE_1_13_48.2L VE_1_14_1.1A VE_1_15_1.1A
VE_1_15_80.3U VE_1_16_1.1A VE_1_16_80.3U VE_1_17_1.1A VE_1_18_1.1A
VE_1_18_74.3R VE_1_19_1.1A VE_1_19_55.2M VE_1_20_1.1A VE_1_20_80.3U
VE_1_21_10.1C VE_1_21_34.1J VE_1_22_1.1A VE_1_22_26.1H VE_1_23_1.1A
VE_1_23_34.1J VE_1_24_1.1A VE_1_25_1.1A VE_1_26_57.2M VE_1_29_26.1H
VE_1_30_26.1H VE_1_30_74.3R VE_1_31_80.3U VE_1_32_26.1H VE_1_33_55.2M
VE_1_34_80.3U VE_1_35_26.1H VE_1_36_34.1J VE_1_37_34.1J VE_1_38_80.3U
VE_1_39_54.2L VE_1_40_55.2M VE_1_41_34.1J VE_3_01_34.1J VE_3_01_48.2L
VE_3_01_80.3U VE_3_02_1.1A VE_3_02_55.2M VE_3_03_83.3V VE_3_04_26.1H
VE_3_04_48.2L VE_3_05_71.3Q PU_1_01_1.1A PU_1_02_1.1A PU_1_03_1.1A
PU_1_04_1.1A PU_1_05_1.1A PU_1_06_1.1A PU_1_07_1.1A PU_1_08_1.1A
PU_1_09_1.1A PU_1_10_1.1A PU_1_11_1.1A PU_1_12_1.1A PU_1_13_1.1A
PU_1_14_1.1A PU_1_15_1.1A PU_1_16_1.1A PU_1_17_1.1A PU_1_18_1.1A
PU_1_19_1.1A PU_1_20_1.1A PU_1_21_1.1A PU_1_22_1.1A PU_1_23_1.1A
PU_1_24_1.1A PU_3_01_73.3R PU_3_02_73.3R PU_3_03_73.3R PU_3_04_73.3R
PU_3_05_73.3R PU_3_06_73.3R PU_3_07_73.3R PU_3_08_73.3R PU_3_09_73.3R
PU_3_10_73.3R PU_3_11_73.3R PU_3_12_73.3R PU_3_13_73.3R PU_3_14_73.3R
PU_3_15_73.3R PU_3_16_73.3R PU_3_17_73.3R PU_3_18_24.1H PU_3_19_24.1H
PU_3_20_24.1H PU_3_21_24.1H PU_3_22_24.1H PU_3_23_24.1H PU_3_24_24.1H
PU_3_25_24.1H PU_3_26_24.1H PU_3_27_24.1H PU_3_28_24.1H PU_3_29_24.1H
PU_3_30_24.1H PU_3_31_24.1H PU_3_32_24.1H PU_3_33_24.1H PU_3_34_24.1H
PU_3_35_24.1H