Team:

Elena Lombardo

Filipe Alves Moreira

 

Description:

The project "Sebástica manuscrita: catalogue and digital editions of Portuguese historiographical texts of the XVI and XVII centuries" has as its final objective the construction of a website in which we gather and systematize data on the manuscript traditions of the historiographic texts about King D. Sebastian, written in Portuguese by contemporaries of the king, along with linguistically annotated digital philological editions of these texts. The data storage will give rise to a digital catalogue, viewable in graph format. The editions will serve as sources for language history studies, being also useful for historians and other scholars of Portuguese culture. We will consider the manuscripts that fall within the above mentioned typology, whether they are in national or international libraries.

With this project, we aim at filling a gap regarding the study of the historiographical texts of the sixteenth and seventeenth centuries dedicated to the reign of King D. Sebastian. We assume that these texts have been almost exclusively considered as sources for other subjects, rather than as a specific object of analysis from the literary, historiographic or linguistic point of view. Consequently, basic tools such as a list of texts and their respective editions and / or manuscripts are still missing, several texts remain unpublished, and almost all of those which have been published were not edited using modern Textual Criticism criteria.

At present, we make available to the public the preliminary catalogue of the manuscripts identified in Portuguese collections: in Lisbon, the National Library, the Biblioteca da Ajuda, the Academia das Ciências de Lisboa Library and the Torre do Tombo National Archive; in Oporto, the Municipal Library; in Évora, the Public Library; in Coimbra, the University Library and, in Viseu, the Municipal Library. The catalogue is a constantly updated, standardization and corrected version, according to the progress of our investigations.

 

Presentation

Catalogue

NLI-PT is the first Portuguese dataset compiled  for Native Language Identification (NLI), the task of identifying an author’s first language based on their second language writing.

The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. We collected data from three different sources: two learner corpora, COPLE2 and PEAPL2 and the dataset of the project "Recolha de dados de aprendizagem de português língua estrangeira". With the goal of unifying learner data collected from these various sources, we applied a methodology which has been previously used for the compilation of language variety corpora (Tan et al., 2014).

 

NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. We used the LX Parser for the simple POS and the Portuguese morphological module of Freeling for detailed POS. Concerning syntactic annotations, we used the LX Parser for
constituency parsing and the DepPattern toolkit for dependencies.

NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP.

NLI-PT is described in the following paper:

del Río, I., Zampieri, M. & Malmasi, S. 2018. A Portuguese Native Language Identification Dataset. The 13th Workshop on Innovative Use of NLP for Building Educational Applications, NAACL 2018. 05th June. New Orleans, USA. [pdf]

 

Download the dataset

 

Boletim de Filologia was the periodical of Centro de Estudos Filológicos, now Centro de Linguística da Universidade de Lisboa. Between 1932, year of CEF foundation, and 1993 no less than 32 volumes were published.

Thanks to the cooperation of Instituto Camões, a digital reproduction of Boletim is available at the site of Centro Virtual Camões.

http://cvc.instituto-camoes.pt/bdc/lingua/boletimfilologia/index.html



 

 

Data collection and corpus constitution

Amália Maria Vera-Cruz de Melo Lopes

Universidade de Cabo Verde

 

Date and place of data collection

From 2005 to 2007, Cidade da Praia

 

Average duration per interview

60 minutes

 

Informants profile (all regular inhabitants of Cidade da Praia)

  

 

                                  Freelancers / Others

             High School Teachers

                   of Portuguese

 

 M/F

 FE

    Origin

          INST

Activity

 

  M/F

 FE

   Origin

   

 Bar

 Sot.

  Lic

Sec/M

 FP

 Bar

 Sot.

 Lic

 Bach

       

1.

  M

 51

  +

 

  +***

   

Social and cultural agent

16.

  M

 38

 

  +

 

    +

2.

  M

 56

  +

 

   +*

   

Social and cultural agent, writer, member of Parliament

17.

  M

 56

   +

 

  +

 

3.

  M

 49

 

   +

   +*

   

Media analyst, professor

18.

  M

 52

 

  +

 

    +

4.

  M

 47

  +

 

  +**

   

Journalist

19.

   F

 28

 

  +

  +

 

5.

  F

 68

  +

 

   +*

   

Writer, cultural agent

20.

   F

 34

 

  +

 

    +

6.

  M

 70

 

   +

 

   +

 

Cultural agent, musician

21.

  M

 28

 

  +

  +

 

7.

  M

 81

 

   +

   +*

   

Poet, literary translator, cultural agent

22.

   F

 41

   +

   

    +

8.

  M

 61

 

   +

   +*

   

Lawyer, writer

23.

   F

 42

 

  +

  +

 

9.

  M

 56

 

   +

   +*

   

Cultural agent, writer

24.

   M

 53

   +

 

  +

 

10.

  M

 53

  +

 

   +*

   

Member of Parliament

25.

   F

 34

 

  +

  +*

 

11.

  M

 40

  +

     

  +

Cultural agent

26.

   F

 51

 

  +

 

    +

12.

  M

 54

  +

 

  +**

   

Member of an international institution (associated with Portuguese language)

27.

   F

 50

   +

   

    +

13.

  M

 54

  +

 

   +*

   

Lawyer

28.

   F

 52

 

  +

 

    +

14.

  F

 52

 

   +

   +*

   

Member of Parliament

29.

   M

 43

 

  +

  +

 

15.

  F

 46

 

   +

 +***

   

Doctor

             

  

Labels

FE: Age group; Bar: Windward; Sot: Leeward; INST: education level; Lic: Graduation degree; Bach: Bachelor degree; Sec / M: High school; FP: professional qualification; *: studies carried out in Portugal; **: studies carried out in another CPLP country; ***: studies carried out in a non-Portuguese speaking country.

 

Corpus

Audios

Transcriptions

 


CEPLEXicon - A Lexicon of Child European Portuguese

CEPLEXicon is a lexicon based on two different corpora of child speech – Corpus SANTOS (Santos, 2006, Santos et al., 2014) and Freitas corpus (Freitas, 1997, Freitas et al. 2012) (part of AcEP). This lexicon results from the automatic tagging of the two corpora, using a tagger and the pos tag set produced in the research unit ANAGRAMA (Centro de Linguística da Universidade de Lisboa - CLUL) (Généreux, Hendrickx & Mendes, 2012). The automatic tagging was followed by a partial manual revision (as described in the manual).

 

This lexicon covers all the speech produced by seven monolingual Portuguese children aged 1;02.00 to 3;11.12, in a total of 114 files, each corresponding to 40-50 minutes of child-adult interaction in a naturalistic setting. The lexicon is presented in .xls format and includes 2201 lemmas, the number of occurrences of each lemma in three different age periods (<2 years of age; ≥ 2 and < 3 years of age; ≥ 3 years of age), frequency of the lemma in each period and age of first occurrence for each child.

CEPLEXicon was developed at ANAGRAMA (CLUL, Faculdade de Letras da Universidade de Lisboa), under the project Complement Clauses in the Acquisition of Portuguese (PTDC/CLE-LIN/120897/2010), funded by Fundação para a Ciência e Tecnologia.

 

The full reference to CEPLEXicon should be included in all types of work using it as a source of information, including books, papers, conference presentations or posters, evaluation tools or any other products.

 

How to cite CEPLEXicon (version 1.1):

Santos, Ana Lúcia, Maria João Freitas & Aida Cardoso (2014) CEPLEXicon - A Lexicon of Child European Portuguese. Lisboa: Anagrama (CLUL, FLUL). ISLRN: 408-817-203-152-3 , ELRA ID: ELRA-L0094
Link to ELRA Catalogue.