Home Team Project Description Corpus Publications Corpus Documentation


Project description:

The Syntax-oriented Corpus of Portuguese Dialects (CORDIAL-SIN) is a project directed towards the study of the dialectal syntactic variation of European Portuguese – within a Principles and Parameters perspective – using a corpus markup methodology. The project aims at developing and enhancing research activities on syntactic dialect variation, a field with no tradition in the Portuguese domain. This is implemented by exploiting and treating existing recorded data. At its present state, CORDIAL-SIN is a 600,000 words corpus.

The Dialectology team of Centro de Linguística da Universidade de Lisboa (CLUL) has constituted during the past 30 years a rich recorded speech collection – of about 4.500 hours speech recording – obtained from interviews in more than 200 localities in the Portuguese territory (having in view the elaboration of linguistic atlases). CORDIAL-SIN is based on a geographically representative body of selected excerpts of spontaneous and semi-directed speech taken from the sound materials gathered within the scope of the following projects:

  • ALEPG Atlas Linguístico e Etnográfico de Portugal e da Galiza (Linguistic and Ethnographic Atlas of Portugal and Galicia)
  • ALLP Atlas Linguístico do Litoral Português (Linguistic Atlas of the Portuguese Coast)
  • ALEAç Atlas Linguístico e Etnográfico dos Açores (Linguistic and Ethnographic Atlas of Azores)
  • BA Fronteira Dialectal do Barlavento Algarvio (Geographical Limits of the Dialect of Western Algarve) 
    [Luisa Segura da Cruz. 1987. A Fronteira Dialectal do Barlavento do Algarve. Assistant Research dissertation. Lisbon, Instituto Nacional de Investigação Científica.]  

The CORDIAL-SIN corpus is presented in four different ways: verbatim transcripts, normalized orthographic transcripts, POS tagged transcripts, and syntactically annotated transcripts. The syntactic annotation is under development within the DUPLEX project.

The verbatim transcript contains not only the standard linguistic expressions but also annotations marking pauses, hesitations, abandoned starts, phonetic and morphological variants, repetitions, truncated words, speech overlapings, fuzzy productions, etc. Such annotations are marked according to the conventions established in Normas de Transcrição (Orthographic Transcription Conventions) and are afterwards automatically erased in order to produce the normalized orthographic transcript.

The morphological annotation system is adapted from the Tycho Brahe project and is automatically set using the tagger developed by the Tycho Brahe research team. Tycho Brahe's morphological tags have an internal structure and are made up of the following components: a part-of-speech component (i.e. the main part of the tag); inflectional components; diacritics and punctuation symbols (see POS Annotation Manual).

The syntactic annotation system has been largely inspired by the annotation system developed for the Penn Parsed Corpora of Historical English. The syntactic annotation is implemented over part-of-speech tagged texts and results in a tree representation in the form of labeled brackets, marking constituent boundaries, phrase and clause dependencies, sentence types, grammatical relations, null categories and certain transformational relations. Complete and automatic searching for predefined syntactic configurations is enabled by the search program CorpusSearch2, written by Beth Randall (open source software, downloadable from Sourceforge), which is compatible with syntactic annotations in the Penn Treebank format - see Syntactic Annotation System Manual).