CORDIAL-SIN - Syntax-oriented Corpus of Portuguese Dialects
The Syntax-oriented Corpus of Portuguese Dialects (CORDIAL-SIN) is a project directed towards the study of the dialectal syntactic variation of European Portuguese - within a Principles and Parameters perspective - using a corpus markup methodology. Since 1999, the project has developed and enhanced research activities on Portuguese dialect syntax.
Studying the syntax of European Portuguese dialects under a comparative perspective.
Developing and enhancing research activity on syntactic dialect variation in Portugal and strengthening cooperation with international dialect syntax projects (CORDIAL-SIN participates at the networks Edisyn - European Dialect Syntax and Wedisyn - Dialect Syntax in Westmost Europe).
Building up and making available online the Syntax-oriented Corpus of Portuguese Dialects (CORDIAL-SIN). This corpus is updated and improved on a regular basis. The existence of the corpus feeds the goals expressed in 1. and 2. above.
Exploiting existing resources in order to make them available to the scientific community and relevant for the development of the field of comparative dialect syntax. A rich recorded speech collection owned by CLUL provides the 'raw material' for the constitution of the Syntax-oriented Corpus of Portuguese Dialects.
The Syntax-oriented Corpus of Portuguese Dialects (CORDIAL-SIN) is a project directed towards the study of the dialectal syntactic variation of European Portuguese – within a Principles and Parameters perspective – using a corpus markup methodology. The project aims at developing and enhancing research activities on syntactic dialect variation, a field with no tradition in the Portuguese domain. This is implemented by exploiting and treating existing recorded data. At its present state, CORDIAL-SIN is a 600,000 words corpus.
The Dialectology team of Centro de Linguística da Universidade de Lisboa (CLUL) has constituted during the past 30 years a rich recorded speech collection – of about 4.500 hours speech recording – obtained from interviews in more than 200 localities in the Portuguese territory (having in view the elaboration of linguistic atlases). CORDIAL-SIN is based on a geographically representative body of selected excerpts of spontaneous and semi-directed speech taken from the sound materials gathered within the scope of the following projects:
- ALEPG Atlas Linguístico e Etnográfico de Portugal e da Galiza (Linguistic and Ethnographic Atlas of Portugal and Galicia)
- ALLP Atlas Linguístico do Litoral Português (Linguistic Atlas of the Portuguese Coast)
- ALEAç Atlas Linguístico e Etnográfico dos Açores (Linguistic and Ethnographic Atlas of Azores)
- BA Fronteira Dialectal do Barlavento Algarvio (Geographical Limits of the Dialect of Western Algarve)
[Luisa Segura da Cruz. 1987. A Fronteira Dialectal do Barlavento do Algarve. Assistant Research dissertation. Lisbon, Instituto Nacional de Investigação Científica.]
The CORDIAL-SIN corpus is presented in four different ways: verbatim transcripts, normalized orthographic transcripts, POS tagged transcripts, and syntactically annotated transcripts. The syntactic annotation is under development within the DUPLEX project.
The verbatim transcript contains not only the standard linguistic expressions but also annotations marking pauses, hesitations, abandoned starts, phonetic and morphological variants, repetitions, truncated words, speech overlapings, fuzzy productions, etc. Such annotations are marked according to the conventions established in Normas de Transcrição (Orthographic Transcription Conventions) and are afterwards automatically erased in order to produce the normalized orthographic transcript.
The morphological annotation system is adapted from the Tycho Brahe project and is automatically set using the tagger developed by the Tycho Brahe research team. Tycho Brahe's morphological tags have an internal structure and are made up of the following components: a part-of-speech component (i.e. the main part of the tag); inflectional components; diacritics and punctuation symbols (see POS Annotation Manual).
The syntactic annotation system has been largely inspired by the annotation system developed for the Penn Parsed Corpora of Historical English. The syntactic annotation is implemented over part-of-speech tagged texts and results in a tree representation in the form of labeled brackets, marking constituent boundaries, phrase and clause dependencies, sentence types, grammatical relations, null categories and certain transformational relations. Complete and automatic searching for predefined syntactic configurations is enabled by the search program CorpusSearch2, written by Beth Randall (open source software, downloadable from Sourceforge), which is compatible with syntactic annotations in the Penn Treebank format - see Syntactic Annotation System Manual).
The CORDIAL-SIN is a dialect corpus of European Portuguese. The materials for this corpus were drawn from the recordings of dialect speech collected by the ATLAS team as fieldwork interviews for linguistic atlases between 1974 and 2004 in more than 200 locations in the Portuguese territory.
The CORDIAL-SIN compiles a geographically representative body of selected excerpts of spontaneous and semi-directed speech from these interviews. The informants were aged, received little instruction, lived in a rural area, and were born and raised in the location of the interview.
The corpus amounts to 600,000 words, collected from 42 locations within the continental territory of Portugal and the archipels of Madeira and Azores.
The CORDIAL-SIN data are available online in written form, in the following formats: two kinds of orthographic transcripts (more or less detailed for the marking up of spoken language phenomena), PoS tagged corpus, syntactically annotated corpus.
Please use the following reference:
Martins, A. M. (coord.) [2000- ]. CORDIAL-SIN: Corpus Dialectal para o Estudo da Sintaxe / Syntax-oriented Corpus of Portuguese Dialects. Lisboa, Centro de Linguística da Universidade de Lisboa. URL: http://www.clul.ulisboa.pt/en/
CORDIAL-SIN is available for download as:
CORDIAL-SIN is searchable online
and interoperable with other dialect corpora
through the Edisyn Search Engine.
CORDIAL-SIN by Centro de Linguística da Universidade de Lisboa is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.