Corpus SANTOS - European Portuguese
SANTOS - European Portuguese
Corpus of child and child-directed speech
Santos - European Portuguese is a corpus of child and child-directed speech, transcribed according to the CHILDES (Child Language Data Exchange System) system and using the CLAN software (MacWhinney, 2000). It includes around 52 hours of child-adult interaction, contains 27,595 child utterances and 70,736 adult utterances. The corpus is part of AcEP (for a full description see Santos, 2006 and Santos et al. 2014) and is available in the CHILDES Database, from this link. The corpus is annotated using a tagger developed at CLUL (Généreux, Hendrickx & Mendes, 2012) - the POS-tags which were used are presented here. This corpus is registered under the following ISLRN: 532-620-702-768-3. The corpus includes data involving three children, according to the description in the table:
|Child||Age||MLUw||Number of files||Number of child’s utterances|
|INI||1;6.6 - 3;11.12||1.530 - 3.827||21||6,591|
|TOM||1;6.18 - 3;10.16||1.286 - 3.089||30||15,548|
|INM||1;5.9 - 2;9.3||1.345 - 2.834||16||5,456|
All types of work using this corpus as a source of information should cite:
Santos, A. L. (2006). Minimal Answers. Ellipsis, Syntax and Discourse in the Acquisition of European Portuguese. Ph.D. Dissertation. Universidade de Lisboa. (Published 2009, Amsterdam / Philadelphia: John Benjamins).
Santos, A. L., M. Génereux, A. Cardoso, C. Agostinho, S. Abalada (2014) A corpus of European Portuguese child and child-directed speech. In Proceedings of the 9th Conference on Language Resources and Evaluation – LREC 2014. European Language Resources Association (ELRA).
This corpus (or its previous versions) was used as basis for different databases:
Santos, Ana Lúcia, Maria João Freitas & Aida Cardoso (2014) CEPLEXicon - A Lexicon of Child European Portuguese. Lisboa: Anagrama (CLUL, FLUL). ISLRN: 408-817-203-152-3 , ELRA ID: ELRA-L0094