Corpus SANTOS - European Portuguese

SANTOS - European Portuguese
Corpus of child and child-directed speech 


Santos - European Portuguese is a corpus of child and child-directed speech, transcribed according to the CHILDES (Child Language Data Exchange System) system and using the CLAN software (MacWhinney, 2000). It includes around 52 hours of child-adult interaction, contains 27,595 child utterances and 70,736 adult utterances. The corpus is part of AcEP (for a full description see Santos, 2006 and Santos et al. 2014) and is available in the CHILDES Database, from this link. The corpus is annotated using a tagger developed at CLUL (Généreux, Hendrickx & Mendes, 2012) - the POS-tags which were used are presented here. This corpus is registered under the following ISLRN: 532-620-702-768-3. The corpus includes data involving three children, according to the description in the table:

Child   Age MLUw Number of files Number of child’s utterances
INI 1;6.6 - 3;11.12 1.530 - 3.827 21 6,591
TOM 1;6.18 - 3;10.16 1.286 - 3.089 30 15,548
INM 1;5.9 - 2;9.3 1.345 - 2.834 16 5,456


All types of work using this corpus as a source of information should cite:

Santos, A. L. (2006). Minimal Answers. Ellipsis, Syntax and Discourse in the Acquisition of European Portuguese. Ph.D. Dissertation. Universidade de Lisboa. (Published 2009, Amsterdam / Philadelphia: John Benjamins).

Santos, A. L., M. Génereux, A. Cardoso, C. Agostinho, S. Abalada (2014) A corpus of European Portuguese child and child-directed speech. In Proceedings of the 9th Conference on Language Resources and Evaluation – LREC 2014. European Language Resources Association (ELRA).

This corpus (or its previous versions) was used as basis for different databases:

Santos, Ana Lúcia, Maria João Freitas & Aida Cardoso (2014) CEPLEXicon - A Lexicon of Child European Portuguese. Lisboa: Anagrama (CLUL, FLUL). ISLRN: 408-817-203-152-3 , ELRA ID: ELRA-L0094

CDS_EP - A lexicon of child directed speech for European Portuguese from the FrePOP database