C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages
Multimedia edition; tools of analysis; standard linguistic measures for validation in HTML.

Information Society Technologies (IST) Programme - European Commission - Directorate - General Information Society - Action Line: IST - 2000 - 3.3.1, Key Action 3, Contract Number IST - 2000 - 26228.
Project approved in December 2000.
Università degli studi di Firenze (UFIR.DIT) - Itália - Coordenador
Université de Provence (UPRO) - França
Fundação da Universidade de Lisboa - Centro de Linguística da Universidade de Lisboa (FUL-CLUL) - Portugal
Universidade Autónoma de Madrid (UAM) - Espanha

Assistant Contractors:
Pitch Instruments France S.A.R.L. (PITCHFRANCE)
Editions Honoré Champion (CHAMPION)
European Language Resources Distribution Agency S.A.R.L. (ELDA)
Instituto Trentino di Cultura (ITC-irst)
Instituto Cervantes (IC)

Advisory and Assessment Board :
CSELT (Telecom Italia - I)
PT - Inovação (Portugal Telecom - P)
Telefonica I+D (E)
IPO, Center for User - System Interaction (Eindoven University of Technology - NL)
INaLF (Institut National de la Langue Française - F)
École Pratique des Hautes Études (F)
Universitet Gent - Collate Research Network (B)

CLUL's Research Team:
Maria Fernanda Bacelar do Nascimento (coordinator)
Maria Lúcia Garcia Marques
JoséBettencourt Gonçalves
Rita Veloso
Sandra Antunes
Nuno Martins
Florbela Barreto
Raquel Amaro
Beginning of the Project:
January 2001
Project Status:
Concluded in March 2004
Description of the project :

The C-ORAL-ROM project intended to increase the Language Resources in the area of spoken language, by establishing, building and making available a multilingual corpus of spontaneous spoken language of the four main romance languages (Spanish, Portuguese, French and Italian, with 300.000 words each language, covering both formal and informal speech). The resource comprises several components:
- a multimedia corpus, containing, for each text, the acoustic source, the orthographic transcription, in CHAT format and enriched with the tagging of terminal and non terminal prosodic breaks, session metadata, text to speech synchronization, in WIN PITCH CORPUS format, based on the alignment of each transcribed utterance, a second orthographic transcription with Lemma and PoS annotation
- software tools for speech analysis (Win Pitch Corpus, developed by Pitch France);
- concordances extraction tool (Contextes, developed by Jean Véronis);
- appendixes.
This resource represents the variety of speech acts performed in everyday language and enables the induction of prosodic and syntactic structures in the four languages, from a quantitative and qualitative point of view. C-ORAL-ROM has a relevant added value at levels such as corpus design, dialogue representation, prosodic annotation, PoS tagging, multimedia storage and speech analysis. It is also worth to mention its usefulness in the creation of a representative multilingual resource designed for validation of HLT (Human Language Technologies). This resource is available in two formats:
- One with full access to explore the materials, available in 8 DVDs (DVDs1-2 French; DVDs 3-4 Italian; DVDs 5-6 Portuguese; DVDs 7-8 Spanish), distributed by ELDA. All collections have the same folder's structure, which mirrors directly the C-ORAL-ROM corpus design.
- An encrypted version, not allowing full concordances extraction, for example, available in 1 DVD, which comes with the book C-ORAL-ROM, published by John Benjamins Publishing Company in 2005, which contains descriptions for the four sub-corpora and of the procedures and choices of each team in their constitution and preparation (lemmatization, tagging, etc.), as well as comparative linguistic studies for lexical and structural strategies in the four languages, models and standard linguistic measures of spoken language variability. A more detailed description of this project is available at


