C-ORAL-ROM - Integrated Reference Corpora for Spoken Romance Languages
Description of the project :
The C-ORAL-ROM project intended to increase the Language Resources in the area of spoken language, by establishing, building and making available a multilingual corpus of spontaneous spoken language of the four main romance languages (Spanish, Portuguese, French and Italian, with 300.000 words each language, covering both formal and informal speech). The resource comprises several components:
- a multimedia corpus, containing, for each text, the acoustic source, the orthographic transcription, in CHAT format and enriched with the tagging of terminal and non terminal prosodic breaks, session metadata, text to speech synchronization, in WIN PITCH CORPUS format, based on the alignment of each transcribed utterance, a second orthographic transcription with Lemma and PoS annotation
- software tools for speech analysis (Win Pitch Corpus, developed by Pitch France);
- concordances extraction tool (Contextes, developed by Jean Véronis);
This resource represents the variety of speech acts performed in everyday language and enables the induction of prosodic and syntactic structures in the four languages, from a quantitative and qualitative point of view. C-ORAL-ROM has a relevant added value at levels such as corpus design, dialogue representation, prosodic annotation, PoS tagging, multimedia storage and speech analysis. It is also worth to mention its usefulness in the creation of a representative multilingual resource designed for validation of HLT (Human Language Technologies). This resource is available in two formats:
- One with full access to explore the materials, available in 8 DVDs (DVDs1-2 French; DVDs 3-4 Italian; DVDs 5-6 Portuguese; DVDs 7-8 Spanish), distributed by ELDA. All collections have the same folder's structure, which mirrors directly the C-ORAL-ROM corpus design.
- An encrypted version, not allowing full concordances extraction, for example, available in 1 DVD, which comes with the book C-ORAL-ROM, published by John Benjamins Publishing Company in 2005, which contains descriptions for the four sub-corpora and of the procedures and choices of each team in their constitution and preparation (lemmatization, tagging, etc.), as well as comparative linguistic studies for lexical and structural strategies in the four languages, models and standard linguistic measures of spoken language variability.