The C-ORAL-ROM project intended to increase the Language Resources in the area of spoken language, by establishing, building and making available a multilingual corpus of spontaneous spoken language of the four main romance languages (Spanish, Portuguese, French and Italian, with 300.000 words each language, covering both formal and informal speech).
The resource comprises several components:
- a multimedia corpus, containing, for each text, the acoustic source, the orthographic transcription, in CHAT format and enriched with the tagging of terminal and non terminal prosodic breaks, session metadata, text to speech synchronization, in WIN PITCH CORPUS format, based on the alignment of each transcribed utterance, a second orthographic transcription with Lemma and PoS annotation
- software tools for speech analysis (Win Pitch Corpus, developed by Pitch France);
- concordances extraction tool (Contextes, developed by Jean Véronis);
This resource represents the variety of speech acts performed in everyday language and enables the induction of prosodic and syntactic structures in the four languages, from a quantitative and qualitative point of view. C-ORAL-ROM has a relevant added value at levels such as corpus design, dialogue representation, prosodic annotation, PoS tagging, multimedia storage and speech analysis. It is also worth to mention its usefulness in the creation of a representative multilingual resource designed for validation of HLT (Human Language Technologies). This resource is available in two formats:
- One with full access to explore the materials, available in 8 DVDs (DVDs1-2 French; DVDs 3-4 Italian; DVDs 5-6 Portuguese; DVDs 7-8 Spanish), distributed by ELDA. All collections have the same folder's structure, which mirrors directly the C-ORAL-ROM corpus design.
- An encrypted version, not allowing full concordances extraction, for example, available in 1 DVD, which comes with the book C-ORAL-ROM, published by John Benjamins Publishing Company in 2005, which contains descriptions for the four sub-corpora and of the procedures and choices of each team in their constitution and preparation (lemmatization, tagging, etc.), as well as comparative linguistic studies for lexical and structural strategies in the four languages, models and standard linguistic measures of spoken language variability. A more detailed description of this project is available at http://lablita.dit.unifi.it/coralrom.
Bacelar do Nascimento, M. F. (2002), "Quelques considérations sur la constitution et l'exploitation d'un corpus de portugais parlé" in SCARANO, A. (a cura di) Macrosyntaxe et pragmatique: l' analyse de la langue orale, Bulzoni, Roma, pré-impressão LABLITA, Novembro 2002, pp. 221-228.
Bacelar do Nascimento, M. F., E. Cresti, M. Moneglia, A. Moreno Sandoval, J. Veronis, P. Martin, K. Choucri, V. Mapelli, D. Falavigna, A. Cid e C. Blum (2002), "The C-ORAL-ROM Project. New methods for spoken language archives in a multilingual romance corpus LREC", in M. C. RODRIGUES e C. SUAREZ ARAUJO (a cura di), Proceedings of the Third International Conference on Language Resources and Evaluation, Paris: ELRA, vol. 1, pp. 2-10.
Bacelar do Nascimento, M. F., A. Mendes e R. Amaro (2003) "Reusing Available Resources for Tagging a Spoken Portuguese Corpus", in TASHA'2003: Workshop on Tagging and Shallow Processing of Portuguese, Universidade de Lisboa, Outubro de 2003.
Bacelar do Nascimento, M. F., Bettencourt Gonçalves, J., Veloso, R., Antunes, S., Martins, N., Barreto, F., Amaro, R. and Garcia Marques, M: L. (2006), C-ORAL-ROM: Integrated Reference Corpora for Spoken Romance Languages, MOSTRA DE LINGUÍSTICA - A Linguística em Portugal: estado da arte, projectos e produtos, CD publication.