Author: Perpétua Gonçalves (Universidade Eduardo Mondlane (UEM), Mozambique)

Collaboration: Luísa Alice Santos Pereira (Centro de Linguística da Universidade de Lisboa (CLUL), Portugal)


The Spoken Corpus Mozambique (SCM) is a linguistic resource of the “Cátedra de Português Língua Segunda e Estrangeira”, sponsored by UEM and Camões - Instituto da Cooperação e da Língua.

SCM is a monolingual corpus of Mozambican Portuguese, recorded in 1986‑87, composed of 40 interviews, with about 140.000 words.

SCM can be accessed at CLUL's CQPWeb platform. This corpus is also available at META-SHARE, an European platform responsible for the distribution of linguistic resources and tools. It has a Metashare Commons-BY-SA license.

Data collection

SCM was collected in 1986-87 by Perpétua Gonçalves, as part of her PhD thesis, A construção de uma gramática do Português em Moçambique: Aspectos da estrutura argumental dos verbos (Universidade de Lisboa, 1991).

The corpus is composed of interviews with 40 students of the Faculty of Education of UEM. At the time of the data collection, these students attended a Teachers Training Program for Portuguese (7th, 8th and 9th classes). Some of these interviews can be accessed, on demand, in audio format.

Subjects were informed about the general objective of the research - a study on the syntax of Mozambican Portuguese - and the topics of the interviews were previously chosen. It is, thus, a semi-spontaneous corpus where, given the general objective of the research, it was intended that subjects produce long discourse sequences.

In order to collect a sample as representative as possible of the linguistic diversity of Mozambique, the main criterion for the selection of the subjects was the diversity of their Bantu mother tongues. Therefore, despite limitations resulting from the fact that data were collected in a relatively small population, besides Portuguese and Swahili, the corpus has been produced by speakers of 11 Mozambican Bantu languages: Ronga/Changana - 16 subjects (40%); Cindau - 5 (12,5%); Xitshwa - 4 (10%); Macua - 3 (7,5%); Sena - 3 (7,5%); Chope - 2 (5%); Portuguese - 2 (5%); Chuabo - 1 (2,5%); Nnyungwe - 1 (2,5%); Maconde - 1 (2,5%); Swahili - 1 (2,5%); Cimanyika - 1 (2,5 %).

The age of subjects ranges between 19 and 22 years, and most (about 90%) are male.

Information about each subject can be accessed here