Corpus PF

Corpus Português Fundamental

Download the corpus

A new version of this corpus is freely available at ELRA's Catalog. It consists of audio files in WAV format, aligned transcriptions in XML EXMARaLDA format and transcriptions in plain text and HTML formats. The plain text files also have automatically assigned PoS-tag information.
This resource has been assigned the International Standard Language Resource Number (ISLRN) 812-337-422-842-3.
More information can be found in

This project started in 1970 under the leadership of Luís Filipe Lindley Cintra. The project's objective was to provide information on the Portuguese vocabulary more often used in situations of everyday life. To establish this vocabulary, two corpora were compiled, the Frequency Corpus and the Availability Corpus.

The Frequency Corpus is a corpus of spoken language, collected between 1970 and 1974, composed of 1800 recordings (500 hours) made in Continental Portugal and Islands, in a situation of spontaneous oral communication, on different themes of everyday life, with speakers of different ages and social and professional backgrounds. Of these 1800 conversations, 1400 were selected and a total of 700,00 words were transcribed and is called the Frequency Corpus.
From the Frequency Corpus, the total list of occurring word forms was extracted (a total of 25,107 different word forms), together with their frequency of occurrence. This list was lemmatized and used to establish the set of lemmas with frequency equal to or greater than 40, taking into consideration text distribution in the case of lemmas between 60 and 40 occurrences. This list is the Frequency Vocabulary (Vocabulário de Frequência).

The Availability Corpus is a corpus compiled primarily between 1970 and 1974, with a supplementary survey in 1980, covering themes which were difficult to talk about before the revolution of April 25, 1974. This compilation aimed at the selection of a thematic vocabulary with lower probability of occurrence in the spoken corpus, but admittedly essential to communication, the Availability Vocabulary. Its lowest occurrence in spontaneous spoken discourse is due to the fact that this vocabulary is only used in specific contexts and also because it is often replaced by deictics or other elements. With this purpose, directed surveys were conducted in all the district capitals, by filling questionnaires, each covering a specific topic (of a total of 30 themes, like the human body, health and illness, travel, professions and trades, art, animals, plants, politics, working relationship). Informants were asked to indicate the names, adjectives and verbs most appropriate to those issues. In this way, a corpus of 481,800 thematic words was obtained.

The analysis of these two corpora resulted in the Basic Vocabulary of Portuguese, with 2217 words, published in 1984 (Português Fundamental, 1984). In 1987, two further volumes were published containing a detailed description of the methods used in compiling, analyzing and establishing the vocabulary published in 1984, and also a set of documents resulting from these collections and analysis: a sampling of the transcriptions of the recorded conversations, lemmatized lists with frequency information, sorted alphabetically and by decreasing frequency, extracted from both corpora, and also a joint list of the lemmas of these two corpora (Bacelar do Nascimento et al 1987). The sample of the transcriptions of the spoken corpus published in 1987 is available on this webpage for download.