Corpus of the Diaries of the Portuguese Parliament annotated with PoS

The PTPARL Corpus is a political written corpus composed by transcriptions of the Portuguese parliament sessions, which were made available in 2004. It is considered written data since it undergoes extensive revision when transcribed. This corpus includes 1076 texts, in a total of approximately 975,806 running words of European Portuguese.

This corpus is composed by a text file (corpus) and an annotated file that contains PoS-annotation at token level, including punctuation. Noun phrases are also recognized and annotated with specific tags. The POS and NP chunks annotation were done automatically.

This corpus can be used in linguistic research and for improving and developing Natural Language Processing tools and applications.

This resource is freely available at ELRA's Catalog and it has been assigned the International Standard Language Resource Number (ISLRN) 294-303-577-819-2. More information can be found in