A Portuguese Native Language Identification Dataset

NLI-PT is the first Portuguese dataset compiled  for Native Language Identification (NLI), the task of identifying an author’s first language based on their second language writing.

The dataset includes 1,868 student essays written by learners of European Portuguese, native speakers of the following L1s: Chinese, English, Spanish, German, Russian, French, Japanese, Italian, Dutch, Tetum, Arabic, Polish, Korean, Romanian, and Swedish. We collected data from three different sources: two learner corpora, COPLE2 and PEAPL2 and the dataset of the project "Recolha de dados de aprendizagem de português língua estrangeira". With the goal of unifying learner data collected from these various sources, we applied a methodology which has been previously used for the compilation of language variety corpora (Tan et al., 2014).


NLI-PT includes the original student text and four different types of annotation: POS, fine-grained POS, constituency parses, and dependency parses. We used the LX Parser for the simple POS and the Portuguese morphological module of Freeling for detailed POS. Concerning syntactic annotations, we used the LX Parser for
constituency parsing and the DepPattern toolkit for dependencies.

NLI-PT can be used not only in NLI but also in research on several topics in the field of Second Language Acquisition and educational NLP.

NLI-PT is described in the following paper:

del Río, I., Zampieri, M. & Malmasi, S. 2018. A Portuguese Native Language Identification Dataset. The 13th Workshop on Innovative Use of NLP for Building Educational Applications, NAACL 2018. 05th June. New Orleans, USA. [pdf]


Download the dataset