Tool
CRPC-NER

Named Entity Recognizer

A NER-classifier based on memory-based learning, trained on the CINTIL dataset, a corpus that contains part of the Corpus de Referência do Português Contemporâneo - CRPC (Reference Corpus of Contemporary Portuguese).

 

Availability
The tool is freely available on the PORTULAN CLARIN infrastructure.

 

Annotation
The annotation includes several categories:
EVT - Event
LOC - Location
ORG - Organization
PER - Person
WRK - Work
MSC - Miscellaneous

The tool applies tags to each token:
/0 indicates that the token is not (part of) a named entity
/B indicates that the token is the first unit of a named entity
/I indicates that the token is the middle or last unit of a named entity

Output will have one sentence per line with tags after each token separated with a slash:

De_/O a/O parte/O de_/O a/O tarde/O ,*/O Maria/B-PER Cristina/B-PER Portugal/I-PER ,*/O advogada//O ,*/O moderou//O o/O painel/O \*"/O Restrições//B-WRK a_/I-WRK o/I-WRK Conteúdo//I-WRK de_/I-WRK a/I-WRK Publicidade/I-WRK "/O ,*/O em/O que/O se /O abordaram//O duas/O temáticas/O  <utt>

 

Evaluation

The NER tool was evaluated by splitting the CINTIL corpus in 50k for training and for testing.
This gave the following accuracy, precision and recall scores on the held-out testset:

processed 211479 tokens with 10631 phrases; found: 10628 phrases; correct: 10409.
accuracy:  99.72%; precision:  97.94%; recall:  97.91%; FB1:  97.93

Coordinator