Skip to content

NAMD/ptwp_tagger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Tagging Portuguese Wikipedia

We want to tag Portuguese Wikipedia using PyPLN and Palavras (we have a license). The goals of this project are:

  • Release a part-of-speech tagged Portuguese Wikipedia Corpus under a Creative Commons license.
  • Train a part-of-speech tagger on NLTK and release it under a free/libre software license.

Assumptions

  • We're going to use all Portuguese Wikipedia articles (pages).
  • Probably we're going to use the Palavras' tagset, but we can then translate it to NLTK's tagset.
  • We won't use an incremental tagger (the entire corpus will be loaded in memory to train a NLTK tagger).

Next Goals

  • Split the entire corpus (and tagger) into Wikipedia Portals, so we'll have a tagged corpus by subject.
  • Compare taggers (Palavras versus NLTK with our trained tagger)

Related Links

About

Tagging Portuguese Wikipedia with PyPLN and Palavras

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published