GitHub - dkmfbk/biographies: Tools for biographies extracted from Wikipedia

Wikipedia is the largest collection of encyclopedic data ever written in the history of humanity. Thanks to its coverage and its availability in machine-readable format, it has become a primary resource for large-scale research in historical and cultural studies. In this work, we focus on the subset of pages describing persons, and we investigate the task of recognizing biographical sections from them: given a person’s page, we identify the list of sections where information about her/his life is present. We model this as a sequence classification problem, and propose a supervised setting, in which the training data are acquired automatically. Besides, we show that six simple features extracted only from the section titles are very informative and yield good results well above a strong baseline.

In this GitHub project, you'll find the material and source code for running our experiments.

Annotated material such as training/dev/test sets of pages, training data for CRFsuite, the results of the classification and the annotation agreement.

To compile the source code of the project, just clone it with git and run mvn package from the shell.

This software is released under the GPLv3 license.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
material		material
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

dkmfbk/biographies

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages