Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting up TEI XML files for TagWorks #580

Open
jameshowison opened this issue Jan 9, 2019 · 2 comments
Open

Splitting up TEI XML files for TagWorks #580

jameshowison opened this issue Jan 9, 2019 · 2 comments

Comments

@jameshowison
Copy link
Contributor

We're exploring using TagWorks for scaling up tagging, they have a sensible user interface and handle recruitment via crowdsourcing etc. Their interface allows the user to reveal additional context before and after the sentence (as well as ask about certainty and highlight parts of the sentence, such as software name, version, etc).

I'm looking at the TEI XML output from grobid, very cool stuff, I love the biblio recognition! My thinking is to have sentences from the <body> as codeable units. Any thoughts on how to break up the <body>?

@kermitt2
Copy link
Member

kermitt2 commented Jan 9, 2019

For the machine learning sequence labelling, I am using the vague notion of "paragraph" as input (it's not sentence-based, I got better results extending to a complete paragraph), and more concretely the following TEI sections:

  • title: <title level="a">
  • abstract <abstract>
  • keywords <keywords>
  • paragraph <p>
  • item <item> (if any, but normally always under <p> when generated by grobid)
  • figure/table caption <figDesc>

I also process the content of the annex (if any) which is not under the <body> in TEI, but under <back>

@kermitt2
Copy link
Member

kermitt2 commented Jan 9, 2019

I don't know tagWorks actually, but it looks promising!

I have a small list of such tools, so I share it here for reference:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants