Skip to content

SpenDM/Substance-Information-Extractor-FH

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Substance-Information-Extractor-FH

Extract patient tobacco and alcohol usage information from medical documents.

Information Identified

Substance usage status:

  • Current user
  • Former user
  • History of use
  • User (details indeterminate)
  • Non-user / Never-user
  • Unknown

Usage attributes:

  • Type (e.g. cigarette, cigar, etc.)
  • Amount of Use (e.g. 1 pack per week)
  • Duration of Use
  • Quit Date
  • Time Passed since Quitting
  • Age at Time of Quitting

Method

The project leverages machine learning. Scikit-learn SVM LinearSVC is used for most classification with the exception of substance usage attributes, which use Stanford NER.

For each substance:

  • Identify sentences that contain substance information
  • Classify usage status of patient
  • Identify usage attributes within identified sentences
  • Link attributes to the appropriate substance

Features

The project currently leverages N-grams as features. The grams are normalized, including some unigrams being replaced with a single standardized label (or word class) for dimensionality reduction:

  • integers
  • decimals
  • USD money values
  • percentages

About

Extract substance use information from medical documents

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Other 0.1%