Extract patient tobacco and alcohol usage information from medical documents.
Substance usage status:
- Current user
- Former user
- History of use
- User (details indeterminate)
- Non-user / Never-user
- Unknown
Usage attributes:
- Type (e.g. cigarette, cigar, etc.)
- Amount of Use (e.g. 1 pack per week)
- Duration of Use
- Quit Date
- Time Passed since Quitting
- Age at Time of Quitting
The project leverages machine learning. Scikit-learn SVM LinearSVC is used for most classification with the exception of substance usage attributes, which use Stanford NER.
For each substance:
- Identify sentences that contain substance information
- Classify usage status of patient
- Identify usage attributes within identified sentences
- Link attributes to the appropriate substance
The project currently leverages N-grams as features. The grams are normalized, including some unigrams being replaced with a single standardized label (or word class) for dimensionality reduction:
- integers
- decimals
- USD money values
- percentages