Skip to content

DentonJC/cobs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COBS

Classification of biochemical sequences

The project goal is to develop a framework for the classification of biochemical sequences. Working with sequences like fasta will be the subject of study.

Models available:

  • KNN (knn)
  • Logistic regression (logreg)
  • RandomForestClassifier (rf)
  • SVC (svc)
  • Isolation Forest (if)
  • ResidualNN (residual)
  • Perceptron (perceptron)
  • Multilayer perceptron (mperceptron)

Models in progress:

  • LSTM
  • RNN

Use cobs/config.ini to configure the models.

  • rparams (type: dictionary) for basic configuration
  • gparams (type: dictionary) for randomized search configuration

KNOWN BUG in Parallel: need to restart script after using keras model in experiments table.

Table of Contents

  1. Install
  2. Usage
  3. Input
  4. Output
  5. Datasets
  6. Results
  7. Resources

Install with Conda

Already installed for virtual_screening:

  • Python3: pip install configparser
  • Python2: pip install ConfigParser
  • pip install argparse

Usage

usage: Classification of biochemical sequences
              [-h] [--output OUTPUT]
              [--configs CONFIGS]
              [--n_iter N_ITER]
              [--n_jobs N_JOBS]
              [--patience PATIENCE]
              [--gridsearch]
              [--experiments_file EXPERIMENTS_FILE]
              [--length LENGTH]
              select_model [select_model ...]
              dataset_path [dataset_path ...]

positional arguments:
select_model          name of the model, select from list in README
dataset_path          path to dataset

optional arguments:
-h, --help            show this help message and exit
--output OUTPUT       path to output directory
--configs CONFIGS     path to config file
--n_iter N_ITER       number of iterations in RandomizedSearchCV
--n_jobs N_JOBS       number of jobs
--patience PATIENCE, -p PATIENCE    patience of fit
--gridsearch, -g      use RandomizedSearchCV
--experiments_file EXPERIMENTS_FILE, -e EXPERIMENTS_FILE address where to write results of experiments
--length LENGTH, -l LENGTH    maximum length of sequences
--targets TARGETS, -t TARGETS    set number of target column

Example input

Single experiment:

python cobs/run_model.py logreg data/dataset.csv --n_jobs -1 --n_iter 6 --length 256 -g

Table of experiments:

  1. Fill in the table with experiments parameters (examples in /etc, False = empty cell)
  2. Run python run.py
  3. Experiments will be performed one by one and fill in the columns with the results

Example output

2018-01-05 19:55:57,028 [main] INFO: GRID SEARCH
2018-01-05 19:55:57,028 [main] INFO: FIT
Fitting 10 folds for each of 6 candidates, totalling 60 fits
...
[Parallel(n_jobs=-1)]: Done 60 out of 60 | elapsed: 5.7min finished
2018-01-05 20:01:53,124 [main] INFO: Accuracy test: 86.59%
2018-01-05 20:01:54,589 [main] INFO: 0:06:07.959393
Can't create history plot for this type of experiment
Report complete, you can see it in the results folder
2018-01-05 20:01:54,720 [main] INFO: Done
2018-01-05 20:01:54,720 [main] INFO: Results path: /cobs/tmp/2018-01-05 19:55:46.630191/

Datasets

Generate dataset from local files
  1. Put FASTA files into data/ folder
  2. Run data/create_dataset.py
Download dataset from ncbi server
  1. Configure search.ini: select requests and name of labels
  2. Run data/load_dataset.py
Use dataset from the "wild"
  1. First row is headers
  2. First column is indexes
  3. Second column is sequences
  4. Third column is classes

Results

DNA classification: Promoter Gene Sequences

Class Distribution:

  • positive instances: 53 (50%)
  • negative instances: 53 (50%)

Random split:

  • Train 70%
  • Val 9%
  • Test 21%

Model train accuracy test accuracy
regression 89.56 88.34
random forest 100 93.27
SVC 100 89.38
IF 17.73 20.62
KNN 100 87.44

DNA classification: Splice-junction Gene Sequences

Class Distribution:

  • EI: 767 (25%)
  • IE: 768 (25%)
  • Neither: 1655 (50%)

Random split:

  • Train 70%
  • Val 9%
  • Test 21%

Model train accuracy test accuracy
regression 100 77.27
random forest 97.29 86.36
SVC 100 72.72
IF 48.64 27.27
KNN 100 77.27

Resources

Used:

Tested:

About

Classification of biochemical sequences

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages