Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial PR #1

Merged
merged 27 commits into from
Aug 13, 2020
Merged

Initial PR #1

merged 27 commits into from
Aug 13, 2020

Conversation

Hephaestus12
Copy link
Contributor

Writing basic web app.

Copy link
Member

@karlnapf karlnapf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to see a first PR :)

@Hephaestus12
Copy link
Contributor Author

Also, what python version should I use?
Is there any requirement with respect to that?
I have used 3.7 here, but as far as I know, Shogun can only be used on version 3.5

So, I should switch to that right? Or is this okay?

@Hephaestus12
Copy link
Contributor Author

Hephaestus12 commented May 16, 2020

Also, for the web app, should I set up a database and use something like SQLAlchemy(a python wrapper for SQL) or simply use pandas and csv files?

influenza_project/web/app.py Outdated Show resolved Hide resolved
influenza_project/web/app.py Outdated Show resolved Hide resolved
influenza_project/web/requirements.txt Outdated Show resolved Hide resolved
@geektoni
Copy link

Also, what python version should I use?
Is there any requirement with respect to that?
I have used 3.7 here, but as far as I know, Shogun can only be used on version 3.5

So, I should switch to that right? Or is this okay?

I think I've answered these questions in the email I sent you :) python 3.5 would be okay for now.

@geektoni
Copy link

Also, for the web app, should I set up a database and use something like SQLAlchemy(a python wrapper for SQL) or simply use pandas and csv files?

mmh what do you plan to use the database for?

@Hephaestus12
Copy link
Contributor Author

I think I've answered these questions in the email I sent you :) python 3.5 would be okay for now.

Yes okay, I'm sorry I think I missed that email :')

@Hephaestus12
Copy link
Contributor Author

Also, for the web app, should I set up a database and use something like SQLAlchemy(a python wrapper for SQL) or simply use pandas and csv files?

mmh what do you plan to use the database for?

For displaying the data related to each day's estimated incidence, we need to send the API calls for every single keyword. Each API call takes about half a second so it doesn't make sense to recollect this everytime.
Therefore I was thinking of a database/CSV file storing the old data (which you have collected) as well as the new data everyday, so that you only need to send the api requests once.

@geektoni
Copy link

Also, for the web app, should I set up a database and use something like SQLAlchemy(a python wrapper for SQL) or simply use pandas and csv files?

mmh what do you plan to use the database for?

For displaying the data related to each day's estimated incidence, we need to send the API calls for every single keyword. Each API call takes about half a second so it doesn't make sense to recollect this everytime.
Therefore I was thinking of a database/CSV file storing the old data (which you have collected) as well as the new data everyday, so that you only need to send the api requests once.

Okay, it seems reasonable to have a database for this kind of task (you could also use it to store predicted influenza incidence, so you don't have to compute it every time).

@Hephaestus12
Copy link
Contributor Author

There are a few files that will be generated as you run the code(src package in the model directory) package in the model directory. (Combined data, pickled model, etc).
Should I add these to the .gitignore?
Or should I push these files to the repository as well?

@@ -0,0 +1,666 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this notebook for?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exploratory data analysis ... I suggest you rename to make that obvious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll do that.

@geektoni
Copy link

There are a few files that will be generated as you run the code(src package in the model directory) package in the model directory. (Combined data, pickled model, etc).
Should I add these to the .gitignore?
Or should I push these files to the repository as well?

It depends if those files are required to run the final pipeline. If they are not useful (e.g., __pycache__) then you should not include them in the repository.

@lgoetz
Copy link

lgoetz commented May 19, 2020

Also, what python version should I use?
Is there any requirement with respect to that?
I have used 3.7 here, but as far as I know, Shogun can only be used on version 3.5

So, I should switch to that right? Or is this okay?

Also, for the web app, should I set up a database and use something like SQLAlchemy(a python wrapper for SQL) or simply use pandas and csv files?

mmh what do you plan to use the database for?

we may ultimately want to do something like this, but for now I would concentrate on getting the pipeline up and running, i.e. use pandas if that's easier

@Hephaestus12
Copy link
Contributor Author

i guess i'm saying is that there should be a PredictMixin, that holds a general model that does model.apply/predict with the input features.... and in that case PredictMixin actually can simply be a wrapper around onnxruntime, so that totally decouples the whole SDK dependency of an ML library for deploying this for predicting things....

This seems like a really cool idea, I'll start reading about how to implement this. However, then this app will no longer be standalone. As in the idea of you pulling a docker image and running the web app out of the box may not work... if I understand you correctly.

@vigsterkr
Copy link
Member

ok here's a radical crazy thought: why do we tie this thing to be a specific model? why isn't it just model, that follows a specific interface (say in case of shogun train/apply, or in sklearn-style fit/predict), and then this becomes totally interchangeable for model type...

Yes, that's be great but afaik there are different kinds of data preprocessing to do before applying specific models? How will we deal with that?
We can do that with sklearn pipelines but coming to Shogun, if we use pipelines, we will be constrained to a limited set of preprocessing techniques which may not be enough at the moment right?

the preprocessing could be done outside of the model per se.... meaning that the model is just literally the specific ML model and how you end up having X,y data set is another separate procedure... you fix what the model should expect as input. this allows you to interchange the model under the hood...

@vigsterkr
Copy link
Member

i guess i'm saying is that there should be a PredictMixin, that holds a general model that does model.apply/predict with the input features.... and in that case PredictMixin actually can simply be a wrapper around onnxruntime, so that totally decouples the whole SDK dependency of an ML library for deploying this for predicting things....

This seems like a really cool idea, I'll start reading about how to implement this. However, then this app will no longer be standalone. As in the idea of you pulling a docker image and running the web app out of the box may not work... if I understand you correctly.

this would simply allow anybody to use the app for predicting and that you externalize how you end up getting the model. of course you would still be able to do everything within a docker image, but if you have a different model with different backend that you would like to serve with the app, that would still be possible. one wouldn't need to refactor the app and change this initial entanglement of a model to a specific sdk to be able to do predictions...

file_path = self.data_path / (country + '.csv')
self.df[country].to_csv(file_path, index=False)
# query data
print(self.df[country].columns.values)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These print statements must be gone to merge this. They are useful if you need to manual debug what you are doing, but you should not add them to the committed files :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about instead of removing start using logging https://flask.palletsprojects.com/en/1.1.x/logging/

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about instead of removing start using logging https://flask.palletsprojects.com/en/1.1.x/logging/

That's even better!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have implemented this.

@geektoni
Copy link

I've noticed that the process which the application uses to download the current pageviews is kind of slow. I guess this happens because we are using just one thread to call Wikipedia's APIs. Could it be possible to parallelize it a bit more such to make the download faster?

Copy link

@geektoni geektoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have done some nitpicking and I have left some more comments. Hopefully, this will be last review round before merging this PR :)


def process_data(self):
for country in config.COUNTRIES:
# # separate numerical features from categorical ones

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this commented code? Otherwise, it would be better to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, now that everything else is running, I'll remove it.

@@ -0,0 +1,77 @@
# Influenza Estimator Web Tool

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add a small paragraph about the Docker image and how to pull/run it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I try to pull the docker image I get the following error:

Using default tag: latest
Error response from daemon: manifest for tejsukhatme/influenza_estimator:latest not found: manifest unknown: manifest unknown

I guess you meant to write docker pull tejsukhatme/influenza_estimator:random_forest instead, right? :) It could be better to tag the docker image as latest so it will be downloaded automatically without the need to specify each time the tag.

@@ -0,0 +1 @@
## GSOC 2020 Influenza Project

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here it should be nice to have some kind of brief description of what the project is about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay fine.

make install

ENV LD_LIBRARY_PATH=/installed/shogun-install/lib
ENV PYTHONPATH=/installed/shogun-install/lib/python3.5/site-packages/shogun.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with
ENV PYTHONPATH=/installed/shogun-install/lib/python3.5/site-packages/

It won't work if you leave it as it is now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, I had forgotten to push the updated Dockerfile.

line = line[:-1]
count = 0
try:
res = pageviewapi.per_article(project, line.strip(), start,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick question here, what happens if we are not able to reach Wikipedia (e.g., because maybe the website is down or the docker container cannot access to the internet)? Will the execution be caught or will the application fail?

I guess there should be some kind of safeguard/error which tells the user that we were not able to reach the website for whatever reason (or at least it should be logged somewhere).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yes, this makes sense. I'll check which exception is thrown when that happens and log it.

for item in res['items']:
count += int(item['views'])
except ZeroOrDataNotLoadedException:
count = 0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides catching the exception, we should write to the logs that something like this happened.

last_checked = datetime.strptime(last_checked, '%Y-%m-%d').date()
logging.info('\tlast checked at ' + str(last_checked))
if last_checked < yesterday:
logging.info('\tmaking API calls again.')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the log information saved somewhere or are they just printed on screen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll happen inside a file called information.log

"""
import pickle
import shogun as sg

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay, I'll add this.

## Problem statement
Reducing the impact of seasonal influenza epidemics and other pandemics such as the H1N1 is of paramount importance for public health authorities. Studies have shown that effective interventions can be taken to contain the epidemics if early detection can be made.

Seasonal influenza epidemics result in about three to five million cases of severe illness and about ​ 250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as the H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. And monitoring and forecasting the occurrence and spread of flu in the community is critical.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Seasonal influenza epidemics result in about three to five million cases of severe illness and about ​ 250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as the H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. And monitoring and forecasting the occurrence and spread of flu in the community is critical.
Seasonal influenza epidemics result in about three to five million cases of severe illness and about ​ 250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. Monitoring and forecasting the occurrence and spread of flu in the community is critical.


Seasonal influenza epidemics result in about three to five million cases of severe illness and about ​ 250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as the H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. And monitoring and forecasting the occurrence and spread of flu in the community is critical.

In the EU, there are several government institutions which track incidents of influenza-like disease (ILI)​ by gathering statistics from sentinel care activities that provide virological(the study of viruses and the diseases) statistics as well as clinical details, such as physicians reporting on the number of patients observed presenting influenza-like disease, obtaining and releasing information on a weekly basis.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In the EU, there are several government institutions which track incidents of influenza-like disease (ILI)​ by gathering statistics from sentinel care activities that provide virological(the study of viruses and the diseases) statistics as well as clinical details, such as physicians reporting on the number of patients observed presenting influenza-like disease, obtaining and releasing information on a weekly basis.
In the EU, there are several government institutions which track incidents of influenza-like disease (ILI)​ by gathering statistics from sentinel care activities that provide virological statistics as well as clinical details, such as physicians reporting on the number of patients observed presenting influenza-like disease, obtaining and releasing information on a weekly basis.

as well as state-level ILI activity

## Project Description
This project is majorly based on the findings of David J. McIver and John S. Brownstein. In their research paper titled Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time, they mention how it’s possible to use Wikipedia pageviews data to estimate the incidence of influenza related illnesses.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a link to the original paper for reference (e.g., https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003581)


To collect data the following sources were used:

Austria: FluNet surveillance tool

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add link references for each of these services.

It has previously been shown that Wikipedia can be a useful tool to monitor the emergence of breaking news stories, to track what topics are ‘‘trending’’ in
the public sphere, and to develop tools for natural language processing. Furthermore, Wikipedia makes all of this information public and freely available, greatly increasing and expediting any potential research studies that aim to make use of their data.

In an attempt to use Wikipedia data to estimate ILI activity, some researchers compiled a list of Wikipedia articles that were likely to be related to influenza, influenza-like activity, or to health in general. These articles were selected based on previous knowledge of the subject area, previously published materials, and expert opinion. This data is all available in this zenodo dataset.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing link to zenodo?

@@ -0,0 +1,47 @@
import shogun as sg

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this file here? Otherwise it could be wise to remove it.

@@ -0,0 +1,150 @@
from pathlib import Path

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this file? Is it testing all the methods you wrote so far?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No no, it is the code that split the data into training and testing datasets for judging which model is better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, I was confused by the name of the file :)

@@ -0,0 +1,77 @@
# Influenza Estimator Web Tool

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I try to pull the docker image I get the following error:

Using default tag: latest
Error response from daemon: manifest for tejsukhatme/influenza_estimator:latest not found: manifest unknown: manifest unknown

I guess you meant to write docker pull tejsukhatme/influenza_estimator:random_forest instead, right? :) It could be better to tag the docker image as latest so it will be downloaded automatically without the need to specify each time the tag.


```commandline
docker pull tejsukhatme/influenza_estimator
docker run tejsukhatme/influenza_estimator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
docker run tejsukhatme/influenza_estimator
docker run -it -p 5000:5000 tejsukhatme/influenza_estimator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of adding different models in the future so this nomenclature would help.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmh then I would update at least the pull command, since the current one doesn't work.

@@ -13,18 +13,20 @@ def load_features(path):
df = pd.read_csv(path)
features = df.drop(columns=['incidence'])
return features.values
return None

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! However, since these methods can return None, what does happen where these methods are called? Are you throwing an error if they return None? Are there any checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I haven't implemented any such checks. Also, for now this code isn't being used as we are doing everything(training and applying) in the web directory. Should I throw the errors and make checks?

I don't think we need to train the model separately and put the serialized version as training on the go takes negligible time, right?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest you to update the model's code to do some checks when loading the features (e.g., if load_features returns None then print an error to the user and exit).

In general, it would be better to provide already serialized models for many good reasons. However, since we had problems with that, let's skip it for now. We could do it later on.

@Hephaestus12
Copy link
Contributor Author

Should I resolve the conversations? I have implemented most of the changes.

@karlnapf
Copy link
Member

I think it would be good to merge this huge thing soon, and then to work on smaller PRs
@Hephaestus12 I think it would be good if you could post a TODO list here (and in all bigger PRs in fact), and indicate what is done/missing

  • example
  • bla

@geektoni
Copy link

Should I resolve the conversations? I have implemented most of the changes.

You should resolve the conversations only if you implemented the related changes. I also second Heiko's idea. Please make a list of the things which have still to be done so to have an idea of the missing bits. Then I think we can merge.

@Hephaestus12
Copy link
Contributor Author

I made an entirely new install directory and installed from scratch, but I still get this error when trying to call GLM. Why might that be?

Class GLM with primitive type SGOBJECT does not exist.

@Hephaestus12
Copy link
Contributor Author

And when I go to the python terminal and type import shogun
I am getting this:

ImportError: /home/tejsukhatme/anaconda3/envs/python3.5/lib/python3.5/_shogun.so: undefined symbol: _ZNK6shogun8SGObject12shallow_copyEv

This is the entire error message: https://pastebin.com/fSNDrFsq

@gf712
Copy link
Member

gf712 commented Aug 5, 2020

And when I go to the python terminal and type import shogun
I am getting this:

ImportError: /home/tejsukhatme/anaconda3/envs/python3.5/lib/python3.5/_shogun.so: undefined symbol: _ZNK6shogun8SGObject12shallow_copyEv

This is the entire error message: https://pastebin.com/fSNDrFsq

Are you talking about the docker image?

@Hephaestus12
Copy link
Contributor Author

Hephaestus12 commented Aug 5, 2020

Are you talking about the docker image?

No, I was trying to get it to work on my machine first.

@gf712
Copy link
Member

gf712 commented Aug 5, 2020

That’s because the newly compiled shogun library is not in the library path

@karlnapf
Copy link
Member

@Hephaestus12 I think if you have those issues, either drop by irc and ask, or send an email to the list. That is a quicker way to get an answer than here in this huge PR :)

import pandas as pd


def load_features(path):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between these methods and the ones in src/utils.py?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utils.py are the utility functions for the web app, this is just for the model. However, I've refactored the code a little and removed this for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay!

@gf712 gf712 merged commit 37a1c18 into shogun-toolbox:master Aug 13, 2020
@karlnapf
Copy link
Member

🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants