-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial PR #1
Initial PR #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to see a first PR :)
Also, what python version should I use? So, I should switch to that right? Or is this okay? |
Also, for the web app, should I set up a database and use something like SQLAlchemy(a python wrapper for SQL) or simply use pandas and csv files? |
I think I've answered these questions in the email I sent you :) python 3.5 would be okay for now. |
mmh what do you plan to use the database for? |
Yes okay, I'm sorry I think I missed that email :') |
For displaying the data related to each day's estimated incidence, we need to send the API calls for every single keyword. Each API call takes about half a second so it doesn't make sense to recollect this everytime. |
Okay, it seems reasonable to have a database for this kind of task (you could also use it to store predicted influenza incidence, so you don't have to compute it every time). |
There are a few files that will be generated as you run the code( |
@@ -0,0 +1,666 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this notebook for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exploratory data analysis ... I suggest you rename to make that obvious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll do that.
It depends if those files are required to run the final pipeline. If they are not useful (e.g., |
we may ultimately want to do something like this, but for now I would concentrate on getting the pipeline up and running, i.e. use pandas if that's easier |
This seems like a really cool idea, I'll start reading about how to implement this. However, then this app will no longer be standalone. As in the idea of you pulling a docker image and running the web app out of the box may not work... if I understand you correctly. |
the preprocessing could be done outside of the model per se.... meaning that the model is just literally the specific ML model and how you end up having |
this would simply allow anybody to use the app for predicting and that you externalize how you end up getting the model. of course you would still be able to do everything within a docker image, but if you have a different model with different backend that you would like to serve with the app, that would still be possible. one wouldn't need to refactor the app and change this initial entanglement of a model to a specific sdk to be able to do predictions... |
file_path = self.data_path / (country + '.csv') | ||
self.df[country].to_csv(file_path, index=False) | ||
# query data | ||
print(self.df[country].columns.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These print
statements must be gone to merge this. They are useful if you need to manual debug what you are doing, but you should not add them to the committed files :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about instead of removing start using logging https://flask.palletsprojects.com/en/1.1.x/logging/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about instead of removing start using logging https://flask.palletsprojects.com/en/1.1.x/logging/
That's even better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have implemented this.
I've noticed that the process which the application uses to download the current pageviews is kind of slow. I guess this happens because we are using just one thread to call Wikipedia's APIs. Could it be possible to parallelize it a bit more such to make the download faster? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have done some nitpicking and I have left some more comments. Hopefully, this will be last review round before merging this PR :)
|
||
def process_data(self): | ||
for country in config.COUNTRIES: | ||
# # separate numerical features from categorical ones |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this commented code? Otherwise, it would be better to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, now that everything else is running, I'll remove it.
@@ -0,0 +1,77 @@ | |||
# Influenza Estimator Web Tool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to add a small paragraph about the Docker image and how to pull/run it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'll add that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I try to pull the docker image I get the following error:
Using default tag: latest
Error response from daemon: manifest for tejsukhatme/influenza_estimator:latest not found: manifest unknown: manifest unknown
I guess you meant to write docker pull tejsukhatme/influenza_estimator:random_forest
instead, right? :) It could be better to tag the docker image as latest
so it will be downloaded automatically without the need to specify each time the tag.
@@ -0,0 +1 @@ | |||
## GSOC 2020 Influenza Project |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here it should be nice to have some kind of brief description of what the project is about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay fine.
make install | ||
|
||
ENV LD_LIBRARY_PATH=/installed/shogun-install/lib | ||
ENV PYTHONPATH=/installed/shogun-install/lib/python3.5/site-packages/shogun.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace with
ENV PYTHONPATH=/installed/shogun-install/lib/python3.5/site-packages/
It won't work if you leave it as it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, I had forgotten to push the updated Dockerfile.
line = line[:-1] | ||
count = 0 | ||
try: | ||
res = pageviewapi.per_article(project, line.strip(), start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick question here, what happens if we are not able to reach Wikipedia (e.g., because maybe the website is down or the docker container cannot access to the internet)? Will the execution be caught or will the application fail?
I guess there should be some kind of safeguard/error which tells the user that we were not able to reach the website for whatever reason (or at least it should be logged somewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yes, this makes sense. I'll check which exception is thrown when that happens and log it.
for item in res['items']: | ||
count += int(item['views']) | ||
except ZeroOrDataNotLoadedException: | ||
count = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides catching the exception, we should write to the logs that something like this happened.
last_checked = datetime.strptime(last_checked, '%Y-%m-%d').date() | ||
logging.info('\tlast checked at ' + str(last_checked)) | ||
if last_checked < yesterday: | ||
logging.info('\tmaking API calls again.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the log information saved somewhere or are they just printed on screen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll happen inside a file called information.log
""" | ||
import pickle | ||
import shogun as sg | ||
|
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay, I'll add this.
## Problem statement | ||
Reducing the impact of seasonal influenza epidemics and other pandemics such as the H1N1 is of paramount importance for public health authorities. Studies have shown that effective interventions can be taken to contain the epidemics if early detection can be made. | ||
|
||
Seasonal influenza epidemics result in about three to five million cases of severe illness and about 250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as the H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. And monitoring and forecasting the occurrence and spread of flu in the community is critical. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seasonal influenza epidemics result in about three to five million cases of severe illness and about 250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as the H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. And monitoring and forecasting the occurrence and spread of flu in the community is critical. | |
Seasonal influenza epidemics result in about three to five million cases of severe illness and about 250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. Monitoring and forecasting the occurrence and spread of flu in the community is critical. |
|
||
Seasonal influenza epidemics result in about three to five million cases of severe illness and about 250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as the H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. And monitoring and forecasting the occurrence and spread of flu in the community is critical. | ||
|
||
In the EU, there are several government institutions which track incidents of influenza-like disease (ILI) by gathering statistics from sentinel care activities that provide virological(the study of viruses and the diseases) statistics as well as clinical details, such as physicians reporting on the number of patients observed presenting influenza-like disease, obtaining and releasing information on a weekly basis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the EU, there are several government institutions which track incidents of influenza-like disease (ILI) by gathering statistics from sentinel care activities that provide virological(the study of viruses and the diseases) statistics as well as clinical details, such as physicians reporting on the number of patients observed presenting influenza-like disease, obtaining and releasing information on a weekly basis. | |
In the EU, there are several government institutions which track incidents of influenza-like disease (ILI) by gathering statistics from sentinel care activities that provide virological statistics as well as clinical details, such as physicians reporting on the number of patients observed presenting influenza-like disease, obtaining and releasing information on a weekly basis. |
as well as state-level ILI activity | ||
|
||
## Project Description | ||
This project is majorly based on the findings of David J. McIver and John S. Brownstein. In their research paper titled Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time, they mention how it’s possible to use Wikipedia pageviews data to estimate the incidence of influenza related illnesses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add a link to the original paper for reference (e.g., https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003581)
|
||
To collect data the following sources were used: | ||
|
||
Austria: FluNet surveillance tool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add link references for each of these services.
It has previously been shown that Wikipedia can be a useful tool to monitor the emergence of breaking news stories, to track what topics are ‘‘trending’’ in | ||
the public sphere, and to develop tools for natural language processing. Furthermore, Wikipedia makes all of this information public and freely available, greatly increasing and expediting any potential research studies that aim to make use of their data. | ||
|
||
In an attempt to use Wikipedia data to estimate ILI activity, some researchers compiled a list of Wikipedia articles that were likely to be related to influenza, influenza-like activity, or to health in general. These articles were selected based on previous knowledge of the subject area, previously published materials, and expert opinion. This data is all available in this zenodo dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing link to zenodo?
@@ -0,0 +1,47 @@ | |||
import shogun as sg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this file here? Otherwise it could be wise to remove it.
@@ -0,0 +1,150 @@ | |||
from pathlib import Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this file? Is it testing all the methods you wrote so far?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No no, it is the code that split the data into training and testing datasets for judging which model is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, I was confused by the name of the file :)
@@ -0,0 +1,77 @@ | |||
# Influenza Estimator Web Tool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I try to pull the docker image I get the following error:
Using default tag: latest
Error response from daemon: manifest for tejsukhatme/influenza_estimator:latest not found: manifest unknown: manifest unknown
I guess you meant to write docker pull tejsukhatme/influenza_estimator:random_forest
instead, right? :) It could be better to tag the docker image as latest
so it will be downloaded automatically without the need to specify each time the tag.
|
||
```commandline | ||
docker pull tejsukhatme/influenza_estimator | ||
docker run tejsukhatme/influenza_estimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker run tejsukhatme/influenza_estimator | |
docker run -it -p 5000:5000 tejsukhatme/influenza_estimator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of adding different models in the future so this nomenclature would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmh then I would update at least the pull command, since the current one doesn't work.
@@ -13,18 +13,20 @@ def load_features(path): | |||
df = pd.read_csv(path) | |||
features = df.drop(columns=['incidence']) | |||
return features.values | |||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good! However, since these methods can return None
, what does happen where these methods are called? Are you throwing an error if they return None
? Are there any checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I haven't implemented any such checks. Also, for now this code isn't being used as we are doing everything(training and applying) in the web directory. Should I throw the errors and make checks?
I don't think we need to train the model separately and put the serialized version as training on the go takes negligible time, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest you to update the model's code to do some checks when loading the features (e.g., if load_features
returns None
then print an error to the user and exit).
In general, it would be better to provide already serialized models for many good reasons. However, since we had problems with that, let's skip it for now. We could do it later on.
Should I resolve the conversations? I have implemented most of the changes. |
I think it would be good to merge this huge thing soon, and then to work on smaller PRs
|
You should resolve the conversations only if you implemented the related changes. I also second Heiko's idea. Please make a list of the things which have still to be done so to have an idea of the missing bits. Then I think we can merge. |
I made an entirely new install directory and installed from scratch, but I still get this error when trying to call GLM. Why might that be?
|
And when I go to the python terminal and type
This is the entire error message: https://pastebin.com/fSNDrFsq |
Are you talking about the docker image? |
No, I was trying to get it to work on my machine first. |
That’s because the newly compiled shogun library is not in the library path |
@Hephaestus12 I think if you have those issues, either drop by irc and ask, or send an email to the list. That is a quicker way to get an answer than here in this huge PR :) |
import pandas as pd | ||
|
||
|
||
def load_features(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference between these methods and the ones in src/utils.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utils.py are the utility functions for the web app, this is just for the model. However, I've refactored the code a little and removed this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay!
🥳 |
Writing basic web app.