Initial PR #1

Hephaestus12 · 2020-05-15T15:33:45Z

Writing basic web app.

influenza_project/.gitignore

influenza_project/README.md

influenza_project/web/keywords/keywords_belgium.txt

influenza_project/web/keywords/keywords_netherlands.txt

influenza_project/web/requirements.txt

influenza_project/web/app.py

karlnapf

Good to see a first PR :)

Hephaestus12 · 2020-05-16T02:08:16Z

Also, what python version should I use?
Is there any requirement with respect to that?
I have used 3.7 here, but as far as I know, Shogun can only be used on version 3.5

So, I should switch to that right? Or is this okay?

Hephaestus12 · 2020-05-16T04:54:41Z

Also, for the web app, should I set up a database and use something like SQLAlchemy(a python wrapper for SQL) or simply use pandas and csv files?

influenza_project/web/app.py

influenza_project/web/requirements.txt

geektoni · 2020-05-16T08:42:11Z

Also, what python version should I use?
Is there any requirement with respect to that?
I have used 3.7 here, but as far as I know, Shogun can only be used on version 3.5

So, I should switch to that right? Or is this okay?

I think I've answered these questions in the email I sent you :) python 3.5 would be okay for now.

geektoni · 2020-05-16T08:43:18Z

Also, for the web app, should I set up a database and use something like SQLAlchemy(a python wrapper for SQL) or simply use pandas and csv files?

mmh what do you plan to use the database for?

Hephaestus12 · 2020-05-16T09:21:02Z

I think I've answered these questions in the email I sent you :) python 3.5 would be okay for now.

Yes okay, I'm sorry I think I missed that email :')

Hephaestus12 · 2020-05-16T09:25:31Z

Also, for the web app, should I set up a database and use something like SQLAlchemy(a python wrapper for SQL) or simply use pandas and csv files?

mmh what do you plan to use the database for?

For displaying the data related to each day's estimated incidence, we need to send the API calls for every single keyword. Each API call takes about half a second so it doesn't make sense to recollect this everytime.
Therefore I was thinking of a database/CSV file storing the old data (which you have collected) as well as the new data everyday, so that you only need to send the api requests once.

geektoni · 2020-05-16T10:11:33Z

Also, for the web app, should I set up a database and use something like SQLAlchemy(a python wrapper for SQL) or simply use pandas and csv files?

mmh what do you plan to use the database for?

For displaying the data related to each day's estimated incidence, we need to send the API calls for every single keyword. Each API call takes about half a second so it doesn't make sense to recollect this everytime.
Therefore I was thinking of a database/CSV file storing the old data (which you have collected) as well as the new data everyday, so that you only need to send the api requests once.

Okay, it seems reasonable to have a database for this kind of task (you could also use it to store predicted influenza incidence, so you don't have to compute it every time).

Hephaestus12 · 2020-05-19T07:45:31Z

There are a few files that will be generated as you run the code(src package in the model directory) package in the model directory. (Combined data, pickled model, etc).
Should I add these to the .gitignore?
Or should I push these files to the repository as well?

karlnapf · 2020-05-19T09:07:26Z

gsoc_application_projects/2020/influenza/model/notebooks/eda.ipynb

@@ -0,0 +1,666 @@
+{


what is this notebook for?

exploratory data analysis ... I suggest you rename to make that obvious

Okay, I'll do that.

gsoc_application_projects/2020/influenza/model/src/model.py

gsoc_application_projects/2020/influenza/model/src/util.py

geektoni · 2020-05-19T09:16:07Z

There are a few files that will be generated as you run the code(src package in the model directory) package in the model directory. (Combined data, pickled model, etc).
Should I add these to the .gitignore?
Or should I push these files to the repository as well?

It depends if those files are required to run the final pipeline. If they are not useful (e.g., __pycache__) then you should not include them in the repository.

lgoetz · 2020-05-19T09:45:08Z

Also, what python version should I use?
Is there any requirement with respect to that?
I have used 3.7 here, but as far as I know, Shogun can only be used on version 3.5

So, I should switch to that right? Or is this okay?

Also, for the web app, should I set up a database and use something like SQLAlchemy(a python wrapper for SQL) or simply use pandas and csv files?

mmh what do you plan to use the database for?

we may ultimately want to do something like this, but for now I would concentrate on getting the pipeline up and running, i.e. use pandas if that's easier

Hephaestus12 · 2020-06-09T08:45:29Z

i guess i'm saying is that there should be a PredictMixin, that holds a general model that does model.apply/predict with the input features.... and in that case PredictMixin actually can simply be a wrapper around onnxruntime, so that totally decouples the whole SDK dependency of an ML library for deploying this for predicting things....

This seems like a really cool idea, I'll start reading about how to implement this. However, then this app will no longer be standalone. As in the idea of you pulling a docker image and running the web app out of the box may not work... if I understand you correctly.

vigsterkr · 2020-06-10T04:25:18Z

ok here's a radical crazy thought: why do we tie this thing to be a specific model? why isn't it just model, that follows a specific interface (say in case of shogun train/apply, or in sklearn-style fit/predict), and then this becomes totally interchangeable for model type...

Yes, that's be great but afaik there are different kinds of data preprocessing to do before applying specific models? How will we deal with that?
We can do that with sklearn pipelines but coming to Shogun, if we use pipelines, we will be constrained to a limited set of preprocessing techniques which may not be enough at the moment right?

the preprocessing could be done outside of the model per se.... meaning that the model is just literally the specific ML model and how you end up having X,y data set is another separate procedure... you fix what the model should expect as input. this allows you to interchange the model under the hood...

vigsterkr · 2020-06-10T04:29:50Z

i guess i'm saying is that there should be a PredictMixin, that holds a general model that does model.apply/predict with the input features.... and in that case PredictMixin actually can simply be a wrapper around onnxruntime, so that totally decouples the whole SDK dependency of an ML library for deploying this for predicting things....

This seems like a really cool idea, I'll start reading about how to implement this. However, then this app will no longer be standalone. As in the idea of you pulling a docker image and running the web app out of the box may not work... if I understand you correctly.

this would simply allow anybody to use the app for predicting and that you externalize how you end up getting the model. of course you would still be able to do everything within a docker image, but if you have a different model with different backend that you would like to serve with the app, that would still be possible. one wouldn't need to refactor the app and change this initial entanglement of a model to a specific sdk to be able to do predictions...

geektoni · 2020-06-10T08:25:13Z

gsoc_application_projects/2020/influenza/web/influenza_estimator/util.py

-            file_path = self.data_path / (country + '.csv')
-            self.df[country].to_csv(file_path, index=False)
+        # query data
+        print(self.df[country].columns.values)


These print statements must be gone to merge this. They are useful if you need to manual debug what you are doing, but you should not add them to the committed files :)

what about instead of removing start using logging https://flask.palletsprojects.com/en/1.1.x/logging/

what about instead of removing start using logging https://flask.palletsprojects.com/en/1.1.x/logging/

That's even better!

I have implemented this.

geektoni · 2020-06-10T08:33:38Z

I've noticed that the process which the application uses to download the current pageviews is kind of slow. I guess this happens because we are using just one thread to call Wikipedia's APIs. Could it be possible to parallelize it a bit more such to make the download faster?

geektoni

I have done some nitpicking and I have left some more comments. Hopefully, this will be last review round before merging this PR :)

geektoni · 2020-06-18T08:43:17Z

gsoc_application_projects/2020/influenza/model/src/process.py

+
+    def process_data(self):
+        for country in config.COUNTRIES:
+            # # separate numerical features from categorical ones


Do we need this commented code? Otherwise, it would be better to remove it.

Not really, now that everything else is running, I'll remove it.

geektoni · 2020-06-18T08:45:00Z

gsoc_application_projects/2020/influenza/web/README.md

@@ -0,0 +1,77 @@
+# Influenza Estimator Web Tool


It would be nice to add a small paragraph about the Docker image and how to pull/run it.

Yes, I'll add that.

If I try to pull the docker image I get the following error:

Using default tag: latest Error response from daemon: manifest for tejsukhatme/influenza_estimator:latest not found: manifest unknown: manifest unknown

I guess you meant to write docker pull tejsukhatme/influenza_estimator:random_forest instead, right? :) It could be better to tag the docker image as latest so it will be downloaded automatically without the need to specify each time the tag.

geektoni · 2020-06-18T08:48:09Z

gsoc_application_projects/2020/influenza/README.md

@@ -0,0 +1 @@
+## GSOC 2020 Influenza Project


Also here it should be nice to have some kind of brief description of what the project is about.

geektoni · 2020-06-18T08:51:50Z

gsoc_application_projects/2020/influenza/web/Dockerfile

+    make install
+
+ENV LD_LIBRARY_PATH=/installed/shogun-install/lib
+ENV PYTHONPATH=/installed/shogun-install/lib/python3.5/site-packages/shogun.py


Replace with
ENV PYTHONPATH=/installed/shogun-install/lib/python3.5/site-packages/

It won't work if you leave it as it is now.

Oh yes, I had forgotten to push the updated Dockerfile.

geektoni · 2020-06-18T08:57:12Z

gsoc_application_projects/2020/influenza/web/influenza_estimator/util.py

+                line = line[:-1]
+                count = 0
+                try:
+                    res = pageviewapi.per_article(project, line.strip(), start,


Just a quick question here, what happens if we are not able to reach Wikipedia (e.g., because maybe the website is down or the docker container cannot access to the internet)? Will the execution be caught or will the application fail?

I guess there should be some kind of safeguard/error which tells the user that we were not able to reach the website for whatever reason (or at least it should be logged somewhere).

Hmm, yes, this makes sense. I'll check which exception is thrown when that happens and log it.

geektoni · 2020-06-18T08:57:54Z

gsoc_application_projects/2020/influenza/web/influenza_estimator/util.py

+                    for item in res['items']:
+                        count += int(item['views'])
+                except ZeroOrDataNotLoadedException:
+                    count = 0


Besides catching the exception, we should write to the logs that something like this happened.

geektoni · 2020-06-18T08:58:43Z

gsoc_application_projects/2020/influenza/web/influenza_estimator/util.py

+            last_checked = datetime.strptime(last_checked, '%Y-%m-%d').date()
+        logging.info('\tlast checked at ' + str(last_checked))
+        if last_checked < yesterday:
+            logging.info('\tmaking API calls again.')


Are the log information saved somewhere or are they just printed on screen?

It'll happen inside a file called information.log

Hephaestus12 · 2020-06-19T06:03:45Z

gsoc_application_projects/2020/influenza/model/src/model.py

+"""
+import pickle
+import shogun as sg
+


Oh okay, I'll add this.

geektoni · 2020-06-20T14:14:40Z

gsoc_application_projects/2020/influenza/README.md

+## Problem statement
+Reducing the impact of seasonal influenza epidemics and other pandemics such as the H1N1 is of paramount importance for public health authorities. Studies have shown that effective interventions can be taken to contain the epidemics if early detection can be made.
+
+Seasonal influenza epidemics result in about three to five million cases of severe illness and about  250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as the H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. And monitoring and forecasting the occurrence and spread of flu in the community is critical.


Suggested change

Seasonal influenza epidemics result in about three to five million cases of severe illness and about 250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as the H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. And monitoring and forecasting the occurrence and spread of flu in the community is critical.

Seasonal influenza epidemics result in about three to five million cases of severe illness and about 250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. Monitoring and forecasting the occurrence and spread of flu in the community is critical.

geektoni · 2020-06-20T14:15:43Z

gsoc_application_projects/2020/influenza/README.md

+
+Seasonal influenza epidemics result in about three to five million cases of severe illness and about  250,000 to 500,000 deaths worldwide each year. This is of utmost significance for public health agencies to reduce the effects of natural pandemics and epidemics such as the H1N1 influenza. Results have demonstrated that protective steps can be taken to suppress epidemics where there is early warning during outbreak germination. And monitoring and forecasting the occurrence and spread of flu in the community is critical.
+
+In the EU, there are several government institutions which track incidents of influenza-like disease (ILI) by gathering statistics from sentinel care activities that provide virological(the study of viruses and the diseases) statistics as well as clinical details, such as physicians reporting on the number of patients observed presenting influenza-like disease, obtaining and releasing information on a weekly basis.


Suggested change

In the EU, there are several government institutions which track incidents of influenza-like disease (ILI) by gathering statistics from sentinel care activities that provide virological(the study of viruses and the diseases) statistics as well as clinical details, such as physicians reporting on the number of patients observed presenting influenza-like disease, obtaining and releasing information on a weekly basis.

In the EU, there are several government institutions which track incidents of influenza-like disease (ILI) by gathering statistics from sentinel care activities that provide virological statistics as well as clinical details, such as physicians reporting on the number of patients observed presenting influenza-like disease, obtaining and releasing information on a weekly basis.

geektoni · 2020-06-20T14:18:10Z

gsoc_application_projects/2020/influenza/README.md

+as well as state-level ILI activity
+
+## Project Description
+This project is majorly based on the findings of David J. McIver and John S. Brownstein. In their research paper titled Wikipedia Usage Estimates Prevalence of Influenza-Like Illness in the United States in Near Real-Time, they mention how it’s possible to use Wikipedia pageviews data to estimate the incidence of influenza related illnesses.


You should add a link to the original paper for reference (e.g., https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003581)

geektoni · 2020-06-20T14:18:55Z

gsoc_application_projects/2020/influenza/README.md

+
+To collect data the following sources were used:
+
+    Austria: FluNet surveillance tool


You should add link references for each of these services.

geektoni · 2020-06-20T14:19:17Z

gsoc_application_projects/2020/influenza/README.md

+It has previously been shown that Wikipedia can be a useful tool to monitor the emergence of breaking news stories, to track what topics are ‘‘trending’’ in
+the public sphere, and to develop tools for natural language processing. Furthermore, Wikipedia makes all of this information public and freely available, greatly increasing and expediting any potential research studies that aim to make use of their data.
+
+In an attempt to use Wikipedia data to estimate ILI activity, some researchers compiled a list of Wikipedia articles that were likely to be related to influenza, influenza-like activity, or to health in general. These articles were selected based on previous knowledge of the subject area, previously published materials, and expert opinion. This data is all available in this zenodo dataset.


Missing link to zenodo?

geektoni · 2020-06-20T14:24:33Z

gsoc_application_projects/2020/influenza/model/src/sample.py

@@ -0,0 +1,47 @@
+import shogun as sg


Do we still need this file here? Otherwise it could be wise to remove it.

geektoni · 2020-06-20T14:25:24Z

gsoc_application_projects/2020/influenza/model/src/test.py

@@ -0,0 +1,150 @@
+from pathlib import Path


What is the purpose of this file? Is it testing all the methods you wrote so far?

No no, it is the code that split the data into training and testing datasets for judging which model is better.

Ah okay, I was confused by the name of the file :)

gsoc_application_projects/2020/influenza/model/src/util.py

geektoni · 2020-06-20T14:32:04Z

gsoc_application_projects/2020/influenza/web/README.md

@@ -0,0 +1,77 @@
+# Influenza Estimator Web Tool


If I try to pull the docker image I get the following error:

Using default tag: latest Error response from daemon: manifest for tejsukhatme/influenza_estimator:latest not found: manifest unknown: manifest unknown

I guess you meant to write docker pull tejsukhatme/influenza_estimator:random_forest instead, right? :) It could be better to tag the docker image as latest so it will be downloaded automatically without the need to specify each time the tag.

geektoni · 2020-06-20T14:33:32Z

gsoc_application_projects/2020/influenza/web/README.md

+
+```commandline
+docker pull tejsukhatme/influenza_estimator
+docker run tejsukhatme/influenza_estimator


Suggested change

docker run tejsukhatme/influenza_estimator

docker run -it -p 5000:5000 tejsukhatme/influenza_estimator

I was thinking of adding different models in the future so this nomenclature would help.

mmh then I would update at least the pull command, since the current one doesn't work.

geektoni · 2020-06-22T19:30:29Z

gsoc_application_projects/2020/influenza/model/src/util.py

@@ -13,18 +13,20 @@ def load_features(path):
        df = pd.read_csv(path)
        features = df.drop(columns=['incidence'])
        return features.values
+    return None


Good! However, since these methods can return None, what does happen where these methods are called? Are you throwing an error if they return None? Are there any checks?

No, I haven't implemented any such checks. Also, for now this code isn't being used as we are doing everything(training and applying) in the web directory. Should I throw the errors and make checks?

I don't think we need to train the model separately and put the serialized version as training on the go takes negligible time, right?

I would suggest you to update the model's code to do some checks when loading the features (e.g., if load_features returns None then print an error to the user and exit).

In general, it would be better to provide already serialized models for many good reasons. However, since we had problems with that, let's skip it for now. We could do it later on.

Hephaestus12 · 2020-07-16T23:52:41Z

Should I resolve the conversations? I have implemented most of the changes.

karlnapf · 2020-07-17T09:39:06Z

I think it would be good to merge this huge thing soon, and then to work on smaller PRs
@Hephaestus12 I think it would be good if you could post a TODO list here (and in all bigger PRs in fact), and indicate what is done/missing

example
bla

geektoni · 2020-07-17T14:43:13Z

Should I resolve the conversations? I have implemented most of the changes.

You should resolve the conversations only if you implemented the related changes. I also second Heiko's idea. Please make a list of the things which have still to be done so to have an idea of the missing bits. Then I think we can merge.

Hephaestus12 · 2020-08-05T10:02:03Z

I made an entirely new install directory and installed from scratch, but I still get this error when trying to call GLM. Why might that be?

Class GLM with primitive type SGOBJECT does not exist.

Hephaestus12 · 2020-08-05T10:07:30Z

And when I go to the python terminal and type import shogun
I am getting this:

ImportError: /home/tejsukhatme/anaconda3/envs/python3.5/lib/python3.5/_shogun.so: undefined symbol: _ZNK6shogun8SGObject12shallow_copyEv

This is the entire error message: https://pastebin.com/fSNDrFsq

gf712 · 2020-08-05T10:37:36Z

And when I go to the python terminal and type import shogun
I am getting this:
ImportError: /home/tejsukhatme/anaconda3/envs/python3.5/lib/python3.5/_shogun.so: undefined symbol: _ZNK6shogun8SGObject12shallow_copyEv
This is the entire error message: https://pastebin.com/fSNDrFsq

Are you talking about the docker image?

Hephaestus12 · 2020-08-05T13:13:08Z

Are you talking about the docker image?

No, I was trying to get it to work on my machine first.

gf712 · 2020-08-05T18:30:32Z

That’s because the newly compiled shogun library is not in the library path

karlnapf · 2020-08-10T17:24:17Z

@Hephaestus12 I think if you have those issues, either drop by irc and ask, or send an email to the list. That is a quicker way to get an answer than here in this huge PR :)

geektoni · 2020-08-13T08:02:52Z

gsoc_application_projects/2020/influenza/web/influenza_estimator/poisson/util.py

+import pandas as pd
+
+
+def load_features(path):


What is the difference between these methods and the ones in src/utils.py?

utils.py are the utility functions for the web app, this is just for the model. However, I've refactored the code a little and removed this for now.

karlnapf · 2020-08-13T11:15:44Z

🥳

Hephaestus12 added 2 commits May 15, 2020 21:01

make basic web app

b2df4de

add .gitignore

c29dd86