1.) Streams twitter data
2.) Filters the data based on the keywords in the first column of DiseaseHashtags.csv
3.) Organizes the twitter data into the following file structure:
## Data structure for tweets to pass into mongo collection
my_data = {
'id': decoded['id'],
'text': decoded['text'],
'place': {'country': country,
'full_name': full_name},
'user': {'screen_name': decoded['user']['screen_name'],
'location': decoded['user']['location']},
'entities': {'hashtags': hashes}
}
4.) If a mongod.exe mongoDB instance is open streams the each tweet record into a collection called 'twitter_healthcare' in a database called 'twitter'
Some basic pymongo search queries on the data being streamed into the twitter.twitter_healthcare collection.
These queries can be dynamically run on the database collection as it is being streamed in.
1.) location_pipeline: creates a query to find the top locations of tweets in the dataset.
2.) hashtag_pipeline: creates a query to find the top hashtags in the dataset.
3.) project_matches_pipeline: creates a query to return the tweets from a specific location.
4.) aggregate: runs the created queries.
csv file containing the sets of keywords to filter tweets by - first column
List of field names to export mongo collection to csv.
Command line argument to export mongo collection as csv with data/fields names specified in fields.txt