We will focus on building bots to detect greenwashing.
Description of directories
Web scrapers. See column-description.csv
for a description of the columns.
Put your scrapped data in the data directory.
Once everything has been scrapped run the data-merger.py
script to merge all the data into one file (normally done only once).
scripts by alex to normalize data according to agreed columns (see column-description.csv
)
CSVs of scraped data, organised by data source.
Doccano has very specific data requirements, especially when it comes to the size of the csv files.
This Python script:
- reads data from the main data folder
- shuffles the data
- makes sure the data is in the correct format
- splits it into N csvs of 100rows each (doccano-acceptable size)
- saves those csvs in doccano_data folder
Implementation of text cleaning and document embedding
The document embedding techniques implemented:
- word2vec + average
- tfidf
Evaluation of models and document embedding techniques
You can contribute to this repository by solving an issue and suggesting improvements/changes in the code, documentation and project organisation. Look for issues labeled "good-first-issue"
When contributing to this repository, please first communicate with the owners of this repository the change you wish to make via a GitHub Issues or our Slack Channel before starting to work on it.
If you are working on an existing issue please assign it to yourself using GitHub Issues so that it is visible what you are doing and unneeded replication is avoided. When you finish working on an issue unassign yourself.
To start contributing, follow the steps below
Clone the repo
Create a branch using git checkout -b feature-branch
Make the required changes
Create a pull request using below commands
git add --all
git commit -m "your commit message"
git push origin feature-branch
Go to Repository
Create Pull Request against master branch
Add a suitable title and description to the pull request and tag the issue number in Pull Request description, if the pull request is related to some issue logged here: Issues
You're done. Wait for your code to get reviewed and merged.
If you need more information then read the CONTRIBUTING guide.
We do not yet need to deploy to a production env many times per day so we will use the GitHub flow strategy to merge changes. It is described here https://guides.github.com/introduction/flow/. The only GitHub Flow rule is Anything in the master branch is deployable Each change is reviewed on a feature branch then merged into master.