Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically update webtrackers with adblockplus and privacy badger data #152

Open
marcosmenendez opened this issue May 4, 2015 · 1 comment

Comments

@marcosmenendez
Copy link

marcosmenendez commented May 4, 2015

We should update the trackers to be blocked. We should keep our classification for the existing trackers

Source to consider are:
adblockplus: https://easylist-downloads.adblockplus.org/easyprivacy.txt
privacybadger: https://github.com/EFForg/privacybadgerchrome/blob/master/doc/sample_cookieblocklist.txt

The second, as you can read in the faqs (https://www.eff.org/privacybadger) does not have a blacklist as it, but a yellow list of sites to which it blocks its cookies but may allow its content if the site does observe the Do Not Track messages sent by the extension.

Please delete code written for #141 and try to build a routine that identifies trackers that are not on our list based on the above, propose the most proper category (content, advertising, etc) and let the admin change that category. Check that the list does not become so big that uses too much memory.

This routine should also identify trackers that are included in our list but not in ABP, asking admin whether we should keep or delete them

Other lists we may look at are:

  • EasyList
  • Peter Lowe list
  • EasyPrivacy
  • Malware domains
@marcosmenendez marcosmenendez changed the title Automatically update webtrackers with adblockplus data Automatically update webtrackers with adblockplus and privacy badger data Sep 8, 2015
@JorgelieHD
Copy link
Contributor

@marcosmenendez, @atrandafir and myself have been investigating carefully about this issue.

We made some conclusions:

  1. The sources you pointed out are not too useful. This sources are not just host-based, they're also based on regular expressions and files. This will need a different code to detect new domains since TGD extension is only host-based. And this represents and important workload.
  2. There are cases that uses only domain but do not provide information to identify categories.
  3. I've been looking into disconnect me list and it has a different size than TGDs. TGD list's size almost doubles the one in this link https://github.com/disconnectme/disconnect-tracking-protection. What it makes me thinks that maybe this list is not updated or there's another one. In the case that we decide to work with disconnect me we have to be sure that this is a reliable source, because I haven't found another one that seems more updated.
  4. Keep in mind that currently the webapp does not handle any of this, it is a file in TGD extension. Every time that extension detects a domain that is in this file, it will send this domain through the API controller and then this will be stored in the webapp's DB. So a full automation is almost impossible, the only way will be to update this file through webapp API comparing TGD's list (previously stored in DB) with disconnect me's list.

We have some ideas on how to do this but looking into disconnectme github I realize that there haven't been any updates since two days after deploy (that is almost 3 months). So, if we decide to update our list with the API comparing with disconnect me list and this is not updated regularly, this work might be a waste of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants