Can we build a heuristic for browser attribute fingerprinting? #34

birdsarah · 2019-03-11T19:54:03Z

There are some scripts that we can pick out by name that are doing browser attribute fingerprinting:

The fingerprintjs scripts
Scripts with hs-analytics in the script_url
Scripts with /akam/ in the script_url

Can we build a heuristic for browser attribute fingerprinting that pulls out these scripts?

The text was updated successfully, but these errors were encountered:

birdsarah · 2019-03-15T09:26:52Z

I uploaded a notebook with basic examples for finding each of the scripts here: https://github.com/mozilla/overscripted/blob/master/analyses/issue_34_setup_and_dask_tips.ipynb

I also gave additional information in the chat. Pasting here too:

hs-analytics:

To get the hs-analytics scripts, something like df[df.script_url.str.contains('hs-analytics')] will return all the call rows with hs-analytics scripts.
& Here's a copy i grabbed and formatted of an hs-analytics script - https://gist.github.com/birdsarah/1d47ed38da7efcc258b388c1951a992e
See function "Fingerprint" here https://gist.github.com/birdsarah/1d47ed38da7efcc258b388c1951a992e#file-hs-analytics-2-js-L2533
now see what you can see in the data that corresponds to it

akam:

Here's an akam script i de-obfuscated: https://gist.github.com/birdsarah/3150ec8860ed736aabbedeaff8299153
To get the akam scripts it's df[df.script_url.str.contains('/akam/')]

fingerprintjs2:

this is the source for fingerprint2.js although there are a number of variants out there: https://github.com/Valve/fingerprintjs2/blob/master/fingerprint2.js
to get the script you can search for fingerprint in the script url field
but see https://github.com/Valve/fingerprintjs2/blob/master/fingerprint2.js#L919
i consider that line something of a signature for fingerprintjs2 so I look for that value being called df[df.argument_0.str.contains('Cwm fjordbank glyphs vext quiz')] and then get all the script urls where that argument is being used.

Victory17 · 2019-03-15T10:30:31Z

Hi
I am interested in this. I want to work on it.

muskankhedia · 2019-03-15T10:32:43Z

Hi @birdsarah, I was looking for this issue and as the notebook uploaded by you already performs all the 3 tasks as mentioned in the issue. So, Can you please explain some detail information regarding what more changes are required to be performed in the notebook in order to solve this issue.

srujana121 · 2019-03-16T00:13:47Z

Hi @birdsarah , I am applying for outreachy.

Do you think its a good idea to detect canvas fingerprinting. I am thinking on the lines of detecting unnecessary canvas elements. But I am not entirely sure how to detect which elements are not needed.

Generally canvas fingerprinting is done by calling the ToDataURL() method. I am assuming there is no real reason genuine scripts need to get the canvas image in DataURL format. Do you have any suggestions for me?

birdsarah · 2019-03-16T21:56:53Z

@srujana121 @muskankhedia I will try and answer both your questions together. @srujana121 there is no need to develop a technique for detecting fingerprinting. This has already been developed and examples are in the literature. See "Online Tracking: A 1-million-site Measurement and Analysis " and "The Web's Sixth Sense" on the reading list: https://github.com/mozilla/overscripted/wiki/Reading-List-(WIP)

In particular, the code for detecting four types of fingerprinting we're interested in (canvas, font, audio, and webrtc) is available here: https://github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py

@willougr has done the work of applying these heuristics to our dataset and will be submitting his code shortly. Some of the results of his work are here: https://github.com/mozilla/overscripted/blob/master/analyses/2018_12_willoughr__fingerprinting_prevalence.txt

This issue is about developing code like that shown at https://github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py but finding a set of rules that detect browser attribute fingerprinting, that is the type of fingerprinting that compiles together a series of browser attributes. Again, the reading list articules will elaborate this type of fingerprinting in more detail.

The notebook supplied @muskankhedia does not solve this issue it provides the code to filter some relevant scripts out of the whole dataset. The hard work is then developing a "heuristic" that picks out these scripts and others like it. By "heuristic" I mean a rule-set encoded in code that selects for specific scripts and not others.

For in the case of canvas fingerprinting, the heursitic in extract_features.py looks for scripts that call toDataUrl but do not call save, restore, or addEventListener (along with some other things).

muskankhedia · 2019-03-17T19:42:36Z

Hi @birdsarah,

I have some doubts regarding this, do we have to make a list of such scripts used for browser attribute fingerprinting and search for all of them individually using a looping or we have to create a function to automatically search for such scripts based on some parameters.

birdsarah · 2019-03-17T21:46:15Z

@muskankhedia have you reviewed "https://github.com/sensor-js/OpenWPM-mobile/blob/mobile_sensors/feature_extraction/extract_features.py"? Which of the papers in the reading list have you reviewed? What did you learn from them?

Please reformat your question "In ____ the authors do _____. When I tried to do _____, I was stuck by _____. As a result I have the following question ___________."

srujana121 · 2019-03-18T13:45:38Z

@birdsarah

In "https://sensor-js.xyz/webs-sixth-sense-ccs18.pdf" the authors find the trackers by clustering the scripts that use sensor information. So this is what I have understood. Can you tell me if this is what I have to do.

I have to cluster the scripts in the dataset like they did in the paper.
My heuristic would be how close a script is to clusters which are preponderantly trackers.

birdsarah · 2019-03-18T18:11:54Z

Hi @srujana121 .... I'm having a little trouble answering this. So I'm going to say up front that there's no right answers here. That's the hard part about data exploration and research. There's no hidden hint in the rest of what I write here about what I think is a "best" direction. The following is just notes and observations not direction.

What you posted is not what you have to do, but it is an approach. There are multiple ways of approaching the problem of building a heuristic.

You could keep investigating other approaches, and document their differences, strengths, and weaknesses. Or you pursue this approach.

If you pursue the approach you outlined I would be surprised if you were able to finish an undertaking like that in a couple of weeks. But that doesn't mean you shouldn't start. But given that it's a big job think about the interim outputs. Think about documenting your background research, your methodology, and how you will measure success. This preparation and thinking work alone can be a solid contribution. In addition, thinking through questions like how you will measure success will likely help you hone your methodology. If you're moving quickly, then post that preparation document early as a PR, get feedback and start a conversation about moving your analysis along.

birdsarah · 2019-03-18T22:16:52Z

The work from @willougr has been posted: https://github.com/mozilla/overscripted/tree/master/analyses/2019_03_willougr_fingerprinting_implementation_sixth_sense - it has a small bug in it so if you're trying to run it yourself you may need to fix up the variable names for the data file path - but other than that it's good. This applies the heuristics used for detecting audio, canvas, font, and webrtc by the Sixth Sense paper to the OverScripted dataset.

birdsarah · 2019-03-18T22:28:40Z

@Victory17 I missed your message before. Permission is not required to work on issues. Just dive right in.

14Richa · 2019-03-19T19:47:25Z

Adding for clarity: Browser attribute fingerprinting is a kind of browser fingerprinting in which a bunch of browser specific attributes are collected and used to uniquely identify a browser. Eg. It could be something like a hash generated using a known algorithm which concats attributes like screen-size, resolution, font etc to a string and hashes that string. Now this hash will most likely be unique to a browser from which it was generated. Relevant paper.

Tikwiza · 2019-03-27T07:50:37Z

I think this is a fantastic topic as I work in the realm of GDPR in the UK and Europe and privacy laws here in the US. Just brainstorming, but looking at this, may it be a good idea to see if we can look into the fingerprinting on browsers or countries where privacy with internet is quite strictly regulated? Although the GDPR stays quite clear of some technology, I think this might give us a good way to establish similarities in scripts that pull the necessary data, the major changes to track fingerprinting scripts and also to look at what is really considered as true fingerprinting to identify an individual? I will continue to find other angles to find ways to create what is needed for this.

14Richa · 2019-04-02T09:03:35Z

I added some analysis in this PR. You can use this file as well.

birdsarah · 2019-04-04T08:18:44Z

@14Richa bit confused by your last comment - is it aimed at me or tikwiza? always good to use an @ someone.

14Richa · 2019-04-04T16:38:33Z

@birdsarah oops, Apologies. I added it for general discussion use case. File summarizes some threads I started chasing.

birdsarah added good first issue Good for newcomers research question Outstanding questions that have not been investigated yet. labels Mar 11, 2019

birdsarah mentioned this issue Mar 22, 2019

Analysis on #34, calculating percentage of scripts present in dataset #74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can we build a heuristic for browser attribute fingerprinting? #34

Can we build a heuristic for browser attribute fingerprinting? #34

birdsarah commented Mar 11, 2019

birdsarah commented Mar 15, 2019

Victory17 commented Mar 15, 2019

muskankhedia commented Mar 15, 2019 •

edited

Loading

srujana121 commented Mar 16, 2019

birdsarah commented Mar 16, 2019

muskankhedia commented Mar 17, 2019

birdsarah commented Mar 17, 2019

srujana121 commented Mar 18, 2019

birdsarah commented Mar 18, 2019

birdsarah commented Mar 18, 2019

birdsarah commented Mar 18, 2019

14Richa commented Mar 19, 2019

Tikwiza commented Mar 27, 2019

14Richa commented Apr 2, 2019 •

edited

Loading

birdsarah commented Apr 4, 2019

14Richa commented Apr 4, 2019

Can we build a heuristic for browser attribute fingerprinting? #34

Can we build a heuristic for browser attribute fingerprinting? #34

Comments

birdsarah commented Mar 11, 2019

birdsarah commented Mar 15, 2019

Victory17 commented Mar 15, 2019

muskankhedia commented Mar 15, 2019 • edited Loading

srujana121 commented Mar 16, 2019

birdsarah commented Mar 16, 2019

muskankhedia commented Mar 17, 2019

birdsarah commented Mar 17, 2019

srujana121 commented Mar 18, 2019

birdsarah commented Mar 18, 2019

birdsarah commented Mar 18, 2019

birdsarah commented Mar 18, 2019

14Richa commented Mar 19, 2019

Tikwiza commented Mar 27, 2019

14Richa commented Apr 2, 2019 • edited Loading

birdsarah commented Apr 4, 2019

14Richa commented Apr 4, 2019

muskankhedia commented Mar 15, 2019 •

edited

Loading

14Richa commented Apr 2, 2019 •

edited

Loading