Skip to content
This repository has been archived by the owner on Dec 22, 2021. It is now read-only.

Verification / evolution of "Internet Jones" paper #26 #93

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

noahwalugembe
Copy link

In this work i have evaluated the deductions from "Internet Jones" paper #26. I have also concluded that the research findings from this paper show that third-party web tracking is a rapidly growing practice in an increasingly complex ecosystem— suggesting that users’ and policymakers’ concerns about privacy require sustained, and perhaps increasing, attention.

Copy link
Contributor

@birdsarah birdsarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two major problems with this PR.

Firstly, a summary of the Internet Jones paper is not the intention of issue #26. The intention of issue #26 is to apply the methodologies in the Internet Jones paper to the OverScripted dataset and compare the results from OverScripted with the results presented in Internet Jones.

In pursuit of this goal a summary of the Internet Jones paper, in particular, pulling out the relevant methodological parts which would be applied to OverScripted would certainly be appropriate. Your summary is not focused on the topics in Internet Jones related to Issue #26.

Secondly, this summary is largely direct copies from the paper. These must be presented as direct quotes in quotation marks. In addition, copying sections of the text does not meet your stated aim of "evaluating the contribution of the paper." I am happy to be corrected, but I am not seeing your evaluation in this text.

@@ -0,0 +1,32 @@
Verification / evolution of "Internet Jones" paper #26
Buy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Buy
By

Verification / evolution of "Internet Jones" paper #26
Buy
Walugembe Francis Noah
[email protected]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary to put your contact information in the analysis. But I think it's okay if you do (ping @mlopatka to confirm)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong preference either way.

The license applied to this particular work is inherited from the (parent) overscripted repo. However if
Walugembe Francis Noah would like to ensure visible attribution of this specific PR in this form, I see no problem with the inclusion of contact information here.

That said, it is NOT a requirement of the outreachy application process to associate contact information in the PR like this and it may be worth considering the public visibility of this information.


Introduction

Third-party web tracking is the practice by which entities (“trackers”) embedded in webpages re-identify users as they browse the web, collecting information about the websites that they visit. A cording to According to Lerner, Simpson, Kohno and Roesner, (2016) web Tracking is typically done for the purposes of website analytics, targeted advertising, and other forms of personalization (e.g., social media content). In this work I am evaluating the contribution of "Internet Jones" paper #26 starting with its insight on TrackingExcavator and a longitudinal measurement study of third-party cookie-based web tracking on Wayback Machine1. I will also show how has the third-party web tracking ecosystem evolved since its beginnings according to "Internet Jones" paper.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great introduction and I appreciate you clearly laying out background and the goal of your analysis.

In this work I am evaluating the contribution of "Internet Jones" paper #26 starting with...

One small thing: #26 refers to the issue not the paper - joining the two together like this doesn't make sense. You could omit the #26 so it reads In this work I am evaluating the contribution of "Internet Jones" paper starting with...

One bigger thing: To evaluate the Internet Jones paper was not the intention of #26. The goal of #26 was to apply applicable methodologies from the Internet Jones paper to the OverScripted dataset to compare the results we see with the results that were presented in the Internet Jones paper.

Wayback Machine

According to Lerner, Simpson, Kohno and Roesner, (2016)
it was discovered that The Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party web tracking and is thus imperfect for that use but they stated that Nevertheless, the only way to study web tracking prior to explicit measurements targeting it is to leverage materials previously archived for other purposes which is true because it is a good approach to start from some thing than reinventing from scratch. At this point I am going to mention some of the failures identified by According to Lerner, Simpson, Kohno and Roesner, (2016)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you are quoting work it is very important to use quotation marks and citations to make it clear that you have used the original authors words. Alternatively you can re-write evidence / claims / conclusions in your own words. Here is the passage from the original paper that is too close, in my eyes, for you to claim as your own words.

To be fair, you do say they state that Nevertheless... but this needs to be they state that "nevertheless...

Additionally, your use of "they state that" may imply that earlier parts of the sentence are your own words.

Pg 7 - Sec 4

The Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party web
tracking and is thus imperfect for that use. Nevertheless, the only way to study web tracking prior to explicit measurements targeting it is to leverage materials previously
archived for other purposes.

it was discovered that The Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party web tracking and is thus imperfect for that use but they stated that Nevertheless, the only way to study web tracking prior to explicit measurements targeting it is to leverage materials previously archived for other purposes which is true because it is a good approach to start from some thing than reinventing from scratch. At this point I am going to mention some of the failures identified by According to Lerner, Simpson, Kohno and Roesner, (2016)
.
The researchers realized that the Wayback Machine may fail to archive resources for any number of reasons. For example, the domain serving a certain resource may have been unavailable at the time of the archive, or changes in the Wayback Machine’s crawler may result in different archiving behaviors over time. As shown in Table 2, missing archives are rare. The Wayback Machine’s archived pages execute the corresponding archived JavaScript within the browser when TrackingExcavator visits them, the Wayback Machine does not execute JavaScript during its archival crawls of the web. Instead, it attempts to statically extract URLs from HTML and JavaScript to find additional sites to archive. It then modifies the archived JavaScript, rewriting the URLs in the included script to point to the archived copy of the resource. This process may fail, particularly for dynamically generated URLs. As a result, when TrackingExcavator visits archived pages, dynamically generated URLs not properly redirected to their archived versions will cause the page to attempt to make a request to the live web, i.e., “escape” the archive. TrackingExcavator blocks such escapes (see Section 3). As a result, the script never runs on the archived site, never sets a cookie or leaks it, and thus TrackingExcavator does not witness the associated tracking behavior. Also embedded resources in a webpage archived by the Wayback Machine may occasionally have a timestamp far from the timestamp of the top-level page. Any of the above failures can lead to cascading failures, in that non-archived responses or blocked requests will result in the omission of any subsequent requests or cookie setting events that would have resulted from the success of the original request. The “wake” of a single failure cannot be measured within an archival dataset, because events following that failure are simply missing. To study the effect of these cascading failures, we must compare an archival run to a live run from the same time; we do so in the next subsection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, when you are quoting work it is very important to use quotation marks and citations to make it clear that you have used the original authors words. Alternatively you can re-write conclusions in your own words. Here are passages from the original paper that are too close, in my eyes, for you to claim as your own words:

Page8 Sec 4.1

The Wayback Machine may fail to archive resources for any number of reasons. For example, the domain serving a certain resource may have been unavailable at the time of the archive, or changes in the Wayback Machine’s crawler may result in different archiving behaviors over time. As shown in Table 2, missing archives are rare.

Though the Wayback Machine’s archived pages execute the corresponding archived JavaScript within the browser when TrackingExcavator visits them, the Wayback Machine does not execute JavaScript during its archival crawls of the web. Instead, it attempts to statically extract URLs from HTML and JavaScript to find additional sites
to archive. It then modifies the archived JavaScript, rewriting the URLs in the included script to point to the archived copy of the resource. This process may fail, particularly for dynamically generated URLs. As a result, when TrackingExcavator visits archived pages, dynamically generated URLs not properly redirected to their archived versions will cause the page to attempt to make a request to the live web, i.e., “escape” the archive. TrackingExcavator blocks such escapes (see Section 3). As a result, the script never runs on the archived site, never sets a cookie or leaks it, and thus TrackingExcavator does not witness the associated tracking behavior.

As others have documented [10], embedded resources in a webpage archived by the Wayback Machine may occasionally have a timestamp far from the timestamp of the top-level page.

Any of the above failures can lead to cascading failures, in that non-archived responses or blocked requests will result in the omission of any subsequent requests or cookie setting events that would have resulted from the success of the original request. The “wake” of a single failure cannot be measured within an archival dataset, because events following that failure are simply missing. To study the effect of these cascading
failures, we must compare an archival run to a live run from the same time; we do so in the next subsection.


longitudinal measurement study.

After evaluating the Wayback Machine’s view into the past and developing best practices for using its data, we use TrackingExcavator to conduct a longitudinal study of the third-party web tracking ecosystem from 1996- 2016. the researchers explored how this ecosystem has changed over time, including the prevalence of different web tracking behaviors, the identities and scope of popular trackers, and the complexity of relationships within the ecosystem. Among their findings, they identified the earliest tracker in the dataset of 1996 and observe the rise and fall of important players in the ecosystem (e.g., the rise of Google Analytics to appear on over a third of all popular websites). They also found that websites contact an increasing number of third parties over time (about 5% of the 500 most popular sites contacted at least 5 separate third parties in early 2000s, whereas nearly 40% do so in 2016) and that the top trackers can track users across an increasing percentage of the web’s most popular sites. They also found out that tracking behaviors changed over time, e.g., that third-party popups peaked in the mid-2000s and that the fraction of trackers that rely on referrals from other trackers has recently risen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again passages from the paper:

Page 3, Sec 1

After evaluating the Wayback Machine’s view into the past and developing best practices for using its data, we use TrackingExcavator to conduct a longitudinal study of the third-party web tracking ecosystem from 1996-2016 (Sections 5). We explore how this ecosystem has changed over time, including the prevalence of different web tracking behaviors, the identities and scope of popular trackers, and the complexity of relationships within the ecosystem. Among our findings, we identify the earliest tracker in our dataset in 1996 and observe the rise and fall of important players in the ecosystem (e.g., the rise of Google Analytics to appear on over a third of all popular websites). We find that websites contact an increasing number of third parties over time (about 5% of the 500 most popular sites contacted at least 5 separate third parties in early 2000s, whereas nearly 40% do so in 2016) and that the top trackers can track users across an increasing percentage of the web’s most popular sites. We also find that tracking behaviors changed over time, e.g., that third-party popups peaked in the mid-2000s and that the fraction of trackers that rely on referrals from other trackers has recently risen.

When taking such direct quotes. It's not appropriate to replace "we" with "they" or "the researchers" as it builds the impression that these words are your own.


Conclusion

Taken together, the Internet Jones" paper #26 research findings show that third-party web tracking is a rapidly growing practice in an increasingly complex ecosystem— suggesting that users’ and policymakers’ concerns about privacy require sustained, and perhaps increasing, attention. The Internet Jones" paper #26 research results also provide hitherto unavailable historical context for today’s technical and policy discussions. It is also stated in the Internet Jones" paper #26 research that Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party webtracking and is thus imperfect for that use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Page 3, Sec 1

Taken together, our findings show that third-party web tracking is a rapidly growing practice in an increasingly complex ecosystem — suggesting that users’ and policy-makers’ concerns about privacy require sustained, and perhaps increasing, attention. Our results provide hitherto unavailable historical context for today’s technical and policy discussions.

Page 7, Sec 4

The Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party web tracking and is thus imperfect for that use.

Again, in this case, you phrase it It is also stated in the Internet Jones" paper #26 research that Wayback Machine..... This is getting closer to attribution, but it's not clear where the quote starts and ends. You could say The Internet Jones paper notes that "the Wayback Machine provides a unique and comprehensive source of historical web data. However, it was not created for the purpose of studying third-party web-tracking and is thus imperfect for that use."

A further improvement on this would be to use the convention of referring to the authors rather than the paper e.g. Lerner et al. note that "the Wayback Machine....

@birdsarah birdsarah requested a review from mlopatka April 4, 2019 06:42
Copy link
Contributor

@mlopatka mlopatka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not close Issue #26, It does provide an excellent context/summarization of the work presented in the referenced paper. Looking back at the description of Issue #26 I do not see any ambiguity regarding the aim:

It contains a number of interesting metrics to describe tracking over time. While the OverScripted dataset does not have sufficient data to compare for all metrics, there may be some that we can reproduce and so continue the evolution of the data presented in the paper.

The work product required to close out this issue requires analysis that reproduces the methodology of the Internet Jones paper to be applied to some subset of the overscripted data set.

There are also a number of issues regarding attribution of the original source material. My interpretation of intention behind this PR is to summarize the relevance of this original research as it pertains to the current tracking ecosystem. This is certainly a valuable contribution to the overscripted repo! However, there are changes required to the manner and style of attribution required for this PR to be merged in.

As mention in several of @birdsarah's comments, the verbatim replication of large sections of the original work must be either correctly cited as a quotation or removed so that an interpretation of editorial perspective regarding that text can be presented.

This writing guide is a helpful resource in determining the appropriate and correct way to reference source materials. https://academicguides.waldenu.edu/writingcenter/evidence/citations

Please ensure that this PR is providing appropriate attribution to the original content and satisfies the 5-point guideline in the guide linked above.

Verification / evolution of "Internet Jones" paper #26
Buy
Walugembe Francis Noah
[email protected]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no strong preference either way.

The license applied to this particular work is inherited from the (parent) overscripted repo. However if
Walugembe Francis Noah would like to ensure visible attribution of this specific PR in this form, I see no problem with the inclusion of contact information here.

That said, it is NOT a requirement of the outreachy application process to associate contact information in the PR like this and it may be worth considering the public visibility of this information.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants