Target Privacy Threat Model

This document is at a very early stage. Many things in it are wrong and/or incomplete. Please take it as a rough shape for how we might document the target threat model, rather than as definite statements about what should be in the target threat model.

1. Introduction

As a threat model, this specification describes attacker capabilities and attacker goals, and says, for each possible set of capabilities an attacker might have, which goals we expect the attacker can achieve.

As a privacy threat model, the attacker goals compromise the privacy of users, rather than their security.

As a target threat model, it describes not the current state of the Web including all current maybe-unwise APIs, but rather an end state that we hope to migrate to, and that new APIs should be held to. This is meant to be a plausible threat model: it doesn’t expect to remove any APIs or browser behavior that is deemed essential to the viability of the Web.

Since people are likely to disagree about which APIs are essential to the Web, when saying that an attacker can achieve their goal, this document describes how the attacker achieves it using particular "essential" APIs, and it provides an index of those APIs so readers can point out ones that they don’t consider essential.

2. Terminology

[HTML] defines an origin as the tuple of a scheme, hostname, and port that provides the main security boundary on the web.

A site is a set of origins that are all same site with each other. Note that there are problems ([PSL-PROBLEMS]) with using registrable domains as a logical boundary.

A party is defined by [tracking-dnt] as "a natural person, a legal entity, or a set of legal entities that share common owner(s), common controller(s), and a group identity that is easily discoverable by a user."

The first party for a user action is the party that controls the origin of the top-level browsing context under which the action happened. Intuitively, this is the owner of the domain in the browser’s URL bar. This differs from Mozilla’s definition in that Mozilla defines other parties as first parties if the user can easily discover which party it is and intends to interact with that party, for example to allow sign-in widgets to be first-party.

A third party for a user action is any party that isn’t the first party or the user (the second party).

A user is a human or program that controls a user agent.

A user ID is a pair of a site and a (potentially-large) integer allocated by that site that is used to identify a user on that site. A single user will generally have many user IDs that refer to them, and a single site may or may not know that multiple user IDs refer to the same user.

A global identifier is a string that identifies a particular user independent of which site they’re visiting. Users generally have relatively few global identifiers and can usually list and recognize them. A goal of anti-tracking policy is to prevent user IDs from becoming global identifiers.

An attacker is any entity trying to get information that a user might not want them to get. Attackers are often entities that a user intends to interact with in other ways, as both first and third parties, and some users may not mind their collection of this information.

This document uses the terms publisher and tracker colloquially to refer to particular kinds of sites and the parties that operate them. They are not rigorously defined.

3. High-level threats

User agents should attempt to defend their users from a variety of high-level threats or attacker goals, described in this section. § 6 Attacker goals then describes the low-level steps an attacker would use to achieve these high-level goals.

[RFC6973] describes the following high-level privacy threats, which the TAG has adopted into Self-Review Questionnaire: Security and Privacy §threats:

Surveillance: Surveillance is the observation or monitoring of an individual’s communications or activities. See Privacy Considerations for Internet Protocols §section-5.1.1.
Stored Data Compromise: End systems that do not take adequate measures to secure stored data from unauthorized or inappropriate access. See Privacy Considerations for Internet Protocols §section-5.1.2.
Intrusion: Intrusion consists of invasive acts that disturb or interrupt one’s life or activities. See Privacy Considerations for Internet Protocols §section-5.1.3.
Misattribution: Misattribution occurs when data or communications related to one individual are attributed to another. See Privacy Considerations for Internet Protocols §section-5.1.4.
Correlation: Correlation is the combination of various pieces of information related to an individual or that obtain that characteristic when combined. See Privacy Considerations for Internet Protocols §section-5.2.1.
Identification: Identification is the linking of information to a particular individual to infer an individual’s identity or to allow the inference of an individual’s identity. See Privacy Considerations for Internet Protocols §section-5.2.2.
Secondary Use: Secondary use is the use of collected information about an individual without the individual’s consent for a purpose different from that for which the information was collected. See Privacy Considerations for Internet Protocols §section-5.2.3.
Disclosure: Disclosure is the revelation of information about an individual that affects the way others judge the individual. See Privacy Considerations for Internet Protocols §section-5.2.4.
Exclusion: Exclusion is the failure to allow individuals to know about the data that others have about them and to participate in its handling and use. See Privacy Considerations for Internet Protocols §section-5.2.5.

These threats combine into the particular concrete threats we want web specifications to defend against, described in subsections here:

3.1. Unwanted same-site recognition

Contributes to surveillance, correlation, and identification.

Users of most instantiations of the web platform expect that if they visit a site on one day, and then visit again the next day, the site will be able to recognize that they’re the same user. This allows sites to save the user’s preferences, shopping carts, etc. The web platform offers many mechanisms that are either intended to accomplish this recognition or that can be trivially used for it, including cookies, localStorage, indexedDB, CacheStorage, and other forms of storage.

A privacy harm only occurs if the user wants to break the association between two visits, but the site can still determine with high probability that the two visits came from the same user.

A user might expect that their two visits won’t be associated if they:

Use a browser that promises to avoid such correlation.
Use their browser’s private browsing mode. ([WHAT-DOES-PRIVATE-BROWSING-DO])
Use two different browser profiles between the two visits.
Explicitly clear the site’s cookies or storage.

This recognition is generally accomplished by either "supercookies" or browser fingerprinting.

Supercookies occur when a browser stores data for a site but makes that data more difficult to clear than other cookies or storage. Fingerprinting Guidance §5.4 Clearing all local state discusses how specifications can help browsers avoid this mistake.

Fingerprinting consists of using attributes of the user’s browser and platform that are consistent between the two visits and probabilistically unique to the user.

The attributes can be exposed as information about the user’s device that is otherwise benign (as opposed to § 3.3 Sensitive information disclosure). For example:

What are the user’s language and time zone?
What size is the user’s window?
What system preferences has the user set? Dark mode, serif font, etc...
...

See [fingerprinting-guidance] for how to mitigate this threat.

3.2. Unwanted cross-site recognition

Contributes to surveillance, correlation, and identification, usually more significantly than § 3.1 Unwanted same-site recognition.

This occurs if a site can determine with high probability that a visit to that site comes from the same user as another visit to a different site. This threat is discussed in § 4.1 Cross-site recognition.

3.3. Sensitive information disclosure

Contributes to correlation, identification, secondary use, and disclosure.

Many pieces of information about a user could cause privacy harms if disclosed. For example:

The user’s location.
Video or audio from the user’s camera or microphone.
The content of certain files on the user’s filesystem.
Financial data.
Contacts.
Calendar entries.
Whether the user is using assistive technology.
...

A particular piece of information may have different sensitivity for different users. Language preferences, for example, might typically seem innocent, but also can be an indicator of belonging to an ethnic minority. Precise location information can be extremely sensitive (because it’s identifying, because it allows for in-person intrusions, because it can reveal detailed information about a person’s life) but it might also be public and not sensitive at all, or it might be low-enough granularity that it is much less sensitive for many users.

When considering whether a class of information is likely to be sensitive to users, consider at least these factors:

whether it serves as a persistent identifier (see severity in Mitigating browser fingerprinting);
whether it discloses substantial (including intimate details or inferences) information about the user or other users;
whether it can be revoked (as in determining whether a permission is necessary);
whether it enables other threats, like intrusion.

This description of what makes information sensitive still needs to be refined. <https://github.com/w3cping/privacy-threat-model/issues/16>

3.4. Intrusive behavior

See intrusion.

Privacy harms don’t always come from a site learning things. For example it is intrusive for a site to

Display messages or notifications,
Play sounds,
Occupy the full screen,
etc.

if the user doesn’t intend for it to do so.

3.5. Powerful capabilities

Contributes to misattribution.

For example, a site that sends SMS without the user’s intent could cause them to be blamed for things they didn’t intend.

4. Threat Model

For each of the high-level threats, we describe a threat model: a description of what goals attackers with various capabilities should or should not be able to achieve. For simple threats, a model can be expressed in prose, while complex threat models use a grid to express the web platform’s target guarantees.

✘ indicates that the goal should be frustrated, while ✓ indicates that the attacker can achieve their goal.

Should we mark goals attackers can currently achieve, which we want to remove, differently from goals attackers already can’t achieve?

4.1. Cross-site recognition

	§ 6.2.1 Transfer user ID from publisher 1 to publisher 2 on navigation.	§ 6.2.2 Transfer user ID from tracker to that tracker running within a publisher, on navigation.	§ 6.2.3 Transfer user ID from tracker within publisher 1 to that tracker within publisher 2, on navigation.	§ 6.2.5 Probabilistically transfer user ID from publisher 1 to publisher 2 without navigation.	§ 6.2.4 Probabilistically transfer user ID from publisher 1 to publisher 2 on navigation.
§ 5.1 Load iframes	✘	✘	✘	✘	✘
§ 5.2 Run Javascript in a first-party context	✘	✘	✘	✘	✘
§ 5.4 Read server logs on other publishers	✘	✘	✘	✘	✓ Publisher 2 reads their own logs for the page load and publisher 1’s logs for the click tracking of the navigation click. User IDs that clicked on publisher 1 approximately the same time as that link’s target loaded on publisher 2 are probabilistically correlated.
§ 5.3 Modify server processing on the target publisher	✘	✓ The tracker adds a path segment, possibly encrypted, in their links to the publisher, encoding the user’s ID within the tracker. They convince the publisher to ignore that path segment in their server processing. The tracker running inside that publisher reads the URL, decodes the tracker site’s user ID, and sends that and the tracker-within-publisher user ID up to the tracker’s server.	✘	✘	✘
§ 5.2 Run Javascript in a first-party context or § 5.3 Modify server processing on the source site and § 5.3 Modify server processing on the target publisher	✘	✓ The tracker adds a path segment, possibly encrypted, in their links to the publisher, encoding the user’s ID within the tracker. They convince the publisher to ignore that path segment in their server processing. The tracker running inside that publisher reads the URL, decodes the tracker site’s user ID, and sends that and the tracker-within-publisher user ID up to the tracker’s server.	✓ The tracker adds a path segment, possibly encrypted, in the publisher 1’s links to publisher 2, encoding the user’s ID within publisher 1. They convince publisher 2 to ignore that path segment in their server processing. The tracker running inside that publisher reads the URL, decodes publisher 1’s user ID, and sends that and publisher 2’s user ID up to the tracker’s server.	✓	✓

Further cross-site recognition is available by combining capabilities with the ability to § 5.2 Run Javascript in a first-party context (or § 5.3 Modify server processing to add attacker-controlled javascript):

	§ 6.2.5 Probabilistically transfer user ID from publisher 1 to publisher 2 without navigation.
§ 5.5 Convince the user to type an identifier	✓ The tracker gets a report of the identifiers typed in both publisher sites, along with each publisher’s user id. If they’re equal, the user ids probably refer to the same user. The probability depends on the type of identifier: email addresses or credit card numbers are very high probability. Names or dates of birth are lower probability unless combined with other information like a zip code.
§ 5.6 Convince the user to open the same device in two cross-site tabs	✓ The tracker reads the two devices, and if they give the same output at approximately the same time, the user is probably the same. For devices like cameras and microphones, detection is likely to be very accurate. Others, like ambient light, might only give a few bits per sample, and so need a long period of overlap in order to provide a good correlation. For some devices, this transfer can be mitigated by turning off input when the site isn’t visible or isn’t focused, but user expectations limit where that mitigation is applicable.
§ 5.7 Convince the user to open the same read+write device in two different sites	✓ The tracker writes identifying content to the device and then reads it back when the other site is opened. This is visible to varying degrees depending on the device: an individual native file or a Bluetooth or USB device that isn’t designed to cooperate with this sort of tracking, is likely to break in obvious ways when the tracker tries to write an identifier. A native directory, on the other hand, has many available filenames that could hold identifying information without a user being likely to notice.
§ 5.8 Have two sites open when a browser-wide event happens	✘
§ 5.9 Have two sites visible when a browser-wide event happens	✓ Browser-wide events generally need to be visible immediately when a user is looking at a website, so the tracker just needs to notice that the event’s parameters are the same, and its timestamp across two sites is close together. The probability of identifying a single user goes up as more events are observed.

4.2. Sensitive-information

Attackers should only be able to get access to sensitive information from a user agent if the user expresses their intent that the attacker get access to this information at the time the attacker gets access to it. User agents vary in how they gather this expression of intent.

That a user intends an attacker to get a piece of information at one time, for example their location or their contact book, may be, but is not necessarily evidence that the user intends to give out the same piece of information at a later time. There is not consensus about how long it’s reasonable to infer continued intent, but there is consensus that intent doesn’t last for years without interaction.

Add more information about local attackers and confused users and how browsers can mitigate. <https://github.com/w3cping/privacy-threat-model/issues/19>

This threat model defines a kind of information as restricted sensitive information if the web platform currently blocks access to it by default or if we plan to evolve the web platform to block access to it by default because of the potential privacy harms from disclosure of that kind of information.

Other information is described as "not restricted sensitive information" even if some users in some situations would find it sensitive. Information in this category may have a lower risk of privacy harm to users or may not currently be restricted because of incompatibility with functionality of the Web. These categories are not static and it may become feasible to block access by default to more kinds of information as the platform develops.

There is consensus that some kinds of information are restricted sensitive information:

Location
Disability status
Microphone input
Etc.

There is consensus that some other kinds of information are not restricted sensitive information:

User agent
Language
A user’s preference for less motion.
Etc.

There is not consensus about the sensitivity of all kinds of information:

TODO: examples?

5. Attacker Capabilities

These are things some attackers can do. The attackers can use them to achieve their goals. All attackers are assumed to be able to buy domains and host sites on their domains.

5.1. Load iframes

The attacker can convince a publisher to load an iframe from the attacker’s site.

5.2. Run Javascript in a first-party context

When a publisher wants to show ads from an ad network, many/most ad networks require that publisher to use a <script src="adnetwork.js"> tag instead of an <iframe src="adnetwork.html">, which allows the ad network to see any publisher data that’s exposed to Javascript. This includes query parameters (e.g. fbclid) and any cookies that aren’t HttpOnly.

5.3. Modify server processing

Some attackers can convince some publishers (perhaps by paying them) to modify their server software. This could be used, for example, to receive user IDs passed in a path segment instead of a query parameter, without breaking the normal logic mapping a path to content.

5.4. Read server logs

Servers can keep logs of requests. The attacker may be able to convince some server operators to give them these logs or let them run queries over the logs.

5.5. Convince the user to type an identifier

The attacker can convince a user to type their email address, name, zip code, etc. into two publishers' sites.

5.6. Convince the user to open the same device in two cross-site tabs

The attacker can convince a user to open the same device on two publishers' sites that are both open at the same time in different tabs or windows.

5.7. Convince the user to open the same read+write device in two different sites

A read+write device is something like a native filesystem file or directory, or a bluetooth or USB device that can be configured to save data. The two sites need to be given access to the same device, but they don’t need to both be open at the same time.

5.8. Have two sites open when a browser-wide event happens

A browser-wide event is something that happens to the browser as a whole instead of to an individual site or web page. For example, the user going idle, sensor data changing, or the device’s time zone changing, would affect all tabs at the same time.

5.9. Have two sites visible when a browser-wide event happens

Unlike § 5.8 Have two sites open when a browser-wide event happens, this capability requires both sites to be visible (probably in two separate browser windows) at the same time during the event.

6. Attacker goals

These are things attackers want to accomplish.

6.1. User metrics within a single site

6.1.1. Click tracking

A site wants to know which of its links a user clicks on.

6.2. Tracking

[tracking-dnt] defines "tracking" as the collection of data regarding a particular user’s activity across multiple distinct contexts and the retention, use, or sharing of data derived from that activity outside the context in which it occurred. A context is a set of resources that are controlled by the same party or jointly controlled by a set of parties. The following are building blocks that allow a tracker to build such a log of a user’s activity.

6.2.1. Transfer user ID from publisher 1 to publisher 2 on navigation.

When the user clicks a link from publisher 1 that navigates to publisher 2, publisher 2’s server learns that a user ID on publisher 2 and a user ID on publisher 1 represent the same user.

6.2.2. Transfer user ID from tracker to that tracker running within a publisher, on navigation.

When the user clicks a link from tracker.example to publisher.example, that tracker’s server learns that a user ID for either publisher.example or the tracker running within publisher.example and a user ID for tracker.example represent the same user.

6.2.3. Transfer user ID from tracker within publisher 1 to that tracker within publisher 2, on navigation.

When the user clicks a link from publisher1.example (where a tracker was embedded within that site) to publisher2.example (which has the same tracker embedded), that tracker’s server learns that a user ID for either publisher1.example or the tracker running within publisher1.example and a user ID for either publisher2.example or the tracker running within publisher2.example represent the same user.

6.2.4. Probabilistically transfer user ID from publisher 1 to publisher 2 on navigation.

When the user clicks a link from publisher 1 that navigates to publisher 2, publisher 2’s server learns that a user ID on publisher 2 and a user ID on publisher 1 are more likely than chance to represent the same user.

6.2.5. Probabilistically transfer user ID from publisher 1 to publisher 2 without navigation.

While the user is visiting publisher 1, that publisher can tell publisher 2 that a user ID on publisher 2 and a user ID on publisher 1 are more likely than chance to represent the same user, without requiring the user to navigate from publisher 1 to publisher 2.

7. Essential Web APIs

These are Web APIs and features that would destroy the Web as we know it if they went away. The placement of some features here will be contentious, and those features are marked as I discover the contention.

7.1. Javascript can make requests to servers

This could perhaps be broken down into same-origin requests and cross-origin requests, with cross-origin requests being more contentious.

7.2. Servers can define the paths under which they host content

These paths can include opaque strings that neither a human nor a user agent can interpret without knowing how the server works, like /t/proposal-packaging-for-the-web-signed-and-indexed/1827/10 and /Moby-Dick-Herman-Melville/dp/1503280780/.

8. Acknowledgements

Safari did the first work to prove that a more privacy-preserving web was possible, by blocking third-party cookies by default and then shipping ITP 1.0, without breaking the world. They eventually published their policy for Tracking Prevention, which heavily influenced this document.

Mozilla wrote the first concrete anti-tracking policy, which inspired Safari’s policy.

Michael Kleber on the Chrome team proposed a Privacy Model for the Web, which suggests blocking the transfer of user IDs between top-level sites and suggests a few ways that information could flow between sites without compromising user privacy.