Legal:Data publication guidelines: Difference between revisions

From Wikimedia Foundation Governance Wiki
Content deleted Content added
m →‎Data publication risk tiering grid: link updates, replaced: [[Policy: → [[Special:MyLanguage/Policy:
m spacing
Tag: 2017 source edit
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<languages />
<languages />
{{staff-policy}}<translate>
{{staff-policy}}
<translate>
The right to privacy is at the core of how communities contribute to Wikimedia projects and upholding this right is central to our [[<tvar name="1">Special:MyLanguage/Policy:Human Rights Policy</tvar>|human rights commitments]]. These data publication guidelines are the best practices at the Wikimedia Foundation for managing risk in data publication. They complement our [[<tvar name="2">Special:MyLanguage/Legal:Data retention guidelines</tvar>|Data retention guidelines]] and contribute to our commitment to protect users' data as elaborated in our [[<tvar name="3">Special:MyLanguage/Policy:Privacy policy</tvar>|privacy policy]].
The right to privacy is at the core of how communities contribute to Wikimedia projects and upholding this right is central to our [[<tvar name="1">Special:MyLanguage/Policy:Human Rights Policy</tvar>|human rights commitments]]. These data publication guidelines are the best practices at the Wikimedia Foundation for managing risk in data publication. They complement our [[<tvar name="2">Special:MyLanguage/Legal:Data retention guidelines</tvar>|Data retention guidelines]] and contribute to our commitment to protect users' data as elaborated in our [[<tvar name="3">Special:MyLanguage/Policy:Privacy policy</tvar>|privacy policy]].


Line 8: Line 9:
</translate>
</translate>
{| class="wikitable"
{| class="wikitable"
!'''[//office.wikimedia.org/wiki/Security/Policy/Data_Classification Data classification]'''
!'''[//office.wikimedia.org/wiki/Security/Policy/Data_Classification <translate>Data classification</translate>]'''
!'''Confidential'''
!'''<translate>Confidential</translate>'''
! colspan="2" |'''Restricted'''
! colspan="2" |'''<translate>Restricted</translate>'''
|-
|-
! scope="row" rowspan="2" |'''[//office.wikimedia.org/wiki/Security/Policy/Risk_Management Risk level]'''
! scope="row" rowspan="2" |'''[//office.wikimedia.org/wiki/Security/Policy/Risk_Management <translate>Risk level</translate>]'''
| style="background: #941616; color: white" |'''Tier 1: High risk'''
| style="background: #941616; color: white" |'''<translate>Tier 1: High risk</translate>'''
| style="background: #eb6c49; color: white" |'''Tier 2: Medium risk'''
| style="background: #eb6c49; color: white" |'''<translate>Tier 2: Medium risk</translate>'''
| style="background: #b071c7; color: white" |'''Tier 3: Low risk'''
| style="background: #b071c7; color: white" |'''<translate>Tier 3: Low risk</translate>'''
|-
|-
|Data that could ''certainly'' be used to cause harm
|<translate>Data that could ''certainly'' be used to cause harm</translate>
|Data that could ''likely'' or ''possibly'' be used to cause harm
|<translate>Data that could ''likely'' or ''possibly'' be used to cause harm</translate>
|Data that is unlikely to be used to cause harm or is private for administrative reasons
|<translate>Data that is unlikely to be used to cause harm or is private for administrative reasons</translate>
|-
|-
! scope="row" | Examples (non-exhaustive list)
! scope="row" | <translate>Examples (non-exhaustive list)</translate>
|<translate>
|
* Data containing PII
* Data containing PII
** see the [//office.wikimedia.org/wiki/Security/Policy/Data_Classification#Confidential data classification policy] and [[Special:MyLanguage/Policy:Privacy policy|privacy policy]]
** see the [<tvar name="1">//office.wikimedia.org/wiki/Security/Policy/Data_Classification#Confidential</tvar> data classification policy] and [[<tvar name="2">Special:MyLanguage/Policy:Privacy policy</tvar>|privacy policy]]</translate>
<translate>
* Granular analyses of
* Granular analyses of
** [[wikitech:Country protection list|country protection list]] countries
** [[<tvar name="1">wikitech:Country protection list</tvar>|country protection list]] countries
** fundraising data
** fundraising data
* Recurring data releases of medium risk data
* Recurring data releases of medium risk data</translate>
|<translate>
|
* High-level analyses of
* High-level analyses of
** [[wikitech:Country protection list|country protection list]] countries
** [[<tvar name="1">wikitech:Country protection list</tvar>|country protection list]] countries
** fundraising data
** fundraising data</translate>
<translate>
* Granular analyses of
* Granular analyses of
** '''''non'''''-country protection list countries
** '''''non'''''-country protection list countries
Line 39: Line 42:
** interaction data
** interaction data
** reading data
** reading data
* Recurring data releases of low risk data
* Recurring data releases of low risk data</translate>
|<translate>
|
* High-level analyses of
* High-level analyses of
** '''''non'''''-country protection list countries
** '''''non'''''-country protection list countries
Line 46: Line 49:
** editing data
** editing data
** interaction data
** interaction data
** reading data
** reading data</translate>
* Any analyses that utilize [//en.wikipedia.org/wiki/Differential%20privacy differential privacy]<ref>This process requires specialist help to ensure that the DP algorithm is correctly configured, as well as adequate documentation.</ref>
* <translate>Any analyses that utilize [[<tvar name="1">{{lwp|Differential privacy}}</tvar>|differential privacy]]</translate><ref><translate>This process requires specialist help to ensure that the DP algorithm is correctly configured, as well as adequate documentation.</translate></ref>
* Collations and combinations of already-public data that it may be inconvenient/difficult for external parties to access
* <translate>Collations and combinations of already-public data that it may be inconvenient/difficult for external parties to access</translate>
|-
|-
! scope="row |'''Response time goal'''
! scope="row |'''<translate>Response time goal</translate>'''
|3 work weeks
|<translate>3 work weeks</translate>
|5 work days
|<translate>5 work days</translate>
|<translate>N/A</translate>
|N/A
|-
|-
! scope="row |'''Expected % of requests (internal metric)'''
! scope="row |'''<translate>Expected % of requests (internal metric)</translate>'''
|15%
|{{formatnum:15}}%
|35%
|{{formatnum:35}}%
|50%
|{{formatnum:50}}%
|-
|-
! scope="row" colspan="4" |'''What this means for Wikimedia Foundation teams'''
! scope="row" colspan="4" |'''<translate>What this means for Wikimedia Foundation teams</translate>'''
|-
|-
! scope="row" |'''Follow-up actions'''
! scope="row" |'''<translate>Follow-up actions</translate>'''
|
|
* Do not upload this data to non-Wikimedia Foundation servers
* <translate>Do not upload this data to non-Wikimedia Foundation servers</translate>
* Clear outputs before committing code, even to [[:mw:GitLab/Hosting a project on GitLab#GitLab private (restricted) repos|private Gitlab repos]]
* <translate>Clear outputs before committing code, even to [[<tvar name="1">:mw:Special:MyLanguage/GitLab/Hosting a project on GitLab#GitLab_private_(restricted)_repos</tvar>|private Gitlab repos]]</translate>
* Legal and Security will consider publication of high risk data on a case-by-case basis after review and risk mitigation
* <translate>Legal and Security will consider publication of high risk data on a case-by-case basis after review and risk mitigation</translate>
|
|
* '''''Unsanitized data''''' can be uploaded to '''''private''''' servers outside of the Wikimedia Foundation ([[:mw:GitLab/Hosting a project on GitLab#GitLab private (restricted) repos|private Gitlab repos]], Slack, Drive, etc.)
* <translate>'''''Unsanitized data''''' can be uploaded to '''''private''''' servers outside of the Wikimedia Foundation ([[<tvar name="1">:mw:Special:MyLanguage/GitLab/Hosting a project on GitLab#GitLab_private_(restricted)_repos</tvar>|private Gitlab repos]], Slack, Drive, etc.)</translate>
<translate>
* '''''Sanitized data''''' is considered to be low risk, and can be uploaded to '''''public''''' servers outside of the Wikimedia Foundation (Gitlab, presentations, mailing lists, etc.). Data sanitization involves
* '''''Sanitized data''''' is considered to be low risk, and can be uploaded to '''''public''''' servers outside of the Wikimedia Foundation (Gitlab, presentations, mailing lists, etc.). Data sanitization involves
** clearing all outputs that display raw data
** clearing all outputs that display raw data
** filtering out or obfuscating granular analyses as defined by the threshold table below
** filtering out or obfuscating granular analyses as defined by the threshold table below</translate>
* Legal and Security will consider publication of medium risk data on a case-by-case basis after review and risk mitigation
* <translate>Legal and Security will consider publication of medium risk data on a case-by-case basis after review and risk mitigation</translate>
|
|
* This data can be uploaded to public servers outside of the Wikimedia Foundation (Gitlab, presentations, mailing lists, etc.)
* <translate>This data can be uploaded to public servers outside of the Wikimedia Foundation (Gitlab, presentations, mailing lists, etc.)</translate>
|}
|}
<translate>
<translate>
Line 81: Line 85:
=== {{int string|faq}} ===
=== {{int string|faq}} ===


* '''Q: What is the Risk Tiering Grid used for?''' The Risk Tiering Grid is to help Wikimedia Foundation teams that work with data know when their work requires privacy review by Legal and Security.
* <translate>'''Q: What is the Risk Tiering Grid used for?''' The Risk Tiering Grid is to help Wikimedia Foundation teams that work with data know when their work requires privacy review by Legal and Security.</translate>
* '''Q: What are the key risks the Tiering Grid measures?''' The key risks are on both the overuse and underuse of the spectrum. If this is used in such a way that too many things are being triaged to Legal and Security, then Legal and Security become the bottleneck for necessary workflow. On the other hand, if projects go live that would have been halted or mitigated under privacy review, that exposes the Foundation to privacy risks — including reputational, legal, and security risks.
* <translate>'''Q: What are the key risks the Tiering Grid measures?''' The key risks are on both the overuse and underuse of the spectrum. If this is used in such a way that too many things are being triaged to Legal and Security, then Legal and Security become the bottleneck for necessary workflow. On the other hand, if projects go live that would have been halted or mitigated under privacy review, that exposes the Foundation to privacy risks — including reputational, legal, and security risks.</translate>
* '''Q: Who are the intended audiences of the Tiering Grid?''' Teams that work with data in product and tech.
* <translate>'''Q: Who are the intended audiences of the Tiering Grid?''' Teams that work with data in product and tech.</translate>
* '''Q: What is changed from the existing risk review process?''' The existing review process required every single schema and data project to undergo Legal review. This both was not being followed, and was not practical to follow for either data teams or Legal.
* <translate>'''Q: What is changed from the existing risk review process?''' The existing review process required every single schema and data project to undergo Legal review. This both was not being followed, and was not practical to follow for either data teams or Legal.</translate>
* '''Q: What is the process for updating the Tiering Grid or resolving Tiering disagreements?'''
* <translate>'''Q: What is the process for updating the Tiering Grid or resolving Tiering disagreements?'''</translate>
<translate>
** Get Privacy approval
** Get Privacy approval
** Anyone can initiate an update/amendment but approval must be sought across the board before implementing
** Anyone can initiate an update/amendment but approval must be sought across the board before implementing
** Ongoing feedback immediately following launch, regular recalibration thereafter (say, every quarter or half)
** Ongoing feedback immediately following launch, regular recalibration thereafter (say, every quarter or half)</translate>
* '''Q: What should I do if I am unsure whether to reach out to the Legal and Security teams?''' When in doubt, it is better to err on the side of caution and submit a [//form.asana.com/?k=pkWf5vzxmgF4GrjpoH4hPQ&d=3758245663860 L3SC request].
* <translate>'''Q: What should I do if I am unsure whether to reach out to the Legal and Security teams?''' When in doubt, it is better to err on the side of caution and submit a [<tvar name="1">//form.asana.com/?k=pkWf5vzxmgF4GrjpoH4hPQ&d=3758245663860</tvar> L3SC request].</translate>
<translate>
<translate>
=== Threshold table ===
=== Threshold table ===
Line 96: Line 101:
</translate>
</translate>
{| class="wikitable" style="text-align: center;"
{| class="wikitable" style="text-align: center;"
! rowspan="2" |'''Data unit type'''
! rowspan="2" |'''<translate>Data unit type</translate>'''
! colspan="2" |'''Classification of analysis based on counts'''
! colspan="2" |'''<translate>Classification of analysis based on counts</translate>'''
|-
|-
!'''"Granular"'''
!'''<translate>"Granular"</translate>'''
!'''"High-level"'''
!'''<translate>"High-level"</translate>'''
|-
|-
! scope="row |'''Users (including unique devices)'''
! scope="row |'''<translate>Users (including unique devices)</translate>'''
|<25
|<{{formatnum:25}}
|≥{{formatnum:25}}
|≥25
|-
|-
! scope="row |'''Edits'''
! scope="row |'''<translate>Edits</translate>'''
|<50
|<{{formatnum:50}}
|≥{{formatnum:50}}
|≥50
|-
|-
! scope="row |'''App interactions'''
! scope="row |'''<translate>App interactions</translate>'''
|<100
|<{{formatnum:100}}
|≥{{formatnum:100}}
|≥100
|-
|-
! scope="row |'''Views'''
! scope="row |'''<translate>Views</translate>'''
|<250
|<{{formatnum:250}}
|≥{{formatnum:250}}
|≥250
|}
|}
<translate>
<translate>
Line 142: Line 147:


Before you post data publicly (which includes pushing a notebook to gerrit or gitlab), have you
Before you post data publicly (which includes pushing a notebook to gerrit or gitlab), have you

</translate>
* entered this data publication into the [//docs.google.com/forms/d/e/1FAIpQLSds6m1puVoWHUoeYOq-4IoVg81aqrdkuQjWX8BZTrTjdBh5Fg/viewform data publication log form]?
* entered this data publication into the [<tvar name="1">//docs.google.com/forms/d/e/1FAIpQLSds6m1puVoWHUoeYOq-4IoVg81aqrdkuQjWX8BZTrTjdBh5Fg/viewform</tvar> data publication log form]?
* cleared outputs that display raw data?
* cleared outputs that display raw data?</translate>
<translate>
* cleared outputs that display granular data (as defined in the threshold table above)?
* cleared outputs that display granular data (as defined in the threshold table above)?
* obfuscated rows that display granular data? For example:
* obfuscated rows that display granular data? For example:
</translate>

{| class="wikitable"
{| class="wikitable"
!Python
!<translate>Python</translate>
!<translate>R</translate>
!R
|-
|-
|
|
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
# imagine we are doing an analysis of the number of *users* to try a feature
# <translate>imagine we are doing an analysis of the number of *users* to try a feature</translate>


# set constants
# <translate>set constants</translate>
threshold = 25
threshold = 25
col = "num_users"
col = "num_users"


# obfuscate rows
# <translate>obfuscate rows</translate>
df.loc[df[col] < threshold, col] = f'<{threshold}'
df.loc[df[col] < threshold, col] = f'<{threshold}'
</syntaxhighlight>
</syntaxhighlight>
Line 175: Line 181:
</syntaxhighlight>
</syntaxhighlight>
|}
|}
* filtered out rows that display granular data? For example:
* <translate>filtered out rows that display granular data? For example:</translate>
{| class="wikitable"
{| class="wikitable"
!Python
!Python
Line 182: Line 188:
|
|
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
# imagine we are doing an analysis of *app interactions* on the users did
# <translate>imagine we are doing an analysis of *app interactions* on the users did</translate>


# set constants
# set constants
Line 188: Line 194:
col = "num_interactions"
col = "num_interactions"


# filter out rows below threshold
# <translate>filter out rows below threshold</translate>
df = df[df[col] >= threshold]
df = df[df[col] >= threshold]
</syntaxhighlight>
</syntaxhighlight>
Line 203: Line 209:
|}
|}
<translate>
<translate>

== General risk heuristics ==
== General risk heuristics ==


Line 228: Line 235:
** granular analysis > high-level analysis
** granular analysis > high-level analysis
</translate>
</translate>
=={{int string|Contact us}}==
== {{int string|Contact us}} ==
<translate>
<translate>
If you think that these guidelines have potentially been breached, or if you have questions or comments about compliance with the guidelines, please contact us at <tvar name="1">privacy{{@}}wikimedia.org</tvar>.
If you think that these guidelines have potentially been breached, or if you have questions or comments about compliance with the guidelines, please contact us at <tvar name="1">privacy{{@}}wikimedia.org</tvar>.
</translate>
</translate>


== {{int string|Notes}} ==
{{Template:Privacy policy navigation 2}}
<references />



[[Category:Policies not setup for translation]]
{{Template:Privacy policy navigation 2}}

Revision as of 05:52, 9 May 2024

The right to privacy is at the core of how communities contribute to Wikimedia projects and upholding this right is central to our human rights commitments. These data publication guidelines are the best practices at the Wikimedia Foundation for managing risk in data publication. They complement our Data retention guidelines and contribute to our commitment to protect users' data as elaborated in our privacy policy.

Similar guidelines pertaining to data collection are forthcoming, in order to more fully govern the entire lifecycle of data in Wikimedia Foundation systems.

Data publication risk tiering grid

Data classification Confidential Restricted
Risk level Tier 1: High risk Tier 2: Medium risk Tier 3: Low risk
Data that could certainly be used to cause harm Data that could likely or possibly be used to cause harm Data that is unlikely to be used to cause harm or is private for administrative reasons
Examples (non-exhaustive list) * Data containing PII * High-level analyses of
  • Granular analyses of
    • non-country protection list countries
    • projects
    • editing data
    • interaction data
    • reading data
  • Recurring data releases of low risk data
* High-level analyses of
    • non-country protection list countries
    • projects
    • editing data
    • interaction data
    • reading data
  • Any analyses that utilize differential privacy[1]
  • Collations and combinations of already-public data that it may be inconvenient/difficult for external parties to access
Response time goal 3 work weeks 5 work days N/A
Expected % of requests (internal metric) 15% 35% 50%
What this means for Wikimedia Foundation teams
Follow-up actions
  • Do not upload this data to non-Wikimedia Foundation servers
  • Clear outputs before committing code, even to private Gitlab repos
  • Legal and Security will consider publication of high risk data on a case-by-case basis after review and risk mitigation
  • Unsanitized data can be uploaded to private servers outside of the Wikimedia Foundation (private Gitlab repos, Slack, Drive, etc.)
  • Sanitized data is considered to be low risk, and can be uploaded to public servers outside of the Wikimedia Foundation (Gitlab, presentations, mailing lists, etc.). Data sanitization involves
    • clearing all outputs that display raw data
    • filtering out or obfuscating granular analyses as defined by the threshold table below
  • Legal and Security will consider publication of medium risk data on a case-by-case basis after review and risk mitigation
  • This data can be uploaded to public servers outside of the Wikimedia Foundation (Gitlab, presentations, mailing lists, etc.)

Note: the country protection list is a reference guide for countries potentially dangerous for internet freedom and not indicative of the Foundation's working relationship with each country

Frequently asked questions

  • Q: What is the Risk Tiering Grid used for? The Risk Tiering Grid is to help Wikimedia Foundation teams that work with data know when their work requires privacy review by Legal and Security.
  • Q: What are the key risks the Tiering Grid measures? The key risks are on both the overuse and underuse of the spectrum. If this is used in such a way that too many things are being triaged to Legal and Security, then Legal and Security become the bottleneck for necessary workflow. On the other hand, if projects go live that would have been halted or mitigated under privacy review, that exposes the Foundation to privacy risks — including reputational, legal, and security risks.
  • Q: Who are the intended audiences of the Tiering Grid? Teams that work with data in product and tech.
  • Q: What is changed from the existing risk review process? The existing review process required every single schema and data project to undergo Legal review. This both was not being followed, and was not practical to follow for either data teams or Legal.
  • Q: What is the process for updating the Tiering Grid or resolving Tiering disagreements?
    • Get Privacy approval
    • Anyone can initiate an update/amendment but approval must be sought across the board before implementing
    • Ongoing feedback immediately following launch, regular recalibration thereafter (say, every quarter or half)
  • Q: What should I do if I am unsure whether to reach out to the Legal and Security teams? When in doubt, it is better to err on the side of caution and submit a L3SC request.

Threshold table

Use this table to determine whether your analysis is granular or high-level, informing which tier/risk level the analysis is considered as. Note: thresholds are determined based solely on the statistics being released — i.e. if you are only releasing information about edits, you do not need to account for how many editors generated the edits.

Data unit type Classification of analysis based on counts
"Granular" "High-level"
Users (including unique devices) <25 ≥25
Edits <50 ≥50
App interactions <100 ≥100
Views <250 ≥250

For reverts, report rate and a rough total if the reverted edit count or total edit count are less than the threshold. For example:

  • If 8 out of 49 edits were reverted:
    • "16.3% reverted (out of <50 edits)"
  • If 49 out of 49 edits were reverted:
    • "100% reverted (out of <50 edits)"
  • If 20 out of 580 edits were reverted:
    • "3.4% reverted (out of ~600 edits)"
    • "3.4% reverted (out of >500 edits)"
  • If 50 out of 50 edits were reverted:
    • OK to leave as-is (both counts meet threshold)

This guidance also applies to reporting below-threshold percentages for other data types.

Publication risk mitigation checklist

This self-service checklist is intended to help data scientists and analysts lower the risk of a high or medium risk data publication and reduce unintentional disclosures of private information.

Before you post data publicly (which includes pushing a notebook to gerrit or gitlab), have you

  • entered this data publication into the data publication log form?
  • cleared outputs that display raw data?
  • cleared outputs that display granular data (as defined in the threshold table above)?
  • obfuscated rows that display granular data? For example:
Python R
# imagine we are doing an analysis of the number of *users* to try a feature

# set constants
threshold = 25
col = "num_users"

# obfuscate rows
df.loc[df[col] < threshold, col] = f'<{threshold}'
library(tidyverse)
library(glue)

# set constants
threshold <- 25

df <- df |>
  mutate(num_users = ifelse(num_users < threshold, glue("<{threshold}"), num_users))
  • filtered out rows that display granular data? For example:
Python R
# imagine we are doing an analysis of *app interactions* on the users did

# set constants
threshold = 100
col = "num_interactions"

# filter out rows below threshold
df = df[df[col] >= threshold]
library(tidyverse)

# set constants
threshold <- 100

df <- df |>
  filter(num_interactions >= threshold)

General risk heuristics

Below, "X > Y > Z" means that X is riskier than Y, which is in turn riskier than Z.

  • Data type:
    • Geography:
      • city > (sub-national) region > country > subcontinent > continent > global
      • country protection list > non-country protection list
    • Device details:
      • raw User-Agent > browser or OS type > device type
      • raw IP > partially-redacted IP range
    • Temporal:
      • dt > hourly > daily > monthly
    • Combos of multiple keys > any key on its own (i.e. country + project > country or project)
  • User activity type:
    • fundraising activity > editing activity > interaction activity > reading activity
  • Wikimedia Foundation activity type:
    • data collection > data analysis
    • granular analysis > high-level analysis

Contact us

If you think that these guidelines have potentially been breached, or if you have questions or comments about compliance with the guidelines, please contact us at privacy@wikimedia.org.

Notes

  1. This process requires specialist help to ensure that the DP algorithm is correctly configured, as well as adequate documentation.


Privacy-related pages