Legal:Data publication guidelines: Difference between revisions
LincolnBot (talk | contribs) |
GVarnum-WMF (talk | contribs) m spacing Tag: 2017 source edit |
||
(18 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
<languages /> |
|||
⚫ | The right to privacy is at the core of how communities contribute to Wikimedia projects and upholding this right is central to our [ |
||
{{staff-policy}} |
|||
<translate> |
|||
⚫ | The right to privacy is at the core of how communities contribute to Wikimedia projects and upholding this right is central to our [[<tvar name="1">Special:MyLanguage/Policy:Human Rights Policy</tvar>|human rights commitments]]. These data publication guidelines are the best practices at the Wikimedia Foundation for managing risk in data publication. They complement our [[<tvar name="2">Special:MyLanguage/Legal:Data retention guidelines</tvar>|Data retention guidelines]] and contribute to our commitment to protect users' data as elaborated in our [[<tvar name="3">Special:MyLanguage/Policy:Privacy policy</tvar>|privacy policy]]. |
||
Similar guidelines pertaining to data collection are forthcoming, in order to more fully govern the entire lifecycle of data in |
Similar guidelines pertaining to data collection are forthcoming, in order to more fully govern the entire lifecycle of data in Wikimedia Foundation systems. |
||
== Data publication risk tiering grid == |
== Data publication risk tiering grid == |
||
</translate> |
|||
{| class="wikitable" |
{| class="wikitable" |
||
!'''[ |
!'''[//office.wikimedia.org/wiki/Security/Policy/Data_Classification <translate>Data classification</translate>]''' |
||
!'''Confidential''' |
!'''<translate>Confidential</translate>''' |
||
! colspan="2" |'''Restricted''' |
! colspan="2" |'''<translate>Restricted</translate>''' |
||
|- |
|- |
||
! scope="row" rowspan="2" |'''[ |
! scope="row" rowspan="2" |'''[//office.wikimedia.org/wiki/Security/Policy/Risk_Management <translate>Risk level</translate>]''' |
||
| style="background: #941616; color: white" |'''Tier 1: High risk''' |
| style="background: #941616; color: white" |'''<translate>Tier 1: High risk</translate>''' |
||
| style="background: #eb6c49; color: white" |'''Tier 2: Medium risk''' |
| style="background: #eb6c49; color: white" |'''<translate>Tier 2: Medium risk</translate>''' |
||
| style="background: #b071c7; color: white" |'''Tier 3: Low risk''' |
| style="background: #b071c7; color: white" |'''<translate>Tier 3: Low risk</translate>''' |
||
|- |
|- |
||
|Data that could ''certainly'' be used to cause harm |
|<translate>Data that could ''certainly'' be used to cause harm</translate> |
||
|Data that could ''likely'' or ''possibly'' be used to cause harm |
|<translate>Data that could ''likely'' or ''possibly'' be used to cause harm</translate> |
||
|Data that is unlikely to be used to cause harm or is private for administrative reasons |
|<translate>Data that is unlikely to be used to cause harm or is private for administrative reasons</translate> |
||
|- |
|- |
||
! scope="row" | Examples (non-exhaustive list) |
! scope="row" | <translate>Examples (non-exhaustive list)</translate> |
||
|<translate> |
|||
| |
|||
* Data containing PII |
* Data containing PII |
||
** see the [ |
** see the [<tvar name="1">//office.wikimedia.org/wiki/Security/Policy/Data_Classification#Confidential</tvar> data classification policy] and [[<tvar name="2">Special:MyLanguage/Policy:Privacy policy</tvar>|privacy policy]]</translate> |
||
<translate> |
|||
* Granular analyses of |
* Granular analyses of |
||
** [[wikitech:Country protection list|country protection list]] countries |
** [[<tvar name="1">wikitech:Country protection list</tvar>|country protection list]] countries |
||
** fundraising data |
** fundraising data |
||
* Recurring data releases of medium risk data |
* Recurring data releases of medium risk data</translate> |
||
|<translate> |
|||
| |
|||
* High-level analyses of |
* High-level analyses of |
||
** [[wikitech:Country protection list|country protection list]] countries |
** [[<tvar name="1">wikitech:Country protection list</tvar>|country protection list]] countries |
||
** fundraising data |
** fundraising data</translate> |
||
<translate> |
|||
* Granular analyses of |
* Granular analyses of |
||
** '''''non'''''-country protection list countries |
** '''''non'''''-country protection list countries |
||
Line 37: | Line 42: | ||
** interaction data |
** interaction data |
||
** reading data |
** reading data |
||
* Recurring data releases of low risk data |
* Recurring data releases of low risk data</translate> |
||
|<translate> |
|||
| |
|||
* High-level analyses of |
* High-level analyses of |
||
** '''''non'''''-country protection list countries |
** '''''non'''''-country protection list countries |
||
Line 44: | Line 49: | ||
** editing data |
** editing data |
||
** interaction data |
** interaction data |
||
** reading data |
** reading data</translate> |
||
* Any analyses that utilize [ |
* <translate>Any analyses that utilize [[<tvar name="1">{{lwp|Differential privacy}}</tvar>|differential privacy]]</translate><ref><translate>This process requires specialist help to ensure that the DP algorithm is correctly configured, as well as adequate documentation.</translate></ref> |
||
* Collations and combinations of already-public data that it may be inconvenient/difficult for external parties to access |
* <translate>Collations and combinations of already-public data that it may be inconvenient/difficult for external parties to access</translate> |
||
|- |
|- |
||
! scope="row |'''Response time goal''' |
! scope="row |'''<translate>Response time goal</translate>''' |
||
|3 work weeks |
|<translate>3 work weeks</translate> |
||
|5 work days |
|<translate>5 work days</translate> |
||
|<translate>N/A</translate> |
|||
|N/A |
|||
|- |
|- |
||
! scope="row |'''Expected % of requests (internal metric)''' |
! scope="row |'''<translate>Expected % of requests (internal metric)</translate>''' |
||
|15% |
|{{formatnum:15}}% |
||
|35% |
|{{formatnum:35}}% |
||
|50% |
|{{formatnum:50}}% |
||
|- |
|- |
||
! scope="row" colspan="4" |'''What this means for |
! scope="row" colspan="4" |'''<translate>What this means for Wikimedia Foundation teams</translate>''' |
||
|- |
|- |
||
! scope="row" |'''Follow-up actions''' |
! scope="row" |'''<translate>Follow-up actions</translate>''' |
||
| |
| |
||
* Do not upload this data to non- |
* <translate>Do not upload this data to non-Wikimedia Foundation servers</translate> |
||
* Clear outputs before committing code, even to [[mw:GitLab/ |
* <translate>Clear outputs before committing code, even to [[<tvar name="1">:mw:Special:MyLanguage/GitLab/Hosting a project on GitLab#GitLab_private_(restricted)_repos</tvar>|private Gitlab repos]]</translate> |
||
* Legal and Security will consider publication of high risk data on a case-by-case basis after review and risk mitigation |
* <translate>Legal and Security will consider publication of high risk data on a case-by-case basis after review and risk mitigation</translate> |
||
| |
| |
||
* '''''Unsanitized data''''' can be uploaded to '''''private''''' servers outside of |
* <translate>'''''Unsanitized data''''' can be uploaded to '''''private''''' servers outside of the Wikimedia Foundation ([[<tvar name="1">:mw:Special:MyLanguage/GitLab/Hosting a project on GitLab#GitLab_private_(restricted)_repos</tvar>|private Gitlab repos]], Slack, Drive, etc.)</translate> |
||
<translate> |
|||
* '''''Sanitized data''''' is considered to be low risk, and can be uploaded to '''''public''''' servers outside of |
* '''''Sanitized data''''' is considered to be low risk, and can be uploaded to '''''public''''' servers outside of the Wikimedia Foundation (Gitlab, presentations, mailing lists, etc.). Data sanitization involves |
||
** clearing all outputs that display raw data |
** clearing all outputs that display raw data |
||
** filtering out or obfuscating granular analyses as defined by the threshold table below |
** filtering out or obfuscating granular analyses as defined by the threshold table below</translate> |
||
* Legal and Security will consider publication of medium risk data on a case-by-case basis after review and risk mitigation |
* <translate>Legal and Security will consider publication of medium risk data on a case-by-case basis after review and risk mitigation</translate> |
||
| |
| |
||
* This data can be uploaded to public servers outside of |
* <translate>This data can be uploaded to public servers outside of the Wikimedia Foundation (Gitlab, presentations, mailing lists, etc.)</translate> |
||
|} |
|} |
||
<translate> |
|||
Note: the country protection list is a reference guide for countries potentially dangerous for internet freedom and not indicative of the Foundation's working relationship with each country |
Note: the country protection list is a reference guide for countries potentially dangerous for internet freedom and not indicative of the Foundation's working relationship with each country |
||
</translate> |
|||
=== {{int string|faq}} === |
|||
⚫ | |||
=== FAQs === |
|||
⚫ | * <translate>'''Q: What are the key risks the Tiering Grid measures?''' The key risks are on both the overuse and underuse of the spectrum. If this is used in such a way that too many things are being triaged to Legal and Security, then Legal and Security become the bottleneck for necessary workflow. On the other hand, if projects go live that would have been halted or mitigated under privacy review, that exposes the Foundation to privacy risks — including reputational, legal, and security risks.</translate> |
||
⚫ | |||
⚫ | |||
⚫ | * '''Q: What are the key risks the Tiering Grid measures?''' The key risks are on both the overuse and underuse of the spectrum. If this is used in such a way that too many things are being triaged to Legal and Security, then Legal and Security become the bottleneck for necessary workflow. On the other hand, if projects go live that would have been halted or mitigated under privacy review, that exposes the Foundation to privacy risks — including reputational, legal, and security risks. |
||
⚫ | |||
⚫ | |||
⚫ | |||
⚫ | |||
<translate> |
|||
⚫ | |||
** Get Privacy approval |
** Get Privacy approval |
||
** Anyone can initiate an update/amendment but approval must be sought across the board before implementing |
** Anyone can initiate an update/amendment but approval must be sought across the board before implementing |
||
** Ongoing feedback immediately following launch, regular recalibration thereafter (say, every quarter or half) |
** Ongoing feedback immediately following launch, regular recalibration thereafter (say, every quarter or half)</translate> |
||
* '''Q: What should I do if I am unsure whether to reach out to the Legal and Security teams?''' When in doubt, it is better to err on the side of caution and submit a [ |
* <translate>'''Q: What should I do if I am unsure whether to reach out to the Legal and Security teams?''' When in doubt, it is better to err on the side of caution and submit a [<tvar name="1">//form.asana.com/?k=pkWf5vzxmgF4GrjpoH4hPQ&d=3758245663860</tvar> L3SC request].</translate> |
||
<translate> |
|||
⚫ | |||
⚫ | |||
Use this table to determine whether your analysis is '''''granular''''' or '''''high-level''''', informing which tier/risk level the analysis is considered as. Note: thresholds are determined based solely on the statistics being released — i.e. if you are only releasing information about edits, you do not need to account for how many editors generated the edits. |
Use this table to determine whether your analysis is '''''granular''''' or '''''high-level''''', informing which tier/risk level the analysis is considered as. Note: thresholds are determined based solely on the statistics being released — i.e. if you are only releasing information about edits, you do not need to account for how many editors generated the edits. |
||
</translate> |
|||
{| class="wikitable" style="text-align: center;" |
{| class="wikitable" style="text-align: center;" |
||
! rowspan="2" |'''Data unit type''' |
! rowspan="2" |'''<translate>Data unit type</translate>''' |
||
! colspan="2" |'''Classification of analysis based on counts''' |
! colspan="2" |'''<translate>Classification of analysis based on counts</translate>''' |
||
|- |
|- |
||
!'''"Granular"''' |
!'''<translate>"Granular"</translate>''' |
||
!'''"High-level"''' |
!'''<translate>"High-level"</translate>''' |
||
|- |
|- |
||
! scope="row |'''Users (including unique devices)''' |
! scope="row |'''<translate>Users (including unique devices)</translate>''' |
||
|<25 |
|<{{formatnum:25}} |
||
|≥{{formatnum:25}} |
|||
|≥25 |
|||
|- |
|- |
||
! scope="row |'''Edits''' |
! scope="row |'''<translate>Edits</translate>''' |
||
|<50 |
|<{{formatnum:50}} |
||
|≥{{formatnum:50}} |
|||
|≥50 |
|||
|- |
|- |
||
! scope="row |'''App interactions''' |
! scope="row |'''<translate>App interactions</translate>''' |
||
|<100 |
|<{{formatnum:100}} |
||
|≥{{formatnum:100}} |
|||
|≥100 |
|||
|- |
|- |
||
! scope="row |'''Views''' |
! scope="row |'''<translate>Views</translate>''' |
||
|<250 |
|<{{formatnum:250}} |
||
|≥{{formatnum:250}} |
|||
|≥250 |
|||
|} |
|} |
||
<translate> |
|||
For reverts, report rate and a rough total if the reverted edit count or total edit count are less than the threshold. For example: |
For reverts, report rate and a rough total if the reverted edit count or total edit count are less than the threshold. For example: |
||
* If 8 out of 49 edits were reverted: |
* If 8 out of 49 edits were reverted: |
||
** |
** "16.3% reverted (out of <50 edits)"</translate> |
||
<translate> |
|||
* If 49 out of 49 edits were reverted: |
* If 49 out of 49 edits were reverted: |
||
** |
** "100% reverted (out of <50 edits)"</translate> |
||
<translate> |
|||
* If 20 out of 580 edits were reverted: |
* If 20 out of 580 edits were reverted: |
||
** |
** "3.4% reverted (out of ~600 edits)" |
||
** |
** "3.4% reverted (out of >500 edits)"</translate> |
||
<translate> |
|||
* If 50 out of 50 edits were reverted: |
* If 50 out of 50 edits were reverted: |
||
** OK to leave as-is (both counts meet threshold) |
** OK to leave as-is (both counts meet threshold) |
||
Line 131: | Line 143: | ||
== Publication risk mitigation checklist == |
== Publication risk mitigation checklist == |
||
This self-service checklist is intended to help data scientists and analysts lower the risk of a high or medium risk data publication and reduce unintentional disclosures of private information. |
This self-service checklist is intended to help data scientists and analysts lower the risk of a high or medium risk data publication and reduce unintentional disclosures of private information. |
||
Before you post data publicly (which includes pushing a notebook to gerrit or gitlab), have you |
Before you post data publicly (which includes pushing a notebook to gerrit or gitlab), have you |
||
* entered this data publication into the [ |
* entered this data publication into the [<tvar name="1">//docs.google.com/forms/d/e/1FAIpQLSds6m1puVoWHUoeYOq-4IoVg81aqrdkuQjWX8BZTrTjdBh5Fg/viewform</tvar> data publication log form]? |
||
* cleared outputs that display raw data? |
* cleared outputs that display raw data?</translate> |
||
<translate> |
|||
* cleared outputs that display granular data (as defined in the threshold table above)? |
* cleared outputs that display granular data (as defined in the threshold table above)? |
||
* obfuscated rows that display granular data? For example: |
* obfuscated rows that display granular data? For example: |
||
</translate> |
|||
{| class="wikitable" |
{| class="wikitable" |
||
!Python |
!<translate>Python</translate> |
||
!<translate>R</translate> |
|||
!R |
|||
|- |
|- |
||
| |
| |
||
<syntaxhighlight lang="python"> |
<syntaxhighlight lang="python"> |
||
# imagine we are doing an analysis of the number of *users* to try a feature |
# <translate>imagine we are doing an analysis of the number of *users* to try a feature</translate> |
||
# set constants |
# <translate>set constants</translate> |
||
threshold = 25 |
threshold = 25 |
||
col = "num_users" |
col = "num_users" |
||
# obfuscate rows |
# <translate>obfuscate rows</translate> |
||
df.loc[df[col] < threshold, col] = |
df.loc[df[col] < threshold, col] = f'<{threshold}' |
||
</syntaxhighlight> |
</syntaxhighlight> |
||
| |
| |
||
Line 165: | Line 181: | ||
</syntaxhighlight> |
</syntaxhighlight> |
||
|} |
|} |
||
* filtered out rows that display granular data? For example: |
* <translate>filtered out rows that display granular data? For example:</translate> |
||
{| class="wikitable" |
{| class="wikitable" |
||
!Python |
!Python |
||
Line 172: | Line 188: | ||
| |
| |
||
<syntaxhighlight lang="python"> |
<syntaxhighlight lang="python"> |
||
# imagine we are doing an analysis of *app interactions* on the users did |
# <translate>imagine we are doing an analysis of *app interactions* on the users did</translate> |
||
# set constants |
# set constants |
||
Line 178: | Line 194: | ||
col = "num_interactions" |
col = "num_interactions" |
||
# filter out rows below threshold |
# <translate>filter out rows below threshold</translate> |
||
df = df[df[col] >= threshold] |
df = df[df[col] >= threshold] |
||
</syntaxhighlight> |
</syntaxhighlight> |
||
Line 192: | Line 208: | ||
</syntaxhighlight> |
</syntaxhighlight> |
||
|} |
|} |
||
<translate> |
|||
== General risk heuristics == |
== General risk heuristics == |
||
''Below, |
''Below, "X > Y > Z" means that X is riskier than Y, which is in turn riskier than Z.'' |
||
⚫ | |||
⚫ | |||
<translate> |
|||
** Geography: |
** Geography: |
||
*** city > (sub-national) region > country > subcontinent > continent > global |
*** city > (sub-national) region > country > subcontinent > continent > global |
||
*** country protection list > non-country protection list |
*** country protection list > non-country protection list</translate> |
||
<translate> |
|||
** Device details: |
** Device details: |
||
*** raw User-Agent > browser or OS type > device type |
*** raw User-Agent > browser or OS type > device type |
||
*** raw IP > partially-redacted IP range |
*** raw IP > partially-redacted IP range</translate> |
||
<translate> |
|||
** Temporal: |
** Temporal: |
||
*** dt > hourly > daily > monthly |
*** dt > hourly > daily > monthly |
||
** Combos of multiple keys > any key on its own (i.e. country + project > country or project) |
** Combos of multiple keys > any key on its own (i.e. country + project > country or project)</translate> |
||
<translate> |
|||
* User activity type: |
* User activity type: |
||
** fundraising activity > editing activity > interaction activity > reading activity |
** fundraising activity > editing activity > interaction activity > reading activity</translate> |
||
<translate> |
|||
* |
* Wikimedia Foundation activity type: |
||
** data collection > data analysis |
** data collection > data analysis |
||
** granular analysis > high-level analysis |
** granular analysis > high-level analysis |
||
</translate> |
|||
⚫ | |||
<translate> |
|||
⚫ | |||
</translate> |
|||
== {{int string|Notes}} == |
|||
⚫ | |||
<references /> |
|||
⚫ | |||
⚫ | |||
⚫ | |||
[[Category:Content not setup for translation]] |
Revision as of 05:52, 9 May 2024
This policy or procedure is maintained by the Wikimedia Foundation. Please note that in the event of any differences in meaning or interpretation between the original English version of this content and a translation, the original English version takes precedence. |
The right to privacy is at the core of how communities contribute to Wikimedia projects and upholding this right is central to our human rights commitments. These data publication guidelines are the best practices at the Wikimedia Foundation for managing risk in data publication. They complement our Data retention guidelines and contribute to our commitment to protect users' data as elaborated in our privacy policy.
Similar guidelines pertaining to data collection are forthcoming, in order to more fully govern the entire lifecycle of data in Wikimedia Foundation systems.
Data publication risk tiering grid
Data classification | Confidential | Restricted | |
---|---|---|---|
Risk level | Tier 1: High risk | Tier 2: Medium risk | Tier 3: Low risk |
Data that could certainly be used to cause harm | Data that could likely or possibly be used to cause harm | Data that is unlikely to be used to cause harm or is private for administrative reasons | |
Examples (non-exhaustive list) | * Data containing PII
|
* High-level analyses of
|
* High-level analyses of
|
Response time goal | 3 work weeks | 5 work days | N/A |
Expected % of requests (internal metric) | 15% | 35% | 50% |
What this means for Wikimedia Foundation teams | |||
Follow-up actions |
|
|
|
Note: the country protection list is a reference guide for countries potentially dangerous for internet freedom and not indicative of the Foundation's working relationship with each country
Frequently asked questions
- Q: What is the Risk Tiering Grid used for? The Risk Tiering Grid is to help Wikimedia Foundation teams that work with data know when their work requires privacy review by Legal and Security.
- Q: What are the key risks the Tiering Grid measures? The key risks are on both the overuse and underuse of the spectrum. If this is used in such a way that too many things are being triaged to Legal and Security, then Legal and Security become the bottleneck for necessary workflow. On the other hand, if projects go live that would have been halted or mitigated under privacy review, that exposes the Foundation to privacy risks — including reputational, legal, and security risks.
- Q: Who are the intended audiences of the Tiering Grid? Teams that work with data in product and tech.
- Q: What is changed from the existing risk review process? The existing review process required every single schema and data project to undergo Legal review. This both was not being followed, and was not practical to follow for either data teams or Legal.
- Q: What is the process for updating the Tiering Grid or resolving Tiering disagreements?
- Get Privacy approval
- Anyone can initiate an update/amendment but approval must be sought across the board before implementing
- Ongoing feedback immediately following launch, regular recalibration thereafter (say, every quarter or half)
- Q: What should I do if I am unsure whether to reach out to the Legal and Security teams? When in doubt, it is better to err on the side of caution and submit a L3SC request.
Threshold table
Use this table to determine whether your analysis is granular or high-level, informing which tier/risk level the analysis is considered as. Note: thresholds are determined based solely on the statistics being released — i.e. if you are only releasing information about edits, you do not need to account for how many editors generated the edits.
Data unit type | Classification of analysis based on counts | |
---|---|---|
"Granular" | "High-level" | |
Users (including unique devices) | <25 | ≥25 |
Edits | <50 | ≥50 |
App interactions | <100 | ≥100 |
Views | <250 | ≥250 |
For reverts, report rate and a rough total if the reverted edit count or total edit count are less than the threshold. For example:
- If 8 out of 49 edits were reverted:
- "16.3% reverted (out of <50 edits)"
- If 49 out of 49 edits were reverted:
- "100% reverted (out of <50 edits)"
- If 20 out of 580 edits were reverted:
- "3.4% reverted (out of ~600 edits)"
- "3.4% reverted (out of >500 edits)"
- If 50 out of 50 edits were reverted:
- OK to leave as-is (both counts meet threshold)
This guidance also applies to reporting below-threshold percentages for other data types.
Publication risk mitigation checklist
This self-service checklist is intended to help data scientists and analysts lower the risk of a high or medium risk data publication and reduce unintentional disclosures of private information.
Before you post data publicly (which includes pushing a notebook to gerrit or gitlab), have you
- entered this data publication into the data publication log form?
- cleared outputs that display raw data?
- cleared outputs that display granular data (as defined in the threshold table above)?
- obfuscated rows that display granular data? For example:
Python | R |
---|---|
# imagine we are doing an analysis of the number of *users* to try a feature
# set constants
threshold = 25
col = "num_users"
# obfuscate rows
df.loc[df[col] < threshold, col] = f'<{threshold}'
|
library(tidyverse)
library(glue)
# set constants
threshold <- 25
df <- df |>
mutate(num_users = ifelse(num_users < threshold, glue("<{threshold}"), num_users))
|
- filtered out rows that display granular data? For example:
Python | R |
---|---|
# imagine we are doing an analysis of *app interactions* on the users did
# set constants
threshold = 100
col = "num_interactions"
# filter out rows below threshold
df = df[df[col] >= threshold]
|
library(tidyverse)
# set constants
threshold <- 100
df <- df |>
filter(num_interactions >= threshold)
|
General risk heuristics
Below, "X > Y > Z" means that X is riskier than Y, which is in turn riskier than Z.
- Data type:
- Geography:
- city > (sub-national) region > country > subcontinent > continent > global
- country protection list > non-country protection list
- Device details:
- raw User-Agent > browser or OS type > device type
- raw IP > partially-redacted IP range
- Temporal:
- dt > hourly > daily > monthly
- Combos of multiple keys > any key on its own (i.e. country + project > country or project)
- Geography:
- User activity type:
- fundraising activity > editing activity > interaction activity > reading activity
- Wikimedia Foundation activity type:
- data collection > data analysis
- granular analysis > high-level analysis
Contact us
If you think that these guidelines have potentially been breached, or if you have questions or comments about compliance with the guidelines, please contact us at privacywikimedia.org.
Notes
- ↑ This process requires specialist help to ensure that the DP algorithm is correctly configured, as well as adequate documentation.