{"id":20066453,"url":"https://github.com/apparebit/diaphanous","last_synced_at":"2025-03-02T11:16:20.675Z","repository":{"id":211407574,"uuid":"646089718","full_name":"apparebit/diaphanous","owner":"apparebit","description":"Exploring the limits of social media transparency data","archived":false,"fork":false,"pushed_at":"2025-02-03T19:20:07.000Z","size":36955,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"boss","last_synced_at":"2025-02-03T20:26:42.487Z","etag":null,"topics":["csam","cybertipline","ncmec","social-media","transparency"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apparebit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-27T08:50:13.000Z","updated_at":"2025-02-03T19:20:10.000Z","dependencies_parsed_at":"2024-04-20T11:42:29.636Z","dependency_job_id":"25795b5a-b889-41b2-b104-13b7790cf4e4","html_url":"https://github.com/apparebit/diaphanous","commit_stats":null,"previous_names":["apparebit/intransparent","apparebit/diaphanous"],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apparebit%2Fdiaphanous","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apparebit%2Fdiaphanous/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apparebit%2Fdiaphanous/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apparebit%2Fdiaphanous/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apparebit","download_url":"https://codeload.github.com/apparebit/diaphanous/tar.gz/refs/heads/boss","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241494184,"owners_count":19971871,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csam","cybertipline","ncmec","social-media","transparency"],"created_at":"2024-11-13T13:58:28.772Z","updated_at":"2025-03-02T11:16:20.647Z","avatar_url":"https://github.com/apparebit.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Diaphanous: Transparency Disclosures About the Sexual Exploitation of Minors\n\nThis repository curates quantitative transparency disclosures about the online\nsexual exploitation of minors, i.e., people under the age of eighteen, in\nmachine-readable form. It also includes a 4,400-line Python library for\nvalidating and tidying the data and Python as well as R notebooks with the\nanalysis for the corresponding report [Putting the Count Back Into\nAccountability: An Analysis of Transparency Data About the Sexual Exploitation\nof Minors](https://arxiv.org/abs/2402.14625), which is also available [through\nthis repository](technical-report.pdf).\n\nPlease cite as: Robert Grimm. Diaphanous: Transparency Disclosures About the\nSexual Exploitation of Minors. Zenodo, 12 Dec. 2024,\n[![DOI](https://zenodo.org/badge/646089718.svg)](https://doi.org/10.5281/zenodo.13896437).\n\n\n## The Code\n\nTo run the code in this repository, you'll need the following tools:\n\n  * According to [vermin](https://github.com/netromdk/vermin), the minimmum\n    required Python version is 3.11.\n  * The [analysis/platform.ipynb](analysis/platform.ipynb) notebook is written\n    in Python and R. The necessary bindings are provided by the\n    [rpy2](https://rpy2.github.io) Python package. The package is installed like\n    other Python packages as described in the next bullet point. But it does\n    require a working R installation (e.g.,\n    \u003ccode\u003ebrew\u0026nbsp;install\u0026nbsp;r\u003c/code\u003e).\n  * Required Python packages are listed in the repository's\n    [pyproject.toml](pyproject.toml). The simplest way of installing the\n    project's dependencies is create a local clone of this repository and then\n    installing it thusly:\n    ```sh\n    $ python -m venv .venv   # Create virtual environment\n    $ . .venv/bin/activate   # Activate virtual environment\n    $ pip install -e .       # Install diaphanous as editable\n    ```\n    Thanks to the `-e` option, `pip install` creates a so-called editable\n    install, i.e., it makes the Python code in the `diaphanous` package\n    executable without copying it. It also installs all necessary dependencies.\n\nBuilding the report requires additional tools, i.e., a working LaTeX\ninstallation, though the necessary incantations [are scripted](report/build.sh).\n\n\n## The Data\n\nWhile a few CSV files contain [tidy\ndata](https://vita.had.co.nz/papers/tidy-data.pdf), others are decidedly untidy\nwith, for example, individual columns combining two variables. The organization\nof a dataset usually reflects that of the original disclosure and helps ensuring\nthe correctness of data transcription. The Python package includes several\nexamples for how to tidy up such data.\n\n\n### Dataset 1: CyberTipline Reports per Year (1998 onward)\n\nThe [CyberTipline reports per year](data/ocse-reports-per-year.csv) dataset\ncaptures the number of reports NCMEC received on its CyberTipline since\ninception in March 1998, largely based on the table included in Appendix A of\nits\n[2022](https://www.missingkids.org/content/dam/missingkids/pdfs/OJJDP-NCMEC-Transparency_2022-Calendar-Year.pdf)\nand\n[2023](https://www.missingkids.org/content/dam/missingkids/pdfs/OJJDP-NCMEC-Transparency-CY-2023-Report.pdf)\ntransparency reports to the Office for Juvenile Justice and Delinquency\nPrevention at the Department of Justice.\n\nSimon Kemp's [Digital 2024: Global Overview\nReport](https://datareportal.com/reports/digital-2024-global-overview-report)\nincludes statistics on the [global number of social media user\nidentities](https://indd.adobe.com/embed/8892459e-f0f4-4cfd-bf47-f5da5728a5b5?startpage=207\u0026allowFullscreen=true).\nThey are an effective denominator for normalizing the CyberTipline reports per\nyear.\n\n\n### Dataset 2: CyberTipline Report Contents and Recipients (2020 onward)\n\nThe CyberTipline report [contents](data/ocse-report-contents.csv) and\n[recipients](data/ocse-report-recipients.csv) dataset breaks down the reports\nNCMEC received by:\n\n  * the category of sexual exploitation, e.g., whether a report concerns child\n    pornography, misleading words/images, online enticement, child sex\n    trafficking, obscene material sent to a child, misleading domain names,\n    child sexual molestation, or child sex tourism;\n  * the kind of attachments, e.g., photos, videos, or other;\n  * the uniqueness of attachments as determined by a precise hash (MD5) and a\n    perceptual hash (PhotoDNA, Videntifier);\n  * their level of detail, i.e., whether they are actionable or only\n    informational;\n  * their recipients in dedicated units, local, federal, or international law\n    enforcement.\n\nLabels for the uniqueness classification use \"unique\" for precisely hashed\nattachments and \"similar\" for perceptually hashed ones. The dataset combines\nseveral tables from NCMEC's\n[2022](https://www.missingkids.org/content/dam/missingkids/pdfs/OJJDP-NCMEC-Transparency_2022-Calendar-Year.pdf)\nand\n[2023](https://www.missingkids.org/content/dam/missingkids/pdfs/OJJDP-NCMEC-Transparency-CY-2023-Report.pdf)\ntransparency reports to the Office for Juvenile Justice and Delinquency\nPrevention at the Department of Justice.\n\n\n### Dataset 3: CyberTipline Reports per Platform (2019 onward)\n\nThe [CyberTipline reports per platform](data/ocse-reports-per-platform.json)\ndataset is the project's main dataset. It collects:\n\n  * disclosures about child sexual exploitation by major non-Chinese social\n    networks and other large service providers;\n  * corresponding disclosures about service providers' reporting by NCMEC.\n\nThe above linked JSON format is [automatically\ngenerated](diaphanous/platform/export.py) from a [Python\nmodule](diaphanous/platform/data.py). Both formats have the same structure\nand contain the same information.\n\nThe dataset incorporates information about these platforms:\n\n  * Amazon (owns Twitch)\n  * Apple\n  * Automattic (owns Tumblr and Wordpress)\n  * Aylo (née MindGeek)\n  * Discord\n  * Facebook (Meta)\n  * GitHub (Microsoft)\n  * Google (owns YouTube)\n  * Instagram (Meta)\n  * LinkedIn (Microsoft)\n  * Meta (owns Facebook, Instagram, and WhatsApp)\n  * Microsoft (owns GitHub and LinkedIn)\n  * MindGeek (now Aylo)\n  * Omegle\n  * Pinterest\n  * Pornhub (Aylo)\n  * Quora\n  * Reddit\n  * Snap\n  * Telegram\n  * TikTok\n  * Tumblr (Automattic)\n  * Twitch (Amazon)\n  * Twitter (now X)\n  * WhatsApp (Meta)\n  * Wikimedia\n  * Wordpress (Automattic)\n  * X (née Twitter)\n  * YouTube (Google)\n\nSurveyed organizations fall into at least one of the following categories:\n\n  * Social media based on Buffer's list of [top social media\n    sites](https://buffer.com/library/social-media-sites/),\n  * Popular platforms based on the European Commission's list of [very large\n    online\n    platforms](https://digital-strategy.ec.europa.eu/en/policies/list-designated-vlops-and-vloses),\n  * Platforms with considerable reported child sexual exploitation activity\n    based on NCMEC's transparency disclosures.\n\nA separate [codebook](codebook.md) documents the JSON and Python formats.\nBasically, they consist of a top-level object that maps organization names to an\nobject with the data about that organization. Since platforms vary widely in\nwhat metrics they disclose, the format necessarily is rather generic and\ncollects all of a platform's quantitative disclosures within one table:\n\n  * Since platforms make transparency disclosures for quarter, half, and full\n    years, each table also organizes metrics into time periods with the same\n    granularity.\n  * To faithfully capture disclosures, time periods may vary within a table.\n    They may also overlap, both to capture several partial disclosures and to\n    capture several redundant disclosures. A flag clearly marks the latter\n    entries.\n  * Where possible, the table uses standard labels for equivalent metrics:\n\n      * _reports_ tallies CyberTipline reports to NCMEC;\n      * _pieces_ tallies instances of CSAM such as pictures and videos;\n      * _accounts_ tallies user registrations implicated *and* terminated\n        for CSAM;\n\n    Instead of \"account termination,\" many platforms use a euphemism such as\n    \"permanent suspension.\" User registrations thusly impacted *are* included\n    under *accounts*. However, temporarily impacted registrations are not.\n\n[Comparable CyberTipline report counts](data/comparable-reports.csv) and\n[per-provider comparable CyberTipline report\ncounts](data/comparable-reports-by-provider.csv) are materialized views onto the\nsame data. Both views are in long format and only include rows for counts that\nwere disclosed by both electronic service provider and NCMEC.\n\nThe latter, more precise view has _year_, _observer_, _count_, and _topic_\ncolumns, with the topic column enabling the grouping of rows with service\nprovider and NCMEC as observers. The former, simplified view has only _id_,\n_observer_, and _count_ columns, with the ID column effectively combining the\nother view's _year_ and _topic_ columns and the _observer_ column only\ndistinguishing between a generic *ServiceProvider* and NCMEC.\n\n\n\n### Dataset 4: CyberTipline Reports per Country (2019 onward)\n\n[CyberTipline reports per country](data/ocse-reports-per-country.csv) collects\nNCMEC's per-country breakdown of CyberTipline reports for\n[2019](https://www.missingkids.org/content/dam/missingkids/pdfs/2019-cybertipline-reports-by-country.pdf),\n[2020](https://www.missingkids.org/content/dam/missingkids/pdfs/2020-reports-by-country.pdf),\n[2021](https://www.missingkids.org/content/dam/missingkids/pdfs/2021-reports-by-country.pdf),\n[2022](https://www.missingkids.org/content/dam/missingkids/pdfs/2022-reports-by-country.pdf),\nand\n[2023](https://www.missingkids.org/content/dam/missingkids/pdfs/2023-reports-by-country.pdf)\nin machine-readable form. The CSV table is mostly straightforward: Its first two\ncolumns comprise the country name and ISO three-letter code, followed by a\ncolumn per year from 2019 through 2022.\n\nTo preserve all information from NCMEC's disclosures, the table includes rows\nfor the Netherlands Antilles (ANT), \"Europe\" (EEE), Bouvet Island (BVT), and \"No\nCountry Listed\" (*no code*). NCMEC does not explain its inclusion of Europe in\naddition to individual European countries nor the Netherlands Antilles in\naddition to its 2010 successors Bonaire, Sint Eustatius, and Saba (BES), Curaçao\n(CUW), and Sint Maarten (SXM). Neither do they explain the inclusion of Bouvet\nIsland; the subantarctic dependency of Norway is an uninhabited nature reserve\nand hence rather unlikely to serve as actual location of internet users.\n\nThis repository's Python package includes [code that\nenriches](diaphanous/country.py) this dataset with population counts,\ngeometries, and region/continent information. It leverages the following data:\n\n  * Per-country population counts by the [United Nations Population\n    Division](https://population.un.org/dataportal/data/indicators/49/locations/4,8,12,16,20,24,660,28,32,51,533,36,40,31,44,48,50,52,112,56,84,204,60,64,68,535,70,72,76,92,96,100,854,108,132,116,120,124,136,140,148,152,156,344,446,158,170,174,178,184,188,384,191,192,531,196,203,408,180,208,262,212,214,218,818,222,226,232,233,748,231,238,234,242,246,250,254,258,266,270,268,276,288,292,300,304,308,312,316,320,831,324,624,328,332,336,340,348,352,356,360,364,368,372,833,376,380,388,392,832,400,398,404,296,412,414,417,418,428,422,426,430,434,438,440,442,450,454,458,462,466,470,584,474,478,480,175,484,583,492,496,499,500,504,508,104,516,520,524,528,540,554,558,562,566,570,807,580,578,512,586,585,591,598,600,604,608,616,620,630,634,410,498,638,642,643,646,652,654,659,662,663,666,670,882,674,678,682,686,688,690,694,702,534,703,705,90,706,710,728,724,144,275,729,740,752,756,760,762,764,626,768,772,776,780,788,792,795,796,798,800,804,784,826,834,840,850,858,860,548,862,704,876,732,887,894,716/start/2019/end/2022/table/pivotbylocation);\n  * Per-country internet user counts prepared by [Our World in\n    Data](https://ourworldindata.org/internet) from statistics released by the\n    International Telecommunication Union via WorldBank as well as the United\n    Nations;\n  * Administrative boundaries for countries by [Natural Earth, version\n    5.1.1](https://www.naturalearthdata.com/downloads/110m-cultural-vectors/);\n  * Per-country ISO 3166 Alpha-2 and Alpha-3 codes scraped from [ISO's\n    website](https://www.iso.org/obp/ui/#search/code/) and corresponding region\n    names based on [Luke Duncalfe's ISO-3166\n    dataset](https://github.com/lukes/ISO-3166-Countries-with-Regional-Codes).\n\nThe following choropleths using the Equal Earth projection visualize\nCyberTipline reports per year per country per capita:\n\n![CyberTipline reports per capita per country per\nyear](https://raw.githubusercontent.com/apparebit/diaphanous/boss/figure/countries.svg)\n\n\n### Dataset 5: Platform Data (2020 onward)\n\nDiscord, Meta, Microsoft, and TikTok have released (some) data in\nmachine-readable form. This dataset contains the corresponding files. Discord's\nand Meta's data is in CSV format, Microsoft's in Excel format, and TikTok's in\nExcel and later on CSV format. Meta's and TikTok's files include historical data\nwhereas Discord's and Microsoft's do not. Since Meta re-uses [the same\nURL](https://transparency.fb.com/sr/community-standards/) every quarter, files\nreleased before Q2 2022 were retrieved from the Internet Archive's snapshots.\n\n\n### Dataset 6: Relationship between Offender and Victim\n\nThe [CSAM pieces by relationship to\nvictim](data/csam-pieces-by-relationship-to-victim.csv) dataset captures the\nrelationship between suspected offenders and victims as determined by law\nenforcement agencies and tabulated by NCMEC. It is included in NCMEC's\n[2022](https://www.missingkids.org/content/dam/missingkids/pdfs/OJJDP-NCMEC-Transparency_2022-Calendar-Year.pdf)\nand\n[2023](https://www.missingkids.org/content/dam/missingkids/pdfs/OJJDP-NCMEC-Transparency-CY-2023-Report.pdf)\ntransparency reports to the Office for Juvenile Justice and Delinquency\nPrevention at the Department of Justice.\n\nSince the number of victims in NCMEC's database seems to be very small, I pulled\nin two more datasets characterizing relationships as well. [The\nfirst](data/ojjdp-qa02403-2019.csv) stems from [OJJDP's Statistical Briefing\nBook](https://www.ojjdp.gov/ojstatbb/victims/qa02403.asp?qaDate=2019) and covers\nyears 2018 and 2019. The data was originally extracted from the FBI's National\nIncident-Based Reporting System Master Files. Note that all counts are relative\nto \"typical 1,000 sexual assaults.\" [The\nsecond](data/learcat-relationship-2016.csv) stems from\n[LEARCAT](https://learcat.bjs.ojp.gov) and covers the year 2016. It also draws\non the FBI's National Incident-Based Reporting System. While the Briefing Book\ndata is helpful indeed, the choice of relationship bins for the LEARCAT data\nrenders it close to useless in this context.\n\n\n### Other Data\n\nThe `data` directory contains a few more tables, including one with [global\npopulation sizes](data/global-population.csv) also provided by the UN Population\nDivision and one with Meta's [daily and monthly active\npeople](data/meta/family-active-people.csv), which captures the number of users\nwho logged into Facebook, Instagram, Messenger, or WhatsApp at least one over a\nday or month. Both tables are used to calculate Meta's daily and monthly active\npeople as a fraction of the world population.\n\n\n## Repository Layout\n\nIn addition to the data, this repository also contains the Python code for\nanalyzing it as well as resulting figures. In particular:\n\n  - The `analysis` directory contains notebooks with the high-level analysis\n    code. The `index.ipynb` notebook includes almost all other notebooks.\n  - The `diaphanous` directory contains the Python library code used by the\n    notebooks.\n      - The remaining code in `diaphanous.main` should be refactored into\n        notebooks.\n      - The `show()` function in `diaphanous.show` is more generally useful.\n        Most of this functionality should be up-streamed to Pandas because it\n        significantly improves on the default table format.\n  - The `figure` directory contains SVG figures.\n  - The `stubs` directory contains typing stubs.\n  - The `report` directory contains the LaTeX sources for the article discussing\n    the work.\n\n\n## Acronyms\n\n  - **CSAM**: Child Sexual Abuse Material\n  - **CSE**: Child Sexual Exploitation\n  - **NCMEC**: National Center for Missing and Exploited Children\n  - **OCSE**: Online Child Sexual Exploitation\n  - **OJJDP**: Office for Juvenile Justice and Delinquency Prevention (at the US\n    Departmet of Justice)\n\n\n## Licensing\n\nThe code in this repository is ©️ 2023–2024 by Robert Grimm and has been\nreleased under the [Apache 2.0](LICENSE) open source license. The datasets in\nthis repository combine disclosures by electronic service providers as well as\nthe National Center for Missing and Exploited Children (NCMEC) and make this\ndata more easily accessible in machine-readable form. It has been released under\nthe [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapparebit%2Fdiaphanous","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapparebit%2Fdiaphanous","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapparebit%2Fdiaphanous/lists"}