https://github.com/commoncrawl/ml-opt-out-experiments
A series of experiments into ML opt–out protocols
https://github.com/commoncrawl/ml-opt-out-experiments
Last synced: 9 months ago
JSON representation
A series of experiments into ML opt–out protocols
- Host: GitHub
- URL: https://github.com/commoncrawl/ml-opt-out-experiments
- Owner: commoncrawl
- Created: 2024-02-22T22:45:57.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-22T22:54:11.000Z (over 2 years ago)
- Last Synced: 2025-06-12T08:49:23.292Z (about 1 year ago)
- Language: Python
- Size: 2.93 KB
- Stars: 5
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## What is this?
PySpark Jobs for investigating prevalence of ML Opt–Out Protocols, written by [Alex Xue](https://commoncrawl.org/team/alex-xue) as part of the blog post [A Further Look Into the Prevalence of Various ML Opt–Out Protocols](https://commoncrawl.org/blog/a-further-look-into-the-prevalence-of-various-ml-opt-out-protocols).
### How Do I Run It?
Requires `sparkcc.py` from [commoncrawl/cc-pyspark](https://github.com/commoncrawl/cc-pyspark/blob/main/sparkcc.py).
Setup is the same as [cc-pyspark](https://github.com/commoncrawl/cc-pyspark). Make sure you have an `./input` directory.
To run the jobs:
```
$SPARK_HOME/bin/spark-submit job_name.py \
--num_output_partitions 1 --log_level WARN \
./input/test_warc.txt output_file_name
```
and specifically to run html_metatag_count.py (which has a different output schema)
```
$SPARK_HOME/bin/spark-submit ./html_metatag_count.py \
--num_output_partitions 1 --log_level WARN --tuple_key_schema True \
./input/test_warc.txt output_file_name
```