https://github.com/commoncrawl/ml-opt-out-experiments

A series of experiments into ML opt–out protocols
https://github.com/commoncrawl/ml-opt-out-experiments

Last synced: 9 months ago
JSON representation

A series of experiments into ML opt–out protocols

Host: GitHub
URL: https://github.com/commoncrawl/ml-opt-out-experiments
Owner: commoncrawl
Created: 2024-02-22T22:45:57.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-02-22T22:54:11.000Z (over 2 years ago)
Last Synced: 2025-06-12T08:49:23.292Z (about 1 year ago)
Language: Python
Size: 2.93 KB
Stars: 5
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## What is this?

PySpark Jobs for investigating prevalence of ML Opt–Out Protocols, written by [Alex Xue](https://commoncrawl.org/team/alex-xue) as part of the blog post [A Further Look Into the Prevalence of Various ML Opt–Out Protocols](https://commoncrawl.org/blog/a-further-look-into-the-prevalence-of-various-ml-opt-out-protocols).

### How Do I Run It?

Requires `sparkcc.py` from [commoncrawl/cc-pyspark](https://github.com/commoncrawl/cc-pyspark/blob/main/sparkcc.py).

Setup is the same as [cc-pyspark](https://github.com/commoncrawl/cc-pyspark). Make sure you have an `./input` directory.

To run the jobs:

```
$SPARK_HOME/bin/spark-submit job_name.py \
--num_output_partitions 1 --log_level WARN \
./input/test_warc.txt output_file_name
```

and specifically to run html_metatag_count.py (which has a different output schema)

```
$SPARK_HOME/bin/spark-submit ./html_metatag_count.py \
--num_output_partitions 1 --log_level WARN --tuple_key_schema True \
./input/test_warc.txt output_file_name
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/commoncrawl/ml-opt-out-experiments

Awesome Lists containing this project

README