https://github.com/haunt98/gill-cc3m
https://github.com/haunt98/gill-cc3m
Last synced: 7 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/haunt98/gill-cc3m
- Owner: haunt98
- Created: 2024-03-23T11:57:41.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-23T12:07:32.000Z (over 1 year ago)
- Last Synced: 2024-04-17T18:06:12.264Z (over 1 year ago)
- Language: Python
- Size: 1.95 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# gill-cc3m
## Prepare data
Download from
[ConceptualCaptions](https://ai.google.com/research/ConceptualCaptions/download):
- `raw/Train_GCC-training.tsv`
- `raw/Validation_GCC-1.1.0-Validation.tsv`
Add header:
```sh
mkdir -p header
cp raw/Train_GCC-training.tsv header/training_00.tsv
sed -i '' -e '1s/^/caption\turl\n/' header/training_00.tsv
cp raw/Validation_GCC-1.1.0-Validation.tsv header/validation_00.tsv
sed -i '' -e '1s/^/caption\turl\n/' header/validation_00.tsv
```
Small dataset to test first:
```sh
cp header/training_00.tsv header/training_01.tsv
sed -i '' -e '100,$d' header/training_01.tsv
cp header/validation_00.tsv header/validation_01.tsv
sed -i '' -e '10,$d' header/validation_01.tsv
```
## Download
Require `img2dataset`, `wandb`:
```sh
uv venv
uv pip install img2dataset wandb
source .venv/bin/activate
# Create an account first
wandb login
img2dataset \
--url_list header/training_01.tsv \
--input_format "tsv" \
--url_col "url" \
--caption_col "caption" \
--output_format webdataset \
--output_folder output/training_01 \
--processes_count 1 \
--thread_count 4 \
--image_size 256 \
--retries 1 \
--enable_wandb True \
--wandb_project "gill-cc3m"
img2dataset \
--url_list header/validation_01.tsv \
--input_format "tsv" \
--url_col "url" \
--caption_col "caption" \
--output_format webdataset \
--output_folder output/validation_01 \
--processes_count 1 \
--thread_count 4 \
--image_size 256 \
--retries 1 \
--enable_wandb True \
--wandb_project "gill-cc3m"
```
## Process
```sh
cp scripts/untar.sh output/training_01/
cd output/training_01
./untar.sh
cp scripts/untar.sh output/validation_01/
cd output/validation_01
./untar.sh
mkdir -p processed
python3 scripts/gen_train_val_tsv.py
```
## References
- https://github.com/kohjingyu/gill
- https://github.com/ray-ruisun/CC3M-Helper
- https://github.com/rom1504/img2dataset
- https://stackoverflow.com/a/9533736
- https://stackoverflow.com/a/4247319