Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/harveyslash/ms-celeb-extractor

Extraction tool to parse MS Celeb dataset
https://github.com/harveyslash/ms-celeb-extractor

data-science dataset dataset-manager face-recognition microsoft-research

Last synced: about 1 month ago
JSON representation

Extraction tool to parse MS Celeb dataset

Awesome Lists containing this project

README

        

# ms-celeb-extractor
Extraction tool to parse MS Celeb dataset

The [MS Celeb Dataset](https://github.com/EB-Dodo/C-MS-Celeb) is a database of faces with 6,464,018
images.

Due to some error, the original [dataset is gone](https://github.com/EB-Dodo/C-MS-Celeb/issues/1) .
However, there is a torrent availble for use [here](https://academictorrents.com/details/9e67eb7cc23c9417f39778a8e06cca5e26196a97/tech&hit=1&filelist=1)
It contains a tsv file with the images encoded as base64 strings.

This extraction tool helps read through the tsv and place images of the same person in their respective folders.
As it reads through the tsv file, it deletes the already read entries, meaning it requires no extra disk space to save the processed files.

The reasoning for this is:

1. Most libraries have built in helper functions to parse such a structure, including [pytorch](https://pytorch.org/docs/stable/torchvision/datasets.html#datasetfolder) and
[keras/tensorflow](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image_dataset_from_directory)
2. Modern file systems hash their files, so if the path of the file is known, reading it is O(1) time
3. Storing as the original jpeg files give a reduction in size from 95 GB to 57 GB

## Installing
`pip install -r requirements.txt`

## Usage

```
Usage: extractor.py [OPTIONS] COMMAND [ARGS]...

Utility to help extract MS Celeb data into manageable fils.

Options:
--help Show this message and exit.

Commands:
combine Combine clean_list_128Vec_WT051_P010.txt and...
process Read lines from the MS Celeb TSV file and save into a directory...
```

First use the combine command to combine the two text files provided in the dataset. Details of
why to combine will be clear on referring to Section "How to use C-MS-Celeb" at
https://github.com/EB-Dodo/C-MS-Celeb. Further, the 2 txt files are not found in the torrent but in
https://github.com/EB-Dodo/C-MS-Celeb/blob/master/clean_list.7z

```
Usage: extractor.py combine [OPTIONS]

Combine clean_list_128Vec_WT051_P010.txt and relabel_list_128Vec_T058.txt
together.

The output of this file is used by the process command.

Options:
--clean_list_128_path FILENAME Path of clean_list_128Vec_WT051_P010.txt
[required]

--relabel_list_128_path FILENAME
Path of relabel_list_128Vec_T058 [required]
--output_path FILE Path of output file [required]
--help Show this message and exit.
```

Then use the generated combined txt file into the process command to start extracting
lines from the tsv and saving to jpeg files.

```
Usage: extractor.py process [OPTIONS]

Read lines from the MS Celeb TSV file and save into a directory structure.
The files will be put in this format:

root/person_x/xxx.jpg root/person_x/xxy.jpg
root/person_x/xxz.jpg

root/person_y/123.jpg root/person_y/817.jpg
root/person_y/some.jpg

!NOTE!: As this command reads the TSV, it will delete the lines already
read.

Options:
--tsv_location FILENAME Location of the entire MS Celeb tsv file
[required]

--output_dir PATH Output directory for images [required]
--combined_file_path FILE Location of the file generated by combine command
[required]

--chunk_size INTEGER Number of bytes to read from the tsv at once
--num_threads INTEGER
--help Show this message and exit.
```

Example:

```bash
python ms-celeb-extractor/extractor.py process --tsv_location=head.tsv --output_dir out --combined_file_path combined.txt
89it [00:03, 23.58it/s]
```

## Contributing
Feel free to add issues or pull requests