https://github.com/ryanrudes/wikimedia

A dataset comprised of over 40 million images sourced from Wikimedia Commons
https://github.com/ryanrudes/wikimedia

computer-vision data-science data-scraping dataset datasets deep-learning gans image images machine-learning wikimedia wikimedia-commons

Last synced: 10 months ago
JSON representation

A dataset comprised of over 40 million images sourced from Wikimedia Commons

Host: GitHub
URL: https://github.com/ryanrudes/wikimedia
Owner: ryanrudes
License: mit
Created: 2021-06-14T13:39:12.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2023-10-22T22:02:10.000Z (over 2 years ago)
Last Synced: 2025-09-02T03:55:26.446Z (10 months ago)
Topics: computer-vision, data-science, data-scraping, dataset, datasets, deep-learning, gans, image, images, machine-learning, wikimedia, wikimedia-commons
Language: Python
Homepage: https://www.kaggle.com/ryanrudes/wikimedia
Size: 14.6 KB
Stars: 6
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/gist/Ryan-Rudes/ad5bc9481ffb268e1cacaf3808d395e5/wikimedia-dataset-demo.ipynb)

### Introduction

Wikimedia Commons Image Dataset is comprised of over 40 million URLs to Wikimedia Commons images.

### Requirements

These are only required if you plan to run the data scraper yourself, which is unnecessary.

- [`tqdm`](https://github.com/tqdm/tqdm)

- [`bs4`](https://github.com/waylan/beautifulsoup)

These are the requirements for the PyTorch `DataLoader`:

- [`torch`](https://github.com/pytorch/pytorch)

- [`scikit-image`](https://github.com/scikit-image/scikit-image)

- [`pillow`](https://github.com/python-pillow/Pillow)

- [`opencv-python`](https://github.com/opencv/opencv)

- [`torchvision`](https://github.com/pytorch/vision)

### Data

Data is represented in a certain compressed format. URLs are newline-delimited.

URLs to Wikimedia Commons images are formatted as follows: \

`https://upload.wikimedia.org/wikipedia/commons/thumb///` \

or \

`https://upload.wikimedia.org/wikipedia/commons///` (no /thumb/)

`` is 1 character in length, and `` is 2

For each URL, it is compressed as follows:

``

where `` is a binary integer, indicating whether `/thumb/` is a component of the path.

There are 41666578 URLs in total, equating to 4.73 GB.

### Usage

Included is a:

- [x] PyTorch `Dataset` and `DataLoader`

- [ ] TensorFlow `Dataset`

#### PyTorch `DataLoader` Usage

To demo the PyTorch `DataLoader`, first `cd` to the main directory. Then, download the dataset:

```

kaggle datasets download -d ryanrudes/wikimedia --unzip

```

Then, run the script:

```

python loaders/pytorch.py

```

You can use this dataset by simply importing the `DataLoader` class, for example:

```python

from loaders.pytorch import WikimediaCommonsLoader

loader = WikimediaCommonsLoader()

for batch in loader:

    print (batch.shape)

>>> torch.Size([32, 3, 256, 256])

>>> torch.Size([32, 3, 256, 256])

>>> torch.Size([32, 3, 256, 256])

>>> torch.Size([32, 3, 256, 256])

>>> torch.Size([32, 3, 256, 256])

...

```

You can modify the following arguments of `WikimediaCommonsLoader`. Their default values are given below:

```python

path        = 'filtered.txt'

verbose     = True

max_retries = None

timeout     = None

shuffle     = True

max_buffer  = 4096

workers     = 8

transform   = None

batch_size  = 32

resize_to   = 512

crop_to     = 256

```

Or, you can use the backbone `WikimediaCommonsDataset` class, which returns the raw images, one by one, without applying any transformations, whereas `WikimediaCommonsLoader` performs resizing and random cropping:

```python

from loaders.pytorch import WikimediaCommonsDataset

dataset = WikimediaCommonsDataset()

for image in dataset:

    print (image.shape)

>>> (120, 100, 3)

>>> (80, 120, 3)

>>> (120, 80, 3)

>>> (98, 120, 3)

>>> (120, 97, 3)

>>> (120, 120, 3)

...

```

The `WikimediaCommonsDataset` class takes almost the same arguments as `WikimediaCommonsLoader`, excluding `batch_size`, `resize_to`, and `crop_to`.

### Links and Further Info

The dataset is available on [Kaggle](https://www.kaggle.com/ryanrudes/wikimedia)

This is licensed under the MIT license. Click [here](https://github.com/Ryan-Rudes/wikimedia/blob/master/LICENSE.txt) to learn more. All image links in this dataset are in the public domain. The only exception would be links to Wikimedia Foundation logos, which were already filtered prior to the data upload.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ryanrudes/wikimedia

Awesome Lists containing this project

README