https://github.com/daisvke/arachnida

A suite of web scrapers and metadata editors designed for efficient web and image data processing.
https://github.com/daisvke/arachnida
exif-data-extraction exif-editor image-scraper metadata-extraction python python-scraper scraping-websites
Last synced: 8 months ago
JSON representation
A suite of web scrapers and metadata editors designed for efficient web and image data processing.
Host: GitHub
URL: https://github.com/daisvke/arachnida
Owner: daisvke
Created: 2024-10-13T17:43:19.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-01T16:17:07.000Z (over 1 year ago)
Last Synced: 2025-10-22T10:56:01.413Z (8 months ago)
Topics: exif-data-extraction, exif-editor, image-scraper, metadata-extraction, python, python-scraper, scraping-websites
Language: Python
Homepage:
Size: 379 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Arachnida

A suite of web scrapers and metadata editors designed for efficient web and image data processing:

- **`Harvestmen`**: A tool for searching and extracting strings from web pages.

- **`Spider`**: A scraper for finding images or specific strings within HTML image tags.

- **`Scorpion`**: A utility for viewing metadata from image files and searching strings in them.

- **`Scorpion Viewer`**: A more advanced tool for displaying, deleting, and modifying metadata in image files.

## Testing

You can run basic tests with:

```sh

# Harvestmen (find a string on a webpage)

./tests.sh -h

# Spider & Scorpion (find images on a webpage and open the image folder and the metadata editor once done)

./tests.sh -s

```

---

## Harvestmen (strings)

This module implements a web scraper that recursively searches for a specified string within the content of a given base URL and all reachable links from that URL. The script utilizes the `requests` library to fetch web pages and `BeautifulSoup` from the `bs4` library to parse HTML content. 

---

### Key Features:

- **Recursive Scraping**: The scraper navigates through all links found on the base URL and continues to scrape linked pages unless restricted by the user.

- **Search Functionality**: It checks for the presence of a user-defined search string in the text of each page, with an option for case-insensitive searching.

- **Visited URL Tracking**: The script maintains a list of visited URLs to avoid processing the same page multiple times.

- **Skip Limit**: Users can set a limit on the number of skipped links (either due to already visited pages or bad links) before the scraper terminates.

- **Command-Line Interface**: The script accepts command-line arguments for the base URL, search string, case sensitivity, single-page mode, and skip limit.

---

### Usage:

```

usage: harvestmen.py [-h] -s SEARCH_STRING [-i] [-r] [-l RECURSE_DEPTH] [-k KO_LIMIT] link

This program will search the given string on the provided link and on every link that can be reached from that link, recursively.

positional arguments:

  link                  the name of the base URL to access

options:

  -h, --help            show this help message and exit

  -s SEARCH_STRING, --search-string SEARCH_STRING

                        the string to search

  -i, --case-insensitive

                        Enable case-insensitive mode

  -r, --recursive       Enable recursive search mode

  -l RECURSE_DEPTH, --recurse-depth RECURSE_DEPTH

                        indicates the maximum depth level of the recursive download. If not indicated, it will be 5. (-r/--recursive has to be activated).

  -k KO_LIMIT, --ko-limit KO_LIMIT

                        Number of already visited/bad links that are allowed before we terminate the search. This is to ensure that we don't get stuck into a

                        loop.

  -v, --verbose         Enable verbose mode.

  -S, --sleep           Enable sleep between HTTP requests to mimic a human-like behavior

  -t MAX_SLEEP, --max-sleep MAX_SLEEP

                        Maximum duration of the random sleeps between HTTP requests. If not indicated, it will be 3. (-s/--search-string has to be activated).

```

---

## Spider (images, strings in image tags and filenames)

This module implements a web image scraper that recursively searches for images on a specified base URL and downloads them to a designated folder. 

---

### Key Features:

- **Image downloading**: The scraper identifies and downloads images from the base URL and any linked pages, saving them to a specified local directory. If no directory is specified, it defaults to `./data/`.

- **Search functionality**: Users can specify a search string to filter images based on their alt text/filename. The scraper supports both case-sensitive and case-insensitive modes.

- **Recursive scraping**: The script can perform recursive scraping through all links found on the base URL, with an option to set a maximum depth level for the recursion (default is 5).

- **Visited URL tracking**: It maintains a list of visited URLs to avoid processing the same page multiple times, with a configurable limit on the number of already visited or bad links allowed before termination (KO limit).

- **Open image folder option**: Users have the option to automatically open the image folder at the end of the program for easy access to downloaded images.

- **Memory limit**: Set a memory limit for downloaded images to a specified value in MB, with a default of 1000MB.

---

### Usage:

```

// Display help

python spider.py -h

usage: spider.py [-h] [-s SEARCH_STRING] [-p IMAGE_PATH] [-i] [-r]

                 [-l RECURSE_DEPTH] [-k KO_LIMIT] [-o]

                 link

This program will search the given string on the provided link and on every link

that can be reached from that link, recursively.

positional arguments:

  link                  the name of the base URL to access

options:

  -h, --help            show this help message and exit

  -s SEARCH_STRING, --search-string SEARCH_STRING

                        If not empty enables the string search mode: only images which 'alt' attribute contains the search string are saved

  -p IMAGE_PATH, --image-path IMAGE_PATH

                        indicates the path where the downloaded files will be saved. If not specified, ./data/ will be used.

  -i, --case-insensitive

                        Enable case-insensitive mode

  -r, --recursive       Enable recursive search mode

  -l RECURSE_DEPTH, --recurse-depth RECURSE_DEPTH

                        indicates the maximum depth level of the recursive download. If not indicated, it will be 5.

  -k KO_LIMIT, --ko-limit KO_LIMIT

                        Number of already visited/bad links that are allowed before we terminate the search. This is to ensure that we don't get stuck into a

                        loop.

  -o, --open            Open the image folder at the end of the program.

  -m MEMORY, --memory MEMORY

                        Set a limit to the memory occupied by the dowloaded images (in MB). Default is set to 1000MB.

  -v, --verbose         Enable verbose mode.

  -S, --sleep           Enable sleep between HTTP requests to mimic a human-like behavior

  -t MAX_SLEEP, --max-sleep MAX_SLEEP

                        Maximum duration of the random sleeps between HTTP requests. If not indicated, it will be 3. (-s/--search-string has to be activated).

// Ex. to scrap with a depth of 1 with a search string "42" with the open folder option on :

python3 spider.py "https://42.fr/le-campus-de-paris/diplome-informatique/expert-en-architecture-informatique" -r -l 1 -s "42" -o

```

---

## Scorpion (image file metadata)

### Description

This is the CLI for Scorpion. This program receives image files as parameters and parses them for EXIF and other metadata, displaying the information on the terminal.


It displays basic attributes such as the creation date, as well as EXIF, or PNG data.

---

### Usage

```

usage: scorpion.py [-h] [-f [FILE ...]] [-d [DIR ...]] [-v] [-s SEARCH_STRING] [-i]

Extract EXIF data and other data from image files.

options:

  -h, --help            show this help message and exit

  -f [FILE ...], --files [FILE ...]

                        one or more image files to process

  -d [DIR ...], --directory [DIR ...]

                        one or more folders containing image files to process

  -v, --verbose         Enable verbose mode.

  -s SEARCH_STRING, --search-string SEARCH_STRING

                        the string to search

  -i, --case-insensitive

                        Enable case-insensitive mode

```

---

## Scorpion Viewer

### Description

* This is the GUI for Scorpion. This program let us delete and modify some of the metadata from the image files.


* It can also search a specific string the the metadata.

* It uses `Tkinter` for the GUI and `Treeview` widget to present metadata in a structured, tabular format.

---

## Notes

### Mimicking human-like behavior

#### 1. **With User-Agent**

While attempting to download an image from Wikipedia, we encountered a failure due to the default User-Agent used by the `requests` library. The error message received was as follows:

![User-Agent Failure](screenshots/user-agent-failure.png)

To investigate further, we checked the default headers used by the `requests` library in Python:

```python

requests.utils.default_headers()

```

The output was:

```json

{

    'User-Agent': 'python-requests/2.31.0',

    'Accept-Encoding': 'gzip, deflate',

    'Accept': '*/*',

    'Connection': 'keep-alive'

}

```

As we can see, the default User-Agent was not compliant with Wikipedia's User-Agent policy. To resolve this issue, we added a custom User-Agent header to our requests:

```python

HEADER = {

    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'

}

```

By using this custom User-Agent, we can successfully download images from Wikipedia without encountering the 403 Forbidden error:

![Valid User-Agent](screenshots/user-agent-success.png)

#### 2. **With delay**

When building a web scraper, it's important to consider how our requests may be perceived by the target website. Rapid, consecutive requests can trigger anti-bot measures and may lead to your IP being blocked. To mitigate this risk and make our scraper behave more like a human user, we can implement delays between our HTTP requests.

#### **Why use delays?**

- Mimics human behavior: humans do not browse the web at lightning speed. By introducing random delays, our scraper can simulate more natural browsing patterns.

- Reduces server load: spacing out requests helps to reduce the load on the target server, which is considerate and can prevent our scraper from being flagged as abusive.

- Avoids rate limiting: many websites have rate limits in place. By pacing our requests, we can stay within these limits and avoid being temporarily or permanently banned.

---

#### **Implementation**

To further mimic human behavior, we used a randomized delay. This can be done using the `random` module, which can be utilized in conjunction with the `sleep` function:

```py

from time import sleep

from random import randint

def sleep_for_random_secs(min_sec: int = 1, max_sec: int = 3) -> None:

	"""

	Sleep for a random duration to mimic a human visiting a webpage.

	"""

	

	# Generate a random sleeping duration in seconds,

	# from a range = [min, max]

	sleeping_secs = randint(min_sec, max_sec)

	sleep(sleeping_secs)

```

---

### Understanding `robots.txt`

The `robots.txt` file is a standard used by websites to communicate with web crawlers and bots about which parts of the site should not be accessed or indexed.


According to the [official website](http://www.robotstxt.org/faq/legal.html) of the `robots.txt` standard, "There is no law stating that /robots.txt must be obeyed, nor does it constitute a binding contract between site owner and user." This means that while it is a widely accepted practice to respect the directives in a `robots.txt` file, there are no legal repercussions for ignoring it.

--- 

### Modifying extensions

* Sometimes, certain websites do not recognize your ID photo image file because they expect a 'PNG' extension instead of 'JPEG'. Simply changing the file extension manually may not be sufficient.

* We discovered that modifying the `Image.format` attribute using the `Pillow library (PIL)` effectively allows the file to be recognized with the desired extension and successfully passes the checks.

### Exif labels

* EXIF metadata uses numerical identifiers (integers) to represent specific tags, but these integers are not human-readable. To work effectively with EXIF data, you need a way to map these numerical codes to their corresponding tag names and descriptions. 

* We got the Exif Tags from: exiv2.org.

The original tags are in `standard_exif_tags.txt`.

Only the needed columns are stored in `exif_labels.py`.

```sh

# Make a dictionary from the data on the website in `exif_labels.py`

./generate_exif_labels_dict.sh

```

### Time related metadata

#### Creation Time 

Here’s a refined version of the README section for clarity and readability:

---

### Challenges Faced

1. **Linux Limitation on `Creation Time`**:

   - Linux does not natively support or store a `Creation Time` attribute in the same way as Windows. This limitation prevents direct modification of the `Creation Time` metadata on Linux systems.

2. **Behavior of `Image.save()` Method**:

   - The `Image.save()` method in Python creates a new image file and deletes the original one during the save process. As a result, all time-related metadata (`Creation Time`, `Access Time`, and `Modification Time`) are updated to reflect the time of the save operation, unintentionally overwriting the original timestamps.

3. **Attempted Workaround**:

   - To address the issue, we attempted a workaround where the `Image.save()` operation was performed on a temporary file. The temporary file was then copied to the destination path. However, even with this approach, the destination file's `Access Time` and `Modification Time` were updated because the file system treats the copy operation as an access and modification event.

```python

def save_image_without_time_update(img, file_path, info):

    with NamedTemporaryFile(delete=False) as temp_file:

        temp_path = temp_file.name + ".png"

        img.save(temp_path, exif=info)

    # Copy the temporary file to the original path

    shutil.copyfile(temp_path, file_path)

    # Remove the temporary file

    os.remove(temp_path)

save_image_without_time_update(img, file_path, exif_data)  

```

4. **Successful Modification of `Modification Time`**:

   - The only case where the result reflected our intent was when modifying the `Modification Time`. After the file was created (and its `Modification Time` was unintentionally updated), we explicitly updated the `Modification Time` value, effectively erasing the unintentional update and applying the desired value.

---

### Common Image (PIL) Methods

```

    img.show():

        This method displays the image using the default image viewer on your system.

    img.save(fp, format=None, **params):

        This method saves the image to a file. You can specify the file path and format (if different from the original).

    img.resize(size):

        This method resizes the image to the specified size (a tuple of width and height) and returns a new image object.

    img.convert(mode):

        This method converts the image to a different color mode (e.g., from 'RGB' to 'L' for grayscale) and returns a new image object.

    img.thumbnail(size: tuple[float, float]):

        This method modifies the image to contain a thumbnail version of itself, no larger than the given size. This method calculates an appropriate thumbnail size to preserve the aspect of the image, calls the draft() method to configure the file reader (where applicable), and finally resizes the image.

```

More [here](https://pillow.readthedocs.io/en/stable/reference/Image.html).

---

### Documentation

* [Pillow Doc on handled Image File Formats](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html)

* [Examples of JPG files with EXIF data](https://github.com/ianare/exif-samples)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/daisvke/arachnida

Awesome Lists containing this project

README