{"id":25812590,"url":"https://github.com/daisvke/arachnida","last_synced_at":"2025-10-22T10:56:01.964Z","repository":{"id":261985439,"uuid":"872068850","full_name":"daisvke/Arachnida","owner":"daisvke","description":"A suite of web scrapers and metadata editors designed for efficient web and image data processing.","archived":false,"fork":false,"pushed_at":"2025-03-01T16:17:07.000Z","size":388,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-22T10:56:01.413Z","etag":null,"topics":["exif-data-extraction","exif-editor","image-scraper","metadata-extraction","python","python-scraper","scraping-websites"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/daisvke.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-13T17:43:19.000Z","updated_at":"2025-03-01T16:17:10.000Z","dependencies_parsed_at":"2025-02-28T02:12:46.082Z","dependency_job_id":"94799a44-7fc0-4228-8e6d-d624a2878c13","html_url":"https://github.com/daisvke/Arachnida","commit_stats":null,"previous_names":["daisvke/python_web_scrappers","daisvke/arachnida","daisvke/python_web_scrapers"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/daisvke/Arachnida","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daisvke%2FArachnida","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daisvke%2FArachnida/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daisvke%2FArachnida/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daisvke%2FArachnida/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/daisvke","download_url":"https://codeload.github.com/daisvke/Arachnida/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/daisvke%2FArachnida/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280424213,"owners_count":26328462,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-22T02:00:06.515Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["exif-data-extraction","exif-editor","image-scraper","metadata-extraction","python","python-scraper","scraping-websites"],"created_at":"2025-02-28T01:54:34.575Z","updated_at":"2025-10-22T10:56:01.945Z","avatar_url":"https://github.com/daisvke.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Arachnida\nA suite of web scrapers and metadata editors designed for efficient web and image data processing:\n\n- **`Harvestmen`**: A tool for searching and extracting strings from web pages.\n- **`Spider`**: A scraper for finding images or specific strings within HTML image tags.\n- **`Scorpion`**: A utility for viewing metadata from image files and searching strings in them.\n- **`Scorpion Viewer`**: A more advanced tool for displaying, deleting, and modifying metadata in image files.\n\n## Testing\n\nYou can run basic tests with:\n```sh\n# Harvestmen (find a string on a webpage)\n./tests.sh -h\n\n# Spider \u0026 Scorpion (find images on a webpage and open the image folder and the metadata editor once done)\n./tests.sh -s\n```\n\n---\n## Harvestmen (strings)\n\nThis module implements a web scraper that recursively searches for a specified string within the content of a given base URL and all reachable links from that URL. The script utilizes the `requests` library to fetch web pages and `BeautifulSoup` from the `bs4` library to parse HTML content. \n\n---\n### Key Features:\n- **Recursive Scraping**: The scraper navigates through all links found on the base URL and continues to scrape linked pages unless restricted by the user.\n- **Search Functionality**: It checks for the presence of a user-defined search string in the text of each page, with an option for case-insensitive searching.\n- **Visited URL Tracking**: The script maintains a list of visited URLs to avoid processing the same page multiple times.\n- **Skip Limit**: Users can set a limit on the number of skipped links (either due to already visited pages or bad links) before the scraper terminates.\n- **Command-Line Interface**: The script accepts command-line arguments for the base URL, search string, case sensitivity, single-page mode, and skip limit.\n---\n### Usage:\n```\nusage: harvestmen.py [-h] -s SEARCH_STRING [-i] [-r] [-l RECURSE_DEPTH] [-k KO_LIMIT] link\n\nThis program will search the given string on the provided link and on every link that can be reached from that link, recursively.\n\npositional arguments:\n  link                  the name of the base URL to access\n\noptions:\n  -h, --help            show this help message and exit\n  -s SEARCH_STRING, --search-string SEARCH_STRING\n                        the string to search\n  -i, --case-insensitive\n                        Enable case-insensitive mode\n  -r, --recursive       Enable recursive search mode\n  -l RECURSE_DEPTH, --recurse-depth RECURSE_DEPTH\n                        indicates the maximum depth level of the recursive download. If not indicated, it will be 5. (-r/--recursive has to be activated).\n  -k KO_LIMIT, --ko-limit KO_LIMIT\n                        Number of already visited/bad links that are allowed before we terminate the search. This is to ensure that we don't get stuck into a\n                        loop.\n  -v, --verbose         Enable verbose mode.\n  -S, --sleep           Enable sleep between HTTP requests to mimic a human-like behavior\n  -t MAX_SLEEP, --max-sleep MAX_SLEEP\n                        Maximum duration of the random sleeps between HTTP requests. If not indicated, it will be 3. (-s/--search-string has to be activated).\n```\n---\n## Spider (images, strings in image tags and filenames)\n\nThis module implements a web image scraper that recursively searches for images on a specified base URL and downloads them to a designated folder. \n\n---\n### Key Features:\n\n- **Image downloading**: The scraper identifies and downloads images from the base URL and any linked pages, saving them to a specified local directory. If no directory is specified, it defaults to `./data/`.\n- **Search functionality**: Users can specify a search string to filter images based on their alt text/filename. The scraper supports both case-sensitive and case-insensitive modes.\n- **Recursive scraping**: The script can perform recursive scraping through all links found on the base URL, with an option to set a maximum depth level for the recursion (default is 5).\n- **Visited URL tracking**: It maintains a list of visited URLs to avoid processing the same page multiple times, with a configurable limit on the number of already visited or bad links allowed before termination (KO limit).\n- **Open image folder option**: Users have the option to automatically open the image folder at the end of the program for easy access to downloaded images.\n- **Memory limit**: Set a memory limit for downloaded images to a specified value in MB, with a default of 1000MB.\n---\n### Usage:\n```\n// Display help\npython spider.py -h\n\nusage: spider.py [-h] [-s SEARCH_STRING] [-p IMAGE_PATH] [-i] [-r]\n                 [-l RECURSE_DEPTH] [-k KO_LIMIT] [-o]\n                 link\n\nThis program will search the given string on the provided link and on every link\nthat can be reached from that link, recursively.\n\npositional arguments:\n  link                  the name of the base URL to access\n\noptions:\n  -h, --help            show this help message and exit\n  -s SEARCH_STRING, --search-string SEARCH_STRING\n                        If not empty enables the string search mode: only images which 'alt' attribute contains the search string are saved\n  -p IMAGE_PATH, --image-path IMAGE_PATH\n                        indicates the path where the downloaded files will be saved. If not specified, ./data/ will be used.\n  -i, --case-insensitive\n                        Enable case-insensitive mode\n  -r, --recursive       Enable recursive search mode\n  -l RECURSE_DEPTH, --recurse-depth RECURSE_DEPTH\n                        indicates the maximum depth level of the recursive download. If not indicated, it will be 5.\n  -k KO_LIMIT, --ko-limit KO_LIMIT\n                        Number of already visited/bad links that are allowed before we terminate the search. This is to ensure that we don't get stuck into a\n                        loop.\n  -o, --open            Open the image folder at the end of the program.\n  -m MEMORY, --memory MEMORY\n                        Set a limit to the memory occupied by the dowloaded images (in MB). Default is set to 1000MB.\n  -v, --verbose         Enable verbose mode.\n  -S, --sleep           Enable sleep between HTTP requests to mimic a human-like behavior\n  -t MAX_SLEEP, --max-sleep MAX_SLEEP\n                        Maximum duration of the random sleeps between HTTP requests. If not indicated, it will be 3. (-s/--search-string has to be activated).\n\n// Ex. to scrap with a depth of 1 with a search string \"42\" with the open folder option on :\npython3 spider.py \"https://42.fr/le-campus-de-paris/diplome-informatique/expert-en-architecture-informatique\" -r -l 1 -s \"42\" -o\n```\n---\n\n## Scorpion (image file metadata)\n\n### Description\nThis is the CLI for Scorpion. This program receives image files as parameters and parses them for EXIF and other metadata, displaying the information on the terminal.\u003cbr /\u003e\nIt displays basic attributes such as the creation date, as well as EXIF, or PNG data.\n\n---\n### Usage\n\n```\nusage: scorpion.py [-h] [-f [FILE ...]] [-d [DIR ...]] [-v] [-s SEARCH_STRING] [-i]\n\nExtract EXIF data and other data from image files.\n\noptions:\n  -h, --help            show this help message and exit\n  -f [FILE ...], --files [FILE ...]\n                        one or more image files to process\n  -d [DIR ...], --directory [DIR ...]\n                        one or more folders containing image files to process\n  -v, --verbose         Enable verbose mode.\n  -s SEARCH_STRING, --search-string SEARCH_STRING\n                        the string to search\n  -i, --case-insensitive\n                        Enable case-insensitive mode\n```\n---\n\n## Scorpion Viewer\n\n### Description\n* This is the GUI for Scorpion. This program let us delete and modify some of the metadata from the image files.\u003cbr /\u003e\n* It can also search a specific string the the metadata.\n* It uses `Tkinter` for the GUI and `Treeview` widget to present metadata in a structured, tabular format.\n\n---\n## Notes\n\n### Mimicking human-like behavior\n#### 1. **With User-Agent**\nWhile attempting to download an image from Wikipedia, we encountered a failure due to the default User-Agent used by the `requests` library. The error message received was as follows:\n\n![User-Agent Failure](screenshots/user-agent-failure.png)\n\nTo investigate further, we checked the default headers used by the `requests` library in Python:\n\n```python\nrequests.utils.default_headers()\n```\n\nThe output was:\n\n```json\n{\n    'User-Agent': 'python-requests/2.31.0',\n    'Accept-Encoding': 'gzip, deflate',\n    'Accept': '*/*',\n    'Connection': 'keep-alive'\n}\n```\n\nAs we can see, the default User-Agent was not compliant with Wikipedia's User-Agent policy. To resolve this issue, we added a custom User-Agent header to our requests:\n\n```python\nHEADER = {\n    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'\n}\n```\n\nBy using this custom User-Agent, we can successfully download images from Wikipedia without encountering the 403 Forbidden error:\n\n![Valid User-Agent](screenshots/user-agent-success.png)\n\n#### 2. **With delay**\nWhen building a web scraper, it's important to consider how our requests may be perceived by the target website. Rapid, consecutive requests can trigger anti-bot measures and may lead to your IP being blocked. To mitigate this risk and make our scraper behave more like a human user, we can implement delays between our HTTP requests.\n\n#### **Why use delays?**\n- Mimics human behavior: humans do not browse the web at lightning speed. By introducing random delays, our scraper can simulate more natural browsing patterns.\n- Reduces server load: spacing out requests helps to reduce the load on the target server, which is considerate and can prevent our scraper from being flagged as abusive.\n- Avoids rate limiting: many websites have rate limits in place. By pacing our requests, we can stay within these limits and avoid being temporarily or permanently banned.\n---\n#### **Implementation**\nTo further mimic human behavior, we used a randomized delay. This can be done using the `random` module, which can be utilized in conjunction with the `sleep` function:\n\n```py\nfrom time import sleep\nfrom random import randint\n\ndef sleep_for_random_secs(min_sec: int = 1, max_sec: int = 3) -\u003e None:\n\t\"\"\"\n\tSleep for a random duration to mimic a human visiting a webpage.\n\t\"\"\"\n\t\n\t# Generate a random sleeping duration in seconds,\n\t# from a range = [min, max]\n\tsleeping_secs = randint(min_sec, max_sec)\n\tsleep(sleeping_secs)\n```\n---\n### Understanding `robots.txt`\n\nThe `robots.txt` file is a standard used by websites to communicate with web crawlers and bots about which parts of the site should not be accessed or indexed.\u003cbr /\u003e\nAccording to the [official website](http://www.robotstxt.org/faq/legal.html) of the `robots.txt` standard, \"There is no law stating that /robots.txt must be obeyed, nor does it constitute a binding contract between site owner and user.\" This means that while it is a widely accepted practice to respect the directives in a `robots.txt` file, there are no legal repercussions for ignoring it.\n\n--- \n\n### Modifying extensions\n* Sometimes, certain websites do not recognize your ID photo image file because they expect a 'PNG' extension instead of 'JPEG'. Simply changing the file extension manually may not be sufficient.\n\n* We discovered that modifying the `Image.format` attribute using the `Pillow library (PIL)` effectively allows the file to be recognized with the desired extension and successfully passes the checks.\n\n### Exif labels\n* EXIF metadata uses numerical identifiers (integers) to represent specific tags, but these integers are not human-readable. To work effectively with EXIF data, you need a way to map these numerical codes to their corresponding tag names and descriptions. \n\n* We got the Exif Tags from: \u003ca href=\"https://exiv2.org/tags.html\"\u003eexiv2.org\u003c/a\u003e.\nThe original tags are in `standard_exif_tags.txt`.\nOnly the needed columns are stored in `exif_labels.py`.\n\n```sh\n# Make a dictionary from the data on the website in `exif_labels.py`\n./generate_exif_labels_dict.sh\n```\n\n### Time related metadata\n#### Creation Time \nHere’s a refined version of the README section for clarity and readability:\n\n---\n\n### Challenges Faced\n\n1. **Linux Limitation on `Creation Time`**:\n   - Linux does not natively support or store a `Creation Time` attribute in the same way as Windows. This limitation prevents direct modification of the `Creation Time` metadata on Linux systems.\n\n2. **Behavior of `Image.save()` Method**:\n   - The `Image.save()` method in Python creates a new image file and deletes the original one during the save process. As a result, all time-related metadata (`Creation Time`, `Access Time`, and `Modification Time`) are updated to reflect the time of the save operation, unintentionally overwriting the original timestamps.\n\n3. **Attempted Workaround**:\n   - To address the issue, we attempted a workaround where the `Image.save()` operation was performed on a temporary file. The temporary file was then copied to the destination path. However, even with this approach, the destination file's `Access Time` and `Modification Time` were updated because the file system treats the copy operation as an access and modification event.\n\n```python\ndef save_image_without_time_update(img, file_path, info):\n    with NamedTemporaryFile(delete=False) as temp_file:\n        temp_path = temp_file.name + \".png\"\n        img.save(temp_path, exif=info)\n\n    # Copy the temporary file to the original path\n    shutil.copyfile(temp_path, file_path)\n\n    # Remove the temporary file\n    os.remove(temp_path)\n\nsave_image_without_time_update(img, file_path, exif_data)  \n```\n4. **Successful Modification of `Modification Time`**:\n   - The only case where the result reflected our intent was when modifying the `Modification Time`. After the file was created (and its `Modification Time` was unintentionally updated), we explicitly updated the `Modification Time` value, effectively erasing the unintentional update and applying the desired value.\n\n---\n### Common Image (PIL) Methods\n```\n    img.show():\n        This method displays the image using the default image viewer on your system.\n\n    img.save(fp, format=None, **params):\n        This method saves the image to a file. You can specify the file path and format (if different from the original).\n\n    img.resize(size):\n        This method resizes the image to the specified size (a tuple of width and height) and returns a new image object.\n\n    img.convert(mode):\n        This method converts the image to a different color mode (e.g., from 'RGB' to 'L' for grayscale) and returns a new image object.\n\n    img.thumbnail(size: tuple[float, float]):\n        This method modifies the image to contain a thumbnail version of itself, no larger than the given size. This method calculates an appropriate thumbnail size to preserve the aspect of the image, calls the draft() method to configure the file reader (where applicable), and finally resizes the image.\n```\nMore [here](https://pillow.readthedocs.io/en/stable/reference/Image.html).\n\n---\n\n### Documentation\n* [Pillow Doc on handled Image File Formats](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html)\n* [Examples of JPG files with EXIF data](https://github.com/ianare/exif-samples)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaisvke%2Farachnida","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdaisvke%2Farachnida","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdaisvke%2Farachnida/lists"}