Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Projects in Awesome Lists tagged with concurrent-crawling

A curated list of projects in awesome lists tagged with concurrent-crawling .

https://github.com/simonpierreboucher/crawler

A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

concurrent-crawling content-extraction data-collection data-extraction-pipeline data-preservation-and-recovery data-scraping error-handling html-parsing http-requests metadata-storage modular-design pdf-text-extraction python-crawler rate-limiting structured-data-storage text-processing url-normalization web-crawling yaml-configuration

Last synced: 12 Dec 2024