{"id":26384594,"url":"https://github.com/mlibre/clean-web-scraper","last_synced_at":"2025-03-17T07:29:47.464Z","repository":{"id":271792823,"uuid":"914579243","full_name":"mlibre/Clean-Web-Scraper","owner":"mlibre","description":"A Node.js web scraper that extracts clean, readable content from websites - perfect for AI/LLM training datasets. Features smart crawling, Mozilla Readability integration, and organized content storage 🤖","archived":false,"fork":false,"pushed_at":"2025-03-14T21:23:25.000Z","size":170,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-14T22:26:22.551Z","etag":null,"topics":["ai","artificial-intelligence","clean","crawler","data-preprocessing","dataset","fine-tuning","llm","recursive-crawling","scraper","training"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mlibre.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-09T21:52:16.000Z","updated_at":"2025-03-14T21:23:29.000Z","dependencies_parsed_at":"2025-01-30T09:23:47.532Z","dependency_job_id":"1a8f76a8-31d8-4575-855a-56aa48b215f0","html_url":"https://github.com/mlibre/Clean-Web-Scraper","commit_stats":null,"previous_names":["mlibre/clean-web-scraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibre%2FClean-Web-Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibre%2FClean-Web-Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibre%2FClean-Web-Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibre%2FClean-Web-Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mlibre","download_url":"https://codeload.github.com/mlibre/Clean-Web-Scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243991794,"owners_count":20380053,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","artificial-intelligence","clean","crawler","data-preprocessing","dataset","fine-tuning","llm","recursive-crawling","scraper","training"],"created_at":"2025-03-17T07:29:47.080Z","updated_at":"2025-03-17T07:29:47.458Z","avatar_url":"https://github.com/mlibre.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Content Scraper\n\nA powerful Node.js web scraper that extracts clean, readable content from websites while keeping everything nicely organized. Perfect for creating AI training datasets! 🤖\n\n## ✨ Features\n\n- 🌐 Smart web crawling of internal links\n- 🔄 Smart retry mechanism with proxy fallback\n- 📝 Clean content extraction using Mozilla's Readability\n- 🧹 Smart content processing and cleaning\n- 🗂️ Maintains original URL structure in saved files\n- 🚫 Excludes unwanted paths from scraping\n- 🚦 Configurable rate limiting and delays\n- 🤖 AI-friendly output formats (JSONL, CSV, clean text)\n- 📊 Rich metadata extraction\n- 📁 Combine results from multiple scrapers into a unified dataset\n\n## 🛠️ Prerequisites\n\n- Node.js (v18 or higher)\n- npm\n\n## 📦 Dependencies\n\n- **axios** - HTTP requests master\n- **jsdom** - DOM parsing wizard\n- **@mozilla/readability** - Content extraction genius\n\n## 🚀 Installation\n\n```bash\nnpm i clean-web-scraper\n\n# OR\n\ngit clone https://github.com/mlibre/Clean-Web-Scraper\ncd Clean-Web-Scraper\nsudo pacman -S extra/xorg-server-xvfb chromium\nnpm install\n\n# Skip chromium download during npm installation\n# npm i --ignore-scripts\n```\n\n## 💻 Usage\n\n```js\nconst WebScraper = require('clean-web-scraper');\n\nconst scraper = new WebScraper({\n  baseURL: 'https://example.com/news',          // Required: The website base url to scrape\n  startURL: 'https://example.com/blog',         // Optional: Custom starting URL\n  excludeList: ['/admin', '/private'],          // Optional: Paths to exclude\n  exactExcludeList: ['/specific-page',          // Optional: Exact URLs to exclude \n  /^https:\\/\\/host\\.com\\/\\d{4}\\/$/],            // Optional: Regex patterns to exclude. this will exclude urls likee https://host.com/2023/\n  scrapResultPath: './example.com/website',     // Required: Where to save the content\n  jsonlOutputPath: './example.com/train.jsonl', // Optional: Custom JSONL output path\n  textOutputPath: \"./example.com/texts\",        // Optional: Custom text output path\n  csvOutputPath: \"./example.com/train.csv\",     // Optional: Custom CSV output path\n  strictBaseURL: true,                          // Optional: Only scrape URLs from same domain\n  maxDepth: Infinity,                           // Optional: Maximum crawling depth\n  maxArticles: Infinity,                        // Optional: Maximum articles to scrape\n  crawlingDelay: 1000,                          // Optional: Delay between requests (ms)\n  batchSize: 5,                                 // Optional: Number of URLs to process concurrently\n\n  // Network options\n  axiosHeaders: {},                             // Optional: Custom HTTP headers\n  axiosProxy: {                                 // Optional: HTTP/HTTPS proxy\n   host: \"localhost\",\n   port: 2080,\n   protocol: \"http\"\n  },              \n  axiosMaxRetries: 5,                           // Optional: Max retry attempts\n  axiosRetryDelay: 40000,                       // Optional: Delay between retries (ms)\n  useProxyAsFallback: false,                    // Optional: Fallback to proxy on failure\n  \n  // Puppeteer options for handling dynamic content\n  usePuppeteer: false,                          // Optional: Enable Puppeteer browser\n});\nawait scraper.start();\n```\n\n## 💻 Advanced Usage: Multi-Site Scraping\n\n```js\nconst WebScraper = require('clean-web-scraper');\n\n// Scrape documentation website\nconst docsScraper = new WebScraper({\n  baseURL: 'https://docs.example.com',\n  scrapResultPath: './datasets/docs',\n  maxDepth: 3,                               // Optional: Maximum depth for recursive crawling\n  includeMetadata: true,                     // Optional: Include metadata in output files\n  metadataFields: [\"author\", \"articleTitle\", \"pageTitle\", \"description\", \"dateScrapedDate\"],\n   // Optional: Specify metadata fields to include\n});\n\n// Scrape blog website\nconst blogScraper = new WebScraper({\n  baseURL: 'https://blog.example.com',\n  scrapResultPath: './datasets/blog',\n  maxDepth: 3,                               // Optional: Maximum depth for recursive crawling\n  includeMetadata: true,                     // Optional: Include metadata in output files\n  metadataFields: [\"author\", \"articleTitle\", \"pageTitle\", \"description\", \"dateScrapedDate\"],\n   // Optional: Specify metadata fields to include\n});\n\n// Start scraping both sites\nawait docsScraper.start();\nawait blogScraper.start();\n\n// Combine all scraped content into a single dataset\nawait WebScraper.combineResults('./combined', [docsScraper, blogScraper]);\n```\n\n```bash\n# 8 GB RAM\nnode --max-old-space-size=8192 example-usage.js\n```\n\n## 📤 Output\n\nYour AI-ready content is saved in a clean, structured format:\n\n- 📁 Base folder: `./folderPath/example.com/`\n- 📑 Files preserve original URL paths\n- 🤖 No HTML, no noise - just clean, structured text (`.txt` files)\n- 📊 `JSONL` and `CSV` outputs, ready for AI consumption, model training and fine-tuning\n\n```bash\nexample.com/\n├── website/\n│   ├── page1.txt         # Clean text content\n│   ├── page1.json        # Full metadata\n│   └── blog/\n│       ├── post1.txt\n│       └── post1.json\n├── texts/                # Numbered text files\n│   ├── 1.txt\n│   └── 2.txt\n├── texts_with_metadata/  # When includeMetadata is true\n│   ├── 1.txt\n│   └── 2.txt\n├── train.jsonl           # Combined content\n├── train_with_metadata.jsonl  # When includeMetadata is true\n├── train.csv             # Clean text in CSV format\n└── train_with_metadata.csv    # When includeMetadata is true\n\ncombined/\n├── texts/                # Combined numbered text files\n│   ├── 1.txt\n│   ├── 2.txt\n│   └── n.txt\n├── texts_with_metadata/  # Combined metadata text files\n│   ├── 1.txt\n│   ├── 2.txt\n│   └── n.txt\n├── combined.jsonl        # Combined JSONL content\n├── combined_with_metadata.jsonl\n├── combined.csv         # Combined CSV content\n└── combined_with_metadata.csv\n```\n\n## 📄 Output File Formats\n\n### 📝 Text Files (*.txt)\n\n```text\nThe actual article content starts here. This is the clean, processed text of the article that was extracted from the webpage\n```\n\n### 📑 Text Files with Metadata (texts_with_metadata/*.txt)\n\n```text\narticleTitle: Palestine history\ndescription: This is a great article about Palestine history\nauthor: John Doe\nlanguage: en\ndateScraped: 2024-01-20T10:30:00Z\n\n---\n\nThe actual article content starts here. This is the clean, processed text of the article that was extracted from the webpage.\n```\n\n### 📊 JSONL Files (train.jsonl)\n\n```json\n{\"text\": \"Clean article content here\"}\n{\"text\": \"Another article content here\"}\n```\n\n### 📈 JSONL with Metadata (train_with_metadata.jsonl)\n\n```json\n{\"text\": \"Article content\", \"metadata\": {\"articleTitle\": \"Page Title\", \"author\": \"John Doe\"}}\n{\"text\": \"Another article\", \"metadata\": {\"articleTitle\": \"Second Page\", \"author\": \"Jane Smith\"}}\n```\n\n### 🗃️ JSON Files In Website Output  (*.json)\n\n```json\n{\n  \"url\": \"\u003chttps://example.com/page\u003e\",\n  \"title\": \"Page Title\",\n  \"description\": \"Page description\",\n  \"dateScraped\": \"2024-01-20T10:30:00Z\"\n}\n```\n\n### 📋 CSV Files (train.csv)\n\n```csv\ntext\n\"Clean article content here\"\n\"Another article content here\"\n```\n\n### 📊 CSV with Metadata (train_with_metadata.csv)\n\n```csv\ntext,articleTitle,author,description\n\"Article content\",\"Page Title\",\"John Doe\",\"Page description\"\n\"Another article\",\"Second Page\",\"Jane Smith\",\"Another description\"\n```\n\n## Standing with Palestine 🇵🇸\n\nThis project supports Palestinian rights and stands in solidarity with Palestine. We believe in the importance of documenting and preserving Palestinian narratives, history, and struggles for justice and liberation.\n\nFree Palestine 🇵🇸\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlibre%2Fclean-web-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmlibre%2Fclean-web-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlibre%2Fclean-web-scraper/lists"}