{"id":29422117,"url":"https://github.com/houssamouhra/server-url-extractor","last_synced_at":"2026-04-14T04:31:51.594Z","repository":{"id":303959035,"uuid":"1003211881","full_name":"houssamouhra/server-url-extractor","owner":"houssamouhra","description":"Server URL Extractor \u0026 Validator","archived":false,"fork":false,"pushed_at":"2025-08-26T21:19:29.000Z","size":314,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-08-27T02:44:02.625Z","etag":null,"topics":["e2e-testing","json","node","nodejs","playwright","scraping-server","testing","testing-automation","typescript"],"latest_commit_sha":null,"homepage":"","language":"Vue","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/houssamouhra.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-16T19:51:39.000Z","updated_at":"2025-08-26T21:19:32.000Z","dependencies_parsed_at":"2025-07-10T18:11:39.227Z","dependency_job_id":"d7396f55-77fd-4186-9d78-3e0e8e47ba55","html_url":"https://github.com/houssamouhra/server-url-extractor","commit_stats":null,"previous_names":["houssamouhra/server-url-extractor"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/houssamouhra/server-url-extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/houssamouhra%2Fserver-url-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/houssamouhra%2Fserver-url-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/houssamouhra%2Fserver-url-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/houssamouhra%2Fserver-url-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/houssamouhra","download_url":"https://codeload.github.com/houssamouhra/server-url-extractor/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/houssamouhra%2Fserver-url-extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31782736,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T02:24:21.117Z","status":"ssl_error","status_checked_at":"2026-04-14T02:24:20.627Z","response_time":153,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["e2e-testing","json","node","nodejs","playwright","scraping-server","testing","testing-automation","typescript"],"created_at":"2025-07-12T04:02:01.740Z","updated_at":"2026-04-14T04:31:51.582Z","avatar_url":"https://github.com/houssamouhra.png","language":"Vue","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \n# Server URL Extractor \u0026 Validator\n*Fast, reliable extraction \u0026 validation of URLs from dynamic pages — using curl-first with Playwright fallback.* \n\n[![SQLite branch](https://img.shields.io/badge/branch-SQLite-3399FF)](https://github.com/houssamouhra/server-url-extractor/tree/sqlite-version) \u0026nbsp;\n![Node.js](https://img.shields.io/badge/Node.js-339933?logo=node.js\u0026logoColor=white) \u0026nbsp;\n![TypeScript](https://img.shields.io/badge/typescript-3178c6?logo=typescript\u0026logoColor=white) \u0026nbsp;\n![curl](https://img.shields.io/badge/curl-005C9C?logo=curl\u0026logoColor=white) \u0026nbsp;\n![MIT License](https://img.shields.io/badge/license-MIT-green)\n\n\u003c/div\u003e\n\n\n\u003c!-- prettier-ignore-start --\u003e\n\u003e [!IMPORTANT]  \n\u003e This project was built as part of real-world work experience for a company.\n\u003c!-- prettier-ignore-end --\u003e\n\nDesigned for:\n\n- Efficient batch processing of server-side URL drops\n- Smart duplicate prevention\n- DNS-aware validation\n\nBuilt with resilience and scale in mind — perfect for processing large datasets without reprocessing the same work twice.\n\n\u003c!-- prettier-ignore-start --\u003e\n\u003e [!TIP]  \n\u003e A [SQLite-based version](../../tree/sqlite-version) is available in a dedicated branch for lightweight, persistent storage.\n\u003c!-- prettier-ignore-end --\u003e\n\n## 📚 Table of Contents\n\n- [🔧 Features](#-features)\n- [✅ Core Tasks Done](#-core-tasks-done)\n- [🚀 Usage](#-usage)\n- [📄 License](#-license)\n\n\n## 🔧 Features\n\n- **Automated navigation** across multiple `/md/xxxxx.html` pages, decrementing through URLs.\n- **Dual source extraction** from `\u003ctextarea\u003e` placeholders and valid anchor `\u003ca href=\"\"\u003e` tags within each drop.\n- **Robust regex filters** to exclude placeholder and anchor patterns, targeting only real URLs with allowed TLDs and excluding false positives.\n- **Smart skipping** logic:\n  - Skips scraping if a dropId is already present in `dropLinks.json`\n  - Skips validation if a batchId is fully present in `validatedLinks.json`\n- **Batch-based processing** saves links incrementally as `dropId_drop_N` batches to control memory and improve clarity.\n- **Duplicate-free batching**: avoids saving the same link twice within a batch.\n- **Status validation:**\n  - Uses `curl` for fast, lightweight URL status checking\n  - Automatically falls back to `Playwright` for rich browser-level checks if curl fails or gives uncertain output.\n- **Redirection detection** compares normalized final URLs to identify real redirects and capture redirected_url.\n- **DNS error detection** classifies failures like `ENOTFOUND`, `EAI_AGAIN`, and treats them distinctly with zero status.\n- **Secure credential injection** using `.env` variables for login automation\n- **Memory usage tracking** logs RAM snapshots after every 10 placeholder tabs processed.\n- **Detailed console logging** helps monitor:\n  - URL extraction steps\n  - Status checks\n  - Validation decisions (curl vs playwright)\n  - Skip reasons and timing\n- **Structured JSON output:**\n  - Scraped links → `data/dropLinks.json`\n  - Validated links → `data/validatedLinks.json`\n  - Grouped by `batchId`, each link contains:\n    - `original`: source URL\n    - `status`: HTTP status code\n    - `redirection`: true/false\n    - `redirected_url`: final URL if redirection happened\n    - `included`: boolean match for known target IDs\n    - `method`: `\"curl\"` or `\"playwright\"`\n    - `error`: if present (e.g. `\"DNS could not be resolved\"`)\n\n## ✅ Core Tasks Done\n\n### 1. Link Extraction\n\n- Extracted URLs from placeholders in textareas with regex, including `http(s)`, `www`, and protocol-relative URLs (`//...`).\n- Built a helper to extract both placeholder links and real anchor `\u003ca href=\"\"\u003e` links per drop.\n\n### 2. Duplicate Handling\n\n- Used `Set` logic to avoid duplicate URLs within each drop batch.\n- Skipped already saved drops (`dropLinks.json`) and already validated batches (`validatedLinks.json`) to prevent reprocessing.\n\n### 3. Batch Accumulation \u0026 Saving\n\n- Grouped links into drop-specific batches: `dropId_drop_N`.\n- Merged links from placeholders and anchors into a single batch.\n- Saved batches incrementally to JSON to avoid memory overflow.\n\n### 4. Navigation \u0026 Validation Loop\n\n- Decremented through `/md/{id}.html` pages in a loop using Playwright automation.\n- Validated extracted links using `curl` for speed.\n- Automatically fell back to Playwright for browser-level validation if curl failed or gave ambiguous results.\n- Captured and stored HTTP status, redirection info, final URL, and method used.\n\n### 5. Inclusion Mapping (Optional Analysis)\n\n- Compared resolved URLs against a predefined list of numeric target IDs.\n- Marked each validated link with `included: true/false` depending on match.\n- Enables later filtering and analysis based on external reference lists.\n\n### 6. Regex Improvements\n\n- Refined regex patterns to allow a wide variety of real URLs while filtering out false positives like `contact.first_name}}`.\n- Added support for extended TLDs and shorteners (`.me`, `.li`, `.in`, `.moe`, etc.).\n\n### 7. Memory Management \u0026 Debugging\n\n- Logged memory usage every 10 tabs to track performance.\n- Introduced async timeouts and batch size limits to keep Playwright stable during heavy runs.\n\n### 8. Environment Handling\n\n- Introduced `.env` config for secure credentials (`SERVER_EMAIL`, `SERVER_PASSWORD`).\n- Included `.env.example` for team usage without exposing secrets.\n- Uses `.env` credentials in Playwright login tests with strict TypeScript handling.\n\n## 🚀 Usage\n\nThis section covers everything you need to **set up, run the server, and execute the web scrape** for the JSON branch.\n\n\n### 1. Install dependencies\n\n```bash\nnpm install\n```\n\n### 2. Set up your environment\n\n```bash\ncp .env.example .env\n```\n\nThen define your credentials:\n\n```ini\nSERVER_EMAIL=your@email.com\nSERVER_PASSWORD=yourPassword\n```\n\n### 3. Prepare storage:\n- Copy the example JSON files before running:\n\n```bash\ncp data/dropLinks.example.json data/dropLinks.json\ncp data/validatedLinks.example.json data/validatedLinks.json\n```\n\n### 4. Run the Web Scrape\n\n```bash\n.\\run-tests.bat\n```\n- Executes the full web scraping and URL validation workflow\n- Saves results to the JSON files: `dropLinks.json` and `validatedLinks.json`\n\n\u003e ⚠️ Server runs on `localhost:3000` by default. Example endpoint: `/api/validated-links`\n\n## 📄 License\n\nThis project is licensed under the [MIT License](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhoussamouhra%2Fserver-url-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhoussamouhra%2Fserver-url-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhoussamouhra%2Fserver-url-extractor/lists"}