https://github.com/houssamouhra/server-url-extractor
Server URL Extractor & Validator
https://github.com/houssamouhra/server-url-extractor
e2e-testing json node nodejs playwright scraping-server testing testing-automation typescript
Last synced: 3 months ago
JSON representation
Server URL Extractor & Validator
- Host: GitHub
- URL: https://github.com/houssamouhra/server-url-extractor
- Owner: houssamouhra
- License: mit
- Created: 2025-06-16T19:51:39.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2025-08-26T21:19:29.000Z (10 months ago)
- Last Synced: 2025-08-27T02:44:02.625Z (10 months ago)
- Topics: e2e-testing, json, node, nodejs, playwright, scraping-server, testing, testing-automation, typescript
- Language: Vue
- Homepage:
- Size: 307 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Server URL Extractor & Validator
*Fast, reliable extraction & validation of URLs from dynamic pages — using curl-first with Playwright fallback.*
[](https://github.com/houssamouhra/server-url-extractor/tree/sqlite-version)




> [!IMPORTANT]
> This project was built as part of real-world work experience for a company.
Designed for:
- Efficient batch processing of server-side URL drops
- Smart duplicate prevention
- DNS-aware validation
Built with resilience and scale in mind — perfect for processing large datasets without reprocessing the same work twice.
> [!TIP]
> A [SQLite-based version](../../tree/sqlite-version) is available in a dedicated branch for lightweight, persistent storage.
## 📚 Table of Contents
- [🔧 Features](#-features)
- [✅ Core Tasks Done](#-core-tasks-done)
- [🚀 Usage](#-usage)
- [📄 License](#-license)
## 🔧 Features
## ✅ Core Tasks Done
### 1. Link Extraction
- Extracted URLs from placeholders in textareas with regex, including `http(s)`, `www`, and protocol-relative URLs (`//...`).
- Built a helper to extract both placeholder links and real anchor `` links per drop.
### 2. Duplicate Handling
- Used `Set` logic to avoid duplicate URLs within each drop batch.
- Skipped already saved drops (`dropLinks.json`) and already validated batches (`validatedLinks.json`) to prevent reprocessing.
### 3. Batch Accumulation & Saving
- Grouped links into drop-specific batches: `dropId_drop_N`.
- Merged links from placeholders and anchors into a single batch.
- Saved batches incrementally to JSON to avoid memory overflow.
### 4. Navigation & Validation Loop
- Decremented through `/md/{id}.html` pages in a loop using Playwright automation.
- Validated extracted links using `curl` for speed.
- Automatically fell back to Playwright for browser-level validation if curl failed or gave ambiguous results.
- Captured and stored HTTP status, redirection info, final URL, and method used.
### 5. Inclusion Mapping (Optional Analysis)
- Compared resolved URLs against a predefined list of numeric target IDs.
- Marked each validated link with `included: true/false` depending on match.
- Enables later filtering and analysis based on external reference lists.
### 6. Regex Improvements
- Refined regex patterns to allow a wide variety of real URLs while filtering out false positives like `contact.first_name}}`.
- Added support for extended TLDs and shorteners (`.me`, `.li`, `.in`, `.moe`, etc.).
### 7. Memory Management & Debugging
- Logged memory usage every 10 tabs to track performance.
- Introduced async timeouts and batch size limits to keep Playwright stable during heavy runs.
### 8. Environment Handling
- Introduced `.env` config for secure credentials (`SERVER_EMAIL`, `SERVER_PASSWORD`).
- Included `.env.example` for team usage without exposing secrets.
- Uses `.env` credentials in Playwright login tests with strict TypeScript handling.
## 🚀 Usage
This section covers everything you need to **set up, run the server, and execute the web scrape** for the JSON branch.
### 1. Install dependencies
```bash
npm install
```
### 2. Set up your environment
```bash
cp .env.example .env
```
Then define your credentials:
```ini
SERVER_EMAIL=your@email.com
SERVER_PASSWORD=yourPassword
```
### 3. Prepare storage:
- Copy the example JSON files before running:
```bash
cp data/dropLinks.example.json data/dropLinks.json
cp data/validatedLinks.example.json data/validatedLinks.json
```
### 4. Run the Web Scrape
```bash
.\run-tests.bat
```
- Executes the full web scraping and URL validation workflow
- Saves results to the JSON files: `dropLinks.json` and `validatedLinks.json`
> ⚠️ Server runs on `localhost:3000` by default. Example endpoint: `/api/validated-links`
## 📄 License
This project is licensed under the [MIT License](LICENSE).