https://github.com/houssamouhra/server-url-extractor

Server URL Extractor & Validator
https://github.com/houssamouhra/server-url-extractor

e2e-testing json node nodejs playwright scraping-server testing testing-automation typescript

Last synced: 3 months ago
JSON representation

Server URL Extractor & Validator

Host: GitHub
URL: https://github.com/houssamouhra/server-url-extractor
Owner: houssamouhra
License: mit
Created: 2025-06-16T19:51:39.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2025-08-26T21:19:29.000Z (11 months ago)
Last Synced: 2025-08-27T02:44:02.625Z (11 months ago)
Topics: e2e-testing, json, node, nodejs, playwright, scraping-server, testing, testing-automation, typescript
Language: Vue
Homepage:
Size: 307 KB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Server URL Extractor & Validator
*Fast, reliable extraction & validation of URLs from dynamic pages — using curl-first with Playwright fallback.*

[![SQLite branch](https://img.shields.io/badge/branch-SQLite-3399FF)](https://github.com/houssamouhra/server-url-extractor/tree/sqlite-version)
![Node.js](https://img.shields.io/badge/Node.js-339933?logo=node.js&logoColor=white)
![TypeScript](https://img.shields.io/badge/typescript-3178c6?logo=typescript&logoColor=white)
![curl](https://img.shields.io/badge/curl-005C9C?logo=curl&logoColor=white)
![MIT License](https://img.shields.io/badge/license-MIT-green)

> [!IMPORTANT]
> This project was built as part of real-world work experience for a company.

Designed for:

- Efficient batch processing of server-side URL drops
- Smart duplicate prevention
- DNS-aware validation

Built with resilience and scale in mind — perfect for processing large datasets without reprocessing the same work twice.

> [!TIP]
> A [SQLite-based version](../../tree/sqlite-version) is available in a dedicated branch for lightweight, persistent storage.

## 📚 Table of Contents

- [🔧 Features](#-features)
- [✅ Core Tasks Done](#-core-tasks-done)
- [🚀 Usage](#-usage)
- [📄 License](#-license)

## 🔧 Features

- **Automated navigation** across multiple `/md/xxxxx.html` pages, decrementing through URLs.
- **Dual source extraction** from `` placeholders and valid anchor `` tags within each drop.
- **Robust regex filters** to exclude placeholder and anchor patterns, targeting only real URLs with allowed TLDs and excluding false positives.
- **Smart skipping** logic:
- Skips scraping if a dropId is already present in `dropLinks.json`
- Skips validation if a batchId is fully present in `validatedLinks.json`
- **Batch-based processing** saves links incrementally as `dropId_drop_N` batches to control memory and improve clarity.
- **Duplicate-free batching**: avoids saving the same link twice within a batch.
- **Status validation:**
- Uses `curl` for fast, lightweight URL status checking
- Automatically falls back to `Playwright` for rich browser-level checks if curl fails or gives uncertain output.
- **Redirection detection** compares normalized final URLs to identify real redirects and capture redirected_url.
- **DNS error detection** classifies failures like `ENOTFOUND`, `EAI_AGAIN`, and treats them distinctly with zero status.
- **Secure credential injection** using `.env` variables for login automation
- **Memory usage tracking** logs RAM snapshots after every 10 placeholder tabs processed.
- **Detailed console logging** helps monitor:
- URL extraction steps
- Status checks
- Validation decisions (curl vs playwright)
- Skip reasons and timing
- **Structured JSON output:**
- Scraped links → `data/dropLinks.json`
- Validated links → `data/validatedLinks.json`
- Grouped by `batchId`, each link contains:
- `original`: source URL
- `status`: HTTP status code
- `redirection`: true/false
- `redirected_url`: final URL if redirection happened
- `included`: boolean match for known target IDs
- `method`: `"curl"` or `"playwright"`
- `error`: if present (e.g. `"DNS could not be resolved"`)

## ✅ Core Tasks Done

### 1. Link Extraction

- Extracted URLs from placeholders in textareas with regex, including `http(s)`, `www`, and protocol-relative URLs (`//...`).
- Built a helper to extract both placeholder links and real anchor `` links per drop.

### 2. Duplicate Handling

- Used `Set` logic to avoid duplicate URLs within each drop batch.
- Skipped already saved drops (`dropLinks.json`) and already validated batches (`validatedLinks.json`) to prevent reprocessing.

### 3. Batch Accumulation & Saving

- Grouped links into drop-specific batches: `dropId_drop_N`.
- Merged links from placeholders and anchors into a single batch.
- Saved batches incrementally to JSON to avoid memory overflow.

### 4. Navigation & Validation Loop

- Decremented through `/md/{id}.html` pages in a loop using Playwright automation.
- Validated extracted links using `curl` for speed.
- Automatically fell back to Playwright for browser-level validation if curl failed or gave ambiguous results.
- Captured and stored HTTP status, redirection info, final URL, and method used.

### 5. Inclusion Mapping (Optional Analysis)

- Compared resolved URLs against a predefined list of numeric target IDs.
- Marked each validated link with `included: true/false` depending on match.
- Enables later filtering and analysis based on external reference lists.

### 6. Regex Improvements

- Refined regex patterns to allow a wide variety of real URLs while filtering out false positives like `contact.first_name}}`.
- Added support for extended TLDs and shorteners (`.me`, `.li`, `.in`, `.moe`, etc.).

### 7. Memory Management & Debugging

- Logged memory usage every 10 tabs to track performance.
- Introduced async timeouts and batch size limits to keep Playwright stable during heavy runs.

### 8. Environment Handling

- Introduced `.env` config for secure credentials (`SERVER_EMAIL`, `SERVER_PASSWORD`).
- Included `.env.example` for team usage without exposing secrets.
- Uses `.env` credentials in Playwright login tests with strict TypeScript handling.

## 🚀 Usage

This section covers everything you need to **set up, run the server, and execute the web scrape** for the JSON branch.

### 1. Install dependencies

```bash
npm install
```

### 2. Set up your environment

```bash
cp .env.example .env
```

Then define your credentials:

```ini
SERVER_EMAIL=your@email.com
SERVER_PASSWORD=yourPassword
```

### 3. Prepare storage:
- Copy the example JSON files before running:

```bash
cp data/dropLinks.example.json data/dropLinks.json
cp data/validatedLinks.example.json data/validatedLinks.json
```

### 4. Run the Web Scrape

```bash
.\run-tests.bat
```
- Executes the full web scraping and URL validation workflow
- Saves results to the JSON files: `dropLinks.json` and `validatedLinks.json`

> ⚠️ Server runs on `localhost:3000` by default. Example endpoint: `/api/validated-links`

## 📄 License

This project is licensed under the [MIT License](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/houssamouhra/server-url-extractor

Awesome Lists containing this project

README