https://github.com/davodm/article-export
Simple serverless app to export article data from an URL with ability to bypass cloudflare ani-bot measures
https://github.com/davodm/article-export
cloudflare javascript nodejs vercel
Last synced: about 2 months ago
JSON representation
Simple serverless app to export article data from an URL with ability to bypass cloudflare ani-bot measures
- Host: GitHub
- URL: https://github.com/davodm/article-export
- Owner: davodm
- License: other
- Created: 2023-06-06T15:35:20.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2025-10-18T15:13:41.000Z (8 months ago)
- Last Synced: 2025-10-19T09:18:00.748Z (8 months ago)
- Topics: cloudflare, javascript, nodejs, vercel
- Language: JavaScript
- Homepage:
- Size: 86.9 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Article Export - Serverless Content Extractor
A high-performance, serverless Node.js application that extracts article data from URLs while bypassing Cloudflare's anti-bot measures. Built with modern JavaScript (ES2022) and optimized for Node.js v18+ and Vercel deployment.
## โจ Features
### ๐ Core Functionality
- **Dual Bypass Strategy**: Two-tier anti-bot bypass system
- Primary: `humanoid-js` for basic-medium Cloudflare protection
- Secondary: `impit` with browser fingerprint spoofing
- Automatic fallback if primary method fails
- **Smart Caching**: Redis-based caching with configurable TTL (default: 10 days)
- **Content Extraction**: Extracts title, content, images, author, published date, and metadata
- **Quality Validation**: Automatic detection of cookie walls, paywalls, and invalid content
- **Dual HTTP Methods**: Supports both GET and POST requests
### ๐ Security & Reliability
- **Secret Key Authentication**: Multi-key support with comma-separated values
- **Input Validation**: URL format validation and sanitization
- **Redis Fallback**: Service continues without cache if Redis is unavailable
- **Timeout Handling**: 25-second timeout prevents hanging requests
- **Error Sanitization**: Production-safe error messages
### ๐ Monitoring & Observability
- **Strategy Reporting**: Shows which bypass method succeeded (`humanoid` or `impit`)
- **Performance Tracking**: Response time measurement for every request
- **Content Validation**: Reports on article quality and detected blockers
- **Health Endpoint**: Service health and Redis connectivity monitoring
- **Cache Status**: Indicates if content was served from cache or freshly fetched
### ๐ Developer Experience
- **CORS Support**: Cross-origin requests enabled for all methods
- **RESTful API**: Clean, consistent JSON responses
- **Comprehensive Testing**: 7 automated checks for project integrity
- **Modern Tooling**: ESLint v9, Prettier, ES2022 features
- **Serverless Ready**: Optimized for Vercel free tier (<50MB)
## ๐ Quick Start
### Prerequisites
- **Node.js**: v18.0.0 or later (fully compatible with Node.js v22)
- **Vercel CLI**: Install globally with `npm i -g vercel`
- **Upstash Redis**: For caching (free tier available at [upstash.com](https://upstash.com))
### Installation
1. **Clone the repository**
```bash
git clone https://github.com/davodm/article-export.git
cd article-export
```
2. **Install dependencies**
```bash
npm install
```
3. **Set up environment variables**
Create a `.env.local` file:
```bash
UPSTASH_REDIS_REST_TOKEN=your_redis_token
UPSTASH_REDIS_REST_URL=your_redis_url
SECRET_KEY=your_secret_key1,your_secret_key2
REDIS_CACHE_DAYS=10
```
4. **Run tests to verify setup**
```bash
npm test
```
5. **Start local development**
```bash
vercel dev
```
## ๐ก API Usage
### Endpoints
#### Main API: `GET /api` or `POST /api`
Extracts article content from a given URL. Supports both GET and POST methods.
#### Health Check: `GET /api/health`
Monitors service health and Redis connection status.
### Request Format
The API supports both GET and POST methods with the same parameters:
**GET Request (Query Parameters):**
```bash
GET /api?key=your_secret_key&url=https://example.com/article
```
**POST Request (JSON Body):**
```json
{
"key": "your_secret_key",
"url": "https://example.com/article"
}
```
### Response Format
**Success Response (200):**
```json
{
"status": 0,
"article": {
"title": "Article Title",
"content": "Article content...",
"image": "https://example.com/image.jpg",
"author": "Author Name",
"publishedTime": "2024-01-01T00:00:00.000Z"
},
"cached": false,
"strategy": "humanoid",
"validation": {
"isValid": true,
"hasBlocker": false,
"issues": [],
"quality": {
"hasValidTitle": true,
"hasValidContent": true,
"hasValidDescription": true,
"contentLength": 2540
}
},
"processingTime": "1250ms",
"timestamp": "2024-01-01T00:00:00.000Z"
}
```
**Response Fields:**
- `status`: `0` for success, `-1` for error
- `article`: Extracted article data (title, content, author, etc.)
- `cached`: `true` if served from cache, `false` if freshly fetched
- `strategy`: Which fetch method was used (`"humanoid"` or `"impit"`), `null` if from cache
- `validation`: Content quality and blocker detection (see below)
- `processingTime`: Total processing time in milliseconds
- `timestamp`: ISO timestamp of the response
**Validation Object:**
- `isValid`: `true` if content is valid, `false` if issues detected
- `hasBlocker`: `true` if cookie wall or paywall detected
- `issues`: Array of detected issues (cookie walls, paywalls, etc.)
- `quality`: Quality metrics (title, content, description validity)
**Error Response (4xx/5xx):**
```json
{
"status": -1,
"error": "Error message",
"timestamp": "2024-01-01T00:00:00.000Z"
}
```
**Health Check Response:**
```json
{
"status": 0,
"message": "Service is healthy",
"timestamp": "2024-01-01T00:00:00.000Z",
"environment": "production",
"nodeVersion": "v22.15.1",
"redis": "connected",
"uptime": 123.456
}
```
### Example Usage
```bash
# Test health endpoint
curl https://your-app.vercel.app/api/health
# Extract article content (GET method - simple and easy)
curl "https://your-app.vercel.app/api?key=your_secret_key&url=https://example.com/article"
# Extract article content (POST method - recommended for long URLs)
curl -X POST https://your-app.vercel.app/api \
-H "Content-Type: application/json" \
-d '{
"key": "your_secret_key",
"url": "https://example.com/article"
}'
```
## ๐ ๏ธ Development
### Available Scripts
- `vercel dev` - Start local development server
- `npm run build` - Build the project (creates public directory for Vercel)
- `npm run deploy` - Deploy to production
- `npm run deploy:staging` - Deploy to staging
- `npm run lint` - Run ESLint for code quality
- `npm run format` - Format code with Prettier
- `npm test` - Run project validation tests
- `npm run clean` - Clean Vercel build files
### Code Quality
The project uses modern development tools:
- **ESLint v9** with flat config for code linting
- **Prettier** for consistent code formatting
- **ES2022** features for modern JavaScript
- **Comprehensive testing** with automated validation
### Local Development
1. **Install Vercel CLI globally:**
```bash
npm i -g vercel
```
2. **Link your project:**
```bash
vercel link
```
3. **Run locally:**
```bash
vercel dev
```
## ๐ฆ Dependencies
### Production Dependencies
| Package | Version | Status | Purpose |
|---------|---------|--------|---------|
| `@extractus/article-extractor` | ^8.0.20 | โ
**Active** | Extracts article content, metadata, and structured data from HTML |
| `@upstash/redis` | ^1.35.6 | โ
**Active** | Serverless Redis client for caching with REST API |
| `humanoid-js` | ^1.0.1 | โ ๏ธ **Deprecated** | Primary Cloudflare bypass (7 years old, but still functional) |
| `impit` | ^0.6.0 | โ
**Active** | HTTP client with browser impersonation for secondary bypass |
### Development Dependencies
| Package | Version | Status | Purpose |
|---------|---------|--------|---------|
| `eslint` | ^9.38.0 | โ
**Active** | Code linting with flat config support |
| `globals` | ^16.4.0 | โ
**Active** | ESLint global variables for Node.js v24 compatibility |
| `prettier` | ^3.6.2 | โ
**Active** | Code formatting |
### ๐ Dependency Notes
**humanoid-js (โ ๏ธ Unmaintained)**
- Last updated: 7 years ago (2018)
- Status: Works for basic-medium Cloudflare protection
- Why we keep it: Simple, lightweight, no browser needed
- Fallback: `impit` automatically used if humanoid-js fails
- Future: Will replace when it stops working or better alternatives emerge
**Why This Approach Works:**
- โ
Two bypass strategies provide redundancy
- โ
Automatic fallback ensures reliability
- โ
All dependencies work on Vercel free tier
- โ
No browser automation needed (keeps function size <50MB)
- โ
Total package size: ~15MB (well under 50MB limit)
### ๐ Update Strategy
```bash
# Update all dependencies (safe - follows semver)
npm update
# Check for outdated packages
npm outdated
# Rebuild native modules after Node.js upgrade
npm rebuild
```
## ๐๏ธ Architecture
### Data Flow
```
Request โ Validate Key & URL
โ
Check Redis Cache
โ
Cache Hit? โ Return Cached Article โ
โ
Cache Miss? โ Fetch with Bypass Strategy
โ
Try humanoid-js โ Success? โ Extract & Cache โ Return โ
โ
Failed? โ Try impit โ Success? โ Extract & Cache โ Return โ
โ
Failed? โ Return Error โ
```
### Bypass Strategy Logic
```javascript
// Automatic fallback system
1. Try humanoid-js (fast, lightweight)
โ Success โ Cache & Return
โ Fail
2. Try impit (browser impersonation)
โ Success โ Cache & Return
โ Fail
3. Return error with details
```
### Content Validation Flow
```
Extract Article โ Validate Content
โ
Check for:
- Cookie walls (40+ confidence threshold)
- Paywalls (30+ confidence threshold)
- Short content (< 200 chars)
- Missing title (< 10 chars)
โ
Return validation object with:
- isValid: boolean
- hasBlocker: boolean
- issues: array
- quality: metrics
```
## ๐ฏ Use Cases
### โ
**What This API Is Great For:**
- ๐ฐ News aggregators
- ๐ฑ RSS feed readers
- ๐ Bookmark managers with content preview
- ๐ Content analysis tools
- ๐ค Research bots
- ๐ Article archiving services
- ๐ Content discovery platforms
### โ ๏ธ **Limitations:**
- **Cookie Walls**: Detects but cannot automatically accept (requires browser automation)
- **Paywalls**: Detects but cannot bypass (premium content protected)
- **JavaScript-heavy sites**: May return incomplete content
- **Rate limiting**: Subject to target site's rate limits
- **Dynamic content**: May miss content loaded via AJAX after initial render
### ๐ก **Best Practices:**
- Cache aggressively (10-day default is reasonable for most content)
- Handle `validation.hasBlocker` in your client code
- Monitor `strategy` field to track bypass success rates
- Use POST for long URLs (avoid URL length limits)
- Implement retry logic with exponential backoff
- Check `cached` field to understand performance
## ๐ง Configuration
### Environment Variables
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| `UPSTASH_REDIS_REST_TOKEN` | โ
Yes | - | Your Upstash Redis REST token |
| `UPSTASH_REDIS_REST_URL` | โ
Yes | - | Your Upstash Redis REST URL (https://...) |
| `SECRET_KEY` | โ
Yes | - | Comma-separated API keys for authentication |
| `REDIS_CACHE_DAYS` | โ No | `10` | Cache duration in days (recommend 10-30) |
| `NODE_ENV` | โ No | `development` | Environment (`development`, `production`) |
### Example Configuration
**`.env.local` for local development:**
```bash
UPSTASH_REDIS_REST_TOKEN=xxxx...
UPSTASH_REDIS_REST_URL=https://frank-lizard-12345.upstash.io
SECRET_KEY=my_dev_key_123,another_key_456
REDIS_CACHE_DAYS=10
NODE_ENV=development
```
**Vercel Environment Variables:**
1. Go to your Vercel project โ Settings โ Environment Variables
2. Add each variable for Production, Preview, and Development
3. Vercel will automatically inject them during deployment
### Cache Configuration Recommendations
| Content Type | Recommended TTL | Setting |
|--------------|----------------|---------|
| News articles | 1-3 days | `REDIS_CACHE_DAYS=1` |
| Blog posts | 7-14 days | `REDIS_CACHE_DAYS=7` |
| Static content | 30+ days | `REDIS_CACHE_DAYS=30` |
| **General use (default)** | **10 days** | `REDIS_CACHE_DAYS=10` |
## ๐ Deployment
### Deploy to Vercel
**Quick Deploy:**
```bash
# Production deployment
npm run deploy
# Staging deployment
npm run deploy:staging
```
**First-time Setup:**
1. Install Vercel CLI: `npm i -g vercel`
2. Link project: `vercel link`
3. Add environment variables in Vercel dashboard
4. Deploy: `npm run deploy`
### Keep-Alive Configuration
Serverless functions can go "cold" after inactivity. To keep your function and Upstash Redis connection active, we've configured a daily cron job that pings the health endpoint.
**Built-in Solution (Vercel Cron Jobs):**
- โ
Already configured in `vercel.json`
- โ
Runs daily at 12:00 UTC
- โ
Free on Vercel Pro plan (or use alternatives below)
- โ
No external dependencies
The cron job is configured to call `/api/health` once per day, which:
- Keeps the serverless function warm
- Tests Redis connectivity
- Ensures the database stays active
**Alternative Free Solutions:**
If you're on Vercel's free tier (which doesn't include cron jobs), use one of these free external services:
1. **UptimeRobot** (Recommended - Free tier: 50 monitors)
- URL: https://uptimerobot.com
- Setup: Create a monitor โ HTTP(s) โ Your health endpoint URL
- Interval: Set to check every 24 hours (or minimum 5 minutes)
- Free tier: 50 monitors, 5-minute intervals
2. **Cron-Job.org** (Free)
- URL: https://cron-job.org
- Setup: Create job โ HTTP Request โ Your health endpoint URL
- Schedule: `0 12 * * *` (daily at 12:00 UTC)
- Free tier: Unlimited jobs, 1-minute minimum interval
3. **EasyCron** (Free tier available)
- URL: https://www.easycron.com
- Setup: Create cron job โ HTTP GET โ Your health endpoint URL
- Schedule: Daily
- Free tier: 1 job, 1-hour minimum interval
4. **GitHub Actions** (If your repo is public)
- Create `.github/workflows/keep-alive.yml`:
```yaml
name: Keep Alive
on:
schedule:
- cron: '0 12 * * *' # Daily at 12:00 UTC
jobs:
ping:
runs-on: ubuntu-latest
steps:
- name: Ping health endpoint
run: curl -f ${{ secrets.HEALTH_ENDPOINT_URL }} || exit 1
```
**Health Endpoint URL:**
```
https://your-app.vercel.app/api/health
```
Replace `your-app` with your actual Vercel deployment URL.
## ๐งช Testing
### Automated Tests
The project includes 7 automated validation checks:
```bash
npm test
```
**What's tested:**
1. โ
Project structure (all required files exist)
2. โ
Code quality (ESLint passes)
3. โ
Package scripts (deploy, test, lint, etc.)
4. โ
Dependencies (all installed correctly)
5. โ
Node.js compatibility (v18+)
6. โ
Module exports (fetcher functions work)
7. โ
Environment template (all variables documented)
## ๐ค Contributing
We welcome contributions! Please follow these steps:
1. Fork the repository
2. Create a feature branch: `git checkout -b feature/amazing-feature`
3. Commit your changes: `git commit -m 'Add amazing feature'`
4. Push to the branch: `git push origin feature/amazing-feature`
5. Open a Pull Request
### Development Guidelines
- Follow ESLint rules (run `npm run lint`)
- Use Prettier for formatting (run `npm run format`)
- Write meaningful commit messages
- Test your changes locally before submitting
- Ensure all tests pass (`npm test`)
- Update README if adding new features
- Keep dependencies up to date
## ๐ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
---
**Made with โค๏ธ by Davod Mozafari**
[](LICENSE)
[](https://nodejs.org/)
[](https://vercel.com/)