{"id":18756288,"url":"https://github.com/davodm/article-export","last_synced_at":"2026-05-09T07:02:59.593Z","repository":{"id":189379549,"uuid":"650215933","full_name":"davodm/article-export","owner":"davodm","description":"Simple serverless app to export article data from an URL with ability to bypass cloudflare ani-bot measures","archived":false,"fork":false,"pushed_at":"2025-10-18T15:13:41.000Z","size":89,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-19T09:18:00.748Z","etag":null,"topics":["cloudflare","javascript","nodejs","vercel"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davodm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-06-06T15:35:20.000Z","updated_at":"2025-10-18T15:13:46.000Z","dependencies_parsed_at":"2025-10-18T19:25:08.312Z","dependency_job_id":null,"html_url":"https://github.com/davodm/article-export","commit_stats":null,"previous_names":["davodm/article-export"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/davodm/article-export","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davodm%2Farticle-export","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davodm%2Farticle-export/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davodm%2Farticle-export/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davodm%2Farticle-export/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davodm","download_url":"https://codeload.github.com/davodm/article-export/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davodm%2Farticle-export/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32810381,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-08T08:22:46.396Z","status":"online","status_checked_at":"2026-05-09T02:00:06.633Z","response_time":123,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cloudflare","javascript","nodejs","vercel"],"created_at":"2024-11-07T17:35:53.966Z","updated_at":"2026-05-09T07:02:59.581Z","avatar_url":"https://github.com/davodm.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Article Export - Serverless Content Extractor\n\nA high-performance, serverless Node.js application that extracts article data from URLs while bypassing Cloudflare's anti-bot measures. Built with modern JavaScript (ES2022) and optimized for Node.js v18+ and Vercel deployment.\n\n## ✨ Features\n\n### 🚀 Core Functionality\n- **Dual Bypass Strategy**: Two-tier anti-bot bypass system\n  - Primary: `humanoid-js` for basic-medium Cloudflare protection\n  - Secondary: `impit` with browser fingerprint spoofing\n  - Automatic fallback if primary method fails\n- **Smart Caching**: Redis-based caching with configurable TTL (default: 10 days)\n- **Content Extraction**: Extracts title, content, images, author, published date, and metadata\n- **Quality Validation**: Automatic detection of cookie walls, paywalls, and invalid content\n- **Dual HTTP Methods**: Supports both GET and POST requests\n\n### 🔒 Security \u0026 Reliability\n- **Secret Key Authentication**: Multi-key support with comma-separated values\n- **Input Validation**: URL format validation and sanitization\n- **Redis Fallback**: Service continues without cache if Redis is unavailable\n- **Timeout Handling**: 25-second timeout prevents hanging requests\n- **Error Sanitization**: Production-safe error messages\n\n### 📊 Monitoring \u0026 Observability\n- **Strategy Reporting**: Shows which bypass method succeeded (`humanoid` or `impit`)\n- **Performance Tracking**: Response time measurement for every request\n- **Content Validation**: Reports on article quality and detected blockers\n- **Health Endpoint**: Service health and Redis connectivity monitoring\n- **Cache Status**: Indicates if content was served from cache or freshly fetched\n\n### 🌐 Developer Experience\n- **CORS Support**: Cross-origin requests enabled for all methods\n- **RESTful API**: Clean, consistent JSON responses\n- **Comprehensive Testing**: 7 automated checks for project integrity\n- **Modern Tooling**: ESLint v9, Prettier, ES2022 features\n- **Serverless Ready**: Optimized for Vercel free tier (\u003c50MB)\n\n## 🚀 Quick Start\n\n### Prerequisites\n\n- **Node.js**: v18.0.0 or later (fully compatible with Node.js v22)\n- **Vercel CLI**: Install globally with `npm i -g vercel`\n- **Upstash Redis**: For caching (free tier available at [upstash.com](https://upstash.com))\n\n### Installation\n\n1. **Clone the repository**\n\n   ```bash\n   git clone https://github.com/davodm/article-export.git\n   cd article-export\n   ```\n\n2. **Install dependencies**\n\n   ```bash\n   npm install\n   ```\n\n3. **Set up environment variables**\n   Create a `.env.local` file:\n\n   ```bash\n   UPSTASH_REDIS_REST_TOKEN=your_redis_token\n   UPSTASH_REDIS_REST_URL=your_redis_url\n   SECRET_KEY=your_secret_key1,your_secret_key2\n   REDIS_CACHE_DAYS=10\n   ```\n\n4. **Run tests to verify setup**\n\n   ```bash\n   npm test\n   ```\n\n5. **Start local development**\n   ```bash\n   vercel dev\n   ```\n\n## 📡 API Usage\n\n### Endpoints\n\n#### Main API: `GET /api` or `POST /api`\n\nExtracts article content from a given URL. Supports both GET and POST methods.\n\n#### Health Check: `GET /api/health`\n\nMonitors service health and Redis connection status.\n\n### Request Format\n\nThe API supports both GET and POST methods with the same parameters:\n\n**GET Request (Query Parameters):**\n\n```bash\nGET /api?key=your_secret_key\u0026url=https://example.com/article\n```\n\n**POST Request (JSON Body):**\n\n```json\n{\n  \"key\": \"your_secret_key\",\n  \"url\": \"https://example.com/article\"\n}\n```\n\n### Response Format\n\n**Success Response (200):**\n\n```json\n{\n  \"status\": 0,\n  \"article\": {\n    \"title\": \"Article Title\",\n    \"content\": \"Article content...\",\n    \"image\": \"https://example.com/image.jpg\",\n    \"author\": \"Author Name\",\n    \"publishedTime\": \"2024-01-01T00:00:00.000Z\"\n  },\n  \"cached\": false,\n  \"strategy\": \"humanoid\",\n  \"validation\": {\n    \"isValid\": true,\n    \"hasBlocker\": false,\n    \"issues\": [],\n    \"quality\": {\n      \"hasValidTitle\": true,\n      \"hasValidContent\": true,\n      \"hasValidDescription\": true,\n      \"contentLength\": 2540\n    }\n  },\n  \"processingTime\": \"1250ms\",\n  \"timestamp\": \"2024-01-01T00:00:00.000Z\"\n}\n```\n\n**Response Fields:**\n- `status`: `0` for success, `-1` for error\n- `article`: Extracted article data (title, content, author, etc.)\n- `cached`: `true` if served from cache, `false` if freshly fetched\n- `strategy`: Which fetch method was used (`\"humanoid\"` or `\"impit\"`), `null` if from cache\n- `validation`: Content quality and blocker detection (see below)\n- `processingTime`: Total processing time in milliseconds\n- `timestamp`: ISO timestamp of the response\n\n**Validation Object:**\n- `isValid`: `true` if content is valid, `false` if issues detected\n- `hasBlocker`: `true` if cookie wall or paywall detected\n- `issues`: Array of detected issues (cookie walls, paywalls, etc.)\n- `quality`: Quality metrics (title, content, description validity)\n\n**Error Response (4xx/5xx):**\n\n```json\n{\n  \"status\": -1,\n  \"error\": \"Error message\",\n  \"timestamp\": \"2024-01-01T00:00:00.000Z\"\n}\n```\n\n**Health Check Response:**\n\n```json\n{\n  \"status\": 0,\n  \"message\": \"Service is healthy\",\n  \"timestamp\": \"2024-01-01T00:00:00.000Z\",\n  \"environment\": \"production\",\n  \"nodeVersion\": \"v22.15.1\",\n  \"redis\": \"connected\",\n  \"uptime\": 123.456\n}\n```\n\n### Example Usage\n\n```bash\n# Test health endpoint\ncurl https://your-app.vercel.app/api/health\n\n# Extract article content (GET method - simple and easy)\ncurl \"https://your-app.vercel.app/api?key=your_secret_key\u0026url=https://example.com/article\"\n\n# Extract article content (POST method - recommended for long URLs)\ncurl -X POST https://your-app.vercel.app/api \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"key\": \"your_secret_key\",\n    \"url\": \"https://example.com/article\"\n  }'\n```\n\n## 🛠️ Development\n\n### Available Scripts\n\n- `vercel dev` - Start local development server\n- `npm run build` - Build the project (creates public directory for Vercel)\n- `npm run deploy` - Deploy to production\n- `npm run deploy:staging` - Deploy to staging\n- `npm run lint` - Run ESLint for code quality\n- `npm run format` - Format code with Prettier\n- `npm test` - Run project validation tests\n- `npm run clean` - Clean Vercel build files\n\n### Code Quality\n\nThe project uses modern development tools:\n\n- **ESLint v9** with flat config for code linting\n- **Prettier** for consistent code formatting\n- **ES2022** features for modern JavaScript\n- **Comprehensive testing** with automated validation\n\n### Local Development\n\n1. **Install Vercel CLI globally:**\n\n   ```bash\n   npm i -g vercel\n   ```\n\n2. **Link your project:**\n\n   ```bash\n   vercel link\n   ```\n\n3. **Run locally:**\n   ```bash\n   vercel dev\n   ```\n\n## 📦 Dependencies\n\n### Production Dependencies\n\n| Package | Version | Status | Purpose |\n|---------|---------|--------|---------|\n| `@extractus/article-extractor` | ^8.0.20 | ✅ **Active** | Extracts article content, metadata, and structured data from HTML |\n| `@upstash/redis` | ^1.35.6 | ✅ **Active** | Serverless Redis client for caching with REST API |\n| `humanoid-js` | ^1.0.1 | ⚠️ **Deprecated** | Primary Cloudflare bypass (7 years old, but still functional) |\n| `impit` | ^0.6.0 | ✅ **Active** | HTTP client with browser impersonation for secondary bypass |\n\n### Development Dependencies\n\n| Package | Version | Status | Purpose |\n|---------|---------|--------|---------|\n| `eslint` | ^9.38.0 | ✅ **Active** | Code linting with flat config support |\n| `globals` | ^16.4.0 | ✅ **Active** | ESLint global variables for Node.js v24 compatibility |\n| `prettier` | ^3.6.2 | ✅ **Active** | Code formatting |\n\n### 📝 Dependency Notes\n\n**humanoid-js (⚠️ Unmaintained)**\n- Last updated: 7 years ago (2018)\n- Status: Works for basic-medium Cloudflare protection\n- Why we keep it: Simple, lightweight, no browser needed\n- Fallback: `impit` automatically used if humanoid-js fails\n- Future: Will replace when it stops working or better alternatives emerge\n\n**Why This Approach Works:**\n- ✅ Two bypass strategies provide redundancy\n- ✅ Automatic fallback ensures reliability\n- ✅ All dependencies work on Vercel free tier\n- ✅ No browser automation needed (keeps function size \u003c50MB)\n- ✅ Total package size: ~15MB (well under 50MB limit)\n\n### 🔄 Update Strategy\n\n```bash\n# Update all dependencies (safe - follows semver)\nnpm update\n\n# Check for outdated packages\nnpm outdated\n\n# Rebuild native modules after Node.js upgrade\nnpm rebuild\n```\n\n## 🏗️ Architecture\n\n### Data Flow\n\n```\nRequest → Validate Key \u0026 URL\n    ↓\nCheck Redis Cache\n    ↓\nCache Hit? → Return Cached Article ✅\n    ↓\nCache Miss? → Fetch with Bypass Strategy\n    ↓\nTry humanoid-js → Success? → Extract \u0026 Cache → Return ✅\n    ↓\nFailed? → Try impit → Success? → Extract \u0026 Cache → Return ✅\n    ↓\nFailed? → Return Error ❌\n```\n\n### Bypass Strategy Logic\n\n```javascript\n// Automatic fallback system\n1. Try humanoid-js (fast, lightweight)\n   ↓ Success → Cache \u0026 Return\n   ↓ Fail\n2. Try impit (browser impersonation)\n   ↓ Success → Cache \u0026 Return\n   ↓ Fail\n3. Return error with details\n```\n\n### Content Validation Flow\n\n```\nExtract Article → Validate Content\n    ↓\nCheck for:\n- Cookie walls (40+ confidence threshold)\n- Paywalls (30+ confidence threshold)  \n- Short content (\u003c 200 chars)\n- Missing title (\u003c 10 chars)\n    ↓\nReturn validation object with:\n- isValid: boolean\n- hasBlocker: boolean\n- issues: array\n- quality: metrics\n```\n\n## 🎯 Use Cases\n\n### ✅ **What This API Is Great For:**\n- 📰 News aggregators\n- 📱 RSS feed readers\n- 🔖 Bookmark managers with content preview\n- 📊 Content analysis tools\n- 🤖 Research bots\n- 📚 Article archiving services\n- 🔍 Content discovery platforms\n\n### ⚠️ **Limitations:**\n- **Cookie Walls**: Detects but cannot automatically accept (requires browser automation)\n- **Paywalls**: Detects but cannot bypass (premium content protected)\n- **JavaScript-heavy sites**: May return incomplete content\n- **Rate limiting**: Subject to target site's rate limits\n- **Dynamic content**: May miss content loaded via AJAX after initial render\n\n### 💡 **Best Practices:**\n- Cache aggressively (10-day default is reasonable for most content)\n- Handle `validation.hasBlocker` in your client code\n- Monitor `strategy` field to track bypass success rates\n- Use POST for long URLs (avoid URL length limits)\n- Implement retry logic with exponential backoff\n- Check `cached` field to understand performance\n\n## 🔧 Configuration\n\n### Environment Variables\n\n| Variable | Required | Default | Description |\n|----------|----------|---------|-------------|\n| `UPSTASH_REDIS_REST_TOKEN` | ✅ Yes | - | Your Upstash Redis REST token |\n| `UPSTASH_REDIS_REST_URL` | ✅ Yes | - | Your Upstash Redis REST URL (https://...) |\n| `SECRET_KEY` | ✅ Yes | - | Comma-separated API keys for authentication |\n| `REDIS_CACHE_DAYS` | ❌ No | `10` | Cache duration in days (recommend 10-30) |\n| `NODE_ENV` | ❌ No | `development` | Environment (`development`, `production`) |\n\n### Example Configuration\n\n**`.env.local` for local development:**\n```bash\nUPSTASH_REDIS_REST_TOKEN=xxxx...\nUPSTASH_REDIS_REST_URL=https://frank-lizard-12345.upstash.io\nSECRET_KEY=my_dev_key_123,another_key_456\nREDIS_CACHE_DAYS=10\nNODE_ENV=development\n```\n\n**Vercel Environment Variables:**\n1. Go to your Vercel project → Settings → Environment Variables\n2. Add each variable for Production, Preview, and Development\n3. Vercel will automatically inject them during deployment\n\n### Cache Configuration Recommendations\n\n| Content Type | Recommended TTL | Setting |\n|--------------|----------------|---------|\n| News articles | 1-3 days | `REDIS_CACHE_DAYS=1` |\n| Blog posts | 7-14 days | `REDIS_CACHE_DAYS=7` |\n| Static content | 30+ days | `REDIS_CACHE_DAYS=30` |\n| **General use (default)** | **10 days** | `REDIS_CACHE_DAYS=10` |\n\n## 🚀 Deployment\n\n### Deploy to Vercel\n\n**Quick Deploy:**\n```bash\n# Production deployment\nnpm run deploy\n\n# Staging deployment\nnpm run deploy:staging\n```\n\n**First-time Setup:**\n1. Install Vercel CLI: `npm i -g vercel`\n2. Link project: `vercel link`\n3. Add environment variables in Vercel dashboard\n4. Deploy: `npm run deploy`\n\n### Keep-Alive Configuration\n\nServerless functions can go \"cold\" after inactivity. To keep your function and Upstash Redis connection active, we've configured a daily cron job that pings the health endpoint.\n\n**Built-in Solution (Vercel Cron Jobs):**\n- ✅ Already configured in `vercel.json`\n- ✅ Runs daily at 12:00 UTC\n- ✅ Free on Vercel Pro plan (or use alternatives below)\n- ✅ No external dependencies\n\nThe cron job is configured to call `/api/health` once per day, which:\n- Keeps the serverless function warm\n- Tests Redis connectivity\n- Ensures the database stays active\n\n**Alternative Free Solutions:**\n\nIf you're on Vercel's free tier (which doesn't include cron jobs), use one of these free external services:\n\n1. **UptimeRobot** (Recommended - Free tier: 50 monitors)\n   - URL: https://uptimerobot.com\n   - Setup: Create a monitor → HTTP(s) → Your health endpoint URL\n   - Interval: Set to check every 24 hours (or minimum 5 minutes)\n   - Free tier: 50 monitors, 5-minute intervals\n\n2. **Cron-Job.org** (Free)\n   - URL: https://cron-job.org\n   - Setup: Create job → HTTP Request → Your health endpoint URL\n   - Schedule: `0 12 * * *` (daily at 12:00 UTC)\n   - Free tier: Unlimited jobs, 1-minute minimum interval\n\n3. **EasyCron** (Free tier available)\n   - URL: https://www.easycron.com\n   - Setup: Create cron job → HTTP GET → Your health endpoint URL\n   - Schedule: Daily\n   - Free tier: 1 job, 1-hour minimum interval\n\n4. **GitHub Actions** (If your repo is public)\n   - Create `.github/workflows/keep-alive.yml`:\n   ```yaml\n   name: Keep Alive\n   on:\n     schedule:\n       - cron: '0 12 * * *'  # Daily at 12:00 UTC\n   jobs:\n     ping:\n       runs-on: ubuntu-latest\n       steps:\n         - name: Ping health endpoint\n           run: curl -f ${{ secrets.HEALTH_ENDPOINT_URL }} || exit 1\n   ```\n\n**Health Endpoint URL:**\n```\nhttps://your-app.vercel.app/api/health\n```\n\nReplace `your-app` with your actual Vercel deployment URL.\n\n## 🧪 Testing\n\n### Automated Tests\n\nThe project includes 7 automated validation checks:\n\n```bash\nnpm test\n```\n\n**What's tested:**\n1. ✅ Project structure (all required files exist)\n2. ✅ Code quality (ESLint passes)\n3. ✅ Package scripts (deploy, test, lint, etc.)\n4. ✅ Dependencies (all installed correctly)\n5. ✅ Node.js compatibility (v18+)\n6. ✅ Module exports (fetcher functions work)\n7. ✅ Environment template (all variables documented)\n\n## 🤝 Contributing\n\nWe welcome contributions! Please follow these steps:\n\n1. Fork the repository\n2. Create a feature branch: `git checkout -b feature/amazing-feature`\n3. Commit your changes: `git commit -m 'Add amazing feature'`\n4. Push to the branch: `git push origin feature/amazing-feature`\n5. Open a Pull Request\n\n### Development Guidelines\n\n- Follow ESLint rules (run `npm run lint`)\n- Use Prettier for formatting (run `npm run format`)\n- Write meaningful commit messages\n- Test your changes locally before submitting\n- Ensure all tests pass (`npm test`)\n- Update README if adding new features\n- Keep dependencies up to date\n\n## 📝 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n---\n\n**Made with ❤️ by Davod Mozafari**\n\n[![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)\n[![Node.js Version](https://img.shields.io/badge/node-%3E%3D18.0.0-brightgreen.svg)](https://nodejs.org/)\n[![Vercel](https://img.shields.io/badge/Deploy-Vercel-black.svg)](https://vercel.com/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavodm%2Farticle-export","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavodm%2Farticle-export","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavodm%2Farticle-export/lists"}