https://github.com/instagram-automations/instagram-web-scraper
instagram web scraper and automation toolkit
https://github.com/instagram-automations/instagram-web-scraper
anti-detect automation bot cli docker instagram instagram-web-scraper nodejs proxy python rate-limits selenium srarper web
Last synced: 8 months ago
JSON representation
instagram web scraper and automation toolkit
- Host: GitHub
- URL: https://github.com/instagram-automations/instagram-web-scraper
- Owner: Instagram-Automations
- Created: 2025-10-10T19:28:01.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-10-10T19:35:11.000Z (8 months ago)
- Last Synced: 2025-10-14T19:04:41.886Z (8 months ago)
- Topics: anti-detect, automation, bot, cli, docker, instagram, instagram-web-scraper, nodejs, proxy, python, rate-limits, selenium, srarper, web
- Homepage:
- Size: 1.37 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# instagram web scraper
A production-ready boilerplate to collect publicly available Instagram web data (profiles, posts, hashtags) using safe automation patterns, rotating proxies, and human-like delays. Built for agencies, researchers, and growth teams that want reliable scraping with lower block risk.
For discussion, queries, and freelance work — reach out 👆
---
## Introduction
> This repository provides a modular Instagram web scraping starter that focuses on resilience (anti-detect flows, rotating proxies, session reuse) and clarity (typed schema, storage adapters). It’s ideal for analysts, SaaS builders, and agencies that need compliant, rate-aware scraping of public pages.
### Key Benefits
1. Saves time with prebuilt Playwright/Selenium runners.
2. Scales from single run to distributed jobs.
3. Safer with proxy rotation, backoff, fingerprint & session logic.
---
## Features must be in table
| Feature | Details |
|---|---|
| Headless/Visible Browsers | Playwright or Selenium drivers with toggleable headless mode |
| Proxy Rotation | Supports residential/mobile proxies with per-request rotation |
| Session Persistence | Reuse cookies/storage to reduce challenges and CAPTCHAs |
| Human-like Throttling | Randomized delays, jitter, scrolling, and viewport variance |
| Target Modules | Profile, posts, hashtag pages (public data) with parsers |
| Output Formats | JSONL, CSV, SQLite/Postgres adapters |
| Error/Retry Logic | Exponential backoff, soft-fail queues, resumable runs |
| CLI Runner | `scrape profiles`, `scrape hashtag`, `resume` subcommands |
| Dockerized | Reproducible runs with one-line Docker start |
| Env-First Config | `.env` for proxies, rate limits, storage, headless flags |
---
## Use Cases
- Competitive research and trend tracking
- Social listening for public hashtags
- Creator discovery & lead lists (public info)
- Academic/market research on public engagement
---
## FAQs
**Q:** How to remove scraping warning?
**A:** Scraping warnings (blocks/challenges) often result from aggressive request rates, reused fingerprints, or IP reputation. Reduce concurrency, add randomized delays, persist sessions, rotate high-quality residential/mobile proxies, and lower fetch depth. Clearing cookies blindly can worsen flags—prefer stable sessions per account/profile, rotate user-agents with consistent device signatures, and implement exponential backoff on 4xx/429 responses.
**Q:** Does Instagram allow web scraping?
**A:** Accessing or collecting data is governed by Instagram’s Terms and your local laws. This boilerplate is for educational and compliance-oriented uses on publicly available pages. Always review and follow the platform’s terms and applicable regulations before running any scraper.
**Q:** Can web scraping be detected?
**A:** Yes. Platforms detect patterns like high request rates, identical fingerprints, datacenter IPs, and scripted navigation. Mitigate via residential/mobile proxies, realistic browser automation (Playwright/Selenium), randomized timings, scroll/viewport simulation, and consistent sessions. Even with safeguards, detection risk can’t be eliminated—only reduced.
---
## Results
-----------------------------------
> 10x faster posting schedules
> 80% engagement increase on group campaigns
> Fully automated lead response system
## Performance Metrics
-----------------------------------
Average Performance Benchmarks:
- **Speed:** 2x faster than manual posting
- **Stability:** 99.2% uptime
- **Ban Rate:** <0.5% with safe automation mode
- **Throughput:** 100+ posts/hour per session
---
##Do you have a customize project for us ?
Contact Us
---
## Installation
### Pre-requisites
- Node.js or Python
- Git
- Docker (optional)
### Steps
```bash
# Clone the repo
git clone https://github.com/yourusername/instagram-web-scraper.git
cd instagram-web-scraper
# Install dependencies
# Node (Playwright)
npm install
npx playwright install
# or Python (Selenium/Playwright)
pip install -r requirements.txt
# Setup environment
cp .env.example .env
# then edit .env to set:
# PROXY_URL= # e.g. http://user:pass@host:port
# DRIVER=playwright # or selenium
# HEADLESS=true
# RATE_MIN_MS=800
# RATE_MAX_MS=2200
# STORAGE_DIR=.storage
# OUT_FORMAT=jsonl # csv|jsonl|sqlite|postgres
# Run (examples)
# Scrape a hashtag page (public)
npm run scrape:hashtag -- --tag "travel" --limit 50
# or
python main.py hashtag --tag "travel" --limit 50
```
---
## Example Output
```json
{"type":"post","shortcode":"CxyZ12A","likes":1243,"comments":57,"caption":"Sunset shots #travel","timestamp":"2025-10-11T14:22:10Z","author":"@example"}
{"type":"profile","username":"example","followers":10422,"following":312,"posts":87,"bio":"Photographer | Traveler"}
```
---
## License
MIT License