https://github.com/instagram-automations/instagram-web-scraper

instagram web scraper and automation toolkit
https://github.com/instagram-automations/instagram-web-scraper

anti-detect automation bot cli docker instagram instagram-web-scraper nodejs proxy python rate-limits selenium srarper web

Last synced: 8 months ago
JSON representation

instagram web scraper and automation toolkit

Host: GitHub
URL: https://github.com/instagram-automations/instagram-web-scraper
Owner: Instagram-Automations
Created: 2025-10-10T19:28:01.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-10-10T19:35:11.000Z (9 months ago)
Last Synced: 2025-10-14T19:04:41.886Z (9 months ago)
Topics: anti-detect, automation, bot, cli, docker, instagram, instagram-web-scraper, nodejs, proxy, python, rate-limits, selenium, srarper, web
Homepage:
Size: 1.37 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# instagram web scraper

A production-ready boilerplate to collect publicly available Instagram web data (profiles, posts, hashtags) using safe automation patterns, rotating proxies, and human-like delays. Built for agencies, researchers, and growth teams that want reliable scraping with lower block risk.

For discussion, queries, and freelance work — reach out 👆

---

## Introduction
> This repository provides a modular Instagram web scraping starter that focuses on resilience (anti-detect flows, rotating proxies, session reuse) and clarity (typed schema, storage adapters). It’s ideal for analysts, SaaS builders, and agencies that need compliant, rate-aware scraping of public pages.

### Key Benefits
1. Saves time with prebuilt Playwright/Selenium runners.
2. Scales from single run to distributed jobs.
3. Safer with proxy rotation, backoff, fingerprint & session logic.

---

## Features must be in table

| Feature | Details |
|---|---|
| Headless/Visible Browsers | Playwright or Selenium drivers with toggleable headless mode |
| Proxy Rotation | Supports residential/mobile proxies with per-request rotation |
| Session Persistence | Reuse cookies/storage to reduce challenges and CAPTCHAs |
| Human-like Throttling | Randomized delays, jitter, scrolling, and viewport variance |
| Target Modules | Profile, posts, hashtag pages (public data) with parsers |
| Output Formats | JSONL, CSV, SQLite/Postgres adapters |
| Error/Retry Logic | Exponential backoff, soft-fail queues, resumable runs |
| CLI Runner | `scrape profiles`, `scrape hashtag`, `resume` subcommands |
| Dockerized | Reproducible runs with one-line Docker start |
| Env-First Config | `.env` for proxies, rate limits, storage, headless flags |

---

## Use Cases
- Competitive research and trend tracking
- Social listening for public hashtags
- Creator discovery & lead lists (public info)
- Academic/market research on public engagement

---

## FAQs

**Q:** How to remove scraping warning?
**A:** Scraping warnings (blocks/challenges) often result from aggressive request rates, reused fingerprints, or IP reputation. Reduce concurrency, add randomized delays, persist sessions, rotate high-quality residential/mobile proxies, and lower fetch depth. Clearing cookies blindly can worsen flags—prefer stable sessions per account/profile, rotate user-agents with consistent device signatures, and implement exponential backoff on 4xx/429 responses.

**Q:** Does Instagram allow web scraping?
**A:** Accessing or collecting data is governed by Instagram’s Terms and your local laws. This boilerplate is for educational and compliance-oriented uses on publicly available pages. Always review and follow the platform’s terms and applicable regulations before running any scraper.

**Q:** Can web scraping be detected?
**A:** Yes. Platforms detect patterns like high request rates, identical fingerprints, datacenter IPs, and scripted navigation. Mitigate via residential/mobile proxies, realistic browser automation (Playwright/Selenium), randomized timings, scroll/viewport simulation, and consistent sessions. Even with safeguards, detection risk can’t be eliminated—only reduced.

---

## Results
-----------------------------------
> 10x faster posting schedules
> 80% engagement increase on group campaigns
> Fully automated lead response system

## Performance Metrics
-----------------------------------
Average Performance Benchmarks:
- **Speed:** 2x faster than manual posting
- **Stability:** 99.2% uptime
- **Ban Rate:** <0.5% with safe automation mode
- **Throughput:** 100+ posts/hour per session

---

##Do you have a customize project for us ?
Contact Us

support@appilot.app

┃

pilot

┃

zee#2655

┃

whatsapp

---

## Installation

### Pre-requisites
- Node.js or Python
- Git
- Docker (optional)

### Steps
```bash
# Clone the repo
git clone https://github.com/yourusername/instagram-web-scraper.git
cd instagram-web-scraper

# Install dependencies
# Node (Playwright)
npm install
npx playwright install

# or Python (Selenium/Playwright)
pip install -r requirements.txt

# Setup environment
cp .env.example .env
# then edit .env to set:
# PROXY_URL= # e.g. http://user:pass@host:port
# DRIVER=playwright # or selenium
# HEADLESS=true
# RATE_MIN_MS=800
# RATE_MAX_MS=2200
# STORAGE_DIR=.storage
# OUT_FORMAT=jsonl # csv|jsonl|sqlite|postgres

# Run (examples)
# Scrape a hashtag page (public)
npm run scrape:hashtag -- --tag "travel" --limit 50
# or
python main.py hashtag --tag "travel" --limit 50
```

---

## Example Output

```json
{"type":"post","shortcode":"CxyZ12A","likes":1243,"comments":57,"caption":"Sunset shots #travel","timestamp":"2025-10-11T14:22:10Z","author":"@example"}
{"type":"profile","username":"example","followers":10422,"following":312,"posts":87,"bio":"Photographer | Traveler"}
```

---

## License

MIT License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/instagram-automations/instagram-web-scraper

Awesome Lists containing this project

README