Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ScriptSmith/instamancer

Scrape Instagram's API with Puppeteer
https://github.com/ScriptSmith/instamancer

data-mining instagram instagram-api instagram-scraper puppeteer scrape

Last synced: about 2 months ago
JSON representation

Scrape Instagram's API with Puppeteer

Awesome Lists containing this project

README

        



Instamancer

[![Quality](https://img.shields.io/codacy/grade/98066a13fa444845aa3902d180581b86.svg)](https://app.codacy.com/project/ScriptSmith/instamancer/dashboard)
[![Coverage](https://img.shields.io/codacy/coverage/98066a13fa444845aa3902d180581b86.svg)](https://app.codacy.com/project/ScriptSmith/instamancer/dashboard)
[![Speed](https://firebasestorage.googleapis.com/v0/b/instagram-speed-test.appspot.com/o/instamancer.svg?alt=media&token=dcc3e623-ee88-4d74-ae86-2d969a1cd8ad)](https://scriptsmith.github.io/instagram-speed-test)
[![NPM](https://img.shields.io/npm/v/instamancer.svg)](https://www.npmjs.com/package/instamancer)
[![Dependencies](https://david-dm.org/scriptsmith/instamancer/status.svg)](https://david-dm.org/scriptsmith/instamancer)
[![Chat](https://img.shields.io/gitter/room/instamancer/instamancer.svg)](https://gitter.im/instamancer)

Scrape Instagram's API with Puppeteer.

###### [Install](#Install) | [Usage](#Usage) | [Comparison](#Comparison) | [Website](https://scriptsmith.github.io/instamancer/) | [FAQ](FAQ.md) | [Examples](examples/README.md)


**Notice:** Instagram's Web UI and API now requires users to be logged in to access hashtag and account endpoints through a browser. As instamancer is designed to access publicly available data, it currently does not work as intended. Given that this change is unlikely to be reversed, Instamancer will remain unsupported and unmaintained indefinitely. Please use [this pinned issue](https://github.com/ScriptSmith/instamancer/issues/58) to discuss.


Instamancer is a new type of scraping tool that leverages Puppeteer's ability to intercept requests made by a webpage to an API.

Read more about how Instamancer works [here](https://scriptsmith.github.io/instamancer/).

### Features
- Scrape hashtags, users' posts, and individual posts
- Download images, albums, and videos
- Output JSON, CSV
- Batch scraping
- Search hashtags, users, and locations
- API response validation
- Upload files to [S3](https://github.com/ScriptSmith/instamancer/blob/master/FAQ.md#how-do-i-use-the---bucket-flag-and-s3) and [depot](https://github.com/ScriptSmith/instamancer/blob/master/FAQ.md#how-do-i-use-the---depot-flag-and-depot)
- [Plugins](plugins)

### Data
Metadata that Instamancer is able to gather from posts:

- Text
- Timestamps
- Tagged users
- Accessibility captions
- Like counts
- Comment counts
- Images (Thumbnails, Dimensions, URLs)
- Videos (URL, View count, Duration)
- Comments (Timestamp, Text, Like count, User)
- User (Username, Full name, Profile picture, Profile privacy)
- Location (Name, Street, Zip code, City, Region, Country)
- Sponsored status
- Gating information
- Fact checking information

## Install

#### Linux
Enable user namespace cloning:
```
sysctl -w kernel.unprivileged_userns_clone=1
```

Or run without a sandbox:

```
# WARNING: unsafe
export NO_SANDBOX=true
```

See [Puppeteer troubleshooting](https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md#chrome-headless-fails-due-to-sandbox-issues)

#### Without downloading chromium
If you wish to install Instamancer without downloading chromium, enable the `PUPPETEER_SKIP_CHROMIUM_DOWNLOAD` environment variable before installation

```
export PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true
```

### From NPM

```
npm install -g instamancer
```

If you're using root to install globally, use the following command to install the Puppeteer dependency

```
sudo npm install -g instamancer --unsafe-perm=true
```

### From NPX

```
npx instamancer
```

### From this repository
```
git clone https://github.com/ScriptSmith/instamancer.git
cd instamancer
npm install
npm run build
npm install -g
```

## Usage

### Command Line
```
$ instamancer
Usage: instamancer [options]

Commands:
instamancer hashtag [id] Scrape a hashtag
instamancer user [id] Scrape a users posts
instamancer post [ids] Scrape a comma-separated list of posts
instamancer search [query] Perform a search of users, tags and places
instamancer batch [batchfile] Read newline-separated arguments from a file

Configuration
--count, -c Number of posts to download (0 for all) [number] [default: 0]
--full, -f Retrieve full post data [boolean] [default: false]
--sleep, -s Seconds to sleep between interactions [number] [default: 2]
--graft, -g Enable grafting [boolean] [default: true]
--browser, -b Browser path. Defaults to the puppeteer version [string]
--sameBrowser Use a single browser when grafting [boolean] [default: false]

Download
--download, -d Save images from posts [boolean] [default: false]
--downdir Download path [default: "downloads/[endpoint]/[id]"]
--video, -v Download videos (requires full) [boolean] [default: false]
--sync Force download between requests [boolean] [default: false]
--threads, -k Parallel download / depot threads [number] [default: 4]
--waitDownload, -w Download media after scraping [boolean] [default: false]

Upload
--bucket Upload files to an AWS S3 bucket [string]
--depot Upload files to a URL with a PUT request (depot) [string]

Output
--file, -o Output filename. '-' for stdout [string] [default: "[id]"]
--type, -t Filetype [choices: "csv", "json", "both"] [default: "json"]
--mediaPath, -m Add filepaths to _mediaPath [boolean] [default: false]

Display
--visible Show browser on the screen [boolean] [default: false]
--quiet, -q Disable progress output [boolean] [default: false]

Logging
--logging, -l [choices: "none", "error", "info", "debug"] [default: "none"]
--logfile Log file name [string] [default: "instamancer.log"]

Validation
--strict Throw an error on response type mismatch [boolean] [default: false]

Plugins
--plugin, -p Use a plugin from the plugins directory [array] [default: []]

Options:
--help Show help [boolean]
--version Show version number [boolean]

Examples:
instamancer hashtag instagood -fvd Download all the available posts,
and their media from #instagood
instamancer user arianagrande --type=csv Download Ariana Grande's posts to a
--logging=info --visible CSV file with a non-headless
browser, and log all events

Source code available at https://github.com/ScriptSmith/instamancer

```

### Module

ES2018 Typescript example:
```typescript
import {createApi, IOptions} from "instamancer"

const options: IOptions = {
total: 10
};
const hashtag = createApi("hashtag", "beach", options);

(async () => {
for await (const post of hashtag.generator()) {
console.log(post);
}
})();
```

#### Generator functions

```typescript
import {createApi} from "instamancer"

createApi("hashtag", id, options);
createApi("user", id, options);
createApi("post", ids, options);
createApi("search", query, options);
```

#### Options
```typescript
const options: Instamancer.IOptions = {
// Total posts to download. 0 for unlimited
total: number,

// Run Chrome in headless mode
headless: boolean,

// Logging events
logger: winston.Logger,

// Run without output to stdout
silent: boolean,

// Time to sleep between interactions with the page
sleepTime: number,

// Throw an error if type validation has been failed
strict: boolean,

// Time to sleep when rate-limited
hibernationTime: number,

// Enable the grafting process
enableGrafting: boolean,

// Extract the full amount of information from the API
fullAPI: boolean,

// Use a proxy in Chrome to connect to Instagram
proxyURL: string,

// Location of the chromium / chrome binary executable
executablePath: string,

// Custom io-ts validator
validator: Type,

// Custom plugins
plugins: IPlugin[]
}
```

## Comparison

A comparison of Instagram scraping tools. Please suggest more tools and criteria through a pull request.

To see a speed comparison, visit [this page](https://scriptsmith.github.io/instagram-speed-test)


Tool
Hashtags
Users
Tagged posts
Locations
Posts
Stories
Login not required
Private feeds
Batch mode
Plugins
Command-line
Library/Module
Download media
Download metadata
Scraping method
Daily builds
Main language
Speed ____________________________
License ____________________________
Last commit ____________________________
Open Issues ____________________________
Closed Issues ____________________________
Build status ____________________________
Test coverage ____________________________
Code quality ____________________________


Instamancer
:heavy_check_mark:
:heavy_check_mark:
:x:
:x:
:heavy_check_mark:
:x:
:heavy_check_mark:
:x:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
Web API request interception
:heavy_check_mark:
Typescript










Instaphyte
:heavy_check_mark:
:x:
:x:
:x:
:x:
:x:
:heavy_check_mark:
:x:
:x:
:x:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
Web API simulation
:heavy_check_mark:
Python










Instaloader
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:x:
:x:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
Web API simulation
:x:
Python






:question:
:question:


Instalooter
:heavy_check_mark:
:heavy_check_mark:
:x:
:heavy_check_mark:
:heavy_check_mark:
:x:
:x:
:heavy_check_mark:
:heavy_check_mark:
:x:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
Web API simulation
:x:
Python










Instagram crawler
:heavy_check_mark:
:heavy_check_mark:
:x:
:x:
:heavy_check_mark:
:x:
:heavy_check_mark:
:x:
:x:
:x:
:heavy_check_mark:
:heavy_check_mark:
:x:
:heavy_check_mark:
Web DOM reading
:x:
Python
:question:





:question:
:question:


Instagram Scraper
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:x:
:heavy_check_mark:
:x:
:heavy_check_mark:
:x:
:x:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
Web API simulation
:x:
Python






:question:
:question:


Instagram Private API
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
:x:
:x:
:x:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
App and Web API simulation
:x:
Python
:question:





:question:
:question:


Instagram PHP Scraper
:heavy_check_mark:
:heavy_check_mark:
:x:
:heavy_check_mark:
:heavy_check_mark:
:x:
:heavy_check_mark:
:heavy_check_mark:
:x:
:x:
:x:
:heavy_check_mark:
:heavy_check_mark:
:heavy_check_mark:
Web API simulation
:x:
PHP
:question:




:question:
:question:
:question: