Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/timkiely/scrape-instagram-by-location

R package to locate Instagram place IDs and retrieve metadata about posts from that area
https://github.com/timkiely/scrape-instagram-by-location

Last synced: about 2 months ago
JSON representation

R package to locate Instagram place IDs and retrieve metadata about posts from that area

Awesome Lists containing this project

README

        

---
title: "Instagram Metadata Scraper"
output: github_document
---

### OPEN QUESTIONS:

* How to obtain an exhaustive list of location ID's?
* What are the api call limits? How many calls can you make in a row? What's the reset time? Do IP's ever get blocked? (running network speed test to address this)
* how to integrate proxy use with httr? 1) Obtain list of proxies 2) build fault tolerance
* Parallelize network requests? Maybe launch multiple EC2's?

# Instagram Scraper:

This package focuses on metadata analysis of Instagram posts with an emphasis on geo-location.

Unlike other Instagram scraping projects around, the aim of this package is very narrow:
* search for Instagram/Facebook/Foursquare place ID's
* Retrieve the metadata about the posts from that location

This package is not interested in downloading the media (photos/videos) associated with IG posts.

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

The first function searches for locaiton ID's using whatever text string you're interested in. Let's search for posts around Williamsburg:

```{r, message=FALSE, warning=FALSE}
source("search-for-location-ids.R")

possible_locations <- search_for_location_ids(location = "Williamsburg, Brooklyn")

possible_locations

```

The `pk` collumn contains the relevant location ID's we need to hit instagram's Graphql api. The correct location appears to be `242698464`. We can feed that to the subsequent function `get_posts_from_location` in order to retrieve metadata about the first batch of posts.

```{r}

source("scrape-posts-from-location.R")

posts <- get_posts_from_location(location_id = 242698464, after = 0)

posts

```

You can input the `end_cursor` value returned by `get_posts_from_location` to a subsequent call as the `after` parameter. That will ensure the next call will return the next batch of posts.