https://github.com/kern/fb_scrape

A simple scraper for Facebook Groups.
https://github.com/kern/fb_scrape

Last synced: about 1 month ago
JSON representation

A simple scraper for Facebook Groups.

Host: GitHub
URL: https://github.com/kern/fb_scrape
Owner: kern
License: bsd-3-clause
Created: 2015-08-03T05:09:20.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2015-10-20T20:40:17.000Z (over 9 years ago)
Last Synced: 2025-03-26T13:04:56.096Z (about 2 months ago)
Language: Ruby
Size: 105 KB
Stars: 27
Watchers: 3
Forks: 8
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# fb_scrape

A simple scraper for Facebook Groups.

## Requirements

* Ruby 2.x.x
* A Facebook Graph API access token with the `user_managed_groups` permission

## Usage

There are two steps to scraping posts. First, every original post's ID is gathered off the group's feed. Second, each post and all of its comments/likes are fetched in parallel and assembled into a single CSV.

You'll need to export your Facebook Graph API access token as an environment variable. You can retrieve an access token with the `user_managed_groups` permission from the [Graph API Explorer](https://developers.facebook.com/tools/explorer/).

$ export ACCESS_TOKEN=[ACCESS_TOKEN_GOES_HERE]

To get every original post ID in a group whose ID is `GROUP_ID`:

$ ruby fb_scrape.rb post_ids GROUP_ID

The IDs will be fetched in large chunks and printed to `STDOUT`, one per line. You can write the IDs to a file for safe keeping:

$ ruby fb_scrape.rb post_ids GROUP_ID > ids.txt

Next we fetch the posts/comments/likes in parallel. Post IDs are accepted via `STDIN` and a CSV is printed to `STDOUT` making it easy to use pipes:

$ cat ids.txt | ruby fb_scrape.rb fetch > scraped_data.csv

You can fetch posts from multiple groups simultaneously and they will be tagged
appropriately. If you don't care about saving the post IDs, you can chain both commands together for extra Unix-y goodness:

$ ruby fb_scrape.rb post_ids GROUP_ID | ruby fb_scrap.rb fetch > scraped_data.csv

The CSV contains all of the scraped data in a consistent format. Data is written incrementally and failed requests (usually due to rate limiting) will be retried every 5 minutes. For very large groups, the CSV will be too big for Excel or other GUI programs to manipulate, so consider importing it into [Google BigQuery](https://cloud.google.com/bigquery/) or R.

## Filtering

If you'd like to filter data from the dataset, you can do so by piping into the `filter` command:

$ cat scraped_data.csv | ruby fb_scrape.rb filter group_name "Hackathon Hackers" > hh_data.csv

The first argument is the field to use in the dataset for filtering. The second argument is a regular expression that is compared to the field's value for each row, and if matched, adds the row to the output dataset (printed to `STDOUT`).

## Precautions

This scraper will make a hell of a lot of requests using your access token. Be wary that Facebook will inevitably rate limit you, but this is temporary and resets within an hour. If the fetcher detects rate limiting, it will automatically retry the request in 5 minutes.

If a user has blocked you, you will not be able to retrieve their posts and an error will be printed. These are safe to ignore.

Please use this software for good, not evil.

## License

[BSD 3-Clause](https://github.com/kern/fb_scrape/blob/master/LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kern/fb_scrape

Awesome Lists containing this project

README