https://github.com/kern/fb_scrape
A simple scraper for Facebook Groups.
https://github.com/kern/fb_scrape
Last synced: about 1 month ago
JSON representation
A simple scraper for Facebook Groups.
- Host: GitHub
- URL: https://github.com/kern/fb_scrape
- Owner: kern
- License: bsd-3-clause
- Created: 2015-08-03T05:09:20.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2015-10-20T20:40:17.000Z (over 9 years ago)
- Last Synced: 2025-03-26T13:04:56.096Z (about 2 months ago)
- Language: Ruby
- Size: 105 KB
- Stars: 27
- Watchers: 3
- Forks: 8
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# fb_scrape
A simple scraper for Facebook Groups.
## Requirements
* Ruby 2.x.x
* A Facebook Graph API access token with the `user_managed_groups` permission## Usage
There are two steps to scraping posts. First, every original post's ID is gathered off the group's feed. Second, each post and all of its comments/likes are fetched in parallel and assembled into a single CSV.
You'll need to export your Facebook Graph API access token as an environment variable. You can retrieve an access token with the `user_managed_groups` permission from the [Graph API Explorer](https://developers.facebook.com/tools/explorer/).
$ export ACCESS_TOKEN=[ACCESS_TOKEN_GOES_HERE]
To get every original post ID in a group whose ID is `GROUP_ID`:
$ ruby fb_scrape.rb post_ids GROUP_ID
The IDs will be fetched in large chunks and printed to `STDOUT`, one per line. You can write the IDs to a file for safe keeping:
$ ruby fb_scrape.rb post_ids GROUP_ID > ids.txt
Next we fetch the posts/comments/likes in parallel. Post IDs are accepted via `STDIN` and a CSV is printed to `STDOUT` making it easy to use pipes:
$ cat ids.txt | ruby fb_scrape.rb fetch > scraped_data.csv
You can fetch posts from multiple groups simultaneously and they will be tagged
appropriately. If you don't care about saving the post IDs, you can chain both commands together for extra Unix-y goodness:$ ruby fb_scrape.rb post_ids GROUP_ID | ruby fb_scrap.rb fetch > scraped_data.csv
The CSV contains all of the scraped data in a consistent format. Data is written incrementally and failed requests (usually due to rate limiting) will be retried every 5 minutes. For very large groups, the CSV will be too big for Excel or other GUI programs to manipulate, so consider importing it into [Google BigQuery](https://cloud.google.com/bigquery/) or R.
## Filtering
If you'd like to filter data from the dataset, you can do so by piping into the `filter` command:
$ cat scraped_data.csv | ruby fb_scrape.rb filter group_name "Hackathon Hackers" > hh_data.csv
The first argument is the field to use in the dataset for filtering. The second argument is a regular expression that is compared to the field's value for each row, and if matched, adds the row to the output dataset (printed to `STDOUT`).
## Precautions
This scraper will make a hell of a lot of requests using your access token. Be wary that Facebook will inevitably rate limit you, but this is temporary and resets within an hour. If the fetcher detects rate limiting, it will automatically retry the request in 5 minutes.
If a user has blocked you, you will not be able to retrieve their posts and an error will be printed. These are safe to ignore.
Please use this software for good, not evil.
## License
[BSD 3-Clause](https://github.com/kern/fb_scrape/blob/master/LICENSE)