https://github.com/pushshift/big-data-scripts

Miscellaneous Python and Perl Scripts for working with Big Data files that are new-line delimited JSON Objects
https://github.com/pushshift/big-data-scripts

Last synced: 7 months ago
JSON representation

Miscellaneous Python and Perl Scripts for working with Big Data files that are new-line delimited JSON Objects

Host: GitHub
URL: https://github.com/pushshift/big-data-scripts
Owner: pushshift
Created: 2015-12-08T08:04:51.000Z (about 10 years ago)
Default Branch: master
Last Pushed: 2015-12-10T04:40:07.000Z (about 10 years ago)
Last Synced: 2025-01-11T20:45:35.241Z (about 1 year ago)
Language: Python
Homepage:
Size: 9.13 MB
Stars: 4
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Big-Data-Scripts
Miscellaneous Python Scripts for working with Big Data files that are new-line delimited JSON files

**popular-subreddits.py:**

This is a simple Python program to read a reddit comment dump and print out the most popular subreddits sorted by number of comments made within each subreddit.

Example usage: ```python popular-subreddits.py sample.bz2```

**popular-words.pl**

This is a Perl script to show the most popular words from a collection of comments.

Example usage: ```bzip2 -cd sample.bz2 | ./popular-words.pl```

**Linux command line Kung-Fu Examples**

The following commands will work on most Linux operating systems. I am using Ubuntu 14.04 for these examples.

*Pretty print JSON and learn about the JSON object structure using Python*

bzip2 -cd sample.bz2 | head -n1 | python -m json.tool

*Print the author from each JSON block using Perl*

Make sure Cpanel::JSON::XS is installed

cpan Cpanel::JSON::XS

Now you can do:

bzip2 -cd sample.bz2 | perl -MCpanel::JSON::XS -lne 'print decode_json($_)->{author}'

*Sort these authors by number of times they have made a comment*

bzip2 -cd sample.bz2 | perl -MCpanel::JSON::XS -lne 'print decode_json($_)->{author}' | sort | uniq -c | sort -n

**Location of main Reddit Data**

You can download billions of Reddit comments from my main archive. If you use this data for research, I would kindly ask that you attribute my efforts in your publication. Thank you!

Location: http://pan.whatbox.ca:36975/reddit/comments/monthly/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pushshift/big-data-scripts

Awesome Lists containing this project

README