{"id":14067591,"url":"https://github.com/fdrennan/ndexr-platform","last_synced_at":"2025-07-30T02:30:57.272Z","repository":{"id":37640638,"uuid":"260098929","full_name":"fdrennan/ndexr-platform","owner":"fdrennan","description":"The NDEXR platform code","archived":false,"fork":false,"pushed_at":"2022-12-08T04:22:57.000Z","size":109703,"stargazers_count":24,"open_issues_count":15,"forks_count":8,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-12-04T08:36:13.724Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/fdrennan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-30T02:54:49.000Z","updated_at":"2024-03-12T12:46:08.000Z","dependencies_parsed_at":"2023-01-24T15:45:46.202Z","dependency_job_id":null,"html_url":"https://github.com/fdrennan/ndexr-platform","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/fdrennan/ndexr-platform","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fdrennan%2Fndexr-platform","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fdrennan%2Fndexr-platform/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fdrennan%2Fndexr-platform/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fdrennan%2Fndexr-platform/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/fdrennan","download_url":"https://codeload.github.com/fdrennan/ndexr-platform/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/fdrennan%2Fndexr-platform/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267798625,"owners_count":24145727,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-30T02:00:09.044Z","response_time":70,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-13T07:05:40.714Z","updated_at":"2025-07-30T02:30:56.278Z","avatar_url":"https://github.com/fdrennan.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"# NDEXR - Indexing the Reddit Platform\n\n## Visit the [Live Site](http://ndexr.com)\n\n## About This Project\nReddit has become one of the most frequently visited websites on the web. At the current time, according\nto [Ahrefs](https://ahrefs.com/blog/most-visited-websites/) Reddit is the 7th most popular website in the world. As a \ndecentralized platform, moderators are expected to do the policing of their subreddits. However, anyone can create\na subreddit and anyone can receive an API token - which to this date, allows access indistinguishable from a normal user.\n\nGiven that we have near exclusive access to Reddit data, what can we do with it? Some thoughts - \n\n1. Models can help determine bad actors on the platform\n2. The duration of high impact national events can potentially be determined - tend to follow a log normal distribution\n3. The API creates the opportunity to learn about data engineering best practices with real-time data\n4. A team can learn group collaboration\n5. The platform built translates to real world needs in other areas\n\n### Bots\n\nWe frequently encounter bots on the platform. Some of the most common ones are designed to make sure a post is following \nthe appropriate guidelines for the subreddit - i.e., did the post contain correct content, headers - did the author have \nkarma (internet points) to post, etc. This is done by looking at the submission data and the data about the author which is \npublicly available.\n\nBots can do everything a human can do, including guilding which is a paid badge to a user for good content and removal\nof ads for the user for a given time.\n\n### Trolls\nTrolls are human actors which intentionally work to mislead or aggravate people. A troll can change the tone of the conversation and\ngenerally speaking do not add context or nuance to otherwise well intentioned discussions.\n\n## Observing a World Event\n\nHourly submissions to Reddit mentioning George Floyd - could distributions like this one help determine the\n duration of a political movement?\n\n![Viewing a World Event](images/georgefloyd.png)\n\nThis is a log-scale look at the number of different links submitted to Reddit vs the number of Subreddits observed fir a single author.\nSampled 10,000 authors from approx 5 million authors.\n![Viewing Authors](images/authoractivity.png)\n\nAre some of these authors bots or not? Can we determine this? If we can, then what can we say about them?\n\n## The Network\n![This](images/ndexr_infra.png)\nAll incoming ports are blocked to external users except for 80 and 3000, the remaining ports are only accessible\n to approved IP addresses.\n\n## My Programming History\n\nI worked at a company called Digital First Media. I was hired on as a data optimization engineer. \nThe job was primary working on their optimization code for online marketing campaigns. As the guy in-between, I worked\nwith qualified data engineers on one side of me and creative web developers on the other side.\n \n[Duffy](https://github.com/duffn) was definitely one of the talented ones and taught me quite a bit. While I was with \nthe company, one of my complaints was related to how much we were spending for tools we could easily make in house. Of \nof those choices, was whether to buy RSConnect or not. I found a way to build highly scalable R APIs using \ndocker-compose and NGINX. Duffy was the guy who knew what was needed for a solution, so he gave me quite a bit of \nguidance in understanding good infrastructure. \n\nSo that's where I learned about building really cool APIs so that I could share my output with non-R users. People used \nmy APIs, Duffy was doing cool stuff in Data Engineering and was getting the data to me. \n\nI gravitated a bit out of the math into the tools Data Engineers used, and became interested in Python, \nSQL, Airflow etc. These guys spin up that stuff daily, so it's not impossible to learn! I started creating data \npipelines, which grew - and became difficult to maintain. I wanted to learn best practices in data engineering - because \nwhen things break, it's devastating and a time sink and kept me up nights.  \n\nAIRFLOW, is one of the tools for this job. It makes your scheduled jobs smooth like butter, and is highly transparent \nwith the health of your network, and allows for push button runs of your code. This was far superior to cron jobs \nkicking off singular scripts.\n\n## What's Running it\n\n### Dell XPS and Lenovo Ideapad (hangin out in the kitchen)\n![](images/lenovo_xps.png)\n### Dell Poweredge (40 cores and 128gb)\n![](images/poweredge.jpeg)\n### Between the AC below and PowerEdge above, I have to choose one.... stupid\n![](images/air.jpeg)\n\nThe main components are \n\n1. An Airflow instance running scripts for data gathering.\n2. A Postgres database to store the data, with scheduled backups to AWS S3.\n3. An R Package I wrote for talking to AWS called `biggr` (using a Python backend - its an R wrapper for\n `boto3` using Reticulate)\n4. An R Package I wrote for talking to Reddit called `redditor`  (using a Python backend - its an R wrapper for `praw` using Reticulate)  \n5. An R API that converts the data generated by this pipeline to a front end application for display\n6. A React Application which takes the data in the R API and displays it on the web.\n\n## You will Need\n1. Reddit API authentication\n2. AWS IAM Creds (Not always)\n3. Motivation to learn Docker, PostgreSQL, MongoDB, Airflow, R Packages using Reticulate, NGINX, \n and web design.\n4. Patience while Docker builds\n5. An interest in programming\n6. A pulse\n7. Oxygen\n8. Oreos\n\n## Getting Started \n1. Request permission from `fdrennan` in [NDEXR Slack](https://app.slack.com/client/TAS9MV5K2) for [RStudio](http://ndexr.com:8787) and Postgres access.\n2. Get your own set of Reddit API credentials from [Reddit](https://ssl.reddit.com/prefs/apps/) \n2. `FORK` this repository to your Github account\n3. Run `git clone https://github.com/YOUR_GITHUB_USERNAME/ndexr-platform.git`\n4. RUN `git remote add upstream https://github.com/fdrennan/ndexr-platform.git`\n4. RUN `cd ndexr-platform`\n5. RUN `docker build -t redditorapi --file ./DockerfileApi .`\n6. RUN `docker build -t rpy --file ./DockerfileRpy .`\n7. RUN `docker build -t redditorapp --file ./DockerfileShiny .`\n\n#### Once these steps are complete, contact me to see how to set your environment variables.\n```\nPOSTGRES_USER=yournewusername\nPOSTGRES_PASSWORD=yourstrongpassword\nPOSTGRES_HOST=ndexr.com\nPOSTGRES_PORT=5433\nPOSTGRES_DB=postgres\nREDDIT_CLIENT=yourclient\nREDDIT_AUTH=yourauth\nUSER_AGENT=\"datagather by /u/username\"\nUSERNAME=usernameforreddit\nPASSWORD=passwordforreddit\n```\n\n## About the Dockerfiles\nThere are three dockerfiles that are needed: `DockerfileApi`, `DockerfileRpy`, and `DockerfileUi`\n\n`DockerfileApi` is associated with the container needed to run an R [Plumber](https://www.rplumber.io/) API. \nIn the container I take from [trestletech](https://hub.docker.com/r/trestletech/plumber/), I add on some additional \nLinux binaries and R packages. There are two R packages in this project. One is called [biggr] and the other is called \n[redditor], which are located in `./bigger` and `./redditor-api` respectively. To build the container, run the \nfollowing:\n\n```\ndocker build -t redditorapi --file ./DockerfileApi .\n```\n\n`DockerfileRpy` is a container running both R and Python, This is taken from the `python:3.7.6` container. I install R \non top of it, so I can run scheduled jobs. This container runs Airflow, which is set up in `airflower`. Original name, \nright? \n\n```\ndocker build -t rpy --file ./DockerfileRpy .\n```\n\nThis container contains code and packags required to run the Shiny application\n\n```\ndocker build -t redditorapp --file ./DockerfileShiny .\n```\n### The main DAG, man\n1. `set_up_aws`: Update AWS credentials on file for `biggr`\n2. `backup_postgres_to_s3`: Moves recent submission data from the XPS server to S3\n3. `transfer_subissions_from_s3_to_poweredge`: Grabs the data in S3 and stages for long term storage on the Poweredge \nin Postgres at `public.submissions`.\n4. `upload_submissions_to_elastic`: takes all submissions not stored in elastic search and saves them there - running on the `Dell XPS` laptop\n5. `refresh_*`: are all materialized views that get updated on the Poweredge once received from the S3 bucket\n6. `poweredge_to_xps_meta_statistics`: takes the submission, author, and subreddit counts and stored in the `XPS` Postgres database. \nThis allows for updated statistics when the Poweredge server is off. \n7. `update_costs`: Once the ETL process is done, grab the latest costs from AWS and store in the DB.\n\n![Daily Dag](images/the_daily_ndexr.png)\n\n### More Detail into the Airflow Process\n![](images/daily_ndexr.png)\n\n\n# Hop Into A Container\n\n```\ndocker exec -it  [container name]  bash\n```\n\n# Backing Up Your Data\n```\npsql -U airflow postgres \u003c postgres.bak\n```\n\n```\nscp -i \"~/ndexr.pem\" ubuntu@ndexr.com:/var/lib/postgresql/postgres/backups/postgres.bak postgres.bak\ndocker exec redditor_postgres_1 pg_restore -U airflow -d postgres /postgres.bak\n```\n\n\n# Dont run these unless you know what you are doing. Im serious.\n```\ndocker stop $(docker ps -a -q)\ndocker rm $(docker ps -a -q)\ndocker volume prune\ndocker volume rm  redditor_postgres_data\n```\n\n# Restoring Postgres from Backup\n```\npg_dump -h db -p 5432 -Fc -o -U postgres postgres \u003e postgres.bak\nwget https://redditor-dumps.s3.us-east-2.amazonaws.com/postgres.tar.gz\ntar -xzvf postgres.tar.gz\n```\n\n\n## Restore Database\n1. Run Gathering Dag\n2. Run this\n```\n\n  redditor_postgres  /bin/bash \ntar -zxvf /data/postgres.tar.gz\npg_restore --clean --verbose -U postgres -d postgres /postgres.bak\n# /var/lib/postgresql/data\n```\n \n# Creating Connections from Local Servers to Remote Servers\n\n### To Kill a port\n`sudo fuser -k -n tcp 3000`\n\n\n### Allowing Port Forwarding\n```\nsudo systemctl restart ssh\nsudo vim /etc/ssh/sshd_config\n\n/var/log/secure\nAllowTcpForwarding yes\nGatewayPorts yes\n\n### Kill Autossh\n\npkill -3 autossh\nps aux | grep ssh\nkill -9 28186 14428\n\n```\n\n### LENOVO\n```\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 61209:localhost:61208  ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 2300:localhost:22  ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 8999:localhost:8999  ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 3000:localhost:3000 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 8005:localhost:8005 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 8002:localhost:8002 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 8003:localhost:8003 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 8004:localhost:8004 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 8006:localhost:8006 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\n```\n\n### DELL XPS\n```\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 61210:localhost:61208 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 8080:localhost:8080 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 2500:localhost:22  ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 9200:localhost:9200 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 8081:localhost:8081 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\n```\n\n### POWEREDGE\n```\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 61211:localhost:61208   ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 8001:localhost:8001 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 8000:localhost:8000   ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 2400:localhost:22  ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 8787:localhost:8787 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\nautossh -f -nNT -i /home/fdrennan/ndexr.pem -R 5433:localhost:5432 ubuntu@ndexr.com -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ExitOnForwardFailure=yes\n```\n\n# Uploading to Docker\n```\ndocker image tag rpy:latest fdrennan/rpy:latest\ndocker push fdrennan/rpy:latest\n\ndocker image tag redditorapi:latest fdrennan/redditorapi:latest\ndocker push fdrennan/redditorapi:latest\n```\n\n# Reset Everything Docker\n```\ndocker stop $(docker ps -a -q)\ndocker rm $(docker ps -a -f status=exited -q)\ndocker rmi $(docker images -a -q)\ndocker volume prune\n```\n\n# Add user\n```\nsudo adduser newuser\nusermod -aG sudo newuser\n```\n# Useful LInks\n##[Port Scanner](https://gf.dev/port-scanner)\n##[Install Elastic Search Plugins](https://serverfault.com/questions/973325/how-to-install-elasticsearch-plugins-with-docker-container)\n\n##[Monitoring Users](https://www.ostechnix.com/monitor-user-activity-linux/)\n\n# Technologies Used\n\n![](images/airflow.png)\n![](images/github.png)\n![](images/pgadmin.png)\n![](images/elasticsearch.jpeg)\n![](images/rstudio.png)\n\n\n# Docker Files\n```\ndocker build -t redditorapi --build-arg DUMMY={DUMMY} --file ./DockerfileApi .\ndocker build -t rpy --build-arg DUMMY={DUMMY} --file ./DockerfileRpy .\ndocker build -t redditorapp --build-arg DUMMY={DUMMY} --file ./DockerfileShiny .\n```\n\n# Delete a User\n```.env\nsudo userdel -r username\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffdrennan%2Fndexr-platform","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffdrennan%2Fndexr-platform","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffdrennan%2Fndexr-platform/lists"}