{"id":22672221,"url":"https://github.com/jimlynchcodes/super-scraper","last_synced_at":"2025-08-19T00:16:53.543Z","repository":{"id":37868771,"uuid":"255391771","full_name":"JimLynchCodes/Super-Scraper","owner":"JimLynchCodes","description":"An awesome setup for scraping data from websites and storing it in a database.  🤖 🦾 ➡️ 📖 ➡ 📦","archived":false,"fork":false,"pushed_at":"2023-01-07T04:38:33.000Z","size":4526,"stargazers_count":3,"open_issues_count":17,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-29T11:29:51.048Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JimLynchCodes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-04-13T17:09:32.000Z","updated_at":"2022-09-18T07:48:55.000Z","dependencies_parsed_at":"2023-02-06T12:01:47.259Z","dependency_job_id":null,"html_url":"https://github.com/JimLynchCodes/Super-Scraper","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/JimLynchCodes/Super-Scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JimLynchCodes%2FSuper-Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JimLynchCodes%2FSuper-Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JimLynchCodes%2FSuper-Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JimLynchCodes%2FSuper-Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JimLynchCodes","download_url":"https://codeload.github.com/JimLynchCodes/Super-Scraper/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JimLynchCodes%2FSuper-Scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271079039,"owners_count":24695559,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-18T02:00:08.743Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-09T16:18:23.362Z","updated_at":"2025-08-19T00:16:53.456Z","avatar_url":"https://github.com/JimLynchCodes.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# The Super Scraper!\n\n\non travis-ci.org:\n\n[![Build Status](https://travis-ci.org/JimLynchCodes/Super-Scraper.svg?branch=master)](https://travis-ci.org/JimLynchCodes/Super-Scraper)\n\n\non travis-ci.com:\n\n[![Build Status](https://travis-ci.com/JimLynchCodes/Super-Scraper.svg?branch=master)](https://travis-ci.com/JimLynchCodes/Super-Scraper)\n\n\n\n\u003cimg src=\"screenshots/scraper-demo.gif\"\u003e\n\n\u003cbr/\u003e\n\n## Usage Guide\n\n### (Running The Google Theme Scraper)\n\n#### Please use node v13.13.0\n```\nnvm use\n```\n\n#### Install dependencies\n```\nnpm i\n```\n\n\u003cbr/\u003e\n\n### _Running The Google Theme Scraper_\n\n### Start A Local Mongo Instance\n\nThere are many ways to start a mongo server,  but here is one way:\n```\nbrew services mongo-community start\n```\n\ncheck if it's running by listing the running brew services:\n```\nbrew services list\n```\n\n\n\n# Running The Scraper\n- concurrently starts up the scraper's \"back-end\" and \"front-end\".\n- back-end is a local express node server that interacts with the database and takes REST calls from front-end.\n- front-end is a cypress-controlled browser process that interacts with web pages.\n- This is great for defveloping becauase cypress has hot reloading built in, and you can really get into the weeds use `cy.log` to see what's going on.\n```\nnpm start\n```\n\nThis should open the \"cypress test runner\" while concurrently starting up the super scraper \"backend\" that communicates with the database.\n\n\n#### Running In Headless Mode\n\n  - concurrently spins up \"backend-end\" and \"frontend\" where the browser is a headless browser process.\n  - How you would run the scraper \"in production\" on a remote server.\n```\nnpm run scrape:headless\n```\n\n\n#### Start Backend Server\n```\nnpm run start:backend\n```\n\nYou can test it locally via `curl` or Postman by hitting the back-end endpoints with REST calls. \n\n\u003cbr/\u003e\n\n\n# Backend Endpoints\n_you shouldn't really need to change these much._\n\n### Health-Check\n\n-  `health-check` - GET \n\n   A convenient sanity check to see if the server is running. \"Super Scraper backend is ready to accept scraped data \u0026 insert it into the database!\" with a 200 status code.\n\n   http://localhost:3000/health-check\n\n   Response: **200** - \"Shutting down backend server...\"\n\n\n\u003cbr/\u003e\n\n### Save\n\n- `save` - POST\n\n  Saves the scraped data along with the current time as a new mongo document in the specified collection.\n\n  http://localhost:3000/save\n\n  body: \n  ```\n  {\n\t  \"scraped_data\": String | Object | Array,\n\t  \"collection\": String\n  }\n  ```\n\n  Response: **200** \n\n  responseBody:\n  ```\n  {\n    \"statusCode\": 200,\n    \"body\": \"{ \n      \"message\": \"Saved succesfully!\",\n      \"document_saved\\\": _id\n      \"date_scraped\\\": Date\n      \"data\": String | Object | Array \n  }\n  ```\n\n\n### Shutdown-Backend\n(Note: Actually, this may not be necessary...)\n\n  - `shutdown-backend` - POST\n\n    Shuts down the backend server and returns the string, \"Shutting down backend server...\"\n\n    http://localhost:3000/shutdown-backend\n\n    body: \n    ```\n    {}\n    ```\n\n    Response: **200** - \"Shutting down backend server...\"\n\n\n### Deploying The Scraper\nTo \"run this is prod\" we just run it on travis CI, using a secret environment variables pointing to a known database and providing credentials for logging into barchart.\n\nlogge\n\n### Barchart username and password\n\nWhen running locally, set these path variables using \"CYPRESS_\" as the prefix so cypress can see them:\n```\nexport CYPRESS_BARCHART_USER=$'jimbo@boofar.com'\nexport CYPRESS_BARCHART_PW='derpderp123'\n```\n\nWhen running on the build server, set the above two environment variables in the CI admin.\n\n\nAlso, be sure to set the values in `cypress.json` for `google_themes_mongo_collection` and   `mongo_collection_bc_scraper` to reflect the mongo collections in which you'd like to save each scraper's data.\n\nSimilarly, set the value in `cypress.json` for `mongo_database_name` is you would like to use a database name other than `eon_data`.  \n\n_Be sure to have the collections with these names inside of the database with this name before running the script!_\n\n## Create Backend .Env\n\nCreate a `.env` file in the `backend` folder with the same structure of `./backend/.env_SAMPLE`, filling in your mongo connections information.\n\n```\n# Local MongoDb Example\nMONGO_URI=localhost:27017/db\n\n# MongoDb Atlas Example\nMONGO_URI=mongodb://username:password@cluster0-----.mongodb.net:27017/db?ssl=true\u0026replicaSet=Cluster0-shard-0\u0026authSource=admin\n```\n\nWhen running on the build server, add the `MONGO_URI` environment variable to the database you'd like to save the data to, **and be sure to put quotes around the value you use for the uri!**.\n\n\n## Install Node Dependencies\n\nIn **BOTH** the project root and the `backend` folder, install node dependencies using `v13.13.0`:\n```\nnvm use\nnpm i\n```\n\n## Install libgconf-2-4\n\nFor cypress headless, you still need libgconf-2-4. The necessary dependencies for running cypress on linux can be found here. \n\nNotice this section of `travis.yml` file:\n```\naddons:\n  apt:\n    packages:\n      cypress tests\n      - libgconf-2-4\n```\n\n## (Optional) Run Locally\nAt this point you should be able to run the script by just executing the bash file, in this case `./run-scraper.sh`\n\n\n## Schedule Cron Job\n\nSetup a cron job on ubuntu by editing the crontab:\n```\ncrontab -e\n```\n\nOnce editing your that runs the `` file. This one runs it every weekday at 5pm:\n\n```\n30 18 * * 1-5 ~/Git-Projects/Super-Scraper/run-scraper.sh \u003e\u003e /home/ubuntu/Git-Projects/Super-Scraper/logs/`date +\\%Y-\\%m-\\%d`-cron.log 2\u003e\u00261\n```\n\nNote, to view your shedules:\n```\ncrontab -l\n```\n\n\n\n\n\u003cbr/\u003e\n\n## TODO\n\n- [ ✓ ] - Get Example Google Theme Scraper working\n\n- [ \u0026nbsp; ] - Implement data validation step\n\n- [ ✓ ] - Add optional text / email notifications on success and/or failures\n\n- [ ✓ ] - Setting Up the Cron Scheduling\n\n- [ \u0026nbsp; ] - Run on remote Ubuntu server.\n\n\u003cbr/\u003e\n\n# Developing Your Own Scraper\n\n## Add A Feature File For The New Scraper\n\n- put the file in `cypress/integration/scrape-scripts`.\n- use the Google-Theme-Scraper example as a guide, changing the \"Given\" and \"When\" conditions to grab th desired scrape data.\n\n\n## Implement the Feature File Steps \n\nCreate a new folder with the same name as the `.feature` file created for the new scraper.\n\nWithin it, put the three files `navigate`, `scrape`, and `store`, which correspond to the `Given`, `When`, `Then` statements in the feature file, respectievly.\n\n\n\u003cbr/\u003e\n\n## Contributing\nPlease contribute! 🙏\n\n\u003cbr/\u003e\n\n# How To Re-Create This Project From Scratch\n\n## 1. Create A Directory Attached to a Git Repo\nI just go on github, create the repo in the browser, and clone it to my computer.\n\n## 2. Choose A NodeJs Version\nDecide on a good (prefereably LTS) version of Node (the latest version of v12 is a good choice at the time of this writing). \n\nIt is recommended to have [nvm](https://github.com/nvm-sh/nvm) installed and create a .nvmrc file:\n```\nnvm i v12\nnvm use v12\nnode -v \u003e .nvmrc\n```\n\n## 3. Install the Latest Version of Cypress\n\nInstall `cypress` as a dev dependency:\n```\nnpm i -D cypress\n```\n\n## 4. Run Cypress \nWhen you run cypress in a project with no cypress folder, it creates one with a bunch of boilerplate cypress stuff.\n\nAdd a script in your package.json for cypres's `open` and `run` commands (we recommend having `npm start` be an alias for a \"scrape\" command):\n\nHere is a sample snippet of the \"scripts\" sections in package.json:\n```\n\"scripts\": {\n    \"start\": \"npm run scrape\",\n    \"scrape\": \"cypress open\",\n    \"scrape:headless\": \"cypress run\"\n  },\n```\n\n## 5. Install CucumberJS\ncucumber is an awesome plugin that we are using in a slightly usual way since this is a slightly usual project (using what is normally and e2e testing framework as _the application itself_).\n\n ### 5.a install the cucumber library:\n```\nnpm i -D cypress-cucumber-preprocessor\n```\n\n ### 5.b Add this library it to your \"plugins\":\n\ncypress/plugins/index.js\n```\nconst cucumber = require('cypress-cucumber-preprocessor').default\n \nmodule.exports = (on, config) =\u003e {\n  on('file:preprocessor', cucumber())\n}\n```\n\n###  5.c Make Cypress Look For .feature Files \n\ncypress.json\n```\n{\n  \"testFiles\": \"**/*.feature\"\n}\n```\n\n### 5.d Create A Sample Feature\n\ncreate a feature file anywhere within the `cypress/integration` folder that follows proper [Gherkin]() syntax.\n\nHere's a sample feature file:\n```\nFeature: The Google Theme Scraper\n \n  I want to scrape the theme of google's home page image each day\n  \n  @focus\n  Scenario: Opening a social network page\n    Given I open Google search home page\n    When I scrape the day's theme of the day's google image\n    Then I save it in my database's Google-Theme-Scrapings collection\n```\n\n## 5.e Put \"Step Defs\" Near The feature Files\n\nThis isn't totally necessary as you could put the step definition files in the default `use the default path that cypress \n\nadd this to `package.json`:\n\n```\n\"cypress-cucumber-preprocessor\": {\n    \"step_definitions\": \"cypress/integration/\"\n  }\n```\n\nthen create a folder within this folder which has same name at the .feature file.\n\n## 5.f Create Example Script \"Step Definition\" Files\n\nFor the examples feature file, we can create these three step definiton files:\n\ncypress/integration/Google-Theme-Scraper/navigate.js\n```\nimport { Given } from \"cypress-cucumber-preprocessor/steps\";\n \nconst url = 'https://google.com'\nGiven(`I open Google search home page`, (title) =\u003e {\n  cy.visit(url)\n})\n```\n\ncypress/integration/Google-Theme-Scraper/scrape.js\n```\n\n```\n\n\n## 5.g Add Data Storage Of Your Choice\n\nPut your favorite save / insert code in the your \"Then\" step defintion file\" \n\ncypress/integration/Google-Theme-Scraper/store.js\n```\nconst MongoClient = require('mongodb').MongoClient;\n\nconst save = (data, collection) =\u003e {\n\n    return new Promise(resolve =\u003e {\n\n        MongoClient.connect(process.env.MONGO_URI, (err, db) =\u003e {\n\n            if (err)\n                throw new Error(err)\n\n            console.log('connected to mongo for saving results...')\n\n            var dbo = db.db(collection)\n\n            const currentTime = moment().format('MMMM Do YYYY, h:mm:ss a')\n\n            dbo.collection('twitter-keyword-scanner-results').insertOne({\n                date_scraped: currentTime,\n                tweets_by_keyword: tweetsFound\n            }, (err, res) =\u003e {\n                if (err) throw err\n                db.close()\n                resolve(res.result)\n            })\n\n        })\n\n    })\n\n}\n\n```\n\n## 6 Create .env File And Load It During The Scraping\n\nInstall `dotenv`\n```\nnpm i dotenv -D\n```\n\nLoad the `.env` file during your scrape by adding this the `plugins/index.js`:\n```\nrequire('dotenv').config()\n```\n\nThen read the env variables with `process.env.MONGO_URI`\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimlynchcodes%2Fsuper-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjimlynchcodes%2Fsuper-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjimlynchcodes%2Fsuper-scraper/lists"}