{"id":23291049,"url":"https://github.com/jplusplus/crawlertoolkit","last_synced_at":"2025-10-19T03:36:30.526Z","repository":{"id":27074454,"uuid":"108009371","full_name":"jplusplus/CrawlerToolkit","owner":"jplusplus","description":"Crawls and save articles that have a \"preservation\" meta element.","archived":false,"fork":false,"pushed_at":"2022-12-08T00:42:06.000Z","size":1484,"stargazers_count":2,"open_issues_count":9,"forks_count":2,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-09T13:52:44.939Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://www.offshorejournalism.com","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jplusplus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-10-23T16:34:59.000Z","updated_at":"2022-07-20T22:17:39.000Z","dependencies_parsed_at":"2023-01-14T05:56:06.385Z","dependency_job_id":null,"html_url":"https://github.com/jplusplus/CrawlerToolkit","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jplusplus/CrawlerToolkit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2FCrawlerToolkit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2FCrawlerToolkit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2FCrawlerToolkit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2FCrawlerToolkit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jplusplus","download_url":"https://codeload.github.com/jplusplus/CrawlerToolkit/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jplusplus%2FCrawlerToolkit/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279685279,"owners_count":26210591,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-19T02:00:07.647Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-20T05:13:58.514Z","updated_at":"2025-10-19T03:36:30.510Z","avatar_url":"https://github.com/jplusplus.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Offshore Journalism Toolkit\nThe crawler toolkit is part of the [Offshore Journalism](http://www.offshorejournalism.com) initiative. It's a proof-of-concept of the `preversation` meta tag. Thus this project is divided in two parts, first the crawler, a django application designed to crawl feeds (RSS, atom or Twitter account) and preserve, if needed, articles tagged with preservation meta. The second part (the test site) is dedicated to test the preservation tags. It implements a simple version of the preservation meta tags and is based on Jekyll. \n\n## Summary\n- [How to install](#how-to-install)\n  - [Prerequisites](#prerequisites)\n  - [Get the sources](#get-the-sources)\n  - [Install](#install)\n- [Configuration](#configure-the-application)\n  - [The environnement variables](#the-environnement-variables)\n  - [Set up external services](#set-up-the-external-services)\n- [How to use](#how-to-use)\n  - [Run servers locally](#run-servers-locally)\n  - [Operations on the django application](#operations-on-the-django-application)\n  - [Operations on the jekyll application](#operations-on-the-jekyll-application)\n  - [Adding content on the test site](#adding-content-on-the-test-site)\n- [How to deploy](#how-to-deploy)\n  - [Deploy the crawler](#deploy-the-crawler)\n  - [Deploy the test site](#deploy-the-test-site)\n\n\n## How to install\n### Prerequisites\nTo install all dependencies (see [Dependencies]()) you must have the following programs installed on your computer:\n\n- python (\u003e= 3.5)\n- ruby (\u003e= 2.4)\n- homebrew is recommanded if you're on Mac OS X\n- rvm is also recommanded\n\n### Get the sources\n```sh\ngit clone https://github.com/jplusplus/CrawlerToolkit.git\ncd CrawlerToolkit\n\n```\n\n### Install\n#### 1. Install redis\n```sh\n# On Mac OS X (with homebrew)\nbrew install redis\n\n# On Ubuntu (16.04+) \nsudo apt-get install redis-server\n\n# On RedHat/Fedora distributions\nsudo dnf install redis\n```\n\n#### 2. Install the dependencies\n```sh\n./manage.sh install\n```\n\n## Configure the application\n### The environnement variables\nThis application relies on environnement variable to run. \n\n| Name | Purpose |\n| ---- | ------- |\n| `DJANGO_SETTINGS_MODULE` | Change the settings file to use for the django app (ex `settings_dev` and `settings_heroku`) |\n| `AWS_ACCESS_KEY_ID` | Amazon Web Service acces key's id, required on heroku to serve \u0026 upload static files. |\n| `AWS_SECRET_ACCESS_KEY` | As above, required for static files serving \u0026 uploading. |\n| `AWS_STORAGE_BUCKET_NAME` | The name of the S3 storage bucket |\n| `TWITTER_ACCESS_TOKEN` | Token to access [Twitter's API](#twitter) |\n| `TWITTER_ACCESS_SECRET`| Acces's secret for [Twitter's API](#twitter) |\n| `TWITTER_CONSUMER_KEY`| Twitter consumer key for [Twitter's API](#twitter) |\n| `TWITTER_CONSUMER_SECRET`| Token to access [Twitter's API](#twitter) |\n\n#### On local\nTo configure the local application we use and `.env` file. To configure it copy the `.env.template` file:\n```sh\ncp .env.template .env\n```\nThen edit `.env` to fill the proper variables\n\n#### On Heroku\nAll configuration variables can be edited from the heroku dashboard or with the following command.\n```sh\n# To set a variable\n./manage.sh set \u003cVARIABLE NAME\u003e \u003cvalue\u003e\n# To get a variable's value\n./manage.sh get \u003cVARIABLE NAME\u003e\n```\n\n### Set up the external services\nThis project has been configured to be managed with simple commands (see How to use). But in order certain services\nneeds to be configured.\n#### Surge.sh\nYou will need to install the surge npm package to deploy the test-site.\n```sh\n$ sudo npm install -g surge\n$ surge login\n```\n\n#### Heroku\nTo use the heroku `manage.sh` commands you must have the heroku-cli package installed on your OS. Once this package\nis installed you must log in:\n```sh\n$ heroku login\n```\nThen add the proper `heroku` git remote with the following command\n```sh\n# replace \u003capp\u003e with your heroku's application name\n$ heroku git:remote -a \u003capp\u003e\n```\n\n#### Twitter\nThis project uses the Twitter's API in order to retrieve tweets from twitter feeds. Thus, you'll need to [create a twitter app](https://apps.twitter.com/app/new) and generate a set of Token Access (in the Keys and Access Tokens tab). Then report the various keys, secrets and tokens in the appropriate [environnement variables](#the-environnement-variables)\n\n## How to use\n### Run servers locally\n\n```sh\n# 1. Start the redis server\n./manage.sh start_redis \u003coptional port, default: 3000\u003e\n\n# 2. Run the crawler\n./manage.sh start_crawler \u003coptional port, default: 4000\u003e\n\n# 3. Run the test site\n./manage start_test_site \u003coptional port, default 5000\u003e\n``` \n\n### Operations on the django application\nIf you need to perform operations on the application you have access to all django commands throught the following command:\n```sh\n./manage.sh django --help\n```\n\n### Operations on the jekyll application\n```sh\n./manage.sh jekyll --help\n```\n### Adding content on the test site\nCurrently, the test site is built thanks to Jekyll and the minimal-mistakes theme.\nSo in order to make a new post work properly you'll need to create a post in `tests-site/_posts`\nfolder (like on Jekyll) but with the `single` layout instead of the `post` that you'd expect.\n\nAlso, the purpose of this site is to test the preservation meta tags (see the specs).\nTo do add one or more preservation meta tag you just have to add a `preservation` field in the post header as follows:\n\n```md\n---\nlayout: single\ntitle: \"The article title\"\ncategories: this is a test\npreservation:\n  - type: notfound_only\n    value: true\n  - type: release_date\n    value: 2018-01-01\n  - type: priority\n    value: true\n---\n```\n\n## How to deploy\n\n### Deploy the crawler\nThe crawler itself is parametered to be deployed on heroku with the following command\n```sh\n# This helper function calls the following git command:\n# git subtree push --prefix crawl/ heroku master\n./manage.sh deploy\n```\n### Deploy the test site\nBy default we parametered the `test-site` to be deployed on [surge.sh](http://surge.sh).\n```sh\n$ ./manage.sh deploy_test_site\n``` \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjplusplus%2Fcrawlertoolkit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjplusplus%2Fcrawlertoolkit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjplusplus%2Fcrawlertoolkit/lists"}