{"id":13892856,"url":"https://github.com/DocNow/diffengine","last_synced_at":"2025-07-17T06:31:21.547Z","repository":{"id":66762452,"uuid":"77878289","full_name":"DocNow/diffengine","owner":"DocNow","description":"track changes to the news, where news is anything with an RSS feed","archived":false,"fork":false,"pushed_at":"2020-06-25T17:20:50.000Z","size":396,"stargazers_count":178,"open_issues_count":30,"forks_count":30,"subscribers_count":14,"default_branch":"master","last_synced_at":"2024-11-22T01:41:56.338Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DocNow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-01-03T02:45:48.000Z","updated_at":"2024-10-13T20:56:51.000Z","dependencies_parsed_at":"2024-01-19T08:14:26.241Z","dependency_job_id":"72078877-e29a-4b0d-a28b-ed735a7a8b72","html_url":"https://github.com/DocNow/diffengine","commit_stats":null,"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocNow%2Fdiffengine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocNow%2Fdiffengine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocNow%2Fdiffengine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DocNow%2Fdiffengine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DocNow","download_url":"https://codeload.github.com/DocNow/diffengine/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226226243,"owners_count":17592351,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-06T17:01:17.126Z","updated_at":"2024-11-24T20:31:29.376Z","avatar_url":"https://github.com/DocNow.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cdiv style=\"text: center;\"\u003e\n\u003cimg height=\"100\" src=\"https://github.com/DocNow/diffengine/blob/master/diffengine.png?raw=true\"\u003e\n\u003c/div\u003e\n\ndiffengine is a utility for watching RSS feeds to see when story content\nchanges. When new content is found a snapshot is saved at the Internet Archive,\nand a diff is generated for sending to social media. The hope is that it can\nhelp draw attention to the way news is being shaped on the web. It also creates\na database of changes over time that can be useful for research purposes.\n\ndiffengine draws heavily on the inspiration of [NYTDiff] and [NewsDiffs] which\n*almost* did what we wanted. [NYTdiff] is able to create presentable diff images\nand tweet them, but was designed to work specifically with the NYTimes API.\nNewsDiffs provides a comprehensive framework for watching changes on multiple\nsites (Washington Post, New York Times, CNN, BBC, etc) but you need to be a\nprogrammer to add a [parser\nmodule](https://github.com/ecprice/newsdiffs/tree/master/parsers) for a website\nthat you want to monitor. It is also a full-on website which involves some\ncommitment to install and run.\n\nWith the help of [feedparser], diffengine takes a different approach by working\nwith any site that publishes an RSS feed of changes. This covers many news\norganizations, but also personal blogs and organizational websites that put out\nregular updates. And with the [readability] module, diffengine is able to\nautomatically extract the primary content of pages, without requiring special\nparsing to remove boilerplate material. And like NYTDiff, instead of creating\nanother website for people to watch, diffengine pushes updates out to social\nmedia (via Twitter or email) where people are already, while also building a local database of diffs\nthat can be used for research purposes.\n\n## Install\n\n1. install [GeckoDriver]\n1. install [Python 3]\n1. `pip3 install diffengine`\n\n## Run\n\nIn order to run diffengine you need to pick a directory location where you can\nstore the diffengine configuration, database and diffs. For example I have a\ndirectory in my home directory, but you can use whatever location you want, you\njust need to be able to write to it.\n\nThe first time you run diffengine it will prompt you to enter an RSS or Atom\nfeed URL to monitor. You will the be asked to provide the credentials to\npublish the diffs in social media.\n\n\n```console\n% diffengine /home/ed/.diffengine\n\nWhat RSS/Atom feed would you like to monitor? https://inkdroid.org/feed.xml\n\nWould you like to set up tweeting edits? [Y/n] Y\n\nGo to https://apps.twitter.com and create an application.\n\nWhat is the consumer key? \u003cTWITTER_APP_KEY\u003e\n\nWhat is the consumer secret? \u003cTWITTER_APP_SECRET\u003e\n\nLog in to https://twitter.com as the user you want to tweet as and hit enter.\n\nVisit https://api.twitter.com/oauth/authorize?oauth_token=NRW9BQAAAAAAyqBnAAXXYYlCL8g\n\nWhat is your PIN: 8675309\n\nSaved your configuration in /home/ed/.diffengine/config.yaml\n\nWould you like to set up emailing edits with Sendgrid? [Y/n] y\n\nGo to https://app.sendgrid.com/ and get an API key.\n\nWhat is the API key? \u003cAPI_KEY\u003e\n\nWhat email address is sending the email? \u003cFROM_ADDRESS\u003e\n\nWho are the recipients of the emails?  \u003cRECEIVERS ADDRESSES_CSV\u003e\n\nFetching initial set of entries.\n\nDone!\n```\n\nAfter that you just need to put diffengine in your crontab to have it run\nregularly, or you can run it manually at your own intervals if you want. Here's\nmy crontab to run every 30 minutes to look for new content.\n\n    0,30 * * * * /usr/local/bin/diffengine /home/ed/.diffengine\n\nYou can examine your config file at any time and add/remove feeds as needed. It\nis the `config.yaml` file that is stored relative to the storage directory you\nchose, so in my case `/home/ed/.diffengine/config.yaml`.\n\nLogs can be found in `diffengine.log` in the storage directory, for example\n`/home/ed/.diffengine/diffengine.log`.\n\n## Examples\n\nCheckout [Ryan Baumann's \"diffengine\" Twitter list] for a list of known\ndiffengine Twitter accounts that are out there.\n\n## Config options\n\n### Database engine\n\nBy default the database is configured for Sqlite and the file `./diffengine.db` through the `db` config prop\n\n```yaml\ndb: sqlite:///diffengine.db\n```\n\nThis value responds to the [database URL connection string format](http://docs.peewee-orm.com/en/latest/peewee/playhouse.html#database-url).\n\nFor instance, you can co˚nnect to your postgresql database using something like this.\n\n```yaml\ndb: postgresql://postgres:my_password@localhost:5432/my_database\n```\n\nIn case you store your database url connection into an environment var, like in Heroku. You can simply do as follows.\n\n```yaml\ndb: \"${DATABASE_URL}\"\n```\n\n### Multiple Accounts \u0026 Feed Implementation Example\n\nIf you are setting multiple accounts, and multiple feeds if may be helpful to setup a\ndirectory for each account. For example:\n\n- Toronto Sun `/home/nruest/.torontosun`\n- Toronto Star  `/home/nruest/.torontostar`\n- Globe \u0026 Mail `/home/nruest/.globemail`\n- Canadaland `/home/nruest/.canadaland`\n- CBC `/home/nruest/.cbc`\n\nThen you will configure a cron entry for each account:\n\n```\n0,15,30,45 * * * * /usr/bin/flock -xn /tmp/globemail.lock -c \"/usr/local/bin/diffengine /home/nruest/.globemail\"\n0,15,30,45 * * * * /usr/bin/flock -xn /tmp/torontosun.lock -c \"/usr/local/bin/diffengine /home/nruest/.torontosun\"\n0,15,30,45 * * * * /usr/bin/flock -xn /tmp/cbc.lock -c \"/usr/local/bin/diffengine /home/nruest/.cbc\"\n0,15,30,45 * * * * /usr/bin/flock -xn /tmp/lapresse.lock -c \"/usr/local/bin/diffengine /home/nruest/.lapresse\"\n0,15,30,45 * * * * /usr/bin/flock -xn /tmp/calgaryherald.lock -c \"/usr/local/bin/diffengine /home/nruest/.calgaryherald\"\n```\n\nIf there are multiple feeds for an account, you can setup the `config.yml` like so:\n\n```yaml\n- name: The Globe and Mail - Report on Business\n  twitter:\n    access_token: ACCESS_TOKEN\n    access_token_secret: ACCESS_TOKEN_SECRET\n  sendgrid:\n    sender: FROM_ADDRESS\n    recipients: TO_ADDRES1, TO_ADDRESS2\n  url: http://www.theglobeandmail.com/report-on-business/?service=rss\n- name: The Globe and Mail - Opinion\n  twitter:\n    access_token: ACCESS_TOKEN\n    access_token_secret: ACCESS_TOKEN_SECRET\n  sendgrid:\n    sender: FROM_ADDRESS2\n    recipients: TO_ADDRES3, TO_ADDRESS4\n  url: http://www.theglobeandmail.com/opinion/?service=rss\n- name: The Globe and Mail - News\n  twitter:\n    access_token: ACCESS_TOKEN\n    access_token_secret: ACCESS_TOKEN_SECRET\n  url: http://www.theglobeandmail.com/news/?service=rss\ntwitter:\n  consumer_key: CONSUMER_KEY\n  consumer_secret: CONSUMER_SECRET\nsendgrid:\n  api_token: API_TOKEN\n```\n\n### Skip entry\n\nYou can also keep an entry if matches with a regular expression pattern. This is useful for avoid the \"subscribe now\" pages.\nThis is configured per feed like so:\n\n```yaml\n- name: The Globe and Mail - Report on Business\n  skip_pattern: \"you have access to only \\\\d+ articles\"\n  twitter:\n    access_token: ACCESS_TOKEN\n    access_token_secret: ACCESS_TOKEN_SECRET\n  url: http://www.theglobeandmail.com/report-on-business/?service=rss\n```\n\nIn this example, if the page says contains the text \"you have access to only 10 articles\" will skip it. the same if says any number of articles as it's a regular expression.\nThe `skip_pattern` performs a `re.search` operation and uses the flags for `case insensitive` and `multiline`.\n\nLook for the docs for [more information about Regular Expressions and the search operation.](https://docs.python.org/3/library/re.html#search-vs-match)\n\n\n### Tweet content\n\nBy default, the tweeted diff will include the article's title and the archive diff url, [like this.](https://twitter.com/ld_diff/status/1267989297048817672)\n\nYou change this by tweeting what's changed: the url, the title and/or the summary. For doing so, you need to specify **all** the following `lang` keys:\n\n```yaml\nlang:\n  change_in: \"Change in\"\n  the_url: \"the URL\"\n  the_title: \"the title\"\n  and: \"and\"\n  the_summary: \"the summary\"\n```\n\nOnly if all the keys are defined, the tweet will include what's changed on its content, followed by the `diff.url`. Some examples:\n\n- \"Change in the title\"\n- \"Change in the summary\"\n- \"Change in the title and the summary\"\n\nAnd so on with all the possible combinations between url, title and summary\n\n### Support for environment vars\n\nThe configuration file has support for [environment variables](https://medium.com/chingu/an-introduction-to-environment-variables-and-how-to-use-them-f602f66d15fa). This is useful if you want to keeping your credentials secure when deploying to Heroku, Vercel (former ZEIT Now), AWS, Azure, Google Cloud or any other similar services. The environment variables are defined on the app of the platform you use or directly in a [dotenv file](https://12factor.net/config), which is the usual case when coding locally.\n\nFor instance, say you want to keep your Twitter credentials safe. You'd keep a reference to it in the `config.yaml` this way:\n\n```yaml\ntwitter:\n  consumer_key: \"${MY_CONSUMER_KEY_ENV_VAR}\"\n  consumer_secret: \"${MY_CONSUMER_SECRET_ENV_VAR}\"\n```\n\nThen you would define your environment variables `MY_CONSUMER_KEY_ENV_VAR` and `MY_CONSUMER_SECRET_ENV_VAR` in your `.env` file:\n\n```dotenv\nMY_CONSUMER_KEY_ENV_VAR=\"CONSUMER_KEY\"\nMY_CONSUMER_SECRET_ENV_VAR=\"CONSUMER_SECRET\"\n```\n\nDone! You can use diffengine as usual and keep your credentials safe.\n\n### Adding a Twitter account when the configuration file is already created\n\nYou can use the following command for adding Twitter accounts to the config file.\n\n```shell\n$ diffengine --add\n\nLog in to https://twitter.com as the user you want to tweet as and hit enter.\nVisit https://api.twitter.com/oauth/authorize?oauth_token=QKGAqgAAAAABDsonAAABcbfQfFw in your browser and hit enter.\nWhat is your PIN: 1234567\n\nThese are your access token and secret.\nDO NOT SHARE THEM WITH ANYONE!\n\nACCESS_TOKEN\nxxxxxxxxxxx-yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy\n\nACCESS_TOKEN_SECRET\nzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz\n```\n\nThen you would use the `ACCESS_TOKEN` and the `ACCESS_TOKEN_SECRET` inside the config like this\n\n```yaml\nfeeds:\n- name: My new feed\n  url: http://www.mynewfeed.com/feed/\n  twitter:\n    access_token: \"${ACCESS_TOKEN}\"\n    access_token_secret: \"${ACCESS_TOKEN_SECRET}\"\n```\n\n### Avaiable webdriver engines\n\nDiffengine has support for `geckodriver` and `chromedriver`.\n\nYou can configure this in the `config.yaml`. The keys are the following ones.\n```yaml\nwebdriver:\n  engine:\n  executable_path:\n  binary_location:\n```\n\n#### Configuring geckodriver\n\nThe `geckodriver` is properly defined by default. In case you need to configure it, then:\n\n```yaml\nwebdriver:\n  engine: \"geckodriver\"\n  executable_path: null (this config has no use with geckodriver)\n  binary_location: null (the same as above with this one)\n```\n\n#### Configuring chromedriver\n\nIf you want to use `chromedriver` locally, then you should leave the config this way:\n\n```yaml\nwebdriver:\n  engine: \"chromedriver\"\n  executable_path: null (\"chromedriver\" by default)\n  binary_location: null (\"\" by default)\n```\n\n##### Using chromedriver in Heroku\n\nIf you use Heroku, then you have to add the [Heroku chromedriver buildpack](https://github.com/heroku/heroku-buildpack-chromedriver).\nAnd then use the environment vars provided automatically by it.\n\n```yaml\nwebdriver:\n  engine: \"chromedriver\"\n  executable_path: \"${CHROMEDRIVER_PATH}\"\n  binary_location: \"${GOOGLE_CHROME_BIN}\"\n```\n\n### Configuring the loggers\n\nBy default, the script will log everyhintg to `./diffengine.log`.\nAnyway, you can disable the file logger and/or enable the console logger as well.\nYou can modify the log filename, too.\n\nIf no present, the default values will be the following ones.\n```yaml\nlog: diffengine.log\nlogger:\n  file: true\n  console : false\n```\n\nLogging to the console could be useful to see what's happening if the app lives in services like Heroku.\n\n## Develop\n\n[![Build Status](https://travis-ci.org/DocNow/diffengine.svg)](http://travis-ci.org/DocNow/diffengine)\n\nHere's how to get started hacking on diffengine with [pipenv]:\n\n```console\n% git clone https://github.com/docnow/diffengine\n% cd diffengine\n% pipenv install\n% pytest\n============================= test session starts ==============================\nplatform linux -- Python 3.5.2, pytest-3.0.5, py-1.4.32, pluggy-0.4.0\nrootdir: /home/ed/Projects/diffengine, inifile:\ncollected 5 items\n\ntest_diffengine.py .....\n\n=========================== 5 passed in 8.09 seconds ===========================\n```\n\nLast, you need to install the pre-commit hooks to be run before any commit\n\n```\npre-commit install\n```\n\nThis way, [Black](https://black.readthedocs.io/en/stable/) formatter will be executed every time.\n\nWe recommend you to [to configure it in your own IDE here.](https://black.readthedocs.io/en/stable/editor_integration.html)\n\n\n[nyt_diff]: https://twitter.com/nyt_diff\n[NYTDiff]: https://github.com/j-e-d/NYTdiff\n[NewsDiffs]: http://newsdiffs.org/\n[feedparser]: https://pythonhosted.org/feedparser/\n[readability]: https://github.com/buriy/python-readability\n[GeckoDriver]: https://github.com/mozilla/geckodriver\n[Python 3]: https://python.org\n[create an issue]: https://github.com/DocNow/diffengine/issues\n[pipenv]: https://pipenv.readthedocs.io/en/latest/\n[Ryan Baumann's \"diffengine\" Twitter list]: https://twitter.com/ryanfb/lists/diffengine\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDocNow%2Fdiffengine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDocNow%2Fdiffengine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDocNow%2Fdiffengine/lists"}