{"id":13558286,"url":"https://github.com/openaustralia/morph","last_synced_at":"2025-04-03T13:30:56.209Z","repository":{"id":12698570,"uuid":"15370964","full_name":"openaustralia/morph","owner":"openaustralia","description":"Take the hassle out of web scraping","archived":false,"fork":false,"pushed_at":"2023-01-26T23:56:45.000Z","size":9162,"stargazers_count":461,"open_issues_count":368,"forks_count":74,"subscribers_count":19,"default_branch":"master","last_synced_at":"2024-11-04T09:37:08.676Z","etag":null,"topics":["civictech","docker","webscraping"],"latest_commit_sha":null,"homepage":"https://morph.io","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openaustralia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-12-22T05:39:05.000Z","updated_at":"2024-10-13T21:01:59.000Z","dependencies_parsed_at":"2023-02-15T03:16:02.680Z","dependency_job_id":null,"html_url":"https://github.com/openaustralia/morph","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openaustralia%2Fmorph","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openaustralia%2Fmorph/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openaustralia%2Fmorph/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openaustralia%2Fmorph/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openaustralia","download_url":"https://codeload.github.com/openaustralia/morph/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247009496,"owners_count":20868564,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["civictech","docker","webscraping"],"created_at":"2024-08-01T12:04:51.693Z","updated_at":"2025-04-03T13:30:53.781Z","avatar_url":"https://github.com/openaustralia.png","language":"Ruby","funding_links":[],"categories":["Ruby","docker"],"sub_categories":[],"readme":"[![Build Status](https://travis-ci.com/openaustralia/morph.png?branch=master)](https://travis-ci.com/openaustralia/morph) [![Code Climate](https://codeclimate.com/github/openaustralia/morph.png)](https://codeclimate.com/github/openaustralia/morph)\n\n# morph.io: A scraping platform\n\n* A [Heroku](https://www.heroku.com/) for [Scrapers](https://en.wikipedia.org/wiki/Web_scraping)\n* All code and collaboration through [GitHub](https://github.com/)\n* Write your scrapers in Ruby, Python, PHP, Perl or JavaScript (NodeJS, PhantomJS)\n* Simple API to grab data\n* Schedule scrapers or run manually\n* Process isolation via [Docker](https://www.docker.com/)\n* Email alerts for broken scrapers\n\n## Dependencies\nRuby, Docker, MySQL, SQLite 3, Redis, mitmproxy.\n(See below for more details about installing Docker)\n\nDevelopment is supported on Linux (Ubuntu 20.04) and Mac OS X.\n\n## Repositories\n\nUser-facing:\n\n* [openaustralia/morph](https://github.com/openaustralia/morph) - Main application\n* [openaustralia/morph-cli](https://github.com/openaustralia/morph-cli) - Command-line morph.io tool\n* [openaustralia/scraperwiki-python](https://github.com/openaustralia/scraperwiki-python) - Fork of [scraperwiki/scraperwiki-python](https://github.com/scraperwiki/scraperwiki-python) updated to use morph.io naming conventions\n* [openaustralia/scraperwiki-ruby](https://github.com/openaustralia/scraperwiki-ruby) - Fork of [scraperwiki/scraperwiki-ruby](https://github.com/scraperwiki/scraperwiki-ruby) updated to use morph.io naming conventions\n\nDocker images:\n* [openaustralia/buildstep](https://github.com/openaustralia/buildstep) - Base image for running scrapers in containers\n\n## Installing Docker\n\n### On Linux\n\nJust follow the instructions on the [Docker site](https://docs.docker.com/engine/installation/linux/docker-ce/ubuntu/).\n\nYour user account should be able to manipulate Docker (just add your user to the `docker` group).\n\n### On Mac OS X\n\nInstall [Docker for Mac](https://docs.docker.com/docker-for-mac/install/).\n\n## Starting up Elasticsearch\n\nMorph needs Elasticsearch to run. We've made things easier for development by using docker\nto run Elasticsearch.\n\n    docker-compose up\n\n## To Install Morph\n\n    bundle install\n    cp config/database.yml.example config/database.yml\n    cp env-example .env\n\nEdit `config/database.yml` with your database settings\n\n### Tunnel GitHub webhook traffic back to your local development machine\n\nWe use \"ngrok\" a tool that makes tunnelling internet traffic to a local development machine easy. First [download ngrok](https://ngrok.com/download) if you don't have it already. Then,\n\n    ngrok http 5100\n\nMake note of the `http://*.ngrok.io` forwarding URL.\n\n\u003c!-- TODO: Add instructions for debugging and working with callbacks for the GitHub app in development with https://webhook.site --\u003e\n\n### Creating Github Application\n\nYou'll need to create an application on GitHub So that morph.io can talk to GitHub. We've pre-filled most of the important fields for a few different configurations below:\n\n* [Create GitHub application on your personal account for use in development](https://github.com/settings/apps/new?name=Morph.io+(development)\u0026description=Get+structured+data+out+of+the+web\u0026url=http://127.0.0.1:5100\u0026callback_urls[]=http://127.0.0.1:5100/users/auth/github/callback\u0026setup_url=http://127.0.0.1:5100\u0026setup_on_update=true\u0026public=true\u0026webhook_active=false\u0026webhook_url=http://127.0.0.1:5100/github/webhook\u0026administration=write\u0026contents=write\u0026emails=read)\n* [Create GitHub application on your personal account for use in production](https://github.com/settings/apps/new?name=Morph.io\u0026description=Get+structured+data+out+of+the+web\u0026url=https://morph.io\u0026callback_urls[]=https://morph.io/users/auth/github/callback\u0026setup_url=https://morph.io\u0026setup_on_update=true\u0026public=true\u0026webhook_active=false\u0026webhook_url=https://morph.io/github/webhook\u0026administration=write\u0026contents=write\u0026emails=read)\n* [Create GitHub application on the openaustralia organization for use in production](https://github.com/organizations/openaustralia/settings/apps/new?name=Morph.io\u0026description=Get+structured+data+out+of+the+web\u0026url=https://morph.io\u0026callback_urls[]=https://morph.io/users/auth/github/callback\u0026setup_url=https://morph.io\u0026setup_on_update=true\u0026public=true\u0026webhook_active=false\u0026webhook_url=https://morph.io/github/webhook\u0026administration=write\u0026contents=write\u0026emails=read)\n\nYou will need to add add and change a few values manually:\n* Disable \"Expire user authorization tokens\"\n* Add an image - you can use the standard logo at `app/assets/images/logo.png` (you can add this after the app is created)\n* If the webhooks are active and being used in production (currently not the case) then\n  you'll also need to add a \"Webhook secret\" for security.\n\nNext you'll need to fill in some values in the `.env` file which come from the GitHub App that you've just created.\n\n* `GITHUB_APP_ID` - Look for \"App ID\" near the top of the page. This should be an integer\n* `GITHUB_APP_NAME` - Look for \"Public link\". The name is what appears after \"https://github.com/apps/\". It's essentially a url happy version of the name you gave the app.\n* `GITHUB_APP_CLIENT_ID` - Look for \"Client ID\" near the top of the page.\n* `GITHUB_APP_CLIENT_SECRET` - Go to \"Generate a new client secret\".\n\nAlso, a private key for the GitHub app is needed. This can be generated by clicking the \"Generate a private key\" button and will be automatically downloaded. Move and rename it to `config/morph-github-app.private-key.pem`.\n\nNow setup the databases:\n\n    bundle exec dotenv rake db:setup\n\nNow you can start the server\n\n    bundle exec dotenv foreman start\n\nand point your browser at [http://127.0.0.1:3000](http://127.0.0.1:3000)\n\nTo get started, log in with GitHub. There is a simple admin interface\naccessible at [http://127.0.0.1:3000/admin](http://127.0.0.1:3000/admin). To\naccess this, run the following to give your account admin rights:\n\n    bundle exec rake app:promote_to_admin\n\n## Running tests\n\nIf you're running guard (see above) the tests will also automatically run when you change a file.\n\nBy default, RSpec will skip tests that have been tagged as being slow. To change this behaviour, add the following to your `.env`:\n\n    RUN_SLOW_TESTS=1\n\nBy default, RSpec will run certain tests against a running Docker server. These tests are quite slow, but not have been tagged as slow. To stop Rspec from running these tests, add the following to your `.env`:\n\n    DONT_RUN_DOCKER_TESTS=1\n\n### Guard Livereload\n\nWe use Guard and Livereload so that whenever you edit a view in development the web page gets automatically reloaded. It's a massive time saver when you're doing design or lots of work in the view. To make it work run\n\n    bundle exec guard\n\nGuard will also run tests when needed. Some tests do integration tests against a\nrunning docker server. These particular tests are very slow. If you want to\ndisable them,\n\n```\nDONT_RUN_DOCKER_TESTS=1 bundle exec guard\n```\n\n### Mail in development\n\nBy default in development mails are sent to [Mailcatcher](http://mailcatcher.me/). To install\n\n    gem install mailcatcher\n\n## Deploying to production\n\nThis section will not be relevant to most people. It will however be relevant if you're deploying to a production server.\n\n### Ansible Vault\n\nWe're using [Ansible Vault](https://docs.ansible.com/ansible/2.4/vault.html) to encrypt certain files, like the private key for the SSL certificate.\n\nTo make this work you will need to put the password in a\nfile at `~/.infrastructure_ansible_vault_pass.txt`. This is the same password as used in the [openaustralia/infrastructure](https://github.com/openaustralia/infrastructure) GitHub repository.\n\n## Restarting Discourse\n\nDiscourse runs in a container and should usually be restarted automatically by docker.\n\nHowever, if the container goes away for some reason, it can be restarted:\n\n```\nroot@morph:/var/discourse# ./launcher rebuild app\n```\n\nThis will pull down the latest docker image, rebuild, and restart the container.\n\n## Production devops development\n\n\u003e This method defaults to creating a 4Gb VirtualBox VM, which can strain an 8Gb Mac. We suggest tweaking the Vagrantfile to restrict ram usage to 2Gb at first, or using a machine with at least 12Gb ram.\n\nInstall [Vagrant](http://www.vagrantup.com/), [VirtualBox](https://www.virtualbox.org) and [Ansible](http://www.ansible.com/).\n\nInstall a couple of Vagrant plugins: `vagrant plugin install vagrant-hostsupdater vagrant-disksize`\n\nInstall [rbenv](https://github.com/rbenv/rbenv) and [ruby-build](https://github.com/rbenv/ruby-build#readme).\n\nIf on Ubuntu, install libreadline-dev: `sudo apt install libreadline-dev libsqlite3-dev`\n\nInstall the required ruby version: `rbenv install`\n\nInstall capistrano: `gem install capistrano`\n\nRun `make roles` to install some required ansible roles.\n\nRun `vagrant up local`. This will build and provision a box that looks and acts like production at `dev.morph.io`.\n\nOnce the box is created and provisioned, deploy the application to your Vagrant box:\n\n    cap local deploy\n\nNow visit https://dev.morph.io/\n\n## Production provisioning and deployment\n\nTo deploy morph.io to production, normally you'll just want to deploy using Capistrano:\n\n    cap production deploy\n\nWhen you've changed the Ansible playbooks to modify the infrastructure you'll want to run:\n\n    make ansible\n\n## SSL certificates\n\nWe're using Let's Encrypt for SSL certificates. It's not 100% automated.\nOn a completely fresh install (with a new domain) as root:\n```\ncertbot --nginx certonly -m contact@oaf.org.au --agree-tos\n```\n\nIt should show something like this:\n```\nWhich names would you like to activate HTTPS for?\n-------------------------------------------------------------------------------\n1: morph.io\n2: api.morph.io\n3: faye.morph.io\n4: help.morph.io\n```\n\nLeave your answer your blank which will install the certificate for all of them\n\n### Installing certificates for local vagrant build\n\n    sudo certbot certonly --manual -d dev.morph.io --preferred-challenges dns -d api.dev.morph.io -d faye.dev.morph.io -d help.dev.morph.io\n\n### Scraper\u003c-\u003emitmdump SSL\n\nScrapers talk out to the internet by being routed through the mitmdump2\nproxy container. The default container you'll get on a devops install\nhas no SSL certificates. This makes it easy for traffic to get out,\nbut means we can't replicate some problems that occur when the SSL\nvalidation fails.\n\nTo work around this, you'll have to rebuild the mitmdump container. Look in `/var/www/current/docker_images/morph-mitmdump`; there's a `Makefile` that will aid in building the new image.\n\nOnce that's done, you'll need to build a new version of the `openaustralia/buildstep`:\n\n* `cd`\n* `git clone https://github.com/openaustralia/buildstep.git`\n* `cd buildstep`\n* `cp /var/www/current/docker_images/morph-mitmdump/mitmproxy/mitmproxy-ca-cert.pem .`\n* `docker image build -t openaustralia/buildstep:latest .`\n\nYou should now be able to see in `docker image list --all` that your new image is ready. The next time you run a scraper it will be rebuilt using the new buildstep image.\n\n# How to contribute\n\nIf you find what looks like a bug:\n\n* Check the [GitHub issue tracker](http://github.com/openaustralia/morph/issues/)\n  to see if anyone else has reported issue.\n* If you don't see anything, create an issue with information on how to reproduce it.\n\nIf you want to contribute an enhancement or a fix:\n\n* Fork the project on GitHub.\n* Make your changes with tests.\n* Commit the changes without making changes to any files that aren't related to your enhancement or fix.\n* Send a pull request.\n\nWe maintain a list of [issues that are easy fixes](https://github.com/openaustralia/morph/issues?labels=easy+fix\u0026milestone=\u0026page=1\u0026state=open). Fixing\none of these is a great way to get started while you get familiar with the codebase.\n\n# Copyright \u0026 License\n\nCopyright OpenAustralia Foundation Limited. Licensed under the Affero GPL. See LICENSE file for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenaustralia%2Fmorph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenaustralia%2Fmorph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenaustralia%2Fmorph/lists"}