{"id":40209598,"url":"https://github.com/meedan/pender","last_synced_at":"2026-01-19T21:04:17.152Z","repository":{"id":11446302,"uuid":"52640206","full_name":"meedan/pender","owner":"meedan","description":"URL parsing, archiving and rendering service for Meedan Check, a collaborative media annotation platform","archived":false,"fork":false,"pushed_at":"2025-11-04T20:54:41.000Z","size":5332,"stargazers_count":10,"open_issues_count":1,"forks_count":13,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-11-04T22:22:39.094Z","etag":null,"topics":["json-ld","metatags","oembed","opengraph-tags","schema-org","twitter-cards"],"latest_commit_sha":null,"homepage":"https://meedan.com/check","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/meedan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2016-02-26T23:44:04.000Z","updated_at":"2025-11-04T20:54:07.000Z","dependencies_parsed_at":"2023-10-13T08:15:35.241Z","dependency_job_id":"224c7ea0-e31a-42ad-8808-9f834735e006","html_url":"https://github.com/meedan/pender","commit_stats":null,"previous_names":[],"tags_count":206,"template":false,"template_full_name":null,"purl":"pkg:github/meedan/pender","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/meedan%2Fpender","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/meedan%2Fpender/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/meedan%2Fpender/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/meedan%2Fpender/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/meedan","download_url":"https://codeload.github.com/meedan/pender/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/meedan%2Fpender/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28585302,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-19T20:45:59.482Z","status":"ssl_error","status_checked_at":"2026-01-19T20:45:41.500Z","response_time":67,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["json-ld","metatags","oembed","opengraph-tags","schema-org","twitter-cards"],"created_at":"2026-01-19T21:04:16.996Z","updated_at":"2026-01-19T21:04:17.141Z","avatar_url":"https://github.com/meedan.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pender\n\n[![Build and Run Integration Test](https://github.com/meedan/pender/actions/workflows/ci-test-pr.yml/badge.svg)](https://github.com/meedan/pender/actions/workflows/ci-test-pr.yml)\n\nPender is a service for link parsing, archiving and rendering. It is one of the services that supports [Check](https://meedan.com/check), an open source platform for collaborative fact-checking and media annotation.\n\n## Parsing\n\nThe url is visited, parsed and the data found is used to create a media and its attributes. The data can be obtained by API or parsing directly the HTML.\n\nIn addition to parsing any link with an oEmbed endpoint or metatags, Pender supports a few page-specific parsers.\n\n##### You can find a list of page-specific parsers in the [Pender wiki](https://github.com/meedan/pender/wiki/Supported-Page%E2%80%90Specific-Parsers).\n\n## Archiving\n\nWhen making a request to parse a URL, you can also request that the URL be archived. Currently, we support:\n\n* **Archive.org**\n  * This archiver requires `archive_org_access_key` and `archive_org_secret_key` to be set in `config/config.yml`. Get your account keys at [archive.org](https://archive.org/account/s3.php).\n\n* **Perma.cc**\n  * This archiver requires a `perma_cc_key` to be set in `config/config.yml`. Get your account key at [perma.cc](https://perma.cc).\n\n## Setup\n\nTo set Pender up locally:\n\n```\ngit clone https://github.com/meedan/pender.git\ncd pender\nfind -name '*.example' | while read f; do cp \"$f\" \"${f%%.example}\"; done\n```\n\nTo run Pender in development mode:\n\n```\n$ docker-compose build\n$ docker-compose up --abort-on-container-exit\n```\nOpen http://localhost:3200/api-docs/index.html to access Pender API directly.\n\n## Running tests as CI\nTo run the full test suite of Pender tests locally the way CI runs them:\n\n```\nbin/get_env_vars.sh\ndocker build . -t pender\ndocker compose -f docker-test.yml up pender\ndocker compose -f docker-test.yml exec pender test/setup-parallel\ndocker compose -f docker-test.yml exec pender bundle exec rake \"parallel:test[3]\"\ndocker compose -f docker-test.yml exec pender bundle exec rake \"parallel:spec\"\n```\n\n## Setting Cookies for Requests\n\nWe send cookies with certain requests that require logged-in users (e.g. Instagram, TikTok).\n\n**In development**\nTo provide these for development, log in on your browser and copy the cookie information to `config/cookies.txt`. The location of this file can also be configured as `cookies_file_path` in `config.yml`\n\nTo do this easily in Chrome:\n1. Install the [Get cookies.txt](https://chrome.google.com/webstore/detail/get-cookiestxt/bgaddhkoddajcdgocldbbfleckgcbcid) browser extension\n1. Log into the website (e.g. instagram.com)\n1. Using the browser extension, export cookies on the page you want to view\n1. Replace the entries in `config/cookies.txt` with the downloaded `cookies.txt`\n\n**Note**: If you do install this extension, consider doing it on a limited Chrome profile since it requires read and write permission for all websites.\n\n**In deployed environments**\nDeployed environment cookies are stored in S3. To update them, use steps 1-3 above and then update the remote file in AWS. The path to this file can be found for each environment in [SSM](https://meedan.atlassian.net/wiki/spaces/ENG/pages/1126694913/How+to+get+and+set+configuration+values+and+secrets+on+SSM).\n\n## API\n\nTo make requests to the API, you must set a request header with the value of the configuration option `authorization_header` – by default, this is `X-Pender-Token`. The value of that header should be the API key that you have generated using `bundle exec rake lapis:api_keys:create`, or any API key that was given to you.\n\n##### In the wiki you'll find examples of [requests and responses](https://github.com/meedan/pender/wiki/Requests-and-Responses).\n\n## Webhook Notification\n\nThe archiving feature uses asynchronous events. Pender can notify your application after it sends URLs for archiving.\n\nPender sends the `url`, `type` and the information associated with the event. The webhook endpoint should have an associated URL (e.g., http://api:3000/api/webhooks/keep) and a token. These information should be added to API key's `application_settings`: `api_key.application_settings = {:webhook_url=\u003e\"http://api:3000/api/webhooks/keep\", :webhook_token=\u003e\"somethingsecret\"}`\n\n## Rake tasks\n\nThere are rake tasks for a few tasks (besides Rails' default ones). Run them this way: `bundle exec rake \u003ctask name\u003e`\n\n* `test:coverage`: Run all tests and calculate test coverage\n* `application=\u003capplication name\u003e lapis:api_keys:create`: Create a new API key for an application\n* `lapis:api_keys:delete_expired`: Delete all expired keys\n* `lapis:error_codes`: List all error codes that this application can return\n* `lapis:licenses`: List the licenses of all libraries used by this project\n* `lapis:client:ruby`: Generate a client Ruby gem, that allows other applications to communicate and test this service\n* `lapis:client:php`: Generate a client PHP library, that allows other applications to communicate and test this service\n* `lapis:docs`: Generate the documentation for this API, including models and controllers diagrams, Swagger, API endpoints, licenses, etc.\n* `lapis:docker:run`: Run the application in Docker\n* `lapis:docker:shell`: Enter the Docker container\n\n## How to add a new parser\n\n* Add a new file at `app/models/concerns/parser/\u003cprovider\u003e_\u003ctype\u003e.rb` (example... `provider` could be `facebook` and type could be `post` or `profile`)\n* Include the class in the `PARSERS` array in `app/models/media.rb`\n* It should return at least `published_at`, `username`, `title`, `description` and `picture`\n* If `type` is `item`, it should also return the `author_url` and `author_picture`\n* The skeleton should look like this:\n\n```ruby\nmodule Parser\n  class \u003cProvider\u003e\u003cType\u003e \u003c Base\n    class \u003c\u003c self\n      def type\n        '\u003cprovider\u003e_\u003ctype\u003e'.freeze\n      end\n\n      def patterns\n        # A list of regex that tell us when we've landed on a URL for this parser, eg facebook.com\n        [\u003clist of URL patterns\u003e]\n      end\n\n      def ignored_urls\n        # Optional method to specify disallowed URLs. We generally use this to detect\n        # when we've been redirected to a dead end, like a login page.\n        #\n        # Should return an array in format:\n        # [\n        #   {\n        #     pattern: /^https:\\/\\/www\\.instagram\\.com\\/accounts\\/login/,\n        #     reason: :login_page\n        #   },\n        # ]\n      end\n    end\n\n    private    \n\n    def parse_data_for_parser(doc, original_url)\n      # Populate `@parsed_data` with information and return parsed_data at the end of the function\n      # `@parsed_data` is a hash whose key is the attribute and the value is... the value\n    end\n\n    def oembed_url(doc)\n      # Optional method to define an Oembed URL, will default to looking in HTML in Parser::Base\n      # Passed to OembedItem\n    end\n  end\nend\n```\n\nIf shared behavior is needed between parsers of the same provider, make a provider class as a concern and include it in the class.\nSee ProviderInstagram, ProviderYoutube, ProviderFacebook, ProviderTwitter, or ProviderTiktok for examples.\n\n### URL Parameters Normalization\n\nSome service providers include URL parameters for tracking purposes that can be safely removed. Pender parsers can define a list of such parameters to be removed during the URL normalization process.\n\nTo define URL parameters to be removed, a parser class should implement the `urls_parameters_to_remove` method, which returns an array of strings representing the parameters to be stripped. For example:\n\n```ruby\ndef urls_parameters_to_remove\n  ['ighs']\nend\n```\n\n## How to add a new archiver\n\n* Add a new file at `app/models/concerns/media_\u003cname\u003e_archiver.rb`\n* Include the class in `app/models/media.rb`\n* It should have a method `archive_to_\u003cname\u003e`\n* It should call method `Media.declare_archiver`, saying the URL patterns it supports (using the `only` modifier) or the URL patterns it doesn't support (using the `except` modifier)\n* The skeleton should look like this:\n\n```ruby\nmodule Media\u003cName\u003eArchiver\n  extend ActiveSupport::Concern\n\n  included do\n    Media.declare_archiver('\u003cname\u003e', [\u003clist of URL patterns as regular expressions\u003e], :only) # Or :except instead of :only\n  end\n\n  def archive_to_\u003cname\u003e\n    # Archive and then update cache (if needed) and call webhook (if needed)\n    Media.notify_webhook_and_update_cache(\u003cname\u003e, url, data, key_id)\n  end\nend\n```\n\n## Error reporting\n\nWe use Sentry for tracking exceptions in our application.\n\nBy default we unset `sentry_dsn` in the `config.yml`, which prevents\ninformation from being reported to Sentry. If you would like to see data reported from your local machine, set `sentry_dsn` to the value provided for Pender in the Sentry app.\n\n### Additional configuration\n\n**In config.yml**\n  * `sentry_dsn` - the secret that allows us to send information to Sentry, available in the Sentry web app. Scoped to a service (e.g. Pender)\n  * `sentry_environment` - the environment reported to Sentry (e.g. dev, QA, live)\n  * `sentry_traces_sample_rate` - not currently used, since we don't use Sentry for tracing. Set to 0 in config as result.\n\n**In `02_sentry.rb`**\n  * `config.excluded_exceptions` - a list of exception classes that we don't want to send to Sentry\n\n## Observability\n\nWe use Honeycomb for monitoring information about our application. It is currently configured to suppress Honeycomb reporting when the Open Telemetry required config is unset, which we would expect in development; however it is possible to report data from your local environment to either console or remotely to Honeycomb for troubleshooting purposes.\n\n### Enable reporting of Data from your local machine\nIf you would like to see data reported from your local machine, do the following:\n\n**Local console**\n1. Make sure that the `otlp_exporter` prefixed values are set in `config.yml` following `config.yml.example`. The values provided in `config.yml.example` can be used since we don't need a real API key.\n1. In `lib/pender/open_telemetry_config.rb`, uncomment the line setting exporter to 'console'. Warning: this is noisy!\n1. Restart the server\n1. View output in local server logs\n\n**On Honeycomb**\n1. Make sure that the `otlp_exporter` prefixed values are set in `config.yml` following `config.yml.example`\n1. In the config key `otel_exporter_otlp_headers`, set `x-honeycomb-team` to a Honeycomb API key for the Development environment (a sandbox where we put anything). This can be found in the [Honeycomb web interface](https://ui.honeycomb.io/meedan/environments/dev/api_keys). To track your own reported info, be sure to set the `otel_resource_attributes.developer.name` key in `config.yml` to your own name or unique identifier (e.g. `christa`). You will need this to filter information on Honeycomb.\n1. Restart the server\n1. See reported information in Development environment on Honeycomb\n\n### Configuring sampling\n\nTo enable sampling for Honeycomb, set the following configuration (either in `config.yml` locally, or via environment for deployments):\n\n* `otel_traces_sampler` to a supported sampler. See the Open Telemetry documentaiton for supported values.\n* `otel_custom_sampling_rate` to an integer value. This will be used to calculate and set OTEL_TRACES_SAMPLER_ARG (1 / `\u003csample_rate\u003e`) and to append sampler-related value to `OTEL_RESOURCE_ATTRIBUTES` (as `SampleRate=\u003csample_rate\u003e`).\n\n**Note**: If sampling behavior is changed in Pender, we will also need to update the behavior to match in any other application reporting to Honeycomb. More [here](https://docs.honeycomb.io/getting-data-in/opentelemetry/ruby/#sampling)\n\n### Environment overrides\n\nOften for rake tasks or background jobs, we will either want none of the data (skip reporting) or all of the data (skip sampling). For these cases we can set specific environment variables:\n\n* To skip reporting to Honeycomb, set `PENDER_SKIP_HONEYCOMB` to `true`\n* To skip sampling data we want to report to Honeycomb, set `PENDER_SKIP_HONEYCOMB_SAMPLING` to `true`\n\n## Credits\n\nMeedan (hello@meedan.com)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmeedan%2Fpender","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmeedan%2Fpender","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmeedan%2Fpender/lists"}