{"id":20355828,"url":"https://github.com/zackproser/pageripper-v2","last_synced_at":"2025-03-04T17:31:21.242Z","repository":{"id":214414458,"uuid":"731205844","full_name":"zackproser/pageripper-v2","owner":"zackproser","description":"Enhanced version of the Pageripper API ","archived":false,"fork":false,"pushed_at":"2024-04-03T17:33:26.000Z","size":3475,"stargazers_count":1,"open_issues_count":3,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-15T01:07:55.420Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://www.zackproser.com/blog/introducing-pageripper-api","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zackproser.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-12-13T15:11:01.000Z","updated_at":"2023-12-30T16:11:53.000Z","dependencies_parsed_at":"2023-12-28T03:24:36.197Z","dependency_job_id":"6463bb9b-2bbd-4d89-af35-3e472c169d4d","html_url":"https://github.com/zackproser/pageripper-v2","commit_stats":null,"previous_names":["zackproser/pageripper-v2"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zackproser%2Fpageripper-v2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zackproser%2Fpageripper-v2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zackproser%2Fpageripper-v2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zackproser%2Fpageripper-v2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zackproser","download_url":"https://codeload.github.com/zackproser/pageripper-v2/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241889588,"owners_count":20037535,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T23:14:16.397Z","updated_at":"2025-03-04T17:31:21.224Z","avatar_url":"https://github.com/zackproser.png","language":"TypeScript","readme":"# Pageripper v2 \n\n[![Pageripper Tests](https://github.com/zackproser/pageripper-v2/actions/workflows/build-and-test.yml/badge.svg)](https://github.com/zackproser/pageripper-v2/actions/workflows/build-and-test.yml)\n[![OpenAPI spec current](https://github.com/zackproser/pageripper-v2/actions/workflows/openapi.yml/badge.svg)](https://github.com/zackproser/pageripper-v2/actions/workflows/openapi.yml)\n[![OpenAPI spec valid](https://github.com/zackproser/pageripper-v2/actions/workflows/validate-openapi.yml/badge.svg)](https://github.com/zackproser/pageripper-v2/actions/workflows/validate-openapi.yml)\n[![Pulumi Deploy](https://github.com/zackproser/pageripper-v2/actions/workflows/pulumi-deploy.yml/badge.svg)](https://github.com/zackproser/pageripper-v2/actions/workflows/pulumi-deploy.yml)\n\n![pageripperv2](./img/pageripper-v2.png)\n\nThe Pageripper API extracts data from webpages; even if they are single page applications (SPAs) that are rendered via Javascript.\n\nPageripper allows you to customize the behavior of its headless browser on a per-request basis. Pageripper uses [Puppeteer](https://github.com/puppeteer/puppeteer).\n\n## Example\n\nFor example, making a request to Pageripper to extract data from my portfolio site, which is a Next.js Single Page Application deployed to Vercel: \n\n```bash\ncurl -sX POST \\\n    -H 'Content-type: application/json' \\\n    -d '{\"url\": \"https://zackproser.com\", \"options\": {\"waitUntilEvent\":\"networkidle2\"}}' \\\n    https://api.pageripper.com/extracts\n\n```\n\nResults in: \n\n```javascript\n{\n  \"emails\": [],\n  \"twitterHandles\": [],\n  \"socialMediaLinks\": [\n    \"https://twitter.com/zackproser\",\n    \"https://instagram.com/zackproser\",\n    \"https://linkedin.com/in/zackproser\"\n  ],\n  \"mediaContentLinks\": [],\n  \"downloadLinks\": [],\n  \"ecommerceLinks\": [],\n  \"urls\": {\n    \"internal\": [\n      \"https://zackproser.com/\",\n      \"https://zackproser.com/about\",\n      \"https://zackproser.com/blog\",\n      \"https://zackproser.com/videos\",\n      \"https://zackproser.com/projects\",\n      \"https://zackproser.com/testimonials\",\n      \"https://zackproser.com/contact\",\n      \"https://zackproser.com/learn\",\n      \"https://zackproser.com/blog/comic-strip-long-day\",\n      \"https://zackproser.com/blog/pinecone-reference-architecture-launch\",\n      \"https://zackproser.com/blog/run-your-own-tech-blog\",\n      \"https://zackproser.com/blog/how-to-generate-images-with-ai\"\n    ],\n    \"external\": [\n      \"https://github.com/zackproser\",\n      \"https://twitter.com/zackproser\",\n      \"https://instagram.com/zackproser\",\n      \"https://linkedin.com/in/zackproser\"\n    ]\n  }\n}\n\n```\n\nLinks found are categorized into internal, external, and other useful groups such as download and ecommerce links.\n\nAdditional parsing and extraction capabilities will be added to Pageripper over time.\n\n## Features and Capabilities\n\n* Extracts various data types: emails, URLs, social media, and media links.\n* Supports SPA and JavaScript-heavy websites.\n* Customizable extraction options.\n\n## API Documentation \n\n[Read the Docs](https://zackproser.github.io/pageripper-v2/)\n\n## OpenAPI Spec integration \u0026 GitHub Pages\n\nPageripper publishes an OpenAPI spec in this repository at `spec/openapi.yml`. \n\nThis spec is programmatically validated, updated and published at [https://zackproser.github.io/pageripper-v2/](https://zackproser.github.io/pageripper-v2/)\nevery time it is modified via a pull request.\n\nThis ensures that API consumers have the latest information and that we're aware of any breaking changes early in the software development lifecycle.\n\n## Infrastructure as Code (IaC) with Pulumi \n\nPageripper is deployed on AWS and defined in Pulumi TypeScript, meaning that the entire architecture of the API, from its Docker container to its log groups, \nload balancer and security groups are defined as code. \n\nThis means that this repository contains both the application code (Node.js, TypeScript, Dockerfile, etc) and the infrastructure code for the production API. \n\nDefining the cloud infrastructure as code enables tighter iterative loops between developing new features, \n\n## Continuous Integration and Delivery (CI/CD)\n\nThis repository is configured with GitHub Actions that run in response to lifecycle events: \n\n```mermaid\nsequenceDiagram\n    participant Dev as Developer\n    participant GH as GitHub Repo\n    participant PR as Pull Request\n    participant Tests as Tests\n    participant Pulumi as Pulumi\n    participant AWS as AWS Deployment\n\n    Dev-\u003e\u003eGH: Push code\n    GH-\u003e\u003ePR: Open/Update Pull Request\n    PR-\u003e\u003eTests: Run tests\n    Tests--\u003e\u003ePR: Test results\n    PR-\u003e\u003ePulumi: Trigger `pulumi preview`\n    Pulumi-\u003e\u003ePR: Post comment with changes\n    PR-\u003e\u003eGH: Merge to main (if tests pass)\n    GH-\u003e\u003ePulumi: Trigger `pulumi update`\n    Pulumi-\u003e\u003eAWS: Deploy to AWS\n```\n\nPhilosophically, this setup emerges from DevOps principles (as found in The Phoenix Project and elsewhere) about reducing the friction required to \nmaintain this code. \n\nWhen developers make changes and open a pull request, they get immediate feedback (that takes less than 5 minutes to complete) via unit tests, test builds for the app and Docker, etc.\n\nWhen developers merge code that has passed all tests, it is automatically deployed to production, so that the `HEAD` of the `main` branch always represents what is deployed to production.\n\n### Automated testing \n\nEvery time a pull request is opened against this repository, the following tests are run automatically: \n\n* Pageripper unit tests with Jest \n* An `npm build` that compiles the application\n* A Docker build that bundles Puppeteer and the application code \n* An OpenAPI spec validation\n\n### Preview deployments on pull requests\n\nWhen a pull request is issued to this repository, a `pulumi preview` is run via CI/CD against the AWS account where Pageripper is deployed: \n\n![pulumi preview](./img/pulumi-preview.png)\n\n## How it works \n\nPageripper fetches data from URLs you indicate. On a per-request level, you can configure Pageripper's behavior. \n\n```mermaid\nsequenceDiagram\n    participant User\n    participant Pageripper API\n    participant Target URL\n\n    User-\u003e\u003ePageripper API: Request data extraction (URL \u0026 options)\n    Pageripper API-\u003e\u003eTarget URL: Fetch webpage content\n    Target URL--\u003e\u003ePageripper API: Webpage content\n    Pageripper API-\u003e\u003ePageripper API: Extract specified data\n    Pageripper API--\u003e\u003eUser: Return extracted data\n```\n\n\n## Usage and Examples\n\nTo use Pageripper,  send a POST request to /extracts with the target URL and options. Example:\n\n```javascript\n\n// Example request using Node.js\nconst response = await fetch('http://api.pageripper.com/extracts', {\n  method: 'POST',\n  body: JSON.stringify({ url: 'https://example.com', options: {...} })\n});\n```\n\nThe production instance of the Pageripper API is [up and available on RapidAPI](https://rapidapi.com/zackproser/api/pageripper)\n\n## License\n\nPageripper V2 is released under the MIT License. See the LICENSE file for more details.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzackproser%2Fpageripper-v2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzackproser%2Fpageripper-v2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzackproser%2Fpageripper-v2/lists"}