{"id":26296418,"url":"https://github.com/charliedigital/playwright-scrape-api","last_synced_at":"2025-05-09T00:51:54.125Z","repository":{"id":189219357,"uuid":"680275112","full_name":"CharlieDigital/playwright-scrape-api","owner":"CharlieDigital","description":"A dead simple REST API to use Playwright to scrape the text contents from any URL.","archived":false,"fork":false,"pushed_at":"2023-10-07T13:47:13.000Z","size":2615,"stargazers_count":26,"open_issues_count":0,"forks_count":2,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-31T19:51:15.031Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CharlieDigital.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-08-18T19:20:04.000Z","updated_at":"2024-11-07T08:17:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"6ce733af-3dc0-4caf-9a14-27c910901dc6","html_url":"https://github.com/CharlieDigital/playwright-scrape-api","commit_stats":null,"previous_names":["charliedigital/playwright-scrape-api"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CharlieDigital%2Fplaywright-scrape-api","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CharlieDigital%2Fplaywright-scrape-api/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CharlieDigital%2Fplaywright-scrape-api/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CharlieDigital%2Fplaywright-scrape-api/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CharlieDigital","download_url":"https://codeload.github.com/CharlieDigital/playwright-scrape-api/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253171232,"owners_count":21865289,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-03-15T04:18:20.109Z","updated_at":"2025-05-09T00:51:54.097Z","avatar_url":"https://github.com/CharlieDigital.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Dead Simple Playwright Scraper\n\nFor working with LLMs like ChatGPT, you may not need the full structure of the page; you probably just need the text content on the page. For example, if you’re running a summarization of a blog post or article.\n\nThe easiest way to extract this in your browser is to open up your devtools and type the following in the JavaScript console:\n\n```js\ndocument.body.innerText\n```\n\n![Example](/images/document-body-cap.gif)\n\nThis will return all of the text nodes in the HTML document.\n\n\u003e 💡 Note: if the page is \"noisy\", you can just select the containing node for the main content and grab the `innerText` of the containing node.\n\nTo do this as an API, we can either process the document as HTML/XML and manually parse out the text nodes or use a headless browser like Playwright.\n\nThe advantage of Playwright is that it will also work with single-page-applications (SPAs) which only load the DOM after the page scripts are executed.\n\nThis makes it handy for more general purpose scraping of text content from both server-side generated pages as well as SPAs.\n\nThe goal of this walkthrough is to build a performant, easy to use REST API that we can perhaps call from OpenAI Function Calling.\n\n## Using Playwright\n\n[Microsoft's Playwright](https://playwright.dev/) is a headless browser automation tool that provides both testing and automation SDKs.\n\nHere, we'll be using the automation SDK to interact with a target URL.  The beauty of it is that it's _dead-simple_:\n\n```csharp\napp.MapGet(\"/\", async (\n  [FromQuery] string url\n) =\u003e {\n  url = HttpUtility.UrlDecode(url);\n  using var playwright = await Playwright.CreateAsync();\n  await using var browser = await playwright.Chromium.LaunchAsync(new() {\n    Headless = true\n  });\n\n  await using var context = await browser.NewContextAsync();\n  var page = await context.NewPageAsync();\n  await page.GotoAsync(url);\n  await page.WaitForLoadStateAsync(LoadState.DOMContentLoaded);\n  var text = await page.EvaluateAsync\u003cstring\u003e(\"document.body.innerText\");\n\n  return text;\n});\n```\n\nAnd in TypeScript:\n\n```js\napp.get(\"/\", cors(corsOpts), async (req: Request, res: Response) =\u003e {\n  const url = decodeURIComponent(req.query.url as string)\n\n  const browser = await chromium.launch({\n    headless: true\n  });\n\n  const context = await browser.newContext();\n  const page = await context.newPage();\n  await page.goto(url);\n  await page.waitForLoadState(\"domcontentloaded\");\n  var text = await page.evaluate(\"document.body.innerText\");\n\n  res.status(200).send(text);\n})\n```\n\n## How Do I Use It?\n\nTo run the .NET example:\n\n```shell\ncd dotnet6\ndotnet run\n\ncurl http://localhost:5005\\?url\\=https://chrlschn.dev\n```\n\nAnd the TypeScript example:\n\n```shell\ncd typescript\ntsc\nnode dist/index.js\n\ncurl http://localhost:8080\\?url\\=https://chrlschn.dev\n```\n\n## Deploying\n\nReady to deploy this to use on your own?  This codebase is ready to go!  You can deploy easily into either AWS using Copilot or into Google Cloud using Cloud Run (basically free).\n\nBoth the .NET and TypeScript versions build on top of the Microsoft Playwright container image.\n\n- .NET: https://hub.docker.com/_/microsoft-playwright-dotnet\n- Node: https://hub.docker.com/_/microsoft-playwright\n\nThis base image is quite hefty and includes the installations of all three browsers (Chrome, Firefox, and WebKit).  You can also consider using a third party image or build your own to trim down the size of the image.\n\nOur `Dockerfile` for .NET:\n\n```dockerfile\n# (1) The build environment\nFROM mcr.microsoft.com/dotnet/sdk:6.0-jammy as build\nWORKDIR /app\n\n# (2) Copy the .csproj and restore; this will cache these layers so they are not run if no changes.\nCOPY ./playwright-scrape.csproj ./playwright-scrape.csproj\nRUN dotnet restore\n\n# (3) Copy the application files and build.\nCOPY ./Program.cs ./Program.cs\nRUN dotnet publish ./playwright-scrape.csproj -o /app/published-app --configuration Release\n\n# (4) The dotnet tagged Playwright environment includes .NET\nFROM mcr.microsoft.com/playwright/dotnet:v1.37.0-jammy as playwright\nWORKDIR /app\nCOPY --from=build /app/published-app /app\n\nENV IS_CONTAINER=true\n\n# (5) Start our app!\nENTRYPOINT [ \"dotnet\", \"/app/playwright-scrape.dll\" ]\n```\n\nAnd for Node:\n\n```dockerfile\nFROM mcr.microsoft.com/playwright:v1.34.0-jammy\n\n# Create app directory in the image\nWORKDIR /usr/src/app\n\n# Copy over assets\nCOPY package.json ./\nCOPY yarn.lock ./\n\n# Install dependencies.\nRUN yarn install --immutable --immutable-cache --check-cache\n\n# Copy source\nCOPY . .\n\n# Build the TypeScript\nRUN npx tsc\n\n# Start the server.\nEXPOSE 8080\nCMD [\"node\", \"dist/index.js\"]\n```\n\n### Google Cloud Run\n\nTo start with, enable the Google Cloud Run API in your Google Cloud account. For most normal use cases, this will be free since you’d have to run ***a lot of*** requests before consuming the free tier quota.\n\n![Billing](/images/google-calc.png)\n\nUse either the `build-deploy-gcr.sh` to build and deploy via artifact registry or use `build-deploy-gcr-src.sh` to build and deploy as source.\n\nThe former is faster to cycle with since you push smaller layers on changes.  The later is perhaps more convenient.\n\n```shell\n# Deploy source into a Cloud Build pipeline\ngcloud run deploy $gcloud_svc \\\n  --source=. \\\n  --allow-unauthenticated \\\n  --port=8080 \\\n  --min-instances=0 \\\n  --max-instances=1 \\\n  --cpu-boost \\\n  --memory=1Gi\n```\n\nTweak these parameters for your needs, but note that this configuration will scale to zero meaning that you'll incur no cost except when the application is handling a request.  But keep in mind that the free tier quota is quite generous.  You'd need to consume quite a bit before incurring costs.\n\n### AWS via Copilot\n\nThe deployment to AWS is via [Copilot](https://aws.github.io/copilot-cli/) which will deploy the application as a ECS Fargate container.  This has a bit more hoops as it requires a lot more infrastructure on the AWS side.\n\nYou'll need to start with a one time setup of the AWS infrastructure via Copilot:\n\n```shell\n# Initialize the application\ncopilot init\n\n# Initialize the environment\n# Don't need the AWS_PROFILE if your default profile is non-root admin\nAWS_PROFILE=profile_name copilot env init\n\n# Deploy the environment (need an admin, non-root account)\nAWS_PROFILE=profile_name copilot env deploy --name prod\n\n# Deploy the application\nAWS_PROFILE=profile_name copilot deploy --env prod\n```\n\nOnce you've got it all set up, then use `build-deploy-aws.sh` to deploy into it.\n\n## IMPORTANT NOTES\n\nThis API endpoint isn't secured.  At a minimum -- depending on how you want to use it -- add a hard coded API key and pass it in via a header.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcharliedigital%2Fplaywright-scrape-api","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcharliedigital%2Fplaywright-scrape-api","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcharliedigital%2Fplaywright-scrape-api/lists"}