{"id":23279944,"url":"https://github.com/gallolabs/web-surfer","last_synced_at":"2025-06-15T13:39:32.087Z","repository":{"id":259543727,"uuid":"876990164","full_name":"gallolabs/web-surfer","owner":"gallolabs","description":"Easy web Surfing (scraping) via API, firefox, chrome, webkit","archived":false,"fork":false,"pushed_at":"2025-01-04T15:57:09.000Z","size":287,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-01-04T16:32:17.045Z","etag":null,"topics":["api","bot","browser","chrome","firefox","scraping","web"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gallolabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-22T22:32:48.000Z","updated_at":"2025-01-04T15:57:12.000Z","dependencies_parsed_at":"2024-10-26T11:29:08.280Z","dependency_job_id":"f232b8a5-cef9-4f17-8e9e-1d0556f54a77","html_url":"https://github.com/gallolabs/web-surfer","commit_stats":{"total_commits":41,"total_committers":2,"mean_commits":20.5,"dds":"0.024390243902439046","last_synced_commit":"d0b286b499fc3b331cff5cf148f1adcdf9777216"},"previous_names":["gallolabs/bobot"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gallolabs%2Fweb-surfer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gallolabs%2Fweb-surfer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gallolabs%2Fweb-surfer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gallolabs%2Fweb-surfer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gallolabs","download_url":"https://codeload.github.com/gallolabs/web-surfer/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238533300,"owners_count":19488159,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","bot","browser","chrome","firefox","scraping","web"],"created_at":"2024-12-19T23:19:45.206Z","updated_at":"2025-06-15T13:39:32.043Z","avatar_url":"https://github.com/gallolabs.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg height=\"300\" src=\"https://raw.githubusercontent.com/gallolabs/web-surfer/main/logo_w300.jpeg\"\u003e\n  \u003ch1 align=\"center\"\u003eWeb Surfer\u003c/h1\u003e\n\u003c/p\u003e\n\n## Description\n\nWeb-Surfer is a webservice to automate (ex scrape) web surfs.\n\n### Launch\n\n```sh\nnpm i\nsudo docker compose up\n```\n\n### Test\n\nYou can use the command :\n\n```sh\n# Pre-requise : Running service (ex npm start)\nnpm run build # Build surf-cmd\nnpm run serve-lib # Start a server to serve libs (for imports tests)\nnode dist/surf-cmd.js --surf-api 'http://localhost:3000' tests/doctolib.yaml --url 'https://www.doctolib.fr/chirurgien-visceral-et-digestif/le-blanc-mesnil/nouredine-oukachbi/booking/availabilities?specialityId=179\u0026telehealth=false\u0026placeId=practice-5105\u0026motiveIds%5B%5D=860154\u0026pid=practice-5105'\n```\n\nThis will returns the availabilities for your doctor for the next 15 days :\n```javascript\n// Launch date 2024-12-22 21:20+01:00\n[\n  '2024-12-24 09:40',\n  '2024-12-24 11:30',\n  '2024-12-26 15:00',\n  '2024-12-26 15:10',\n  '2024-12-26 15:20',\n  '2024-12-26 15:50',\n  '2024-12-26 16:00',\n  '2024-12-26 16:30',\n  '2024-12-26 16:40'\n]\n```\n\n## POST /surf with SurfQL\n\n- Hight level API with functions, with simple naming like human actions (I go to, I click on, I fill, I read something, etc)\n- (not available) Low level API with object returned by $startSurfing()\n\n### Example : Search on Google, extract a text and take a screenshot\n\n```javascript\n{\n    expression: `\n\n        $goTo('https://www.google.fr');\n\n        $clickOn('button:has-text(\"Tout accepter\")');\n\n        $fill('textarea[aria-label=\"Rech.\"]', 'Trump', { 'pressEnter': true });\n\n        {\n            'description': $readText('[data-attrid=description] div \u003e span:nth-child(2)'),\n            'screenshot': $screenshot()\n        };\n\n    `\n}\n```\n\nWe will receive a JSON with a description (an extracted text) and a sreenshot base64 encoded.\n\n### Example : Extract and transform Gaz consumption from GRDF\n\n```javascript\n{\n    input: {\n        compteur,\n        email,\n        _password\n    },\n    expression: `\n\n        $start := $date().subtract(10, 'days').format('YYYY-MM-DD');\n        $end := $date().format('YYYY-MM-DD');\n\n        $startSurfing({'session': {'id': 'grdf', 'ttl': 'P1D'}});\n\n        $goTo('https://monespace.grdf.fr/');\n\n        $login := function() {(\n            $debug('Login');\n            $fill('[name=\"identifier\"]', email, { 'pressEnter': true });\n            $fill('[name=\"credentials.passcode\"]', _password, { 'pressEnter': true });\n        )};\n\n        $contains($readUrl(), 'connexion.grdf.fr') ? $login() : $debug('Already logged');\n\n        $goTo($buildUrl(\n            'https://monespace.grdf.fr/api/e-conso/pce/consommation/informatives?dateDebut={start}\u0026dateFin={end}\u0026pceList%5B%5D={compteur}',\n            { 'start': $start, 'end': $end, 'compteur': compteur }\n        ));\n\n        $resultConso := $eval($readText('body'));\n\n        $resultConso.*.releves.{'date': journeeGaziere, 'kwh': energieConsomme};\n    `\n}\n```\n\nWe explicity create a surfing session with a 1day validity, login to GRDF if needed, fetching consumption and transforming it to obtain exactly what we want.\n\nHere an output example :\n```javascript\n[\n  { date: '2024-11-27', kwh: 12 },\n  { date: '2024-11-28', kwh: 2 },\n  { date: '2024-11-29', kwh: 6 },\n  { date: '2024-11-30', kwh: 12 },\n  { date: '2024-12-01', kwh: 14 },\n  { date: '2024-12-02', kwh: 10 },\n  { date: '2024-12-03', kwh: 13 },\n  { date: '2024-12-04', kwh: 15 }\n]\n```\n\n### Example : Use imports\n\nhttp://trusted.com/shared-surfs.json\n\n```javascript\n{\n    search: {\n        schemas: {\n            input: {\n                type: 'object',\n                properties: {\n                    url: {\n                        type: 'string'\n                    },\n                    query: {\n                        type: 'string'\n                    }\n                },\n                required: [\n                    'url',\n                    'query'\n                ]\n            },\n            output: {\n                type: 'object',\n                properties: {\n                    description: {\n                        type: 'string'\n                    },\n                    screenshot: {\n                        type: 'object'\n                    }\n                },\n                required: [\n                    'description',\n                    'screenshot'\n                ]\n            }\n        },\n        input: {\n            url: 'https://www.google.com'\n        },\n        expression: `\n            $goTo(url);\n            $clickOn('button:has-text(\"Tout accepter\")');\n            $fill('textarea[aria-label=\"Rech.\"]', query, { 'pressEnter': true });\n\n            {\n              'description': $readText('[data-attrid=VisualDigestDescription] div:nth-child(2) \u003e span:nth-child(1)'),\n              'screenshot': $screenshot()\n            }\n\n        `\n    }\n}\n```\n\nOur surf :\n\n```javascript\n{\n    input: 'hello world',\n    expression: `\n\n        $call('http://trusted.com/shared-surfs.json#/search', {\n            'url': 'https://www.google.fr',\n            'query': $\n        }).description\n\n    `\n}\n```\n\nTadaaaa ! We can reuse code. It is also possible to export functions, but the logic of input/expression/output is recommanded.\n\n## startSurfing\n\nHigh functions will use the last found resources of the surf. If not found, they will be created. To explicit them, you can declare your surfing. For example (everything is optionnal) :\n```\n$startSurfing({\n    'browser': 'firefox',\n    'session': {'id': 'abc', 'ttl': 'P1M'},\n    'timezone': 'Europe/Madrid',\n    'locale': 'es_ES'\n});\n ```\n\n i18nPreset allows to give a preset (alias) of a set of internationalization params, including timezone, locale, proxy, etc.\n\n StartSurfing can be called several times in the same surf. A Surfing context is created (and pages will be created then).\n\n## Notes\n\nSurfQL is on top of JSONATA (input -\u003e transformation -\u003e output). Browsers are managed by Browserless (warning to the licence), but it should be good to have an opensource alternative with minimum firefox and chrome and autostart and garbage system, drived by playwright.\n\nFor output, Web Surfer will choose the content type (json/plain/image/etc) depending of the returned value. To force the type, use Accept http header. To force binary encoding (in case of json for example), use explicit method (ex $base64) (or header ?)\n\nCases :\n- Output is string : text/plain\n- Output is Buffer : identify the type and returns raw data\n- Output is object/boolean : application/json\n\nWhen output contains string (text/plain or application/json), binary data will be represented as base64 by default.\n\n## Todo\n\n- Native Yaml support\n- Global Registry and/or user registry\n- Direct http call without browser\n- Ability to call with GET with CORS allow -\u003e need URL token (jwt ?) to exec it\n- Cache ?\n- Resolve import on $call call instead of init, with ability to refer to the same \"document\"\n- Add URI sha1 check to ensure a resource has not changed ? Or another way to manage contracts/trust ?\n- Add contracts zod for inputs/output, etc\n- Use @gallolabs/application on top\n- Create Browserless alternative for the need\n- Replace Typebox by Zod ?\n\n\n## Help\n\nGo to http://localhost:3000/doc for OpenAPI doc with surfQL available methods.\n\n![The doc preview](doc.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgallolabs%2Fweb-surfer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgallolabs%2Fweb-surfer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgallolabs%2Fweb-surfer/lists"}