{"id":18710700,"url":"https://github.com/apify/waw-file-specification","last_synced_at":"2026-03-19T06:01:27.206Z","repository":{"id":43943832,"uuid":"493257405","full_name":"apify/waw-file-specification","owner":"apify","description":"Contains specification of the Web Automation Workflow (WAW) file.","archived":false,"fork":false,"pushed_at":"2024-05-03T11:12:59.000Z","size":15,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-02-23T19:26:59.212Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apify.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-17T13:13:01.000Z","updated_at":"2025-07-17T19:21:49.000Z","dependencies_parsed_at":"2024-11-07T12:50:47.214Z","dependency_job_id":null,"html_url":"https://github.com/apify/waw-file-specification","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/apify/waw-file-specification","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Fwaw-file-specification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Fwaw-file-specification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Fwaw-file-specification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Fwaw-file-specification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apify","download_url":"https://codeload.github.com/apify/waw-file-specification/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apify%2Fwaw-file-specification/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30695073,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-19T05:29:31.190Z","status":"ssl_error","status_checked_at":"2026-03-19T05:28:25.821Z","response_time":57,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T12:35:18.892Z","updated_at":"2026-03-19T06:01:27.190Z","avatar_url":"https://github.com/apify.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n    \u003cpicture\u003e\n        \u003cimg alt=\"waw docs logo\" src=\"./logo.svg\" width=\"400\"\u003e\n    \u003c/picture\u003e\n\u003c/h1\u003e\n\n\u003e This document contains the official definition of the WAW (web automation workflow) format.\n\u003e\n\u003e Are you looking for a way of running files in this format? Check out the `wbr` project on [npm](https://www.npmjs.com/package/@wbr-project/wbr-interpret) or in its [GitHub repository](https://github.com/barjin/wbr).\n\n___\n\nThe WAW format is a declarative format for specifying web-related workflows. \nIt enables the user to control the automation flow with conditional expressions, allowing to make decisions based on the websites content.\nIt is also easily parsable (based on JSON), which greatly simplifies validation, visualization and third-party adoption. \n\n## Table of Contents\n\n1. [General overview](#general)\n1. [Meta Header](#meta-header)\n1. [Workflow](#workflow)\n1. [Where conditions](#the-where-clause)\n\t- [Basics](#where-conditions---the-basics)\n\t- [Boolean logic](#where-conditions---boolean-logic)\n\t- [Ordering](#ordering)\n\t- [State memory](#state-memory)\n1. [What actions](#the-what-clause)\n\t- [`wbr-interpret` custom functions](#wbr-interpret-custom-functions)\n1. [Miscellaneous](#extra-syntax)\n\t- [Regular Expressions](#regular-expressions)\n\t- [Parametrization](#Parametrization)\n1. [JSON Schema](#json-schema)\n\n## General\n\nThe `.waw`  *(not to be confused with .wav)* file is a textual format used for quick, safe and declarative definition of web automation workflows.\n\nSyntactically, `.waw` should always be a valid `.json` file. If you are unsure what `.json` is, refer to the [official documentation](https://www.json.org). \n\n*Note: From now on, the .waw file will be considered a valid JSON file and all the terminology (object, array) will be used in this context.*\n\nOn the top level, the workflow file contains an object with two properties - `\"meta\"` - an object with the [workflow's metadata](#meta-header) and `\"workflow\"` - a **single array** of so-called \"where-what pairs\". These pairs contain three properties with keys `id`, `where`, and `what`. \n\n\u003e The `id` property is used only for pair referencing (more in [State memory](#state-memory)) and can be omitted.\n\nHere follows a top-level view of the Workflow file:\n\n```javascript\n{\n\t\"meta\" : {\n        ...\n\t}\n\t\"workflow\": [\n\t\t{\n\t\t\t\"id\": \"login\",\n\t\t\t\"where\": {...},\n\t\t\t\"what\": [...]\n\t\t},\n\t\t{\n\t\t\t\"id\": \"signup\",\n\t\t\t\"where\": {...},\n\t\t\t\"what\": [...]\n\t\t},\n\t\t...\n\t]\n]\n```\n\n## Meta Header \nThe `meta` header of the file can contain two fields: \n- \"name\" - `string` - optional, name of the workflow (for easier management)\n- \"desc\" - `string` - optional, text description of the workflow.\nEven though all the metadata is optional, developers are strongly advised to use them for clarity and easier management of the workflows. \n\n### Example\n```json\n{\n\t\"name\": \"Google Maps Scraper\",\n\t\"desc\": \"A blazing fast scraper for Google Maps search results.\"\n}\n```\n## Workflow\nThe \"workflow\" part of the file is a single **array** consisting of the where-what pairs - objects describing desired behavior in different situations. \n\nFor example, let's say we want to click on a button with the label \"hello\" every time we get on the page \"https://example.com/\". This behavior is described with the following snippet:\n\n```json\n{\n\t\"where\": { \"url\": \"https://example.com\" },\n\t\"what\": [\n\t\t{\n\t\t\t\"action\": \"click\",\n\t\t\t\"args\": [\"button:text('hello')\"]\n\t\t}\n\t]\n}\n```\n\nNow, let's say we want to type \"Hello world!\" into an input field, whenever we see an input field on the \"https://example.com\" website:\n\n```json\n{\n\t\"where\": { \n\t\t\"url\": \"https://example.com\",\n\t\t\"selectors\": \"input\"\n\t},\n\t\"what\": [\n\t\t{\n\t\t\t\"action\": \"type\",\n\t\t\t\"args\": [\n\t\t\t\t\"input\",\n\t\t\t\t\"Hello world!\"\n\t\t\t]\n\t\t}\n\t]\n}\n```\n\nThis should be enough to give you some basic understanding of the WAW Smart Workflow format. In the following sections, there are more details about the format and its certain features. \n\n## The Where Clause\nThe Where clause describes a **condition** required for the respective What clause to be executed. \n\nIn the basic version without the state memory (more later), we can count with the Markov assumption, i.e. the Where clause always depends only on the current browser state and its \"applicability\" can be evaluated statically, knowing only the browser's state at the given point. \n\nFor this reason, the workflow can be executed on different tabs in parallel (any popup window open from the first passed page is processed as well).\n\n### Where conditions - The Basics\n\nThe `where` clause is an object with various keys.\n\n\u003e The specific \"basic\" keys (like `url`, `cookies` etc.) are implementation-dependent and are not a part of the format specification. Keys shown here correspond to the `wbr-interpret` implementation.\n\nAs of now, three keys are recognized:\n- URL *(string or [RegEx](#regular-expressions))*\n- cookies *(object with string keys and string/[RegEx](#regular-expressions) values)*\n- selectors *(array of CSS/[Playwright](https://playwright.dev/docs/selectors/) selectors - all of the targetted elements must be present in the page to match this clause)*\n\nAn example of a full (simple, flat) Where clause:\n\n```javascript\n\t\"where\": {\n\t\t\"url\": \"https://jindrich.bar/\",\n\t\t\"cookies\": {\n\t\t\t\"uid\": \"123456\"\n\t\t},\n\t\t\"selectors\": [\n\t\t\t\":text('My Profile')\",\n\t\t\t\"button.logout\"\n\t\t]\n\t}\n```\n\n### Where conditions - (Boolean) Logic\nFor a system operating with conditions, it is crucial to have a simple way to work with formal logic.\nThe WAW format is taking inspiration from the [MongoDB query operators](https://docs.mongodb.com/manual/reference/operator/query/), as shown in the example below:\n\n```javascript\n\"where\": {\n    \"$and\": [\n        {\n        \"url\": \"https://jindrich.bar/\",\n        },\n        {\n            \"$or\": [\n                {\n                    \"cookies\": {\n                        \"uid\": \"123456\"\n                    }\n                },\n                {\n                    \"selectors\": [\n                        \":text('My Profile')\",\n                        \"button.logout\"\n                    ]\n                }\n            ]\n        }\n    ]\n}\n```\nThis notation describes a condition where the URL is `https://jindrich.bar/` **and** there is **either** the `uid` cookie set with the specified value, **or** there are the selectors present. Please note that the top-level `$and` condition is redundant, as the conjunction of the conditions is the implicit operation.\n\nAs of now, the format supports the following boolean operators: `$and`, `$or` and `$not`.\n\n### Ordering\n\nNote that the ordering of the rules in the file is crucial. Consider the following example:\n\n```json\n\t{\n\t\t\"where\": { \"url\": \"https://jindrich.bar\" },\n\t\t\"what\": [{ \"action A\" }]\n\t},\n\t{\n\t\t\"where\": { \"url\": \"https://jindrich.bar\" },\n\t\t\"what\": [{ \"action B\" }]\n\t},\n```\n\nThe `where` conditions in the displayed pairs are the same, i.e. when the interpreter gets to the webpage `https://jindrich.bar`, it has two possible action sequences to carry out. This situation makes little sense, as the workflow definition needs to be as strict as possible and cannot allow non-deterministic behaviour of the interpreter.\n\nFor this reason, the definition of the workflow file says that **only the first matching action** gets executed.\n\nEven though the colliding conditions were easy to spot in the example above, this problem can get a little more nuanced, for example:\n\n```json\n\t{\n\t\t\"where\": { \n\t\t\t\"selectors\": [\"h1\", \"ul\"]\n\t\t},\n\t\t\"what\": [{ \"action A\" }]\n\t},\n\t{\n\t\t\"where\": { \n\t\t\t\"selectors\": [\".large-heading\",\"#list\"]\n\t\t},\n\t\t\"what\": [{ \"action B\" }]\n\t},\n```\n\nWhile there is no visible collision in the described conditions, the interpreter behavior might be surprising on the following page:\n\n```html\n...\n\t\u003ch1 class=\"large-heading\"\u003eHeading\u003c/h1\u003e\n\t\u003cul id=\"list\"\u003e\n\t\t\u003cli\u003ea\u003c/li\u003e\n\t\t...\n\t\u003c/ul\u003e\n...\n```\n\nAgain, the interpreter will execute only `action A`, even though both conditions apply.\n\n\u003e Another way to think of this is \"put more specific conditions closer to the top\".\n\n### State memory\nAs mentioned earlier, the interpreter also has an internal memory which allows for more specific conditions. Some of those could be e.g.\n\n```javascript\n\"where\": {\n\t\"$after\": \"login\" // login being an \"id\" of another where-what pair\n}\n```\n\n```javascript\n\"where\": {\n\t\"$before\": \"signup\"\n}\n```\n\nAs of now, the metatags `$before` and `$after` are supported. The meaning behind those is to allow an action to be run only after (or before) another action has been executed.\n\nThe memory for actions used is **tab-scoped**, i.e. every new tab has its own memory of used actions (the tabs run the workflow independently of each other).\n\n**[Hacker's Tip]** : The `$before` condition specifically can be used to run an action only once (`\"id\": \"self\", ..., \"$before\" : \"self\"`).\n\n## The What Clause\nIn the most basic version, the What clause should contain a sequence of actions, which should be carried out in case the respective Where condition is satisfied.\n\n\u003e Note: While the interpreter `wbr-interpret` uses Playwright for its backend, the WAW format is suitable for use with any other backend. Just like with the Where clause basic keys, the action's names and parameters are not a part of the format specification.\n\n### What actions - The Basics\nThe `what` clause is an array of \"function\" objects. These objects consist of the `action` field, describing the function called and `args` - an optional array property, providing parameters for the specified function.\n\n```json\n\"what\":[\n\t{\n\t\t\"action\": \"functionAcceptingString\",\n\t\t\"args\": [\"theFirstParameter\"]\n\t},\n\t{\n\t\t\"action\":\"voidFunction\",\n\t},\n\t{\n\t\t\"action\":\"moreParameters\",\n\t\t\"args\": [\n\t\t\t1000,\n\t\t\t\"string parameter\",\n\t\t\t{\n\t\t\t\t\"option\": true\n\t\t\t}\n\t\t]\n\t}\n]\n```\n\nIn [`wbr-interpret`](https://github.com/barjin/wbr), these actions correspond to the Playwright's [Page class methods](https://playwright.dev/docs/api/class-page/) (`goto`,`fill`, `click`...). On top of this, users can use dot notation to access the `Page`'s properties and call their methods (e.g. `page.keyboard.press` etc.) All parameters passed must be JSON's native types, i.e. scalars, arrays, or objects (no functions etc.)\n\n### `wbr-interpret` custom functions \n\nOn top of the Playwright's native methods/functions, the user can also use some **interpreter-specific** functions. \n\nAs of now, these are:\n- `screenshot` - this is overriding Playwright's `page.screenshot` method and saves the screenshot using the interpreter's *binary output callback*.\n- `scrape` - using a heuristic algorithm, the interpreter tries to find the most important items on the webpage, parses those into a table and pushes the table into the *serializable callback*.\n\t- user can also specify the item from the webpage to be scraped (using a [Playwright-style selector](https://playwright.dev/docs/selectors)).\n- `scrapeSchema` - getting a \"row schema definition\" with column names and selectors, the interpreter scrapes the data from a webpage into a \"curated\" table.\n\t- Example:\n\t```json\n\t{\n\t\t\"action\": \"scrapeSchema\",\n\t\t\"args\": [{\n\t\t\t\"name\": \".c-item-title\",\n\t\t\t\"price\": \".c-a-basic-info__price\",\n\t\t\t\"vin\": \".c-vin-info__vin\",\n\t\t\t\"desc\": \".c-car-properties__text\"\n\t\t}]\n\t}\n\t```\n- `scroll` - scrolls down the webpage for given number of times (default = `1`).\n- `script` - allows the user to run an arbitrary asynchronous function in the interpreter. The function's body is read as a string from the `params` field and evaluated at the server side (as opposed to a browser). The function accepts one parameter named `page`, being the current Playwright Page instance.\n\t- Example:\n\t```javascript\n\t{\n\t\t\"action\": \"script\",\n\t\t\"args\": [\"\\\n\t\tconst links = await page.evaluate(() =\u003e \\\n\t\t{\\\n\t\t\treturn Array.from(\\\n\t\t\t\tdocument.querySelectorAll('a.c-item__link.sds-surface--clickable')\\\n\t\t\t).map(a =\u003e a.href);\\\n\t\t});\\\n\t\t\\\n\t\tfor(let link of links){\\\n\t\t\tawait new Promise(res =\u003e setTimeout(res, 100));\\\n\t\t\tawait page.context().newPage().then(page =\u003e page.goto(link))\\\n\t\t}\\\n\t\t\"]\n\t},\n\t```\n\tThe example runs a server-side script opening all links on the current page in new tabs with 100 ms delay *(Note: if you only want to open links on a page, see `enqueueLinks` lower).*\n\t- Even though it is possible to write the whole workflow using one `script` field, we do not endorse this. The WAW format should allow the developers to write comprehensible, easy-to-maintain workflow definitions.\n- `enqueueLinks` *(new in 0.4.0)*\n\t- Accepts `selector` parameter. Reads elements targetted by the specified selector ([Playwright selectors](https://playwright.dev/docs/selectors)) and stores their links in a queue. \n\t- Those pages are then processed using the same workflow as the initial page (in parallel if the `maxConcurrency` interpreter parameter is greater than 1).\n\n## Extra Syntax \nApart from the mentioned syntax available for direct workflow specification, the WAW format contains more constructs for even better flexibility of the format.\n\n### Regular Expressions\n\nThe format supports usage of regular expressions, both in the conditions and the action parameters. The syntax is inspired by the MongoDB regex syntax and looks as follows:\n\n```json\n    \"url\": {\"$regex\": \"^https\"}\n```\n\nSuch a rule matches every URL on a secured website, i.e. starting with `https`.\n\n### Parametrization\n\nThe WAW format also allows the developer to parametrize the workflow - this can be particularly useful, e.g. for letting the user insert their login information, URL to be scraped etc.\n\n```json\n{\n\t\"action\": \"goto\",\n\t\"args\": [\n\t\t{\"$param\": \"startURL\"}\n    ]\n}\n```\n\nThe interpreter of the format should allow the user to include their own value to replace the entire parameter structure with the user-supplied value. \n\n## JSON Schema\n\nIn case you want to automatically check a workflow definition file for syntax correctness, use the official [JSON Schema](https://github.com/barjin/wbr/blob/main/json-schema.json).\n\nNote that this JSON schema validates the files only against the base WAW definition.\nTo validate the files against the `wbr-interpret` implementation of the WAW format, please use the `validateWorkflow` method of the `Preprocessor` class.\n___\n\nWant to see a real-world example of a workflow? Visit the [examples folder](https://github.com/barjin/wbr/tree/main/examples) with numerous example workflows.\n\nReady to automate? Read how to write [your first workflow](https://github.com/barjin/wbr/blob/main/docs/wbr-interpret/first_workflow.md) step-by-step.\n\n\u003cbr\u003e\n\u003cdiv align=\"center\"\u003e\n\u003cimg src=\"https://raw.githubusercontent.com/barjin/wbr/bf45528225e3b9fc05963d751bb254ee2b2a427c/docs/wbr-interpret/static/img/wikipedia_scraper.gif\" alt=\"Successful and speedy Wikipedia scraper\"/\u003e\n\u003c/div\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapify%2Fwaw-file-specification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapify%2Fwaw-file-specification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapify%2Fwaw-file-specification/lists"}