{"id":15376530,"url":"https://github.com/tomayac/http-archive-progressive-web-apps","last_synced_at":"2025-04-15T16:34:42.703Z","repository":{"id":139926148,"uuid":"136486319","full_name":"tomayac/http-archive-progressive-web-apps","owner":"tomayac","description":"Different approaches to estimate the number of Progressive Web Apps in the HTTP Archive","archived":false,"fork":false,"pushed_at":"2018-07-09T15:00:34.000Z","size":352,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-28T22:23:19.774Z","etag":null,"topics":["bigquery","httparchive"],"latest_commit_sha":null,"homepage":"https://medium.com/dev-channel/progressive-web-apps-in-the-http-archive-614d4bcf81fe","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tomayac.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-07T14:07:02.000Z","updated_at":"2024-08-25T12:03:29.000Z","dependencies_parsed_at":null,"dependency_job_id":"f3e11377-8eaa-4e0f-ab65-a0c495002467","html_url":"https://github.com/tomayac/http-archive-progressive-web-apps","commit_stats":{"total_commits":31,"total_committers":1,"mean_commits":31.0,"dds":0.0,"last_synced_commit":"85186c7d78cdf82da7552797b39cad29836e18a1"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomayac%2Fhttp-archive-progressive-web-apps","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomayac%2Fhttp-archive-progressive-web-apps/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomayac%2Fhttp-archive-progressive-web-apps/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomayac%2Fhttp-archive-progressive-web-apps/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tomayac","download_url":"https://codeload.github.com/tomayac/http-archive-progressive-web-apps/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249108958,"owners_count":21214090,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","httparchive"],"created_at":"2024-10-01T14:08:05.489Z","updated_at":"2025-04-15T16:34:42.683Z","avatar_url":"https://github.com/tomayac.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Progressive Web Apps in the HTTP Archive\n\n**Thomas Steiner**, Google Hamburg, Germany\n\n📧 [tomac@google.com](mailto:tomac@google.com) • 🐦 [@tomayac](https://twitter.com/tomayac) • 😸 [tomayac](https://github.com/tomayac)\n\n(Published at https://medium.com/dev-channel/progressive-web-apps-in-the-http-archive-614d4bcf81fe.)\n\n## *Abstract*\n\n*In this document, we present three different approaches and discuss their particular pros and cons for extracting data about Progressive Web Apps (PWA) from the HTTP Archive. Approach 1 is based on data that is tracked in the context of runs of the Lighthouse tool, Approach 2 is based on use counters in the Chrome browser to record per-page anonymous aggregated metrics on feature usage, and Approach 3 is based on parsing the source code of web pages for traces of service worker registrations and Web App Manifest references. We find that by all three approaches the popularity of PWAs increases roughly linearly over time and provide further research ideas based on the extracted data, whose underlying queries we share publicly.*\n\n## Introduction to Progressive Web Apps\n\nProgressive Web Apps (PWA) are a new class of web applications, enabled for the most part by the [Service Worker APIs](https://developer.mozilla.org/en/docs/Web/API/Service_Worker_API). Service workers allow apps to support *network-independent loading* by intercepting network requests to deliver programmatic or cached responses, service workers can receive *push notifications* and *synchronize* data in the background even when the corresponding app is not running, and service workers—together with [Web App Manifests](https://developer.mozilla.org/en-US/docs/Web/Manifest)—allow users to *install* PWAs to their devices’ home screens. Service workers were [first implemented in Chrome 40 Beta](https://blog.chromium.org/2014/12/chrome-40-beta-powerful-offline-and.html) released in December 2014, and the term *Progressive Web Apps* was [coined by Frances Berriman and Alex Russell](https://infrequently.org/2015/06/progressive-apps-escaping-tabs-without-losing-our-soul/) in 2015.\n\n## Research Questions and Problem Statement\n\nAs service workers are now finally [implemented in all major browsers](https://jakearchibald.github.io/isserviceworkerready/), we at the Google Web Developer Relations team were wondering *“how many PWAs are actually out there in the wild and how do they make use of these new technologies?”* Certain advanced APIs like [Background Sync](https://developers.google.com/web/updates/2015/12/background-sync) are currently still [only available on Chromium-based browsers](https://caniuse.com/#feat=background-sync), so as an additional question we looked into *“what features do these PWAs actually use—or in the sense of progressive enhancement—try to use?”*\nOur first idea was to check some of the curated PWA catalogues, for example, [PWA.rocks](https://pwa.rocks/), [PWA Directory](https://pwa-directory.appspot.com/), [Outweb](https://outweb.io/), or [PWA Stats](https://www.pwastats.com/). The problem with such catalogues is that they suffer from what we call *submission bias*. [Anecdotal](https://outweb.io/1506520224205) [evidence](https://www.pwastats.com/2017/06/forbes/) [shows](https://pwa-directory.appspot.com/pwas/5758305695694848) that authors of PWAs want to be included in as many catalogues as possible, but oftentimes the listed examples are not very representative of *the* web and rather longtail. For example, at the time of writing, the [first listed PWA](https://pwa-directory.appspot.com/pwas/4816176644358144) on *PWA Directory* is [feuerwehr-eisolzried.de](https://feuerwehr-eisolzried.de/), a PWA on the *\"latest news, dates and more from [the] fire department in Eisolzried, Bavaria.\"* Second, while *PWA Stats* offers tags, for example, on the [use of notifications](https://www.pwastats.com/tags/notifications), not all PWA features are classified in their tagging system. In short, PWA catalogues are not very well suited for answering our research questions.\n\n## The HTTP Archive to the Rescue\n\nThe [HTTP Archive](https://httparchive.org/) tracks how the web is built and provides historical data to quantitatively illustrate how the web is evolving. The archive’s crawlers process [500,000 URLs](https://httparchive.org/faq#how-does-the-http-archive-decide-which-urls-to-test) for both desktop and mobile twice a month. These URLs come from the most popular 500,000 sites in the [Alexa Top 1,000,000](http://www.alexa.com/topsites) list and are mostly homepages that may or may not be representative for the rest of the site. The data in the HTTP Archive can be [queried through BigQuery](https://github.com/HTTPArchive/legacy.httparchive.org/blob/master/docs/bigquery-gettingstarted.md), where multiple tables are available in the ```httparchive``` project. As these tables tend to get fairly big, they are partitioned, but multiple associated tables can be queried using the [wildcard symbol '*'](https://cloud.google.com/bigquery/docs/querying-wildcard-tables). For our purposes, three families of tables are relevant, leading to three different approaches:\n* ```httparchive.lighthouse.*```, which contains data about [Lighthouse](https://developers.google.com/web/tools/lighthouse/) runs.\n* ```httparchive.pages.*```, which contain the JSON-encoded parent documents’ [HAR](https://w3c.github.io/web-performance/specs/HAR/Overview.html) data.\n* ```httparchive.response_bodies.*```, which contains the raw response bodies of all resources and sub-resources of all sites in the archive.\n\nIn the following, we will discuss all three approaches and their particular pros and cons, as well as present the extractable data and ideas for further research. All [queries are also available on GitHub](https://github.com/tomayac/http-archive-progressive-web-apps) and are released under the terms of the Apache 2.0 license.\n\n**⚠️ Warning:** while BigQuery grants everyone a certain amount of [free quota per month](https://cloud.google.com/bigquery/pricing#free), on-demand pricing kicks in once the free quota is consumed. Currently, this is [$5 per terabyte](https://cloud.google.com/bigquery/pricing#on_demand_pricing). Some of the shown queries process 70+(!) terabytes! You can see the amount of data that will be processed by clicking on the *Validator* icon:\n\n![Notice of the amount of to-be-processed data](images/image_0.png)\n\n## Approach 1: ```httparchive.lighthouse.*``` Tables\n\n### Description\n\n[Lighthouse](https://developers.google.com/web/tools/lighthouse/) is an automated open-source tool for improving the quality of web pages. One can run it against any web page, public or requiring authentication. It has audits for *Performance*, *Accessibility*, *Progressive Web App*, and more. The ```httparchive.lighthouse.*``` tables contain JSON dumps ([example](https://gist.github.com/tomayac/05fed2d4bfa94fe066c705510a3c2103)) of past reports that can be extracted via BigQuery.\n\n### Cons\n\nThe biggest con is that obviously the tables only contain data of web pages that were ever run through the tool, so there is a blind spot. Additionally, while latest versions of Lighthouse process mobile *and* desktop pages, the currently used Lighthouse only processes mobile pages, so there are no results for desktop. One pitfall when working with these tables is that in a past version of Lighthouse *Progressive Web App* was the first category that was shown in the tool, however the [order was flipped](https://github.com/GoogleChrome/lighthouse/issues/3599) in the current version so that now *Performance* is first. In the query we need to take this corner case into account.\n\n### Pros\n\nOn the positive side, Lighthouse has clear scoring guidelines based on the [Baseline PWA Checklist](https://developers.google.com/web/progressive-web-apps/checklist#baseline) for each version of the tool ([v2](https://developers.google.com/web/tools/lighthouse/scoring#pwa), [v3](https://developers.google.com/web/tools/lighthouse/v3/scoring#pwa)), so by requiring a minimum *Progressive Web App* score of ≥75, we can, to some extent, determine what PWA features we want to have included, namely, we can require offline capabilities and make sure the app can be added to the home screen.\n\n### Query and Results\n\nRunning the query below and then selecting distinct PWA URLs returns [799 unique PWA results](https://docs.google.com/spreadsheets/d/1zxpfuEW06oG6wXWq96Zrs0FjzDVVDUTpRcNu9TkiyIw/edit?usp=sharing) that are known to work offline and to be installable to the user’s home screen.\n\n```sql\n#standardSQL\nCREATE TEMPORARY FUNCTION\n  getPWAScore(report STRING)\n  RETURNS FLOAT64\n  LANGUAGE js AS \"\"\"\n$=JSON.parse(report);\nreturn $.reportCategories.find(i =\u003e i.name === 'Progressive Web App').score;\n\"\"\";\nCREATE TABLE IF NOT EXISTS\n  `progressive_web_apps.lighthouse_pwas` AS\nSELECT\n  DISTINCT url AS pwa_url,\n  IFNULL(rank,\n    1000000) AS rank,\n  date,\n  platform,\n  CAST(ROUND(score) AS INT64) AS lighthouse_pwa_score\nFROM (\n  SELECT\n    REGEXP_REPLACE(JSON_EXTRACT(report,\n        \"$.url\"), \"\\\"\", \"\") AS url,\n    getPWAScore(report) AS score,\n    REGEXP_REPLACE(REGEXP_EXTRACT(_TABLE_SUFFIX, \"\\\\d{4}(?:_\\\\d{2}){2}\"), \"_\", \"-\") AS date,\n    REGEXP_EXTRACT(_TABLE_SUFFIX, \".*_(\\\\w+)$\") AS platform\n  FROM\n    `httparchive.lighthouse.*`\n  WHERE\n    report IS NOT NULL\n    AND JSON_EXTRACT(report,\n      \"$.audits.service-worker.score\") = 'true' )\nLEFT JOIN (\n  SELECT\n    Alexa_rank AS rank,\n    Alexa_domain AS domain\n  FROM\n    # Hard-coded due to https://github.com/HTTPArchive/bigquery/issues/42\n    `httparchive.urls.20170315`\n  WHERE\n    Alexa_rank IS NOT NULL\n    AND Alexa_domain IS NOT NULL ) AS urls\nON\n  urls.domain = NET.REG_DOMAIN(url)\nWHERE\n  # Lighthouse \"Good\" threshold\n  score \u003e= 75\nGROUP BY\n  url,\n  date,\n  score,\n  platform,\n  date,\n  rank\nORDER BY\n  rank ASC,\n  url,\n  date DESC;\n```\n\n### Research Ideas\n\nAn interesting analysis we can run based on this data is the development of average Lighthouse PWA scores over time and the number of PWAs (note that the presented naive approach does not take the in relation also growing HTTP Archive into account, but purely counts absolute numbers).\n\n```sql\n#standardSQL\nSELECT\n  date,\n  count (DISTINCT pwa_url) AS total_pwas,\n  round(AVG(lighthouse_pwa_score), 1) AS avg_lighthouse_pwa_score\nFROM\n  `progressive_web_apps.lighthouse_pwas`\nGROUP BY\n  date\nORDER BY\n  date;\n```\n\n![Average PWA scores over time, the trend is going up from ~83 (of 100) in June 2017 to ~85 (of 100) in May 2018](images/image_1.png)\n\n![Number of PWAs over time, the trend is going up from ~100 in June 2017 to ~340 in May 2018](images/image_2.png)\n\n## Approach 2: ```httparchive.pages.*``` Tables\n\n### Description\n\nAnother straightforward way for estimating the amount of PWAs (however completely neglecting Web App Manifests) is to look for so-called [use counters](https://cs.chromium.org/chromium/src/third_party/blink/public/platform/web_feature.mojom) in the ```httparchive.pages.*``` tables. Particularly interesting is the ```ServiceWorkerControlledPage``` use counter, which, [according to Chrome engineer Matt Falkenhagen](https://groups.google.com/a/chromium.org/d/msg/blink-api-owners-discuss/uxwEuxCRfGA/_1VdL4_EBAAJ), *“is counted whenever a page is controlled by a service worker, which typically happens only on subsequent loads.”*\n\n### Cons\n\nNo qualitative attributes other than the absolute fact that a service worker controlled the loading of the page can be extracted. More importantly, as the counter is typically triggered on subsequent loads only (and not on the first load that the crawler sees), this method undercounts and only contains sites that claim their clients (```self.clients.claim()```) on the first load.\n\n### Pros\n\nOn the bright side, the precision is high due to the browser-level tracking, so we can be sure the page actually registered a service worker. The query also covers both desktop and mobile.\n\n### Query and Results\n\nThis approach, at time of writing, turns up [5,368 unique results](https://docs.google.com/spreadsheets/d/16jJQF4ACqOKnypCC1jqUqpC3mO-guBZAaSA-P9tbJJg/edit?usp=sharing), however, as mentioned before, not all of these results *necessarily* qualify as PWA due to the potentially missing Web App Manifest that affects the installability of the app.\n\n```sql\n#standardSQL\nCREATE TABLE IF NOT EXISTS\n  `progressive_web_apps.usecounters_pwas` AS\nSELECT\n  DISTINCT REGEXP_REPLACE(url, \"^http:\", \"https:\") AS pwa_url,\n  IFNULL(rank,\n    1000000) AS rank,\n  date,\n  platform\nFROM (\n  SELECT\n    DISTINCT url,\n    REGEXP_REPLACE(REGEXP_EXTRACT(_TABLE_SUFFIX, \"\\\\d{4}(?:_\\\\d{2}){2}\"), \"_\", \"-\") AS date,\n    REGEXP_EXTRACT(_TABLE_SUFFIX, \".*_(\\\\w+)$\") AS platform\n  FROM\n    `httparchive.pages.*`\n  WHERE\n    # From https://cs.chromium.org/chromium/src/third_party/blink/public/platform/web_feature.mojom\n    JSON_EXTRACT(payload,\n      '$._blinkFeatureFirstUsed.Features.ServiceWorkerControlledPage') IS NOT NULL)\nLEFT JOIN (\n  SELECT\n    Alexa_domain AS domain,\n    Alexa_rank AS rank\n  FROM\n    # Hard-coded due to https://github.com/HTTPArchive/bigquery/issues/42\n    `httparchive.urls.20170315` AS urls\n  WHERE\n    Alexa_rank IS NOT NULL\n    AND Alexa_domain IS NOT NULL )\nON\n  domain = NET.REG_DOMAIN(url)\nORDER BY\n  rank ASC,\n  date DESC,\n  pwa_url;\n```\n\n### Research Ideas\n\nSimilar to the second query in *Approach 1* from above, we can also track the number of pages controlled by a service worker over time (the gap in the September 1, 2017 dataset is due to a parsing issue in the data collection pipeline).\n\n```sql\n#standardSQL\nSELECT\n  date,\n  count (DISTINCT pwa_url) AS total_pwas\nFROM\n  `progressive_web_apps.usecounters_pwas`\nGROUP BY\n  date\nORDER BY\n  date;\n```\n\n![Number of pages controlled by a service worker over time, the trend is going up from ~100 in December 2016 to ~2,000 in June 2018](images/image_3.png)\n\n## Approach 3: ```httparchive.response_body.*``` Tables\n\n### Description\n\nA third less obvious way to answer our research questions is to look at actual response bodies. The ```httparchive.response_bodies.*``` tables contain raw data of all resources and sub-resources of all sites in the archive, so we can use fulltext search to find patterns that are indicators for the presence of PWA features like, for instance, the existence of variations of the string ```navigator.serviceWorker.register(\"``` that provide a clue that the page might be registering a service worker on the one hand, and variations of ```\u003clink rel=\"manifest\"``` that point to a potential Web App Manifest on the other hand.\n\n### Cons\n\nThe downside of this approach is that we are trying to parse HTML with regular expressions to begin with, which is [commonly known to be impossible](https://stackoverflow.com/a/1732454) and a [bad practice](https://www.reddit.com/r/ProgrammerHumor/comments/6ytfw5/parsing_html_using_regular_expressions/). One example where things can go wrong is that we might detect out-commented code or struggle with incorrectly nested code.\n\n### Pros\n\nDespite all challenges, as the service worker JavaScript files and the Web App Manifest JSON files are subresources of the page and therefore stored in the ```httparchive.response_bodies.*``` tables, we can still bravely attempt to examine their contents and try to gain an in-depth understanding of the PWAs’ capabilities. By checking the service worker JavaScript code for the events the service worker listens to, we can see if a PWA—at least in theory—deals with Web Push notifications, handles fetches, *etc.*, and by looking at the Web App Manifest JSON document, we can see if the PWA specifies a start URL, provides a name, and so on.\n\n### Query and Results\n\nWe have split the analysis of service workers and Web App Manifests, and use a common helper table to extract PWA candidates from the large response body tables. As references to service worker script files and Web App Manifest JSON files may be relative or absolute, we need a [User-Defined Function](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions) to resolve paths like ```../../manifest.json``` relative to their base URL. Our function is a hacky simplification based on [path.resolve([...paths])](https://nodejs.org/docs/latest/api/path.html#path_path_resolve_paths) in Node.js and not very elegant. We deliberately ignore references that would require executing JavaScript, for example, URLs like ```window.location.href + 'sw.js'```, so our regular expressions are a bit involved to make sure we exclude these cases.\n\n#### PWA Candidates Helper Table\n\n```sql\n#standardSQL\nCREATE TEMPORARY FUNCTION\n  pathResolve(path1 STRING,\n    path2 STRING)\n  RETURNS STRING\n  LANGUAGE js AS \"\"\"\n  function normalizeStringPosix(e,t){for(var n=\"\",r=-1,i=0,l=void 0,o=!1,h=0;h\u003c=e.length;++h){if(h\u003ce.length)l=e.charCodeAt(h);else{if(l===SLASH)break;l=SLASH}if(l===SLASH){if(r===h-1||1===i);else if(r!==h-1\u0026\u00262===i){if(n.length\u003c2||!o||n.charCodeAt(n.length-1)!==DOT||n.charCodeAt(n.length-2)!==DOT)if(n.length\u003e2){for(var g=n.length-1,a=g;a\u003e=0\u0026\u0026n.charCodeAt(a)!==SLASH;--a);if(a!==g){n=-1===a?\"\":n.slice(0,a),r=h,i=0,o=!1;continue}}else if(2===n.length||1===n.length){n=\"\",r=h,i=0,o=!1;continue}t\u0026\u0026(n.length\u003e0?n+=\"/..\":n=\"..\",o=!0)}else{var f=e.slice(r+1,h);n.length\u003e0?n+=\"/\"+f:n=f,o=!1}r=h,i=0}else l===DOT\u0026\u0026-1!==i?++i:i=-1}return n}function resolvePath(){for(var e=[],t=0;t\u003carguments.length;t++)e[t]=arguments[t];for(var n=\"\",r=!1,i=void 0,l=e.length-1;l\u003e=-1\u0026\u0026!r;l--){var o=void 0;l\u003e=0?o=e[l]:(void 0===i\u0026\u0026(i=getCWD()),o=i),0!==o.length\u0026\u0026(n=o+\"/\"+n,r=o.charCodeAt(0)===SLASH)}return n=normalizeStringPosix(n,!r),r?\"/\"+n:n.length\u003e0?n:\".\"}var SLASH=47,DOT=46,getCWD=function(){return\"\"};if(/^https?:/.test(path2)){return path2;}if(/^\\\\//.test(path2)){return path1+path2.substr(1);}return resolvePath(path1, path2).replace(/^(https?:\\\\/)/, '$1/');\n\"\"\";\nCREATE TABLE IF NOT EXISTS\n  `progressive_web_apps.pwa_candidates` AS\nSELECT\n  DISTINCT REGEXP_REPLACE(page, \"^http:\", \"https:\") AS pwa_url,\n  IFNULL(rank,\n    1000000) AS rank,\n  pathResolve(REGEXP_REPLACE(page, \"^http:\", \"https:\"),\n    REGEXP_EXTRACT(body, \"navigator\\\\.serviceWorker\\\\.register\\\\s*\\\\(\\\\s*[\\\"']([^\\\\),\\\\s\\\"']+)\")) AS sw_url,\n  pathResolve(REGEXP_REPLACE(page, \"^http:\", \"https:\"),\n    REGEXP_EXTRACT(REGEXP_EXTRACT(body, \"(\u003clink[^\u003e]+rel=[\\\"']?manifest[\\\"']?[^\u003e]+\u003e)\"), \"href=[\\\"']?([^\\\\s\\\"'\u003e]+)[\\\"']?\")) AS manifest_url\nFROM\n  `httparchive.response_bodies.*`\nLEFT JOIN (\n  SELECT\n    Alexa_domain AS domain,\n    Alexa_rank AS rank\n  FROM\n    # Hard-coded due to https://github.com/HTTPArchive/bigquery/issues/42\n    `httparchive.urls.20170315` AS urls\n  WHERE\n    Alexa_rank IS NOT NULL\n    AND Alexa_domain IS NOT NULL )\nON\n  domain = NET.REG_DOMAIN(page)\nWHERE\n  (REGEXP_EXTRACT(body, \"navigator\\\\.serviceWorker\\\\.register\\\\s*\\\\(\\\\s*[\\\"']([^\\\\),\\\\s\\\"']+)\") IS NOT NULL\n    AND REGEXP_EXTRACT(body, \"navigator\\\\.serviceWorker\\\\.register\\\\s*\\\\(\\\\s*[\\\"']([^\\\\),\\\\s\\\"']+)\") != \"/\")\n  AND (REGEXP_EXTRACT(REGEXP_EXTRACT(body, \"(\u003clink[^\u003e]+rel=[\\\"']?manifest[\\\"']?[^\u003e]+\u003e)\"), \"href=[\\\"']?([^\\\\s\\\"'\u003e]+)[\\\"']?\") IS NOT NULL\n    AND REGEXP_EXTRACT(REGEXP_EXTRACT(body, \"(\u003clink[^\u003e]+rel=[\\\"']?manifest[\\\"']?[^\u003e]+\u003e)\"), \"href=[\\\"']?([^\\\\s\\\"'\u003e]+)[\\\"']?\") != \"/\")\nORDER BY\n  rank ASC,\n  pwa_url;\n```\n\n#### Web App Manifests Analysis\n\nBased on this helper table, we can then run the analysis of the Web App Manifests. We check for the existence of properties defined in the [```WebAppManifest``` dictionary](https://www.w3.org/TR/appmanifest/#webappmanifest-dictionary) combined with non-standard, but well-known properties like ```\"gcm_sender_id\"``` from the deprecated [Google Cloud Messaging](https://developers.google.com/cloud-messaging/) or ```\"share_target\"``` from the currently [in flux Web Share Target API](https://wicg.github.io/web-share-target/#extension-to-the-web-app-manifest). Turns out, not many manifests are in the archive; from 2,823 candidate manifest URLs in the helper table we actually only find [30 unique Web App Manifests](https://docs.google.com/spreadsheets/d/1VE9hoj7Ag7E3kOG4BKc8NKISg1w0BZVQkGdCq4MJ6hw/edit?usp=sharing) and thus PWAs in the response bodies, but these at least archived in several versions.\n\n```sql\n#standardSQL\n  CREATE TABLE IF NOT EXISTS `progressive_web_apps.web_app_manifests` AS\nSELECT\n  pwa_url,\n  rank,\n  manifest_url,\n  date,\n  platform,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"dir\\\"\\s*:\") AS dir_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"lang\\\"\\s*:\") AS lang_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"name\\\"\\s*:\") AS name_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"short_name\\\"\\s*:\") AS short_name_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"description\\\"\\s*:\") AS description_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"scope\\\"\\s*:\") AS scope_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"icons\\\"\\s*:\") AS icons_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"display\\\"\\s*:\") AS display_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"orientation\\\"\\s*:\") AS orientation_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"start_url\\\"\\s*:\") AS start_url_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"serviceworker\\\"\\s*:\") AS serviceworker_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"theme_color\\\"\\s*:\") AS theme_color_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"related_applications\\\"\\s*:\") AS related_applications_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"prefer_related_applications\\\"\\s*:\") AS prefer_related_applications_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"background_color\\\"\\s*:\") AS background_color_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"categories\\\"\\s*:\") AS categories_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"screenshots\\\"\\s*:\") AS screenshots_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"iarc_rating_id\\\"\\s*:\") AS iarc_rating_id_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"gcm_sender_id\\\"\\s*:\") AS gcm_sender_id_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"gcm_user_visible_only\\\"\\s*:\") AS gcm_user_visible_only_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"share_target\\\"\\s*:\") AS share_target_property,\n  REGEXP_CONTAINS(manifest_code,\n    r\"\\\"supports_share\\\"\\s*:\") AS supports_share_property\nFROM\n  `progressive_web_apps.pwa_candidates`\nJOIN (\n  SELECT\n    url,\n    body AS manifest_code,\n    REGEXP_REPLACE(REGEXP_EXTRACT(_TABLE_SUFFIX, \"\\\\d{4}(?:_\\\\d{2}){2}\"), \"_\", \"-\") AS date,\n    REGEXP_EXTRACT(_TABLE_SUFFIX, \".*_(\\\\w+)$\") AS platform\n  FROM\n    `httparchive.response_bodies.*`\n  WHERE\n    body IS NOT NULL\n    AND body != \"\"\n    AND url IN (\n    SELECT\n      DISTINCT manifest_url\n    FROM\n      `progressive_web_apps.pwa_candidates`) ) AS manifest_bodies\nON\n  manifest_bodies.url = manifest_url\nORDER BY\n  rank ASC,\n  pwa_url,\n  date DESC,\n  platform,\n  manifest_url;\n```\n\n#### Research Ideas\n\nWith this data at hand, we can extract all (well, not really *all*, but all known according to our query) PWAs that still use the deprecated Google Cloud Messaging service.\n\n```sql\n#standardSQL\nSELECT\n  DISTINCT pwa_url,\n  manifest_url\nFROM\n  `progressive_web_apps.web_app_manifests`\nWHERE\n  gcm_sender_id_property;\n```\n\n#### Service Workers Analysis\n\nSimilarly to the analysis of Web App Manifests, the analysis of the various [```ServiceWorkerGlobalScope``` events ](https://www.w3.org/TR/service-workers-1/#execution-context-events)is based on regular expressions. Events can be listened to using two JavaScript syntaxes: *(i)* the property syntax (*e.g.*, ```self.oninstall = […]``` or *(ii)* the event listener syntax (*e.g.*, ```self.addEventListener('install', […])```). As an additional data point, we extract potential uses of the increasingly popular library [Workbox](https://developers.google.com/web/tools/workbox/) by looking for telling traces of various Workbox versions in the code. Running this query we obtain [1,151 unique service workers](https://docs.google.com/spreadsheets/d/1rrSh3tXje9WnySfX8oRafY7Aduunv6X0rq_jmcBicIM/edit?usp=sharing) and thus PWAs.\n\n```sql\n#standardSQL\nCREATE TABLE IF NOT EXISTS\n  `progressive_web_apps.service_workers` AS\nSELECT\n  pwa_url,\n  rank,\n  sw_url,\n  date,\n  platform,\n  REGEXP_CONTAINS(sw_code, r\"\\.oninstall\\s*=|addEventListener\\(\\s*[\\\"']install[\\\"']\") AS install_event,\n  REGEXP_CONTAINS(sw_code, r\"\\.onactivate\\s*=|addEventListener\\(\\s*[\\\"']activate[\\\"']\") AS activate_event,\n  REGEXP_CONTAINS(sw_code, r\"\\.onfetch\\s*=|addEventListener\\(\\s*[\\\"']fetch[\\\"']\") AS fetch_event,\n  REGEXP_CONTAINS(sw_code, r\"\\.onpush\\s*=|addEventListener\\(\\s*[\\\"']push[\\\"']\") AS push_event,\n  REGEXP_CONTAINS(sw_code, r\"\\.onnotificationclick\\s*=|addEventListener\\(\\s*[\\\"']notificationclick[\\\"']\") AS notificationclick_event,\n  REGEXP_CONTAINS(sw_code, r\"\\.onnotificationclose\\s*=|addEventListener\\(\\s*[\\\"']notificationclose[\\\"']\") AS notificationclose_event,\n  REGEXP_CONTAINS(sw_code, r\"\\.onsync\\s*=|addEventListener\\(\\s*[\\\"']sync[\\\"']\") AS sync_event,\n  REGEXP_CONTAINS(sw_code, r\"\\.oncanmakepayment\\s*=|addEventListener\\(\\s*[\\\"']canmakepayment[\\\"']\") AS canmakepayment_event,\n  REGEXP_CONTAINS(sw_code, r\"\\.onpaymentrequest\\s*=|addEventListener\\(\\s*[\\\"']paymentrequest[\\\"']\") AS paymentrequest_event,\n  REGEXP_CONTAINS(sw_code, r\"\\.onmessage\\s*=|addEventListener\\(\\s*[\\\"']message[\\\"']\") AS message_event,\n  REGEXP_CONTAINS(sw_code, r\"\\.onmessageerror\\s*=|addEventListener\\(\\s*[\\\"']messageerror[\\\"']\") AS messageerror_event,\n  REGEXP_CONTAINS(sw_code, r\"new Workbox|new workbox|workbox\\.precaching\\.|workbox\\.strategies\\.\") AS uses_workboxjs\nFROM\n  `progressive_web_apps.pwa_candidates`\nJOIN (\n  SELECT\n    url,\n    body AS sw_code,\n    REGEXP_REPLACE(REGEXP_EXTRACT(_TABLE_SUFFIX, \"\\\\d{4}(?:_\\\\d{2}){2}\"), \"_\", \"-\") AS date,\n    REGEXP_EXTRACT(_TABLE_SUFFIX, \".*_(\\\\w+)$\") AS platform\n  FROM\n    `httparchive.response_bodies.*`\n  WHERE\n    body IS NOT NULL\n    AND body != \"\"\n    AND url IN (\n    SELECT\n      DISTINCT sw_url\n    FROM\n      `progressive_web_apps.pwa_candidates`) ) AS sw_bodies\nON\n  sw_bodies.url = sw_url\nORDER BY\n  rank ASC,\n  pwa_url,\n  date DESC,\n  platform,\n  sw_url;\n```\n\n#### Research Ideas\n\nHaving detailed service worker data allows for interesting analyses. For example, we can use this data to track Workbox usage over time.\n\n```sql\n#standardSQL\nSELECT\n  date,\n  count (uses_workboxjs) AS total_uses_workbox\nFROM\n  `progressive_web_apps.service_workers`\nWHERE\n  uses_workboxjs\n  AND platform = 'mobile'\nGROUP BY\n  date\nORDER BY\n  date;\n```\n\n![Workbox usage over time, the trend is going up from ~1 in August 2017 to ~46 in June 2018](images/image_4.png)\n\nLines of code (LOC) is a great metric ([not](https://en.wikipedia.org/wiki/Source_lines_of_code#Utility)) to estimate a team’s productivity and to predict a task’s complexity. Let’s analyze the development of a given site’s service worker in terms of string length. Seems like the team deserves a raise… 😉\n\n```sql\n#standardSQL\nSELECT\n  DISTINCT pwa_url,\n  sw_url,\n  date,\n  CHAR_LENGTH(body) AS sw_length\nFROM\n  `progressive_web_apps.service_workers`\nJOIN\n  `httparchive.response_bodies.*`\nON\n  sw_url = url\n  AND date = REGEXP_REPLACE(REGEXP_EXTRACT(_TABLE_SUFFIX, \"\\\\d{4}(?:_\\\\d{2}){2}\"), \"_\", \"-\")\n  AND platform = REGEXP_EXTRACT(_TABLE_SUFFIX, \".*_(\\\\w+)$\")\nWHERE\n  # Redacted\n  pwa_url = \"https://example.com/\"\n  AND platform = \"mobile\"\nORDER BY\n  date ASC;\n```\n\n![String length of an anonymized site's service worker over time, the trend is going up from ~16,000 characters in March 2016 to ~28,000 characters in June 2018](images/image_5.png)\n\nA final idea is to examine service worker events over time and see if there are interesting developments. Something that stands out in the analysis is how increasingly the ```fetch``` event is being listened to as well as the ```message``` event. Both are an indicator for more complex offline handling scenarios.\n\n```sql\n#standardSQL\nSELECT\n  date,\n  COUNT(IF (install_event,\n      TRUE,\n      NULL)) AS install_events,\n  COUNT(IF ( activate_event,\n      TRUE,\n      NULL)) AS activate_events,\n  COUNT(IF ( fetch_event,\n      TRUE,\n      NULL)) AS fetch_events,\n  COUNT(IF ( push_event,\n      TRUE,\n      NULL)) AS push_events,\n  COUNT(IF ( notificationclick_event,\n      TRUE,\n      NULL)) AS notificationclick_events,\n  COUNT(IF ( notificationclose_event,\n      TRUE,\n      NULL)) AS notificationclose_events,\n  COUNT(IF ( sync_event,\n      TRUE,\n      NULL)) AS sync_events,\n  COUNT(IF ( canmakepayment_event,\n      TRUE,\n      NULL)) AS canmakepayment_events,\n  COUNT(IF ( paymentrequest_event,\n      TRUE,\n      NULL)) AS paymentrequest_events,\n  COUNT(IF ( message_event,\n      TRUE,\n      NULL)) AS message_events,\n  COUNT(IF ( messageerror_event,\n      TRUE,\n      NULL)) AS messageerror_events\nFROM\n  `progressive_web_apps.service_workers`\nWHERE\n  NOT uses_workboxjs\n  AND date LIKE \"2018-%\"\nGROUP BY\n  date\nORDER BY\n  date;\n```\n\n![Service worker events over time, showing an increasing usage of the fetch and the message event from February to June 2018](images/image_6.png)\n\n## Meta Approach: *Approaches 1–3* Combined\n\nAn interesting meta analysis is to combine all approaches to get a feeling for the overall landscape of PWAs in the HTTP Archive (with all aforementioned pros and cons regarding precision and recall applied). If we run the query below, we find exactly [6,647 unique PWAs](https://docs.google.com/spreadsheets/d/1XcSa59AwZZiqz7QdEH6CncU6lnSgzAEAOMbyI-aHBow/edit?usp=sharing). They may not necessarily still be PWAs today; some of the previously very prominent PWA lighthouse cases are known to have regressed, and some were only very briefly experimenting with the technologies, but in the HTTP Archive we have evidence of the glory moment in history where all of these pages fulfilled at least one of our three approaches’ criteria for being counted as a PWA.\n\n```sql\n#standardSQL\nSELECT\n  DISTINCT pwa_url,\n  rank\nFROM (\n  SELECT\n    DISTINCT pwa_url,\n    rank\n  FROM\n    `progressive_web_apps.lighthouse_pwas` union all\n  SELECT\n    DISTINCT pwa_url,\n    rank\n  FROM\n    `progressive_web_apps.service_workers` union all\n  SELECT\n    DISTINCT pwa_url,\n    rank\n  FROM\n    `progressive_web_apps.usecounters_pwas`)\nORDER BY\n  rank ASC;\n```\n\nIf we aggregate by dates and ignore some runaway values, we can see linear growth in the total number of PWAs, with a slight decline at the end of our observation period that we will have an eye on in future research.\n\n```sql\n#standardSQL\nSELECT\n  DISTINCT date,\n  COUNT(pwa_url) AS pwas\nFROM (\n  SELECT\n    DISTINCT date,\n    pwa_url\n  FROM\n    `progressive_web_apps.lighthouse_pwas`\n  UNION ALL\n  SELECT\n    DISTINCT date,\n    pwa_url\n  FROM\n    `progressive_web_apps.service_workers`\n  UNION ALL\n  SELECT\n    DISTINCT date,\n    pwa_url\n  FROM\n    `progressive_web_apps.usecounters_pwas`)\nGROUP BY\n  date\nORDER BY\n  date;\n```\n\n![PWAs over time showing linear growth from February 2017 to June 2018, with a slight decline in May and June 2018](images/image_7.png)\n\n## Future Work and Conclusions\n\nIn this document, we have presented three different approaches to extracting PWA data from the HTTP Archive. Each has its individual pros and cons, but especially *Approach 3* has proven very interesting as a basis for further analyses. All presented queries are “evergreen” in a sense that they are not tied to a particular crawl’s tables, allowing for ongoing analyses also in the future. Depending on people’s interest, we will see to what extent the data can be made generally available as part of the HTTP Archive’s public tables. There are likewise interesting research opportunities by combining our results with the [Chrome User Experience Report](https://developers.google.com/web/tools/chrome-user-experience-report/) that is also [accessible with BigQuery](https://developers.google.com/web/tools/chrome-user-experience-report/getting-started#query-dataset).\nConcluding, the overall trends show in the right direction. More and more pages are controlled by a service worker, leading to PWAs with a generally increasing Lighthouse PWA score. Something to watch out for is the decline in PWAs observed in the *Meta Approach*, which, however, is not reflected in the most precise and neutral *Approach 2*, where rather the opposite is the case. We look forward to learning about new ways people make use of our research and to PWAs becoming more and more mainstream.\n\n## Acknowledgements\n\nIn no particular order we would like to thank [Mathias Bynens](https://twitter.com/mathias) for help with shaping one of the initial queries, [Kenji Baheux](https://twitter.com/kenjibaheux) for pointers that led to *Approach 2*, [Rick Viscomi](https://twitter.com/rick_viscomi) and [Patrick Meenan](https://twitter.com/patmeenan?lang=en) for general HTTP Archive help and the video series, [Jeff Posnick](https://twitter.com/jeffposnick), [Ade Oshineye](https://twitter.com/ade_oshineye), [Ilya Grigorik](https://twitter.com/igrigorik), [John Mueller](https://twitter.com/JohnMu), [Cheney Tsai](https://twitter.com/cheneytsai?lang=en), [Miguel Carlos Martínez Díaz](https://twitter.com/mcmd), and [Eric Bidelman](https://twitter.com/ebidel) for editorial comments, as well as [Matt Falkenhagen](https://twitter.com/falkenmatto?lang=en) and [Matt Giuca](https://twitter.com/mgiuca) for providing technical background on use counters.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomayac%2Fhttp-archive-progressive-web-apps","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftomayac%2Fhttp-archive-progressive-web-apps","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomayac%2Fhttp-archive-progressive-web-apps/lists"}