{"id":18118042,"url":"https://github.com/mozilla/probe-scraper","last_synced_at":"2025-04-06T18:14:39.696Z","repository":{"id":37397053,"uuid":"85344923","full_name":"mozilla/probe-scraper","owner":"mozilla","description":"Scrape and publish Telemetry probe data from Firefox","archived":false,"fork":false,"pushed_at":"2024-10-25T07:48:23.000Z","size":5359,"stargazers_count":22,"open_issues_count":73,"forks_count":53,"subscribers_count":27,"default_branch":"main","last_synced_at":"2024-10-29T11:36:48.471Z","etag":null,"topics":["firefox","telemetry"],"latest_commit_sha":null,"homepage":"https://mozilla.github.io/probe-scraper/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mozilla.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-03-17T18:52:08.000Z","updated_at":"2024-10-25T07:48:26.000Z","dependencies_parsed_at":"2023-11-21T05:34:22.140Z","dependency_job_id":"6d85f18a-0797-4a28-8d03-348506b0e25c","html_url":"https://github.com/mozilla/probe-scraper","commit_stats":{"total_commits":836,"total_committers":74,"mean_commits":"11.297297297297296","dds":0.8480861244019139,"last_synced_commit":"ec250081152fd354608e2fdb1b10fc8fb25e12f2"},"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mozilla%2Fprobe-scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mozilla%2Fprobe-scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mozilla%2Fprobe-scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mozilla%2Fprobe-scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mozilla","download_url":"https://codeload.github.com/mozilla/probe-scraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247526763,"owners_count":20953143,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["firefox","telemetry"],"created_at":"2024-11-01T05:08:09.316Z","updated_at":"2025-04-06T18:14:39.673Z","avatar_url":"https://github.com/mozilla.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# probe-scraper\nScrape Telemetry probe data from Firefox repositories.\n\nThis extracts per-version Telemetry probe data for Firefox and other Mozilla products from registry files like Histograms.json and Scalars.yaml.\nThe data allows answering questions like \"which Firefox versions is this Telemetry probe in anyway?\".\nAlso, probes outside of Histograms.json - like the CSS use counters - are included in the output data.\n\nThe data is pulled from two different sources:\n- From [`hg.mozilla.org`](https://hg.mozilla.org) for Firefox data.\n- From a [configurable set of Github repositories](repositories.yaml) that use [Glean](https://github.com/mozilla-mobile/android-components/tree/master/components/service/glean).\n\nProbe Scraper outputs JSON to https://probeinfo.telemetry.mozilla.org.\nEffectively, this creates a REST API which can be used by downstream tools like\n[mozilla-schema-generator](https://github.com/mozilla/mozilla-schema-generator)\nand various data dictionary type applications (see below).\n\nAn [OpenAPI reference](https://mozilla.github.io/probe-scraper/) to this API is available:\n\n\u003ca href=\"https://mozilla.github.io/probe-scraper/\" rel=\"probeinfo API docs\"\u003e![probeinfo API docs](docs.png)\u003c/a\u003e\n\nA web tool to explore the Firefox-related data is available at [probes.telemetry.mozilla.org](https://probes.telemetry.mozilla.org/). A project to develop a similar view for Glean-based data\nis under development in the [Glean Dictionary](https://github.com/mozilla/glean-dictionary).\n\n## Deprecation\n\nDeprecation is an important step in an application lifecycle. Because of the backwards-compatible nature of our pipeline, we do not\nremove Glean apps or variants from the `repositories.yaml` file - instead, we mark them as deprecated.\n\n### Marking an App Variant as deprecated\n\nWhen an app variant is marked as deprecated (see this [example from Fenix](https://github.com/mozilla/probe-scraper/blob/213055f967b4903933667002ec376cd69cdf5a77/repositories.yaml#L415-L431)), the following happens:\n- It shows as `[Deprecated]` in the Glean Dictionary, in the `Access` section (see e.g. [Fenix's client_id metric](https://dictionary.telemetry.mozilla.org/apps/fenix/metrics/client_id)).\n\n### Marking an App as deprecated\n\nWhen an app is marked as deprecated (see this [example of Firefox for Fire TV](https://github.com/mozilla/probe-scraper/blob/213055f967b4903933667002ec376cd69cdf5a77/repositories.yaml#L501-L504)), the following happens:\n- It no longer shows by default in the Glean Dictionary. (Deprecated apps can be viewed by clicking the `Show deprecated applications` checkbox)\n\n## Adding a New Glean Repository\n\nTo scrape a git repository for probe definitions, an entry needs to be added in `repositories.yaml`.\nThe exact format of the entry depends on whether you are adding an application or a library. See below for details.\n\n### Adding an application\n\nFor a given application, Glean metrics are emitted by the application itself, any libraries it uses\nthat also use Glean, as well as the Glean library proper. Therefore, probe scraper needs a way to\nfind all of the dependencies to determine all of the metrics emitted by\nthat application.\n\nTherefore, each application should specify a `dependencies` parameter, which is a list of Glean-using libraries used by the application.  Each entry should be a library name as specified by the library's `library_names` parameter.\n\nFor Android applications, if you're not sure what the dependencies of the application are, you can run the following command at the root of the project folder:\n\n```bash\n$ ./gradlew :app:dependencies\n```\n\nSee the full [application schema documentation](https://mozilla.github.io/probe-scraper/#tag/application)\nfor descriptions of all the available parameters.\n\n### Adding a library\n\nProbe scraper also needs a way to map dependencies back to an entry in the\n`repositories.yaml` file. Therefore, any libraries defined should also include\ntheir build-system-specific library names in the `library_names` parameter.\n\nSee the full [library schema documentation](https://mozilla.github.io/probe-scraper/#tag/library)\nfor descriptions of all the available parameters.\n\n## Developing the probe-scraper\n\nYou can choose to develop using the container, or locally. Using the container will be slower, since changes will trigger a rebuild of the container.\nBut using the container method will ensure that your PR passes CircleCI build/test phases.\n\n### Local development\n\nYou may wish to,\ninstead of installing all these requirements in your global Python environment,\nstart by generating and activating a\n[Python virtual environment](https://docs.python.org/3/library/venv.html).\nThe `.gitignore` expects it to be called `ENV` or `venv`:\n```console\npython -m venv venv\n. venv/bin/activate\n```\n\nInstall the requirements:\n```\npip install -r requirements.txt\npip install -r test_requirements.txt\npython setup.py develop\n```\n\nRun tests. This by default does not run tests that require a web connection:\n```\npytest tests/\n```\n\nTo run all tests, including those that require a web connection:\n```\npytest tests/ --run-web-tests\n```\n\nTo test whether the code conforms to the style rules, you can run:\n```\npython -m black --check probe_scraper tests ./*.py\nflake8 --max-line-length 100 probe_scraper tests ./*.py\nyamllint repositories.yaml .circleci\npython -m isort --profile black --check-only probe_scraper tests ./*.py\n```\n\nTo render API documentation locally to `index.html`:\n```\nmake apidoc\n```\n\n### Developing using the container\n\nRun tests in container. This does not run tests that require a web connection:\n```\nexport COMMAND='pytest tests/'\nmake run\n```\n\nTo run all tests, including those that require a web connection:\n```\nmake test\n```\n\nTo test whether the code conforms to the style rules, you can run:\n```\nmake lint\n```\n\n### Tests with Web Dependencies\n\nAny tests that require a web connection to run should be marked with `@pytest.mark.web_dependency`.\n\nThese will not run by default, but will run on CI.\n\n### Performing a Dry-Run\n\nBefore opening a PR, it's good to test the code you wrote on the production data. You can specify a specific Firefox\nversion to run on by using `first-version`:\n```\nexport COMMAND='python -m probe_scraper.runner --firefox-version 65 --dry-run'\nmake run\n```\nor locally via:\n```\npython -m probe_scraper.runner --firefox-version 65 --dry-run\n```\n\nIncluding `--dry-run` means emails will not be sent.\n\nAdditionally, you can test just on Glean repositories:\n```\nexport COMMAND='python -m probe_scraper.runner --glean --dry-run'\nmake run\n```\n\nBy default that will test against every Glean repository, which might take a while. If you want to test against just one (e.g. a new repository you're adding), you can use the `--glean-repo` argument to just test the repositories you care about:\n```\nexport COMMAND='python -m probe_scraper.runner --glean --glean-repo glean-core --glean-repo glean-android --glean-repo burnham --dry-run'\nmake run\n```\n\nReplace burnham in the example above with your repository and its dependencies.\n\nYou can also do the dry-run locally:\n\n```\npython -m probe_scraper.runner --glean --glean-repo glean-core --glean-repo glean-android --glean-repo burnham --dry-run\n```\n\n## Module overview\n\nThe module is built around the following data flow:\n\n- scrape registry files from mozilla-central, clone files from repositories directory\n- extract probe data from the files\n- transform probe data into output formats\n- save to disk\n\nThe code layout consists mainly of:\n\n- `probe_scraper`\n  - `runner.py` - the central script, ties the other pieces together\n  - `scrapers`\n     - `buildhub.py` - pull build info from the [BuildHub](https://buildhub.moz.tools) service\n     - `moz_central_scraper.py` - loads probe registry files for multiple versions from mozilla-central\n     - `git_scraper.py` - loads probe registry files from a git repository (no version or channel support yet, just per-commit)\n  - `parsers/` - extract probe data from the registry files\n     - `third_party` - these are imported parser scripts from [mozilla-central](https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/)\n   - `transform_*.py` - transform the extracted raw data into output formats\n- `tests/` - the unit tests\n\n## Accessing the data files\nThe processed probe data is serialized to the disk in a directory hierarchy starting from the provided output directory. The directory layout resembles a REST-friendly structure.\n\n    |-- product\n        |-- general\n        |-- revisions\n        |-- channel (or \"all\")\n            |-- ping type\n                |-- probe type (or \"all_probes\")\n\nFor example, all the JSON probe data in the [main ping]() for the *Firefox Nightly* channel can be accessed with the following path: `firefox/nightly/main/all_probes`. The probe data for all the channels (same product and ping) can be accessed instead using `firefox/all/main/all_probes`.\n\nThe root directory for the output generated from the scheduled job can be found at \u003chttps://probeinfo.telemetry.mozilla.org/\u003e.\nAll the probe data for Firefox coming from the main ping can be found at \u003chttps://probeinfo.telemetry.mozilla.org/firefox/all/main/all_probes\u003e.\n\n## Accessing `Glean` metrics data\nGlean data is generally laid out as follows:\n\n```\n| -- glean\n    | -- repositories\n    | -- general\n    | -- repository-name\n        | -- general\n        | -- metrics\n```\n\nFor example, the data for a repository called `fenix` would be found at [`/glean/fenix/metrics`](https://probeinfo.telemetry.mozilla.org/glean/fenix/metrics). The time the data was last updated for that project can be found at [`glean/fenix/general`](https://probeinfo.telemetry.mozilla.org/glean/fenix/general).\n\nA list of available repositories is at [`/glean/repositories`](https://probeinfo.telemetry.mozilla.org/glean/repositories).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmozilla%2Fprobe-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmozilla%2Fprobe-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmozilla%2Fprobe-scraper/lists"}