{"id":24571053,"url":"https://github.com/pavlozt/transparency-gdata","last_synced_at":"2025-03-17T08:26:20.025Z","repository":{"id":273908710,"uuid":"907932420","full_name":"pavlozt/transparency-gdata","owner":"pavlozt","description":"Transparency Report Downloader ","archived":false,"fork":false,"pushed_at":"2024-12-24T16:27:25.000Z","size":6,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-23T17:59:10.544Z","etag":null,"topics":["data-mining","google","transparency"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pavlozt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-24T16:25:56.000Z","updated_at":"2024-12-24T16:27:54.000Z","dependencies_parsed_at":"2025-01-23T17:59:11.819Z","dependency_job_id":"13eb8ffb-1204-43db-b14a-95f3c00a259e","html_url":"https://github.com/pavlozt/transparency-gdata","commit_stats":null,"previous_names":["pavlozt/transparency-gdata"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pavlozt%2Ftransparency-gdata","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pavlozt%2Ftransparency-gdata/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pavlozt%2Ftransparency-gdata/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pavlozt%2Ftransparency-gdata/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pavlozt","download_url":"https://codeload.github.com/pavlozt/transparency-gdata/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243997845,"owners_count":20381106,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-mining","google","transparency"],"created_at":"2025-01-23T17:59:09.841Z","updated_at":"2025-03-17T08:26:20.016Z","avatar_url":"https://github.com/pavlozt.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Transparency Report Downloader\r\n\r\n[Google's Public Data Program](https://transparencyreport.google.com/traffic/overview) provides researchers, data journalists, and community activists with access to Google's array of structured data about amount of traffic. This data is not available as files.  To obtain live data, an additional container with  [Selenium Browser](https://github.com/SeleniumHQ/selenium) is used.\r\n\r\n\r\n## Table of Contents\r\n\r\n- [Requirements](#requirements)\r\n- [Usage](#usage)\r\n- [First run](#first-run)\r\n- [Periodical data updates](#periodical-data-updates)\r\n- [Debugging](#debugging)\r\n- [License](#license)\r\n\r\n## Requirements\r\n\r\nYou need install `docker` with  `docker compose` plugin.\r\nBecause it runs a browser inside the container, for normal operation, you will need resources amounting to 1-1.5 GB of RAM.\r\n\r\n## Usage\r\n\r\nThe parser can be run without specifying command line arguments, but there are some useful arguments\r\n\r\n-   `--loop`: Start special loop mode.\r\n-   `--start`: Start time in Unix timestamp * 1000 (miliseconds).\r\n-   `--end`: End time in Unix timestamp * 1000 (miliseconds).\r\n-   `--product`: Product identifier. Default is `21` (YouTube).\r\n-   `--region`: Region identifier. Default is `RU`.\r\n-   `--step`: Step interval in miliseconds. Default is 30 days (`60*60*24*30*1000`).\r\n-   `--pause`: Pause. Wait some seconds between fetching in loop mode. Default is `30`.\r\n-   `--filename`: Output filename.  (default is `data.xlsx`).\r\n\r\nData is written to the directory `data` using `openpyxl` python module.\r\n\r\n## First run\r\n\r\nFirst of all, you need to build container images:\r\n```\r\ndocker compose build\r\n```\r\n\r\nMost likely, for the first download you will want to download historical data.\r\nTo do this, you need to run the container with the **--loop** parameter. The necessary date parameters can be obtained from the page URL.\r\n```\r\ndocker compose run --rm parser  --loop --start 1643587200000 --end 1734825599999 --pause 10 --product 21 --region RU --filename data-RU.xlsx\r\n```\r\n\r\nParameters can be obtained from the URL in Google Transparency Report. These are millseconds of Unixtime. That is, you need to use Unixtime and multiply by 1000.\r\n\r\n\r\n## Periodical data updates\r\n\r\n\r\nThe program is adapted for launching in Docker Compose. To update the data periodically, you need to configure the launch of the following command to the cron:\r\n\r\n```\r\n0 2 * * * /usr/bin/docker compose -f /path/to/project/docker-compose.yaml run --rm parser --product 21 --region RU\r\n```\r\n\r\nTo save resources you can additionally run\r\n`/usr/bin/docker compose -f /path/to/project/docker-compose.yaml down geckodriver`\r\n\r\n## Debugging\r\n\r\nSometimes it may not work well. Here are some tips for debugging:\r\n\r\n- Read about debugging Selenium https://github.com/SeleniumHQ/docker-selenium#debugging\r\n- Uncomment port 7900 in docker-compose.yaml and check browser state at URL http://localhost:7900/?autoconnect=1\u0026resize=scale\r\n\r\n\r\n## Publishing data\r\n\r\nI suggest publishing data to Google Drive using [rclone](https://github.com/rclone/rclone).\r\n\r\n## License\r\n\r\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpavlozt%2Ftransparency-gdata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpavlozt%2Ftransparency-gdata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpavlozt%2Ftransparency-gdata/lists"}