{"id":13541179,"url":"https://github.com/dcaribou/transfermarkt-datasets","last_synced_at":"2025-04-02T08:30:58.011Z","repository":{"id":38007809,"uuid":"324604176","full_name":"dcaribou/transfermarkt-datasets","owner":"dcaribou","description":"⚽️ Extract, prepare and publish Transfermarkt datasets.","archived":false,"fork":false,"pushed_at":"2024-09-17T05:08:16.000Z","size":3937,"stargazers_count":219,"open_issues_count":23,"forks_count":54,"subscribers_count":10,"default_branch":"master","last_synced_at":"2024-09-17T07:57:08.877Z","etag":null,"topics":["analytics","dataset","dbt","football","football-data","soccer-analytics"],"latest_commit_sha":null,"homepage":"https://www.kaggle.com/datasets/davidcariboo/player-scores","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dcaribou.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["dcaribou"]}},"created_at":"2020-12-26T17:33:52.000Z","updated_at":"2024-09-17T05:08:19.000Z","dependencies_parsed_at":"2023-10-05T10:31:07.663Z","dependency_job_id":"be3d7207-3c7d-4745-9656-81d0444e3dfa","html_url":"https://github.com/dcaribou/transfermarkt-datasets","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcaribou%2Ftransfermarkt-datasets","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcaribou%2Ftransfermarkt-datasets/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcaribou%2Ftransfermarkt-datasets/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcaribou%2Ftransfermarkt-datasets/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dcaribou","download_url":"https://codeload.github.com/dcaribou/transfermarkt-datasets/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246781924,"owners_count":20832934,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","dataset","dbt","football","football-data","soccer-analytics"],"created_at":"2024-08-01T10:00:40.580Z","updated_at":"2025-04-02T08:30:57.660Z","avatar_url":"https://github.com/dcaribou.png","language":"Python","funding_links":["https://github.com/sponsors/dcaribou"],"categories":["Sample Projects","Projects Powered by DuckDB","V2 -  What's News in 2022?","Datasets"],"sub_categories":["Web Clients (WebAssembly)"],"readme":"![Build Status](https://github.com/dcaribou/transfermarkt-datasets/actions/workflows/build.yml/badge.svg)\n![Scraper Pipeline Status](https://github.com/dcaribou/transfermarkt-datasets/actions/workflows/acquire-transfermarkt-scraper.yml/badge.svg)\n![API Pipeline Status](https://github.com/dcaribou/transfermarkt-datasets/actions/workflows/acquire-transfermarkt-api.yml/badge.svg)\n![dbt Version](https://img.shields.io/static/v1?logo=dbt\u0026label=dbt-version\u0026message=1.7.3\u0026color=orange)\n\n# transfermarkt-datasets\n\nIn an nutshell, this project aims for three things:\n\n1. Acquiring data from the transfermarkt website using the [trasfermarkt-scraper](https://github.com/dcaribou/transfermarkt-scraper).\n2. Building a **clean, public football (soccer) dataset** using data in 1.\n3. Automating 1 and 2 to **keep assets up to date** and publicly available on some well-known data catalogs.\n\n[![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/dcaribou/transfermarkt-datasets/tree/master?quickstart=1)\n[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://www.kaggle.com/datasets/davidcariboo/player-scores)\n[![data.world](https://img.shields.io/badge/-Open%20in%20data.world-blue?style=appveyor)](https://data.world/dcereijo/player-scores)\n\n------\n```mermaid\nclassDiagram\ndirection LR\ncompetitions --|\u003e games : competition_id\ncompetitions --|\u003e clubs : domestic_competition_id\nclubs --|\u003e players : current_club_id\nclubs --|\u003e club_games : opponent/club_id\nclubs --|\u003e game_events : club_id\nplayers --|\u003e appearances : player_id\nplayers --|\u003e game_events : player_id\nplayers --|\u003e player_valuations : player_id\ngames --|\u003e appearances : game_id\ngames --|\u003e game_events : game_id\ngames --|\u003e clubs : home/away_club_id\ngames --|\u003e club_games : game_id\nclass competitions {\n competition_id\n}\nclass games {\n    game_id\n    home/away_club_id\n    competition_id\n}\nclass game_events {\n    game_id\n    player_id\n}\nclass clubs {\n    club_id\n    domestic_competition_id\n}\nclass club_games {\n    club_id\n    opponent_club_id\n    game_id\n}\nclass players {\n    player_id\n    current_club_id\n}\nclass player_valuations{\n    player_id\n}\nclass appearances {\n    appearance_id\n    player_id\n    game_id\n}\n```\n------\n\n- [transfermarkt-datasets](#transfermarkt-datasets)\n  - [📥 setup](#-setup)\n    - [make](#make)\n  - [💾 data storage](#-data-storage)\n  - [🕸️ data acquisition](#️-data-acquisition)\n    - [acquirers](#acquirers)\n  - [🔨 data preparation](#-data-preparation)\n    - [python api](#python-api)\n  - [👁️ frontends](#️-frontends)\n    - [🎈 streamlit](#-streamlit)\n  - [🏗️ infra](#️-infra)\n  - [🎼 orchestration](#-orchestration)\n  - [💬 community](#-community)\n    - [📞 getting in touch](#-getting-in-touch)\n    - [🫶 sponsoring](#-sponsoring)\n    - [👨‍💻 contributing](#-contributing)\n\n------\n\n## 📥 setup\n\n\u003e **🔈 New!** \u0026rarr; Thanks to [Github codespaces](https://github.com/features/codespaces) you can now spin up a working dev environment in your browser with just a click, **no local setup required**.\n\u003e\n\u003e [![Open in GitHub Codespaces](https://github.com/codespaces/badge.svg)](https://codespaces.new/dcaribou/transfermarkt-datasets/tree/master?quickstart=1)\n\nSetup your local environment to run the project with `poetry`.\n1. Install [poetry](https://python-poetry.org/docs/)\n2. Install python dependencies (poetry will create a virtual environment for you)\n```console\ncd transfermarkt-datasets\npoetry install\n```\nRemember to activate the virtual environment once poetry has finished installing the dependencies by running `poetry shell`.\n\n### make\nThe `Makefile` in the root defines a set of useful targets that will help you run the different parts of the project. Some examples are\n```console\ndvc_pull                       pull data from the cloud\ndocker_build                   build the project docker image and tag accordingly\nacquire_local                  run the acquiring process locally (refreshes data/raw/\u003cacquirer\u003e)\nprepare_local                  run the prep process locally (refreshes data/prep)\nsync                           run the sync process (refreshes data frontends)\nstreamlit_local                run streamlit app locally\n```\nRun `make help` to see the full list. Once you've completed the setup, you should be able to run most of these from your machine.\n\n## 💾 data storage\nAll project data assets are kept inside the [`data`](data) folder. This is a [DVC](https://dvc.org/) repository, so all files can be pulled from remote storage by running `dvc pull`.\n\n| path        | description                                                                                                                                                                     |\n| ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| `data/raw`  | contains raw data for [different acquirers](https://github.com/dcaribou/transfermarkt-datasets/discussions/202#discussioncomment-7142557) (check the data acquisition section below) |\n| `data/prep` | contains prepared datasets as produced by dbt (check [data preparation](#-data-preparation))                                                                                             |\n\n## 🕸️ data acquisition\nIn the scope of this project, \"acquiring\" is the process of collecting data from a specific source and via an acquiring script. Acquired data lives in the `data/raw` folder.\n\n### acquirers\nAn acquirer is just a script that collect data from somewhere and puts it in `data/raw`. They are defined in the [`scripts/acquiring`](scripts/acquiring) folder and run using the `acquire_local` make target.\nFor example, to run the `transfermarkt-api` acquirer with a set of parameters, you can run\n```console\nmake acquire_local ACQUIRER=transfermarkt-api ARGS=\"--season 2024\"\n```\nwhich will populate `data/raw/transfermarkt-api` with the data it collected. Obviously, you can also run [the script](scripts/acquiring/transfermarkt-api.py) directly if you prefer.\n```console\ncd scripts/acquiring \u0026\u0026 python transfermarkt-api.py --season 2024\n```\n\n\n## 🔨 data preparation\nIn the scope of this project, \"preparing\" is the process of transforming raw data to create a high quality dataset that can be conveniently consumed by analysts of all kinds.\n\nData prepartion is done in SQL using [dbt](https://docs.getdbt.com/) and [DuckDB](https://duckdb.org/). You can trigger a run of the preparation task using the `prepare_local` make target or work with the dbt CLI directly if you prefer.\n\n* `cd dbt` \u0026rarr; The [dbt](dbt) folder contains the dbt project for data preparation\n* `dbt deps` \u0026rarr; Install dbt packages. This is only required the first time you run dbt.\n* `dbt run -m +appearances` \u0026rarr; Refresh the assets by running the corresponding model in dbt.\n\ndbt runs will populate a `dbt/duck.db` file in your local, which you can \"connect to\" using the DuckDB CLI and query the data using SQL.\n```console\nduckdb dbt/duck.db -c 'select * from dev.games'\n```\n\n![dbt](resources/dbt.png)\n\n\u003e :warning: Make sure that you are using a DukcDB version that matches that [that is used in the project](.devcontainer/devcontainer.json).\n\n\n### python api\nA thin python wrapper is provided as a convenience utility to help with loading and inspecting the dataset (for example, from a notebook).\n\n```python\n# import the module\nfrom transfermarkt_datasets.core.dataset import Dataset\n\n# instantiate the datasets handler\ntd = Dataset()\n\n# load all assets into memory as pandas dataframes\ntd.load_assets()\n\n# inspect assets\ntd.asset_names # [\"games\", \"players\", ...]\ntd.assets[\"games\"].prep_df # get the built asset in a dataframe\n\n# get raw data in a dataframe\ntd.assets[\"games\"].load_raw()\ntd.assets[\"games\"].raw_df \n```\n\nThe module code lives in the `transfermark_datasets` folder with the structure below.\n\n| path                           | description                                                   |\n| ------------------------------ | ------------------------------------------------------------- |\n| `transfermark_datasets/core`   | core classes and utils that are used to work with the dataset |\n| `transfermark_datasets/tests`  | unit tests for core classes                                   |\n| `transfermark_datasets/assets` | perpared asset definitions: one python file per asset         |\n\nFor more examples on using `transfermark_datasets`, checkout the sample [notebooks](notebooks).\n\n## 👁️ frontends\nPrepared data is published to a couple of popular dataset websites. This is done running `make sync`, which runs weekly as part of the [data pipeline](#-orchestration).\n\n* [Kaggle](https://www.kaggle.com/datasets/davidcariboo/player-scores)\n* [data.world](https://data.world/dcereijo/player-scores)\n\n### 🎈 streamlit\nThere is a [streamlit](https://streamlit.io/) app for the project with documentation, a data catalog and sample analyisis. The app ~~is currently hosted in fly.io, you can check it out [here](https://transfermarkt-datasets.fly.dev/)~~ deployment is currently disabled until [this](https://github.com/dcaribou/transfermarkt-datasets/issues/297) is resolved.\n\nFor local development, you can also run the app in your machine. Provided you've done the [setup](#-setup), run the following to spin up a local instance of the app\n```console\nmake streamlit_local\n```\n\u003e :warning: Note that the app expects prepared data to exist in `data/prep`. Check out [data storage](#-data-storage) for instructions about how to populate that folder.\n\n## 🏗️ [infra](infra)\nDefine all the necessary infrastructure for the project in the cloud with Terraform.\n\n## 🎼 orchestration\nThe data pipeline is orchestrated as a series of Github Actions workflows. They are defined in the [`.github/workflows`](.github/workflows) folder and are triggered by different events.\n\n| workflow name            | triggers on                                                  | description                                                                                                   |\n| ------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------- |\n| `build`*                  | Every push to the `master` branch or to an open pull request | It runs the [data preparation](#-data-preparation) step, and tests and commits a new version of the prepared data if there are any changes |\n| `acquire-\u003cacquirer\u003e.yml` | Schedule                                                     | It runs the acquirer and commits the acquired data to the corresponding raw location                                                                      |\n| `sync-\u003cfrontend\u003e.yml`    | Every change on prepared data                                | It syncs the prepared data to the corresponding frontend                                                                                |\n\n*`build-contribution` is the same as `build` but without commiting any data.\n\n\u003e 💡 Debugging workflows remotelly is a pain. I recommend using [act](https://github.com/nektos/act) to run them locally to the extent that is possible.\n\n## 💬 community\n\n### 📞 getting in touch\nIn order to keep things tidy, there are two simple guidelines\n* Keep the conversation centralised and public by getting in touch via the [Discussions](https://github.com/dcaribou/transfermarkt-datasets/discussions) tab.\n* Avoid topic duplication by having a quick look at the [FAQs](https://github.com/dcaribou/transfermarkt-datasets/discussions/175)\n\n### 🫶 sponsoring\nMaintenance of this project is made possible by \u003ca href=\"https://github.com/sponsors/dcaribou\"\u003esponsors\u003c/a\u003e. If you'd like to sponsor this project you can use the `Sponsor` button at the top.\n\n\u0026rarr; I would like to express my grattitude to [@mortgad](https://github.com/mortgad) for becoming the first sponsor of this project.\n\n### 👨‍💻 contributing\nContributions to `transfermarkt-datasets` are most welcome. If you want to contribute new fields or assets to this dataset, the instructions are quite simple:\n1. [Fork the repo](https://github.com/dcaribou/transfermarkt-datasets/fork)\n2. Set up your [local environment](#-setup)\n3. [Populate `data/raw` directory](#-data-storage)\n4. Start modifying assets or creating new ones in [the dbt project](#-data-preparation)\n5. If it's all looking good, create a pull request with your changes :rocket:\n\n\u003e ℹ️ In case you face any issue following the instructions above please [get in touch](#-getting-in-touch)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcaribou%2Ftransfermarkt-datasets","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdcaribou%2Ftransfermarkt-datasets","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcaribou%2Ftransfermarkt-datasets/lists"}