{"id":47911286,"url":"https://github.com/etalab-ia/mediatech","last_synced_at":"2026-04-04T05:19:46.262Z","repository":{"id":310857699,"uuid":"953919119","full_name":"etalab-ia/mediatech","owner":"etalab-ia","description":"Collection of public datasets from the French administration, vectorized and ready to use in AI projects. ","archived":false,"fork":false,"pushed_at":"2026-01-22T18:02:48.000Z","size":479,"stargazers_count":6,"open_issues_count":22,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-23T10:38:56.663Z","etag":null,"topics":["datasets","embeddings","huggingface","public-data"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/etalab-ia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-24T09:38:06.000Z","updated_at":"2026-01-22T18:02:55.000Z","dependencies_parsed_at":"2026-01-08T20:08:56.410Z","dependency_job_id":null,"html_url":"https://github.com/etalab-ia/mediatech","commit_stats":null,"previous_names":["etalab-ia/mediatech"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/etalab-ia/mediatech","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/etalab-ia%2Fmediatech","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/etalab-ia%2Fmediatech/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/etalab-ia%2Fmediatech/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/etalab-ia%2Fmediatech/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/etalab-ia","download_url":"https://codeload.github.com/etalab-ia/mediatech/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/etalab-ia%2Fmediatech/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31388507,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T04:26:24.776Z","status":"ssl_error","status_checked_at":"2026-04-04T04:23:34.147Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datasets","embeddings","huggingface","public-data"],"created_at":"2026-04-04T05:19:44.541Z","updated_at":"2026-04-04T05:19:46.256Z","avatar_url":"https://github.com/etalab-ia.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MEDIATECH\n\n[![License](https://img.shields.io/github/license/etalab-ia/mediatech?label=licence\u0026color=red)](https://github.com/etalab-ia/mediatech/blob/main/LICENSE)\n[![French version](https://img.shields.io/badge/🇫🇷-French%20version-blue)](./docs/README_fr.md)\n[![Hugging Face collection](https://img.shields.io/badge/🤗-Hugging%20Face%20collection-yellow)](https://huggingface.co/collections/AgentPublic/mediatech-68309e15729011f49ef505e8)\n\n\n## 📝 Description\n\nThis project processes public data made available by various administrations in order to facilitate access to vectorized and ready-to-use public data for AI applications in the public sector.\nIt includes scripts for downloading, processing, embedding, and inserting this data into a PostgreSQL database, and facilitates its export via various means.\n\n## 💡 Get Started\n\n### 𖣘 Method 1 : Airflow\n\n#### Installing and configuring dependencies\n\n1. Run the initial deployment script:\n   ```bash\n   sudo chmod +x ./scripts/initial_deployment.sh\n   ./scripts/initial_deployment.sh\n   ```\n  \n2. Set up the environment variables in a [`.env`](.env) file based on the example in [`.env.example`](.env.example).\n   \u003e The `AIRFLOW_UID` variable must be obtained by executing:\n    ```bash\n    echo $(id -u)\n    ```\n   \u003e The `JWT_TOKEN` variable will be obtained later by using the Airflow API. Just leave it for now.\n\n#### Initialize Airflow and PostgreSQL (PgVector) containers\n\n1. Run the [`containers_deployment`](./scripts/containers_deployment) script :\n   ```bash\n   sudo chmod +x ./scripts/containers_deployment.sh\n   ./scripts/containers_deployment.sh\n   ```\n2. Set up the environment variables in a [`.env`](.env) file based on the example in [`.env.example`](.env.example).\n\n3. Export [`.env`](.env) variables :\n   ```bash\n   export $(grep -v '^#' .env | xargs)\n   ```\n\n4. Make sure to remove the PostgreSQL (PgVector) volume:\n   ```bash\n   docker compose down -v\n   ```\n   \u003e ⚠️ This operation will delete all volumes ! \n\n5. Use the Airflow API to obtain the `JWT_TOKEN` variable:\n   ```bash\n   curl -X 'POST' \\\n   'http://localhost:8080/auth/token' \\\n   -H 'Content-Type: application/json' \\\n   -d \"{\\\"username\\\": \\\"${_AIRFLOW_WWW_USER_USERNAME}\\\", \\\"password\\\": \\\"${_AIRFLOW_WWW_USER_PASSWORD}\\\"}\"\n   ```\n\n6. Define the `JWT_TOKEN` variable in the [`.env`](.env) file with the obtained `access_token`.\n\n7. Define the `full_pipeline_schedule` Airflow variable to set the execution schedule for the [full_pipeline DAG](full_pipeline.py) whether:\n\n- By executing the bash command : \n   ```bash\n   docker exec -it airflow-scheduler airflow variables set full_pipeline_schedule \"0 19 * * 5\"\n   ```\n   \u003e The cron expression \"0 19 * * 5\" schedules the DAG to run every Friday at 19:00 (7:00 PM). Replace the cron expression with your desired schedule or `None`.\n- From the Airflow UI : `Admin` \u003e `Variable` \u003e `+ Add Variable`\n\n#### Optional : Configure Tchap logging \n\nTo receive real-time notifications about DAG execution (start, success, failure) in a Tchap room, you need to configure an Apprise connection in Airflow.\n\u003e If you don't want to, you can just remove the following lines in each DAG located in [`airflow_config/dags/`](airflow_config/dags/) :\n\n      on_execute_callback=get_start_notifier(),\n      on_success_callback=get_success_notifier(),\n      on_failure_callback=get_failure_notifier(),\n\nOtherwise : \n\n1.  Navigate to the Airflow UI (usually `http://localhost:8080`).\n2.  Go to **Admin \u003e Connections**.\n3.  Click the **`+`** icon to add a new record.\n4.  Fill in the connection form with the following details:\n    *   **Connection Id**: `TchapNotifier`\n    *   **Connection Type**: `Apprise`\n    *   **Extra fields \u003e config**: Construct the Apprise URL for Matrix using your environment variables, following this format:\n        ```\n        {\"path\": \"matrixs://\u003cTCHAP_ACCOUNT_TOKEN\u003e@\u003cTCHAP_SERVER\u003e/\u003cTCHAP_ROOM_TOKEN\u003e/?format=markdown\", \"tag\": \"alerts\"}\n        ```\n        -   Replace `\u003cTCHAP_ACCOUNT_TOKEN\u003e` with the value from your `.env` file.\n        -   Replace `\u003cTCHAP_SERVER\u003e` with the server hostname from your `.env` file (e.g., `matrix.agent.dinum.tchap.gouv.fr`, **without** the `https://` prefix).\n        -   Replace `\u003cTCHAP_ROOM_TOKEN\u003e` with the room ID from your `.env` file.\n\n5.  Click **Save**.\n\nAirflow will now use this connection to send formatted notifications to your specified Tchap room.\n\n#### Downloading, Processing and Uploading Data\n\nYou are now ready to use Airflow and execute DAGs that are available.\nEach dataset has its own DAG and the DAG [`FULL_PIPELINE`](./airflow_config/dags/full_pipeline.py) is defined to manage all datasets DAGs and their execution order.\n\n### \u003c/\u003e Method 2 : Use local CLI\n\n#### Installing Dependencies\n\n1. Install the required apt dependencies:\n   ```bash\n   sudo apt-get update\n   sudo apt-get install -y $(cat config/requirements-apt-container.txt)\n   ```\n\n2. Create and activate a virtual environment:\n   ```bash\n   python3 -m venv .venv  # Create the virtual environment\n   source .venv/bin/activate  # Activate the virtual environment\n   ```\n\n3. Install the required python dependencies:\n   ```bash\n   pip install -e .\n   ```\n\n\u003e Installing in development mode (`-e`) allows you to use the `mediatech` command and modify the code without reinstalling.\n\n\u003e **Note:** Make sure your environment is properly configured before continuing.\n\n#### PostgreSQL (PgVector) Database Configuration\n\n1. Set up the environment variables in a [`.env`](.env) file based on the example in [`.env.example`](.env.example).\n\n2. Export [`.env`](.env) variables :\n   ```bash\n   export $(grep -v '^#' .env | xargs)\n   ```\n\n3. Start the PostgreSQL container with Docker:\n   ```bash\n   docker compose up -d postgres\n   ```\n\n4. Check that the `pgvector_container` container is running:\n   ```bash\n   docker ps\n   ```\n\n#### Downloading, Processing and Uploading Data\n\n##### Using the `mediatech` Command\n\nAfter installation, the `mediatech` command is available globally and replaces `python main.py`:\n\n\u003e If you encounter issues with the `mediatech` command, you can still use `python main.py` instead.\n\nThe [`main.py`](main.py) file is the main entry point of the project and provides a command-line interface (CLI) to run each step of the pipeline separately.  \nYou can use it as follows:\n\n```bash\nmediatech \u003ccommand\u003e [options]\n```\nor \n\n```bash\npython main.py \u003ccommand\u003e [options]\n```\n\nCommand examples:\n- View help:\n  ```bash\n  mediatech --help\n  ```\n- Create PostgreSQL tables:  \n  ```bash\n  mediatech create_tables --model BAAI/bge-m3\n  ```\n- Download all files listed in [`data_config.json`](config/data_config.json):  \n  ```bash\n  mediatech download_files --all\n  ```\n- Download files from the `service_public` source:  \n  ```bash\n  mediatech download_files --source service_public\n  ```\n- Download and process all files listed in [`data_config.json`](config/data_config.json):  \n  ```bash\n  mediatech download_and_process_files --all --model BAAI/bge-m3\n  ```\n- Process all data:  \n  ```bash\n  mediatech process_files --all --model BAAI/bge-m3\n  ```\n- Split a table into subtables based on different criteria (see [`main.py`](main.py)):  \n  ```bash\n  mediatech split_table --source legi\n  ```\n- Export PostgreSQL tables to parquet files:  \n  ```bash\n  mediatech export_tables --output data/parquet\n  ```\n- Upload parquet datasets to the Hugging Face repository:\n  ```bash\n  mediatech upload_dataset --input data/parquet/service_public.parquet --dataset-name service-public\n  ```\n\n\nRun `mediatech --help` in your terminal to see all available options, or check the code directly in [`main.py`](main.py).\n\n\n##### Alternative Usage with `python main.py`\n\nIf you prefer to use the Python script directly, you can always use:\n\n```bash\npython main.py \u003ccommand\u003e [options]\n```\n\nExamples:\n```bash\npython main.py download_files\npython main.py create_tables --model BAAI/bge-m3\npython main.py process_files --all --model BAAI/bge-m3\n```\n##### Using the [`update.sh`](update.sh) Script\n\nThe [`update.sh`](update.sh) script allows you to run the entire data processing pipeline: downloading, table creation, vectorization, and export.  \nTo run it, execute the following command from the project root:\n\n```bash\n./scripts/update.sh\n```\n\nThis script will:\n- Wait for the PostgreSQL database to be available,\n- Create or update the necessary tables in the PostgreSQL database,\n- Download public files listed in [`data_config.json`](config/data_config.json),\n- Process and vectorize the data,\n- Export the tables in Parquet format,\n- Upload the Parquet files to [Hugging Face](https://huggingface.co/AgentPublic).\n\n## 🗂️ Project Structure\n\n- **[`main.py`](main.py)**: Main entry point to run the complete pipeline via CLI.\n- **[`pyproject.toml`](pyproject.toml)**: Python project and dependency configuration.\n- **[`Dockerfile`](Dockerfile)**: Defines the instructions to build the custom Docker image for Airflow, installing system dependencies, Python packages, and setting up the project environment.\n- **[`docker-compose.yml`](docker-compose.yml)**: Orchestrates the multi-container setup, defining Airflow services and the PostgreSQL (PgVector) database.\n- **[`.github/`](.github/)**: Contains GitHub Actions workflows for Continuous Integration and Continuous Deployment (CI/CD), automating testing and deployment processes.\n- **[`download_and_processing/`](download_and_processing/)**: Contains scripts to download and extract files.\n- **[`database/`](database/)**: Contains scripts to manage the database (table creation, data insertion).\n- **[`docs/`](/docs/)**: Contains various documentation resources and tutorials.\n  - **[`docs/hugging_face_rag_tutorial.ipynb`](/docs/hugging_face_rag_tutorial.ipynb)**: RAG Tutorial: How to load MediaTech's datasets from Hugging Face and use them in a RAG pipeline ?\n  - **[`docs/reconstruct_vector_database.ipynb`](/docs/reconstruct_vector_database.ipynb)**: Tutorial: How to reconstruct a dataset without chunking and embedding from MediaTech parquet files uploaded to Hugging Face?\n  - **[`docs/fr/`](/docs/fr/)**: Contains all documentation resources and tutorials translated into French.\n- **[`utils/`](utils/)**: Contains utility functions shared across modules.\n- **[`config/`](config/)**: Contains project configuration scripts.\n- **[`logs/`](logs/)**: Contains log files to track [scripts](scripts/) execution.\n- **[`scripts/`](scripts/)**: Contains all shell scripts, executed either automatically or manually in some cases.\n  - **[`scripts/update.sh`](scripts/update.sh)**: Shell script to run the entire data processing pipeline.\n  - **[`scripts/periodic_update.sh`](scripts/periodic_update.sh)**: Shell script to automate the pipeline on the virtual machine. This script is executed periodically by [`cron_config.txt`](cron_config.txt).\n  - **[`scripts/backup.sh`](scripts/backup.sh)**: Shell script to back up the Pgvector (PostgreSQL) volume and some configuration files. This script is executed periodically by [`cron_config.txt`](cron_config.txt).\n  - **[`scripts/restore.sh`](scripts/restore.sh)**: Shell script to restore the Pgvector (PostgreSQL) volume and configuration files if needed.\n  - **[`scripts/initial_deployment.sh`](scripts/initial_deployment.sh)**: Sets up a new server environment by installing Docker, Docker Compose, and other system dependencies.\n  - **[`scripts/containers_deployment.sh`](scripts/containers_deployment.sh)**:  Manages the application's lifecycle by building, initializing, and deploying the Docker containers as defined in [docker-compose.yml](docker-compose.yml). It must be executed after each update of the Mediatech CLI or other script not shared with the Airflow container, as defined in [docker-compose.yml](docker-compose.yml).\n  - **[`scripts/check_running_dags.sh`](scripts/check_running_dags.sh)**: Checks the Airflow API to see if any data pipelines (DAGs) are currently running, used to safely lock the deployment process.\n  - **[`scripts/delete_old_files.sh`](scripts/delete_old_files.sh)**: Shell script to automatically delete old files  from severals folders such as [logs/](logs/), [airflow_config/logs](airflow_config/logs) and [backups/](backups/). It keeps files from the last X days and deletes older ones. This script can be run manually or scheduled via cron to keep the folders clean.\n  - **[`scripts/manage_checkpoint.sh`](scripts/manage_checkpoint.sh)** : Script shell permettant de gérer les différents fichiers de points de contrôle pour le traitement des fichiers. \n  - **[`scripts/write_tchap_message.sh`](scripts/write_tchap_message.sh)**: Sends a formatted message to a specified Tchap room. It takes the message content as an argument and uses environment variables for authentication and destination.\n- **[`airflow_config`](airflow_config/)**: Contains all files related to Apache Airflow, including DAG definitions (`dags/`), configuration (`config/`), logs (`logs/`), and plugins (`plugins/`). This is where the data orchestration pipelines are defined and managed.\n\n## ⚖️ License\n\nThis project is licensed under the [MIT License](./LICENSE).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fetalab-ia%2Fmediatech","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fetalab-ia%2Fmediatech","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fetalab-ia%2Fmediatech/lists"}