https://github.com/inseefrlab/codif-ape-nace-revision
Centralized repo on nace revision
https://github.com/inseefrlab/codif-ape-nace-revision
Last synced: 3 months ago
JSON representation
Centralized repo on nace revision
- Host: GitHub
- URL: https://github.com/inseefrlab/codif-ape-nace-revision
- Owner: InseeFrLab
- License: apache-2.0
- Created: 2024-03-26T10:43:18.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-14T12:48:39.000Z (4 months ago)
- Last Synced: 2025-03-14T13:38:47.469Z (4 months ago)
- Language: Python
- Size: 111 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# π CAG vs RAG: Centralized Repository for NACE Revision
## π Table of Contents
- [π CAG vs RAG: Centralized Repository for NACE Revision](#-cag-vs-rag-centralized-repository-for-nace-revision)
- [π Table of Contents](#-table-of-contents)
- [π Overview](#-overview)
- [π Getting Started](#-getting-started)
- [π Installation](#-installation)
- [π Pre-commit Setup](#-pre-commit-setup)
- [Cache LLM model from S3 Bucket (Optionnal)](#cache-llm-model-from-s3-bucket-optionnal)
- [π Running the Scripts](#-running-the-scripts)
- [β 1. Build Vector Database (if you are in the RAG case)](#-1-build-vector-database-if-you-are-in-the-rag-case)
- [π· 2. Encode Business Activity Codes](#-2-encode-business-activity-codes)
- [π¬ 3. Evaluate Classification Strategies](#-3-evaluate-classification-strategies)
- [π 4. Build the NACE 2025 Dataset](#-4-build-the-nace-2025-dataset)
- [π‘ LLM Integration](#-llm-integration)
- [π Argo Workflows](#-argo-workflows)
- [π License](#-license)## π Overview
This repository is dedicated to the revision of the **Nomenclature statistique des ActivitΓ©s Γ©conomiques dans la CommunautΓ© EuropΓ©enne (NACE)**.It provides tools for **automated classification and evaluation of business activity codes** using **Large Language Models (LLMs)** and vector-based retrieval systems.
## π Getting Started
### π Installation
Ensure you have **Python 3.12+** and **uv** or **pip** installed, then install the required dependencies:```bash
uv pip install -r requirements.txt
```or
```bash
uv pip install -r pyproject.toml
```### π Pre-commit Setup
Set up linting and formatting checks using `pre-commit`:```bash
pre-commit install
```### Cache LLM model from S3 Bucket (Optionnal)
If you want to use a model available in the SSPCloud you can execute this command:```bash
MODEL_NAME=mistralai/Ministral-8B-Instruct-2410
LOCAL_PATH=~/.cache/huggingface/hub./bash/fetch_model_s3.sh $MODEL_NAME $LOCAL_PATH
```## π Running the Scripts
### β 1. Build Vector Database (if you are in the RAG case)
To create a searchable database of NACE 2025 codes:
```bash
python build-vector-db.py
```### π· 2. Encode Business Activity Codes
For **unambiguous** classification:
```bash
python encode-univoque.py
```For **ambiguous** classification using an LLM:
```bash
python encode-multivoque.py --experiment_name NACE2025_DATASET --llm_name mistralai/Mistral-7B-Instruct
```### π¬ 3. Evaluate Classification Strategies
Compare different classification models:
```bash
python evaluate_strategies.py
```### π 4. Build the NACE 2025 Dataset
Once all unique ambiguous cases have been recoded using the best strategy, you can rebuild the entire dataset with NACE 2025 labels:
```bash
python build_nace2025_sirene4.py
```---
## π‘ LLM Integration
This repository leverages **Large Language Models (LLMs)** to assist in classifying business activities. The supported models include:- `Qwen/Qwen2.5-32B-Instruct`
- `mistralai/Mistral-Small-Instruct-2409`
- `hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4`These models help improve classification accuracy for ambiguous business activity cases.
## π Argo Workflows
This project supports **automated workflows** via [Argo Workflows](https://argoproj.github.io/argo-workflows/).
To trigger a workflow, execute:```yaml
argo submit argo-workflows/relabel-naf08-to-naf25.yaml
```Or use the **Argo Workflow UI**.
## π License
This project is licensed under the **Apache License 2.0**. See the [LICENSE](LICENSE) file for details.