https://github.com/inseefrlab/codif-ape-nace-revision

Centralized repo on nace revision
https://github.com/inseefrlab/codif-ape-nace-revision

Last synced: 3 months ago
JSON representation

Centralized repo on nace revision

Host: GitHub
URL: https://github.com/inseefrlab/codif-ape-nace-revision
Owner: InseeFrLab
License: apache-2.0
Created: 2024-03-26T10:43:18.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-14T12:48:39.000Z (4 months ago)
Last Synced: 2025-03-14T13:38:47.469Z (4 months ago)
Language: Python
Size: 111 KB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# 📌 CAG vs RAG: Centralized Repository for NACE Revision

## 📑 Table of Contents

- [📌 CAG vs RAG: Centralized Repository for NACE Revision](#-cag-vs-rag-centralized-repository-for-nace-revision)
- [📑 Table of Contents](#-table-of-contents)
- [📖 Overview](#-overview)
- [🚀 Getting Started](#-getting-started)
- [🛠 Installation](#-installation)
- [🏗 Pre-commit Setup](#-pre-commit-setup)
- [Cache LLM model from S3 Bucket (Optionnal)](#cache-llm-model-from-s3-bucket-optionnal)
- [📜 Running the Scripts](#-running-the-scripts)
- [✅ 1. Build Vector Database (if you are in the RAG case)](#-1-build-vector-database-if-you-are-in-the-rag-case)
- [🏷 2. Encode Business Activity Codes](#-2-encode-business-activity-codes)
- [🔬 3. Evaluate Classification Strategies](#-3-evaluate-classification-strategies)
- [📊 4. Build the NACE 2025 Dataset](#-4-build-the-nace-2025-dataset)
- [📡 LLM Integration](#-llm-integration)
- [🏗 Argo Workflows](#-argo-workflows)
- [📄 License](#-license)

## 📖 Overview
This repository is dedicated to the revision of the **Nomenclature statistique des Activités économiques dans la Communauté Européenne (NACE)**.

It provides tools for **automated classification and evaluation of business activity codes** using **Large Language Models (LLMs)** and vector-based retrieval systems.

## 🚀 Getting Started

### 🛠 Installation
Ensure you have **Python 3.12+** and **uv** or **pip** installed, then install the required dependencies:

```bash
uv pip install -r requirements.txt
```

```bash
uv pip install -r pyproject.toml
```

### 🏗 Pre-commit Setup
Set up linting and formatting checks using `pre-commit`:

```bash
pre-commit install
```

### Cache LLM model from S3 Bucket (Optionnal)
If you want to use a model available in the SSPCloud you can execute this command:

```bash
MODEL_NAME=mistralai/Ministral-8B-Instruct-2410
LOCAL_PATH=~/.cache/huggingface/hub

./bash/fetch_model_s3.sh $MODEL_NAME $LOCAL_PATH
```

## 📜 Running the Scripts

### ✅ 1. Build Vector Database (if you are in the RAG case)

To create a searchable database of NACE 2025 codes:

```bash
python build-vector-db.py
```

### 🏷 2. Encode Business Activity Codes

For **unambiguous** classification:

```bash
python encode-univoque.py
```

For **ambiguous** classification using an LLM:

```bash
python encode-multivoque.py --experiment_name NACE2025_DATASET --llm_name mistralai/Mistral-7B-Instruct
```

### 🔬 3. Evaluate Classification Strategies

Compare different classification models:

```bash
python evaluate_strategies.py
```

### 📊 4. Build the NACE 2025 Dataset

Once all unique ambiguous cases have been recoded using the best strategy, you can rebuild the entire dataset with NACE 2025 labels:

```bash
python build_nace2025_sirene4.py
```

---

## 📡 LLM Integration
This repository leverages **Large Language Models (LLMs)** to assist in classifying business activities. The supported models include:

- `Qwen/Qwen2.5-32B-Instruct`
- `mistralai/Mistral-Small-Instruct-2409`
- `hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4`

These models help improve classification accuracy for ambiguous business activity cases.

## 🏗 Argo Workflows
This project supports **automated workflows** via [Argo Workflows](https://argoproj.github.io/argo-workflows/).
To trigger a workflow, execute:

```yaml
argo submit argo-workflows/relabel-naf08-to-naf25.yaml
```

Or use the **Argo Workflow UI**.

## 📄 License
This project is licensed under the **Apache License 2.0**. See the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/inseefrlab/codif-ape-nace-revision

Awesome Lists containing this project

README