An open API service indexing awesome lists of open source software.

https://github.com/inseefrlab/codif-ape-nace-revision

Centralized repo on nace revision
https://github.com/inseefrlab/codif-ape-nace-revision

Last synced: 3 months ago
JSON representation

Centralized repo on nace revision

Awesome Lists containing this project

README

        

# πŸ“Œ CAG vs RAG: Centralized Repository for NACE Revision

## πŸ“‘ Table of Contents

- [πŸ“Œ CAG vs RAG: Centralized Repository for NACE Revision](#-cag-vs-rag-centralized-repository-for-nace-revision)
- [πŸ“‘ Table of Contents](#-table-of-contents)
- [πŸ“– Overview](#-overview)
- [πŸš€ Getting Started](#-getting-started)
- [πŸ›  Installation](#-installation)
- [πŸ— Pre-commit Setup](#-pre-commit-setup)
- [Cache LLM model from S3 Bucket (Optionnal)](#cache-llm-model-from-s3-bucket-optionnal)
- [πŸ“œ Running the Scripts](#-running-the-scripts)
- [βœ… 1. Build Vector Database (if you are in the RAG case)](#-1-build-vector-database-if-you-are-in-the-rag-case)
- [🏷 2. Encode Business Activity Codes](#-2-encode-business-activity-codes)
- [πŸ”¬ 3. Evaluate Classification Strategies](#-3-evaluate-classification-strategies)
- [πŸ“Š 4. Build the NACE 2025 Dataset](#-4-build-the-nace-2025-dataset)
- [πŸ“‘ LLM Integration](#-llm-integration)
- [πŸ— Argo Workflows](#-argo-workflows)
- [πŸ“„ License](#-license)

## πŸ“– Overview
This repository is dedicated to the revision of the **Nomenclature statistique des ActivitΓ©s Γ©conomiques dans la CommunautΓ© EuropΓ©enne (NACE)**.

It provides tools for **automated classification and evaluation of business activity codes** using **Large Language Models (LLMs)** and vector-based retrieval systems.

## πŸš€ Getting Started

### πŸ›  Installation
Ensure you have **Python 3.12+** and **uv** or **pip** installed, then install the required dependencies:

```bash
uv pip install -r requirements.txt
```

or

```bash
uv pip install -r pyproject.toml
```

### πŸ— Pre-commit Setup
Set up linting and formatting checks using `pre-commit`:

```bash
pre-commit install
```

### Cache LLM model from S3 Bucket (Optionnal)
If you want to use a model available in the SSPCloud you can execute this command:

```bash
MODEL_NAME=mistralai/Ministral-8B-Instruct-2410
LOCAL_PATH=~/.cache/huggingface/hub

./bash/fetch_model_s3.sh $MODEL_NAME $LOCAL_PATH
```

## πŸ“œ Running the Scripts

### βœ… 1. Build Vector Database (if you are in the RAG case)

To create a searchable database of NACE 2025 codes:

```bash
python build-vector-db.py
```

### 🏷 2. Encode Business Activity Codes

For **unambiguous** classification:

```bash
python encode-univoque.py
```

For **ambiguous** classification using an LLM:

```bash
python encode-multivoque.py --experiment_name NACE2025_DATASET --llm_name mistralai/Mistral-7B-Instruct
```

### πŸ”¬ 3. Evaluate Classification Strategies

Compare different classification models:

```bash
python evaluate_strategies.py
```

### πŸ“Š 4. Build the NACE 2025 Dataset

Once all unique ambiguous cases have been recoded using the best strategy, you can rebuild the entire dataset with NACE 2025 labels:

```bash
python build_nace2025_sirene4.py
```

---

## πŸ“‘ LLM Integration
This repository leverages **Large Language Models (LLMs)** to assist in classifying business activities. The supported models include:

- `Qwen/Qwen2.5-32B-Instruct`
- `mistralai/Mistral-Small-Instruct-2409`
- `hugging-quants/Meta-Llama-3.1-70B-Instruct-GPTQ-INT4`

These models help improve classification accuracy for ambiguous business activity cases.

## πŸ— Argo Workflows
This project supports **automated workflows** via [Argo Workflows](https://argoproj.github.io/argo-workflows/).
To trigger a workflow, execute:

```yaml
argo submit argo-workflows/relabel-naf08-to-naf25.yaml
```

Or use the **Argo Workflow UI**.

## πŸ“„ License
This project is licensed under the **Apache License 2.0**. See the [LICENSE](LICENSE) file for details.