An open API service indexing awesome lists of open source software.

https://github.com/scorpia2004/lazarus

poc for Erdemir
https://github.com/scorpia2004/lazarus

Last synced: 3 months ago
JSON representation

poc for Erdemir

Awesome Lists containing this project

README

          

# Idea Verification System — RAG Double-Checker for BERT Classifier

> **POC Project** | Erdemir | Timeline: 6-8 days

---

## Overview

Erdemir is a steel production company with ~10,000 employees. As part of a long-standing company tradition, every worker submits ideas for improving work processes and production. To date, over **300,000 ideas** have been collected.

An existing **BERT-based NLP classifier** has been developed to automatically evaluate incoming ideas and output an `Approve` / `Reject` decision, achieving **>70% accuracy**. The goal of this project is to build a **RAG (Retrieval-Augmented Generation) system** that acts as a **secondary verification layer** on top of the BERT classifier.

---

## Problem Statement

The BERT classifier processes a plain-text idea submission and outputs a binary decision:

```
Input: Plain-text idea (submitted by employee)
Output: Approve | Reject
```

While the BERT system performs well, a second-opinion layer is needed to increase confidence and reduce misclassifications. This project builds that layer as a RAG system.

---

## Solution Architecture

```
Employee Idea (plain text)


┌─────────────┐
│ BERT │ ──── Approve / Reject
│ Classifier │
└─────────────┘

▼ (BERT output + original idea)
┌─────────────┐
│ RAG System │ ──── Confirm / Override
│ (this repo) │
└─────────────┘


Final Decision
```

The RAG system receives both the **original idea text** and the **BERT classification output**, retrieves relevant context from the historical idea dataset, and produces a verification judgment — either confirming or challenging the BERT result.

---

## Scope

This repository covers the **POC (Proof of Concept)** phase:

- Set up and configure the RAG system for local machine deployment (RTX 5090 or equivalent, targeting models like QwQ 3.5 or similar)
- Integrate with the existing BERT classifier output
- Evaluate RAG verification accuracy against the labeled dataset
- Determine whether fine-tuning is required based on data characteristics

---

## Current Status

| Component | Status |
|---|---|
| BERT classifier | ✅ Exists, working |
| Labeled idea dataset | ✅ Available |
| RAG system | 🔧 In development (this repo) |
| Fine-tuning | ❓ TBD — pending data review |

---

## Project Details

- **Client:** Erdemir
- **Scale:** 300,000+ historical ideas, 10,000 employees
- **Deployment target:** Local machine (RTX 5090 / QwQ 3.5 or similar)
- **POC estimate:** 6–8 person-days
- **Full build timeline:** 5–6 months

---

## Action Items

- [ ] Obtain project proposition and technical documentation from Eren
- [ ] Review dataset to determine fine-tuning requirements
- [ ] Set up RAG pipeline and connect to BERT output
- [ ] Evaluate end-to-end system performance

---

## Open Questions

- Is fine-tuning required for the RAG model? (Pending data review)
- What retrieval strategy best fits the idea domain? (semantic search, keyword, hybrid?)