https://github.com/S1M0N38/dspy-arxiv

Explore the use of DSPy for extracting features from PDFs 🔎
https://github.com/S1M0N38/dspy-arxiv

arxiv classification dspy nlp prompt-engineering

Last synced: about 2 months ago
JSON representation

Explore the use of DSPy for extracting features from PDFs 🔎

Host: GitHub
URL: https://github.com/S1M0N38/dspy-arxiv
Owner: S1M0N38
License: mit
Created: 2024-02-16T20:50:36.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-03-01T13:13:04.000Z (over 1 year ago)
Last Synced: 2025-03-24T18:50:41.157Z (3 months ago)
Topics: arxiv, classification, dspy, nlp, prompt-engineering
Language: HTML
Homepage: https://s1m0n38.github.io/dspy-arxiv/
Size: 184 KB
Stars: 39
Watchers: 1
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-dspy - dspy-arxiv - Explore the use of DSPy for extracting features from PDFs ([DSPy](https://github.com/stanfordnlp/dspy) - A library for compiling declarative language model calls into self-improving pipelines. / Projects)

README

# dspy-arxiv

Explore the use of [DSPy](https://github.com/stanfordnlp/dspy) for extracting features from PDFs.
This repository provides a simple example of how to use this framework to predict the sub-category of a Computer Science paper from arXiv.

## Suggested Installation

1. Clone this repository.
2. Create a virtual environment.
3. Install dependencies from *requirements.txt*.
4. Install the virtual environment as a Jupyter kernel.

## Build Dataset & Database

The **dataset** is a selection of 150 arXiv papers (metadata + pdf) from the computer science category.

To build the database:

1. Download the JSON file from [Kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv) into the `dspy-arxiv` directory.
2. Rename the file to `arxiv.json`.
3. Run the notebook `data.ipynb` from top to bottom.

At the end, you should have two directories:
- *dspy-arxiv/database*
- *arxiv.json* - the original JSON file with only the computer science category
- *dspy-arxiv/dataset*
- *trainset* - 50 JSON files with metadata + text used for "training"
- *valset* - 50 JSON files with metadata + text used for "validation"
- *testset* - 50 JSON files with metadata + text used for "testing"

> If you want to add RAG to the pipeline, it's handy to have the data in a vector database for fast retrieval.
> Check out *database.py* for an example script to set up [chromadb](https://docs.trychroma.com/) and populate it with arXiv metadata.

## Features Extraction

The notebook *features.ipynb* can be seen as a simple tutorial on how to use DSPy to programmatically prompt LLM for feature extraction (in this case, predicting the sub-category of a Computer Science paper from arXiv).

You can also take a look at the [slides](https://s1m0n38.github.io/dspy-arxiv/#/) generated from this notebook.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/S1M0N38/dspy-arxiv

Awesome Lists containing this project

README