Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/HazyResearch/evaporate

This repo contains data and code for the paper "Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes"
https://github.com/HazyResearch/evaporate

Last synced: 4 months ago
JSON representation

This repo contains data and code for the paper "Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes"

Host: GitHub
URL: https://github.com/HazyResearch/evaporate
Owner: HazyResearch
Created: 2023-03-25T18:24:33.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-03-26T04:34:06.000Z (11 months ago)
Last Synced: 2024-08-02T15:33:11.168Z (7 months ago)
Language: Python
Homepage:
Size: 10.1 MB
Stars: 473
Watchers: 18
Forks: 45
Open Issues: 11
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Evaporate

Code, datasets, and extended writeup for paper [Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes](https://www.vldb.org/pvldb/vol17/p92-arora.pdf).

## Setup

We encourage the use of conda environments:
```
conda create --name evaporate python=3.8
conda activate evaporate
```

Clone as follows:
```bash
# Evaporate code
git clone [email protected]:HazyResearch/evaporate.git
cd evaporate
pip install -e .

# Weak supervision code
cd metal-evap
git submodule init
git submodule update
pip install -e .

# Manifest (to install from source, which helps you modify the set of supported models. Otherwise, ``setup.py`` installs ``manifest-ml``)
git clone [email protected]:HazyResearch/manifest.git
cd manifest
pip install -e .
```

## Datasets
The data used in the paper is hosted on Hugging Face's datasets platform: https://huggingface.co/datasets/hazyresearch/evaporate.

To download the datasets, run the following commands in your terminal:
```bash
git lfs install
git clone https://huggingface.co/datasets/hazyresearch/evaporate
```

Or download it via Python:
```python
from datasets import load_dataset
dataset = load_dataset("hazyresearch/evaporate")
```

The code expects the data to be stored at ``/data/evaporate/`` as specified in ``constants.py`` CONSTANTS, though can be modified.

## Running the code
Run closed IE and open IE using the commands:

```cd src/
bash run.sh
```

The ``keys`` in run.sh can be obtained by registering with the LLM provider. For instance, if you want to run inference with the OpenAI API models, create an account [here](https://openai.com/api/).

The script includes commands for both closed and open IE runs. To walk through the code, look at ``run_profiler.py``. For open IE, the code first uses ``schema_identification.py`` to generate a list of attributes for the schema. Next, the code iterates through this list to perform extraction using ``profiler.py``. As functions are generated in ``profiler.py``, ``evaluate_profiler.py`` is used to score the function outputs against the outputs of directly prompting the LM on the sample documents.

## Citation
If you use this codebase, or otherwise found our work valuable, please cite:
```
@article{arora2023evaporate,
title={Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes},
author={Arora, Simran and Yang, Brandon and Eyuboglu, Sabri and Narayan, Avanika and Hojel, Andrew and Trummer, Immanuel and R\'e, Christopher},
journal={arXiv:2304.09433},
year={2023}
}
```