Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/HazyResearch/evaporate
This repo contains data and code for the paper "Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes"
https://github.com/HazyResearch/evaporate
Last synced: 7 days ago
JSON representation
This repo contains data and code for the paper "Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes"
- Host: GitHub
- URL: https://github.com/HazyResearch/evaporate
- Owner: HazyResearch
- Created: 2023-03-25T18:24:33.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-03-26T04:34:06.000Z (8 months ago)
- Last Synced: 2024-08-02T15:33:11.168Z (3 months ago)
- Language: Python
- Homepage:
- Size: 10.1 MB
- Stars: 473
- Watchers: 18
- Forks: 45
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Evaporate
Code, datasets, and extended writeup for paper [Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes](https://www.vldb.org/pvldb/vol17/p92-arora.pdf).
## Setup
We encourage the use of conda environments:
```
conda create --name evaporate python=3.8
conda activate evaporate
```Clone as follows:
```bash
# Evaporate code
git clone [email protected]:HazyResearch/evaporate.git
cd evaporate
pip install -e .# Weak supervision code
cd metal-evap
git submodule init
git submodule update
pip install -e .# Manifest (to install from source, which helps you modify the set of supported models. Otherwise, ``setup.py`` installs ``manifest-ml``)
git clone [email protected]:HazyResearch/manifest.git
cd manifest
pip install -e .
```## Datasets
The data used in the paper is hosted on Hugging Face's datasets platform: https://huggingface.co/datasets/hazyresearch/evaporate.To download the datasets, run the following commands in your terminal:
```bash
git lfs install
git clone https://huggingface.co/datasets/hazyresearch/evaporate
```Or download it via Python:
```python
from datasets import load_dataset
dataset = load_dataset("hazyresearch/evaporate")
```The code expects the data to be stored at ``/data/evaporate/`` as specified in ``constants.py`` CONSTANTS, though can be modified.
## Running the code
Run closed IE and open IE using the commands:```cd src/
bash run.sh
```The ``keys`` in run.sh can be obtained by registering with the LLM provider. For instance, if you want to run inference with the OpenAI API models, create an account [here](https://openai.com/api/).
The script includes commands for both closed and open IE runs. To walk through the code, look at ``run_profiler.py``. For open IE, the code first uses ``schema_identification.py`` to generate a list of attributes for the schema. Next, the code iterates through this list to perform extraction using ``profiler.py``. As functions are generated in ``profiler.py``, ``evaluate_profiler.py`` is used to score the function outputs against the outputs of directly prompting the LM on the sample documents.
## Citation
If you use this codebase, or otherwise found our work valuable, please cite:
```
@article{arora2023evaporate,
title={Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes},
author={Arora, Simran and Yang, Brandon and Eyuboglu, Sabri and Narayan, Avanika and Hojel, Andrew and Trummer, Immanuel and R\'e, Christopher},
journal={arXiv:2304.09433},
year={2023}
}
```