https://github.com/cablelabs/llmdata

Companion data sets for LLM projects
https://github.com/cablelabs/llmdata

Last synced: 6 months ago
JSON representation

Companion data sets for LLM projects

Host: GitHub
URL: https://github.com/cablelabs/llmdata
Owner: cablelabs
License: apache-2.0
Created: 2024-01-24T15:57:01.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-02-13T15:23:07.000Z (over 2 years ago)
Last Synced: 2024-04-15T06:12:54.745Z (over 2 years ago)
Size: 2.16 MB
Stars: 0
Watchers: 5
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

# LLM Data Companion
This repository contains companion data sets for LLM
publications.

## AI-Assisted Ideation
Companion data to
[Randomness Is All You Need: Semantic Traversal of Problem-Solution Spaces with Large Language Models](https://arxiv.org/abs/2402.06053)

Please cite that paper if making use of this data. E.g. using the following bibtex snippet:
```Latex
@article{sandholm2024,
title={{Randomness Is All You Need: Semantic Traversal of Problem-Solution Spaces with Large Language Models}},
author={Thomas Sandholm and Sayandev Mukherjee and Bernardo A. Huberman},
journal={arXiv preprint arXiv:2402.06053},
year={2024}
}
```

The generated data dumps are available in [/aidea](aidea/).

They are organize by original problem statement that generated
them:

| Data | Problem Statement |
|:---|:---|
| timeline | Software project timelines are often underestimated, which leads to high costs. |
| employee | It is difficult to measure employee satisfaction in an unbiased way. |
| startup | It is not easy for early startups to find a customer base willing to try new technology. |
| data | Companies struggle with gaining insights from large volumes and high velocity of data. |
| satisfaction | It is hard to track and measure customer satisfaction across large geographies. |
| invest | It is difficult to plan investments in an uncertain economy. |
| innovation | It is difficult to create innovation opportunities without introducing too much process and hampering creativity. |
| talent | Retaining high-performing talent is hard in competitive emerging markets. |
| ml | Large machine learning models are expensive and time consuming to train. |
| privacy | Ensuring privacy of customers is difficult while leveraging their data for business insights. |

Problems generated from solutions have the prefix `gprob`.
The solution they are generated from has the prefix `gsol`.

The file naming convention is:
```
...txt
```
where `type` can be `gsol`, `gprob`, `sol` or `prob` for solutions for generated
problems, generated problems, solutions and problems respectively. The
`index` denotes the order in which the solution was generated, starting
with solution 1 which is the solution to the original problem. The ordering
is determined by a depth-first search of the related problem and generated
problem tree. The `temperature` is the LLM temperature set during solution
and problem generation from a prompt. The temperatures used include `0.5`,`0.6`,`0.7`,`0.8`, `0.9`,
`1.0`, and `1.1`. The actual temperature fed into the LLM is a uniform random number
in the interval `[temp,temp + 0.1]`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cablelabs/llmdata

Awesome Lists containing this project

README