https://github.com/cablelabs/llmdata
Companion data sets for LLM projects
https://github.com/cablelabs/llmdata
Last synced: 4 months ago
JSON representation
Companion data sets for LLM projects
- Host: GitHub
- URL: https://github.com/cablelabs/llmdata
- Owner: cablelabs
- License: apache-2.0
- Created: 2024-01-24T15:57:01.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-13T15:23:07.000Z (over 2 years ago)
- Last Synced: 2024-04-15T06:12:54.745Z (about 2 years ago)
- Size: 2.16 MB
- Stars: 0
- Watchers: 5
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# LLM Data Companion
This repository contains companion data sets for LLM
publications.
## AI-Assisted Ideation
Companion data to
[Randomness Is All You Need: Semantic Traversal of Problem-Solution Spaces with Large Language Models](https://arxiv.org/abs/2402.06053)
Please cite that paper if making use of this data. E.g. using the following bibtex snippet:
```Latex
@article{sandholm2024,
title={{Randomness Is All You Need: Semantic Traversal of Problem-Solution Spaces with Large Language Models}},
author={Thomas Sandholm and Sayandev Mukherjee and Bernardo A. Huberman},
journal={arXiv preprint arXiv:2402.06053},
year={2024}
}
```
The generated data dumps are available in [/aidea](aidea/).
They are organize by original problem statement that generated
them:
| Data | Problem Statement |
|:---|:---|
| timeline | Software project timelines are often underestimated, which leads to high costs. |
| employee | It is difficult to measure employee satisfaction in an unbiased way. |
| startup | It is not easy for early startups to find a customer base willing to try new technology. |
| data | Companies struggle with gaining insights from large volumes and high velocity of data. |
| satisfaction | It is hard to track and measure customer satisfaction across large geographies. |
| invest | It is difficult to plan investments in an uncertain economy. |
| innovation | It is difficult to create innovation opportunities without introducing too much process and hampering creativity. |
| talent | Retaining high-performing talent is hard in competitive emerging markets. |
| ml | Large machine learning models are expensive and time consuming to train. |
| privacy | Ensuring privacy of customers is difficult while leveraging their data for business insights. |
Problems generated from solutions have the prefix `gprob`.
The solution they are generated from has the prefix `gsol`.
The file naming convention is:
```
...txt
```
where `type` can be `gsol`, `gprob`, `sol` or `prob` for solutions for generated
problems, generated problems, solutions and problems respectively. The
`index` denotes the order in which the solution was generated, starting
with solution 1 which is the solution to the original problem. The ordering
is determined by a depth-first search of the related problem and generated
problem tree. The `temperature` is the LLM temperature set during solution
and problem generation from a prompt. The temperatures used include `0.5`,`0.6`,`0.7`,`0.8`, `0.9`,
`1.0`, and `1.1`. The actual temperature fed into the LLM is a uniform random number
in the interval `[temp,temp + 0.1]`.