https://github.com/stanfordnlp/plot-data

datasets for plotting
https://github.com/stanfordnlp/plot-data

Last synced: about 1 month ago
JSON representation

datasets for plotting

Host: GitHub
URL: https://github.com/stanfordnlp/plot-data
Owner: stanfordnlp
Created: 2017-07-21T23:16:20.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2018-09-23T02:13:54.000Z (almost 7 years ago)
Last Synced: 2025-03-28T01:26:43.325Z (3 months ago)
Language: Jupyter Notebook
Size: 20.8 MB
Stars: 6
Watchers: 12
Forks: 4
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# plot-data

This repo contains data for plot formatting actions in VegaLite, which you can see in the [ viewer](http://plot.sidaw.xyz/#/viewer?url=https://raw.githubusercontent.com/stanfordnlp/plot-data/master/data/query.jsonl).
You can find the procssed data in the releases.

To see the last five lines of processed data, try `jq . plot-data.sample.jsonl`

### Processed data

The processed data is inside `./data`, which is generated by `make` from the content of `hits`.

URL of procssed: https://raw.githubusercontent.com/stanfordnlp/plot-data/master/data/plot-data.jsonl
* The `contextId` field of `plot-data.jsonl` corresponds to items in `contexts.json`, where 47 different context plots from VegaLite examples are used.
* contexts: https://raw.githubusercontent.com/stanfordnlp/plot-data/master/data/contexts.json

* statistics: https://raw.githubusercontent.com/stanfordnlp/plot-data/master/data/stats.json

* querylog form: https://raw.githubusercontent.com/stanfordnlp/plot-data/master/data/query.jsonl

### Collecting data

* First, deploy speaker HITs: `python mturk/create_speaker_hit.py --num-hit 10 --num-assignment 5`, optionally `--is-sandbox`
* This creates `hits/timestamp/speaker.HITs.txt`, and `speaker.sample_hit` and deploys the HITs
* note that assignment_ids are only available once someone works on the hit
* run `make speaker.assignments` to check if these are completed
* In `Makefile` set the `SPEAKER_EXEC` variable to correspond to where the server log is located
* `make speaker.jsonl` to filter and process the data, and `make speaker.review` to approve and reject hits
* Restart the server and use the previous speaker data as `VegaResources.examplesPath` which selects randomly from the specified examples as the listeners
* Run `python mturk/create_listener_hit.py hits/SPEAKER_HIT --num-hit 10 --num-assignment 5` optionally `--is-sandbox`
* Wait for these HITs to complete, `make listener.assignments` to check and `make listener.review` to approve
* Set `LISTENER_EXEC` as well, and run `make speaker.listener.jsonl` to process the data
* Alternatively, wait for both speaker and listener hits to complete, and run `make visualize`
* There seems to be some need to inspect `speaker.status` to make sure there is no incorrect rejections, and no new weird spam before deploying them to the listener.
This prevents the process from being fully automated.

### Useful commands
```
jq -c 'if .q[0]=="accept" then .q[1] else empty end' speaker.raw.jsonl
```

```
cat data/query.json | jq -c '.q[1].utterance'
```

#### Generating splits

Use `split_data.py` to split data into train/test (no dev since all the Turk data is dev data):

python split_data.py randomWithNoCanon.jsonl randomWithNoCanon_splitIndep # Split each example separately
python split_data.py -s randomWithNoCanon.jsonl randomWithNoCanon_splitSess # Split by sessionId == MTurk ID

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stanfordnlp/plot-data

Awesome Lists containing this project

README