https://github.com/stanfordnlp/plot-data
datasets for plotting
https://github.com/stanfordnlp/plot-data
Last synced: about 1 month ago
JSON representation
datasets for plotting
- Host: GitHub
- URL: https://github.com/stanfordnlp/plot-data
- Owner: stanfordnlp
- Created: 2017-07-21T23:16:20.000Z (almost 8 years ago)
- Default Branch: master
- Last Pushed: 2018-09-23T02:13:54.000Z (almost 7 years ago)
- Last Synced: 2025-03-28T01:26:43.325Z (3 months ago)
- Language: Jupyter Notebook
- Size: 20.8 MB
- Stars: 6
- Watchers: 12
- Forks: 4
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# plot-data
This repo contains data for plot formatting actions in VegaLite, which you can see in the [ viewer](http://plot.sidaw.xyz/#/viewer?url=https://raw.githubusercontent.com/stanfordnlp/plot-data/master/data/query.jsonl).
You can find the procssed data in the releases.To see the last five lines of processed data, try `jq . plot-data.sample.jsonl`
### Processed data
The processed data is inside `./data`, which is generated by `make` from the content of `hits`.
URL of procssed: https://raw.githubusercontent.com/stanfordnlp/plot-data/master/data/plot-data.jsonl
* The `contextId` field of `plot-data.jsonl` corresponds to items in `contexts.json`, where 47 different context plots from VegaLite examples are used.
* contexts: https://raw.githubusercontent.com/stanfordnlp/plot-data/master/data/contexts.json* statistics: https://raw.githubusercontent.com/stanfordnlp/plot-data/master/data/stats.json
* querylog form: https://raw.githubusercontent.com/stanfordnlp/plot-data/master/data/query.jsonl
### Collecting data
* First, deploy speaker HITs: `python mturk/create_speaker_hit.py --num-hit 10 --num-assignment 5`, optionally `--is-sandbox`
* This creates `hits/timestamp/speaker.HITs.txt`, and `speaker.sample_hit` and deploys the HITs
* note that assignment_ids are only available once someone works on the hit
* run `make speaker.assignments` to check if these are completed
* In `Makefile` set the `SPEAKER_EXEC` variable to correspond to where the server log is located
* `make speaker.jsonl` to filter and process the data, and `make speaker.review` to approve and reject hits
* Restart the server and use the previous speaker data as `VegaResources.examplesPath` which selects randomly from the specified examples as the listeners
* Run `python mturk/create_listener_hit.py hits/SPEAKER_HIT --num-hit 10 --num-assignment 5` optionally `--is-sandbox`
* Wait for these HITs to complete, `make listener.assignments` to check and `make listener.review` to approve
* Set `LISTENER_EXEC` as well, and run `make speaker.listener.jsonl` to process the data
* Alternatively, wait for both speaker and listener hits to complete, and run `make visualize`
* There seems to be some need to inspect `speaker.status` to make sure there is no incorrect rejections, and no new weird spam before deploying them to the listener.
This prevents the process from being fully automated.### Useful commands
```
jq -c 'if .q[0]=="accept" then .q[1] else empty end' speaker.raw.jsonl
``````
cat data/query.json | jq -c '.q[1].utterance'
```#### Generating splits
Use `split_data.py` to split data into train/test (no dev since all the Turk data is dev data):
python split_data.py randomWithNoCanon.jsonl randomWithNoCanon_splitIndep # Split each example separately
python split_data.py -s randomWithNoCanon.jsonl randomWithNoCanon_splitSess # Split by sessionId == MTurk ID