https://github.com/ipyflow/nbsafety-experiments
https://github.com/ipyflow/nbsafety-experiments
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/ipyflow/nbsafety-experiments
- Owner: ipyflow
- Created: 2020-07-03T05:47:11.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2021-05-11T00:46:47.000Z (about 4 years ago)
- Last Synced: 2025-01-01T12:28:03.162Z (6 months ago)
- Language: Python
- Size: 146 KB
- Stars: 2
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Readme
There are two main functionalities in that this: scraping and replaying. You
don’t need to deal with scraping if using the already-scraped
[traces.sqlite](https://drive.google.com/file/d/1Yf_17-SESPVdyawfNbxF54VIbe4om2eX/view?usp=sharing)
sqlite database. Besides that there are two scripts worth
noting: one which replays a single notebook session, and one which replays
a whole set of sessions after filtering based on some criteria# Replaying a single session
`replay-session.py` replays a single notebook session (given `trace_id` and
`session_id`, basically ids for the repository and per-repository session),
handling things like timeouts, figuring out packages that need installation,
coverting Python 2 to Python 3 using the `2to3` tool,
etc. It also counts the number of exceptions that occurred during replay;
probably worth filtering out sessions where more than ~5-10% of the cell
executions give an exception. There’s also a bunch of ancillary stuff in there
that’s specific to nbsafety, like counting how often the user picks a stale
cell for re-execution or a refresher cell; if just using the replay functionality
and not replicating nbsafety results, this can just be deleted. Note that it assumes
availability of tables `replay_stats` and `replay_exception_stats` in the `traces.sqlite`
database whose schemas must be manually generated; the PyCharm sqlite connector is
pretty good for this.# Replaying all sessions satisfying filtering criteria
`run-replay-experiments.py` runs all the sessions through a filtering process
and replays all sessions that pass a filter. A bunch of the filtering criteria
were manually specified after seeing nonsensical sessions that were replayed.
It also accepts a `--version` argument; if you specify the same version, it skips
sessions that were already replayed; if you specify a new version, it starts
from scratch. There are also some nbsafety-specific parameters:
- `--naive-refresher-computation`: is a baseline used in the paper,
- `--forward-only-propagation`: used to measure utility of highlights where new
ones are only created in later cells (spacially relative to the currently
executed one) instead of both earlier and later cells
- `--no-nbsafety`: used to determine how much faster non-nbsafety replay was (to
see what nbsafety overhead was like).
When replaying these sessions, it is probably a good idea to do so in a chrooted environment
or docker container, since the sessions are untrusted code that sometimes do some fairly strange things.