https://github.com/bataeves/isparkcache

Jupyter модуль для кеширования Spark DataFrame, полученных в результате выполнения ячейки
https://github.com/bataeves/isparkcache

cache ipython jupyter pyspark spark

Last synced: 26 days ago
JSON representation

Jupyter модуль для кеширования Spark DataFrame, полученных в результате выполнения ячейки

Host: GitHub
URL: https://github.com/bataeves/isparkcache
Owner: bataeves
License: gpl-3.0
Created: 2017-06-29T14:44:12.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2017-08-09T08:29:56.000Z (over 7 years ago)
Last Synced: 2025-03-08T17:49:13.769Z (about 2 months ago)
Topics: cache, ipython, jupyter, pyspark, spark
Language: Python
Size: 28.3 KB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Defines a **%%sparkcache** cell magic in the IPython notebook to cache DataFrame
and outputs of long-lasting computations in a persistent Parquet file in Hadoop.
Useful when some computations in a notebook are long and you want to
easily save the results in a file.

Based on [ipycache](https://github.com/rossant/ipycache) module.

Installation
------------

- ``pip install isparkcache``

Usage
-----

- In IPython/Jupyter:

%load_ext isparkcache

- Then, create a cell with:

%%sparkcache df1 df2
df = ...
df1 = sql.createDataFrame(df)
df2 = sql.createDataFrame(df)

- When you execute this cell the first time, the code is executed, and
the dataframes ``df1`` and ``df2`` are saved in
``/user/$USER/sparkcache/mysparkapplication/df1`` and
``/user/$USER/sparkcache/mysparkapplication/df2``.
When you execute this cell again, the code is skipped, the dataframes are
loaded from the Parquet and injected into the namespace, and the outputs
are restored in the notebook.

- Use the ``--force`` or ``-f`` option to force the cell's execution
and overwrite the file.

- Use the ``--read`` or ``-r`` option to prevent the cell's execution
and always load the variables from the cache. An exception is raised
if the file does not exist.

- Use the ``--cachedir`` or ``-d`` option to specify the cache
directory. Default directory: ``/user/$USER/sparkcache``.
You can specify a default directory in the IPython
configuration file in your profile (typically in
``~\.ipython\profile_default\ipython_config.py``) by adding the
following line:

c.SparkCacheMagics.cachedir = "/path/to/mycache"

If both a default cache directory and the ``--cachedir`` option are
given, the latter is used.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bataeves/isparkcache

Awesome Lists containing this project

README