Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bataeves/isparkcache
Jupyter модуль для кеширования Spark DataFrame, полученных в результате выполнения ячейки
https://github.com/bataeves/isparkcache
cache ipython jupyter pyspark spark
Last synced: 7 days ago
JSON representation
Jupyter модуль для кеширования Spark DataFrame, полученных в результате выполнения ячейки
- Host: GitHub
- URL: https://github.com/bataeves/isparkcache
- Owner: bataeves
- License: gpl-3.0
- Created: 2017-06-29T14:44:12.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2017-08-09T08:29:56.000Z (over 7 years ago)
- Last Synced: 2025-01-14T15:00:13.149Z (29 days ago)
- Topics: cache, ipython, jupyter, pyspark, spark
- Language: Python
- Size: 28.3 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Defines a **%%sparkcache** cell magic in the IPython notebook to cache DataFrame
and outputs of long-lasting computations in a persistent Parquet file in Hadoop.
Useful when some computations in a notebook are long and you want to
easily save the results in a file.Based on [ipycache](https://github.com/rossant/ipycache) module.
Installation
------------- ``pip install isparkcache``
Usage
------ In IPython/Jupyter:
%load_ext isparkcache
- Then, create a cell with:
%%sparkcache df1 df2
df = ...
df1 = sql.createDataFrame(df)
df2 = sql.createDataFrame(df)- When you execute this cell the first time, the code is executed, and
the dataframes ``df1`` and ``df2`` are saved in
``/user/$USER/sparkcache/mysparkapplication/df1`` and
``/user/$USER/sparkcache/mysparkapplication/df2``.
When you execute this cell again, the code is skipped, the dataframes are
loaded from the Parquet and injected into the namespace, and the outputs
are restored in the notebook.- Use the ``--force`` or ``-f`` option to force the cell's execution
and overwrite the file.- Use the ``--read`` or ``-r`` option to prevent the cell's execution
and always load the variables from the cache. An exception is raised
if the file does not exist.- Use the ``--cachedir`` or ``-d`` option to specify the cache
directory. Default directory: ``/user/$USER/sparkcache``.
You can specify a default directory in the IPython
configuration file in your profile (typically in
``~\.ipython\profile_default\ipython_config.py``) by adding the
following line:c.SparkCacheMagics.cachedir = "/path/to/mycache"
If both a default cache directory and the ``--cachedir`` option are
given, the latter is used.