{"id":14977400,"url":"https://github.com/natbusa/datafaucet","last_synced_at":"2025-10-28T03:32:09.908Z","repository":{"id":62566613,"uuid":"89775431","full_name":"natbusa/datafaucet","owner":"natbusa","description":"Productivity Utilities for Data Science with Python Notebooks","archived":false,"fork":false,"pushed_at":"2020-02-05T08:47:12.000Z","size":13746,"stargazers_count":6,"open_issues_count":2,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-06T21:42:03.880Z","etag":null,"topics":["framework","jupyter","package","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/natbusa.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-04-29T09:22:16.000Z","updated_at":"2025-01-11T11:46:40.000Z","dependencies_parsed_at":"2022-11-03T16:15:57.580Z","dependency_job_id":null,"html_url":"https://github.com/natbusa/datafaucet","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natbusa%2Fdatafaucet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natbusa%2Fdatafaucet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natbusa%2Fdatafaucet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/natbusa%2Fdatafaucet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/natbusa","download_url":"https://codeload.github.com/natbusa/datafaucet/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238400780,"owners_count":19465704,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["framework","jupyter","package","python"],"created_at":"2024-09-24T13:55:35.586Z","updated_at":"2025-10-28T03:32:09.471Z","avatar_url":"https://github.com/natbusa.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Datafaucet\n\nBasic example and directory structure.\n\n## Elements\n\nThis ETL/Data Science scaffolding works using a number of elements:\n\n  - The introductory python notebook you are reading now (main.ipynb)\n  - A directory structure for code and data processing (data)\n  - The datafaucet python package (datafaucet)\n  - Configuration files (metadata.yml, \\__main__.py, Makefile)\n\n## Principles ##\n\n- ** Both notebooks and code are first citizens **\n\nIn the source directory `src` you will find all source code. In particular, both notebooks and code files are treated as source files. Source code is further partitioned and scaffolded in several directories to simplify and organize the data science project. Following python package conventions, the root of the project is tagged by a `__main__.py` file and directory contains the `__init__.py` code. By doing so, python and notebook files can reference each other.\n\nPython notebooks and Python code can be mixed and matched, and are interoperable with each other. You can include function from a notebook to a python code, and you can include python files in a notebook. \n\n- ** Data Directories should not contain logic code **\n\nData can be located anywhere, on remote HDFS clusters, or Object Store Services exposed via S3 protocols etc. Also you can keep data on the local file system. For illustration purposes, this demo will use a local directory for data scaffolding. \n\nSeparating data and code is done by moving all configuration to metadata files. Metadata files make possible to define aliases for data resources, data services and spark configurations, and keeping the ETL and ML code tidy with no hardcoded parameters.\n\n- ** Decouple Code from Configuration **\n\nCode either stored as notebooks or as python files should be decoupled from both engine configurations and from data locations. All configuration is kept in `metadata.yml` yaml files. Multiple setups for test, exploration, production can be described  in the same `metadata.yml` file or in separate multiple files using __profiles__. All profile inherit from a default profiles, to reduce dupllication of configurations settings across profiles.\n\n- ** Declarative Configuration **\n\nMetadata files are responsible for the binding of data and engine configurations to the code. For instance all data in the code shouold be referenced by an alias, and storage and retrieval of data object and files should happen via a common API. The metadata yaml file, describes the providers for each data source as well as the mapping of data aliases to their corresponding data objects. \n\n\n\n## Project Template\n\nThe data science project is structured in a way to facilitate the deployment of the artifacts, and to switch from batch processing to live experimentation. The top level project is composed of the following items:\n\n### Top level Structure\n\n```\n├── binder\n├── ci\n├── data\n├── resources\n├── src\n├── test\n│\n├── main.ipynb\n├── versions.ipynb\n│\n├── __main__.py\n├── metadata.yml\n│\n└── Makefile\n\n```\n\n## datafaucet\n\n\n```python\nimport datafaucet as dfc\n```\n\n### Package things\nPackage version: package variables `version_info`, `__version__`\n\n\n```python\ndfc.version_info\n```\n\n\n\n\n    (0, 7, 1)\n\n\n\n\n```python\ndfc.__version__\n```\n\n\n\n\n    '0.7.1'\n\n\n\nCheck is the datafaucet is loaded in the current python context\n\n\n```python\ntry:\n    __DATALOOF__\n    print(\"the datafaucet is loaded\")\nexcept NameError:\n    print(\"the datafaucet is not loaded\")\n```\n\n    the datafaucet is loaded\n\n\n\n```python\n#list of modules loaded as `from datafaucet import * ` \ndfc.__all__\n```\n\n\n\n\n    ['logging', 'project']\n\n\n\n### Modules: project\n\nProject is all about setting the correct working directories where to run and find your notebooks, python files and configuration files. When the datafaucet is imported, it starts by searching for a `__main__.py` file, according to python module file naming conventions. All modules and alias paths are all relative to this project root path.\n\n#### Load a project profile\n\nLoading the profile can be done with the `datafaucet.project.load` function call. It will look for files ending with `metadata.yml`. The function can optionally set the current working directory and import the key=values of .env file into the python os environment. if no parameters are specified, the default profile is loaded.\n\n\n```python\nhelp(dfc.project.load)\n```\n\n    Help on function load in module datafaucet.project:\n    \n    load(profile='default', rootdir_path=None, search_parent_dirs=True, dotenv=True, factory_defaults=True)\n        Performs the following steps:\n            - set rootdir for the given project\n            - perform .env env variable exporting,\n            - load the given `profile` from the metadata files,\n            - setup and start the data engine\n        \n        :param profile: load the given metadata profile (default: 'default')\n        :param rootdir_path: root directory for loaded project (default: current working directory)\n        :param search_parent_dirs: search parent dirs to detect rootdir by looking for a '__main__.py' or 'main.ipynb' file (default: True)\n        :param factory_defaults: add preset default configuration. project provided metadata file can override this default values (default: True)\n        :param dotenv: load variable from a dotenv file, if the file exists and readable (default 'True' looks for the file \u003crootdir\u003e/.env)\n        :return: None\n        \n        Note that:\n        \n        1)  Metadata files are merged up, so you can split the information in multiple files as long as they end with `metadata.yml`\n            For example: `metadata.yml`, `abc.metadata.yaml`, `abc_metadata.yml` are all valid metadata file names.\n        \n        2)  All metadata files in all subdirectories from the project root directory are loaded,\n            unless the directory contains a file `metadata.ignore.yml`\n        \n        3)  Metadata files can provide multiple profile configurations,\n            by separating each profile configuration with a Document Marker ( a line with `---`)\n            (see https://yaml.org/spec/1.2/spec.html#YAML)\n        \n        4)  Each metadata profile, can be broken down in multiple yaml files,\n            When loading the files all configuration belonging to the same profile with be merged.\n        \n        5)  All metadata profiles inherit the settings from profile 'default'\n        \n        6)  If `factory_defaults` is set to true, \n            the provided metadata profiles will inherits from a factory defaults file set as:\n             ```\n                %YAML 1.2\n                ---\n                profile: default\n                variables: {}\n                engine:\n                    type: spark\n                    master: local[*]\n                providers: {}\n                resources: {}\n                loggers:\n                    root:\n                        severity: info\n                    datafaucet:\n                        name: dfc\n                        stream:\n                            enable: true\n                            severity: notice\n                ---\n                profile: prod\n                ---\n                profile: stage\n                ---\n                profile: test\n                ---\n                profile: dev\n        \n             ```\n        \n        Metadata files are composed of 6 sections:\n            - profile\n            - variables\n            - providers\n            - resources\n            - engine\n            - loggers\n        \n        For more information about metadata configuration,\n        type `help(datafaucet.project.metadata)`\n    \n\n\n\n```python\ndfc.project.load()\n```\n\n### Metadata profiles\n\n#### Metadata files\n\n     1) Metadata files are merged up, so you can split the information in multiple files as long as they end with `metadata.yml`\n        For example: `metadata.yml`, `abc.metadata.yaml`, `abc_metadata.yml` are all valid metadata file names.\n    \n     2) All metadata files in all subdirectories from the project root directory are loaded,\n        unless the directory contains a file `metadata.ignore.yml`\n    \n     3) Metadata files can provide multiple profile configurations, \n        by separating each profile configuration with a Document Marker ( a line with `---`) \n        (see https://yaml.org/spec/1.2/spec.html#YAML)\n     \n     4) Each metadata profile, can be broken down in multiple yaml files,\n        When loading the files all configuration belonging to the same profile with be merged. \n     \n     5) All metadata profiles inherit the settings from profile 'default'\n     \n     6) An empty metadata profile inherits from a factory default set as:\n         \"\"\"\n            profile: default\n            variables: {}\n            engine:\n                type: spark\n                master: local[*]\n            providers: {}\n            resources: {}\n            loggers:\n                root:\n                    severity: info\n                datafaucet:\n                    name: dfc\n                    stream:\n                        enable: true\n                        severity: notice\n         \"\"\"\n    \n\n     Metadata files are composed of 6 sections:\n         - profile \n         - variables\n         - providers\n         - resources\n         - engine\n         - loggers\n\n - jinja templates for variable substitution \n - environment variables as jinja template function __env('MY_ENV_VARIABLE', my_default_value)__\n - current timestamp as a jinja template function __now(timezone='UTC', format='%Y-%m-%d %H:%M:%S')__\n - multiple profiles, inheriting from the __default__ profile\n\n\n```python\nmd = dfc.project.metadata()\nmd\n```\n\n\n\n\n    profile: default\n    variables:\n        my_concat_var: hello spark running at (local[*])\n        my_env_var: guest\n        my_nested_var: 'hello spark running at (local[*]): the current date is 2019-03-25'\n        my_date_var: '2019-03-25'\n        my_string_var: hello\n    engine:\n        type: spark\n        master: local[*]\n    providers:\n        localfs:\n            service: file\n            format: csv\n            path: data\n    resources:\n        ascombe:\n            path: ascombe.csv\n            provider: localfs\n        correlation:\n            path: correlation.csv\n            provider: localfs\n    loggers:\n        root:\n            severity: info\n        datafaucet:\n            name: dfc\n            stream:\n                enable: true\n                severity: notice\n            kafka:\n                enable: false\n                severity: info\n                topic: dfc\n                hosts: kafka-node1:9092 kafka-node2:9092\n\n\n\n## Inspect current project configuration\n\nYou can inspect the current project configuration, by calling the `datafaucet.project.config` function.\n\n\n```python\nhelp(dfc.project.config)\n```\n\n    Help on function config in module datafaucet.project:\n    \n    config()\n        Returns the current project configuration\n        :return: a dictionary with project configuration data\n    \n\n\n#### Project configuration\n\nThe current loaded project configuration can be inspected with `datafaucet.project.config()` function call. \nThe following information is available in the returned dictionary:\n\n| key                      | explanation                                                                                 | example value                                     |\n| :---                     | :----                                                                                            |--------------------------------------------------:|\n| python_version           | version of python running the current script                                                | 3.6.7                                             |\n| session_id               | session unique for this particular script run                                               | 0xf3df202e4c6f11e9                                |\n| profile                  | name of the metadata profile loaded                                                         | default                                           |\n| filename                 | name of the current script (works both for .py and ipynb files )                            | main.ipynb                                        |\n| rootdir                  | The root directory of the project (marked by an empty __main__.py or __main__.ipynb file)   | /home/jovyan/work/basic                           |\n| workdir                  | The current working directory                                                               | /home/jovyan/work/basic                           |\n| username                 | User running the script                                                                     | jovyan                                            |\n| repository               | Information about the current git repository (if available)                                 |                                                   |\n| repository.type          | The type of revision system (supports only git currently)                                   | git                                               |\n| repository.committer     | Last committer on this repository                                                           | Natalino Busa                                     |\n| repository.hash          | last commit short hash (only 7 chars)                                                       | 5e43848                                           |\n| repository.commit        | Last committer full hash                                                                    | 5e4384853398941f4b52cb4102145ee98bdeafa6          |\n| repository.branch        | repo branch name                                                                            | master                                            |\n| repository.url           | url of the repository                                                                       | https://github.com/natbusa/datafaucet.git   |\n| repository.name          | repository name                                                                             | datafaucet.git                              |\n| repository.date          | Date of last commit                                                                         | 2019-03-22T04:21:07+00:00                         |\n| repository.clean         | Repository does not contained modified files, wrt to commited data                          | False                                             |\n| files                    | python, notebooks and configuration files in this project                                   |                                                   |\n| files.notebooks          | notebooks files (*.ipynb) in all subdirectories starting from rootdir                       | main.ipynb                                        |\n| files.python             | notebooks files (*.py) in all subdirectories starting from rootdir                          | __main__.py                                       |\n| files.metadata           | metadata files (*.yml) in all subdirectories starting from rootdir                          | metadata.yml                                      |\n| files.dotenv             | filename with variables definitions, unix style                                             | .env                                              |\n| engine                   | data engine configuration                                                                   |                                                   |\n| engine.type              | data engine type                                                                            | spark                                             |\n| engine.name              | name (generated using git repo name and metadata profile)                                   | default                                           |\n| engine.version           | engine version                                                                              | 2.4.0                                             |\n| engine.config            | engine configuration (key-values list)                                                      | spark.master: local[*]                            |\n| engine.env               | engine environment variables (key-values list)                                              | PYSPARK_SUBMIT_ARGS: ' pyspark-shell'             |\n| engine.rootdir           | engine rootdir (same as above)                                                              | /home/jovyan/work/basic                           |\n| engine.timezone          | engine timezone configuration                                                               | UTC                                               |\n\n\n\n```python\ndfc.project.config()\n```\n\n\n\n\n    dfc_version: 0.7.1\n    python_version: 3.6.7\n    session_id: '0x8aa0374e4ee811e9'\n    profile: default\n    filename: main.ipynb\n    rootdir: /home/jovyan/work/basic\n    workdir: /home/jovyan/work/basic\n    username: jovyan\n    repository:\n        type:\n        committer: ''\n        hash: 0\n        commit: 0\n        branch: ''\n        url: ''\n        name: ''\n        date: ''\n        clean: false\n    files:\n        notebooks:\n          - main.ipynb\n          - versions.ipynb\n          - hello.ipynb\n        python:\n          - __main__.py\n        metadata:\n          - metadata.yml\n        dotenv: .env\n    engine:\n        type: spark\n        name: default\n        version: 2.4.0\n        config:\n            spark.driver.port: '46739'\n            spark.rdd.compress: 'True'\n            spark.app.name: default\n            spark.serializer.objectStreamReset: '100'\n            spark.master: local[*]\n            spark.executor.id: driver\n            spark.submit.deployMode: client\n            spark.app.id: local-1553509627579\n            spark.ui.showConsoleProgress: 'true'\n            spark.driver.host: 36594ccded11\n        env:\n            PYSPARK_SUBMIT_ARGS: ' pyspark-shell'\n        rootdir: /home/jovyan/work/basic\n        timezone:\n\n\n\nData resources are relative to the `rootpath`. \n\n### Resources\n\nData binding works with the metadata files. It's a good practice to declare the actual binding in the metadata and avoiding hardcoding the paths in the notebooks and python source files.\n\n\n```python\nmd =dfc.project.resource('ascombe')\nmd\n```\n\n\n\n\n    {'hash': '0x80a539b9fc17d1c4',\n     'url': '/home/jovyan/work/basic/data/ascombe.csv',\n     'service': 'file',\n     'format': 'csv',\n     'host': '127.0.0.1',\n     'port': None,\n     'driver': None,\n     'database': None,\n     'username': None,\n     'password': None,\n     'resource_path': 'ascombe.csv',\n     'provider_path': '/home/jovyan/work/basic/data',\n     'provider_alias': 'localfs',\n     'resource_alias': 'ascombe',\n     'cache': None,\n     'date_column': None,\n     'date_start': None,\n     'date_end': None,\n     'date_window': None,\n     'date_partition': None,\n     'update_column': None,\n     'hash_column': None,\n     'state_column': None,\n     'options': {},\n     'mapping': {}}\n\n\n\n### Modules: Engines\n\nThis submodules will allow you to start a context, from the configuration described in the metadata. It also provide, basic load/store data functions according to the aliases defined in the configuration.\n\nLet's start by listing the aliases and the configuration of the engines declared in `metadata.yml`.\n\n\n__Context: Spark__  \nLet's start the engine session, by selecting a spark context from the list. Your can have many spark contexts declared, for instance for single node \n\n\n```python\nimport datafaucet as dfc\nengine = dfc.project.engine()\nengine.config()\n```\n\n\n\n\n    type: spark\n    name: default\n    version: 2.4.0\n    config:\n        spark.driver.port: '46739'\n        spark.rdd.compress: 'True'\n        spark.app.name: default\n        spark.serializer.objectStreamReset: '100'\n        spark.master: local[*]\n        spark.executor.id: driver\n        spark.submit.deployMode: client\n        spark.app.id: local-1553509627579\n        spark.ui.showConsoleProgress: 'true'\n        spark.driver.host: 36594ccded11\n    env:\n        PYSPARK_SUBMIT_ARGS: ' pyspark-shell'\n    rootdir: /home/jovyan/work/basic\n    timezone:\n\n\n\nYou can quickly inspect the properties of the context by calling the `info()` function\n\nBy calling the `context` method, you access the Spark SQL Context directly. The rest of your spark python code is not affected by the initialization of your session with the datafaucet.\n\n\n```python\nengine = dfc.project.engine()\nspark = engine.context()\n```\n\nOnce again, let's read the csv data again, this time using the spark context. First using the engine `write` utility, then directly using the spark context and the `dfc.data.path` function to localize our labeled dataset.\n\n\n```python\n#read using the engine utility (directly using the load function)\ndf = engine.load('ascombe', header=True, inferSchema=True)\n```\n\n\n```python\n#read using the engine utility (also from resource metadata)\nmd =dfc.project.resource('ascombe')\ndf = engine.load(md, header=True, inferSchema=True)\n```\n\n\n```python\ndf.printSchema()\n```\n\n    root\n     |-- idx: integer (nullable = true)\n     |-- Ix: double (nullable = true)\n     |-- Iy: double (nullable = true)\n     |-- IIx: double (nullable = true)\n     |-- IIy: double (nullable = true)\n     |-- IIIx: double (nullable = true)\n     |-- IIIy: double (nullable = true)\n     |-- IVx: double (nullable = true)\n     |-- IVy: double (nullable = true)\n    \n\n\n\n```python\ndf.show()\n```\n\n    +---+----+-----+----+----+----+-----+----+----+\n    |idx|  Ix|   Iy| IIx| IIy|IIIx| IIIy| IVx| IVy|\n    +---+----+-----+----+----+----+-----+----+----+\n    |  0|10.0| 8.04|10.0|9.14|10.0| 7.46| 8.0|6.58|\n    |  1| 8.0| 6.95| 8.0|8.14| 8.0| 6.77| 8.0|5.76|\n    |  2|13.0| 7.58|13.0|8.74|13.0|12.74| 8.0|7.71|\n    |  3| 9.0| 8.81| 9.0|8.77| 9.0| 7.11| 8.0|8.84|\n    |  4|11.0| 8.33|11.0|9.26|11.0| 7.81| 8.0|8.47|\n    |  5|14.0| 9.96|14.0| 8.1|14.0| 8.84| 8.0|7.04|\n    |  6| 6.0| 7.24| 6.0|6.13| 6.0| 6.08| 8.0|5.25|\n    |  7| 4.0| 4.26| 4.0| 3.1| 4.0| 5.39|19.0|12.5|\n    |  8|12.0|10.84|12.0|9.13|12.0| 8.15| 8.0|5.56|\n    |  9| 7.0| 4.82| 7.0|7.26| 7.0| 6.42| 8.0|7.91|\n    | 10| 5.0| 5.68| 5.0|4.74| 5.0| 5.73| 8.0|6.89|\n    +---+----+-----+----+----+----+-----+----+----+\n    \n\n\nFinally, let's calculate the correlation for each set I,II, III, IV between the `x` and `y` columns and save the result on an separate dataset.\n\n\n```python\nfrom pyspark.ml.feature import VectorAssembler\n\nfor s in ['I', 'II', 'III', 'IV']:\n    va = VectorAssembler(inputCols=[s+'x', s+'y'], outputCol=s)\n    df = va.transform(df)\n    df = df.drop(s+'x', s+'y')\n    \ndf.show()\n```\n\n    +---+------------+-----------+------------+-----------+\n    |idx|           I|         II|         III|         IV|\n    +---+------------+-----------+------------+-----------+\n    |  0| [10.0,8.04]|[10.0,9.14]| [10.0,7.46]| [8.0,6.58]|\n    |  1|  [8.0,6.95]| [8.0,8.14]|  [8.0,6.77]| [8.0,5.76]|\n    |  2| [13.0,7.58]|[13.0,8.74]|[13.0,12.74]| [8.0,7.71]|\n    |  3|  [9.0,8.81]| [9.0,8.77]|  [9.0,7.11]| [8.0,8.84]|\n    |  4| [11.0,8.33]|[11.0,9.26]| [11.0,7.81]| [8.0,8.47]|\n    |  5| [14.0,9.96]| [14.0,8.1]| [14.0,8.84]| [8.0,7.04]|\n    |  6|  [6.0,7.24]| [6.0,6.13]|  [6.0,6.08]| [8.0,5.25]|\n    |  7|  [4.0,4.26]|  [4.0,3.1]|  [4.0,5.39]|[19.0,12.5]|\n    |  8|[12.0,10.84]|[12.0,9.13]| [12.0,8.15]| [8.0,5.56]|\n    |  9|  [7.0,4.82]| [7.0,7.26]|  [7.0,6.42]| [8.0,7.91]|\n    | 10|  [5.0,5.68]| [5.0,4.74]|  [5.0,5.73]| [8.0,6.89]|\n    +---+------------+-----------+------------+-----------+\n    \n\n\nAfter assembling the dataframe into four sets of 2D vectors, let's calculate the pearson correlation for each set. In the case the the ascombe sets, all sets should have the same pearson correlation.\n\n\n```python\nfrom pyspark.ml.stat import Correlation\nfrom pyspark.sql.types import DoubleType\n\ncorr = {}\ncols = ['I', 'II', 'III', 'IV']\n\n# calculate pearson correlations\nfor s in cols:\n    corr[s] = Correlation.corr(df, s, 'pearson').collect()[0][0][0,1].item()\n\n# declare schema\nfrom pyspark.sql.types import StructType, StructField, FloatType\nschema = StructType([StructField(s, FloatType(), True) for s in cols])\n\n# create output dataframe\ncorr_df = spark.createDataFrame(data=[corr], schema=schema)\n```\n\n\n```python\nimport pyspark.sql.functions as f\ncorr_df.select([f.round(f.avg(c), 3).alias(c) for c in cols]).show()\n```\n\n    +-----+-----+-----+-----+\n    |    I|   II|  III|   IV|\n    +-----+-----+-----+-----+\n    |0.816|0.816|0.816|0.817|\n    +-----+-----+-----+-----+\n    \n\n\nSave the results. It's a very small data frame, however Spark when saving  csv format files, assumes large data sets and partitions the files inside an object (a directory) with the name of the target file. See below:\n\n\n\n```python\nengine.save(corr_df,'correlation')\n```\n\n\n\n\n    True\n\n\n\nWe read it back to chack all went fine\n\n\n```python\nengine.load('correlation', header=True, inferSchema=True).show()\n```\n\n    +---+---------+---------+----------+----------+\n    |_c0|        I|       II|       III|        IV|\n    +---+---------+---------+----------+----------+\n    |  0|0.8164205|0.8162365|0.81628674|0.81652147|\n    +---+---------+---------+----------+----------+\n    \n\n\n### Modules: Export\n\nThis submodules will allow you to export cells and import them in other notebooks as python packages. Check the notebook [versions.ipynb](versions.ipynb), where you will see how to export the notebook, then follow the code here below to check it really works!\n\n\n\n```python\nimport datafaucet as dfc\ndfc.project.load()\n\nfrom hello import python_version\n```\n\n    importing Jupyter notebook from hello.ipynb\n\n\n\n```python\npython_version()\n```\n\n    Hello world: python 3.6.7\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnatbusa%2Fdatafaucet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnatbusa%2Fdatafaucet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnatbusa%2Fdatafaucet/lists"}