{"id":18886533,"url":"https://github.com/facultyai/boltzmannclean","last_synced_at":"2025-06-27T13:33:18.684Z","repository":{"id":60721955,"uuid":"127295908","full_name":"facultyai/boltzmannclean","owner":"facultyai","description":"Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines","archived":false,"fork":false,"pushed_at":"2020-05-18T09:56:19.000Z","size":22,"stargazers_count":23,"open_issues_count":1,"forks_count":9,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-03-28T10:21:17.352Z","etag":null,"topics":["data-cleaning","data-science","dataframe","pandas","restricted-boltzmann-machine"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/boltzmannclean/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/facultyai.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-03-29T13:32:41.000Z","updated_at":"2022-08-25T11:58:21.000Z","dependencies_parsed_at":"2022-10-03T20:31:27.970Z","dependency_job_id":null,"html_url":"https://github.com/facultyai/boltzmannclean","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facultyai%2Fboltzmannclean","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facultyai%2Fboltzmannclean/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facultyai%2Fboltzmannclean/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/facultyai%2Fboltzmannclean/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/facultyai","download_url":"https://codeload.github.com/facultyai/boltzmannclean/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248631686,"owners_count":21136562,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-cleaning","data-science","dataframe","pandas","restricted-boltzmann-machine"],"created_at":"2024-11-08T07:28:15.220Z","updated_at":"2025-04-14T21:31:09.693Z","avatar_url":"https://github.com/facultyai.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"boltzmannclean\n==============\n\nFill missing values in a pandas DataFrame using a Restricted Boltzmann Machine.\n\nProvides a class implementing the scikit-learn transformer interface for\ncreating and training a Restricted Boltzmann Machine. This can then be sampled\nfrom to fill in missing values in training data or new data of the same format.\nUtility functions for applying the transformations to a pandas DataFrame are\nprovided, with the option to treat columns as either continuous numerical or\ncategorical features.\n\nInstallation\n------------\n\n.. code-block:: bash\n\n    pip install boltzmannclean\n\nUsage\n-----\n\nTo fill in missing values from a DataFrame with the minimum of fuss, a cleaning\nfunction is provided.\n\n.. code-block:: python\n\n    import boltzmannclean\n\n    my_clean_dataframe = boltzmannclean.clean(\n        dataframe=my_dataframe,\n        numerical_columns=['Height', 'Weight'],\n        categorical_columns=['Colour', 'Shape'],\n        tune_rbm=True  # tune RBM hyperparameters for my data\n    )\n\nTo create and use the underlying scikit-learn transformer.\n\n.. code-block:: python\n\n    my_rbm = boltzmannclean.RestrictedBoltzmannMachine(\n        n_hidden=100, learn_rate=0.01,\n        batchsize=10, dropout_fraction=0.5, max_epochs=1,\n        adagrad=True\n    )\n\n    my_rbm.fit_transform(a_numpy_array)\n\nHere the default RBM hyperparameters are those listed above, and the numpy\narray operated on is expected to be composed entirely of numbers in the range\n[0,1] or np.nan/None. The hyperparameters are:\n\n- *n_hidden*: the size of the hidden layer\n- *learn_rate*: learning rate for stochastic gradient descent\n- *batchsize*: batchsize for stochastic gradient descent\n- *dropout_fraction*: fraction of hidden nodes to be dropped out on each\n  backward pass during training\n- *max_epochs*: maximum number of passes over the training data\n- *adagrad*: whether to use the Adagrad update rules for stochastic gradient\n  descent\n\nExample\n-------\n\n.. code-block:: python\n\n    import boltzmannclean\n    import numpy as np\n    import pandas as pd\n    from sklearn import datasets\n\n    iris = datasets.load_iris()\n\n    df_iris = pd.DataFrame(iris.data,columns=iris.feature_names)\n    df_iris['target'] = pd.Series(iris.target, dtype=str)\n\n    df_iris.head()\n\n=   =================   ================    =================   ================    ======\n_   sepal length (cm)   sepal width (cm)    petal length (cm)   petal width (cm)    target\n=   =================   ================    =================   ================    ======\n0   5.1                  3.5                  1.4                  0.2                  0\n1   4.9                  3.0                  1.4                  0.2                  0\n2   4.7                  3.2                  1.3                  0.2                  0\n3   4.6                  3.1                  1.5                  0.2                  0\n4   5.0                  3.6                  1.4                  0.2                  0\n=   =================   ================    =================   ================    ======\n\nAdd some noise:\n\n.. code-block:: python\n\n    noise = [(0,1),(2,0),(0,4)]\n\n    for noisy in noise:\n        df_iris.iloc[noisy] = None\n\n    df_iris.head()\n\n=   =================   ================    =================   ================    ======\n_   sepal length (cm)   sepal width (cm)    petal length (cm)   petal width (cm)    target\n=   =================   ================    =================   ================    ======\n0   5.1                  NaN                  1.4                  0.2               None\n1   4.9                  3.0                  1.4                  0.2                  0\n2   NaN                  3.2                  1.3                  0.2                  0\n3   4.6                  3.1                  1.5                  0.2                  0\n4   5.0                  3.6                  1.4                  0.2                  0\n=   =================   ================    =================   ================    ======\n\nClean the DataFrame:\n\n.. code-block:: python\n\n    df_iris_cleaned = boltzmannclean.clean(\n        dataframe=df_iris,\n        numerical_columns=[\n            'sepal length (cm)', 'sepal width (cm)',\n            'petal length (cm)', 'petal width (cm)'\n        ],\n        categorical_columns=['target'],\n        tune_rbm=True\n    )\n\n    df_iris_cleaned.round(1).head()\n\n=   =================   ================    =================   ================    ======\n_   sepal length (cm)   sepal width (cm)    petal length (cm)   petal width (cm)    target\n=   =================   ================    =================   ================    ======\n0   5.1                  3.3                  1.4                  0.2                  0\n1   4.9                  3.0                  1.4                  0.2                  0\n2   6.3                  3.2                  1.3                  0.2                  0\n3   4.6                  3.1                  1.5                  0.2                  0\n4   5.0                  3.6                  1.4                  0.2                  0\n=   =================   ================    =================   ================    ======\n\nThe larger and more correlated the dataset is, the better the imputed values\nwill be.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacultyai%2Fboltzmannclean","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffacultyai%2Fboltzmannclean","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffacultyai%2Fboltzmannclean/lists"}