{"id":13856675,"url":"https://github.com/jianzhnie/AutoTabular","last_synced_at":"2025-07-13T19:32:20.848Z","repository":{"id":57412984,"uuid":"365207112","full_name":"jianzhnie/AutoTabular","owner":"jianzhnie","description":"Automatic machine learning for tabular data. ⚡🔥⚡","archived":false,"fork":false,"pushed_at":"2021-12-11T12:04:34.000Z","size":6193,"stargazers_count":66,"open_issues_count":0,"forks_count":10,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-05-15T04:34:05.862Z","etag":null,"topics":["automl","catboost","data-science","deep-learning","feature-engineering","hpo","lightgbm","machine-learning","pytorch-lightning","scikit-learn","structured-data","tabular-data","xgboost"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jianzhnie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-05-07T11:09:51.000Z","updated_at":"2024-02-15T15:33:56.000Z","dependencies_parsed_at":"2022-08-29T16:52:35.329Z","dependency_job_id":null,"html_url":"https://github.com/jianzhnie/AutoTabular","commit_stats":null,"previous_names":[],"tags_count":0,"template":true,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jianzhnie%2FAutoTabular","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jianzhnie%2FAutoTabular/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jianzhnie%2FAutoTabular/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jianzhnie%2FAutoTabular/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jianzhnie","download_url":"https://codeload.github.com/jianzhnie/AutoTabular/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225912212,"owners_count":17544124,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automl","catboost","data-science","deep-learning","feature-engineering","hpo","lightgbm","machine-learning","pytorch-lightning","scikit-learn","structured-data","tabular-data","xgboost"],"created_at":"2024-08-05T03:01:08.453Z","updated_at":"2024-11-22T14:30:31.716Z","avatar_url":"https://github.com/jianzhnie.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# AutoTabular\n\n[![Paper](http://img.shields.io/badge/paper-arxiv.1001.2234-B31B1B.svg)](https://www.nature.com/articles/nature14539)\n[![Conference](http://img.shields.io/badge/NeurIPS-2019-4b44ce.svg)](https://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018)\n[![Conference](http://img.shields.io/badge/ICLR-2019-4b44ce.svg)](https://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018)\n[![Conference](http://img.shields.io/badge/AnyConference-year-4b44ce.svg)](https://papers.nips.cc/book/advances-in-neural-information-processing-systems-31-2018)\n\n\nAutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.  With just a few lines of code, you can train and deploy high-accuracy machine learning and deep learning models tabular data.\n\n![autotabular](./docs/autotabular.png)\n\n[Toc]\n## What's good in it?\n\n- It is using the RAPIDS as back-end support, gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.\n- It Supports many anomaly detection models: ,\n- It using meta learning to accelerate  model selection and parameter tuning.\n- It is using many Deep Learning models for tabular data: `Wide\u0026Deep`,  `DCN(Deep \u0026 Cross Network)`, `FM`, `DeepFM`, `PNN` ...\n- It is using many machine learning algorithms: `Baseline`, `Linear`, `Random Forest`, `Extra Trees`, `LightGBM`, `Xgboost`, `CatBoost`, and `Nearest Neighbors`.\n- It can compute Ensemble based on greedy algorithm from [Caruana paper](http://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf).\n- It can stack models to build level 2 ensemble (available in `Compete` mode or after setting `stack_models` parameter).\n- It can do features preprocessing, like: missing values imputation and converting categoricals. What is more, it can also handle target values preprocessing.\n- It can do advanced features engineering, like: [Golden Features](https://supervised.mljar.com/features/golden_features/), [Features Selection](https://supervised.mljar.com/features/features_selection/), Text and Time Transformations.\n- It can tune hyper-parameters with `not-so-random-search` algorithm (random-search over defined set of values) and hill climbing to fine-tune final models.\n\n##  Installation\n\nThe sources for AutoTabular can be downloaded from the `Github repo`.\n\nYou can either clone the public repository:\n\n```bash\n# clone project\ngit clone https://apulis-gitlab.apulis.cn/apulis/AutoTabular/autotabular.git\n# First, install dependencies\npip install -r requirements.txt\n```\n\nOnce you have a copy of the source, you can install it with:\n\n```bash\npython setup.py install\n```\n## Example\nNext, navigate to any file and run it.\n```bash\n# module folder\ncd example\n\n# run module (example: mnist as your main contribution)\npython binary_classifier_Titanic.py\n```\n\n### Auto Feature generate \u0026 Selection\n```python\nTODO\n```\n### Deep Feature Synthesis\n```python\nimport featuretools as ft\nimport pandas as pd\nfrom sklearn.datasets import load_iris\n\n# Load data and put into dataframe\niris = load_iris()\ndf = pd.DataFrame(iris.data, columns=iris.feature_names)\ndf['species'] = iris.target\ndf['species'] = df['species'].map({\n    0: 'setosa',\n    1: 'versicolor',\n    2: 'virginica'\n})\n# Make an entityset and add the entity\nes = ft.EntitySet()\nes.add_dataframe(\n    dataframe_name='data', dataframe=df, make_index=True, index='index')\n# Run deep feature synthesis with transformation primitives\nfeature_matrix, feature_defs = ft.dfs(\n    entityset=es,\n    max_depth=3,\n    target_dataframe_name='data',\n    agg_primitives=['mode', 'mean', 'max', 'count'],\n    trans_primitives=[\n        'add_numeric', 'multiply_numeric', 'cum_min', 'cum_mean', 'cum_max'\n    ],\n    groupby_trans_primitives=['cum_sum'])\n\nprint(feature_defs)\nprint(feature_matrix.head())\nprint(feature_matrix.ww)\n```\n### GBDT Feature Generate\n```python\nfrom autofe.feature_engineering.gbdt_feature import CatboostFeatureTransformer, GBDTFeatureTransformer, LightGBMFeatureTransformer, XGBoostFeatureTransformer\n\ntitanic = pd.read_csv('autotabular/datasets/data/Titanic.csv')\n# 'Embarked' is stored as letters, so fit a label encoder to the train set to use in the loop\nembarked_encoder = LabelEncoder()\nembarked_encoder.fit(titanic['Embarked'].fillna('Null'))\n# Record anyone travelling alone\ntitanic['Alone'] = (titanic['SibSp'] == 0) \u0026 (titanic['Parch'] == 0)\n# Transform 'Embarked'\ntitanic['Embarked'].fillna('Null', inplace=True)\ntitanic['Embarked'] = embarked_encoder.transform(titanic['Embarked'])\n# Transform 'Sex'\ntitanic.loc[titanic['Sex'] == 'female', 'Sex'] = 0\ntitanic.loc[titanic['Sex'] == 'male', 'Sex'] = 1\ntitanic['Sex'] = titanic['Sex'].astype('int8')\n# Drop features that seem unusable. Save passenger ids if test\ntitanic.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)\n\ntrainMeans = titanic.groupby(['Pclass', 'Sex'])['Age'].mean()\n\ndef f(x):\n    if not np.isnan(x['Age']):  # not NaN\n        return x['Age']\n    return trainMeans[x['Pclass'], x['Sex']]\n\ntitanic['Age'] = titanic.apply(f, axis=1)\nrows = titanic.shape[0]\nn_train = int(rows * 0.77)\ntrain_data = titanic[:n_train, :]\ntest_data = titanic[n_train:, :]\n\nX_train = titanic.drop(['Survived'], axis=1)\ny_train = titanic['Survived']\n\nclf = XGBoostFeatureTransformer(task='classification')\nclf.fit(X_train, y_train)\nresult = clf.concate_transform(X_train)\nprint(result)\n\nclf = LightGBMFeatureTransformer(task='classification')\nclf.fit(X_train, y_train)\nresult = clf.concate_transform(X_train)\nprint(result)\n\nclf = GBDTFeatureTransformer(task='classification')\nclf.fit(X_train, y_train)\nresult = clf.concate_transform(X_train)\nprint(result)\n\nclf = CatboostFeatureTransformer(task='classification')\nclf.fit(X_train, y_train)\nresult = clf.concate_transform(X_train)\nprint(result)\n\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import roc_auc_score\n\nlr = LogisticRegression()\nx_train_gb, x_test_gb, y_train_gb, y_test_gb = train_test_split(\n    result, y_train)\nx_train, x_test, y_train, y_test = train_test_split(X_train, y_train)\n\nlr.fit(x_train, y_train)\nscore = roc_auc_score(y_test, lr.predict(x_test))\nprint('LR with GBDT apply data, train data shape : {0}  auc: {1}'.format(\n    x_train.shape, score))\n\nlr = LogisticRegression()\nlr.fit(x_train_gb, y_train_gb)\nscore = roc_auc_score(y_test_gb, lr.predict(x_test_gb))\nprint('LR with GBDT apply data, train data shape : {0}  auc: {1}'.format(\n    x_train_gb.shape, score))\n```\n### Golden Feature Generate\n```python\nfrom autofe import GoldenFeatureTransform\n\ntitanic = pd.read_csv('autotabular/datasets/data/Titanic.csv')\nembarked_encoder = LabelEncoder()\nembarked_encoder.fit(titanic['Embarked'].fillna('Null'))\n# Record anyone travelling alone\ntitanic['Alone'] = (titanic['SibSp'] == 0) \u0026 (titanic['Parch'] == 0)\n# Transform 'Embarked'\ntitanic['Embarked'].fillna('Null', inplace=True)\ntitanic['Embarked'] = embarked_encoder.transform(titanic['Embarked'])\n# Transform 'Sex'\ntitanic.loc[titanic['Sex'] == 'female', 'Sex'] = 0\ntitanic.loc[titanic['Sex'] == 'male', 'Sex'] = 1\ntitanic['Sex'] = titanic['Sex'].astype('int8')\n# Drop features that seem unusable. Save passenger ids if test\ntitanic.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)\n\ntrainMeans = titanic.groupby(['Pclass', 'Sex'])['Age'].mean()\n\ndef f(x):\n    if not np.isnan(x['Age']):  # not NaN\n        return x['Age']\n    return trainMeans[x['Pclass'], x['Sex']]\n\ntitanic['Age'] = titanic.apply(f, axis=1)\n\nX_train = titanic.drop(['Survived'], axis=1)\ny_train = titanic['Survived']\nprint(X_train)\ngbdt_model = GoldenFeatureTransform(\n    results_path='./', ml_task='BINARY_CLASSIFICATION')\ngbdt_model.fit(X_train, y_train)\nresults = gbdt_model.transform(X_train)\nprint(results)\n```\n### Neural Network Embeddings\n```python\n# data url\n\"\"\"https://www.kaggle.com/c/house-prices-advanced-regression-techniques.\"\"\"\ndata_dir = '/media/robin/DATA/datatsets/structure_data/house_price/train.csv'\ndata = pd.read_csv(\n    data_dir,\n    usecols=[\n        'SalePrice', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea',\n        'Street', 'YearBuilt', 'LotShape', '1stFlrSF', '2ndFlrSF'\n    ]).dropna()\n\ncategorical_features = [\n    'MSSubClass', 'MSZoning', 'Street', 'LotShape', 'YearBuilt'\n]\noutput_feature = 'SalePrice'\nlabel_encoders = {}\nfor cat_col in categorical_features:\n    label_encoders[cat_col] = LabelEncoder()\n    data[cat_col] = label_encoders[cat_col].fit_transform(data[cat_col])\n\ndataset = TabularDataset(\n    data=data, cat_cols=categorical_features, output_col=output_feature)\n\nbatchsize = 64\ndataloader = DataLoader(dataset, batchsize, shuffle=True, num_workers=1)\n\ncat_dims = [int(data[col].nunique()) for col in categorical_features]\nemb_dims = [(x, min(50, (x + 1) // 2)) for x in cat_dims]\ndevice = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\nmodel = FeedForwardNN(\n    emb_dims,\n    no_of_cont=4,\n    lin_layer_sizes=[50, 100],\n    output_size=1,\n    emb_dropout=0.04,\n    lin_layer_dropouts=[0.001, 0.01]).to(device)\nprint(model)\nnum_epochs = 100\ncriterion = nn.MSELoss()\noptimizer = torch.optim.Adam(model.parameters(), lr=0.1)\nfor epoch in range(num_epochs):\n    for y, cont_x, cat_x in dataloader:\n        cat_x = cat_x.to(device)\n        cont_x = cont_x.to(device)\n        y = y.to(device)\n        # Forward Pass\n        preds = model(cont_x, cat_x)\n        loss = criterion(preds, y)\n        # Backward Pass and Optimization\n        optimizer.zero_grad()\n        loss.backward()\n        optimizer.step()\n    print('loss:', loss)\n```\n\n## License\n\nThis library is licensed under the Apache 2.0 License.\n\n## Contributing to AutoTabular\n\nWe are actively accepting code contributions to the AutoTabular project. If you are interested in contributing to AutoTabular, please contact me.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjianzhnie%2FAutoTabular","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjianzhnie%2FAutoTabular","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjianzhnie%2FAutoTabular/lists"}