{"id":25019254,"url":"https://github.com/noahho/caafe","last_synced_at":"2025-04-04T19:10:21.775Z","repository":{"id":167370891,"uuid":"635388216","full_name":"noahho/CAAFE","owner":"noahho","description":"Semi-automatic feature engineering process using Language Models and your dataset descriptions. Based on the paper \"LLMs for Semi-Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering\" by Hollmann, Müller, and Hutter (2023).","archived":false,"fork":false,"pushed_at":"2024-12-20T14:27:45.000Z","size":477,"stargazers_count":155,"open_issues_count":4,"forks_count":25,"subscribers_count":9,"default_branch":"main","last_synced_at":"2025-03-28T18:17:11.079Z","etag":null,"topics":["automl","data-science","deep-learning","feature-engineering","machine-learning","tabpfn"],"latest_commit_sha":null,"homepage":"http://priorlabs.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/noahho.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-02T15:31:14.000Z","updated_at":"2025-03-25T03:38:54.000Z","dependencies_parsed_at":"2023-11-29T12:30:56.022Z","dependency_job_id":"cb8c16ad-21d2-41c1-8754-4aacd5bb1508","html_url":"https://github.com/noahho/CAAFE","commit_stats":null,"previous_names":["automl/caafe"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/noahho%2FCAAFE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/noahho%2FCAAFE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/noahho%2FCAAFE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/noahho%2FCAAFE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/noahho","download_url":"https://codeload.github.com/noahho/CAAFE/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247234921,"owners_count":20905854,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automl","data-science","deep-learning","feature-engineering","machine-learning","tabpfn"],"created_at":"2025-02-05T11:39:25.776Z","updated_at":"2025-04-04T19:10:21.757Z","avatar_url":"https://github.com/noahho.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CAAFE\nCAAFE lets you semi-automate your feature engineering process based on your explanations on the dataset and with the help of language models.\nIt is based on the paper [LLMs for Semi-Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering\" by Hollmann, Müller, and Hutter (2023)](https://arxiv.org/pdf/2305.03403.pdf).\nCAAFE is developed as part of [Prior Labs](http://priorlabs.ai).\nCAAFE systematically verifies the generated features to ensure that only features that are actually useful are added to the dataset.\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://www.youtube.com/watch?v=6zCD48d3kNU\"\u003e\n        \u003cimg src=\"https://i.makeagif.com/media/5-20-2023/E4RfRM.gif\" alt=\"CAFFE demo\"/\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n### Usage\nTo use CAAFE, first create a `CAAFEClassifier` object specifying your sklearn base classifier (clf_no_feat_eng; e.g. a random forest or [`TabPFN`](https://github.com/automl/TabPFN))\nand the language model you want to use (e.g. gpt-4):\n\n```python\nclf_no_feat_eng = ...\ncaafe_clf = CAAFEClassifier(\n    base_classifier=clf_no_feat_eng,\n    llm_model=\"gpt-4\",\n    iterations=2\n)\n```\n\nThen, fit the CAAFE-enhanced classifier to your training data:\n```python\ncaafe_clf.fit_pandas(\n    df_train,\n    target_column_name=target_column_name,\n    dataset_description=dataset_description\n)\n```\nFinally, use the classifier to make predictions on your test data:\n\n```python\npred = caafe_clf.predict(df_test)\n```\n\nView generated features:\n```python\nprint(caafe_clf.code)\n```\n\n#### Why not let GPT generate your features directly (or use Code Interpreter)?\nGPT-4 is a powerful language model that can generate code.\nHowever, it is not designed to generate code that is useful for machine learning.\nCAAFE uses a systematic verification process to ensure that the generated features are actually useful for the machine learning task at hand by: iteratively creating new code, verifying their performance using cross validation and providing feedback to the language model.\nCAAFE makes sure that cross validation is correctly applied and formalizes the verification process.\nAlso, CAAFE uses a whitelist of allowed operations to ensure that the generated code is safe(er) to execute.\nThere inherent risks in generating AI generated code, however, please see [Important Usage Considerations][#important-usage-considerations].\n\n#### Downstream Classifiers\nDownstream classifiers should be fast and need no specific hyperparameter tuning since they are iteratively being called.\nBy default we are using [`TabPFN`](https://github.com/automl/TabPFN) as the base classifier, which is a fast automated machine learning method for small tabular datasets.\n\n```python\nfrom tabpfn import TabPFNClassifier # Fast Automated Machine Learning method for small tabular datasets\n\nclf_no_feat_eng = TabPFNClassifier(\n    device=('cuda' if torch.cuda.is_available() else 'cpu'),\n    N_ensemble_configurations=4\n)\nclf_no_feat_eng.fit = partial(clf_no_feat_eng.fit, overwrite_warning=True)\n```\n\nHowever, [`TabPFN`](https://github.com/automl/TabPFN) only works for small datasets. You can use any other sklearn classifier as the base classifier.\nFor example, you can use a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html):\n```python\nfrom sklearn.ensemble import RandomForestClassifier\n\nclf_no_feat_eng = RandomForestClassifier(n_estimators=100, max_depth=2)\n```\n\n#### Demo\nTry out the demo at: [https://colab.research.google.com/drive/1mCA8xOAJZ4MaB_alZvyARTMjhl6RZf0a](https://colab.research.google.com/drive/1mCA8xOAJZ4MaB_alZvyARTMjhl6RZf0a)\n\n### Important Usage Considerations\n\n#### Code Execution\nExecuting AI-generated code automatically poses inherent risks.\nThese include potential misuse by bad actors or unforeseen outcomes when AI systems operate outside of their typical, controlled environments.\nIn developing our approach, we have taken insights from research on AI code generation and cybersecurity into account.\nWe scrutinize the syntax of the Python code generated by the AI and employ a whitelist of operations allowed for execution.\nHowever, certain operations such as imports, arbitrary function calls, and others are not permitted.\nWhile this increases security, it's not a complete solution – for example, it does not prevent operations that could result in infinite loops or excessive resource usage, like loops and list comprehensions.\nWe continually work to improve these limitations.\n\n#### Replication of Biases\nIt's important to note that AI algorithms can often replicate and even perpetuate biases found in their training data.\nCAAFE, which is built on GPT-4, is not exempt from this issue.\nThe model has been trained on a vast array of web crawled data, which inevitably contains biases inherent in society.\nThis implies that the generated features may also reflect these biases.\nIf the data contains demographic information or other sensitive variables that could potentially be used to discriminate against certain groups,\nwe strongly advise against using CAAFE or urge users to proceed with great caution, ensuring rigorous examination of the generated features.\n\n#### Cost of Running CAFE\nCAAFE uses OpenAIs GPT-4 or GPT-3.5 as an endpoint.\nOpenAI charges The cost of running CAAFE depends on the number of iterations, the number of features in the dataset, the length of the dataset description and of the generated code.\nFor example, for a dataset with 1000 rows and 10 columns, 10 iterations cost about 0.50\\$ for GPT-4 and 0.05\\$ for GPT-3.5.\n\n### Paper\nRead our [paper](https://arxiv.org/abs/2305.03403) for more information about the setup (or contact us ☺️)).\nIf you use our method, please cite us using\n\n```bibtex\n@misc{hollmann2023llms,\n      title={LLMs for Semi-Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering}, \n      author={Noah Hollmann and Samuel Müller and Frank Hutter},\n      year={2023},\n      eprint={2305.03403},\n      archivePrefix={arXiv},\n      primaryClass={cs.AI}\n}\n```\n\n### License\nCopyright by Noah Hollmann, Samuel Müller and Frank Hutter.\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnoahho%2Fcaafe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnoahho%2Fcaafe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnoahho%2Fcaafe/lists"}