{"id":22848355,"url":"https://github.com/feedzai/fifar-dataset","last_synced_at":"2025-07-08T15:03:36.336Z","repository":{"id":213516344,"uuid":"721082623","full_name":"feedzai/fifar-dataset","owner":"feedzai","description":"IT-48832","archived":false,"fork":false,"pushed_at":"2024-03-08T13:20:09.000Z","size":1405,"stargazers_count":12,"open_issues_count":1,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-30T04:48:57.845Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/feedzai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-11-20T10:19:14.000Z","updated_at":"2025-02-16T03:30:24.000Z","dependencies_parsed_at":"2025-04-30T04:48:58.933Z","dependency_job_id":"327f0845-8f83-4eee-b31b-377c3954d926","html_url":"https://github.com/feedzai/fifar-dataset","commit_stats":null,"previous_names":["feedzai/fifar-dataset"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/feedzai/fifar-dataset","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feedzai%2Ffifar-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feedzai%2Ffifar-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feedzai%2Ffifar-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feedzai%2Ffifar-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/feedzai","download_url":"https://codeload.github.com/feedzai/fifar-dataset/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/feedzai%2Ffifar-dataset/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264292908,"owners_count":23586059,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-13T04:11:30.117Z","updated_at":"2025-07-08T15:03:36.301Z","avatar_url":"https://github.com/feedzai.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"﻿# **F**i**FAR** - A Fraud Detection Dataset for Learning to Defer\n\n## Abstract\n\nPublic dataset limitations have significantly hindered the development and benchmarking of *learning to defer* (L2D) algorithms, which aim to optimally combine human and AI capabilities in hybrid decision-making systems. In such systems, human availability and domain-specific concerns introduce difficulties, while obtaining human predictions for training and evaluation is costly. Financial fraud detection is a high-stakes setting where algorithms and human experts often work in tandem; however, there are no publicly available datasets for L2D concerning this important application of human-AI teaming. To fill this gap in L2D research, we introduce the *Financial Fraud Alert Review* Dataset (FiFAR), a synthetic bank account fraud detection dataset, containing the predictions of a team of 50 highly complex and varied synthetic fraud analysts, with varied bias and feature dependence. We also provide a realistic definition of human work capacity constraints, an aspect of L2D systems that is often overlooked, allowing for extensive testing of assignment systems under real-world conditions.\nWe use our dataset to develop a capacity-aware L2D method and rejection learning approach under realistic data availability conditions, and benchmark these baselines under an array of 300 distinct testing scenarios. We believe that this dataset will serve as a pivotal instrument in facilitating a systematic, rigorous, reproducible, and transparent evaluation and comparison of L2D methods, thereby fostering the development of more synergistic human-AI collaboration in decision-making systems. The public dataset and detailed synthetic expert information are available [here](https://anonymous.4open.science/r/fifar-1245/).\n\n## Overview\n\n* [Dataset Download](#Dataset-Download)\n* [Using the FiFAR Dataset](#Using-the-FiFAR-Dataset)\n* [Installing Necessary Dependencies](#Installing-Necessary-Dependencies)\n* [Replicating our Experiments](#Replicating-our-experiments)\n\n## Dataset Download\n\n* [FiFAR and models used for our benchmark](https://www.kaggle.com/datasets/leonardovalves/fifar-financial-fraud-alert-review-dataset/data).\n\nThe submitted version of the paper and the datasheet are available in the following links:\n\n* [Paper](https://arxiv.org/abs/2312.13218)\n* [Datasheet](Documents/datasheet.pdf)\n\n\n## Using the FiFAR Dataset\n\n![alt text](Images/dataset_diagram.png)\n\nThe dataset is comprised of:\n\n* An Input Dataset.\n* Synthetic Expert prediction table.\n* Dataset with limited expert predictions.\n* Sets of capacity constraint tables.\n\nFor more information on each of these components, please consult the provided [datasheet](Documents/datasheet.pdf).\n\n* ### Step 1: Download the Code in this repo:\nFor easy use of our dataset and available notebooks, we encourage users to download the repo in its entirety.\n\n* ### Step 2: Download the Input Dataset\nOur input dataset is the base variant of the Bank Account Fraud Tabular Dataset, available [here](https://www.kaggle.com/datasets/sgpjesus/bank-account-fraud-dataset-neurips-2022?resource=download\u0026select=Base.csv). This dataset should then be placed in the folder [Code/data](Code/data).\n\n* ### Step 3: Download the Models and FiFAR.\nThe models used in our experiments and the dataset with limited expert predictions are available [here](https://www.kaggle.com/datasets/leonardovalves/fifar-financial-fraud-alert-review-dataset/data).\n\nWithin the provided folder you will find:\n\n* Expertise Models - Folder containing the models used for deferral\n* ML Model - Folder containing the ML model used in the task\n* Experts - Folder containing the expert information, including the generated probabilities of error and the resulting predictions\n* Testbed - Folder containing the dataset with limited expert predictions and the test capacity constraints\n\nOur methods can be trained on the dataset with limited expert predictions, which simulates a realistic scenario.\n\n* ### Step 4: Load data into correct directories.\nTo place all the necessary data in the correct directories, the user needs to run \"[load\\_data.py](load_data.py)\". The script only requires the user to specify the directory of the datasets downloaded in Step 3. The expert prediction table is split according to the expert preprocessing and deployment splits.\n\n\n\n### Uses of the FiFAR Dataset\n\nThis dataset can be used to develop L2D methods under realistic conditions. Our dataset poses realistic challenges, such as:\n\n* Limited expert prediction availability\n* Developing algorithms under dynamic environments\n* Human work capacity constraints\n\nThe Dataset with limited expert predictions can be used to train assignment systemds under realistic human data availability. Our expert prediction table contains 50 synthetic fraud analyst's predictions for each of the 1M instances of the BAF dataset. It can be used to train more data demanding algorithms, or to generate different training scenarios with the use of new capacity constraints. Our capacity constraint tables are also available, and are useful to test capacity aware assignment under a vast array of expert team configurations.\n\n\n## Installing Necessary Dependencies\n\n### Creating the Python Environment\n\nRequirements:\n* anaconda3\n  \nBefore using any of the provided code, please create and activate the provided Python environment by running\n\n```\nconda env create -f fifar-environment.yml\nconda activate fifar-env\n```\n\nThen, please install the package available in the folder [Dependencies](Dependencies).\n\n```\npip install Dependencies/autodefer-0.0.1-py3-none-any.whl \n```\n\n## Replicating our experiments\n\n### L2D Baseline Results\nAfter following the steps to obtain the **FiFAR Dataset**, detailed in the previous section, the user must run the file \"[Code/testbed/run_tests.py](Code/testbed/run_tests.py)\". This script produces the test split assignments for each testing scenario. These assignments are obtained by using each of our 3 baseline models, detailed in the [paper](Documents/Paper.pdf), resulting in a total of 900 sets of assignments.\n\n### ML Model evaluation\n\nThe plots, numerical results, and hyperparameter choices relating to our ML model are obtained using the script [Code/ml_model/training_and_predicting.py](Code/ml_model/training_and_predicting.py). \n\n### Synthetic expert's decision evaluation\n\nThe plots and numerical results regarding our synthetic expert's generation process and decision properties are obtained using the notebook [Code/experts/expert_properties.ipynb](Code/experts/expert_properties.ipynb). \n\n\n\n### How to Cite FiFAR\n\n```\n@inproceedings{\nalves2023fifar,\ntitle={Fi{FAR}: A Fraud Detection Dataset for Learning to Defer},\nauthor={Jean Vieira Alves and Diogo Leit{\\~a}o and S{\\'e}rgio Jesus and Marco O. P. Sampaio and Pedro Saleiro and Mario A. T. Figueiredo and Pedro Bizarro},\nbooktitle={2nd Workshop on Synthetic Data for AI in Finance},\nyear={2023},\nurl={https://openreview.net/forum?id=oyBm9bRNMK}\n}\n```\nThe paper is publicly available at this [arXiv link](https://arxiv.org/abs/2312.13218)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeedzai%2Ffifar-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffeedzai%2Ffifar-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffeedzai%2Ffifar-dataset/lists"}