{"id":23647154,"url":"https://github.com/wey-gu/fraud-detection-datagen","last_synced_at":"2025-08-31T22:32:39.097Z","repository":{"id":39755873,"uuid":"490929961","full_name":"wey-gu/fraud-detection-datagen","owner":"wey-gu","description":"Fraud detection data generation with configurable degree distribution\u0026 community structure, ready for NebulaGraph.","archived":false,"fork":false,"pushed_at":"2024-04-19T13:45:00.000Z","size":178577,"stargazers_count":25,"open_issues_count":2,"forks_count":9,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-20T12:48:06.427Z","etag":null,"topics":["community-detection","dataset-generation","fraud-detection","nebula-graph","nebulagraph"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wey-gu.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-05-11T02:29:40.000Z","updated_at":"2024-12-12T11:46:12.000Z","dependencies_parsed_at":"2024-04-19T15:07:24.301Z","dependency_job_id":null,"html_url":"https://github.com/wey-gu/fraud-detection-datagen","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wey-gu%2Ffraud-detection-datagen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wey-gu%2Ffraud-detection-datagen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wey-gu%2Ffraud-detection-datagen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wey-gu%2Ffraud-detection-datagen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wey-gu","download_url":"https://codeload.github.com/wey-gu/fraud-detection-datagen/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":231633141,"owners_count":18403402,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["community-detection","dataset-generation","fraud-detection","nebula-graph","nebulagraph"],"created_at":"2024-12-28T13:49:36.537Z","updated_at":"2024-12-28T13:49:37.132Z","avatar_url":"https://github.com/wey-gu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## How to use the data\n\nFirst, to bootstrap your Nebula Graph cluster, here, for a single server playground, try with [nebula-up](https://github.com/wey-gu/nebula-up/).\n\nThen, assuming you have a Nebula Graph cluster running in docker with network namespace: `nebula-net`, you can use the following command call Nebula Graph Importer to import data into Nebula Graph, with its configuration from `nebula_graph_importer.yaml`:\n\n```bash\n# If we are using the sample data:\n# cp -r data_sample_numerical_vertex_id data\n\n# only do this for once, remove header line from data/*.csv\nsed -i '1d' data/*.csv\n\ndocker run --rm -ti \\\n    --network=nebula-net \\\n    -v ${PWD}:/root/ \\\n    -v ${PWD}/data/:/data \\\n    vesoft/nebula-importer:v3.1.0 \\\n    --config /root/nebula_graph_importer.yaml\n```\n\n\u003e Note, to leverage the data in [NebulaGraph Algorithm](https://github.com/vesoft-inc/nebula-algorithm/), it's recommended to configure `vertex_id_format` as `numerical`.\n\u003e This is an example to run the Louvain algorithm:\n```bash\ncd ~/.nebula-up/nebula-up/spark\n\ndocker exec -it sparkmaster /spark/bin/spark-submit \\\n    --master \"local\" --conf spark.rpc.askTimeout=6000s \\\n    --class com.vesoft.nebula.algorithm.Main \\\n    --driver-memory 4g /root/download/nebula-algo.jar \\\n    -p /root/louvain.conf\n```\n\n## How to generate data\n\nInstall Python3 and [Julia](https://www.google.com/search?q=how+to+install+julia) first, then install dependencies with:\n\n```bash\n# python dependencies\npython3 -m pip install -r requirements.txt\n\n# julia dependencies, please install julia before running this line\njulia install.ji\n```\n\nConfigure the `config.toml` as you wish, where options were documented inline, then just run:\n\n```bash\npython3 data_generator.py\n```\n\nData will be output under the `data` folder, the files under `data_sample` could be used if it fits your needs. The process should looks like:\n\nhttps://user-images.githubusercontent.com/1651790/168299297-83b232a1-23b4-44e0-b569-595b70a2b0da.mp4\n\n## Graph Model\n\ntags(vertex label)\n\n- contact\n  - properties: name, gender, birthday\n- device\n- loan_applicant\n  - properties: address, degree, occupation, salary, is_risky, risk_profile, name, gender, birthday\n- loan_application\n  - properties: apply_agent_id, apply_date, application_id, approval_status, application_type, rejection_reason\n- phone_number\n  - properties: phone_num\n- corporation\n  - properties: name, is_risky, risk_profile\n\nedge types\n\n- with_phone_num()\n- applied_for_loan(start_time)\n- used_device(start_time)\n- worked_for(start_time)\n- is_related_to(degree)\n\n![fraud_detection_graph_model](images/fraud_detection_graph_model.svg)\n\n## Data Generation Process\n\nWe will be leveraging [py-Faker](https://github.com/joke2k/faker) to generate relatively reasonable properties, and [ABCDGraphGenerator.jl](https://github.com/bkamins/ABCDGraphGenerator.jl) to generate relationships with defined community structure.\n\nAs the relationship should be typed differently, i.e, `shared_phone`, `shared_employer`, `shared_device`, etc, the generation process would be as the following diagram.\n\n![fraud_detection_data_gen_process](images/fraud_detection_data_gen_process.svg)\n\nThe steps are:\n\n0. Generate contacts(person) as vertices\n\n1. Generate relationships/edges with configurable factors(degrees, community size, etc.)\n\n2. Distribute relationships to different patterns:\n\n   `shared a phone number`, `shared an employer`,  `shared a device`, `with a phone number of one's employer`, `is a relative of`\n\n3. Generate Loan Applications, thus adding contact(person) a vertex tag of `Loan Applicant` and an `applied_for_loan` edge from the given person to the `Loan Application`.\n\n\n\n## Data Explanation\n\nSee comments inline, i.e., for `shared_via_employer_phone_num_relationship.csv`, its comment is:\n```cypher\n(:p)-[:worked_for]-\u003e(:corp)-[:with_phone_num]-\u003e(:phone_num)\u003c-[:with_phone_num]-(:p)\n```\nThis means one line of records in the CSV file contains three edges and four vertices:\n- `:p` is a person vertex tag\n- `worked_for` is a edge type between `:p` and `:corp`\n- `:corp` is a corporation vertex tag\n- `with_phone_num` is a edge type between corporation and phone_num\n- `:phone_num` is a phone number vertex tag\n- `with_phone_num` is a edge type between phone_num and person\n- `:p` is another person vertex tag\n\n\n```bash\n$tree data\ndata\n├── abcd             # raw data with ABCD Sampler, reference only\n│   ├── com.dat      # vertex -\u003e community\n│   ├── cs.dat       # community size\n│   ├── deg.dat      # vertex degree\n│   └── edge.dat     # edges(which construct the community)\n├── applicant_application_with_is_related_to.csv\n│                    # (loan_applicant:appliant)-[:is_related_to]-\u003e(contact:person)\n│                    # (loan_applicant:appliant)-[:applied_for_loan]-\u003e(app:loan_application)\n├── applicant_application_with_shared_device.csv\n│                    # (loan_applicant_0:appliant)-[:used_dev]-\u003e(:dev)\u003c-[:used_dev]-(loan_applicant_1:appliant)\n│                    # (loan_applicant_0:appliant)-[:applied_for_loan]-\u003e(app_0:loan_application)\n│                    # (loan_applicant_1:appliant)-[:applied_for_loan]-\u003e(app_1:loan_application)\n├── applicant_application_with_shared_phone_num.csv\n│                    # (loan_applicant_0:appliant)-[:with_phone_num]-\u003e(:phone_num)\u003c-[:with_phone_num]-(loan_applicant_1:appliant)\n│                    # (loan_applicant_0:appliant)-[:applied_for_loan]-\u003e(app_0:loan_application)\n│                    # (loan_applicant_1:appliant)-[:applied_for_loan]-\u003e(app_1:loan_application)\n├── applicant_application_with_shared_employer.csv\n│                    # (loan_applicant_0:appliant)-[:worked_for]-\u003e(:corp)\u003c-[:worked_for]-(loan_applicant_1:appliant)\n│                    # (loan_applicant_0:appliant)-[:applied_for_loan]-\u003e(app_0:loan_application)\n│                    # (loan_applicant_1:appliant)-[:applied_for_loan]-\u003e(app_1:loan_application)\n├── applicant_application_connected_with_employer_and_phone_num.csv\n│                    # (loan_applicant_0:appliant)-[:worked_for]-\u003e(:corp)-[:with_phone_num]-\u003e(:phone_num)\u003c-[:with_phone_num]-(loan_applicant_1:appliant)\n│                    # (loan_applicant_0:appliant)-[:applied_for_loan]-\u003e(app_0:loan_application)\n│                    # (loan_applicant_1:appliant)-[:applied_for_loan]-\u003e(app_1:loan_application)\n├── corporation.csv  # corporation vertex\n├── device.csv       # device vertex\n├── is_relative_relationship.csv\n│                    # is_relative (:p)-[:is_related_to]-\u003e(:p)\n├── person.csv       # contact vertex\n├── phone_number.csv # phone number vertex\n├── shared_device_relationship.csv\n│                    # (:p)-[:used_dev]-\u003e(:dev)\u003c-[:used_dev]-(:p)\n├── shared_employer_relationship.csv\n│                    # (:p)-[:worked_for]-\u003e(:corp)\u003c-[:worked_for]-(:p)\n├── shared_phone_num_relationship.csv\n│                    # (:p)-[:with_phone_num]-\u003e(:phone_num)\u003c-[:with_phone_num]-(:p)\n└── shared_via_employer_phone_num_relationship.csv\n    # (:p)-[:worked_for]-\u003e(:corp)-[:with_phone_num]-\u003e(:phone_num)\u003c-[:with_phone_num]-(:p)\n```\n\n### How to play with the data:\n\nhttps://github.com/wey-gu/fraud-detection-datagen/assets/1651790/95781467-7454-4763-b0c2-17e7b33e639a\n\nSee here: https://siwei.io/en/fraud-detection-with-nebulagraph/ for more!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwey-gu%2Ffraud-detection-datagen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwey-gu%2Ffraud-detection-datagen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwey-gu%2Ffraud-detection-datagen/lists"}