https://github.com/wey-gu/fraud-detection-datagen
Fraud detection data generation with configurable degree distribution& community structure, ready for NebulaGraph.
https://github.com/wey-gu/fraud-detection-datagen
community-detection dataset-generation fraud-detection nebula-graph nebulagraph
Last synced: about 1 month ago
JSON representation
Fraud detection data generation with configurable degree distribution& community structure, ready for NebulaGraph.
- Host: GitHub
- URL: https://github.com/wey-gu/fraud-detection-datagen
- Owner: wey-gu
- Created: 2022-05-11T02:29:40.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-04-19T13:45:00.000Z (over 1 year ago)
- Last Synced: 2024-12-20T12:48:06.427Z (10 months ago)
- Topics: community-detection, dataset-generation, fraud-detection, nebula-graph, nebulagraph
- Language: Python
- Homepage:
- Size: 170 MB
- Stars: 25
- Watchers: 1
- Forks: 9
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## How to use the data
First, to bootstrap your Nebula Graph cluster, here, for a single server playground, try with [nebula-up](https://github.com/wey-gu/nebula-up/).
Then, assuming you have a Nebula Graph cluster running in docker with network namespace: `nebula-net`, you can use the following command call Nebula Graph Importer to import data into Nebula Graph, with its configuration from `nebula_graph_importer.yaml`:
```bash
# If we are using the sample data:
# cp -r data_sample_numerical_vertex_id data# only do this for once, remove header line from data/*.csv
sed -i '1d' data/*.csvdocker run --rm -ti \
--network=nebula-net \
-v ${PWD}:/root/ \
-v ${PWD}/data/:/data \
vesoft/nebula-importer:v3.1.0 \
--config /root/nebula_graph_importer.yaml
```> Note, to leverage the data in [NebulaGraph Algorithm](https://github.com/vesoft-inc/nebula-algorithm/), it's recommended to configure `vertex_id_format` as `numerical`.
> This is an example to run the Louvain algorithm:
```bash
cd ~/.nebula-up/nebula-up/sparkdocker exec -it sparkmaster /spark/bin/spark-submit \
--master "local" --conf spark.rpc.askTimeout=6000s \
--class com.vesoft.nebula.algorithm.Main \
--driver-memory 4g /root/download/nebula-algo.jar \
-p /root/louvain.conf
```## How to generate data
Install Python3 and [Julia](https://www.google.com/search?q=how+to+install+julia) first, then install dependencies with:
```bash
# python dependencies
python3 -m pip install -r requirements.txt# julia dependencies, please install julia before running this line
julia install.ji
```Configure the `config.toml` as you wish, where options were documented inline, then just run:
```bash
python3 data_generator.py
```Data will be output under the `data` folder, the files under `data_sample` could be used if it fits your needs. The process should looks like:
https://user-images.githubusercontent.com/1651790/168299297-83b232a1-23b4-44e0-b569-595b70a2b0da.mp4
## Graph Model
tags(vertex label)
- contact
- properties: name, gender, birthday
- device
- loan_applicant
- properties: address, degree, occupation, salary, is_risky, risk_profile, name, gender, birthday
- loan_application
- properties: apply_agent_id, apply_date, application_id, approval_status, application_type, rejection_reason
- phone_number
- properties: phone_num
- corporation
- properties: name, is_risky, risk_profileedge types
- with_phone_num()
- applied_for_loan(start_time)
- used_device(start_time)
- worked_for(start_time)
- is_related_to(degree)
## Data Generation Process
We will be leveraging [py-Faker](https://github.com/joke2k/faker) to generate relatively reasonable properties, and [ABCDGraphGenerator.jl](https://github.com/bkamins/ABCDGraphGenerator.jl) to generate relationships with defined community structure.
As the relationship should be typed differently, i.e, `shared_phone`, `shared_employer`, `shared_device`, etc, the generation process would be as the following diagram.

The steps are:
0. Generate contacts(person) as vertices
1. Generate relationships/edges with configurable factors(degrees, community size, etc.)
2. Distribute relationships to different patterns:
`shared a phone number`, `shared an employer`, `shared a device`, `with a phone number of one's employer`, `is a relative of`
3. Generate Loan Applications, thus adding contact(person) a vertex tag of `Loan Applicant` and an `applied_for_loan` edge from the given person to the `Loan Application`.
## Data Explanation
See comments inline, i.e., for `shared_via_employer_phone_num_relationship.csv`, its comment is:
```cypher
(:p)-[:worked_for]->(:corp)-[:with_phone_num]->(:phone_num)<-[:with_phone_num]-(:p)
```
This means one line of records in the CSV file contains three edges and four vertices:
- `:p` is a person vertex tag
- `worked_for` is a edge type between `:p` and `:corp`
- `:corp` is a corporation vertex tag
- `with_phone_num` is a edge type between corporation and phone_num
- `:phone_num` is a phone number vertex tag
- `with_phone_num` is a edge type between phone_num and person
- `:p` is another person vertex tag```bash
$tree data
data
├── abcd # raw data with ABCD Sampler, reference only
│ ├── com.dat # vertex -> community
│ ├── cs.dat # community size
│ ├── deg.dat # vertex degree
│ └── edge.dat # edges(which construct the community)
├── applicant_application_with_is_related_to.csv
│ # (loan_applicant:appliant)-[:is_related_to]->(contact:person)
│ # (loan_applicant:appliant)-[:applied_for_loan]->(app:loan_application)
├── applicant_application_with_shared_device.csv
│ # (loan_applicant_0:appliant)-[:used_dev]->(:dev)<-[:used_dev]-(loan_applicant_1:appliant)
│ # (loan_applicant_0:appliant)-[:applied_for_loan]->(app_0:loan_application)
│ # (loan_applicant_1:appliant)-[:applied_for_loan]->(app_1:loan_application)
├── applicant_application_with_shared_phone_num.csv
│ # (loan_applicant_0:appliant)-[:with_phone_num]->(:phone_num)<-[:with_phone_num]-(loan_applicant_1:appliant)
│ # (loan_applicant_0:appliant)-[:applied_for_loan]->(app_0:loan_application)
│ # (loan_applicant_1:appliant)-[:applied_for_loan]->(app_1:loan_application)
├── applicant_application_with_shared_employer.csv
│ # (loan_applicant_0:appliant)-[:worked_for]->(:corp)<-[:worked_for]-(loan_applicant_1:appliant)
│ # (loan_applicant_0:appliant)-[:applied_for_loan]->(app_0:loan_application)
│ # (loan_applicant_1:appliant)-[:applied_for_loan]->(app_1:loan_application)
├── applicant_application_connected_with_employer_and_phone_num.csv
│ # (loan_applicant_0:appliant)-[:worked_for]->(:corp)-[:with_phone_num]->(:phone_num)<-[:with_phone_num]-(loan_applicant_1:appliant)
│ # (loan_applicant_0:appliant)-[:applied_for_loan]->(app_0:loan_application)
│ # (loan_applicant_1:appliant)-[:applied_for_loan]->(app_1:loan_application)
├── corporation.csv # corporation vertex
├── device.csv # device vertex
├── is_relative_relationship.csv
│ # is_relative (:p)-[:is_related_to]->(:p)
├── person.csv # contact vertex
├── phone_number.csv # phone number vertex
├── shared_device_relationship.csv
│ # (:p)-[:used_dev]->(:dev)<-[:used_dev]-(:p)
├── shared_employer_relationship.csv
│ # (:p)-[:worked_for]->(:corp)<-[:worked_for]-(:p)
├── shared_phone_num_relationship.csv
│ # (:p)-[:with_phone_num]->(:phone_num)<-[:with_phone_num]-(:p)
└── shared_via_employer_phone_num_relationship.csv
# (:p)-[:worked_for]->(:corp)-[:with_phone_num]->(:phone_num)<-[:with_phone_num]-(:p)
```### How to play with the data:
https://github.com/wey-gu/fraud-detection-datagen/assets/1651790/95781467-7454-4763-b0c2-17e7b33e639a
See here: https://siwei.io/en/fraud-detection-with-nebulagraph/ for more!