{"id":21514839,"url":"https://github.com/getindata/doge-datagen","last_synced_at":"2025-06-18T19:38:27.118Z","repository":{"id":40429795,"uuid":"457769685","full_name":"getindata/doge-datagen","owner":"getindata","description":null,"archived":false,"fork":false,"pushed_at":"2022-12-16T19:06:02.000Z","size":183,"stargazers_count":19,"open_issues_count":0,"forks_count":4,"subscribers_count":7,"default_branch":"develop","last_synced_at":"2025-04-09T20:11:30.230Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/getindata.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-02-10T12:23:54.000Z","updated_at":"2025-04-02T13:20:50.000Z","dependencies_parsed_at":"2023-01-29T15:46:02.159Z","dependency_job_id":null,"html_url":"https://github.com/getindata/doge-datagen","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fdoge-datagen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fdoge-datagen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fdoge-datagen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/getindata%2Fdoge-datagen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/getindata","download_url":"https://codeload.github.com/getindata/doge-datagen/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248103872,"owners_count":21048245,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-23T23:53:11.401Z","updated_at":"2025-04-09T20:11:34.996Z","avatar_url":"https://github.com/getindata.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Online Generator (doge_datagen)\n\n[![Python Version](https://img.shields.io/badge/python-3.8-blue.svg)](https://github.com/getindata/doge-datagen)\n[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![SemVer](https://img.shields.io/badge/semver-2.0.0-green)](https://semver.org/)\n[![PyPI version](https://badge.fury.io/py/doge-datagen.svg)](https://pypi.org/project/doge-datagen/)\n[![Downloads](https://pepy.tech/badge/doge-datagen)](https://pepy.tech/badge/doge-datagen)\n\n## Description\nDataOnlineGenerator can be used to simulate user, system or other actor behaviour bases on probabilistic model. \nIt is a state machine which is traversed by multiple subjects automatically based on probability defined for each \npossible transition. \n\nEach Subject instance can hold additional attributes that can be modified during transition with use of action_callback.\nAction callback can also be used to make transition fail and to remain in current state.\n\nIn transition definition event_sinks can be passed which will be called on successful transition to log, generate event\nor do other actions.\n\nState machine works in ticks which length and number is defined in constructor. In each tick DataOnlineGenerate\nevaluates each subject and makes a transition based on provided probabilities. Sum of probabilities of doing \na transition from given state have to be less or equal to 100. If probabilities of all transitions are less than 100 \nremaining value is treated as probability of staying in the same state in given tick.\n\n## Installation\n\nPackage is available in PyPI https://pypi.org/project/doge-datagen/\n\n```shell\npip install -U doge-datagen\n```\n\n## Usage\n\n### DataOnlineGenerator\n*Please refer to examples for full reference.*\n\nLet's consider a user of a banking application that might take or consider taking a loan. Such user will have some\naccount and loan balance. Will be able to receive an income and spend his money. He will be also able to open his\nbanking application, open a loan screen and possible take a loan or exit application. \n\nWe can model this behaviour as states and transitions:\n![State machine](doc/states.png)\n\nAnd define it as code:\n\n```python\ndatagen = DataOnlineGenerator(['offline', 'online', 'loan_screen'], 'offline', UserFactory(), 10, 60000, 1000)\ndatagen.add_transition('income', 'offline', 'offline', 0.01,\n                       action_callback=income_callback, event_sinks=[balance_sink])\ndatagen.add_transition('spending', 'offline', 'offline', 0.1,\n                       action_callback=spending_callback, event_sinks=[trx_sink, balance_sink])\ndatagen.add_transition('login', 'offline', 'online', 0.1, event_sinks=[clickstream_sink])\ndatagen.add_transition('logout', 'online', 'offline', 70, event_sinks=[])\ndatagen.add_transition('open_loan_screen ', 'online', 'loan_screen', 30, event_sinks=[clickstream_sink])\ndatagen.add_transition('close_loan_screen', 'loan_screen', 'online', 40, event_sinks=[clickstream_sink])\ndatagen.add_transition('take_loan', 'loan_screen', 'online', 10,\n                       action_callback=take_loan_callback, event_sinks=[clickstream_sink, loan_sink, balance_sink])\n```\n\nExcept defining state machine, we also provide factory that will be called to generate Subjects, in above example called\nUsers. We also provide tick length (1 min) and number of ticks (1000).\n\nProbability of login into application 0.1 [%] can be interpreted as on average 1/1000 of all users in each minute will \nlog in or that specific User will log in to an app on average once in 1000 minutes (a little more than once a day).\n\n### Generator state graph\nData Online Generator provides a convenient way of presentation of the created data model in the form of the\ngraph renderer module. This module parses the state machine definition and renders its graph - \nthe nodes are labelled by the state names, while the edges are labelled by trigger name and transition probability.\n\n#### Usage\nIn order to render the generator's state, let's define and use the graph renderer:\n\n```python\n    graph = DataOnlineGeneratorGrapher(datagen)\n    graph.render()\n```\n\nThis gives us the following result:\n![Grapher output](doc/grapher.png)\n\n### Sink factories\n\n#### Printing sink\nSimple sink that can be used to print transition results on screen. It requires a format function that converts \ntransition details into a string.\n\n```python\ndef format_function(timestamp: int, user: Subject, transition: Transition) -\u003e str:\n    return '[{}] User id: {}, balance: {}, loan_balance: {} made a transition {} from {} to {}'\\\n        .format(timestamp,\n                user.user_id,\n                user.balance,\n                user.loan_balance,\n                transition.trigger,\n                transition.from_state,\n                transition.to_state)\n\nsink = PrintingSink(format_function)\n```\n\n#### Kafka sink\nKafka sink allows emitting events into Kafka topics. By default, it uses String Serializers which can be overridden by\nproviding different serializers to `create` method. It requires `key_function` and `value_function` to be provided\nwhich converts transition details into a format that is serializable by provided serializers.\n\n```python\ndef key_function(subject: Subject, transition: Transition) -\u003e str:\n    return str(subject.user_id)\n\ndef value_function(timestamp: int, subject: Subject, transition: Transition) -\u003e str:\n    value = {\n        'timestamp': timestamp,\n        'user': {\n            'user_id': subject.user_id,\n            'balance': subject.balance,\n            'loan_balance': subject.loan_balance\n        },\n        'event': transition.trigger\n    }\n    return json.dumps(value)\n\nfactory = KafkaSinkFactory(['localhost:9092'], 'doge-kafka-example')\nsink = factory.create('test_topic', key_function, value_function)\n```\n\n#### Kafka Avro sink\n`KafkaAvroSinkFactory` is a bit of convenience factory that wraps around regular `KafkaSinkFactory` that hides away creation\ndetails of classes needed to push Avro events into Kafka. It requires functions that convert transition details into a\nformat suitable for `AvroSerializer` (typically it will be a structure of nested dicts, with sometimes tuples or type\nhints in case of type unions in schema. \n[Fast avro documentation](https://fastavro.readthedocs.io/en/latest/writer.html#using-the-tuple-notation-to-specify-which-branch-of-a-union-to-take))\nand avro schemas for key and value to be provided.\n\n```python\ndef key_function(subject: Subject, transition: Transition) -\u003e Dict[str, Any]:\n    return {'key': str(subject.user_id)}\n\ndef value_function(timestamp: int, subject: Subject, transition: Transition) -\u003e Dict[str, Any]:\n    value = {\n        'timestamp': timestamp,\n        'user': {\n            'userId': str(subject.user_id),\n            'balance': str(subject.balance),\n            'loanBalance': str(subject.loan_balance)\n        },\n        'event': transition.trigger\n    }\n    return value\n\ndef get_schema(schema_path):\n    with open(schema_path) as f:\n        return f.read()\n\nkey_schema = get_schema('./avro/Key.avsc')\nevent_schema = get_schema('./avro/Event.avsc')\n\nfactory = KafkaAvroSinkFactory(['localhost:9092'], 'http://localhost:8081', 'doge-kafka-example')\nsink = factory.create('test_avro_topic', key_function, key_schema, value_function, event_schema)\n```\n\n#### DB sink\n`DbSinkFactory` uses SQLAlchemy core. It requires DB Url in acceptable format and a function that is able to convert\ntransition details into a flat dict of values.\n\n```python\ndef row_mapper_function(timestamp: int, subject: Subject, transition: Transition) -\u003e Dict[str, Any]:\n    row = {\n        'timestamp': timestamp,\n        'user_id': subject.user_id,\n        'balance': subject.balance,\n        'loan_balance': subject.loan_balance,\n        'event': transition.trigger\n    }\n    return row\n\nfactory = DbSinkFactory('postgresql://postgres:postgres@localhost:5432/postgres')\nsink = factory.create('events', row_mapper_function)\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetindata%2Fdoge-datagen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgetindata%2Fdoge-datagen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgetindata%2Fdoge-datagen/lists"}