https://github.com/aanastasiou/dgen

An object oriented approach to synthetic clinical data generation
https://github.com/aanastasiou/dgen
Last synced: 25 days ago
JSON representation
An object oriented approach to synthetic clinical data generation
Host: GitHub
URL: https://github.com/aanastasiou/dgen
Owner: aanastasiou
Created: 2017-04-20T06:53:17.000Z (about 8 years ago)
Default Branch: master
Last Pushed: 2017-04-23T18:02:17.000Z (about 8 years ago)
Last Synced: 2025-04-03T16:50:33.589Z (26 days ago)
Language: Python
Size: 36.1 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # DGen

DGen is a Python module for the generation (and degeneration) of synthetic 

clinical data.

It achieves this by establishing a very basic algebra of random data generators

(and degenerators) and enabling the user to create more complex re-usable and "replugable"

data generators via inheritance.

To an extent, the syntax and overall design of DGen was influenced by [PyParsing](http://pyparsing.wikispaces.com/),

the Python module that enables its users to build parsers.

Just as a PyParsing parser is a model for the strings it can recognise, 

so is DGen, a model of *patient data*. 

In fact, DGen ships with a very basic `Person` model, that represents basic 

data most commonly encountered in epidemiology. DGen users can extend this 

`Person` to create more complex participants with data representing their 

[*Book Of Life*](https://books.google.co.uk/books?id=1iI5rHtCT-4C&lpg=PA180&ots=2kMSde3-kq&dq=%22book%20of%20life%22%20epidemiology&pg=PA180#v=onepage&q=%22book%20of%20life%22%20epidemiology&f=false) events.

## Quickstart

This quickstart covers the two basic packages of DGen:

1. Data Generators

2. Data Perturbators

### Data generation

To create a fictional postcode variable that conforms loosely to the UK

standard of postcodes:

    from DGen.datagenerator import *

    

    postCode = revRegexGenerator("[A-Z][A-Z][1-9][1-9][A-Z][A-Z]").setVarName("Postcode")

    postCode()

    

Note at this point, `postCode` is **NOT** the instance of a postcode but rather, 

a **model** for postcodes that are composed of 2 letters 2 numbers 2 letters.

To obtain **an instance** of the model, the model is simply *called* (`postCode()`).

#### Other generators

At the moment, the following generators have been defined:

* `optionGenerator`

    * `P = optionGenerator(["Male", "Female"]) # Generates strings Male, Female with equal probabilities`

    * `P = optionGenerator([(0.1,"Male"),(0.9,"Female")]) # Attaches a discrete probability to each event`

* `condProbOptionGenerator`

    * `P = condProbOptionGenerator({"Male":"Victor", "Female":"Victoria"}) #Given the gender, generate a name`

        * *Note*, the specific mechanics of this generator will become apparent further below

        

* `archivedOptionGenerator`

    * Exactly like an `optionGenerator` but reads options from an archive.

    

* `revRegexGenerator`

    * `P = revRegexGenerator("[0-9A-F][0-9A-F][0-9A-F][0-9A-F][0-9A-F][0-9A-F]") # Generates a random 6-digit number in hex`

    * *Note:* Reverse Regular Expressions provided by the excellent Python module [`rstr`](https://pypi.python.org/pypi/rstr/2.1.3).

* `uidGenerator`

    * `P = uidGenerator() # Generates universal identifiers`

    * *Note:* Essentially a repackaging of the excellent [uuid](https://docs.python.org/2/library/uuid.html) module.

    

* `seqGenerator`

    * `P = seqGenerator("ABCD*EFG", maxNum=8) # Generates a sequence or length 8 given an iterable of options`

* `dateGenerator`

    * `P = dateGenerator(datetime.datetime.now()-datetime.timeinterval(weeks=4), datetime.datetime.now()) # Generates a date within the last four weeks`

    

#### Combining generators

All of the above generators can also be combined with each other, either 

as parameters to other generators or through the use of operator overloading. 

For example:

`P = optionGenerator([optionGenerator(["A","B"]), revRegexGenerator("[0-9][0-9]")])`

`P` is now a model which will generate **either** (**A** or **B**) or a number between 0-99.

To generate the cartesian product between sets `U = {"Male", "Female"}` and `V = {"Prostate", "Pregnant"}` with 

specific probabilities, we might define something like:

    P = optionGenerator(["Male","Female"]).setVarName("Gender")

    Q = optionGenerator(["Prostate", "Pregnant"]).setVarName("Condition")

    K = (P * Q).setVarName("GenderCondition")

    

`K` is now a model, a composite data generator, that was created through the multiplication 

operator. It will generate all possible combinations with equal chances, so:

`("MaleProstate", "MalePregnant", "FemaleProstate", "FemalePregnant")`. There are 4 events

and therefore 0.25 probability assigned to each. The probabilities can be altered by using 

the tuple syntax of `optionGenerator`'s constructor.

Obviously, persons of a male gender at birth are more likely to suffer from 

a prostate related condition and similarly, persons of a female gender at birth

are more likely to become pregnant.

To represent this properly, we need a conditional generator, like this:

    P = optionGenerator(["Male","Female"]).setVarName("Gender")

    Q = condProbOptionGenerator({"Male":optionGenerator("Prostate", "Hairloss"), "Female":optionGenerator("Pregnant", "Menustration")})

    K = (Q | P).setVarName("GenderCondition")

    

`K` is now a model that represents the conditional probability of **Q given P**.

In this case, the final generator creates the eventualities:

`("MaleProstate", "MaleHairloss", "FemalePregnancy", "FemaleMenstruation")`

The difference between the cartesian product and conditional probability generator options

is that between a [Klique](https://en.wikipedia.org/wiki/Klique) and a [Tree](https://en.wikipedia.org/wiki/Tree_(graph_theory)).

Finally, data generators can be *XORed" together:

    P = optionGenerator(["Alpha", "Beta"])

    Q = optionGenerator(["Gamma", "Delta"])

    K = (P^Q).setVarName("Combined")

    

`K` is now a model that creates the eventualities of `P XOR Q` or more 

generally, `P1 XOR P2 XOR P3 . . . Pn`.

### Data degeneration

Similarly to the above examples, let's create a fictional postcode variable 

that suffers from punctuation errors:

    from DGen.datagenerator import *

    from DGen.dataperturbator import *

    

    postCode = revRegexGenerator("[A-Z][A-Z][1-9][1-9][A-Z][A-Z]").setVarName("Postcode")

    postCodeDemon = punctuationPerturbator(prob = 0.8)

    

    P = postCode()

    Q = postCodeDemon(P)

    

In the above example, `P` is a pristine **instance** of a postcode but `Q` is a 

perturbed version of `P` suffering a punctuation error.

The common element of all perturbators is a *probability of occurence* parameter 

which determines how often is the error supposed to appear. In this example, `prob = 0.8` 

which means that in 100 generated instances of `postCode`, 80% of them would appear 

to suffer a punctuation error.

For more information on the specific data perturbation scenarios modeled by DGen, 

please see [Linking Data for Health Services Research: A Framework and Instructional Guide](https://www.ncbi.nlm.nih.gov/books/NBK253312/).

#### Other degenerators

At the moment, the following degenerators have been defined:

* `subsPerturbator`

    * `P = subsPerturbator([("Avenue", "Avn"),("Robert", "Bob"),("William", "Bill")]) #When triggered, substitutes (from,to)`

* `prefixPerturbator`

    * `P = prefixPerturbator(["Mr", "Sir", "Dr", "Baron"]) #When triggered, adds one of the prefixes to its output`

    

* `suffixPerturbator`

    * Self explanatory, given the operation of `prefixPerturbator`

* `missingDataPerturbator`

    * `P = missingDataPerturbator("-") #When triggered, outputs a predefined missing data symbol to its output.`

## Creating more complex data generators

To create more complex data geneartors, one generally derives from `randomDataGenerator`, the abstract class 

that defines all behaviour expected by a data generator. However, it is up to the user of DGen to further refine 

the algebra of derived `randomDataGenerator`s.

A very simple example of this is the `Person` class, available from `epi` and a more extensive 

example of how DGen can be used to piece together more complex generators is available in the `examples/` folder.

## Where to go from here

The module is extensively documented in `doc/`, including a draft TODO list.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aanastasiou/dgen

Awesome Lists containing this project

README