https://github.com/adamrossnelson/fictionalsurveyresponses

Generates synthetic survey data with control for specifying factors. Allows researchers, data scientists, and others to create survey datasets with known underlying factor structures.
https://github.com/adamrossnelson/fictionalsurveyresponses
Last synced: about 1 month ago
JSON representation
Generates synthetic survey data with control for specifying factors. Allows researchers, data scientists, and others to create survey datasets with known underlying factor structures.
Host: GitHub
URL: https://github.com/adamrossnelson/fictionalsurveyresponses
Owner: adamrossnelson
License: mit
Created: 2025-04-06T18:28:00.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2025-04-06T19:27:20.000Z (about 1 month ago)
Last Synced: 2025-04-06T19:35:16.022Z (about 1 month ago)
Language: Jupyter Notebook
Size: 788 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # Fictional Survey Data Generator

Generates synthetic survey data with control for specifying factors. Allows researchers, data scientists, and others to create survey datasets with known underlying factor structures. Useful for testing factor analysis methods, developing data visualization techniques, or teaching statistics and psychometrics.

## Features

- Generate data with any number of latent factors

- Control the distribution of each factor (normal distribution or custom probability distribution)

- Specify the number of survey items (questions) for each factor

- Add noise to create variability in responses

- Customize response scales (e.g., 1-5 Likert scale or other ranges)

- Reproducible results via random seed specification

## Installation

Beta installation...

### Windows

```powershell

# Download the Python file

Invoke-WebRequest -Uri "https://raw.githubusercontent.com/adamrossnelson/FictionalSurveyResponses/main/FictionalDataGenerator.py" -OutFile "FictionalDataGenerator.py"

```

### macOS/Linux

```bash

# Download the Python file

curl -O https://raw.githubusercontent.com/adamrossnelson/FictionalSurveyResponses/main/FictionalDataGenerator.py

```

## Quick Start

```python

from FictionalDataGenerator import MakeData

# Create a data generator with 1000 respondents

maker = MakeData(n_subjects=1000, seed=42)

# Add factors with their respective items

maker.add_factor(name="satisfaction", n_items=5)

maker.add_factor(name="engagement", n_items=5)

maker.add_factor(name="leadership", n_items=5)

# Generate the data

df = maker.run()

# View the first few rows

print(df.head())

```

## Detailed Usage Guide

### Core Concepts

#### Factors and Items

In survey analysis:

- **Factors** are latent (unobserved) variables that explain patterns in responses across multiple survey items

- **Items** are individual survey questions that load onto one or more factors

This library creates datasets where each factor influences a specific set of items, with added noise to simulate survey responses.

#### Distribution Types

You can generate factors using two distribution types:

1. **Normal distribution** - Factor values are drawn from a normal distribution with specified mean and standard deviation

2. **Custom probability distribution** - Specify the probabilities for each possible value

### Creating a MakeData Object

```python

from FictionalDataGenerator import MakeData

# Basic initialization

maker = MakeData()

# With custom parameters

maker = MakeData(

    n_subjects=500,      # Number of survey respondents

    seed=123             # Random seed for reproducibility

)

```

### Adding Factors

The `add_factor()` method adds a new factor with specified properties:

```python

# Adding a factor with default parameters (normal distribution)

maker.add_factor(

    name="satisfaction",  # Name of the factor

    n_items=4             # Number of survey items for this factor

)

# Adding a factor with a custom probability distribution

# This creates a left-skewed distribution (5 values with corresponding probabilities)

maker.add_factor(

    name="difficulty",

    n_items=3,

    distribution=[0.05, 0.15, 0.30, 0.35, 0.15]  # Probabilities for values 1-5

)

# Adding a factor with a custom normal distribution

maker.add_factor(

    name="engagement",

    n_items=5,

    distribution="normal",

    mean=4.2,             # Higher mean value

    std=0.7               # Custom standard deviation

)

# Adding a factor with custom response range and noise

maker.add_factor(

    name="agreement",

    n_items=4,

    min_val=0,             # Minimum value (default is 1)

    max_val=6,             # Maximum value (default is 5)

    noise_range=[-1, 0, 0, 0, 1]  # Custom noise distribution

)

```

### Method Chaining

Consider chaining `add_factor()` calls for cleaner code:

```python

maker = MakeData(n_subjects=1000).add_factor(

    name="quality", 

    n_items=3

).add_factor(

    name="usefulness", 

    n_items=4

)

```

### Generating Data

Once configured, call the `run()` method to generate the data:

```python

# Generate the data

df = maker.run()

# The dataframe contains:

# - subject_id column

# - One column for each factor (e.g., "satisfaction", "engagement")

# - Multiple item columns for each factor (e.g., "satisfaction_1", "satisfaction_2")

```

### Examples

#### Creating a 3-Factor Personality Survey

```python

from FictionalDataGenerator import MakeData

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Set Seaborn style

sns.set_theme(style="whitegrid")

# Create a personality survey with 3 factors

maker = MakeData(n_subjects=2000, seed=42)

# Add three personality factors

maker.add_factor(

    name="extraversion",

    n_items=5,

    distribution="normal",

    mean=3.0,

    std=1.2

).add_factor(

    name="agreeableness",

    n_items=5,

    distribution="normal",

    mean=3.5,

    std=0.9

).add_factor(

    name="conscientiousness",

    n_items=5,

    distribution="normal",

    mean=3.8,

    std=1.0

)

# Generate the data

df = maker.run()

# Compute correlation between items

item_cols = [col for col in df.columns if '_' in col]

corr_matrix = df[item_cols].corr()

# Visualize correlation matrix using Seaborn

plt.figure(figsize=(8, 7))

sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', vmin=-1, vmax=1, center=0)

plt.title('Correlation Matrix of Survey Items', fontsize=16)

plt.tight_layout()

plt.show()

# Visualize the distribution of factor values

plt.figure(figsize=(8, 3))

factor_cols = ['extraversion', 'agreeableness', 'conscientiousness']

for i, factor in enumerate(factor_cols, 1):

    plt.subplot(1, 3, i)

    sns.histplot(df[factor], kde=True, color=sns.color_palette("husl", 3)[i-1])

    plt.title(f'{factor.capitalize()} Distribution', fontsize=14)

    plt.xlabel('Score', fontsize=12)

plt.tight_layout()

plt.show()

```

#### Customer Satisfaction Survey

```python

from FictionalDataGenerator import MakeData

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# Set Seaborn style

sns.set_theme(style="ticks")

# Create a customer satisfaction survey

maker = MakeData(n_subjects=1000)

# Product quality factor (normally distributed)

maker.add_factor(

    name="quality",

    n_items=4,

    distribution="normal",

    mean=3.8,  # People generally rate product quality well

    std=0.9

)

# Customer service factor (bimodal - people either love it or hate it)

maker.add_factor(

    name="service",

    n_items=3,

    distribution=[0.30, 0.10, 0.05, 0.15, 0.40]  # U-shaped distribution

)

# Value for money factor (slightly negatively skewed)

maker.add_factor(

    name="value",

    n_items=3,

    distribution=[0.25, 0.30, 0.25, 0.15, 0.05]

)

# Generate the data

df = maker.run()

# Calculate mean scores for each factor's items

for factor in ['quality', 'service', 'value']:

    item_cols = [col for col in df.columns if col.startswith(f"{factor}_")]

    df[f"{factor}_score"] = df[item_cols].mean(axis=1)

# Create a pairplot of the factor scores

sns.pairplot(

    df[["quality_score", "service_score", "value_score"]], 

    kind='scatter',

    diag_kind='kde',

    plot_kws={'alpha': 0.6, 's': 20, 'edgecolor': 'k', 'linewidth': 0.5},

    corner=True

)

plt.suptitle('Relationships Between Factor Scores', y=1.02, fontsize=16)

plt.tight_layout()

plt.show()

# Create a violin plot showing the distribution of each factor

plt.figure(figsize=(10, 6))

factor_scores = pd.melt(

    df[["quality_score", "service_score", "value_score"]], 

    var_name='Factor', 

    value_name='Score'

)

sns.violinplot(

    x="Factor", 

    y="Score", 

    data=factor_scores, 

    palette="Set2",

    inner="quartile"

)

plt.title('Distribution of Customer Satisfaction Factors', fontsize=16)

plt.ylabel('Average Score', fontsize=14)

plt.xlabel('', fontsize=14)

plt.xticks(plt.xticks()[0], ['Product Quality', 'Customer Service', 'Value for Money'], fontsize=12)

plt.ylim(0.5, 5.5)

plt.grid(axis='y', linestyle='--', alpha=0.7)

plt.tight_layout()

plt.show()

```

## Advanced Usage

### Customizing Noise Patterns

The `noise_range` parameter controls how much random variation is added to each item:

```python

# Items that closely follow the factor (less noise)

maker.add_factor(

    name="accuracy", 

    n_items=3,

    noise_range=[-1, 0, 0, 0, 0, 1]  # Mostly zeros = less noise

)

# Items with more variability (more noise)

maker.add_factor(

    name="relevance", 

    n_items=3,

    noise_range=[-2, -1, -1, 0, 1, 1, 2]  # More non-zero values = more noise

)

```

### Working with Generated Data

The `run()` method returns the data. Following `run()` you can also get the data fromt the `_get()` method.

```python

# Generate data

df = maker.run()

# Save to CSV

df.to_csv("survey_data.csv", index=False)

# Get just the survey items (excluding factor columns and subject_id)

item_cols = [col for col in df.columns if '_' in col]

item_data = df[item_cols]

# Calculate summary statistics

print(item_data.describe())

# Get the data later without regenerating

same_df = maker.get_data()  # Will raise error if run() hasn't been called

```

## API Reference

### MakeData Class

```python

MakeData(n_subjects=1000, seed=None)

```

**Parameters:**

- `n_subjects` (int): Number of subjects/respondents to generate (default: 1000)

- `seed` (Optional[int]): Random seed for reproducibility (default: None)

**Methods:**

#### add_factor

```python

add_factor(name, n_items=4, distribution="normal", mean=3, std=1, min_val=1, max_val=5, noise_range=[-2, -1, 0, 0, 0, 1, 1, 2])

```

**Parameters:**

- `name` (str): Name of the factor

- `n_items` (int): Number of survey items to generate for this factor (default: 4)

- `distribution` (Union[List[float], str]): Either a list of probabilities for values 1-5, or "normal" for normal distribution (default: "normal")

- `mean` (float): Mean value if using normal distribution (default: 3)

- `std` (float): Standard deviation if using normal distribution (default: 1)

- `min_val` (int): Minimum value for survey responses (default: 1)

- `max_val` (int): Maximum value for survey responses (default: 5)

- `noise_range` (List): Possible noise values to add to the base factor (default: [-2, -1, 0, 0, 0, 1, 1, 2])

#### run

```python

run() -> pd.DataFrame

```

**Returns:**

- pandas.DataFrame: Generated survey data

#### get_data

```python

get_data() -> pd.DataFrame

```

**Returns:**

- pandas.DataFrame: Previously generated survey data (must call run() first)

## 📚 Citation

### BibTeX

```bibtex

@software{nelson2025fictionaldatagenerator,

  author       = {Nelson, Adam Ross},

  title        = {FictionalDataGenerator: Generate synthetic survey responses using random number generation},

  year         = 2025,

  publisher    = {Up Level Data, LLC},

  version      = {1.0},

  url          = {https://github.com/adamrossnelson/FictionalSurveyResponses}

}

```

### APA Format

Nelson, A. R. (2025). *FictionalDataGenerator: Generate synthetic survey responses using random number generation* (Version 1.0) [Computer software]. Up Level Data, LLC. https://github.com/adamrossnelson/FictionalSurveyResponses

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/adamrossnelson/fictionalsurveyresponses

Awesome Lists containing this project

README