{"id":50391061,"url":"https://github.com/abhay557/fakedata","last_synced_at":"2026-05-30T18:01:39.825Z","repository":{"id":353422467,"uuid":"1215206777","full_name":"Abhay557/fakedata","owner":"Abhay557","description":"The fakedata package generates realistic synthetic user profiles for machine learning, deep learning, data analysis, and data science workflows.","archived":false,"fork":false,"pushed_at":"2026-05-08T16:50:10.000Z","size":5806,"stargazers_count":9,"open_issues_count":3,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-08T18:34:12.761Z","etag":null,"topics":["abhay557","anime","data","data-analysis","data-science","deep-learning","fake","fake-data","generator","joke","machine-learning","mock","mock-data"],"latest_commit_sha":null,"homepage":"https://abhay557.github.io/fakadata/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Abhay557.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-19T16:10:37.000Z","updated_at":"2026-05-08T16:36:38.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Abhay557/fakedata","commit_stats":null,"previous_names":["abhay557/fakedata"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/Abhay557/fakedata","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Abhay557%2Ffakedata","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Abhay557%2Ffakedata/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Abhay557%2Ffakedata/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Abhay557%2Ffakedata/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Abhay557","download_url":"https://codeload.github.com/Abhay557/fakedata/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Abhay557%2Ffakedata/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33703065,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-30T02:00:06.278Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["abhay557","anime","data","data-analysis","data-science","deep-learning","fake","fake-data","generator","joke","machine-learning","mock","mock-data"],"created_at":"2026-05-30T18:01:37.482Z","updated_at":"2026-05-30T18:01:39.820Z","avatar_url":"https://github.com/Abhay557.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# fakedata \n\n[![NPM Version](https://img.shields.io/npm/v/@abhay557/fakedata?color=red\u0026label=npm)](https://www.npmjs.com/package/@abhay557/fakedata)\n[![PyPI Version](https://img.shields.io/pypi/v/fakedata-python?color=blue\u0026label=pypi)](https://pypi.org/project/fakedata-python/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/16N9x1YCOVVvIF8rl7IQxKRkK4en_g3Gi?usp=sharing)\n[![PyPI Downloads](https://static.pepy.tech/personalized-badge/fakedata-python?period=total\u0026units=INTERNATIONAL_SYSTEM\u0026left_color=BLACK\u0026right_color=GREEN\u0026left_text=downloads)](https://pepy.tech/projects/fakedata-python)\n\nA high-performance, **zero-dependency** synthetic data generation engine, available for both **Node.js** and **Python**. Designed specifically for machine learning, data science, and analytics workflows, providing 100% data parity across platforms.\n\n\n## Overview\n\n`fakedata` has been completely rebuilt from the ground up to serve as an **ML-ready synthetic data engine**. It generates deeply interconnected user profiles with **112 flat columns across 13 domains** (Health, Financial, Employment, Digital Footprint, etc.), making it the perfect tool for training models, benchmarking pipelines, or simulating realistic databases.\n\n###  Machine Learning Power Features:\n- **Behavioral Personas**: Orchestrate correlations through 6 distinct personas (e.g., Executive, Student, Tech Pro) to ensure realistic socio-economic patterns.\n- **Seed Reproducibility**: Generate byte-for-byte identical datasets across runs (and languages!) using `seed`.\n- **Schema Overrides**: Force specific distributions (e.g., age ranges, income brackets, genders) using `schema`.\n- **Locale-Aware Generation**: Support for 8 culture-specific name sets and phone formats (`en`, `in`, `jp`, `kr`, `de`, `br`, `ar`, `fr`).\n- **Missing Data Simulation**: Automatically inject realistic nulls using `missing_rate` to test your data imputation pipelines.\n- **Anomaly Injection**: Inject fraud/outlier profiles (e.g., impossible geography, credit fraud, income spikes) using `anomaly_rate`.\n- **Time-Series Data**: Generate chronological activity logs (logins, page views, purchases) per user for behavioral modeling.\n- **Pipeline Ready**: Export directly to CSV, JSON, or Flat objects (perfect for `pandas.DataFrame`).\n- **CLI Tool**: Generate and export datasets directly from your terminal — no scripting required.\n- **Streaming Generation**: Files are written one record at a time — constant RAM usage regardless of dataset size. Generate 10M+ rows without running out of memory.\n- **Standalone Generators**: Generate modular, domain-specific data without full user profiles using `data.company()`, `data.job()`, `data.medicalRecord()`, `data.university()`, and `data.transaction()`.\n- **Enriched High-Fidelity Data**: Powered by aggregated datasets, user profiles now include structured `health.medicalHistory` arrays, `employment.companyDetails` with revenue and net income, and `employment.skills` arrays correlated to real job titles.\n\n---\n\n##  Node.js / TypeScript Implementation\n\n### Installation\n```bash\nnpm install @abhay557/fakedata\n```\n\n### Quick Start\n```javascript\nconst fakedata = require('@abhay557/fakedata');\n\n// Generate deterministic users with a 5% missing data rate (null injection)\nconst users = fakedata.data.users(1000, { seed: 42, missing_rate: 0.05 });\n\n// Export directly to CSV format\nconst csvString = fakedata.data.usersToCSV(1000, { seed: 42 });\n\n// Time-series activity data\nconst ts = fakedata.userTimeSeries({ days: 30, eventsPerDay: 8 });\nconsole.log(`Generated ${ts.activity.length} events for ${ts.user.fullName}`);\n```\n\n### Streaming API \u0026 Custom Correlations\nGenerate unlimited data directly to disk while keeping memory at O(1), and force mathematical relationships between fields using the Pearson Correlation API:\n\n```javascript\nconst fs = require('fs');\nconst fakedata = require('@abhay557/fakedata');\n\n// Create a stream that emits 1 million users as CSV\nconst stream = fakedata.data.generateStream(1000000, { \n    format: 'csv',\n    correlations: [\n        { fieldA: 'education.level', fieldB: 'financial.annualIncome', pearson_coeff: 0.85 },\n        { fieldA: 'health.bmi', fieldB: 'health.bloodPressure.systolic', pearson_coeff: 0.60 }\n    ]\n});\n\n// Pipe directly to file (constant RAM usage)\nstream.pipe(fs.createWriteStream('1m_dataset.csv'));\n```\n\n---\n\n##  Python Implementation\n\n### Installation\n```bash\npip install fakedata-python\n```\n\n### Quick Start\n```python\nimport fakedata\nimport pandas as pd\n\n# Generate 10,000 highly correlated users deterministically\nusers = fakedata.data.users(10000, {\"seed\": 42})\n\n# Or export directly to a Pandas DataFrame\ndf = pd.DataFrame(fakedata.data.users_flat(10000, {\"seed\": 42}))\nprint(df.head())\n\n# Create time-series activity data\nts = fakedata.data.user_time_series({\"days\": 30, \"events_per_day\": 8})\nprint(f\"Generated {len(ts['activity'])} events for {ts['user']['fullName']}\")\n```\n\n### Streaming API \u0026 Custom Correlations\nGenerate unlimited data lazily, keeping memory footprint at O(1), and force mathematical relationships between fields using the Pearson Correlation API:\n\n```python\nimport fakedata\n\n# Create a lazy generator that yields 1 million users\nstream = fakedata.generate_stream(1000000, {\n    \"correlations\": [\n        {\"fieldA\": \"education.level\", \"fieldB\": \"financial.annualIncome\", \"pearson_coeff\": 0.85},\n        {\"fieldA\": \"health.bmi\", \"fieldB\": \"health.bloodPressure.systolic\", \"pearson_coeff\": 0.60}\n    ]\n})\n\n# Process users one by one without blowing up RAM\nfor user in stream:\n    # write to DB, serialize to file, or process\n    pass\n```\n\n---\n\n##  CLI — Command Line Interface\n\nAfter installing, use `fakedata` directly from your terminal. No scripts needed!\n\n### Node.js (global install)\n```bash\nnpm install -g @abhay557/fakedata\n```\n\n### Python (global install)\n```bash\npip install fakedata-python\n```\n\n### CLI Commands\n\n| Command | Description |\n|:---|:---|\n| `fakedata generate` | Generate synthetic user data |\n| `fakedata preview` | Print a single user profile to the console |\n| `fakedata help` | Show all available options |\n\n### CLI Options\n\n| Flag | Default | Description |\n|:---|:---|:---|\n| `-T`, `--type` | `users` | Type of data: `users` \\| `companies` \\| `jobs` \\| `universities` \\| `transactions` \\| `medical_records` |\n| `-n`, `--count` | `10` | Number of records to generate |\n| `-f`, `--format` | `json` | Output format: `json` \\| `csv` \\| `flat` |\n| `-o`, `--output` | stdout | Output file path |\n| `-s`, `--seed` | none | Random seed for reproducibility |\n| `-l`, `--locale` | `en` | Locale: `en` \\| `in` \\| `jp` \\| `kr` \\| `de` \\| `br` \\| `ar` \\| `fr` |\n| `-a`, `--anomaly-rate` | `0` | Fraction of anomalous users (0–1) |\n| `-m`, `--missing-rate` | `0` | Fraction of null fields (0–1) |\n| `-t`, `--timeseries` | — | Include time-series activity logs |\n| `--days` | `30` | Days of activity for time-series |\n| `--pretty` | — | Pretty-print JSON output |\n\n### Examples\n\n```bash\n# Generate 1000 users and save as CSV\nfakedata generate -n 1000 -f csv -o dataset.csv\n\n# Generate 500 standalone company profiles (v2.1)\nfakedata generate --type companies -n 500 -o companies.json\n\n# Generate 100,000 medical records directly to a file (v2.1)\nfakedata generate -T medical_records -n 100000 -o hospitals.json\n\n# Generate 500 deterministic Indian users\nfakedata generate -n 500 -l in --seed 42 -o india.json\n\n# Fraud detection dataset with 5% anomalies\nfakedata generate -n 10000 -a 0.05 -f csv -o fraud_data.csv\n\n# Generate 1 million rows without running out of memory (streaming)\nfakedata generate -n 1000000 -f csv -o big_dataset.csv\n\n# Preview a single user in the console\nfakedata preview\n\n# Time-series activity logs for 100 users\nfakedata generate -n 100 --timeseries --days 60 -o activity.json\n```\n\n### Streaming Architecture\n\nWhen writing to a file (`-o`), the CLI uses a **streaming write** strategy:\n\n- The output file is **created first**, before any data is generated.\n- Each user is generated **one at a time** and written immediately to disk.\n- The generated object is then **discarded** — it is never held in a large array.\n- **RAM usage stays constant** (O(1)) regardless of how many records you generate.\n- A live progress counter is printed every 10,000 records for large jobs.\n\nThis means you can generate **tens of millions of rows** without hitting Node.js heap limits or Python memory errors.\n\n---\n\n## Advanced Features Reference\n\nBoth Python and JS/TS expose the same underlying engine options.\n\n### 1. Configuration Options\nPass an `options` dictionary/object to `data.user(options)` or `data.users(n, options)`:\n\n```javascript\nconst options = {\n    seed: 42,              // Number: Ensures deterministic, byte-for-byte identical output\n    missing_rate: 0.05,    // Float (0-1): 5% chance of any leaf field being null\n    locale: 'jp',          // String: 'en', 'in', 'jp', 'kr', 'de', 'br', 'ar', 'fr'\n    anomaly_rate: 0.05,    // Float (0-1): 5% of users will have injected fraud anomalies\n    days: 30,              // Number: Days of time-series activity to generate\n    eventsPerDay: 8,       // Number: Average events per day for time-series logs\n    \n    // Schema Constraints (force specific data distributions)\n    schema: {\n        age: { min: 25, max: 40 },           // Can also use { exact: 30 }\n        gender: \"female\",                    // \"male\", \"female\", or \"non-binary\"\n        employment: { status: \"employed\" }, \n        education: { level: \"Master's\" },\n        financial: { annualIncome: { min: 60000, max: 120000 } },\n        health: { medicalCondition: \"Diabetes\" },\n        address: { country: \"Japan\" },\n        height: { min: 160, max: 180 },\n        weight: { min: 50, max: 80 }\n    }\n}\n```\n\n### 2. Supported API Methods\n\n| Method (JS) | Method (Python) | Description |\n| :--- | :--- | :--- |\n| `data.user(opts?)` | `data.user(opts=None)` | Generate a single complex user profile. |\n| `data.users(n, opts?)` | `data.users(n, opts=None)` | Generate an array/list of `n` users. |\n| `data.userTimeSeries(opts)` | `data.user_time_series(opts)`| Returns `{ user, activity }` containing chronological event logs. |\n| `data.usersFlat(n, opts?)` | `data.users_flat(n, opts=None)`| Returns flat dicts/objects, perfect for `pandas.DataFrame` ingestion. |\n| `data.usersToCSV(n, opts?)` | `data.users_to_csv(n, opts=None)`| Returns a fully formatted CSV string (112 columns). |\n| `data.usersToJSON(n, opts?)`| `data.users_to_json(n, opts=None)`| Returns a pretty-printed JSON string. |\n\n### 3. Behavioral Personas (Statistical Modeling)\nTo ensure the data is useful for **Clustering** and **Regression** analysis, `fakedata` uses a **Persona-driven engine**. Every user is assigned one of 6 personas that orchestrate their life outcomes:\n\n- **Executive**: High income, high education (Master's/PhD), premium Apple devices, luxury lifestyle.\n- **Tech Professional**: High income, high-end hardware, heavy social media use, remote work bias.\n- **Student**: Low income, high student debt, budget/mid-range tech, high social media footprint.\n- **Manual Laborer / Service Worker**: Budget-conscious, steady income, consistent employment patterns.\n- **Freelancer**: Flexible work modes, variable income ranges, mid-range tech profile.\n\nThese personas ensure that an analyst looking at your synthetic data will find **statistically significant clusters** rather than just a uniform cloud of random values.\n\n---\n\n## Data Structure Highlights (112 Columns)\n\n### 3. v2.1 High-Fidelity Data Injections\nVersion 2.1 completely revamps the `user()` profile by injecting rich, deeply nested real-world data distributions for Employment, Health, and Education.\n\n```json\n{\n  \"employment\": {\n    \"status\": \"employed\",\n    \"jobTitle\": \"Data Scientist\",\n    \"jobCategory\": \"Engineering\",\n    \"skills\": [\"Python\", \"SQL\", \"Machine Learning\", \"PyTorch\"],\n    \"companyDetails\": {\n      \"country\": \"United States\",\n      \"industry\": \"Technology\",\n      \"yearFounded\": 1998,\n      \"revenue\": 182300000000,\n      \"netIncome\": 46200000000\n    }\n  },\n  \"health\": {\n    \"medicalHistory\": [\n      {\n        \"condition\": \"Hypertension\",\n        \"hospital\": \"UCLA Medical Center\",\n        \"admissionType\": \"Urgent\",\n        \"billingAmount\": 18560.50,\n        \"medication\": \"Lisinopril\",\n        \"testResult\": \"Abnormal\"\n      }\n    ]\n  },\n  \"education\": {\n    \"institution\": \"Massachusetts Institute of Technology\",\n    \"institutionDomain\": \"mit.edu\",\n    \"institutionState\": \"Massachusetts\"\n  }\n}\n```\n\n### 4. Locale-Aware Name Generation\nSupports 8 locales with culturally accurate first names, last names, and country/phone codes:\n- `'in'`: Aarav Sharma, Priya Patel (+91, India)\n- `'jp'`: Haruto Tanaka, Sakura Sato (+81, Japan)\n- `'kr'`: Minjun Kim, Seo-yeon Park (+82, South Korea)\n- `'de'`: Lukas Müller, Mia Schmidt (+49, Germany)\n- `'br'`: Miguel Silva, Alice Santos (+55, Brazil)\n- `'ar'`: Mohammed Al-Ahmed, Fatima Khalil (+966, Saudi Arabia)\n- `'fr'`: Gabriel Martin, Emma Dubois (+33, France)\n- `'en'`: James Smith, Mary Johnson (+1, United States)\n\n### 5. Time-Series Activity Data\nGenerate chronological behavioral logs for users. Event types include `login`, `page_view`, `purchase`, `search`, `click`, `logout`, `api_call`, `upload`, `download`, and `comment`.\n\n```javascript\nconst ts = data.userTimeSeries({ seed: 42, days: 30, eventsPerDay: 8 });\n// ts.user → Full user profile\n// ts.activity → [{ timestamp, type, page, duration, device, ip, success, amount?, query? }]\n```\n\n### 6. Anomaly Injection Engine (Fraud Detection)\nWhen `anomaly_rate` is \u003e 0, `fakedata` injects ML-detectable fraud patterns into the dataset. Affected users receive a special `_anomaly` flag object indicating the fraud type.\n\n| Anomaly Type | Effect |\n|:---|:---|\n| `income_spike` | Income multiplied 5-15x |\n| `credit_fraud` | Credit score = 100-200 or 850-999, DTI = 10-60 |\n| `session_anomaly` | Sessions/week = 200-700, avg session = 500-1500 min |\n| `age_outlier` | Age = 1, 2, 3, 115, 120, or 130 |\n| `geo_impossible` | Coordinates = (0,0), IP = 0.0.0.0 |\n| `velocity_attack` | Total sessions = 50k-150k, last login = now |\n| `data_mismatch` | Age=12 + employed + 30yr experience + $500k income |\n| `health_outlier` | BMI = 8-9 or 75-80, BP = extreme values |\n\n### 7. The User Profile Schema (109 Correlated Fields)\nEach generated user contains highly realistic, correlated data. For example, age determines education graduation year, which impacts employment salary, which impacts credit score, which impacts housing status and health/BMI metrics.\n\n```text\nidentity(9) → personal(6) → network(3) → address(7) → demographics(5)\n→ education(7) → employment(10) → financial(8) → health(16)\n→ social(9) → digitalFootprint(15) → bank(5) → lifestyle(9)\n```\n\n---\n\n## License\n\nDistributed under the **MIT License**. See `LICENSE` for more information.\n\n**Maintainer**: [abhay557](https://github.com/abhay557)\n\n- Project Commit History - `https://github.com/abhay557/random-api.xyz`\n\n---\n## Contributing\n\nContributions are welcome! Whether it's a bug fix, a new feature, or improved docs — every bit helps.\n\n- Read the [Contributing Guide](./CONTRIBUTING.md) before submitting a PR.\n- Use the [Bug Report](https://github.com/abhay557/fakedata/issues/new?template=bug_report.md) template to report issues.\n- Use the [Feature Request](https://github.com/abhay557/fakedata/issues/new?template=feature_request.md) template to suggest ideas.\n- Please follow our [Code of Conduct](./CODE_OF_CONDUCT.md) in all interactions.\n\n```bash\n# Fork the repo, then:\ngit clone https://github.com/YOUR_USERNAME/fakedata.git\ngit checkout -b feature/my-improvement\n# Make your changes, then open a Pull Request!\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhay557%2Ffakedata","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fabhay557%2Ffakedata","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fabhay557%2Ffakedata/lists"}