{"id":44325878,"url":"https://github.com/zafrem/pattern-engine","last_synced_at":"2026-02-11T07:34:50.086Z","repository":{"id":330931273,"uuid":"1124159798","full_name":"zafrem/pattern-engine","owner":"zafrem","description":"A comprehensive collection of regex patterns and verification functions for detecting and validating PII and sensitive data across multiple global regions and languages.","archived":false,"fork":false,"pushed_at":"2026-01-03T17:10:27.000Z","size":82,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-09T05:21:27.046Z","etag":null,"topics":["compliance","data-privacy","multi-region","pii-detection","regex-patterns","security-tooling","sensitive-data","verification-functions"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zafrem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-28T13:16:04.000Z","updated_at":"2026-01-06T10:11:17.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/zafrem/pattern-engine","commit_stats":null,"previous_names":["zafrem/pattern-engine"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zafrem/pattern-engine","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zafrem%2Fpattern-engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zafrem%2Fpattern-engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zafrem%2Fpattern-engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zafrem%2Fpattern-engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zafrem","download_url":"https://codeload.github.com/zafrem/pattern-engine/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zafrem%2Fpattern-engine/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29329493,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-11T06:13:03.264Z","status":"ssl_error","status_checked_at":"2026-02-11T06:12:55.843Z","response_time":97,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compliance","data-privacy","multi-region","pii-detection","regex-patterns","security-tooling","sensitive-data","verification-functions"],"created_at":"2026-02-11T07:34:49.892Z","updated_at":"2026-02-11T07:34:50.080Z","avatar_url":"https://github.com/zafrem.png","language":"Python","readme":"# Pattern Engine\n\nA comprehensive collection of regular expression patterns and verification functions for detecting personally identifiable information (PII) and sensitive data across multiple languages and regions.\n\n## Overview\n\nPattern Engine provides:\n- **Regex Patterns**: 160+ patterns for PII detection across US, Korea, Japan, China, Taiwan, India, EU, and common international formats\n- **Verification Functions**: 32 Python verification functions for checksum validation and data quality checks\n- **Keyword Mappings**: Context-aware keywords in multiple languages for improved accuracy\n- **Comprehensive Tests**: 1,965+ automated tests ensuring pattern accuracy and reliability\n\n## Project Structure\n\n```\npattern-engine/\n├── regex/                    # Regex pattern definitions (YAML)\n│   ├── pii/                 # Personally Identifiable Information patterns\n│   │   ├── us/             # United States (SSN, driver's license, passport, phone, etc.)\n│   │   ├── kr/             # South Korea (RRN, bank accounts, phone, etc.)\n│   │   ├── jp/             # Japan (My Number, bank accounts, phone, etc.)\n│   │   ├── cn/             # China (ID cards, bank accounts, phone, etc.)\n│   │   ├── tw/             # Taiwan (ID cards, bank accounts, phone, etc.)\n│   │   ├── in/             # India (Aadhaar, PAN, phone, etc.)\n│   │   ├── eu/             # European Union (VAT, national IDs, passports, etc.)\n│   │   ├── common/         # International formats (email, IP, credit cards, etc.)\n│   │   └── iban.yml        # IBAN patterns with mod-97 validation\n│   └── hash/                # High-entropy tokens and secrets\n│       └── tokens.yml       # API keys, JWT, AWS keys, GitHub tokens, etc.\n│\n├── verification/            # Verification functions for pattern validation\n│   ├── python/             # Python implementation\n│   │   ├── verification.py # 32 verification functions\n│   │   └── __init__.py\n│   └── README.md\n│\n├── keyword/                 # Context-aware keywords (multi-language)\n│   ├── financial.yml       # Bank, credit card, IBAN keywords\n│   ├── identification.yml  # SSN, passport, national ID keywords\n│   ├── contact.yml         # Email, phone, address keywords\n│   ├── network.yml         # IP, URL, MAC address keywords\n│   ├── personal.yml        # Name, DOB, gender keywords\n│   └── README.md\n│\n├── tests/                   # Comprehensive test suite\n│   ├── test_verification.py # Verification function tests (129 tests)\n│   ├── test_patterns.py     # Pattern validation tests (1,836+ tests)\n│   ├── requirements.txt     # Test dependencies\n│   ├── run_tests.sh        # Test runner script\n│   └── README.md\n│\n├── pytest.ini              # Pytest configuration\n└── README.md               # This file\n```\n\n## Quick Start\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/pattern-engine.git\ncd pattern-engine\n\n# Install test dependencies\npip install -r tests/requirements.txt\n```\n\n### Running Tests\n\n```bash\n# Run all tests\npytest\n\n# Run specific test file\npytest tests/test_verification.py\npytest tests/test_patterns.py\n\n# Run with verbose output\npytest -v\n\n# Run with coverage\npytest --cov=verification --cov-report=html\n```\n\n### Using Patterns\n\n```python\nimport re\nimport yaml\nfrom verification.python.verification import luhn, iban_mod97\n\n# Load a pattern file\nwith open('regex/pii/common/credit-cards.yml') as f:\n    patterns = yaml.safe_load(f)\n\n# Use a pattern\nvisa_pattern = patterns['patterns'][0]['pattern']\ntext = \"My card is 4111111111111111\"\n\nmatch = re.search(visa_pattern, text)\nif match and luhn(match.group(0)):\n    print(\"Valid Visa card detected!\")\n```\n\n### Using Verification Functions\n\n```python\nfrom verification.python.verification import (\n    iban_mod97,\n    luhn,\n    high_entropy_token,\n    us_ssn_valid,\n    kr_rrn_valid,\n    cn_national_id_valid,\n    jp_my_number_valid\n)\n\n# Verify IBAN\nif iban_mod97(\"GB82WEST12345698765432\"):\n    print(\"Valid IBAN\")\n\n# Verify credit card using Luhn algorithm\nif luhn(\"4111111111111111\"):\n    print(\"Valid card number\")\n\n# Detect high-entropy tokens (API keys, secrets)\nif high_entropy_token(\"ghp_1a2B3c4D5e6F7g8H9i0J1k2L3m4N5o6P7q8R9s0T\"):\n    print(\"High-entropy token detected\")\n\n# Validate US SSN\nif us_ssn_valid(\"123-45-6789\"):\n    print(\"Valid SSN format\")\n\n# Validate Korean RRN\nif kr_rrn_valid(\"850101-1234567\"):\n    print(\"Valid Korean RRN\")\n\n# Validate Chinese national ID\nif cn_national_id_valid(\"11010519491231002X\"):\n    print(\"Valid Chinese national ID\")\n\n# Validate Japanese My Number\nif jp_my_number_valid(\"123456789012\"):\n    print(\"Valid Japanese My Number\")\n```\n\n## Pattern Categories\n\n### Geographic Coverage\n\n| Region | Patterns | Description |\n|--------|----------|-------------|\n| **US** | 8 | SSN, ITIN, passport, driver's license, phone, zipcode |\n| **Korea** | 28 | RRN, alien registration, bank accounts, phone, zipcode |\n| **Japan** | 17 | My Number, bank accounts, phone, zipcode |\n| **China** | 12 | ID cards, bank accounts, phone |\n| **Taiwan** | 20 | ID cards, bank accounts, phone |\n| **India** | 10 | Aadhaar, PAN, phone |\n| **EU** | 38 | VAT, national IDs, passports (FR, DE, IT, ES, NL, BE, etc.) |\n| **Common** | 22 | Email, IP, credit cards, URLs, IBAN, tokens/secrets |\n\n### Data Types\n\n- **Financial**: Bank accounts, credit cards, IBAN, routing numbers\n- **Identification**: National IDs, SSN, passports, driver's licenses\n- **Contact**: Email, phone numbers, addresses, zipcodes\n- **Network**: IPv4, IPv6, URLs, MAC addresses\n- **Tokens/Secrets**: API keys, JWT, AWS keys, GitHub tokens, private keys\n\n## Verification Functions\n\nThe project includes 32 verification functions for advanced validation:\n\n### Core Validators\n\n| Function | Purpose | Example |\n|----------|---------|---------|\n| `iban_mod97` | IBAN checksum validation | GB82WEST12345698765432 |\n| `luhn` | Credit card validation | 4111111111111111 |\n| `credit_card_bin_valid` | Credit card BIN validation | Valid issuer prefix check |\n| `dms_coordinate` | GPS coordinate validation | 37°46′29.7″N |\n| `high_entropy_token` | API key/secret detection | High randomness check |\n| `not_timestamp` | Reject timestamp-like numbers | Filter false positives |\n| `generic_number_not_timestamp` | Generic timestamp filtering | Flexible validation |\n| `not_repeating_pattern` | Reject repeating digits | Filter 111-1111-1111 |\n| `contains_letter` | Check for alphabetic characters | A1234567 |\n| `ipv4_public` | Validate public IPv4 address | Non-private IP check |\n| `cjk_name_standalone` | Validate CJK name format | Standalone name detection |\n\n### US Validators\n\n| Function | Purpose |\n|----------|---------|\n| `us_ssn_valid` | US SSN validation (area/group/serial checks) |\n| `us_zipcode_valid` | US ZIP code validation |\n\n### Korea Validators\n\n| Function | Purpose |\n|----------|---------|\n| `korean_zipcode_valid` | Korean postal code validation |\n| `korean_bank_account_valid` | Korean bank account validation with prefix checking |\n| `kr_rrn_valid` | Korean Resident Registration Number validation |\n| `kr_alien_registration_valid` | Korean alien registration number validation |\n| `kr_corporate_registration_valid` | Korean corporate registration number validation |\n| `kr_business_registration_valid` | Korean business registration number validation |\n\n### Japan Validators\n\n| Function | Purpose |\n|----------|---------|\n| `jp_my_number_valid` | Japanese My Number validation with check digit |\n\n### China/Taiwan Validators\n\n| Function | Purpose |\n|----------|---------|\n| `cn_national_id_valid` | Chinese national ID validation with checksum |\n| `tw_national_id_valid` | Taiwanese national ID validation |\n\n### India Validators\n\n| Function | Purpose |\n|----------|---------|\n| `india_aadhaar_valid` | Indian Aadhaar number validation |\n| `india_pan_valid` | Indian PAN validation |\n\n### European Validators\n\n| Function | Purpose |\n|----------|---------|\n| `spain_dni_valid` | Spanish DNI validation |\n| `spain_nie_valid` | Spanish NIE validation |\n| `netherlands_bsn_valid` | Dutch BSN validation |\n| `poland_pesel_valid` | Polish PESEL validation |\n| `sweden_personnummer_valid` | Swedish personal number validation |\n| `france_insee_valid` | French INSEE number validation |\n| `belgium_rrn_valid` | Belgian national register number validation |\n| `finland_hetu_valid` | Finnish HETU validation |\n\n## Pattern File Format\n\nEach pattern file follows this YAML structure:\n\n```yaml\nnamespace: \u003cnamespace\u003e\ndescription: \u003cdescription\u003e\n\npatterns:\n  - id: \u003cunique_id\u003e\n    location: \u003ccountry_code\u003e\n    category: \u003ccategory\u003e\n    description: \u003chuman_readable_description\u003e\n    pattern: '\u003cregex_pattern\u003e'\n    mask: \"\u003cmask_format\u003e\"\n    verification: \u003cverification_function\u003e  # Optional\n    flags:                                  # Optional\n      - IGNORECASE\n    examples:\n      match:\n        - \"example1\"\n        - \"example2\"\n      nomatch:\n        - \"invalid1\"\n        - \"invalid2\"\n    policy:\n      store_raw: false\n      action_on_match: redact\n      severity: critical\n    metadata:\n      note: \"Additional information\"\n    priority: 100  # Optional\n```\n\n### Required Fields\n\n- `id`: Unique identifier for the pattern\n- `location`: Country/region code (us, kr, jp, cn, tw, in, eu, co)\n- `category`: Pattern category (ssn, bank, credit_card, phone, email, etc.)\n- `description`: Human-readable description\n- `pattern`: Regular expression pattern\n- `mask`: Masking format for redaction\n- `examples`: Match and nomatch examples\n- `policy`: Data handling policy\n\n### Optional Fields\n\n- `verification`: Name of verification function to apply\n- `flags`: Regex flags (IGNORECASE, MULTILINE, DOTALL, VERBOSE)\n- `priority`: Pattern matching priority (lower = higher priority)\n- `metadata`: Additional information\n\n## Test Suite\n\n### Coverage\n\n- **1,965+ tests collected** ✓\n- **100% verification function coverage**\n\n### Test Categories\n\n1. **Verification Tests** (`test_verification.py` - 129 tests):\n   - IBAN mod-97 validation\n   - Luhn algorithm\n   - Coordinate validation\n   - Token entropy detection\n   - Timestamp detection\n   - Zipcode validation (US, Korea)\n   - National ID validation (US, KR, CN, TW, JP, IN, EU)\n   - Function registry\n\n2. **Pattern Tests** (`test_patterns.py`):\n   - YAML structure validation\n   - Regex compilation\n   - Match/nomatch examples\n   - Verification function integration\n   - Metadata validation\n   - Pattern coverage\n   - Flag support (IGNORECASE, etc.)\n\n### Running Specific Tests\n\n```bash\n# Run verification tests\npytest tests/test_verification.py\n\n# Run pattern structure tests\npytest tests/test_patterns.py::TestPatternStructure\n\n# Run pattern matching tests\npytest tests/test_patterns.py::TestPatternMatching\n\n# Run with specific pattern\npytest tests/test_patterns.py -k \"credit_card\"\n```\n\n## Keywords\n\nThe `keyword/` directory contains context-aware keywords in multiple languages:\n\n- **English**: Primary keywords\n- **Korean (한국어)**: 계좌번호, 주민등록번호, etc.\n- **Japanese (日本語)**: 口座番号, マイナンバー, etc.\n- **Chinese (中文)**: 银行账号, 身份证, etc.\n\nUse keywords to:\n- Reduce false positives\n- Provide context for matches\n- Enable multi-language detection\n- Improve pattern selection\n\nSee [keyword/README.md](keyword/README.md) for details.\n\n## Development\n\n### Adding New Patterns\n\n1. Choose the appropriate directory in `regex/pii/`\n2. Add pattern to existing YAML file or create new file\n3. Include all required fields\n4. Add match and nomatch examples\n5. Run tests: `pytest tests/test_patterns.py`\n\n### Adding Verification Functions\n\n1. Add function to `verification/python/verification.py`\n2. Register in `VERIFICATION_FUNCTIONS` dict\n3. Add comprehensive tests to `tests/test_verification.py`\n4. Update documentation\n5. Run tests: `pytest tests/test_verification.py`\n\n### Code Style\n\n- Follow PEP 8 for Python code\n- Use YAML 1.2 for pattern files\n- Include docstrings for all functions\n- Add type hints where applicable\n\n## Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Add patterns/functions with tests\n4. Ensure all tests pass\n5. Submit a pull request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Resources\n\n- **Documentation**:\n  - [Verification Functions](verification/README.md)\n  - [Keywords](keyword/README.md)\n  - [Tests](tests/README.md)\n\n- **Pattern Files**:\n  - [US Patterns](regex/pii/us/)\n  - [Korea Patterns](regex/pii/kr/)\n  - [Japan Patterns](regex/pii/jp/)\n  - [Common Patterns](regex/pii/common/)\n  - [Token/Secret Patterns](regex/hash/)\n\n## Statistics\n\n- **Total Patterns**: 160+\n- **Countries Covered**: 7+ (US, KR, JP, CN, TW, IN, EU)\n- **Verification Functions**: 32\n- **Test Cases**: 1,965+\n- **Languages**: Python (more coming)\n- **Pattern Categories**: 15+\n\n## Support\n\nFor issues, questions, or contributions:\n- Open an issue on GitHub\n- Submit a pull request\n- Contact the maintainers\n\n---\n\n**Note**: This project is for legitimate use cases such as data protection, compliance, and security. Use responsibly and in accordance with applicable laws and regulations.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzafrem%2Fpattern-engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzafrem%2Fpattern-engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzafrem%2Fpattern-engine/lists"}