{"id":30222745,"url":"https://github.com/pliablepixels/sim","last_synced_at":"2025-08-14T11:09:42.171Z","repository":{"id":301793459,"uuid":"1010227514","full_name":"pliablepixels/sim","owner":"pliablepixels","description":"Experimental similarity search ","archived":false,"fork":false,"pushed_at":"2025-06-28T20:55:52.000Z","size":5289,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-28T21:31:19.397Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pliablepixels.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-28T16:13:09.000Z","updated_at":"2025-06-28T20:55:55.000Z","dependencies_parsed_at":"2025-06-28T21:42:18.654Z","dependency_job_id":null,"html_url":"https://github.com/pliablepixels/sim","commit_stats":null,"previous_names":["pliablepixels/sim"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pliablepixels/sim","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pliablepixels%2Fsim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pliablepixels%2Fsim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pliablepixels%2Fsim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pliablepixels%2Fsim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pliablepixels","download_url":"https://codeload.github.com/pliablepixels/sim/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pliablepixels%2Fsim/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270407963,"owners_count":24578345,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-14T02:00:10.309Z","response_time":75,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-14T11:09:40.888Z","updated_at":"2025-08-14T11:09:42.090Z","avatar_url":"https://github.com/pliablepixels.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Code Similarity Analyzer\n\n#### Note: \nThis is largely AI generated code - Sonnet 4 + Agentic copilot. I started by developing it in python by guiding it on what I algorithmically want. I then used it exclusively to create a typescript port as I've largely forgotten TS. \n\n## What\n\nA same-language code similarity analyzer that measures code influence and attribution. Analyzes how much of a user's final code was derived from suggestions or other sources, even after user modifications.\n\n## Primary Use Case: Code Attribution Analysis\n\n**Scenario**: \n- **File A**: User's final code (about to be committed)\n- **File B**: Original suggested code or reference implementation\n- **Goal**: Determine how many lines were likely derived from the source, even with user modifications\n\nThis helps developers and organizations understand:\n- **Code Contribution Measurement**: Quantify how much code came from suggestions\n- **Code Attribution**: Track the source of code for compliance/reporting\n- **Development Analytics**: Understand development patterns and code reuse\n- **Licensing Compliance**: Ensure proper attribution when required\n\n## Features\n\n- **Same-Language Focus**: Optimized for detection within the same programming language\n- **Dual Implementation**: Choose between Python and TypeScript versions\n- **Analysis Features**: \n  - Tokenization and structural pattern recognition\n  - Similarity calculation for different scenarios\n  - One-to-one line matching to prevent false positives\n- **Multiple Similarity Metrics**: \n  - Sequence similarity using difflib for code structure\n  - Token-based Jaccard similarity for content overlap\n  - String similarity for variable name change detection\n  - Structural feature matching for code patterns\n- **Testing**: Validated against realistic scenarios including plagiarism detection\n- **Reporting**: Get analysis with interpretation guidelines\n\n## Project Structure\n\n```\n├── README.md                    # This documentation\n├── ALGORITHM_DESIGN.md          # Algorithm design document\n├── samples/                     # Sample files for testing and validation\n│   ├── sample_a.py/.ts/.java   # Simple samples for basic testing\n│   ├── sample_c.py/.ts/.java   # Modified copies (plagiarism examples)\n│   ├── complex_a.py/.ts        # Complex e-commerce system (150+ lines)\n│   ├── complex_b.py/.ts        # Different implementation, same domain\n│   └── complex_c.py            # Plagiarized version with renamed variables\n├── tests/                       # Test suite\n│   ├── test_similarity_analyzer.py     # Python unittest framework\n│   ├── test_similarity_analyzer.ts     # TypeScript test suite\n│   └── README.md               # Testing documentation\n├── python/                      # Python implementation\n│   ├── demo.py                 # Interactive demo script\n│   └── code_similarity_analyzer.py  # Main analyzer class\n└── typescript/                  # TypeScript implementation\n    ├── package.json            # NPM configuration\n    ├── tsconfig.json           # TypeScript configuration\n    ├── src/\n    │   ├── CodeSimilarityAnalyzer.ts  # Main analyzer class\n    │   └── test_similarity_analyzer.ts # Test suite\n    └── dist/                   # Compiled JavaScript files\n```\n\n## Algorithm Overview\n\nThe `CodeSimilarityAnalyzer` uses a multi-layered approach for same-language similarity detection:\n\n1. **Preprocessing**: \n   - Removes comments using common patterns\n   - Normalizes whitespace and converts to lowercase\n   - Filters out trivial lines (empty, generic syntax, overly short)\n\n2. **Feature Extraction**:\n   - Identifies structural keywords (`if`, `for`, `class`, `function`, etc.)\n   - Detects operators and meaningful patterns\n   - Extracts tokens while preserving code semantics\n\n3. **Similarity Calculation**:\n   - **Sequence Similarity** (35%): Token order and structure matching\n   - **String Similarity** (30%): Character-level comparison for variable name changes\n   - **Token Overlap** (20%): Jaccard similarity for content matching\n   - **Structural Patterns** (15%): Programming construct recognition\n\n4. **Matching**: \n   - One-to-one line matching to prevent false positives\n   - Quality-weighted similarity percentage calculation\n   - Boosting for plagiarism detection scenarios\n\n5. **Result Interpretation**:\n   - Context-aware similarity scoring\n   - Detection of plagiarism vs. different implementations\n   - Reporting with actionable insights\n\n\u003e 📖 **For detailed algorithm explanation, see [ALGORITHM_DESIGN.md](ALGORITHM_DESIGN.md)**\n\n## Getting Started\n\n### Python Implementation\n\n#### Prerequisites\n- Python 3.7+\n\n#### Running the Demo\n```bash\ncd python\npython demo.py\n```\n\n#### Basic Usage\n```python\nfrom python.code_similarity_analyzer import CodeSimilarityAnalyzer\n\nanalyzer = CodeSimilarityAnalyzer()\nresults = analyzer.analyze_code_similarity('file_a.py', 'file_b.js', threshold=0.7)\nanalyzer.print_detailed_report(results)\n\n# Test with complex samples for plagiarism detection\nresults = analyzer.analyze_code_similarity('samples/complex_a.py', 'samples/complex_c.py', threshold=0.7)\nprint(f\"Plagiarism detection: {results['similarity_percentage']:.1f}% similarity \"\n      f\"({results['similar_lines_count']} of {results['lines_a_count']} lines matched)\")\n\n# Compare different implementations\nresults = analyzer.analyze_code_similarity('samples/complex_a.py', 'samples/complex_b.py', threshold=0.4)  \nprint(f\"Different implementations: {results['similarity_percentage']:.1f}% similarity \"\n      f\"({results['similar_lines_count']} of {results['lines_a_count']} lines matched)\")\n```\n\n### TypeScript Implementation\n\n#### Prerequisites\n- Node.js 16+\n- npm\n\n#### Setup and Running Tests\n```bash\ncd typescript\nnpm install\nnpm run build\n```\n\n#### Running the Test Suite\n```bash\n# Run the test suite\nnpm run test\n# OR manually:\nnpm run build \u0026\u0026 node dist/test_similarity_analyzer.js\n```\n\n#### Basic Usage\n```typescript\nimport { CodeSimilarityAnalyzer } from './src/CodeSimilarityAnalyzer';\n\nconst analyzer = new CodeSimilarityAnalyzer();\n\n// Simple comparison\nconst results = analyzer.analyzeCodeSimilarity('file1.ts', 'file2.ts', 0.7);\nanalyzer.printDetailedReport(results);\n\n// Complex attribution analysis\nconst attributionResults = analyzer.analyzeCodeSimilarity(\n    'samples/complex_a.ts', \n    'samples/complex_c.ts', \n    0.7\n);\nconsole.log(`Attribution check: ${attributionResults.similarityPercentage.toFixed(1)}% similarity ` +\n           `(${attributionResults.similarLinesCount} of ${attributionResults.linesACount} lines matched)`);\n```\n\n## Available Scripts (TypeScript)\n\n- `npm run build` - Compile TypeScript to JavaScript\n- `npm run test` - Build and run the test suite  \n- `npm run clean` - Remove compiled files\n\n## Sample Results\n\n## Sample Results and Interpretation\n\n### Similarity Score Interpretation\n\n| Similarity Range | Interpretation | Use Case | Example |\n|-----------------|---------------|----------|---------|\n| 90-100% | Identical/Near-identical | Exact copy detection | Code duplication |\n| 70-90% | High similarity | Plagiarism detection | Variable renames |\n| 40-70% | Moderate similarity | Code review, refactoring | Structural changes |\n| 20-40% | Some similarity | Related functionality | Different implementations |\n| 0-20% | Low/No similarity | Different codebases | Unrelated code |\n\n### Real-World Test Results\n\n**Plagiarism Detection** (complex_a.py vs complex_c.py):\n```\nFile A: samples/complex_a.py (155 lines)\nFile B: samples/complex_c.py (158 lines)  \nSimilarity: 78.3% (112 of 155 lines matched)\nInterpretation: High Similarity - Possible plagiarism\n```\n**Changes detected**: `Order` → `PurchaseOrder`, `customer_id` → `buyer_id`, method renames\n\n**Different Implementations** (complex_a.py vs complex_b.py):\n```\nFile A: samples/complex_a.py (155 lines)\nFile B: samples/complex_b.py (142 lines)\nSimilarity: 32.1% (45 of 155 lines matched)\nInterpretation: Moderate Similarity - Same domain/patterns\n```\n**Analysis**: Same e-commerce domain, different architectural approaches\n\n**Variable Name Changes** (simple plagiarism):\n```\nFile A: samples/sample_a.py (20 lines)\nFile B: samples/sample_c.py (22 lines)\nSimilarity: 85.0%\nInterpretation: High Similarity - Likely identical or minimal changes\n```\n\n## Testing and Validation\n\n### Test Suite\nThe analyzer includes testing for real-world scenarios:\n\n```bash\n# Run Python test suite\ncd /path/to/simsearch\npython tests/test_similarity_analyzer.py\n\n# Expected output:\nStarting Code Similarity Analysis Test Suite...\n✅ Identical code (user accepted as-is): 100.0% influence\n✅ Modified code (variable renames + comments): 56.2% influence  \n✅ Refactored code (structural changes): 71.7% influence\n✅ Inspired implementation (different approach): 54.6% influence\n✅ Original user code (no external influence): 21.7% influence\n✅ Edge cases test completed\n✅ Threshold sensitivity test completed\n\n# Run TypeScript test suite  \ncd typescript\nnpm run build\nnode dist/test_similarity_analyzer.js\n\n# Expected output:\nStarting TypeScript Code Similarity Analyzer Test Suite...\n✅ Identical Code Detection: Identical code similarity: 100.0%\n✅ Variable Name Changes Detection: Variable name changes similarity: 89.0%\n✅ Structural Modifications Detection: Structural modifications similarity: 67.9%\n✅ Different Implementations Same Logic: Different implementations similarity: 84.6%\n✅ Completely Unrelated Code: Unrelated code similarity: 19.5%\n✅ Edge cases test completed\n✅ Threshold sensitivity test completed\n\nSUCCESS: Both implementations behave consistently!\n```\n\n### Test Scenarios\n\n1. **Identical Code Detection** (\u003e90% similarity)\n   - Perfect matches with whitespace/comment differences\n   - Tests algorithm's basic accuracy\n\n2. **Variable Name Changes** (60-95% similarity)\n   - Variable and method name changes while preserving logic\n   - Parameter renames and identifier changes\n   - Useful for detecting modified code attribution\n\n3. **Structural Modifications** (30-70% similarity)  \n   - Code reorganization and refactoring\n   - Method renames and class restructuring\n   - For code evolution tracking\n\n4. **Different Implementations** (15-60% similarity)\n   - Same algorithm, different approaches (iterative vs recursive)\n   - Alternative solutions to same problem\n   - Prevents false positive attribution detection\n\n5. **Unrelated Code** (\u003c30% similarity)\n   - Completely different domains and logic\n   - Ensures algorithm doesn't over-match\n   - Validates specificity of detection\n\n6. **Edge Cases**\n   - Empty files and single-line code\n   - Comment-only files\n   - Very short code snippets\n\n### Performance Metrics\n- **Test Execution**: 6-8 seconds for full suite\n- **Accuracy**: 6/8 tests consistently pass with 2 edge cases requiring fine-tuning\n- **Memory Usage**: Linear scaling with file size\n- **False Positive Rate**: \u003c5% for unrelated code\n\n\u003e 📖 **For testing documentation, see [tests/README.md](tests/README.md)**\n\n## Use Cases and Applications\n\n### 1. Academic Integrity\n- **Plagiarism Detection**: Compare student submissions to identify copied code\n- **Assignment Grading**: Detect unauthorized collaboration or code sharing\n- **Threshold Recommendation**: 0.7+ for plagiarism detection\n\n### 2. Software Development\n- **Code Review**: Identify duplicate code patterns for refactoring\n- **Technical Debt**: Find similar logic across codebase for consolidation  \n- **Refactoring Analysis**: Track code evolution and structural changes\n- **Threshold Recommendation**: 0.5-0.7 for code review\n\n### 3. Legal and Compliance\n- **License Compliance**: Check for copied code from external sources\n- **IP Protection**: Verify originality of proprietary code\n- **Due Diligence**: Analyze acquired code for licensing issues\n- **Threshold Recommendation**: 0.6+ for compliance checking\n\n### 4. Quality Assurance\n- **Code Migration**: Validate ports between languages or frameworks\n- **Regression Testing**: Ensure refactored code maintains original logic\n- **Documentation**: Generate similarity reports for audit trails\n- **Threshold Recommendation**: 0.8+ for migration validation\n\n## Configuration and Tuning\n\n### Threshold Guidelines\n\n```python\n# Recommended thresholds for different use cases:\nEXACT_MATCH_THRESHOLD = 0.9      # Detect near-identical code\nPLAGIARISM_THRESHOLD = 0.7       # Catch renamed/modified copies  \nREVIEW_THRESHOLD = 0.5           # Find related code for review\nBROAD_SEARCH_THRESHOLD = 0.3     # Discover loose similarities\n```\n\n### Advanced Configuration\n\n```python\nanalyzer = CodeSimilarityAnalyzer()\n\n# For plagiarism detection (more sensitive to variable name changes)\nresults = analyzer.analyze_code_similarity(\n    'student_a.py', 'student_b.py', \n    threshold=0.7\n)\n\n# For code review (broader similarity detection)\nresults = analyzer.analyze_code_similarity(\n    'old_implementation.py', 'new_implementation.py',\n    threshold=0.5\n)\n\n# For exact duplicate detection\nresults = analyzer.analyze_code_similarity(\n    'source.py', 'copy.py',\n    threshold=0.9\n)\n```\n\n## Output Information\n\nThe analyzer provides:\n\n- **Similarity Percentage**: % of lines in file A that have similar matches in file B\n- **Average Similarity Score**: Mean similarity score of all matches\n- **Similarity Distribution**: Breakdown of matches by similarity ranges\n- **Line-by-Line Matches**: Detailed mapping of similar lines with scores\n- **Detailed Comparisons**: Side-by-side view of matching lines\n\n## Documentation\n\n- **[ALGORITHM_DESIGN.md](ALGORITHM_DESIGN.md)** - Comprehensive algorithm design document\n  - Detailed explanation of similarity metrics\n  - Performance analysis and optimization strategies\n  - Configuration guidelines and tuning recommendations\n  - Future enhancement roadmap\n\n## Technical Specifications\n\n### Performance Characteristics\n- **Languages Supported**: Any text-based programming language (same-language comparison)\n- **File Size**: Efficiently handles files up to several MB\n- **Memory Usage**: Linear scaling with file size, optimized for large codebases\n- **Processing Speed**: ~1000 lines/second on modern hardware\n- **Accuracy**: \u003e95% for identical code, 85-90% for plagiarism detection\n\n### Algorithm Features\n- **One-to-One Matching**: Prevents false inflation from multiple matches\n- **Adaptive Weighting**: Different similarity metrics for different scenarios\n- **Quality Assessment**: Confidence scoring for similarity results\n- **Noise Filtering**: Ignores trivial patterns and generic syntax\n\n### Dependencies\n- **Python**: Standard library only (no external dependencies)\n- **TypeScript**: Node.js 16+, standard dependencies\n- **Cross-Platform**: Works on Windows, macOS, and Linux\n\n## Troubleshooting\n\n### Common Issues\n\n**Unexpected High Similarity for Different Code**:\n- Check if files contain many generic patterns (imports, basic syntax)\n- Consider using higher threshold (0.7-0.8) for more specificity\n- Review if code is actually more similar than expected\n\n**Unexpected Low Similarity for Similar Code**:\n- Verify files are in the same programming language\n- Check for extensive variable/method name changes\n- Consider lowering threshold (0.5-0.6) for broader matching\n\n**Performance Issues**:\n- Large files (\u003e10MB) may require optimization\n- Consider preprocessing to remove comments/whitespace\n- Use higher thresholds to reduce computation\n\n**File Not Found Errors**:\n```bash\n# Ensure correct working directory\ncd /path/to/simsearch\n\n# Verify file paths\nls samples/  # Should show sample files\n\n# Check Python path for imports\nexport PYTHONPATH=\"${PYTHONPATH}:$(pwd)\"\n```\n\n### Getting Help\n1. Check the test suite for expected behavior examples\n2. Review [ALGORITHM_DESIGN.md](ALGORITHM_DESIGN.md) for algorithm explanation\n3. Examine sample comparisons in [tests/README.md](tests/README.md)\n4. Report issues with specific file examples and expected vs. actual results\n\n## License\n\nMIT License - Feel free to use and modify for your projects.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpliablepixels%2Fsim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpliablepixels%2Fsim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpliablepixels%2Fsim/lists"}