https://github.com/gitstq/magika-sdk-python
Enhanced Python SDK for AI-powered file type detection with async batch processing and enterprise security features
https://github.com/gitstq/magika-sdk-python
Last synced: 3 days ago
JSON representation
Enhanced Python SDK for AI-powered file type detection with async batch processing and enterprise security features
- Host: GitHub
- URL: https://github.com/gitstq/magika-sdk-python
- Owner: gitstq
- License: mit
- Created: 2026-04-15T13:56:16.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-15T13:59:39.000Z (2 months ago)
- Last Synced: 2026-04-15T15:38:41.372Z (2 months ago)
- Language: Python
- Homepage: https://github.com/gitstq/magika-sdk-python
- Size: 93.8 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.en.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# đ¯ Magika SDK Python
įŽäŊ䏿 | įšéĢ䏿 | English | æĨæŦčĒ
---
## đ Project Introduction
**Magika SDK Python** is an enhanced Python SDK for AI-powered file type detection, based on Google's Magika AI engine. It provides out-of-the-box deep learning file identification capabilities.
### đĨ Core Value
| Feature | Description |
|---------|-------------|
| đ¤ **AI Powered** | Deep learning model based, 99%+ accuracy |
| ⥠**Lightning Fast** | ~5ms inference per file, regardless of file size |
| đĻ **Feature Rich** | Supports 200+ file type detection |
| đ **Security Scanning** | Built-in enterprise security threat detection |
| đ **Async Processing** | Large-scale directory async concurrent scanning |
| đ **Chinese Documentation** | Full Chinese docs, developer friendly |
### đĄ Inspiration
This project is inspired by [google/magika](https://github.com/google/magika), with deep enhancements and additional features to provide simpler and more powerful file type detection for Python developers.
### đ Differentiation Highlights
1. **Simpler Python API** - One line of code for file type detection
2. **Async Batch Processing** - High-concurrency scanning with progress bar
3. **Enterprise Security** - Built-in threat detection, misnamed file identification, security reports
4. **Chinese Localization** - Full Chinese documentation and error messages
5. **Enhanced Filtering** - Filter results by type, group, extension
---
## ⨠Core Features
### đ Feature List
| Feature | Description | Status |
|---------|-------------|--------|
| đ Single File Detection | bytes / stream / path input methods | â
|
| đ Batch Directory Scan | Recursive scan, extension filtering | â
|
| ⥠Async Concurrent Processing | Large-scale parallel scanning with progress | â
|
| đ Security Threat Detection | Identify malware, executables, scripts | â
|
| đ¨ Misnamed File Detection | Detect extension/content mismatch | â
|
| đ Security Report Generation | Generate structured security audit reports | â
|
| đ¯ Multiple Detection Modes | High/Medium/Best confidence modes | â
|
| đ¤ JSON Export | Export results as JSON | â
|
### đ ī¸ Tech Stack
```
âââââââââââââââââââââââââââââââââââââââââââââââ
â Magika SDK Python â
âââââââââââââââââââââââââââââââââââââââââââââââ¤
â Magika Core (Google) â Python 3.8+ â
â aiofiles â tqdm â
â asyncio â concurrent.futures â
âââââââââââââââââââââââââââââââââââââââââââââââ¤
â Supported: Windows / macOS / Linux â
âââââââââââââââââââââââââââââââââââââââââââââââ
```
---
## đ Quick Start
### đĨ Installation
```bash
# Install from PyPI (recommended)
pip install magika-sdk-python
# Or install dev version
pip install git+https://github.com/gitstq/magika-sdk-python.git
# Install dependencies
pip install magika aiofiles tqdm
```
### đ Requirements
| Environment | Requirement |
|-------------|-------------|
| Python | 3.8+ |
| OS | Windows / macOS / Linux |
| Memory | 4GB+ recommended |
| Disk | ~100MB (including Magika model) |
### đ Quick Usage
#### 1ī¸âŖ Basic Detection
```python
from magika_sdk import MagikaSDK, DetectionMode
# Initialize SDK
sdk = MagikaSDK(mode=DetectionMode.BEST_GUESS)
# Detect single file
result = sdk.detect_file("document.pdf")
print(f"File type: {result.label}") # pdf
print(f"Description: {result.description}") # PDF document
print(f"Confidence: {result.score:.2%}") # 99.50%
# Detect from bytes
bytes_result = sdk.detect_bytes(b'print("Hello")')
print(f"Type: {bytes_result.label}") # python
```
#### 2ī¸âŖ Batch Directory Scan
```python
from magika_sdk import MagikaSDK
sdk = MagikaSDK()
# Scan directory
result = sdk.scan_directory("./my_folder")
# Print statistics
print(f"Total files: {result.total_count}")
print(f"Successful: {result.success_count}")
# Filter by type
python_files = result.get_by_label("python")
json_files = result.get_by_group("data")
# Print summary
for label, count in result.summary().items():
print(f"{label}: {count}")
```
#### 3ī¸âŖ Async Batch Processing
```python
import asyncio
from magika_sdk import AsyncMagikaScanner
async def scan():
scanner = AsyncMagikaScanner(max_workers=20)
result = await scanner.scan_directory_async(
"./large_folder",
recursive=True,
progress_callback=lambda done, total: print(f"\rProgress: {done}/{total}", end="")
)
print(f"\nScan complete! {result.total_count} files processed")
scanner.close()
asyncio.run(scan())
```
#### 4ī¸âŖ Security Scanning
```python
from magika_sdk import SecurityScanner
scanner = SecurityScanner()
# Scan directory for threats
report = scanner.scan_directory("./uploads")
# Generate security report
print(scanner.generate_summary(report))
# Get critical threats
critical = report.get_critical_findings()
high_risk = report.get_high_findings()
misnamed = report.get_misnamed_files()
# Export JSON report
import json
print(json.dumps(report.export_report(), indent=2, ensure_ascii=False))
```
---
## đ Detailed Usage Guide
### Detection Modes
```python
from magika_sdk import DetectionMode
# High confidence mode - only return high confidence results
sdk = MagikaSDK(mode=DetectionMode.HIGH_CONFIDENCE)
# Medium confidence mode - include medium confidence results
sdk = MagikaSDK(mode=DetectionMode.MEDIUM_CONFIDENCE)
# Best guess mode - always return best guess
sdk = MagikaSDK(mode=DetectionMode.BEST_GUESS)
```
### File Type Filters
```python
sdk = MagikaSDK()
# Scan only specific extensions
result = sdk.scan_directory(
"./folder",
extensions_filter=[".py", ".js", ".json"]
)
# Exclude specific patterns
result = sdk.scan_directory(
"./folder",
exclude_patterns=["*.test.*", "node_modules/*"]
)
```
### Async Multi-Directory Scan
```python
from magika_sdk import AsyncMagikaScanner
async def scan_multiple():
scanner = AsyncMagikaScanner(max_workers=10)
# Scan multiple directories simultaneously
result = await scanner.scan_multiple_directories([
"./src",
"./lib",
"./tests"
])
print(f"Scanned {result.total_count} files in total")
scanner.close()
asyncio.run(scan_multiple())
```
### Security Scanner Configuration
```python
from magika_sdk import SecurityScanner
# Enable misnamed file detection
scanner = SecurityScanner(check_misnamed=True)
# Strict mode (stricter threat detection)
scanner = SecurityScanner(strict_mode=True)
```
---
## đ Example Output
### File Detection Result
```
File path: ./samples/document.pdf
File type: pdf
Description: PDF document
MIME type: application/pdf
Confidence: 99.85%
Is text: False
File group: document
```
### Security Scan Report
```
============================================================
đ Security Scan Report Summary
============================================================
Scan time: 2024-01-15 14:30:00
Total files scanned: 150
đĸ Threat Level Distribution:
â
Safe: 120
â ī¸ Low: 15
đļ Medium: 10
đ´ High: 5
đĢ Critical: 0
đ¨ High Risk Threats (Immediate Action Required):
âĸ uploads/backup.bat
Reason: Batch script file detected
Recommendation: Check script content for malicious code
đ Misnamed Files (Extension/Content Mismatch):
âĸ uploads/image.jpg.exe
Extension suggests image, but content is executable
============================================================
```
---
## đĄ Design Philosophy & Roadmap
### Design Principles
1. **Simplicity First** - One line of code for complex features
2. **Type Safety** - Complete type annotations and type checking
3. **Error Handling** - Robust exception handling and friendly error messages
4. **Performance** - Async processing and concurrency control
### Tech Choices
| Component | Reason |
|-----------|--------|
| Magika Core | Google's production, mature and stable |
| asyncio | Python native async support, no extra deps |
| tqdm | Mature progress bar library, great UX |
| aiofiles | Async file I/O for better large file handling |
### Roadmap
- [ ] v1.1.0 - Add file content hashing (MD5/SHA256)
- [ ] v1.2.0 - Support custom model loading
- [ ] v1.3.0 - Add Web service interface (FastAPI)
- [ ] v2.0.0 - CLI tool redesign with better UX
---
## đĻ Packaging & Deployment
### Build Distribution
```bash
# Clone repository
git clone https://github.com/gitstq/magika-sdk-python.git
cd magika-sdk-python
# Install build dependencies
pip install build
# Build wheel and tarball
python -m build
# Upload to PyPI
twine upload dist/*
```
### One-Click Build Script
```bash
# Linux/macOS
./build.sh
# Windows
./build.bat
```
### Publish to GitHub Release
```bash
# Create tag
git tag -a v1.0.0 -m "Release v1.0.0"
# Push tag
git push origin v1.0.0
```
---
## đ¤ Contributing
Issues and Pull Requests are welcome!
### Commit Convention
```
feat: New feature
fix: Bug fix
docs: Documentation update
refactor: Code refactoring
test: Test cases
chore: Build/tool changes
```
### Development Workflow
1. Fork this repository
2. Create feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to branch (`git push origin feature/AmazingFeature`)
5. Create Pull Request
---
## đ License
This project is open source under [MIT License](LICENSE).
---
## đ Acknowledgments
- [Google Magika](https://github.com/google/magika) - AI file type detection engine
- [aiofiles](https://github.com/Tinche/aiofiles) - Async file I/O
- [tqdm](https://github.com/tqdm/tqdm) - Progress bar component
---
If you find this project helpful, please give it a â Star!