{"id":31274168,"url":"https://github.com/notfaad/msl-engine","last_synced_at":"2025-09-23T22:40:56.476Z","repository":{"id":308626603,"uuid":"1033496798","full_name":"notFaad/msl-engine","owner":"notFaad","description":"a powerful web scraping engine that uses a custom Domain Specific Language (DSL) to define scraping pipelines.","archived":false,"fork":false,"pushed_at":"2025-08-06T23:07:35.000Z","size":20,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-07T01:05:47.955Z","etag":null,"topics":["engine","rust","rust-lang","scraper-api","scraper-engine","scraping-websites","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/notFaad.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-06T23:00:56.000Z","updated_at":"2025-08-06T23:07:39.000Z","dependencies_parsed_at":"2025-08-07T01:05:49.428Z","dependency_job_id":"b0936ee1-3aad-461c-abbf-53061f0090fa","html_url":"https://github.com/notFaad/msl-engine","commit_stats":null,"previous_names":["notfaad/msl-engine"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/notFaad/msl-engine","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notFaad%2Fmsl-engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notFaad%2Fmsl-engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notFaad%2Fmsl-engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notFaad%2Fmsl-engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/notFaad","download_url":"https://codeload.github.com/notFaad/msl-engine/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/notFaad%2Fmsl-engine/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":276662401,"owners_count":25682029,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-23T02:00:09.130Z","response_time":73,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["engine","rust","rust-lang","scraper-api","scraper-engine","scraping-websites","webscraping"],"created_at":"2025-09-23T22:40:55.452Z","updated_at":"2025-09-23T22:40:56.464Z","avatar_url":"https://github.com/notFaad.png","language":"Rust","readme":"## (EXPERIMENTAL!!)\n# MediaScrapeLang Engine (MSL Engine)\n\nA Rust-based web scraping engine with a custom DSL (Domain Specific Language) for defining scraping pipelines.\n\n## 🚀 Features\n\n- **Custom DSL**: Write scraping scripts in a minimal, readable language\n- **Link Traversal**: Follow links and extract data from multiple pages\n- **Variable Extraction**: Extract text and attributes from HTML elements\n- **Media Discovery**: Find and download images, videos, and audio files\n- **Filtering**: Filter media by source URL patterns and file extensions\n- **Async Processing**: Built with async Rust for efficient concurrent scraping\n- **CLI Interface**: Easy-to-use command-line tool\n\n## 📝 DSL Syntax\n\nThe MediaScrapeLang (MSL) DSL is designed to be minimal and readable:\n\n```msl\nopen \"https://example.com/users\"\n\nclick \".user-card a\"\n  set user = text\n\n  click \".post-list a\"\n    set post = attr(\"href\").split(\"/\")[-1]\n\n    media\n      image\n        where src ~ \"cdn.example.com\"\n        extensions jpg, png\n\n      video\n        where src ~ \"cdn.example.com\"\n        extensions mp4, webm\n\n    save to \"./media/{user}/{post}\"\n```\n\n### Commands\n\n- `open \"url\"` - Navigate to a URL\n- `click \"selector\"` - Click/follow links matching a CSS selector\n- `set variable = value` - Extract and store a value\n- `media` - Define media extraction blocks\n- `save to \"path\"` - Save extracted media to a path\n\n### Values\n\n- `text` - Extract text content\n- `attr(\"name\")` - Extract attribute value\n- `attr(\"name\").split(\"/\")[-1]` - Extract and process attribute\n\n### Media Filters\n\n- `where src ~ \"pattern\"` - Filter by source URL pattern\n- `extensions jpg, png` - Filter by file extensions\n\n## 🛠️ Installation\n\n```bash\n# Clone the repository\ngit clone \u003crepository-url\u003e\ncd msl-engine\n\n# Build the project\ncargo build --release\n\n# Install globally (optional)\ncargo install --path .\n```\n\n## 📖 Usage\n\n### Command Line Interface\n\n```bash\n# Run a script\nmsl run script.msl\n\n# Parse and validate a script without executing\nmsl parse script.msl\n\n# Enable verbose output\nmsl run script.msl --verbose\n```\n\n### Programmatic Usage\n\n```rust\nuse msl_engine::{run_script, MslEngine};\n\n#[tokio::main]\nasync fn main() -\u003e anyhow::Result\u003c()\u003e {\n    let script = r#\"\n        open \"https://example.com\"\n        click \".user-card a\"\n          set user = text\n        media\n          image\n            where src ~ \"cdn.example.com\"\n            extensions jpg, png\n          save to \"./media/{user}\"\n    \"#;\n    \n    run_script(script).await?;\n    Ok(())\n}\n```\n\n## 🏗️ Architecture\n\nThe MSL Engine is built with a modular architecture:\n\n- **Parser** (`src/parser/`): Parses MSL scripts into structured AST\n- **Scraper** (`src/scraper/`): Handles HTTP requests and HTML parsing\n- **Engine** (`src/engine/`): Orchestrates the scraping process\n- **CLI** (`src/cli/`): Command-line interface\n\n### Key Components\n\n- **MslScript**: Represents a parsed MSL script\n- **MslEngine**: Main execution engine\n- **Scraper**: HTTP client and HTML parser\n- **MediaItem**: Represents discovered media files\n\n## 🧪 Testing\n\n```bash\n# Run all tests\ncargo test\n\n# Run with output\ncargo test -- --nocapture\n```\n\n## 📦 Dependencies\n\n- **reqwest**: HTTP client\n- **scraper**: HTML parsing\n- **nom**: Parser combinator library\n- **tokio**: Async runtime\n- **clap**: CLI argument parsing\n- **anyhow**: Error handling\n- **tracing**: Logging\n\n## 🚧 Development Status\n\n- ✅ Parser implementation\n- ✅ Basic scraper functionality\n- ✅ Engine orchestration\n- ✅ CLI interface\n- 🔄 Variable templating in save paths\n- 🔄 Advanced media filtering\n- 🔄 Parallel processing\n- 🔄 Headless browser support\n\n## 🤝 Contributing\n\n1. Fork the repository\n2. Create a feature branch\n3. Make your changes\n4. Add tests\n5. Submit a pull request\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## 🎯 Roadmap\n\n- [ ] Variable templating in save paths\n- [ ] Advanced media filtering options\n- [ ] Parallel processing for multiple pages\n- [ ] Headless browser support (JavaScript rendering)\n- [ ] Retry logic and error handling\n- [ ] Rate limiting and polite scraping\n- [ ] Export to different formats (JSON, CSV)\n- [ ] Web interface for script editing \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnotfaad%2Fmsl-engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnotfaad%2Fmsl-engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnotfaad%2Fmsl-engine/lists"}