{"id":28717444,"url":"https://github.com/ziadea/smartwebscraper-cv","last_synced_at":"2026-03-15T14:03:11.806Z","repository":{"id":285600476,"uuid":"955285527","full_name":"ZIADEA/SmartWebScraper-CV","owner":"ZIADEA","description":"SmartWebScraper-CV – AI-Powered Web Page Zone Detection SmartWebScraper-CV est un projet avancé en Computer Vision et NLP qui combine le scraping visuel de pages web, la détection automatique de zones (comme les headers, footers, ads, contenus, etc.), l’OCR, et un module NLP interactif ainsi que l integration de 2 LLM mistral et gemini.","archived":false,"fork":false,"pushed_at":"2025-06-13T16:03:20.000Z","size":2780,"stargazers_count":1,"open_issues_count":3,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-13T16:59:16.053Z","etag":null,"topics":["ai","annotations","application","computer-vision","detectron2","gemini","mistral","nlp","ocr","ollama","padde","paddelocr","roboflow","roboflow-dataset","spacy","web","web-application","web-scraping","workflow"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ZIADEA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-26T12:04:34.000Z","updated_at":"2025-06-13T16:03:23.000Z","dependencies_parsed_at":"2025-04-01T17:43:01.818Z","dependency_job_id":"279b5beb-411d-4cd1-b70b-910d30741493","html_url":"https://github.com/ZIADEA/SmartWebScraper-CV","commit_stats":null,"previous_names":["ziadea/smartwebscraper-cv"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ZIADEA/SmartWebScraper-CV","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZIADEA%2FSmartWebScraper-CV","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZIADEA%2FSmartWebScraper-CV/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZIADEA%2FSmartWebScraper-CV/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZIADEA%2FSmartWebScraper-CV/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ZIADEA","download_url":"https://codeload.github.com/ZIADEA/SmartWebScraper-CV/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZIADEA%2FSmartWebScraper-CV/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259919293,"owners_count":22932067,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","annotations","application","computer-vision","detectron2","gemini","mistral","nlp","ocr","ollama","padde","paddelocr","roboflow","roboflow-dataset","spacy","web","web-application","web-scraping","workflow"],"created_at":"2025-06-15T04:00:33.388Z","updated_at":"2026-03-15T14:03:11.799Z","avatar_url":"https://github.com/ZIADEA.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## 🎥 Demo\n\nDécouvrez une démonstration complète de l'application **SmartWebScraper-CV** en action :\n\n[![Watch the demo](https://img.youtube.com/vi/TIGsxCHhcps/0.jpg)](https://youtu.be/TIGsxCHhcps)\n\n\n## 🚀 Installation Rapide\n\n### Prérequis\n- Python 3.9 ou supérieur\n- 8+ GB RAM (16 GB recommandé)\n- GPU NVIDIA optionnel (améliore les performances)\n\n### Installation\n\n```bash\n# Clonage du projet\ngit clone https://github.com/ZIADEA/SmartWebScraper-CV.git\ncd SmartWebScraper-CV/LocalApp/SMARTWEBSCRAPPER-CV\n\n# Environnement virtuel (optionnel)\npython -m venv venv\nsource venv/bin/activate  # Linux/Mac\n# ou venv\\Scripts\\activate  # Windows\n\n# Dépendances\npip install -r requirements.txt\n\n# Modèles NLP\npython -c \"import nltk; nltk.download('punkt'); nltk.download('stopwords')\"\npython -m spacy download fr_core_news_sm\n\n# Detectron2 (voir guides spécifiques selon votre OS)\npip install 'git+https://github.com/facebookresearch/detectron2.git'\n```\n\n### Configuration\n\n1. **Créer le fichier .env avec votre clé API Gemini :**\n```bash\nGEMINI_API_KEY=your-gemini-api-key\nADMIN_EMAIL=admin@example.com\nADMIN_PASSWORD=your_password\n```\n\n2. **Installer et lancer Ollama (optionnel) :**\n```bash\n# Télécharger depuis https://ollama.com/\nollama run mistral\n```\n\n3. **Lancer l'application :**\n```bash\npython run.py\n```\n\n4. **Accéder aux interfaces :**\n- Interface utilisateur : http://localhost:5000\n- Interface admin : http://localhost:5000/admin/login\n\n## 📖 Documentation Complète\n\nLa documentation complète est disponible sur ReadTheDocs :\n\n[![Documentation](https://img.shields.io/badge/docs-ReadTheDocs-blue)](https://smartwebscraper-cv.readthedocs.io/)\n\n### 📚 Sections Principales\n\n- [⚙️ Installation Locale](docs/source/installation/local.rst)\n- [🚀 Distribution et Support](docs/source/deployment/distribution.rst)🎯 Introduction et Contexte](docs/source/introduction/contexte.rst)\n- [📊 Acquisition des Données](docs/source/data/constitution.rst)\n- [🏷️ Annotation et Dataset](docs/source/annotation/objectifs.rst)\n- [🤖 Modélisation Computer Vision](docs/source/modeling/detection.rst)\n- [📝 Traitement NLP](docs/source/nlp/traitement.rst)\n- [🏗️ Architecture Application](docs/source/architecture/structure.rst)\n- [# SmartWebScraper-CV\n\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Documentation Status](https://readthedocs.org/projects/smartwebscraper-cv/badge/?version=latest)](https://smartwebscraper-cv.readthedocs.io/fr/latest/?badge=latest)\n\n\u003e **Application intelligente d'annotation de pages web par Computer Vision, OCR, NLP et LLM**\n\nProjet de fin d'année - ENSAM Meknès, Filière IATD-SI  \n**Auteurs :** DJERI-ALASSANI OUBENOUPOU \u0026 EL MAJDI WALID  \n**Encadrant :** Professeur Tawfik MASROUR  \n**Date :** 16 Juin 2025\n\n## 🎯 Aperçu du Projet\n\nSmartWebScraper-CV révolutionne l'extraction de contenu web en combinant plusieurs technologies d'IA pour analyser visuellement les pages web, contournant ainsi les limitations du scrapping traditionnel (obfuscation HTML, JavaScript, contenu dynamique).\n\n### 🚀 Fonctionnalités Principales\n\n- **🖼️ Capture Intelligente** : Screenshot automatique avec gestion du contenu dynamique\n- **🎯 Détection Automatique** : 18 zones fonctionnelles détectées (header, content, ads, etc.)\n- **📝 OCR Avancé** : Extraction précise du texte avec PaddleOCR\n- **🧠 Analyse NLP** : Résumé, Q\u0026A, extraction d'entités avec NLTK/spaCy\n- **🤖 IA Générative** : Intégration Gemini API et Mistral (Ollama)\n- **👨‍💼 Double Interface** : Utilisateur final + Administrateur pour validation\n\n## 🏗️ Architecture Technique\n\n```mermaid\nflowchart TD\n    A[URL Web] --\u003e B[Capture Selenium/Playwright]\n    B --\u003e C[Détection Computer Vision]\n    C --\u003e D[Zones Annotées]\n    D --\u003e E[Extraction OCR]\n    E --\u003e F[Traitement NLP]\n    F --\u003e G[Interface Utilisateur]\n    G --\u003e H[Feedback \u0026 Amélioration]\n    H --\u003e I[Fine-tuning Modèle]\n```\n\n### 🛠️ Technologies Utilisées\n\n| Domaine | Technologies |\n|---------|-------------|\n| **Computer Vision** | Detectron2, Faster R-CNN, annotations COCO |\n| **OCR** | PaddleOCR, prétraitement OpenCV |\n| **NLP** | NLTK, spaCy, TF-IDF, Word2Vec, clustering |\n| **LLM** | Gemini API, Mistral via Ollama |\n| **Web Framework** | Flask, HTML5 Canvas, interface responsive |\n| **Web Scraping** | Selenium, undetected-chromedriver, Playwright |\n\n## 📊 Performances\n\n| Métrique | Score | Description |\n|----------|-------|-------------|\n| **mAP Détection** | 41.6% | Précision moyenne détection d'objets |\n| **Qualité OCR** | \u003e90% | Taux d'extraction sur texte net |\n| **Temps Traitement** | 4-6s | Capture + détection + OCR + NLP |\n| **Classes Détectées** | 18+1 | Zones fonctionnelles web |\n\n## 🚀 Installation Rapide\n\n### Prérequis\n- Python 3.9-3.10\n- 8+ GB RAM (16 GB recommandé)\n- GPU NVIDIA optionnel (améliore les performances)\n\n### Installation Automatique\n\n```bash\n# Clonage du projet\ngit clone https://github.com/ZIADEA/SmartWebScraper-CV.git\ncd SmartWebScraper-CV/LocalApp/SMARTWEBSCRAPPER-CV\n\n# Installation automatique\npython setup.py\n```\n\n### Installation Manuelle\n\n```bash\n# Environnement virtuel\npython -m venv venv\nsource venv/bin/activate  # Linux/Mac\n# ou venv\\Scripts\\activate  # Windows\n\n# Dépendances\npip install -r requirements.txt\n\n# Modèles NLP\npython -c \"import nltk; nltk.download('punkt'); nltk.download('stopwords')\"\npython -m spacy download fr_core_news_sm\n\n# Detectron2\npip install 'git+https://github.com/facebookresearch/detectron2.git'\n```\n\n### Configuration\n\n1. **Copier le fichier de configuration :**\n```bash\ncp .env.example .env\n```\n\n2. **Configurer les clés API dans `.env` :**\n```bash\nGEMINI_API_KEY=your-gemini-api-key\nSERPAPI_KEY=your-serpapi-key\nOLLAMA_BASE_URL=http://localhost:11434\n```\n\n3. **Lancer l'application :**\n```bash\npython run.py\n```\n\n4. **Accéder aux interfaces :**\n- Interface utilisateur : http://localhost:5000\n- Interface admin : http://localhost:5000/admin/login\n\n## 📖 Documentation Complète\n\nLa documentation complète est disponible sur ReadTheDocs :\n\n[![Documentation](https://img.shields.io/badge/docs-ReadTheDocs-blue)](https://smartwebscraper-cv.readthedocs.io/)\n\n### 📚 Sections Principales\n\n- [🎯 Introduction et Contexte](docs/source/introduction/contexte.rst)\n- [📊 Acquisition des Données](docs/source/data/constitution.rst)\n- [🏷️ Annotation et Dataset](docs/source/annotation/objectifs.rst)\n- [🤖 Modélisation Computer Vision](docs/source/modeling/detection.rst)\n- [📝 Traitement NLP](docs/source/nlp/traitement.rst)\n- [🏗️ Architecture Application](docs/source/architecture/structure.rst)\n- [⚙️ Guide d'Installation Locale](docs/source/installation/local.rst)\n- [🔄 Workflow Complet](docs/source/usage/workflow.rst)\n- [🚀 Distribution et Support](docs/source/deployment/distribution.rst)\n\n## 🎮 Utilisation\n\n### Interface Utilisateur\n\n1. **Soumission d'URL** : Entrez l'URL de la page à analyser\n2. **Capture Automatique** : Screenshot et détection des zones\n3. **Sélection des Zones** : Choisissez les éléments à analyser\n4. **Extraction de Contenu** : OCR et traitement NLP automatique\n5. **Interaction Intelligente** : Questions, résumés, analyses\n\n### Interface Administrateur\n\n1. **Validation d'Annotations** : Contrôle qualité des prédictions\n2. **Correction Manuelle** : Amélioration des données d'entraînement\n3. **Fine-tuning** : Relancement de l'entraînement avec nouvelles données\n4. **Métriques** : Suivi des performances du système\n\n## 🔧 API et Intégration\n\n### API REST\n\n```python\n# Exemple d'utilisation de l'API\nimport requests\n\n# Capture et analyse d'une page\nresponse = requests.post('http://localhost:5000/api/analyze', \n                        json={'url': 'https://example.com'})\n\nresult = response.json()\n# result contient : zones détectées, texte extrait, métadonnées\n```\n\n### Docker\n\n```bash\n# Construction de l'image\ndocker build -t smartwebscraper .\n\n# Lancement avec volumes persistants\ndocker run -p 5000:5000 -v $(pwd)/data:/app/data smartwebscraper\n```\n\n## 🎯 Cas d'Usage\n\n- **🔍 Analyse UX/UI** : Détection automatique des zones publicitaires\n- **📚 Recherche Académique** : Constitution de corpus textuels annotés\n- **🤖 Entraînement IA** : Dataset COCO de pages web réelles\n- **📊 Veille Concurrentielle** : Extraction automatisée de contenu\n- **♿ Accessibilité** : Amélioration de la navigation pour malvoyants\n\n## 🛡️ Résolution de Problèmes\n\n### Erreurs Courantes\n\n| Erreur | Solution |\n|--------|----------|\n| `CUDA out of memory` | Définir `FORCE_CPU_MODE=True` dans `.env` |\n| `ModuleNotFoundError: detectron2` | Réinstaller via GitHub : `pip install 'git+https://github.com/facebookresearch/detectron2.git'` |\n| `spaCy model not found` | `python -m spacy download fr_core_news_sm` |\n| `PaddleOCR download failed` | Vérifier connexion Internet et réessayer |\n\n### Performance\n\n```bash\n# Mode CPU forcé (machines limitées)\nexport FORCE_CPU_MODE=True\n\n# Logs détaillés pour debug\nexport FLASK_DEBUG=1\npython run.py 2\u003e\u00261 | tee logs/debug.log\n```\n\n## 🤝 Contribution\n\nNous accueillons les contributions ! Voici comment participer :\n\n1. **Fork** le projet\n2. **Créer** une branche (`git checkout -b feature/amélioration`)\n3. **Commit** vos changements (`git commit -am 'Ajout nouvelle fonctionnalité'`)\n4. **Push** vers la branche (`git push origin feature/amélioration`)\n5. **Créer** une Pull Request\n\n### 🐛 Signaler un Bug\n\nUtilisez les [GitHub Issues](https://github.com/votre-repo/SmartWebScraper-CV/issues) avec :\n- Description détaillée du problème\n- Étapes de reproduction\n- Logs d'erreur\n- Configuration système\n\n## 📈 Roadmap\n\n### Version 2.0 (Prévue)\n- [ ] Support multilingue complet\n- [ ] API REST publique\n- [ ] Interface mobile responsive\n- [ ] Apprentissage par renforcement (RLHF)\n- [ ] Dashboard analytics temps réel\n\n### Version 2.1 (Future)\n- [ ] Intégration cloud (AWS, GCP)\n- [ ] Mode batch processing\n- [ ] Plugin navigateur\n- [ ] Support vidéos web\n\n## 📄 Licence\n\nCe projet est sous licence MIT. Voir le fichier [LICENSE](LICENSE) pour plus de détails.\n\n## 🙏 Remerciements\n\n- **ENSAM Meknès** - Cadre académique et infrastructure\n- **Professeur Tawfik MASROUR** - Encadrement et conseils\n- **Facebook AI Research** - Detectron2 framework\n- **Google** - Gemini API et modèles de base\n- **Communauté Open Source** - Outils et bibliothèques\n\n## 📞 Contact\n\n- **DJERI-ALASSANI OUBENOUPOU** - djeryala@gmail.com\n- **Documentation** - [ReadTheDocs](https://smartwebscraper-cv.readthedocs.io/)\n\n---\n\n\u003cdiv align=\"center\"\u003e\n\n**🎓 ENSAM Meknès - IATD-SI 2025**  \n*Ingénierie de l'Intelligence Artificielle et des Technologies de la Donnée pour les Systèmes Industriels*\n\n\u003c/div\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fziadea%2Fsmartwebscraper-cv","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fziadea%2Fsmartwebscraper-cv","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fziadea%2Fsmartwebscraper-cv/lists"}