{"id":30598300,"url":"https://github.com/leopardslab/crawlerx","last_synced_at":"2025-08-29T22:13:20.597Z","repository":{"id":37059522,"uuid":"263531267","full_name":"leopardslab/CrawlerX","owner":"leopardslab","description":"CrawlerX - Develop Extensible, Distributed, Scalable Crawler System which is a web platform that can be used to crawl URLs in different kind of protocols in a distributed way.","archived":false,"fork":false,"pushed_at":"2023-02-14T21:54:53.000Z","size":12413,"stargazers_count":18,"open_issues_count":41,"forks_count":13,"subscribers_count":4,"default_branch":"master","last_synced_at":"2023-03-04T00:18:45.335Z","etag":null,"topics":["django-backend","elasticsearch","firebase-auth","message-broker","mongodb-server","vuejs","web-crawling"],"latest_commit_sha":null,"homepage":"","language":"SCSS","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/leopardslab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-05-13T05:11:14.000Z","updated_at":"2023-02-11T19:29:55.000Z","dependencies_parsed_at":"2023-02-08T08:15:52.775Z","dependency_job_id":null,"html_url":"https://github.com/leopardslab/CrawlerX","commit_stats":null,"previous_names":[],"tags_count":null,"template":null,"template_full_name":null,"purl":"pkg:github/leopardslab/CrawlerX","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leopardslab%2FCrawlerX","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leopardslab%2FCrawlerX/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leopardslab%2FCrawlerX/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leopardslab%2FCrawlerX/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/leopardslab","download_url":"https://codeload.github.com/leopardslab/CrawlerX/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leopardslab%2FCrawlerX/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272772763,"owners_count":24990514,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-29T02:00:10.610Z","response_time":87,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["django-backend","elasticsearch","firebase-auth","message-broker","mongodb-server","vuejs","web-crawling"],"created_at":"2025-08-29T22:13:20.145Z","updated_at":"2025-08-29T22:13:20.592Z","avatar_url":"https://github.com/leopardslab.png","language":"SCSS","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# CrawlerX - Develop Extensible, Distributed, Scalable Crawler System\n\nThe CrawlerX is a platform which we can use for crawl web URLs in different kind of protocols in a distributed way. Web crawling often called web scraping is a method of programmatically going over a collection of web pages and extracting data which useful for data analysis with web-based data. With a web scraper, you can mine data about a set of products, get a large corpus of text or quantitative data to play around with, get data from a site without an official API, or just satisfy your own personal curiosity.\n\n![Architecture Diagram](resources/abstract_architecture.jpg)\n\nCrawlerX includes the following runtimes to do the crawling jobs for you.\n\n- **VueJS Frontend** - Dashboard which users intercat\n- **Firebase** - User authorization \u0026 authentication\n- **Django Backend Server** - which expose API endpoints for the frontend\n- **RabbitMQ Server** - Message broker\n- **Celery Beat and Workers** - Job Scheduler and executor\n- **Scrapy Server** - for extracting the data you need from websites\n- **MongoDB Server** - for store crawled data\n- **ElasticSearch**- for job/query seaching mechanisams\n\n## CrawlerX Dashboard\n\nIn the CrawlerX dashboard, you can get an abstract idea of the crawled and crawling projects and jobs with their status.\n\n![Dashboard](resources/dashboard.png)\n\n## Crawl Job Scheduling\n\nIn CrawlerX, you can schedule crawl jobs in three ways. \n\n- **Instant Scheduler** - Crawl job is scheduled that run instantly\n- **Interval Scheduler** - Crawl job is scheduled that run at a specific interval\n- **Cron Scheduler** - Crawl job is scheduled that run as a cron job\n\n![Job Scheduler](resources/job_schedule.png)\n\n### Prerequisites\nFirst you need to edit the `.env` file in `crawlerx_app` root directory with your web app's firebase configuration details.\n```\nVUE_APP_FIREBASE_API_KEY = \"\u003cyour-api-key\u003e\"\nVUE_APP_FIREBASE_AUTH_DOMAIN = \"\u003cyour-auth-domain\u003e\"\nVUE_APP_FIREBASE_DB_DOMAIN= \"\u003cyour-db-domain\u003e\"\nVUE_APP_FIREBASE_PROJECT_ID = \"\u003cyour-project-id\u003e\"\nVUE_APP_FIREBASE_STORAGE_BUCKET = \"\u003cyour-storage-bucket\u003e\"\nVUE_APP_FIREBASE_MESSAGING_SENDER_ID= \"\u003cyour-messaging-sender-id\u003e\"\nVUE_APP_FIREBASE_APP_ID = \"\u003cyour-app-id\u003e\"\nVUE_APP_FIREBASE_MEASURMENT_ID = \"\u003cyour-measurementId\u003e\"\n```\n\n### Setup on the Container based Environments\n\n#### Kubernetes Helm Deployment\n\n[See the helm deployment documentation](crawlerx_helm/README.md)\n\n#### Docker Composer\n\nPlease follow the below steps to setup CrawlerX on the container environment.\n\n```sh\ndocker-compose up --build\n```\n\nOpen http://localhost:8080 to view the CrawlerX web UI in the browser.\n\n### Setup on the VM based Environment\n\n#### Please follow the below steps in order to set it up CrawlerX in your VM based environment.\n\nStart RabbitMQ broker\n\n```sh\n$ docker run -d --hostname my-rabbit --name some-rabbit -p 8080:15672 rabbitmq:3-management\n```\n\nStart MongoDB Server\n\n```sh\n$ docker run -d -p 27017:27017 --name some-mongo \\\n    -e MONGO_INITDB_ROOT_USERNAME=\u003cusername\u003e \\\n    -e MONGO_INITDB_ROOT_PASSWORD=\u003cpassword\u003e \\\n    mongo\n```\n\nStart Scrapy Daemon (after installing scrpay daemon)\n\n```sh\n$ cd scrapy_app\n$ scrapyd\n```\n\nStart ElasticSearch\n```sh\n$ docker run -p 9200:9200 -p 9300:9300 -e \"discovery.type=single-node\" elasticsearch:7.8.1\n```\n\nStart Celery Beat\n```sh\n$ cd crawlerx_server\n$ celery -A crawlerx_server beat -l INFO\n```\n\nStart Celery Worker\n```sh\n$ cd crawlerx_server\n$ celery -A crawlerx_server worker -l INFO\n```\n\nStart the Django backend :\n\n```sh\n$ pip install django\n$ cd crawlerx_server\n$ python3 manage.py runserver\n```\n\nStart the frontend :\n\n```sh\n$ cd crawlerx_app\n$ npm install\n$ npm start\n```\n\n### Todos\n\n- Tor URL crawler\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleopardslab%2Fcrawlerx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleopardslab%2Fcrawlerx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleopardslab%2Fcrawlerx/lists"}