{"id":24910235,"url":"https://github.com/priyakdey/github-api-crawler","last_synced_at":"2026-05-04T03:34:38.774Z","repository":{"id":108392134,"uuid":"415330966","full_name":"priyakdey/github-api-crawler","owner":"priyakdey","description":"A crawler to crawl and save the APIs  found in the Public APIs github repo - https://github.com/public-apis/public-apis. Visit README for details.","archived":false,"fork":false,"pushed_at":"2021-10-13T20:08:23.000Z","size":43,"stargazers_count":1,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-28T01:57:49.094Z","etag":null,"topics":["api","crawler","mongo","python3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/priyakdey.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-09T14:24:30.000Z","updated_at":"2021-11-25T18:52:01.000Z","dependencies_parsed_at":"2023-04-15T20:16:13.228Z","dependency_job_id":null,"html_url":"https://github.com/priyakdey/github-api-crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/priyakdey/github-api-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyakdey%2Fgithub-api-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyakdey%2Fgithub-api-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyakdey%2Fgithub-api-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyakdey%2Fgithub-api-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/priyakdey","download_url":"https://codeload.github.com/priyakdey/github-api-crawler/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/priyakdey%2Fgithub-api-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32593944,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T22:12:39.696Z","status":"online","status_checked_at":"2026-05-04T02:00:06.625Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","crawler","mongo","python3"],"created_at":"2025-02-02T03:34:58.571Z","updated_at":"2026-05-04T03:34:38.769Z","avatar_url":"https://github.com/priyakdey.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Public APIs List Crawler\n\n---\n\n**github-api-crawler** is a console based application which crawls a github repository to get the api data for each category\nand store it in a database.\n\n\n###### Problem Statement\n\n---\n\nPublic APIs [github repo](https://github.com/public-apis/public-apis) is a collective list of free APIs for use in\nsoftware and web development.\n\nOn the landing page of the repo, there are some list of categories for e.g. Animals, Art \u0026 Design,\nBusiness etc.\nEach category has some API Details, e.g. for Animals:\n\n```json\n{\n  \"API\": \"Cat Facts\",\n  \"Link\": \"https://alexwohlbruck.github.io/cat-facts/\",\n  \"Description\": \"Daily cat facts\", \n  \"Auth\": \"No\", \n  \"HTTPS\": \"Yes\",\n  \"CORS\": \"No\"\n}\n```\n\nThe application should crawl each category and fetch the API Details and store them in a database.\n\n- **Rate Limiting**  - All requests to the above hosts are limited to **10 requests/minute**.\n- **Authentication** - Each request needs a Bearer Token for authentication. Each token has an expiration of 5 minutes\n- **Get Token**      - GET https://public-apis-api.herokuapp.com/api/v1/auth/token\n- **Get categories** - GET https://public-apis-api.herokuapp.com/api/v1/apis/categories?page=1\n- **Get api data**   - GET https://public-apis-api.herokuapp.com/api/v1/apis/entry?page=1\u0026category=Animals\n\n\n**Complete detailed documentation can be found here \n[Postman documentation](https://documenter.getpostman.com/view/4796420/SzmZczsh?version=latest).**\n\n**NOTE**: Do not use any other APIs or scraping method to get the data.\n\n\n###### Points to achieve\n\n---\n\n- Code should follow concept of OOPS\n- Support for handling authentication requirements \u0026 token expiration of server\n- Support for pagination to get all data\n- Develop work around for rate limited server\n- Crawled all API entries for all categories and stored it in a database\n\n\n###### Steps to run code\n\n---\n\nThe application is built using:\n- python 3.9.0\n- docker (version: 20.10.8)\n- docker-compose (version 1.29.2)\n\nFor local run, docker and docker-compose is a pre-requisite. Documentation for installation can be found \n[here](https://www.docker.com/get-started) \n\n\nOnce you have docker and docker-compose installed you can cd into the directory(assuming you cloned this project on local),\nyou can run - `docker-compose up` to run the complete stack.\n\nThis command will run a mongo-db container, a mongo-express server(to visually see the data) and the application. \nYou can check the logs to understand the flow of the application.\n\nOnce completed, you can check the data visually by visiting `localhost:8081` on which express is running,\nwhich is a UI way to check mongodb data.\n\n\nRefer to [Dockerfile](https://github.com/priyakdey/github-api-crawler/blob/master/Dockerfile) to understand how the image is created.\n*The image is currently under my personal namespace(for obvious reasons), so in case you are building the image locally\nand trying it out, do change namespace in docker-compose file as well. Later, I shall change the compose file to build from\nthe image from the file itself*\n\n\n###### Steps to run/debug code locally\n\n---\n\nYou can run the services using docker-compose-local.yaml - `docker-compose up -f docker-compose-local.yaml`\nwhich will run a mongo db database and a mongo-express server to check the data.\nOnce setup done, you can run the code using your fav ide.\n\n\nAlso, change line [number 5](https://github.com/priyakdey/github-api-crawler/blob/master/crawler/constants.py#L5) to: \n`DB_CONN_STRING = \"mongodb://admin:password@localhost:27017/\"`\n\n- Create and activate your virtual env\n- Install the dependencies by running - `pip install -r requirements-dev.txt`\n\n**NOTE** - Ignore the linux dev requirements file with lots of stuff which is specially needed for my wsl setup and vim\nto work. So ignore that!\n\n###### Improvements\n\n---\n\nSince this was a project asked for an interview review to a friend, I am not going to post the complete question,\nbut follow the instructions and add improvements which I think can be done given more time. (A weekend was given for this).\n\n1. **Configuration Driven** - The database URLs will differ depending on different envs. \nOne example is changing the URL in constants.py file when running on local and not as a stack, which I do not like.\nSo the URLs needs to be config driven. Open issue [#21](https://github.com/priyakdey/github-api-crawler/issues/21))\nis there for this.\n2. **Performance** - Though python is not a good multithreaded platform, I can leverage multiprocess and implement \na Pub-Sub model to speed up the process; the producer pushed the data to a Pipe (I can use SQS maybe and in that case\nswitch to Dynamo or still use Mongo Service) while the consumer keeps pushing the data to the DB. This might not give a huge\nperformance benefit specially on local db, since currently after the data fetch, the collection.insert_many takes few ms\nto load the complete data, but in real time with cloud services and geo-location, this might be an advantage.\n3. **Design Patterns** - I need to revist the complete design and check for more python code and optimisation that can be done.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpriyakdey%2Fgithub-api-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpriyakdey%2Fgithub-api-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpriyakdey%2Fgithub-api-crawler/lists"}