{"id":24021073,"url":"https://github.com/zipcodecore/snakesearch","last_synced_at":"2025-02-25T23:46:29.349Z","repository":{"id":222862512,"uuid":"758533810","full_name":"ZipCodeCore/snakesearch","owner":"ZipCodeCore","description":"a very small search engine","archived":false,"fork":false,"pushed_at":"2024-02-16T16:07:53.000Z","size":64,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-08T12:41:18.568Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ZipCodeCore.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-16T14:27:25.000Z","updated_at":"2024-04-10T12:48:27.000Z","dependencies_parsed_at":"2024-02-16T17:29:35.245Z","dependency_job_id":"4229a87c-0560-4bbe-9d2e-faf51d1e75be","html_url":"https://github.com/ZipCodeCore/snakesearch","commit_stats":null,"previous_names":["zipcodecore/snakesearch"],"tags_count":0,"template":true,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZipCodeCore%2Fsnakesearch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZipCodeCore%2Fsnakesearch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZipCodeCore%2Fsnakesearch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ZipCodeCore%2Fsnakesearch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ZipCodeCore","download_url":"https://codeload.github.com/ZipCodeCore/snakesearch/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240766672,"owners_count":19854114,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-08T12:38:26.495Z","updated_at":"2025-02-25T23:46:29.247Z","avatar_url":"https://github.com/ZipCodeCore.png","language":"Python","readme":"# snakesearch\na very small search engine\n\n## Why would you build a small search engine in python?\n\nSearch engines are fundamental pieces of software. Fortunes have been built and lost with them.\nThey are also the kind of system that lets one understand the dynamics of key internet tech.\nAnalyzing these kinds of systems makes you think deeper about What and How these systems are built.\n\n## What would you build a small search engine in python?\n\nGoogle says a search engine has three phases: __Crawl, Index and Search__.\n\n### Crawl\n\nThe first step to building a search engine is to have data to search. Depending on your use case you can crawl existing data (as Google does). What you mean crawl?\n\nBuild a program that \n\n- fetches the html of some web page\n- put it all in a database \n- repeat\n\nOh, and while we're at it, let's make it so that we use some kind of concurrency (threads, multiple processes, or async/await). \n\n### Index\n\nYou build an index, an inverted index, from the stuff you collected during the crawl.'\n\nAn inverted index is a data structure that maps keywords to documents. This data structure makes it trivial to find documents where a certain word appears. When a user searches for some query the inverted index is used to retrieve all the documents that match with the keywords in the query.\n\n### Search\n\nMaybe a simple website that shows results of searching into the index we've built. \nYou know, some HTML pages and a little bit of API\n\n## How would you build a small search engine in python?\n\nSay we want to modify a search engine and add the ability to use this idea?\nWe should start with a search engine and get it running and come back here.\n\nHere is one now... [SnakeSearch](snakesearch/README.md)\nIt fetches a bunch of URLs of blog feeds (RSS), and indexes them.\nHow could we make it more general, by downloading general web pages?\nAn more specifically, what are the answers to these questions?\n\n- What does the `engine/SearchEngine` do?\n- What about `download_content.py`?\n- Finally, where's all the data being kept?\n\nAnother way to look at the problem... [Building a full-text search engine in 150 lines of Python code](https://bart.degoe.de/building-a-full-text-search-engine-150-lines-of-code/)\nThis one doesn't have any UI associated with it.\nBut it does some interesting things with `wikipedia` data.\n\n## What's important?\n\nHow do we decide what's important in these collections of documents?\n\nwell now... there is this Thing that get used to decide on the importance of documents.\n\n![idf inverse doc frequency](idf-inverse-doc-freq.png)\n\nwhat the hell is that?\n\nturns out, very important. [Inverse Document Frequency and the Importance of Uniqueness](https://moz.com/blog/inverse-document-frequency-and-the-importance-of-uniqueness)\n\nAlso there is [BM25](https://en.wikipedia.org/wiki/Okapi_BM25). Is it used in either of our engine examples?\n\nIf you had thought of it first, you'd be the billionaire: There is [PageRank](https://en.wikipedia.org/wiki/PageRank); which made Google, well, Google.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzipcodecore%2Fsnakesearch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzipcodecore%2Fsnakesearch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzipcodecore%2Fsnakesearch/lists"}