{"id":23431033,"url":"https://github.com/jroakes/tech-seo-crawler","last_synced_at":"2025-04-12T23:20:23.755Z","repository":{"id":37229681,"uuid":"221697713","full_name":"jroakes/tech-seo-crawler","owner":"jroakes","description":"Build a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.","archived":false,"fork":false,"pushed_at":"2023-02-11T01:22:08.000Z","size":6481,"stargazers_count":73,"open_issues_count":17,"forks_count":11,"subscribers_count":12,"default_branch":"master","last_synced_at":"2025-04-05T00:46:52.709Z","etag":null,"topics":["crawling","github-pages","rendering","seo","wikipedia"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jroakes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-11-14T12:55:35.000Z","updated_at":"2025-03-02T22:43:38.000Z","dependencies_parsed_at":"2023-01-24T19:00:23.584Z","dependency_job_id":null,"html_url":"https://github.com/jroakes/tech-seo-crawler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jroakes%2Ftech-seo-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jroakes%2Ftech-seo-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jroakes%2Ftech-seo-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jroakes%2Ftech-seo-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jroakes","download_url":"https://codeload.github.com/jroakes/tech-seo-crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248643874,"owners_count":21138518,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawling","github-pages","rendering","seo","wikipedia"],"created_at":"2024-12-23T09:49:12.876Z","updated_at":"2025-04-12T23:20:23.728Z","avatar_url":"https://github.com/jroakes.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TechSEO Crawler\n\n\nBuild a small, 3 domain internet using Github pages and Wikipedia and construct a crawler to crawl, render, and index.\n\n![TechSEO Screenshot](https://raw.githubusercontent.com/jroakes/tech-seo-crawler/master/etc/images/screenshot.png \"TechSEO Screenshot\")\n\nPlay with the results here: [Simple Search Engine](http://ec2-34-233-22-11.compute-1.amazonaws.com:8501)\n\n**Please Note**: The link above is hosted on a small AWS box, so if you have issues loading, try again later.\n\nSlideshare is here: [Building a Simple Crawler on a Toy Internet](https://www.slideshare.net/jroakes/building-a-simple-crawler-on-a-toy-internet)\n\n## Description\n\n### Web Folder\nIn order to crawl a small internet of sites, we have to create it.  This tool creates 3 small sites from Wikipedia data and hosts them on Github Pages.  The sites are not linked to any other site on the internet, but are linked to each other.\n\n### Main function\n\nThis tool attempts to implement a small ecosystem of 3 websites, along with a simple crawler, renderer, and indexer.  While the author did research to construct the repo, it was a design feature to prefer simplicity over complexity.  Items that are part of large crawling infrastructures, most notably disparate systems, and highly efficient code and data storage, are not part of this repo.  We focus on simple representations of items such that it is more accessible to newer developers.\n\n#### Parts:\n* PageRank\n* Chrome Headless Rendering\n* Text NLP Normalization\n* Bert Embeddings\n* Robots\n* Duplicate Content Shingling\n* URL Hashing\n* Document Frequency Functions (BM25 and TFIDF)\n\n\nMade for a presentation at [Tech SEO Boost](https://www.catalystdigital.com/techseoboost/)\n\n\n\n## Getting Started\n\n### Get the repo\n```\ngit clone https://github.com/jroakes/tech-seo-crawler.git\n```\n\n\n### Dependencies\n\n* Please see the requirements.txt file for a list of dependencies.\n\nIt is strongly suggested to do the following, first, in a new, clean environment.\n\n* May need to install [Microsoft Build Tools] (http://go.microsoft.com/fwlink/?LinkId=691126\u0026fixForIE=.exe.) and upgrade setup tools  `pip install --upgrade setuptools` if you are on Windows.\n* Install PyTorch `pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html`\n* See requirements-libraries.txt file for remaining library requirements.  To install the frozen requirements this was developed with, use ```pip install -r requirements.txt```\n\nInstall with:\n```\npip install -r requirements.txt\n```\n\n\n### Executing program\n\n1. Make sure you've created your three sites first. See README file in the web folder. Conversely, if you just want to use the crawler/renderer, you can run with the premade sites and skip to step 3.\n2. After creating your three sites, go to the config file and add the crawler_seed URL. This will be the organization name you created on github.io. For example: myorganization.github.io/\n3. Run `streamlit run main.py` in the terminal or command prompt.  A new Browser window should open.\n4. The tool can also be run interactively with the `Run.ipynb` notebook in Jupyter.\n\n\n### Sharing\nIf you want to share your search engine for others to see, you can use Streamlit and Localtunnel.\n1. Install Localtunnel `npm install -g localtunnel`\n2. Start the tunnel with `lt --port 80 --subdomain \u003ccreate a unique sub-domain name\u003e`\n3. Start the Streamlit server with `streamlit run main.py --server.port 80 --global.logLevel 'warning' --server.headless true --server.enableCORS false --browser.serverAddress \u003cthe unique subdomain from step 2\u003e.localtunnel.me`\n4. Navigate to `https://\u003cthe unique subdomain from step 2\u003e.localtunnel.me` in your browser, or share the link with a friend.\n\n#### Complete example:\nIn a new terminal:\n```\nnpm install -g localtunnel\nlt --port 80 --subdomain tech-seo-crawler\n```\n\nIn another terminal:\n```\ncd /tech-seo-crawler/\nactivate techseo\nstreamlit run main.py --server.port 80 --global.logLevel 'warning' --server.headless true --server.enableCORS false --browser.serverAddress tech-seo-crawler.localtunnel.me\n```\n\n\n## Troubleshooting\n* When running in streamlit we experienced a few connection closed errors during the Rendering process. If you experience this error just rerun the script by using the top right menu and clicking on rerun in streamlit.\n\n\n## Contributors\n\nContributors names and contact info\n* JR Oakes [@jroakes](https://twitter.com/jroakes)\n* Robert Padgett [@robertcpadgett](https://twitter.com/robertcpadgett)\n\n\n## Version History\n\n* 0.1 - Alpha\n    * Initial Release\n\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE.md file for details\n\n## Acknowledgments\n\n### Libraries\n* [ghPublish](https://github.com/oxalorg/ghPublish)\n* [pandas](https://github.com/pandas-dev/pandas) # What would we all do without Pandas?\n* [gensim](https://github.com/RaRe-Technologies/gensim)\n* [pyppeteer](https://github.com/miyakogi/pyppeteer)\n* [scikit-learn](https://github.com/scikit-learn/scikit-learn)\n* [streamlit](https://github.com/streamlit/streamlit)\n* [DIP](https://github.com/dipanjanS) # I don't know who you are, but thanks for my go-to text normalization pipeline.\n\n### Topics\n* https://github.com/kish1/PoliteCrawler/blob/master/polite_crawler.py\n* https://bitbucket.org/mchaput/whoosh/src/default/\n* https://www.ijarcce.com/upload/2016/january-16/IJARCCE%2052.pdf\n* https://www.seltzer.com/margo/publications\n* https://github.com/sidco0014/Search-Engine\n* https://github.com/valerio94w/ADM-Hw3-Group4\n* https://github.com/rw1993/hupubxj_search\n* https://github.com/mitishagd/Information-Retrieval-System\n* https://medium.com/startup-grind/what-every-software-engineer-should-know-about-search-27d1df99f80d\n* http://web.stanford.edu/class/cs276/\n* https://github.com/wuyi1405/brianspiering-nlp-course\n* https://www.cs.toronto.edu/~muuo/blog/build-yourself-a-mini-search-engine/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjroakes%2Ftech-seo-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjroakes%2Ftech-seo-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjroakes%2Ftech-seo-crawler/lists"}