{"id":20957106,"url":"https://github.com/zefrenchwan/monocle","last_synced_at":"2026-05-18T13:07:10.254Z","repository":{"id":255526517,"uuid":"849926756","full_name":"zefrenchwan/monocle","owner":"zefrenchwan","description":"Tool to list what a website is about","archived":false,"fork":false,"pushed_at":"2024-09-04T16:34:58.000Z","size":49,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-12-30T22:35:49.353Z","etag":null,"topics":["information-gathering","nlp","osint","python3","spacy-nlp","webscraping-data","webscrapping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zefrenchwan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-30T14:28:51.000Z","updated_at":"2024-09-09T06:10:03.000Z","dependencies_parsed_at":"2024-09-05T23:43:23.366Z","dependency_job_id":null,"html_url":"https://github.com/zefrenchwan/monocle","commit_stats":null,"previous_names":["zefrenchwan/monocle"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/zefrenchwan/monocle","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zefrenchwan%2Fmonocle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zefrenchwan%2Fmonocle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zefrenchwan%2Fmonocle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zefrenchwan%2Fmonocle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zefrenchwan","download_url":"https://codeload.github.com/zefrenchwan/monocle/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zefrenchwan%2Fmonocle/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33178758,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-18T09:27:30.708Z","status":"ssl_error","status_checked_at":"2026-05-18T09:27:28.300Z","response_time":71,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["information-gathering","nlp","osint","python3","spacy-nlp","webscraping-data","webscrapping"],"created_at":"2024-11-19T01:29:32.987Z","updated_at":"2026-05-18T13:07:10.237Z","avatar_url":"https://github.com/zefrenchwan.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# monocle\n\nTool to list what a website is about. \n\nName comes from \"monocle\", a french word. \nIt is, basically, a piece of glass to correct one eye vision (just one eye). \nThis code is really basic webscraping, it will not provide you a definitive vision of what a website is about. \n\n## TLDR\n\n1. Provide a website and a file path, the tool scraps the website, find named entities and puts it all into a file\n2. Internals use a NLP model. Default for french so far, so results are not amazing. Give it a better model if you need\n3. Typical use is to find what a website is about (to put ads in it after the 3rd party cookies era, or to automate some searches on someone / something / a company / whatever)\n4. Licence is MIT licence, it means basically that you may use it, sell it, whatever. Just mention this licence and this copyright (zefrenchwan, 2024)\n5. Use it but do not break others websites by overloading them \n\n\n## What does it produce ? \n\nGiven a website, it scraps any pages in that website (not the full web). \nIt produces a JSON file, and that file contains:\n* lang (fr so far): the lang of the website\n* date: end of extraction date\n* url: first url (in general, the root of the website) \n* entities: named entities and their types\n\nLet us zoom on entities. \nIt means named entities, that is PER (persons), LOC (location), ORG (organizations), or MISC (some stuff that may be an entity, but classification is unclear). \nFor each label (PER, LOC, etc), you have: \n* the number of appearences of named entities\n* said named entities\n\nFor instance \n```\n      \"LOC\": {\n            \"4\": [\n                \"Rome\",\n```\n\nmeans that `Rome` is a named entity that is a location (LOC) and it appeared `4` times in the website. \n\n## How do I use it ? \n\nFirst, *pay attention to your usage*. \nThis tool scrapes a website starting at a given page. \nBe sure that the website allows it and that the load is acceptable for that website. \nFor instance, wikipedia says it: they provide their full content as a database, no need to webscrap them. \n\nSecond, *this code deals with NLP model itself, no need to install anything else*.\nDuring the first launch, code will not find the right model, so it will download it.  \nReason is that spacy is a \"technical detail\" somehow, that an end user (you) does not have to bother with. \nIt comes with a counterpart: if your website is not in french, you need to change the code to download the correct spacy model. \n\n\nThen, assume you want to scrap website `https://iamawebsite.fr` and write result into a `result.json` file. \nUsage would be `pipenv run .\\main.py https://iamawebsite.fr result.json` \n\n## How does it work ?\n\nAlgorithm is a map reduce (to find web page content and group stats to a single global result). \nTo find pages to explore, it is a graph walkthrough with no cycle. \nIt just starts from the provided URL, explore each page to find outgoing links. \nThen, it picks the next non processed page, it loads the page, etc. \n\nInteresting parts are:\n* data clean to pass from HTML to plain text. Dealt with BS4, just a reponse.text\n* picking links. Dealt with a BS4 search for a href. Basic, you may want to change this part\n* NLP pipeline. This is the plus-value core. \n\n### NLP \n\nThis section details the models and frameworks used. \n\n* NLP library is [Spacy](https://spacy.io/) \n* Supported languages are ... French. To change it, include a new model in `initializations.py`\n* Spacy model is the largest one, it focuses on accuracy. To use a better model for your use case, change `initializations.py`\n\n\n## Some common comments / FAQ\n\n### I am not happy with the results\n\nI used spacy largest model. \nFeel free to change this code or use another better model. \n\n### Your code may face encoding issues\n\nWebsite may provide its charset (or not). \nDefault is UTF-8, because it is Python default. \nCharset recognition is really painful, costs a lot and provides no benefit for an open source code. \n\n### I have so many ideas to...\n\nClone this code, do your stuff, and if you want, share it. \nCode is under MIT licence for that reason. \n\n### I want to use your code for commercial use\n\nCode is under MIT licence, you may do it. \nAlthough, there are some better webscraping techniques: \n* to find URL\n* to find web page content\n* to list named entities\n\n### Any though on some better architecture ?\n\nSure ! \n* split webscraping and nlp analysis, use apache kafka to sync them (typical producer - consumer)\n* save webpages, they may change\n* use some reference data to improve named entities recognition \n\n### Why the MIT licence ? You could be rich and famous and...\n\nOf course...\nFirst of all, because you may use my code with no limit expect LAW and just mentioning me. \nI wanted to share this simple project (less than a week of work) to help anyone that likes webscraping. \n\n### What changes will you make on this project ? \n\nNot sure I will. \nCode is sufficient for its purpose: a base for others developers to use. ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzefrenchwan%2Fmonocle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzefrenchwan%2Fmonocle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzefrenchwan%2Fmonocle/lists"}