{"id":24247170,"url":"https://github.com/someshdiwan/information-retrieval","last_synced_at":"2025-07-07T11:16:04.261Z","repository":{"id":271344431,"uuid":"913141044","full_name":"Someshdiwan/Information-Retrieval","owner":"Someshdiwan","description":"Demonstrating techniques for text document processing, including vector space modeling, cosine similarity computation, and other information retrieval methods. Ideal for learning and implementing basic IR concepts.","archived":false,"fork":false,"pushed_at":"2025-01-12T13:18:06.000Z","size":122,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-04T15:48:15.230Z","etag":null,"topics":["grammar-parser","information","information-extraction","information-retrieval"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Someshdiwan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-07T05:39:37.000Z","updated_at":"2025-01-12T13:19:38.000Z","dependencies_parsed_at":"2025-03-04T15:56:28.459Z","dependency_job_id":null,"html_url":"https://github.com/Someshdiwan/Information-Retrieval","commit_stats":null,"previous_names":["someshdiwan/information-retrieval"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Someshdiwan/Information-Retrieval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Someshdiwan%2FInformation-Retrieval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Someshdiwan%2FInformation-Retrieval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Someshdiwan%2FInformation-Retrieval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Someshdiwan%2FInformation-Retrieval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Someshdiwan","download_url":"https://codeload.github.com/Someshdiwan/Information-Retrieval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Someshdiwan%2FInformation-Retrieval/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264067243,"owners_count":23552161,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["grammar-parser","information","information-extraction","information-retrieval"],"created_at":"2025-01-14T23:18:21.792Z","updated_at":"2025-07-07T11:16:04.256Z","avatar_url":"https://github.com/Someshdiwan.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Text Document Processing\n\nA collection of scripts and examples demonstrating techniques for text document processing, including vector space modeling, cosine similarity computation, and other information retrieval (IR) methods. \nThis repository is ideal for learning and implementing basic IR concepts, text classification, web crawling, and document preprocessing.\n\n![GitHub License](https://img.shields.io/github/license/Someshdiwan/Information-Retrieval)\n![GitHub stars](https://img.shields.io/github/stars/Someshdiwan/Information-Retrieval)\n\n---\n\n## 🚀 Overview\n\nThis repository showcases several fundamental and advanced techniques in **text document processing** and **information retrieval (IR)**, including methods for text classification, vector space modeling, similarity computation, and web crawling.\n\n### Key Techniques:\n\n- **Text Preprocessing**: Text cleaning, stop word removal, stemming, and lemmatization.\n- **Vector Space Model (VSM)**: Representing documents as vectors in a high-dimensional space for processing.\n- **Cosine Similarity**: Computing the similarity between documents using the cosine similarity measure.\n- **Naive Bayes Classifier**: Text classification using the Naive Bayes algorithm (GaussianNB).\n- **Web Crawling**: Crawling websites to extract news stories with domain filtering.\n\n![Text Processing](https://cdn.dribbble.com/users/19894/screenshots/3359384/grammerly-keyboard.gif)\n\n---\n\n## 🔧 Features\n\n- **Text Classification**: Naive Bayes classifier for text classification and prediction tasks.\n- **Document Preprocessing**: Techniques for cleaning and preparing text documents for analysis.\n- **Cosine Similarity**: Implementation of cosine similarity to compare and measure the similarity between documents.\n- **Web Crawling**: Scripts for crawling news websites and collecting relevant text content.\n- **XML Parsing**: Basic example of parsing and modifying XML documents in Python.\n\n---\n\n## 🌐 Demo\n\nYou can try out the various techniques demonstrated in this repository by running the provided Python scripts or Jupyter notebooks. The projects include:\n- **Text classification** using Naive Bayes (GaussianNB)\n- **Cosine similarity computation** for document comparison\n- **Web crawling** to extract news stories from websites\n- **XML document processing** for parsing and modification\n\n### Dependencies:\n\nTo run the examples, you will need the following libraries:\n- Python 3.x\n- scikit-learn (for Naive Bayes and vectorizer)\n- pandas\n- numpy\n- requests\n- BeautifulSoup (for web scraping)\n- nltk (for text preprocessing)\n- lxml (for XML parsing)\n\nInstall them using pip:\n\npip install\n\n---\n\n🛠️ Technologies Used\nPython 3.x\nscikit-learn (for machine learning and vector space modeling)\npandas\nnumpy\nnltk (for natural language processing)\nBeautifulSoup (for web scraping)\nlxml (for XML parsing)\nJupyter Notebooks (for interactive demos)\n\n## 📂 Project Structure\n\n```plaintext\nText-Document-Processing/\n├── notebooks/               # Jupyter notebooks for each technique\n├── data/                    # Datasets for testing and training models\n├── README.md                # Project documentation\n```\nRunning the Code\nClone the repository:\n\ngit clone [https://github.com/Someshdiwan/Text-Document-Processing](https://github.com/Someshdiwan/Information-Retrieval)\n\n---\n\n```\n🌟 Show Your Support\nIf you like this project, please consider giving it a ⭐ on GitHub!\n\n🤝 Contributing\nWe welcome contributions to improve the repository! If you have any enhancements, bug fixes, or new project ideas, feel free to fork the repository, make changes, and submit a pull request.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsomeshdiwan%2Finformation-retrieval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsomeshdiwan%2Finformation-retrieval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsomeshdiwan%2Finformation-retrieval/lists"}