{"id":18792348,"url":"https://github.com/kplanisphere/vocabulario-processing","last_synced_at":"2025-10-14T17:16:19.291Z","repository":{"id":243553019,"uuid":"812742906","full_name":"KPlanisphere/vocabulario-processing","owner":"KPlanisphere","description":"Laboratory 4 - Retrieval Information","archived":false,"fork":false,"pushed_at":"2024-06-09T18:49:06.000Z","size":973,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-14T17:16:18.432Z","etag":null,"topics":["data-preprocessing","educational-project","information-retrieval","lowercase-conversion","punctuation-removal","python","short-words-filter","text-processing","tokenization","vocabulary-optimization"],"latest_commit_sha":null,"homepage":"https://linktr.ee/planisphere.kgz","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KPlanisphere.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-09T18:47:57.000Z","updated_at":"2024-06-09T18:50:42.000Z","dependencies_parsed_at":"2024-06-09T20:34:39.578Z","dependency_job_id":"5cd4a6b4-47af-451d-8472-9531ec7027e7","html_url":"https://github.com/KPlanisphere/vocabulario-processing","commit_stats":null,"previous_names":["kplanisphere/vocabulario-processing"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/KPlanisphere/vocabulario-processing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KPlanisphere%2Fvocabulario-processing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KPlanisphere%2Fvocabulario-processing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KPlanisphere%2Fvocabulario-processing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KPlanisphere%2Fvocabulario-processing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KPlanisphere","download_url":"https://codeload.github.com/KPlanisphere/vocabulario-processing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KPlanisphere%2Fvocabulario-processing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279020072,"owners_count":26086805,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-14T02:00:06.444Z","response_time":60,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-preprocessing","educational-project","information-retrieval","lowercase-conversion","punctuation-removal","python","short-words-filter","text-processing","tokenization","vocabulary-optimization"],"created_at":"2024-11-07T21:19:35.515Z","updated_at":"2025-10-14T17:16:19.264Z","avatar_url":"https://github.com/KPlanisphere.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Vocabulario Processing Project\n\n## Description\nThis project focuses on processing text files to extract and optimize vocabulary using Python and regular expressions. The main goal is to prepare the vocabulary for information retrieval tasks by performing several preprocessing steps, including tokenization, punctuation removal, conversion to lowercase, and filtering out short words. The project is built upon the results of previous laboratories and demonstrates advanced text processing techniques to enhance the efficiency of information retrieval within large text datasets.\n\n### Files Included\n- **lab4.py**: A Python script for processing text files to extract and optimize vocabulary.\n- **Laboratorio 4 Vocabulario.pdf**: Official documentation detailing the objectives, methodology, and results of the project.\n- **vocabularioReducidoT.txt**: A text file containing the reduced vocabulary.\n- **vocabularioTruncado.txt**: A text file containing the truncated vocabulary.\n- **documentos/documento-TRUNCADO.txt**: A sample processed document used in this project.\n\n### Notable Code Snippets\n\n#### 1. Function to Obtain Vocabulary\nThis function reads a text file and extracts unique words, filtering out numbers and converting the text to lowercase.\n\n```python\nimport re\n\ndef obtener_vocabulario(archivo):\n    with open(archivo, 'r') as f:\n        texto = f.read()\n        palabras = re.findall(r'\\b(?![0-9]+\\b)\\w+\\b', texto.lower())\n        return set(palabras)\n```\n\n#### 2. Processing Text Files\n\nThis snippet processes all text files in the specified directory to extract and optimize vocabulary.\n\n```python\nimport os\n\n# Directory containing text files\ndirectorio = r'C:\\Users\\mini_\\OneDrive\\Documentos\\Code Test\\TEST 1\\lab4\\documentos'\noutput_file_final = r'C:\\Users\\mini_\\OneDrive\\Documentos\\Code Test\\TEST 1\\lab4\\vocabularioReducidoT.txt'\n\n# Set to store the total vocabulary\nvocabulario_total = set()\n\n# Process each text file in the directory\nfor archivo in os.listdir(directorio):\n    if archivo.endswith('.txt'):\n        ruta_archivo = os.path.join(directorio, archivo)\n        vocabulario_archivo = obtener_vocabulario(ruta_archivo)\n        vocabulario_total.update(vocabulario_archivo)\n\n# Sort the vocabulary alphabetically\nvocabulario_ordenado = sorted(vocabulario_total)\n\n# Filter out words with 2 characters or less\nvocabulario_filtrado = [palabra for palabra in vocabulario_ordenado if len(palabra) \u003e 2]\n\n# Write the filtered vocabulary to the output file\nwith open(output_file_final, 'w') as f:\n    for palabra in vocabulario_filtrado:\n        f.write(palabra + '\\n')\n```\n\n### Official Documentation Summary\n\nThe official documentation provided in \"Laboratorio 4 Vocabulario.pdf\" outlines the following key points:\n\n#### Objectives\n\n-   Develop a Python script to extract and optimize the vocabulary of a given document.\n-   Perform preprocessing steps including tokenization, punctuation removal, conversion to lowercase, and truncation of words.\n-   Create a reduced vocabulary by filtering out terms with two or fewer letters.\n\n#### Methodology\n\n1.  **Vocabulary Extraction**: Extract unique words from text files, excluding numbers and converting text to lowercase.\n2.  **Vocabulary Optimization**: Filter out words with two characters or fewer to create a reduced vocabulary.\n3.  **Alphabetical Sorting**: Sort the vocabulary alphabetically for better organization and readability.\n\n#### Results and Discussion\n\n-   The initial vocabulary extracted contains many terms, including short words with only two letters.\n-   The reduced vocabulary, which excludes words with two letters or fewer, shows a significant decrease in the number of terms.\n-   The reduced vocabulary improves the efficiency of information retrieval tasks by focusing on more meaningful terms.\n\n#### Conclusion\n\nThe project successfully demonstrates advanced text processing techniques to optimize vocabulary for information retrieval tasks. The reduction in vocabulary size by filtering out short words enhances the efficiency and relevance of the retrieved information.\n\n### Installation and Usage\n\n1.  Clone the repository to your local machine.\n2.  Ensure you have Python installed.\n3.  Run the `lab4.py` script to process the text files and extract the optimized vocabulary.\n\n```bash\ngit clone https://github.com/KPlanisphere/vocabulario-processing.git\ncd vocabulario-processing\npython lab4.py\n```\n\n### Dependencies\n\n-   Python\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkplanisphere%2Fvocabulario-processing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkplanisphere%2Fvocabulario-processing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkplanisphere%2Fvocabulario-processing/lists"}