{"id":18792347,"url":"https://github.com/kplanisphere/grimm-text-processor","last_synced_at":"2025-06-24T20:08:15.018Z","repository":{"id":243458269,"uuid":"812491942","full_name":"KPlanisphere/grimm-text-processor","owner":"KPlanisphere","description":"Laboratory 1 - Retrieval Information","archived":false,"fork":false,"pushed_at":"2024-06-09T03:45:55.000Z","size":788,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-21T15:14:54.347Z","etag":null,"topics":["data-preprocessing","educational-project","information-retrieval","lowercase-conversion","nltk","punctuation-removal","python","stopwords","text-processing","tokenization"],"latest_commit_sha":null,"homepage":"https://linktr.ee/planisphere.kgz","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/KPlanisphere.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-09T03:40:20.000Z","updated_at":"2024-06-09T03:48:18.000Z","dependencies_parsed_at":"2024-06-09T04:38:57.082Z","dependency_job_id":"f3477f42-fe66-4da7-ba0f-4a9aea9207d4","html_url":"https://github.com/KPlanisphere/grimm-text-processor","commit_stats":null,"previous_names":["kplanisphere/grimm-text-processor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/KPlanisphere/grimm-text-processor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KPlanisphere%2Fgrimm-text-processor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KPlanisphere%2Fgrimm-text-processor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KPlanisphere%2Fgrimm-text-processor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KPlanisphere%2Fgrimm-text-processor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/KPlanisphere","download_url":"https://codeload.github.com/KPlanisphere/grimm-text-processor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/KPlanisphere%2Fgrimm-text-processor/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261749216,"owners_count":23203990,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-preprocessing","educational-project","information-retrieval","lowercase-conversion","nltk","punctuation-removal","python","stopwords","text-processing","tokenization"],"created_at":"2024-11-07T21:19:35.301Z","updated_at":"2025-06-24T20:08:14.999Z","avatar_url":"https://github.com/KPlanisphere.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Grimm Text Processor\n\nA comprehensive text processing project using Python and NLTK, focused on processing text from Grimms' Fairy Tales. The project involves reading text files, tokenizing the content, removing punctuation, converting text to lowercase, and filtering out stopwords. This project was developed as part of a laboratory exercise to understand the basics of text preprocessing for information retrieval.\n\n## Features\n\n### Overview\n\nThe project involves several key tasks:\n\n-   **Tokenization**: Splitting the text into individual words or tokens.\n-   **Punctuation Removal**: Removing punctuation marks from the tokens.\n-   **Lowercase Conversion**: Converting all text to lowercase.\n-   **Stopwords Removal**: Filtering out common words that do not contribute much meaning (e.g., \"and\", \"the\").\n\n### Scripts\n\n#### `lab1.1.py`\n\n-   **Read Text File**: Reads content from `lab1.txt`.\n-   **Tokenization**: Uses NLTK to tokenize the text.\n-   **Remove Punctuation**: Removes punctuation and specific symbols from the text.\n-   **Print Output**: Prints the first 100 words after processing.\n\n#### Code Snippet\n\n```python\nimport re\nimport string\nimport nltk\nimport os\n\nnltk.download('punkt')\n\n# Read the external file content\narchivo_path = os.path.join(os.path.dirname(__file__), 'lab1.txt')\nwith open(archivo_path, 'r', encoding='utf-8') as archivo:\n    texto = archivo.read()\n\n# Tokenization using NLTK\npalabras = nltk.word_tokenize(texto)\n\n# Split the text into words\npalabras = texto.split()\n\n# Print punctuation\nprint(string.punctuation)\n\n# Define regex to remove punctuation\nsimbolos_extra = '’'\nre_punc = re.compile('[%s%s]' % (re.escape(string.punctuation), re.escape(simbolos_extra)))\n\n# Remove punctuation from each word\nstripped = [re_punc.sub('', w) for w in palabras]\n\n# Print the first 100 words\nprint(stripped[:100])\n```\n\n#### `lab1.2.py`\n\n-   **Read Text File**: Reads content from `prueba.txt`.\n-   **Tokenization**: Uses NLTK to tokenize the text.\n-   **Convert to Lowercase**: Converts all text to lowercase.\n-   **Remove Vowels**: Removes vowels from the text.\n-   **Print Output**: Prints the first 100 words after processing.\n\n#### Code Snippet\n\n```python\nimport re\nimport string\nimport nltk\nimport os\n\nnltk.download('punkt')\n\n# Read the external file content\narchivo_path = os.path.join(os.path.dirname(__file__), 'prueba.txt')\nwith open(archivo_path, 'r', encoding='utf-8') as archivo:\n    texto = archivo.read()\n\n# Tokenization using NLTK\npalabras = nltk.word_tokenize(texto)\n\n# Convert text to lowercase\ntexto = texto.lower()\n\n# Split the text into words\npalabras = texto.split()\n\n# Print punctuation\nprint(string.punctuation)\n\n# Define regex to remove vowels\nsimbolos_extra = 'aeiou'\nre_punc = re.compile('[%s]' % (re.escape(simbolos_extra)))\n\n# Remove vowels from each word\nstripped = [re_punc.sub('', w) for w in palabras]\n\n# Print the first 100 words\nprint(stripped[:100])\n```\n\n#### `lab1.3.py`\n\n-   **Read Text File**: Reads content from `lab1.txt`.\n-   **Tokenization**: Uses NLTK to tokenize the text.\n-   **Convert to Lowercase**: Converts all text to lowercase.\n-   **Remove Punctuation**: Removes punctuation and specific symbols from the text.\n-   **Print Output**: Prints the first 100 words after processing.\n\n#### Code Snippet\n\n```python\nimport re\nimport string\nimport nltk\nimport os\n\nnltk.download('punkt')\n\n# Read the external file content\narchivo_path = os.path.join(os.path.dirname(__file__), 'lab1.txt')\nwith open(archivo_path, 'r', encoding='utf-8') as archivo:\n    texto = archivo.read()\n\n# Tokenization using NLTK\npalabras = nltk.word_tokenize(texto)\n\n# Convert text to lowercase\ntexto = texto.lower()\n\n# Split the text into words\npalabras = texto.split()\n\n# Print punctuation\nprint(string.punctuation)\n\n# Define regex to remove punctuation\nsimbolos_extra = '’'\nre_punc = re.compile('[%s%s]' % (re.escape(string.punctuation), re.escape(simbolos_extra)))\n\n# Remove punctuation from each word\nstripped = [re_punc.sub('', w) for w in palabras]\n\n# Print the first 100 words\nprint(stripped[:100])\n```\n\n#### `lab1.4.py`\n\n-   **Read Text File**: Reads content from `lab1.txt`.\n-   **Tokenization**: Uses NLTK to tokenize the text.\n-   **Convert to Lowercase**: Converts all text to lowercase.\n-   **Remove Punctuation**: Removes punctuation and specific symbols from the text.\n-   **Remove Stopwords**: Filters out stopwords from the text.\n-   **Print Output**: Prints the first 100 words after processing.\n\n#### Code Snippet\n\n```python\nimport re\nimport string\nimport nltk\nimport os\nfrom nltk.corpus import stopwords\n\nnltk.download('punkt')\nnltk.download('stopwords')\n\n# Read the external file content\narchivo_path = os.path.join(os.path.dirname(__file__), 'lab1.txt')\nwith open(archivo_path, 'r', encoding='utf-8') as archivo:\n    texto = archivo.read()\n\n# Tokenization using NLTK\npalabras = nltk.word_tokenize(texto)\n\n# Split the text into words\npalabras = texto.split()\n\nprint(\"\\n///// TEXTO SIN ESPACIOS\\n\")\nprint(palabras[:100])\n\n# Print punctuation\nprint(\"\\n///// SIGNOS DE PUNTUACION\\n\")\nprint(string.punctuation)\n\n# Define regex to remove punctuation\nsimbolos_extra = '’'\nre_punc = re.compile('[%s%s]' % (re.escape(string.punctuation), re.escape(simbolos_extra)))\n\n# Remove punctuation from each word\nstripped = [re_punc.sub('', w) for w in palabras]\n\nprint(\"\\n///// TEXTO SIN SIGNOS DE PUNTUACION\\n\")\nprint(stripped[:100])\n\n# Convert each word to lowercase\nfor i in range(len(stripped)):\n    stripped[i] = stripped[i].lower()\n\nprint(\"\\n///// TEXTO EN MINUSCULAS\\n\")\nprint(stripped[:100])\n\n# Get a list of stopwords in English\nstopwords_english = set(stopwords.words('english'))\n\n# Convert the set of stopwords to a list\nstopwords_english_list = list(stopwords_english)\n\n# Print the first 5 stopwords\nprint(\"\\n///// PRIMERAS 5 PALABRAS VACÍAS\\n\")\nprint(stopwords_english_list[:5])\n\n# Remove stopwords\nfiltered_words = [word for word in stripped if word not in stopwords_english]\n\n# Print the first 100 words without stopwords\nprint(\"\\n///// TEXTO SIN PALABRAS VACIAS\\n\")\nprint(filtered_words[:100])\n```\n\n## Getting Started\n\nTo get a local copy up and running follow these simple steps.\n\n### Prerequisites\n\n-   Python 3.x\n-   NLTK library\n\n### Installation\n\n1.  Clone the repo\n    \n    ```sh\n    git clone https://github.com/KPlanisphere/grimm-text-processor.git\n    ```\n    \n2.  Install the required Python packages\n    \n    ```sh\n    pip install nltk\n    ```\n    \n\n## Usage\n\nRun each script using Python:\n\n```sh\npython lab1.1.py\npython lab1.2.py\npython lab1.3.py\npython lab1.4.py\n```\n\n## Project Documentation\n\n### Introduction\n\nIn this practice, Python was used to create code that, from a given text, separates the text into tokens, removes punctuation marks, converts the text to lowercase, and removes stopwords. We utilize the NLTK library and predefined code within the library that facilitates these tasks, such as `split`, `string.punctuation`, `re`, and `stopwords`.\n\n### Objective\n\nLearn to prepare texts for use in information retrieval processes. This involves separating the text into tokens, removing useless tokens (punctuation marks, numbers, stopwords), and converting to lowercase.\n\n### Experimental Development\n\n1.  **Tokenization**: Using spaces as delimiters, split the text into words and print the first 100 words.\n2.  **Punctuation Removal**: Using `re` to remove punctuation marks, print the punctuation marks, and then print the first 100 words after removing punctuation.\n3.  **Lowercase Conversion**: Convert the text to lowercase and print the first 100 words.\n4.  **Stopwords Removal**: Import stopwords in English, print the first 5 stopwords, and then print the first 100 words after removing stopwords.\n\n### Discussion and Results\n\n-   Tokenization separates the text into individual words.\n-   Punctuation removal cleans the text by removing punctuation marks.\n-   Lowercase conversion standardizes the text.\n-   Stopwords removal filters out common words that do not add much meaning.\n\n### Conclusion\n\nThrough constant practice, we have gained a deeper understanding of handling various functions related to text processing in Python. This process has allowed us to solidify our comprehension, providing a stronger foundation for text preparation and manipulation for information retrieval using Python and the NLTK library.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkplanisphere%2Fgrimm-text-processor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkplanisphere%2Fgrimm-text-processor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkplanisphere%2Fgrimm-text-processor/lists"}