https://github.com/kplanisphere/grimm-text-processor

Laboratory 1 - Retrieval Information
https://github.com/kplanisphere/grimm-text-processor

data-preprocessing educational-project information-retrieval lowercase-conversion nltk punctuation-removal python stopwords text-processing tokenization

Last synced: 12 months ago
JSON representation

Laboratory 1 - Retrieval Information

Host: GitHub
URL: https://github.com/kplanisphere/grimm-text-processor
Owner: KPlanisphere
Created: 2024-06-09T03:40:20.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-06-09T03:45:55.000Z (about 2 years ago)
Last Synced: 2025-05-21T15:14:54.347Z (about 1 year ago)
Topics: data-preprocessing, educational-project, information-retrieval, lowercase-conversion, nltk, punctuation-removal, python, stopwords, text-processing, tokenization
Language: Python
Homepage: https://linktr.ee/planisphere.kgz
Size: 770 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Grimm Text Processor

A comprehensive text processing project using Python and NLTK, focused on processing text from Grimms' Fairy Tales. The project involves reading text files, tokenizing the content, removing punctuation, converting text to lowercase, and filtering out stopwords. This project was developed as part of a laboratory exercise to understand the basics of text preprocessing for information retrieval.

## Features

### Overview

The project involves several key tasks:

-   **Tokenization**: Splitting the text into individual words or tokens.

-   **Punctuation Removal**: Removing punctuation marks from the tokens.

-   **Lowercase Conversion**: Converting all text to lowercase.

-   **Stopwords Removal**: Filtering out common words that do not contribute much meaning (e.g., "and", "the").

### Scripts

#### `lab1.1.py`

-   **Read Text File**: Reads content from `lab1.txt`.

-   **Tokenization**: Uses NLTK to tokenize the text.

-   **Remove Punctuation**: Removes punctuation and specific symbols from the text.

-   **Print Output**: Prints the first 100 words after processing.

#### Code Snippet

```python

import re

import string

import nltk

import os

nltk.download('punkt')

# Read the external file content

archivo_path = os.path.join(os.path.dirname(__file__), 'lab1.txt')

with open(archivo_path, 'r', encoding='utf-8') as archivo:

    texto = archivo.read()

# Tokenization using NLTK

palabras = nltk.word_tokenize(texto)

# Split the text into words

palabras = texto.split()

# Print punctuation

print(string.punctuation)

# Define regex to remove punctuation

simbolos_extra = '’'

re_punc = re.compile('[%s%s]' % (re.escape(string.punctuation), re.escape(simbolos_extra)))

# Remove punctuation from each word

stripped = [re_punc.sub('', w) for w in palabras]

# Print the first 100 words

print(stripped[:100])

```

#### `lab1.2.py`

-   **Read Text File**: Reads content from `prueba.txt`.

-   **Tokenization**: Uses NLTK to tokenize the text.

-   **Convert to Lowercase**: Converts all text to lowercase.

-   **Remove Vowels**: Removes vowels from the text.

-   **Print Output**: Prints the first 100 words after processing.

#### Code Snippet

```python

import re

import string

import nltk

import os

nltk.download('punkt')

# Read the external file content

archivo_path = os.path.join(os.path.dirname(__file__), 'prueba.txt')

with open(archivo_path, 'r', encoding='utf-8') as archivo:

    texto = archivo.read()

# Tokenization using NLTK

palabras = nltk.word_tokenize(texto)

# Convert text to lowercase

texto = texto.lower()

# Split the text into words

palabras = texto.split()

# Print punctuation

print(string.punctuation)

# Define regex to remove vowels

simbolos_extra = 'aeiou'

re_punc = re.compile('[%s]' % (re.escape(simbolos_extra)))

# Remove vowels from each word

stripped = [re_punc.sub('', w) for w in palabras]

# Print the first 100 words

print(stripped[:100])

```

#### `lab1.3.py`

-   **Read Text File**: Reads content from `lab1.txt`.

-   **Tokenization**: Uses NLTK to tokenize the text.

-   **Convert to Lowercase**: Converts all text to lowercase.

-   **Remove Punctuation**: Removes punctuation and specific symbols from the text.

-   **Print Output**: Prints the first 100 words after processing.

#### Code Snippet

```python

import re

import string

import nltk

import os

nltk.download('punkt')

# Read the external file content

archivo_path = os.path.join(os.path.dirname(__file__), 'lab1.txt')

with open(archivo_path, 'r', encoding='utf-8') as archivo:

    texto = archivo.read()

# Tokenization using NLTK

palabras = nltk.word_tokenize(texto)

# Convert text to lowercase

texto = texto.lower()

# Split the text into words

palabras = texto.split()

# Print punctuation

print(string.punctuation)

# Define regex to remove punctuation

simbolos_extra = '’'

re_punc = re.compile('[%s%s]' % (re.escape(string.punctuation), re.escape(simbolos_extra)))

# Remove punctuation from each word

stripped = [re_punc.sub('', w) for w in palabras]

# Print the first 100 words

print(stripped[:100])

```

#### `lab1.4.py`

-   **Read Text File**: Reads content from `lab1.txt`.

-   **Tokenization**: Uses NLTK to tokenize the text.

-   **Convert to Lowercase**: Converts all text to lowercase.

-   **Remove Punctuation**: Removes punctuation and specific symbols from the text.

-   **Remove Stopwords**: Filters out stopwords from the text.

-   **Print Output**: Prints the first 100 words after processing.

#### Code Snippet

```python

import re

import string

import nltk

import os

from nltk.corpus import stopwords

nltk.download('punkt')

nltk.download('stopwords')

# Read the external file content

archivo_path = os.path.join(os.path.dirname(__file__), 'lab1.txt')

with open(archivo_path, 'r', encoding='utf-8') as archivo:

    texto = archivo.read()

# Tokenization using NLTK

palabras = nltk.word_tokenize(texto)

# Split the text into words

palabras = texto.split()

print("\n///// TEXTO SIN ESPACIOS\n")

print(palabras[:100])

# Print punctuation

print("\n///// SIGNOS DE PUNTUACION\n")

print(string.punctuation)

# Define regex to remove punctuation

simbolos_extra = '’'

re_punc = re.compile('[%s%s]' % (re.escape(string.punctuation), re.escape(simbolos_extra)))

# Remove punctuation from each word

stripped = [re_punc.sub('', w) for w in palabras]

print("\n///// TEXTO SIN SIGNOS DE PUNTUACION\n")

print(stripped[:100])

# Convert each word to lowercase

for i in range(len(stripped)):

    stripped[i] = stripped[i].lower()

print("\n///// TEXTO EN MINUSCULAS\n")

print(stripped[:100])

# Get a list of stopwords in English

stopwords_english = set(stopwords.words('english'))

# Convert the set of stopwords to a list

stopwords_english_list = list(stopwords_english)

# Print the first 5 stopwords

print("\n///// PRIMERAS 5 PALABRAS VACÍAS\n")

print(stopwords_english_list[:5])

# Remove stopwords

filtered_words = [word for word in stripped if word not in stopwords_english]

# Print the first 100 words without stopwords

print("\n///// TEXTO SIN PALABRAS VACIAS\n")

print(filtered_words[:100])

```

## Getting Started

To get a local copy up and running follow these simple steps.

### Prerequisites

-   Python 3.x

-   NLTK library

### Installation

1.  Clone the repo

    

    ```sh

    git clone https://github.com/KPlanisphere/grimm-text-processor.git

    ```

    

2.  Install the required Python packages

    

    ```sh

    pip install nltk

    ```

    

## Usage

Run each script using Python:

```sh

python lab1.1.py

python lab1.2.py

python lab1.3.py

python lab1.4.py

```

## Project Documentation

### Introduction

In this practice, Python was used to create code that, from a given text, separates the text into tokens, removes punctuation marks, converts the text to lowercase, and removes stopwords. We utilize the NLTK library and predefined code within the library that facilitates these tasks, such as `split`, `string.punctuation`, `re`, and `stopwords`.

### Objective

Learn to prepare texts for use in information retrieval processes. This involves separating the text into tokens, removing useless tokens (punctuation marks, numbers, stopwords), and converting to lowercase.

### Experimental Development

1.  **Tokenization**: Using spaces as delimiters, split the text into words and print the first 100 words.

2.  **Punctuation Removal**: Using `re` to remove punctuation marks, print the punctuation marks, and then print the first 100 words after removing punctuation.

3.  **Lowercase Conversion**: Convert the text to lowercase and print the first 100 words.

4.  **Stopwords Removal**: Import stopwords in English, print the first 5 stopwords, and then print the first 100 words after removing stopwords.

### Discussion and Results

-   Tokenization separates the text into individual words.

-   Punctuation removal cleans the text by removing punctuation marks.

-   Lowercase conversion standardizes the text.

-   Stopwords removal filters out common words that do not add much meaning.

### Conclusion

Through constant practice, we have gained a deeper understanding of handling various functions related to text processing in Python. This process has allowed us to solidify our comprehension, providing a stronger foundation for text preparation and manipulation for information retrieval using Python and the NLTK library.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kplanisphere/grimm-text-processor

Awesome Lists containing this project

README