Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/arkadiuszkaros/nlp-book-pos-extractor

This project focuses on extracting sentences from the text of two popular book series: Harry Potter and Game of Thrones. Using Natural Language Processing (NLP) techniques powered by spaCy, the project aims to identify and analyze the parts of speech (POS) for each word in a sentence.
https://github.com/arkadiuszkaros/nlp-book-pos-extractor

extractor nlp part-of-speech-tagging python spacy

Last synced: about 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/arkadiuszkaros/nlp-book-pos-extractor
Owner: ArkadiuszKaros
License: mit
Created: 2024-10-16T17:33:03.000Z (3 months ago)
Default Branch: main
Last Pushed: 2024-10-18T18:20:07.000Z (3 months ago)
Last Synced: 2024-10-20T02:07:06.144Z (3 months ago)
Topics: extractor, nlp, part-of-speech-tagging, python, spacy
Language: Python
Homepage:
Size: 18.9 MB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# NLP Book POS Extractor

## Overview
This project focuses on extracting sentences from the text of two popular book series: *Harry Potter* and *Game of Thrones*. Using Natural Language Processing (NLP) techniques powered by [spaCy](https://spacy.io/), the project aims to identify and analyze the parts of speech (POS) for each word in a sentence.

## Data Sources
The text data is sourced from the following publicly available datasets on Kaggle:
- **Harry Potter Books**: [Kaggle Dataset](https://www.kaggle.com/datasets/shubhammaindola/harry-potter-books)
- **Game of Thrones Books**: [Kaggle Dataset](https://www.kaggle.com/datasets/saurabhbadole/game-of-thrones-book-dataset)

## Project Features
- Extracts sentences from the books.
- Processes each sentence using spaCy to identify parts of speech (nouns, adjectives, verbs, etc.).
- Supports multiple NLP models for POS tagging.
- Potential use for language learning, text-based games, or further linguistic analysis.

## How It Works
1. **Sentence Extraction**: The script reads the text files of both series and extracts individual sentences.
2. **POS Tagging**: Each sentence is processed with spaCy to identify various parts of speech, such as nouns, verbs, adjectives, and more.

## Installation

1. Clone the repository:
```bash
git clone https://github.com/yourusername/NLP-Book-POS-Extractor.git
cd NLP-Book-POS-Extractor
```

2. Install the required dependencies:
```bash
pip install -r requirements.txt
```

3. Feel free to use 😉