https://github.com/harshstats/natural-language-processing-in-practice

Last synced: over 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/harshstats/natural-language-processing-in-practice
Owner: HarshStats
Created: 2024-03-08T11:46:41.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-03-10T14:44:19.000Z (over 2 years ago)
Last Synced: 2025-01-24T20:38:02.645Z (over 1 year ago)
Language: Jupyter Notebook
Size: 38.1 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Natural-Language-Processing-in-Practice

## Topic: Natural Language Processing and Text as Data Exercises

## Overview

During the Winter Semester of 2023/24, under the insightful guidance of Dr. J. Rieger and M. Sc. K.-R. Lange at Technische Universität Dortmund, I embarked on a journey through the complex and fascinating world of Natural Language Processing (NLP) and Text as Data. This repository contains my solutions to a series of challenging exercises that spanned various aspects of NLP, including language detection, sentiment analysis, text clustering, and topic modeling, among others.

## Exercises and Solutions

### Language Detection
I tackled the task of identifying languages from given texts using unsupervised models from Hugging Face. This exercise honed my skills in leveraging open-source transformers to efficiently process and classify textual data without the need for extensive training.

### Movie Genre Identification
Faced with summaries of different movies without their genres, I conducted an unsupervised analysis to speculate on their genres. This exercise allowed me to explore beyond the constraints of predefined models, fostering creativity in my approach to NLP tasks.

### Analyzing "Star Trek" Character Relationships
Utilizing Word2Vec, I examined how varying window sizes affect the relationships mapped by the model among characters in the "Star Trek" franchise. This task deepened my understanding of embedding models and their sensitivities to contextual window sizes.

### Text Preprocessing and Frequency Analysis
I performed basic preprocessing steps on movie reviews, including removing special characters, case normalization, and tokenization. Additionally, I analyzed the frequency of words to understand their significance in the corpus.

### Fake News Detection
This challenging task involved differentiating between factual and fake news articles. I trained a document embedding model and utilized logistic regression for classification, which underscored the potential of embedding models in identifying misinformation.

### Academic Paper Categorization
I engaged in creating a pipeline to predict the categories of academic papers using various models discussed in our lectures and exercises. This task showcased the versatility of NLP techniques in organizing and classifying extensive scholarly data.

### Exploring the "Harry Potter" Series
Through textual analysis of the "Harry Potter" books, I applied preprocessing, topic modeling, and tf-idf calculations to uncover underlying themes, demonstrating the power of NLP in literary analysis.

### Sentiment Analysis on Reddit Comments
Analyzing sentiment in Reddit comments related to Bitcoin allowed me to correlate public sentiment with Bitcoin's price fluctuations, illustrating NLP's utility in financial market analysis.

### News Category Clustering and Classification
I worked on clustering short news descriptions to match their categories using LDA and k-means clustering, which highlighted the complexities and intricacies of text clustering in real-world datasets.

### Advanced Topics
Further exercises included analyzing "A Song of Ice and Fire" books, leveraging LDA models to trace thematic developments, and employing seeded LDA for emotion classification in tweets. These tasks pushed the boundaries of traditional topic modeling, integrating novel approaches to cater to specific analytical needs.

## Reflections

Throughout these exercises, I've not only developed a deeper understanding of NLP's theoretical underpinnings but also honed my practical skills in applying these concepts to diverse and complex datasets. Each task was an opportunity to explore the multifaceted nature of text as data, navigating through its challenges and reveling in the insights it provided.

This repository is a testament to my growth in the field of NLP and embodies my continuous pursuit of knowledge and proficiency in handling text data. I hope it serves as a valuable resource for others embarking on their journey in NLP.

## Acknowledgments

I extend my deepest gratitude to Dr. J. Rieger and M. Sc. K.-R. Lange for their expert guidance and support throughout the semester. Their commitment to fostering a deep understanding of NLP among their students has been truly inspirational.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/harshstats/natural-language-processing-in-practice

Awesome Lists containing this project

README