Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/artzaragozagithub/nlp--sentiment_analysis_and_summarization_of_stock_news
Natural Language Processing AI-model driven sentiment analysis system that will automatically process and analyze news articles to gauge market sentiment, and summarizing the news at a weekly level to enhance the accuracy of their stock price predictions and optimize investment strategies.
https://github.com/artzaragozagithub/nlp--sentiment_analysis_and_summarization_of_stock_news
classifier-training confusion-matrix decisiontreeclassifier eda glove-embeddings gridsearchcv keyedvectors llama mistral-7b myplot nlp-keywords-extraction numpy-library pandas-library prompt-engineering sentiment-analysis sklearn-library text-processing text-summarization transformers-models word2vec
Last synced: 8 days ago
JSON representation
Natural Language Processing AI-model driven sentiment analysis system that will automatically process and analyze news articles to gauge market sentiment, and summarizing the news at a weekly level to enhance the accuracy of their stock price predictions and optimize investment strategies.
- Host: GitHub
- URL: https://github.com/artzaragozagithub/nlp--sentiment_analysis_and_summarization_of_stock_news
- Owner: ArtZaragozaGitHub
- License: apache-2.0
- Created: 2025-02-03T19:53:19.000Z (17 days ago)
- Default Branch: main
- Last Pushed: 2025-02-12T05:01:22.000Z (9 days ago)
- Last Synced: 2025-02-12T06:27:51.052Z (9 days ago)
- Topics: classifier-training, confusion-matrix, decisiontreeclassifier, eda, glove-embeddings, gridsearchcv, keyedvectors, llama, mistral-7b, myplot, nlp-keywords-extraction, numpy-library, pandas-library, prompt-engineering, sentiment-analysis, sklearn-library, text-processing, text-summarization, transformers-models, word2vec
- Language: Jupyter Notebook
- Homepage:
- Size: 5.27 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# NLP--Sentiment_Analysis_and_Summarization_of_Stock_News
This notebook uses a Natural Language Processing AI-model driven sentiment analysis solution that can process and analyze news articles to gauge market sentiment predicting stock price and volume. It will also summarize lengthy news at a weekly level to enhance the accuracy of their stock price predictions to optimize investment strategies. GloVe, Word2Vec and Transformer models will be compared for accuracy and will be fine-tuned.I experimented with specific configurations to optimize each model's initial tuning parameters after training the models using different classifiers.
Additionally, I used different available TPU's and GPUs to determine processing tradeoffs with cost and speed: I used the following libraries.* 1-To manipulate and analyze data: pandas, numpy.
* 2-To visualize data: matplotlib.pyplot, seaborn.
* 3-To parse JSON data: json.
* 4-To build, tune, and evaluate ML models:
sklearn.ensemble: GradientBoostingClassifier, RandomForestClassifier, DecisionTreeClassifier sklearn.model_selection: GridSearchCV, sklearn.metrics: confusion_matrix, accuracy_score, f1_score, precision_score, recall_score.
* 5-To load/create word embeddings: gensim.models, Word2Vec; KeyedVectors, gensim.scripts.glove2word2vec, glove2 word2vec.
* 6-To work with transformer models: torch, sentence_transformers
* 7-To summarize with NLP models: Llama Mistral-7B max_tokens, temperature, top_p, top_k.data:image/s3,"s3://crabby-images/8900c/8900c8217b973c230d89c602b4821b065ac02761" alt="Models TradeOffs"
Best Model Selection:
Tuning the model Word2Vec with a Decision Tree Classifier gave us comparable performance metrics as using a non-tuned Sentence Transformer Model:
* ->Model: Tuned Word2Vec ->Accuracy: 0.48 ->F1-Score: 0.48
* ->Model: Non-Tuned Sentence Transformer ->Accuracy: 0.52 ->F1-Score: 0.48However, the TPU/GPU processing is much higher so for cost considerations, Tuned Word2Vec may be more economical.
data:image/s3,"s3://crabby-images/fa6b4/fa6b4e2e4c089c481751e7a694c95ed9b1e0a65d" alt="final_model_selection"
With the second part of this project, I used LLama & Mistral-7B. Note: Llama models come in various sizes, including larger ones like Llama 2 70B. Mistral-7B often outperforms larger Llama models in certain tasks despite its smaller size.
For the news summarizations, I had to monitor the performance and GPU utilization with the following configurations:data:image/s3,"s3://crabby-images/26ba1/26ba18a9e8a6bcbd280f52e3048beee93ba8b7c4" alt="Llama Model Setup"
This is an example of the input to be processed by the model:
data:image/s3,"s3://crabby-images/5447e/5447edcbdb3abeed9995f89e13cdc88ac6d3d11a" alt="input"
These are examples of the resulting outputs with a summary extraction, keywords, topics, stock value and price after summarizing the news input above when you enter a specic date (interactive mode):
data:image/s3,"s3://crabby-images/81da0/81da03fb2774341338f67f93fb3815c99236d092" alt="output-1"
And this is an example when you just need a summary for the top 3 positive/negative events per week:
data:image/s3,"s3://crabby-images/f28bd/f28bd381aa798be2e4964f76a889d1ece5fc8d4c" alt="output-2"
Summary of my learnings:
Model Development and Hardware Optimization:
- Explored various classifier configurations and tuning parameters
- Evaluated performance trade-offs across different TPUs and GPUs, considering both cost and processing speedData Processing and Analysis:
- Pandas and NumPy for data manipulation and numerical operations
- JSON parsing for structured data handling
- Matplotlib and Seaborn for data visualizationMachine Learning Framework (scikit-learn):
- Ensemble Methods:
- Gradient Boosting Classifier
- Random Forest Classifier
- Decision Tree Classifier
- Model Optimization:
- GridSearchCV for hyperparameter tuning
- Evaluation Metrics:
- Confusion Matrix
- Accuracy Score
- F1 Score
- Precision Score
- Recall ScoreNatural Language Processing Tools:
- Word Embeddings:
- Gensim's Word2Vec
- GloVe (with glove2word2vec conversion)
- KeyedVectors for embedding management
- Transformer Models:
- PyTorch
- Sentence Transformers
- Large Language Models:
- Llama
- Mistral-7B with configurable parameters:
- Maximum tokens
- Temperature
- Top-p sampling
- Top-k sampling