Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tomhalloin/Springboard-Berkshire
Topic model analysis of Berkshire Hathaway annual letters (Completed Capstone Project #2)
https://github.com/tomhalloin/Springboard-Berkshire
gensim nlp spacy springboard textacy topic-modeling
Last synced: about 1 month ago
JSON representation
Topic model analysis of Berkshire Hathaway annual letters (Completed Capstone Project #2)
- Host: GitHub
- URL: https://github.com/tomhalloin/Springboard-Berkshire
- Owner: tomhalloin
- License: mit
- Created: 2020-03-31T20:55:44.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T07:26:54.000Z (about 2 years ago)
- Last Synced: 2024-08-14T22:31:48.837Z (4 months ago)
- Topics: gensim, nlp, spacy, springboard, textacy, topic-modeling
- Language: Java
- Homepage:
- Size: 25.7 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 22
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-springboard-capstones - GitHub
README
# Springboard Capstone Project II: Topic Model Analysis of Berkshire Hathaway's Shareholder Letters
This project is an analysis of Berkshire Hathaway's annual letters using Natural Language Processing with Python. Approaches included three types of extractive summarization: [LexRank](https://raw.githubusercontent.com/toshimelonhead/Springboard-Berkshire/master/Outputs/Summaries/LexRank_Summaries_summaries.txt), [TextRank](https://raw.githubusercontent.com/toshimelonhead/Springboard-Berkshire/master/Outputs/Summaries/TextRank_Summaries_summaries.txt), and [Latent Semantic Analysis](https://raw.githubusercontent.com/toshimelonhead/Springboard-Berkshire/master/Outputs/Summaries/LSA_Summaries_summaries.txt), as well as topic modeling using the Mallet wrapper from Gensim and a Java version of Mallet LDA.
If you plan to run this code, make sure to set the file locations and shortcuts for Mallet to your respective files on your computer, as otherwise, the code will not run on your computer. I would recommend not running the notebook to scrape the letters and just using the letters that come with it instead because usually, Berkshire's website denies me access from scraping multiple letters at once.
Also note that the final topics change from run to run, even with the same random seed. The final topic distributions between the notebook and the report differ slightly.
[Notebook to Scrape the Letters](https://github.com/toshimelonhead/Springboard-Berkshire/blob/master/Notebooks/Final%20Version/Scraping_Letters.ipynb)
[Notebook for Everything Else](https://nbviewer.jupyter.org/github/toshimelonhead/Springboard-Berkshire/blob/e0c3270166722a21765e415b4de800396537ec99/Notebooks/Final%20Version/Final_Version.ipynb)
[Final Writeup](https://github.com/toshimelonhead/Springboard-Berkshire/blob/master/Reports/Final%20Paper%202.0.pdf)
[Final Presentation](https://github.com/toshimelonhead/Springboard-Berkshire/blob/master/Reports/Final%20Presentation.pdf)