Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ototot/ytpoop
A lyric based recommendation system.
https://github.com/ototot/ytpoop
Last synced: about 2 months ago
JSON representation
A lyric based recommendation system.
- Host: GitHub
- URL: https://github.com/ototot/ytpoop
- Owner: oToToT
- Created: 2020-03-14T19:43:51.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2020-03-14T19:44:51.000Z (almost 5 years ago)
- Last Synced: 2024-10-19T08:14:49.960Z (2 months ago)
- Language: Python
- Homepage: https://ytp.dve.tw/
- Size: 4.88 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# YTPoop #
## Intro ##
A song recommendation system with lyrics data.
- Member
- [@minson123](https://github.com/minson123-github)
- [@harry900831](https://github.com/harry900831)
- [@oToToT](https://github.com/oToToT)
- Instructor: Pu-Jen ChengSlides: [https://dve.tw/V5B](https://dve.tw/V5B)
## HowTo ##
### Build up song database ###
You could use [@harry900831/mojim_lyrics_crawler](https://github.com/harry900831/mojim_lyrics_crawler) to crawl song data from [mojim.com](http://mojim.com/).
After that, you should use `load_lyrics.py` to segment lyrics. Notice that we use [ckiptagger](https://github.com/ckiplab/ckiptagger) to segment lyrics, you should download it to `./ckiptagger/`.
Then, we need to use [doc2vecC](https://github.com/mchen24/iclr2017) to perform doc2vec operation. However, we might want to add more corpus to train our data, so you could download [dump](https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2) of wikipedia. To extract data from dumped data, you could use `wikiseg.py` to segment data.
Also, you might want to concat and shuffle wikicorpus and lyrics data.
Here is some notes to use doc2vecC (assuming corpus in `corpus.txt` and lyrics data in `lyrics.txt`):
1. `wget https://raw.githubusercontent.com/mchen24/iclr2017/master/doc2vecc.c`
2. `gcc doc2vecc.c -o doc2vecc -lm -pthread -O3 -march=native -funroll-loops`
3. `./doc2vecc -train corpus.txt -word wordvectors.txt -output docvectors.txt -cbow 1 -size 100 -window 10 -negative 5 -hs 0 -sample 0 -threads 4 -binary 0 -iter 20 -min-count 10 -test lyrics.txt -sentence-sample 0.1 -save-vocab alldata.vocab`With docvectors generated, you might want to merge them back into the origin data, so you could run `parsevec.py` with log generated from `load_lyrics.py`. It will generate two `data/songs.json` without docvector inside it and `web/songs.json` with docvector inside it.
Then, you need to add youtube link to every songs, you could use `youtubeid.py` to brute-forced crawl youtube data.
p.s [KKBOX](https://www.kkbox.com/) has provided a great [API](https://github.com/KKBOX/OpenAPI-Python) for songs data, but we doesn't know that before.
### Launch Elasticsearch service ###
You could try to use Dockerfile inside `data` to launch a elsaticsearch sevice.
### Launch Web Server ###
You could try to use Dockerfile inside `web` to launch a server from [@oToToT/YTPoop-Server](https://github.com/oToToT/YTPoop-Server). Notice that you may want to change the path to sqlite database or try other database, and also you could change `session_secret` inside it.