https://github.com/AlbertSuarez/azlyrics-scraper
🎵 AZLyrics scraper for getting song lyrics publishing to Box
https://github.com/AlbertSuarez/azlyrics-scraper
azlyrics box cloud-storage dataset scraper songs
Last synced: 7 months ago
JSON representation
🎵 AZLyrics scraper for getting song lyrics publishing to Box
- Host: GitHub
- URL: https://github.com/AlbertSuarez/azlyrics-scraper
- Owner: AlbertSuarez
- License: mit
- Created: 2019-07-06T08:38:37.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-01-31T10:21:12.000Z (over 5 years ago)
- Last Synced: 2024-07-31T20:50:04.558Z (about 1 year ago)
- Topics: azlyrics, box, cloud-storage, dataset, scraper, songs
- Language: Python
- Homepage: https://app.box.com/s/vats4n6slxtknuaxz58mxlo6ry8v04pd?sortColumn=name&sortDirection=ASC
- Size: 33.2 KB
- Stars: 18
- Watchers: 2
- Forks: 7
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AZLyrics scraper
[](http://hits.dwyl.io/AlbertSuarez/azlyrics-scraper)
[](https://GitHub.com/AlbertSuarez/azlyrics-scraper/stargazers/)
[](https://GitHub.com/AlbertSuarez/azlyrics-scraper/network/)
[](https://github.com/AlbertSuarez/azlyrics-scraper)
[](https://GitHub.com/AlbertSuarez/azlyrics-scraper/graphs/contributors/)
[](https://github.com/AlbertSuarez/azlyrics-scraper/blob/master/LICENSE)
[Box folder URL](https://app.box.com/s/vats4n6slxtknuaxz58mxlo6ry8v04pd) | [Static repo website](https://asuarez.dev/azlyrics-scraper/) | [Kaggle dataset](https://www.kaggle.com/albertsuarez/azlyrics)
🎵 AZLyrics scraper for getting all the song lyrics and publishing to Box.
## Python requirements
This project is using Python3. All these requirements have been specified in the `requirements.lock` file.
1. [Requests](https://2.python-requests.org/en/master/): used for retrieving the HTML content of a website.
2. [BeautifulSoup](https://pypi.org/project/beautifulsoup4/): used for scraping an HTML content.
3. [Tor](https://2019.www.torproject.org/docs/debian.html.en): used for making requests anonymous using other IPs.
4. [Stem](https://stem.torproject.org/): used for authentificating every request with a different IP.
5. [Fake User-Agent](https://pypi.org/project/fake-useragent/): used for using random User-Agent's for every request.
6. [Unidecode](https://pypi.org/project/Unidecode/): used for cleaning strings from weird characters.
7. [Box SDK](https://github.com/box/box-python-sdk): used for uploading/downloading files to/from Box Cloud Storage.
## Recommendations
Usage of [virtualenv](https://realpython.com/blog/python/python-virtual-environments-a-primer/) is recommended for package library / runtime isolation.
## Usage
To run this script, please execute the following from the root directory:
1. Setup virutal environment
2. Install dependencies
```bash
pip3 install -r requirements.lock
```
3. Move [JWT configuration](#jwt-configuration) file from Box API
4. Install [Tor browser](https://2019.www.torproject.org/docs/debian.html.en)
5. Configure Tor IP renewal editting `/etc/tor/torrc` file
```
ControlPort 9051
CookieAuthentication 1
```
6. Restart Tor browser
```bash
sudo service tor restart
```
7. Run the script
```bash
python3 -m src
```
## JWT configuration
In order to use Box Cloud Storage API in a secure way, this project is configured for using their service with the JWT authentication. After following the [tutorial](https://developer.box.com/docs/construct-jwt-claim-manually), we will obtain a configuration file which will have to be located under `data` folder with the name of `jwt_config.json` as the `__init__.py` configuration file says:
```python
# Box integration
BOX_CONFIG_FILE_PATH = 'data/jwt_config.json'
```
## Authors
- [Albert Suà rez](https://github.com/AlbertSuarez)
## License
MIT © AZLyrics scraper