https://github.com/nikolapeja6/psz_proj
School project for the PSZ (Pronalazenje Skrivenog Znanja, en. Data Mining and Semantic Web) course at the School of Electrical Engineering, University of Belgrade.
https://github.com/nikolapeja6/psz_proj
data-mining discogs school-project web-crawler
Last synced: 3 months ago
JSON representation
School project for the PSZ (Pronalazenje Skrivenog Znanja, en. Data Mining and Semantic Web) course at the School of Electrical Engineering, University of Belgrade.
- Host: GitHub
- URL: https://github.com/nikolapeja6/psz_proj
- Owner: nikolapeja6
- Created: 2019-05-22T18:44:48.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2019-06-19T22:43:31.000Z (almost 7 years ago)
- Last Synced: 2025-10-04T23:30:36.621Z (8 months ago)
- Topics: data-mining, discogs, school-project, web-crawler
- Language: HTML
- Homepage: https://www.kaggle.com/nikolapeja6/etf-psz-discogs-albums-published-in-sr-and-yu
- Size: 8.15 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# PSZ proj
This school project was created for the [PSZ (Pronalaženje Skrivenog Znanja, en. Data Mining and Semantic Web) course][psz],
which is part of the Master studies at the [School of Electrical Engineering][school], [University of Belgrade][uni].
The project consisted of crawling the [discogs][discogs] website in order to gather data for albums, artists and songs.
After gathering it, the raw data was pre-processed and then stored in a SQLite database (the ```psz_database.db``` file
in the ```data``` folder), which was the first task of the project. The remaining 4 tasks were centered around processing
the data, visualizing it and running unsupervised learning algorithms (in my case only clustering algorithms).
The whole [project statement][statement] (in Serbian) is located in the ```docs``` folder.
## Requirements
In order to run the code, you need to have [Python 3.x][python3] installed.
You will also need the following python packages:
- requests
- beautifulsoup4
- fuzzywuzzy
- python-Levenshtein
- regex
- matplotlib
- numpy
- cyrtranslit
- scikit-learn
- bokeh
## Scraping
The data on the pages can be structured differently, which caused me some difficulties when I tried to scrape it.
Below are some examples of the pages with different structures.
- Albums
- [versions; tracklist with times but no credits](https://www.discogs.com/Bob-Dylan-Self-Portrait/master/28188)
- [tracklist with some addition labels but no times](https://www.discogs.com//Mile-Delija-%D0%9E%D1%98-%D0%A1%D1%80%D0%B1%D0%B8%D1%98%D0%BE-%D0%9C%D0%B0%D1%98%D0%BA%D0%BE-%D0%9D%D0%B5-%D0%9F%D0%BB%D0%B0%D1%88%D0%B8-%D0%A1%D0%B5-%D0%A0%D0%B0%D1%82%D0%B0/release/7347664)
- [tracklist with credits but no times; album credits](https://www.discogs.com/Various-4-Uspjeha-Sa-Festivala-Sanremo-1960/release/3722600)
- [tracklist with credits and times; album credits](https://www.discogs.com//Various-Radio-Utopia-4-Belgrade-Coffee-Shop/release/142792)
- [tracklist with no additional data; album credits](https://www.discogs.com//Radioaktivni-Radioaktivni/release/4009346)
- [tracklist with no link to songs but with credits](https://www.discogs.com//To%C5%A1e-Proeski-Pratim-Te/release/5006269)
- [additional separators in the tracklist](https://www.discogs.com/Dragi-Domi%C4%87-Boem-Grada/release/13675687)
- Artists
- [artist with aliases, sites and metrics](https://www.discogs.com/artist/59792-Bob-Dylan)
- [artist with no sites, some mestrics](https://www.discogs.com/artist/2984842-Dragi-Domi%C4%87)
- Songs
- [album with no credits for song](https://www.discogs.com//Radioaktivni-Radioaktivni/release/4009346) → [song with credits](https://www.discogs.com/composition/2bf9dd96-7837-415f-9318-495cee3d9fe4-Ne-Pri%C4%8Daj)
- [album with some credits for song](https://www.discogs.com//Various-Radio-Utopia-4-Belgrade-Coffee-Shop/release/142792) → [song with different credits](https://www.discogs.com/composition/0604c137-ac24-4f15-9d4b-61b551912a93-Svitac)
- [album with credits for songs](https://www.discogs.com/To%C5%A1e-Proeski-Secret-Place/release/12749465) → [sogn with no credits](https://www.discogs.com/composition/c61866a0-b099-4d84-9a78-399a68ef4844-Light-The-Flame)
[psz]: http://rti.etf.bg.ac.rs/rti/ms1psz/
[school]: https://www.etf.bg.ac.rs/
[uni]: https://www.bg.ac.rs/
[discogs]: https://www.discogs.com/
[statement]: docs/PSZ_Projekat_2019_v1.0.pdf
[python3]: https://www.python.org/downloads/