Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/natlee/myanimelist-comment-crawler

Crawl all reviews and infomation of Anime works on MyAnimeList. ;)
https://github.com/natlee/myanimelist-comment-crawler

anime crawler data-analysis data-mining data-science kaggle kaggle-dataset myanimelist python requests scrapy-crawler sqlite

Last synced: 2 days ago
JSON representation

Crawl all reviews and infomation of Anime works on MyAnimeList. ;)

Host: GitHub
URL: https://github.com/natlee/myanimelist-comment-crawler
Owner: NatLee
License: mit
Created: 2019-08-03T19:51:46.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2023-06-21T02:23:46.000Z (over 1 year ago)
Last Synced: 2024-11-21T03:39:41.642Z (2 months ago)
Topics: anime, crawler, data-analysis, data-mining, data-science, kaggle, kaggle-dataset, myanimelist, python, requests, scrapy-crawler, sqlite
Language: Python
Homepage:
Size: 50.8 KB
Stars: 4
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# MyAnimeList Crawler
This crawler can crawl all work information and reviews from `MyAnimeList` by using `Scrapy`.

## Usage

1. Ensure your settings in `./setting.ini`.
2. Install the required dependencies.

```bash
pip install -r requirements.txt
```

## Usage

### Crawl Anime Infomation

```bash
scrapy runspider info_spider.py --nolog
```

- Example of crawling data
```json
{
"workId": "11979",
"url": "https://myanimelist.net/anime/11979/Mahou_Shoujo_Madoka%E2%98%85Magica_Movie_2__Eien_no_Monogatari",
"jpName": "劇場版魔法少女まどか☆マギカ永遠の物語",
"engName": "Puella Magi Madoka Magica the Movie Part 2: Eternal",
"synonymsName": "Mahou Shoujo Madoka Magika Movie 2, Magical Girl Madoka Magica Movie 2",
"workType": "Movie",
"episodes": "1",
"status": "Finished Airing",
"aired": "Oct 13, 2012",
"premiered": "",
"producer": "Aniplex, Mainichi Broadcasting System, Movic, Nitroplus, Houbunsha",
"broadcast": "",
"licensors": "Aniplex of America",
"studios": "Shaft",
"genres": "Drama",
"source": "Original",
"duration": "1 hr. 49 min.",
"rating": "PG-13 - Teens 13 or older",
"score": "8.37",
"allRank": "#197",
"popularityRank": "#1132",
"members": "195,001",
"favorites": "1,026",
"scoredByUser": "97097",
"lastUpdate": "2023-05-31 13:39:02"
}
```

### Crawl Reviews

> Need crawl information of works at first because we need the list of Anime works in `myanimelist`.

```bash
scrapy runspider review_spider.py --nolog
```

## Link

The dataset is put on Kaggle.

- Version 1 (2006/11 to 2019/06)

[MALCoD](https://www.kaggle.com/natlee/myanimelist-comment-dataset)

Contains 130K commnets.

After many years, the site updated. So I refactored this code to fit the new version of MyAnimeList.

- Version 2 (2006/11 to 2023/06)

[MALCoDv2](https://www.kaggle.com/natlee/myanimelist-comment-dataset-v2)

Contains 220K commnets.

You can obtain the data from your SQLite database by using the following command.
```bash
python db_to_kaggle.py
```

## Misc

I recommend using [VisiData](https://www.visidata.org/) to see the SQLite database and the CSV files.

![](https://d33wubrfki0l68.cloudfront.net/a2039fda848c76b90ee0270854cd417a82bbd60e/0b350/img/woq9dm5llq-590.webp)

It can see details of the structured data in CLI.

Just use `sudo apt install VisiData` to get the package.

In our case, you can use `vd anime.db` to get a view with the SQLite database.

## Contributor

_{Nat Lee}

## LICENSE

[MIT](LICENSE)