Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mrpudn/animefreq
(mirror) An analysis of the frequency of words appearing in Japanese anime.
https://github.com/mrpudn/animefreq
anime frequency japanese words
Last synced: 3 days ago
JSON representation
(mirror) An analysis of the frequency of words appearing in Japanese anime.
- Host: GitHub
- URL: https://github.com/mrpudn/animefreq
- Owner: mrpudn
- Created: 2021-09-01T20:39:48.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-08-24T07:18:52.000Z (over 2 years ago)
- Last Synced: 2024-11-05T09:48:40.779Z (about 2 months ago)
- Topics: anime, frequency, japanese, words
- Language: Python
- Homepage: https://gitlab.com/mrpudn/animefreq
- Size: 32.4 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# animefreq
An analysis of the frequency of words appearing in Japanese anime.
## Overview
Japanese subtitles were obtained from an export of [Kitsunekko] from 4/7/2020.
Only `.srt` files were used in the analysis. All non-anime titles were filtered
out (ex. tokusatsu shows). These subtitles were run through [SudachiPy], a
Japanese morphological analyzer, to produce morphological information about each
subtitle. This morphological information is aggregated for each show and stored
in the files in the `shows/` directory. All of this morphological information is
further aggregated for all shows and stored in the `animefreq.csv` file.Some additional filtering and processing was also performed. Some of the tokens
contained half-width kana characters - these were converted to their full-width
equivalents. Tokens containing characters other than kanji, kana, and a few
other characters were discarded.Each show is assigned a UUID. Each file in the `shows/` directory is named using
this UUID. The Japanese and English names for each show were collected by hand
using [MyAnimeList]. All of this show information is stored in `shows.csv`.## Field Information
### `animefreq.csv`
- `Word`: A Japanese word.
- `DictionaryForm`: The dictionary/root form of the word.
- `NormalizedForm`: The normalized root form of the word.
- `PartOfSpeech`: Part of speech information about the word.
- `Dependencies`: A space-separated list of smaller words that compose the word.
- `Count`: The number of times the word appeared.
- `Files`: The number of subtitle files in which the word appeared.### `shows.csv`
- `UUID`: A UUID assigned to the show.
- `Japanese`: The Japanese name of the show (source: [MyAnimeList]).
- `English`: The English name of the show (source: [MyAnimeList]).
- `KitsunekkoID`: The ID (or name) of the show in [Kitsunekko].
- `MyAnimeListID`: The ID of the show in [MyAnimeList]### `shows/.csv`
- `Word`: A Japanese word.
- `DictionaryForm`: The dictionary/root form of the word.
- `NormalizedForm`: The normalized root form of the word.
- `PartOfSpeech`: Part of speech information about the word.
- `Count`: The number of times the word appeared.
- `Files`: The number of subtitle files in which the word appeared.[Kitsunekko]: https://kitsunekko.net
[SudachiPy]: https://github.com/WorksApplications/SudachiPy
[MyAnimeList]: https://myanimelist.net/