Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mrpudn/animefreq

(mirror) An analysis of the frequency of words appearing in Japanese anime.
https://github.com/mrpudn/animefreq

anime frequency japanese words

Last synced: about 1 month ago
JSON representation

(mirror) An analysis of the frequency of words appearing in Japanese anime.

Host: GitHub
URL: https://github.com/mrpudn/animefreq
Owner: mrpudn
Created: 2021-09-01T20:39:48.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-08-24T07:18:52.000Z (over 2 years ago)
Last Synced: 2024-11-05T09:48:40.779Z (3 months ago)
Topics: anime, frequency, japanese, words
Language: Python
Homepage: https://gitlab.com/mrpudn/animefreq
Size: 32.4 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# animefreq

An analysis of the frequency of words appearing in Japanese anime.

## Overview

Japanese subtitles were obtained from an export of [Kitsunekko] from 4/7/2020.
Only `.srt` files were used in the analysis. All non-anime titles were filtered
out (ex. tokusatsu shows). These subtitles were run through [SudachiPy], a
Japanese morphological analyzer, to produce morphological information about each
subtitle. This morphological information is aggregated for each show and stored
in the files in the `shows/` directory. All of this morphological information is
further aggregated for all shows and stored in the `animefreq.csv` file.

Some additional filtering and processing was also performed. Some of the tokens
contained half-width kana characters - these were converted to their full-width
equivalents. Tokens containing characters other than kanji, kana, and a few
other characters were discarded.

Each show is assigned a UUID. Each file in the `shows/` directory is named using
this UUID. The Japanese and English names for each show were collected by hand
using [MyAnimeList]. All of this show information is stored in `shows.csv`.

## Field Information

### `animefreq.csv`

- `Word`: A Japanese word.
- `DictionaryForm`: The dictionary/root form of the word.
- `NormalizedForm`: The normalized root form of the word.
- `PartOfSpeech`: Part of speech information about the word.
- `Dependencies`: A space-separated list of smaller words that compose the word.
- `Count`: The number of times the word appeared.
- `Files`: The number of subtitle files in which the word appeared.

### `shows.csv`

- `UUID`: A UUID assigned to the show.
- `Japanese`: The Japanese name of the show (source: [MyAnimeList]).
- `English`: The English name of the show (source: [MyAnimeList]).
- `KitsunekkoID`: The ID (or name) of the show in [Kitsunekko].
- `MyAnimeListID`: The ID of the show in [MyAnimeList]

### `shows/.csv`

- `Word`: A Japanese word.
- `DictionaryForm`: The dictionary/root form of the word.
- `NormalizedForm`: The normalized root form of the word.
- `PartOfSpeech`: Part of speech information about the word.
- `Count`: The number of times the word appeared.
- `Files`: The number of subtitle files in which the word appeared.

[Kitsunekko]: https://kitsunekko.net
[SudachiPy]: https://github.com/WorksApplications/SudachiPy
[MyAnimeList]: https://myanimelist.net/