Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/danielpancake/soviet-recipes-data-wrangling-and-visualization
Soviet recipes visualization project for Data Wrangling and Visualization course at Innopolis University
https://github.com/danielpancake/soviet-recipes-data-wrangling-and-visualization
Last synced: about 1 month ago
JSON representation
Soviet recipes visualization project for Data Wrangling and Visualization course at Innopolis University
- Host: GitHub
- URL: https://github.com/danielpancake/soviet-recipes-data-wrangling-and-visualization
- Owner: danielpancake
- Archived: true
- Created: 2023-07-11T14:33:34.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2023-08-06T14:19:44.000Z (over 1 year ago)
- Last Synced: 2024-10-08T19:05:13.546Z (about 1 month ago)
- Language: Jupyter Notebook
- Homepage: https://danielpancake.github.io/soviet-recipes-data-wrangling-and-visualization/visualization
- Size: 1.64 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Soviet recipes data wrangling and visualization project
## Project Description
1. Data Scraping
1. Target websites
- [Main source for Soviet recipes](https://sov-obshchepit.ru/)
- Not used yet:
- [Website "1000.menu"](https://1000.menu/catalog/recepty-sovetskix-vremen)
- [Website "webspoon"](https://webspoon.ru/cuisine/kuhnja-sssr)
- [Website "povarenok"](https://www.povarenok.ru/recipes/kitchen/101/?sort=date_create_asc&order=desc)
2. Raw data format
At first, I found it hard to implement a nested json structure, so I decided to use a flat structure for the raw data instead. The structure is as follows:```json
{
"category": "category",
"subcategory": "subcategory",
"recipe_name": "recipe_name",
"ingredients": [
"ingredient_1", "ingredient_2", "ingredient_3"
],
}
```Ingredients are really a combination of ingredient name, its quantity, and its unit of measurement.
To scrape raw data, run:
```bash
cd ./scrapping
scrapy crawl sov-obshchepit -O ../data/raw_data.json
```3. Nested raw data format
Eventually, I figured out a way to implement a nested json structure. I store scrapped data in a nested json structure (`nested_index` in `sov_obshchepit.py`) and write it to a json file when the spider is closed. The structure of this file is as follows:
```json
{
"category_name": {
"subcategory_name": {
"recipe_name": {
"ingredients": [
"ingredient_1", "ingredient_2", "ingredient_3"
],
}
}
}
}
```Ingredients are really a combination of ingredient name, its quantity, and its unit of measurement.
To scrape raw data with nested structure, run:
```bash
cd ./scrapping
scrapy crawl sov-obshchepit -a nested_output=../data/raw_nested_data.json
```4. Sorted and prettified raw data format
You migth want to sort the raw data by category, subcategory, and recipe name. To do so, run:
```bash
cat ./data/raw_data.json | jq 'sort_by(.category, .subcategory, .recipe_name)' > ./data/raw_data_sorted.json
```2. Data Wrangling
Part of the data cleaning process is done during the scraping process. For example, all trailing whitespaces are removed from the scraped data, as well as any empty strings or invisible characters. The rest of the data cleaning is done in the `data_wrangling.ipynb` notebook.
1. Structured cleaned data
I used Claude AI assistent to convert the raw strings of ingredients into structured data. The resulting data is stored in `data/structured_data.json`. The structure of this file is as follows:
```json
{
"category_name": {
"subcategory_name": {
"recipe_name": {
"ingredients": [
"ingredient_1", "ingredient_2", "ingredient_3"
],
"parsed_ingredients": [
["ingredient_name", "quantity", "measure units"],
["ingredient_name", "quantity", "measure units"],
["ingredient_name", "quantity", "measure units"]
]
}
}
}
}
```3. Data Visualization
Visualization consists of two major parts: **static** using `plotly` (python) and **dynamic** using `d3.js` (coffeescript) and `plotly` (python export to html+js).
Notebook `data_visualization.ipynb` has all the code for generating svg and html files used on the website.
Three types of charts are used:
1. Bar charts. It shows the number of recipes per subcategory in the specified category.
2. Sunburst chart. Similar to the bar chart, shows the number of recipes per subcategory of each category.
3. Networks. For the specified category, it shows connections between different recipes and used ingredients.## Misc
Visual inspiration: [everyday soviet food](https://trip-for-the-soul.ru/foto/chto-gotovili-v-sssr-na-kazhdyj-den.html).
(I intended to use those in the final design, however, did not).