Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alberto-abarzua/cc7220-project
Convert a csv based dataset to rdf
https://github.com/alberto-abarzua/cc7220-project
apache-jena-fuseki rdf sparql sparql-query tarql
Last synced: 17 days ago
JSON representation
Convert a csv based dataset to rdf
- Host: GitHub
- URL: https://github.com/alberto-abarzua/cc7220-project
- Owner: alberto-abarzua
- Created: 2022-11-13T16:46:44.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2022-11-16T21:00:15.000Z (about 2 years ago)
- Last Synced: 2024-11-15T19:40:30.329Z (3 months ago)
- Topics: apache-jena-fuseki, rdf, sparql, sparql-query, tarql
- Language: Jupyter Notebook
- Homepage:
- Size: 71.3 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CC7220-project
## Data used
- [Trending YouTube Video Statistics](https://www.kaggle.com/datasets/datasnaek/youtube-new). The data should be placed in a folder named `raw_dataset`
## Project structure
- `pre_processing.py`: Notebook to process the raw dataset downloaded from Kaggle. Running this script creates a directory named `clean_dataset`, which contains the cleaned `.csv` files. The script takes two arguments: `frac` and `threshold`. `frac` represents what fraction of the dataset to take from the raw data, and `threshold` will remove tags that appear less than `threshold` times.
- `create_rdf.py`: Reads the cleaned data and uses Tarql to convert the `.csv` files into RDF triples. The `CONSTRUCT` queries in SPARQL are placed in the `sparql` folder. Running this script will create `.ttl` files. The script will place the `.ttl` files into the `rdf_dataset` directory. The `.ttl` files can now be loaded into a SPARQL endpoint to run queries.
## Prefixes
```sql
PREFIX tag:
PREFIX rdf:
PREFIX rdfs:
PREFIX ex:
PREFIX ct:
PREFIX tg:
PREFIX cat:
PREFIX ch:
```## Queries
## 1. Get the average views by country:
```SPARQL
SELECT ?country (AVG(?views) as ?avg) WHERE {
?country a ex:Country .
?video a ex:Video .
?video ex:country ?country .
?video ex:views ?views
} GROUP BY(?country)
ORDER BY DESC(?avg)
```### Results:
| | country | avg |
| ---- | -------------------------- | ------------------------------------------------------------ |
| 1 | | "6105793.23950035224624722267382"^^ |
| 2 | | "2394536.7338001725275282894403"^^ |
| 3 | | "1177444.463009143807148794679967"^^ |
| 4 | | "1081487.720469266909454838619999"^^ |
| 5 | | "622963.930254701261604379909545"^^ |
| 6 | | "478380.517857142857142857142857"^^ |
| 7 | | "453443.666403162055335968379447"^^ |
| 8 | | "390053.163736800341817737899042"^^ |
| 9 | | "272906.586734399353460716965884"^^ |
| 10 | | "249568.381662358862739763297511"^^ |## 2. Highest and lowest viewcount for each channel
```SPARQL
SELECT ?channel (MIN(?views) as ?min_video) (MAX(?views) as ?max_video) WHERE{
?channel a ex:Channel .
?video ex:channel_title ?channel .
?video ex:views ?views .
?video a ex:Video .
} GROUP BY ?channel HAVING(COUNT(*)>10)
ORDER BY DESC(?min_video)
LIMIT 10
```### Results:
| | channel | min_video | max_video |
| ---- | -------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------- |
| 10 | | "12118553"^^ | "12118553"^^ |
| 9 | | "12229386"^^ | "12229386"^^ |
| 6 | | "12672730"^^ | "19063679"^^ |
| 7 | | "12552714"^^ | "19234147"^^ |
| 8 | | "12305311"^^ | "31371692"^^ |
| 5 | | "16991901"^^ | "337621571"^^ |
| 1 | | "24412837"^^ | "35463645"^^ |
| 4 | | "18825555"^^ | "49178073"^^ |
| 3 | | "19878085"^^ | "54087829"^^ |
| 2 | | "20921796"^^ | "62338362"^^ |## 3. Get the most common tag for every category.
```sql
SELECT ?category ?tag ?cnt WHERE {
{SELECT DISTINCT ?category (MAX(?cnt) as ?MaxCount) WHERE {
{SELECT DISTINCT ?category ?tag (COUNT(?tag) as ?cnt) WHERE {
?category a ex:Category .
?video ex:category ?category .
?video ex:hasTag ?tag
}
GROUP BY ?category ?tag
ORDER BY DESC(?cnt)}} GROUP BY ?category
ORDER BY DESC(?MaxCount)}
{SELECT DISTINCT ?category ?tag (COUNT(?tag) as ?cnt) WHERE {
?category a ex:Category .
?video ex:category ?category .
?video ex:hasTag ?tag
}
GROUP BY ?category ?tag
ORDER BY DESC(?cnt)}FILTER(?cnt = ?MaxCount)
}
```### Results:
| 1 | | | "6222"^^ |
| ---- | ---------------------------------------------- | ------------------------------------------------------------ | -------------------------------------------------- |
| 2 | | | "4832"^^ |
| 3 | | | "3901"^^ |
| 4 | | | "3175"^^ |
| 5 | | | "2180"^^ |
| 6 | | | "1844"^^ |
| 7 | | | "1298"^^ |
| 8 | | | "1260"^^ |
| 9 | | | "1073"^^ |
| 10 | | | "973"^^ |
| 11 | | | "924"^^ |
| 12 | | | "634"^^ |
| 13 | | <[http://ex.org/tag/%C3%90%C2%B0%C3%90%C2%B2%C3%91%C2%82%C3%90%C2%BE](http://ex.org/tag/авÑо)> | "373"^^ |
| 15 | | | "270"^^ |
| 19 | | | "246"^^ |
| 20 | | <[http://ex.org/tag/%C3%90%C2%BF%C3%91%C2%83%C3%91%C2%82%C3%90%C2%B8%C3%90%C2%BD](http://ex.org/tag/пÑÑин)> | "224"^^ |
| 21 | | | "25"^^ |
| 22 | | | "9"^^ |## 4. Find the channels with at least 15 videos or categories that have fastest time for a video to become trending (time_to_trending = trending_date - publish_date).
```sql
## Query for categories
SELECT ?category ?time_to_trending_average_hours WHERE{
{SELECT ?category (AVG(?time_in_seconds) as ?average_time_to_trending_seconds) WHERE {
?video a ex:Video .
?video ex:category ?category .
?video ex:title ?title .
?video ex:publish_timestamp ?publish_timestamp .
?video ex:trending_timestamp ?trending_timestamp
BIND(xsd:dateTime(?trending_timestamp) - xsd:dateTime(?publish_timestamp) AS ?time2trending)
BIND(day(?time2trending) AS ?days)
BIND(hours(?time2trending) AS ?hours)
BIND(minutes(?time2trending) AS ?minutes)
BIND(seconds(?time2trending) AS ?seconds)
BIND( (?days*86400 + ?hours*3600 + ?minutes*60 + ?seconds) AS ?time_in_seconds)
}GROUP BY ?category}BIND(ceil(?average_time_to_trending_seconds/3600) AS ?time_to_trending_average_hours)
}ORDER BY ASC(?time_to_trending_average_hours)
## Query for channels with at least 15 videos
SELECT ?channel_title ?time_to_trending_average_hours WHERE{
{SELECT ?channel_title (AVG(?time_in_seconds) as ?average_time_to_trending_seconds) WHERE {
?video a ex:Video .
?video ex:channel_title ?channel_title .
?video ex:publish_timestamp ?publish_timestamp .
?video ex:trending_timestamp ?trending_timestamp
BIND(xsd:dateTime(?trending_timestamp) - xsd:dateTime(?publish_timestamp) AS ?time2trending)
BIND(year(?time2trending) AS ?years)
BIND(month(?time2trending) AS ?months)
BIND(day(?time2trending) AS ?days)
BIND(hours(?time2trending) AS ?hours)
BIND(minutes(?time2trending) AS ?minutes)
BIND(seconds(?time2trending) AS ?seconds)
BIND( (?days*86400 + ?hours*3600 + ?minutes*60 + ?seconds) AS ?time_in_seconds)
}GROUP BY ?channel_title HAVING(COUNT(*)>4)}BIND(ceil(?average_time_to_trending_seconds/3600) AS ?time_to_trending_average_hours)
}ORDER BY ASC(?time_to_trending_average_hours) LIMIT 10
```
### Results - Channels
| 1 | | "1.0"^^ |
| ---- | ------------------------------------------------------------ | ------------------------------------------------- |
| 2 | | "1.0"^^ |
| 3 | <[http://ex.org/channel/TOQUE_Y_SAZ%C3%83%C2%93N](http://ex.org/channel/TOQUE_Y_SAZÃN)> | "2.0"^^ |
| 4 | <[http://ex.org/channel/Kiwilim%C3%83%C2%B3n](http://ex.org/channel/Kiwilimón)> | "2.0"^^ |
| 5 | | "3.0"^^ |
| 6 | | "3.0"^^ |
| 7 | | "3.0"^^ |
| 8 | | "3.0"^^ |
| 9 | | "3.0"^^ |
| 10 | | "3.0"^^ |### Results - Categories
| | category | time_to_trending_average_hours |
| ---- | ---------------------------------------------- | --------------------------------------------------- |
| 1 | | "38.0"^^ |
| 2 | | "39.0"^^ |
| 3 | | "47.0"^^ |
| 4 | | "73.0"^^ |
| 5 | | "90.0"^^ |
| 6 | | "101.0"^^ |
| 7 | | "119.0"^^ |
| 8 | | "120.0"^^ |
| 9 | | "121.0"^^ |
| 10 | | "139.0"^^ |
| 11 | | "156.0"^^ |
| 12 | | "160.0"^^ |
| 13 | | "169.0"^^ |
| 14 | | "199.0"^^ |
| 15 | | "251.0"^^ |
| 16 | | "293.0"^^ |
| 17 | | "390.0"^^ |
| 18 | | "431.0"^^ |## 5. Find the best performing (likes+comments/num_views) categories for a given tag. (top 10)
```SPARQL
SELECT (MAX(?performance) as ?m_p) WHERE {
?tag a tg:Funny .
?video a ex:Video .
?video ex:hasTag ?tag .
?video ex:category ?cat .
?video ex:views ?views .
?video ex:likes ?likes .
?video ex:dislikes ?dislikes .
?video ex:comment_count ?comments .BIND(((?likes + ?comments)/?views) as ?performance) .
}GROUP BY ?tag ?cat
ORDER BY DESC(?m_p)
LIMIT 10
```### Results
| | cat | performance |
| ---- | --------------------------------------------- | ------------------------------------------------------------ |
| 1 | | "0.039612354693054087808627"^^ |
| 2 | | "0.043338299457273597956795"^^ |
| 3 | | "0.111380816977599001317706"^^ |
| 4 | | "0.122075526700499472784748"^^ |
| 5 | | "0.12962962962962962962963"^^ |
| 6 | | "0.131175432700386489665602"^^ |
| 7 | | "0.155303641713952246737605"^^ |
| 8 | | "0.186055620838229533881708"^^ |
| 9 | | "0.198130841121495327102804"^^ |
| 10 | | "0.198834825763297011543856"^^ |