Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alberto-abarzua/cc7220-project

Convert a csv based dataset to rdf
https://github.com/alberto-abarzua/cc7220-project

apache-jena-fuseki rdf sparql sparql-query tarql

Last synced: 17 days ago
JSON representation

Convert a csv based dataset to rdf

Awesome Lists containing this project

README

        

# CC7220-project

## Data used

- [Trending YouTube Video Statistics](https://www.kaggle.com/datasets/datasnaek/youtube-new). The data should be placed in a folder named `raw_dataset`

## Project structure

- `pre_processing.py`: Notebook to process the raw dataset downloaded from Kaggle. Running this script creates a directory named `clean_dataset`, which contains the cleaned `.csv` files. The script takes two arguments: `frac` and `threshold`. `frac` represents what fraction of the dataset to take from the raw data, and `threshold` will remove tags that appear less than `threshold` times.

- `create_rdf.py`: Reads the cleaned data and uses Tarql to convert the `.csv` files into RDF triples. The `CONSTRUCT` queries in SPARQL are placed in the `sparql` folder. Running this script will create `.ttl` files. The script will place the `.ttl` files into the `rdf_dataset` directory. The `.ttl` files can now be loaded into a SPARQL endpoint to run queries.

## Prefixes

```sql
PREFIX tag:
PREFIX rdf:
PREFIX rdfs:
PREFIX ex:
PREFIX ct:
PREFIX tg:
PREFIX cat:
PREFIX ch:
```

## Queries

## 1. Get the average views by country:

```SPARQL
SELECT ?country (AVG(?views) as ?avg) WHERE {
?country a ex:Country .
?video a ex:Video .
?video ex:country ?country .
?video ex:views ?views
} GROUP BY(?country)
ORDER BY DESC(?avg)
```

### Results:

| | country | avg |
| ---- | -------------------------- | ------------------------------------------------------------ |
| 1 | | "6105793.23950035224624722267382"^^ |
| 2 | | "2394536.7338001725275282894403"^^ |
| 3 | | "1177444.463009143807148794679967"^^ |
| 4 | | "1081487.720469266909454838619999"^^ |
| 5 | | "622963.930254701261604379909545"^^ |
| 6 | | "478380.517857142857142857142857"^^ |
| 7 | | "453443.666403162055335968379447"^^ |
| 8 | | "390053.163736800341817737899042"^^ |
| 9 | | "272906.586734399353460716965884"^^ |
| 10 | | "249568.381662358862739763297511"^^ |

## 2. Highest and lowest viewcount for each channel

```SPARQL
SELECT ?channel (MIN(?views) as ?min_video) (MAX(?views) as ?max_video) WHERE{
?channel a ex:Channel .
?video ex:channel_title ?channel .
?video ex:views ?views .
?video a ex:Video .
} GROUP BY ?channel HAVING(COUNT(*)>10)
ORDER BY DESC(?min_video)
LIMIT 10
```

### Results:

| | channel | min_video | max_video |
| ---- | -------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------- |
| 10 | | "12118553"^^ | "12118553"^^ |
| 9 | | "12229386"^^ | "12229386"^^ |
| 6 | | "12672730"^^ | "19063679"^^ |
| 7 | | "12552714"^^ | "19234147"^^ |
| 8 | | "12305311"^^ | "31371692"^^ |
| 5 | | "16991901"^^ | "337621571"^^ |
| 1 | | "24412837"^^ | "35463645"^^ |
| 4 | | "18825555"^^ | "49178073"^^ |
| 3 | | "19878085"^^ | "54087829"^^ |
| 2 | | "20921796"^^ | "62338362"^^ |

## 3. Get the most common tag for every category.

```sql
SELECT ?category ?tag ?cnt WHERE {
{SELECT DISTINCT ?category (MAX(?cnt) as ?MaxCount) WHERE {
{SELECT DISTINCT ?category ?tag (COUNT(?tag) as ?cnt) WHERE {
?category a ex:Category .
?video ex:category ?category .
?video ex:hasTag ?tag
}
GROUP BY ?category ?tag
ORDER BY DESC(?cnt)}

} GROUP BY ?category
ORDER BY DESC(?MaxCount)}

{SELECT DISTINCT ?category ?tag (COUNT(?tag) as ?cnt) WHERE {
?category a ex:Category .
?video ex:category ?category .
?video ex:hasTag ?tag
}
GROUP BY ?category ?tag
ORDER BY DESC(?cnt)}

FILTER(?cnt = ?MaxCount)

}
```

### Results:

| 1 | | | "6222"^^ |
| ---- | ---------------------------------------------- | ------------------------------------------------------------ | -------------------------------------------------- |
| 2 | | | "4832"^^ |
| 3 | | | "3901"^^ |
| 4 | | | "3175"^^ |
| 5 | | | "2180"^^ |
| 6 | | | "1844"^^ |
| 7 | | | "1298"^^ |
| 8 | | | "1260"^^ |
| 9 | | | "1073"^^ |
| 10 | | | "973"^^ |
| 11 | | | "924"^^ |
| 12 | | | "634"^^ |
| 13 | | <[http://ex.org/tag/%C3%90%C2%B0%C3%90%C2%B2%C3%91%C2%82%C3%90%C2%BE](http://ex.org/tag/авто)> | "373"^^ |
| 15 | | | "270"^^ |
| 19 | | | "246"^^ |
| 20 | | <[http://ex.org/tag/%C3%90%C2%BF%C3%91%C2%83%C3%91%C2%82%C3%90%C2%B8%C3%90%C2%BD](http://ex.org/tag/путин)> | "224"^^ |
| 21 | | | "25"^^ |
| 22 | | | "9"^^ |

## 4. Find the channels with at least 15 videos or categories that have fastest time for a video to become trending (time_to_trending = trending_date - publish_date).

```sql
## Query for categories
SELECT ?category ?time_to_trending_average_hours WHERE{
{SELECT ?category (AVG(?time_in_seconds) as ?average_time_to_trending_seconds) WHERE {

?video a ex:Video .
?video ex:category ?category .
?video ex:title ?title .
?video ex:publish_timestamp ?publish_timestamp .
?video ex:trending_timestamp ?trending_timestamp
BIND(xsd:dateTime(?trending_timestamp) - xsd:dateTime(?publish_timestamp) AS ?time2trending)
BIND(day(?time2trending) AS ?days)
BIND(hours(?time2trending) AS ?hours)
BIND(minutes(?time2trending) AS ?minutes)
BIND(seconds(?time2trending) AS ?seconds)

BIND( (?days*86400 + ?hours*3600 + ?minutes*60 + ?seconds) AS ?time_in_seconds)

}GROUP BY ?category}

BIND(ceil(?average_time_to_trending_seconds/3600) AS ?time_to_trending_average_hours)

}ORDER BY ASC(?time_to_trending_average_hours)

## Query for channels with at least 15 videos
SELECT ?channel_title ?time_to_trending_average_hours WHERE{
{SELECT ?channel_title (AVG(?time_in_seconds) as ?average_time_to_trending_seconds) WHERE {

?video a ex:Video .
?video ex:channel_title ?channel_title .
?video ex:publish_timestamp ?publish_timestamp .
?video ex:trending_timestamp ?trending_timestamp
BIND(xsd:dateTime(?trending_timestamp) - xsd:dateTime(?publish_timestamp) AS ?time2trending)
BIND(year(?time2trending) AS ?years)
BIND(month(?time2trending) AS ?months)
BIND(day(?time2trending) AS ?days)
BIND(hours(?time2trending) AS ?hours)
BIND(minutes(?time2trending) AS ?minutes)
BIND(seconds(?time2trending) AS ?seconds)

BIND( (?days*86400 + ?hours*3600 + ?minutes*60 + ?seconds) AS ?time_in_seconds)

}GROUP BY ?channel_title HAVING(COUNT(*)>4)}

BIND(ceil(?average_time_to_trending_seconds/3600) AS ?time_to_trending_average_hours)

}ORDER BY ASC(?time_to_trending_average_hours) LIMIT 10

```

### Results - Channels

| 1 | | "1.0"^^ |
| ---- | ------------------------------------------------------------ | ------------------------------------------------- |
| 2 | | "1.0"^^ |
| 3 | <[http://ex.org/channel/TOQUE_Y_SAZ%C3%83%C2%93N](http://ex.org/channel/TOQUE_Y_SAZÓN)> | "2.0"^^ |
| 4 | <[http://ex.org/channel/Kiwilim%C3%83%C2%B3n](http://ex.org/channel/Kiwilimón)> | "2.0"^^ |
| 5 | | "3.0"^^ |
| 6 | | "3.0"^^ |
| 7 | | "3.0"^^ |
| 8 | | "3.0"^^ |
| 9 | | "3.0"^^ |
| 10 | | "3.0"^^ |

### Results - Categories

| | category | time_to_trending_average_hours |
| ---- | ---------------------------------------------- | --------------------------------------------------- |
| 1 | | "38.0"^^ |
| 2 | | "39.0"^^ |
| 3 | | "47.0"^^ |
| 4 | | "73.0"^^ |
| 5 | | "90.0"^^ |
| 6 | | "101.0"^^ |
| 7 | | "119.0"^^ |
| 8 | | "120.0"^^ |
| 9 | | "121.0"^^ |
| 10 | | "139.0"^^ |
| 11 | | "156.0"^^ |
| 12 | | "160.0"^^ |
| 13 | | "169.0"^^ |
| 14 | | "199.0"^^ |
| 15 | | "251.0"^^ |
| 16 | | "293.0"^^ |
| 17 | | "390.0"^^ |
| 18 | | "431.0"^^ |

## 5. Find the best performing (likes+comments/num_views) categories for a given tag. (top 10)

```SPARQL
SELECT (MAX(?performance) as ?m_p) WHERE {
?tag a tg:Funny .
?video a ex:Video .
?video ex:hasTag ?tag .
?video ex:category ?cat .
?video ex:views ?views .
?video ex:likes ?likes .
?video ex:dislikes ?dislikes .
?video ex:comment_count ?comments .

BIND(((?likes + ?comments)/?views) as ?performance) .
}GROUP BY ?tag ?cat
ORDER BY DESC(?m_p)
LIMIT 10
```

### Results

| | cat | performance |
| ---- | --------------------------------------------- | ------------------------------------------------------------ |
| 1 | | "0.039612354693054087808627"^^ |
| 2 | | "0.043338299457273597956795"^^ |
| 3 | | "0.111380816977599001317706"^^ |
| 4 | | "0.122075526700499472784748"^^ |
| 5 | | "0.12962962962962962962963"^^ |
| 6 | | "0.131175432700386489665602"^^ |
| 7 | | "0.155303641713952246737605"^^ |
| 8 | | "0.186055620838229533881708"^^ |
| 9 | | "0.198130841121495327102804"^^ |
| 10 | | "0.198834825763297011543856"^^ |