https://github.com/alberto-abarzua/cc7220-project

Convert a csv based dataset to rdf
https://github.com/alberto-abarzua/cc7220-project
apache-jena-fuseki rdf sparql sparql-query tarql
Last synced: 2 months ago
JSON representation
Convert a csv based dataset to rdf
Host: GitHub
URL: https://github.com/alberto-abarzua/cc7220-project
Owner: alberto-abarzua
Created: 2022-11-13T16:46:44.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2022-11-16T21:00:15.000Z (over 2 years ago)
Last Synced: 2025-01-16T07:55:19.511Z (4 months ago)
Topics: apache-jena-fuseki, rdf, sparql, sparql-query, tarql
Language: Jupyter Notebook
Homepage:
Size: 71.3 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # CC7220-project

## Data used

- [Trending YouTube Video Statistics](https://www.kaggle.com/datasets/datasnaek/youtube-new). The data should be placed in a folder named `raw_dataset`

## Project structure

- `pre_processing.py`: Notebook to process the raw dataset downloaded from Kaggle. Running this script creates a directory named `clean_dataset`, which contains the cleaned `.csv` files. The script takes two arguments: `frac` and `threshold`. `frac` represents what fraction of the dataset to take from the raw data, and `threshold` will remove tags that appear less than `threshold` times.

- `create_rdf.py`: Reads the cleaned data and uses Tarql to convert the `.csv` files into RDF triples. The `CONSTRUCT` queries in SPARQL are placed in the `sparql` folder. Running this script will create `.ttl` files. The script will place the `.ttl` files into the `rdf_dataset` directory. The `.ttl` files can now be loaded into a SPARQL endpoint to run queries.

## Prefixes

```sql

PREFIX tag: 

PREFIX rdf: 

PREFIX rdfs: 

PREFIX ex: 

PREFIX ct: 

PREFIX tg: 

PREFIX cat: 

PREFIX ch: 

```

## Queries

## 1. Get the average views by country:

```SPARQL

SELECT ?country (AVG(?views) as ?avg) WHERE {

  ?country a ex:Country .

  ?video a ex:Video .

  ?video ex:country ?country .

  ?video ex:views ?views

} GROUP BY(?country)

ORDER BY DESC(?avg)

```

### Results:

|      | country                    | avg                                                          |

| ---- | -------------------------- | ------------------------------------------------------------ |

| 1    |  | "6105793.23950035224624722267382"^^ |

| 2    |  | "2394536.7338001725275282894403"^^ |

| 3    |  | "1177444.463009143807148794679967"^^ |

| 4    |  | "1081487.720469266909454838619999"^^ |

| 5    |  | "622963.930254701261604379909545"^^ |

| 6    |  | "478380.517857142857142857142857"^^ |

| 7    |  | "453443.666403162055335968379447"^^ |

| 8    |  | "390053.163736800341817737899042"^^ |

| 9    |  | "272906.586734399353460716965884"^^ |

| 10   |  | "249568.381662358862739763297511"^^ |

## 2. Highest and lowest viewcount for each channel

```SPARQL

SELECT ?channel (MIN(?views) as ?min_video) (MAX(?views) as ?max_video) WHERE{

  ?channel a ex:Channel . 

  ?video ex:channel_title ?channel .

  ?video ex:views ?views .

  ?video a ex:Video .

} GROUP BY ?channel HAVING(COUNT(*)>10)

ORDER BY DESC(?min_video) 

LIMIT 10

```

### Results:

|      | channel                                      | min_video                                              | max_video                                               |

| ---- | -------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------- |

| 10   |            | "12118553"^^ | "12118553"^^  |

| 9    |   | "12229386"^^ | "12229386"^^  |

| 6    |  | "12672730"^^ | "19063679"^^  |

| 7    |          | "12552714"^^ | "19234147"^^  |

| 8    |           | "12305311"^^ | "31371692"^^  |

| 5    |         | "16991901"^^ | "337621571"^^ |

| 1    |      | "24412837"^^ | "35463645"^^  |

| 4    |           | "18825555"^^ | "49178073"^^  |

| 3    |           | "19878085"^^ | "54087829"^^  |

| 2    |          | "20921796"^^ | "62338362"^^  |

## 3. Get the most common tag for every category.

```sql

SELECT ?category ?tag ?cnt WHERE {

    {SELECT DISTINCT ?category (MAX(?cnt) as ?MaxCount) WHERE {

      {SELECT DISTINCT ?category ?tag (COUNT(?tag) as ?cnt) WHERE {

        ?category a ex:Category .

        ?video ex:category ?category .

        ?video ex:hasTag ?tag

      }

      GROUP BY ?category ?tag

      ORDER BY DESC(?cnt)}

    } GROUP BY ?category

    ORDER BY DESC(?MaxCount)}

  

    {SELECT DISTINCT ?category ?tag (COUNT(?tag) as ?cnt) WHERE {

        ?category a ex:Category .

        ?video ex:category ?category .

        ?video ex:hasTag ?tag

      }

      GROUP BY ?category ?tag

      ORDER BY DESC(?cnt)}

FILTER(?cnt = ?MaxCount)

}

```

### Results:

| 1    |          |                                     | "6222"^^ |

| ---- | ---------------------------------------------- | ------------------------------------------------------------ | -------------------------------------------------- |

| 2    |                 |                                    | "4832"^^ |

| 3    |                  |                                       | "3901"^^ |

| 4    |        |                                      | "3175"^^ |

| 5    |                 |                                  | "2180"^^ |

| 6    |          |                                    | "1844"^^ |

| 7    |         |                                     | "1298"^^ |

| 8    |       |                                   | "1260"^^ |

| 9    |         |                                       | "1073"^^ |

| 10   |                 |                                  | "973"^^  |

| 11   |              |                                   | "924"^^  |

| 12   |   |                                  | "634"^^  |

| 13   |       | <[http://ex.org/tag/%C3%90%C2%B0%C3%90%C2%B2%C3%91%C2%82%C3%90%C2%BE](http://ex.org/tag/Ð°Ð²ÑÐ¾)> | "373"^^  |

| 15   |                  |                             | "270"^^  |

| 19   |        |                                      | "246"^^  |

| 20   |           | <[http://ex.org/tag/%C3%90%C2%BF%C3%91%C2%83%C3%91%C2%82%C3%90%C2%B8%C3%90%C2%BD](http://ex.org/tag/Ð¿ÑÑÐ¸Ð½)> | "224"^^  |

| 21   |                 |                                | "25"^^   |

| 22   |  |                                  | "9"^^    |

## 4. Find the channels with at least 15 videos or categories that have fastest time for a video to become trending (time_to_trending = trending_date - publish_date).

```sql

## Query for categories

SELECT ?category ?time_to_trending_average_hours WHERE{

{SELECT ?category (AVG(?time_in_seconds) as ?average_time_to_trending_seconds) WHERE {

  

  ?video a ex:Video .

  ?video ex:category ?category .

  ?video ex:title ?title .

  ?video ex:publish_timestamp ?publish_timestamp .

  ?video ex:trending_timestamp ?trending_timestamp

  BIND(xsd:dateTime(?trending_timestamp) - xsd:dateTime(?publish_timestamp) AS ?time2trending)       

  BIND(day(?time2trending) AS ?days)    

  BIND(hours(?time2trending) AS ?hours)   

  BIND(minutes(?time2trending) AS ?minutes)   

  BIND(seconds(?time2trending) AS ?seconds)   

  

  BIND( (?days*86400 + ?hours*3600 + ?minutes*60 + ?seconds) AS ?time_in_seconds)

  

}GROUP BY ?category}

BIND(ceil(?average_time_to_trending_seconds/3600) AS ?time_to_trending_average_hours)

}ORDER BY ASC(?time_to_trending_average_hours)

## Query for channels with at least 15 videos

SELECT ?channel_title ?time_to_trending_average_hours WHERE{

{SELECT ?channel_title (AVG(?time_in_seconds) as ?average_time_to_trending_seconds) WHERE {

  

  ?video a ex:Video .

  ?video ex:channel_title ?channel_title .   

  ?video ex:publish_timestamp ?publish_timestamp .

  ?video ex:trending_timestamp ?trending_timestamp

  BIND(xsd:dateTime(?trending_timestamp) - xsd:dateTime(?publish_timestamp) AS ?time2trending)    

  BIND(year(?time2trending) AS ?years)    

  BIND(month(?time2trending) AS ?months)    

  BIND(day(?time2trending) AS ?days)    

  BIND(hours(?time2trending) AS ?hours)   

  BIND(minutes(?time2trending) AS ?minutes)   

  BIND(seconds(?time2trending) AS ?seconds)   

  

  BIND( (?days*86400 + ?hours*3600 + ?minutes*60 + ?seconds) AS ?time_in_seconds)

  

}GROUP BY ?channel_title HAVING(COUNT(*)>4)}

  BIND(ceil(?average_time_to_trending_seconds/3600) AS ?time_to_trending_average_hours)

  

}ORDER BY ASC(?time_to_trending_average_hours) LIMIT 10

```

### Results - Channels

| 1    |                     | "1.0"^^ |

| ---- | ------------------------------------------------------------ | ------------------------------------------------- |

| 2    |                            | "1.0"^^ |

| 3    | <[http://ex.org/channel/TOQUE_Y_SAZ%C3%83%C2%93N](http://ex.org/channel/TOQUE_Y_SAZÃN)> | "2.0"^^ |

| 4    | <[http://ex.org/channel/Kiwilim%C3%83%C2%B3n](http://ex.org/channel/KiwilimÃ³n)> | "2.0"^^ |

| 5    |                            | "3.0"^^ |

| 6    |                    | "3.0"^^ |

| 7    |                   | "3.0"^^ |

| 8    |                    | "3.0"^^ |

| 9    |                         | "3.0"^^ |

| 10   |                             | "3.0"^^ |

### Results -  Categories

|      | category                                       | time_to_trending_average_hours                      |

| ---- | ---------------------------------------------- | --------------------------------------------------- |

| 1    |                  | "38.0"^^  |

| 2    |           | "39.0"^^  |

| 3    |                 | "47.0"^^  |

| 4    |        | "73.0"^^  |

| 5    |          | "90.0"^^  |

| 6    |          | "101.0"^^ |

| 7    |                 | "119.0"^^ |

| 8    |  | "120.0"^^ |

| 9    |         | "121.0"^^ |

| 10   |                 | "139.0"^^ |

| 11   |       | "156.0"^^ |

| 12   |         | "160.0"^^ |

| 13   |   | "169.0"^^ |

| 14   |                 | "199.0"^^ |

| 15   |        | "251.0"^^ |

| 16   |       | "293.0"^^ |

| 17   |                  | "390.0"^^ |

| 18   |              | "431.0"^^ |

## 5. Find the best performing (likes+comments/num_views) categories for a given tag. (top 10)

```SPARQL

SELECT (MAX(?performance) as ?m_p) WHERE {

  	?tag a tg:Funny .

  	?video a ex:Video .

    ?video ex:hasTag ?tag .

    ?video ex:category ?cat .

    ?video ex:views ?views .

    ?video ex:likes ?likes .

    ?video ex:dislikes ?dislikes .

    ?video ex:comment_count ?comments .

  BIND(((?likes + ?comments)/?views) as ?performance) .

}GROUP BY ?tag ?cat 

ORDER BY DESC(?m_p)

LIMIT 10

```

### Results

|      | cat                                           | performance                                                  |

| ---- | --------------------------------------------- | ------------------------------------------------------------ |

| 1    |                 | "0.039612354693054087808627"^^ |

| 2    |                | "0.043338299457273597956795"^^ |

| 3    |          | "0.111380816977599001317706"^^ |

| 4    |       | "0.122075526700499472784748"^^ |

| 5    |  | "0.12962962962962962962963"^^ |

| 6    |                | "0.131175432700386489665602"^^ |

| 7    |             | "0.155303641713952246737605"^^ |

| 8    |                | "0.186055620838229533881708"^^ |

| 9    |       | "0.198130841121495327102804"^^ |

| 10   |        | "0.198834825763297011543856"^^ |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alberto-abarzua/cc7220-project

Awesome Lists containing this project

README