Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/alexander-matsievsky/tv-recommendations


https://github.com/alexander-matsievsky/tv-recommendations

neo4j spark

Last synced: 18 days ago
JSON representation

Awesome Lists containing this project

README

        

{
"cells": [
{
"cell_type": "markdown",
"id": "2b74eff6-b200-4c4a-a84b-a777cabac8be",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"# TV Recommendations (Using IMDb Dataset)\n",
"\n",
"## Setup\n",
"\n",
"1. Install `conda` (see the [installation guide](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html))\n",
"2. Run\n",
"\n",
" ```shell\n",
" conda env create -f environment.yml\n",
" conda activate tv-recommendations\n",
" jupyter lab\n",
" ```\n",
"  \n",
"\n",
"3. Update [README.md](README.md)\n",
"\n",
" ```shell\n",
" jupyter nbconvert --to markdown README.ipynb\n",
" ```\n",
"\n",
"## Extract, Transform, Load\n",
"\n",
"### Extract"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "86b5fd70-4e38-4213-b8c8-e27b93a71937",
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-02T11:35:57.204725Z",
"iopub.status.busy": "2022-01-02T11:35:57.204452Z",
"iopub.status.idle": "2022-01-02T11:37:53.629553Z",
"shell.execute_reply": "2022-01-02T11:37:53.627315Z",
"shell.execute_reply.started": "2022-01-02T11:35:57.204618Z"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2022-01-02 14:35:57-- https://datasets.imdbws.com/\n",
"Resolving datasets.imdbws.com (datasets.imdbws.com)... 143.204.98.32, 143.204.98.111, 143.204.98.41, ...\n",
"Connecting to datasets.imdbws.com (datasets.imdbws.com)|143.204.98.32|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 945 [text/html]\n",
"Saving to: ‘/tmp/datasets.imdbws.com/index.html.tmp’\n",
"\n",
"datasets.imdbws.com 100%[===================>] 945 --.-KB/s in 0s \n",
"\n",
"2022-01-02 14:35:57 (6.23 MB/s) - ‘/tmp/datasets.imdbws.com/index.html.tmp’ saved [945/945]\n",
"\n",
"Loading robots.txt; please ignore errors.\n",
"--2022-01-02 14:35:57-- https://datasets.imdbws.com/robots.txt\n",
"Reusing existing connection to datasets.imdbws.com:443.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 945 [text/html]\n",
"Saving to: ‘/tmp/datasets.imdbws.com/robots.txt.tmp’\n",
"\n",
"datasets.imdbws.com 100%[===================>] 945 --.-KB/s in 0s \n",
"\n",
"2022-01-02 14:35:58 (19.7 MB/s) - ‘/tmp/datasets.imdbws.com/robots.txt.tmp’ saved [945/945]\n",
"\n",
"Removing /tmp/datasets.imdbws.com/index.html.tmp since it should be rejected.\n",
"\n",
"--2022-01-02 14:35:58-- https://datasets.imdbws.com/name.basics.tsv.gz\n",
"Reusing existing connection to datasets.imdbws.com:443.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 221263899 (211M) [binary/octet-stream]\n",
"Saving to: ‘/tmp/datasets.imdbws.com/name.basics.tsv.gz’\n",
"\n",
"datasets.imdbws.com 100%[===================>] 211.01M 9.61MB/s in 21s \n",
"\n",
"2022-01-02 14:36:21 (10.0 MB/s) - ‘/tmp/datasets.imdbws.com/name.basics.tsv.gz’ saved [221263899/221263899]\n",
"\n",
"--2022-01-02 14:36:21-- https://datasets.imdbws.com/title.akas.tsv.gz\n",
"Reusing existing connection to datasets.imdbws.com:443.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 258943798 (247M) [binary/octet-stream]\n",
"Saving to: ‘/tmp/datasets.imdbws.com/title.akas.tsv.gz’\n",
"\n",
"datasets.imdbws.com 100%[===================>] 246.95M 10.3MB/s in 24s \n",
"\n",
"2022-01-02 14:36:49 (10.1 MB/s) - ‘/tmp/datasets.imdbws.com/title.akas.tsv.gz’ saved [258943798/258943798]\n",
"\n",
"--2022-01-02 14:36:49-- https://datasets.imdbws.com/title.basics.tsv.gz\n",
"Reusing existing connection to datasets.imdbws.com:443.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 150032645 (143M) [binary/octet-stream]\n",
"Saving to: ‘/tmp/datasets.imdbws.com/title.basics.tsv.gz’\n",
"\n",
"datasets.imdbws.com 100%[===================>] 143.08M 10.3MB/s in 14s \n",
"\n",
"2022-01-02 14:37:05 (10.0 MB/s) - ‘/tmp/datasets.imdbws.com/title.basics.tsv.gz’ saved [150032645/150032645]\n",
"\n",
"--2022-01-02 14:37:05-- https://datasets.imdbws.com/title.crew.tsv.gz\n",
"Reusing existing connection to datasets.imdbws.com:443.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 58444351 (56M) [binary/octet-stream]\n",
"Saving to: ‘/tmp/datasets.imdbws.com/title.crew.tsv.gz’\n",
"\n",
"datasets.imdbws.com 100%[===================>] 55.74M 9.72MB/s in 5.6s \n",
"\n",
"2022-01-02 14:37:11 (9.95 MB/s) - ‘/tmp/datasets.imdbws.com/title.crew.tsv.gz’ saved [58444351/58444351]\n",
"\n",
"--2022-01-02 14:37:11-- https://datasets.imdbws.com/title.episode.tsv.gz\n",
"Reusing existing connection to datasets.imdbws.com:443.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 34630107 (33M) [binary/octet-stream]\n",
"Saving to: ‘/tmp/datasets.imdbws.com/title.episode.tsv.gz’\n",
"\n",
"datasets.imdbws.com 100%[===================>] 33.03M 10.3MB/s in 3.2s \n",
"\n",
"2022-01-02 14:37:15 (10.3 MB/s) - ‘/tmp/datasets.imdbws.com/title.episode.tsv.gz’ saved [34630107/34630107]\n",
"\n",
"--2022-01-02 14:37:15-- https://datasets.imdbws.com/title.principals.tsv.gz\n",
"Reusing existing connection to datasets.imdbws.com:443.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 383673006 (366M) [binary/octet-stream]\n",
"Saving to: ‘/tmp/datasets.imdbws.com/title.principals.tsv.gz’\n",
"\n",
"datasets.imdbws.com 100%[===================>] 365.90M 9.62MB/s in 36s \n",
"\n",
"2022-01-02 14:37:52 (10.1 MB/s) - ‘/tmp/datasets.imdbws.com/title.principals.tsv.gz’ saved [383673006/383673006]\n",
"\n",
"--2022-01-02 14:37:52-- https://datasets.imdbws.com/title.ratings.tsv.gz\n",
"Reusing existing connection to datasets.imdbws.com:443.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 5989985 (5.7M) [binary/octet-stream]\n",
"Saving to: ‘/tmp/datasets.imdbws.com/title.ratings.tsv.gz’\n",
"\n",
"datasets.imdbws.com 100%[===================>] 5.71M 10.0MB/s in 0.6s \n",
"\n",
"2022-01-02 14:37:53 (10.0 MB/s) - ‘/tmp/datasets.imdbws.com/title.ratings.tsv.gz’ saved [5989985/5989985]\n",
"\n",
"FINISHED --2022-01-02 14:37:53--\n",
"Total wall clock time: 1m 56s\n",
"Downloaded: 9 files, 1.0G in 1m 45s (10.1 MB/s)\n",
"wget -xP'/tmp' --accept '.tsv.gz' --no-parent --recursive 3.73s user 8.09s system 10% cpu 1:56.19 total\n",
"[4.0K] \u001B[01;34m/tmp/datasets.imdbws.com\u001B[0m\n",
"├── [211M] \u001B[01;31mname.basics.tsv.gz\u001B[0m\n",
"├── [ 945] robots.txt.tmp\n",
"├── [247M] \u001B[01;31mtitle.akas.tsv.gz\u001B[0m\n",
"├── [143M] \u001B[01;31mtitle.basics.tsv.gz\u001B[0m\n",
"├── [ 56M] \u001B[01;31mtitle.crew.tsv.gz\u001B[0m\n",
"├── [ 33M] \u001B[01;31mtitle.episode.tsv.gz\u001B[0m\n",
"├── [366M] \u001B[01;31mtitle.principals.tsv.gz\u001B[0m\n",
"└── [5.7M] \u001B[01;31mtitle.ratings.tsv.gz\u001B[0m\n",
"\n",
"0 directories, 8 files\n"
]
}
],
"source": [
"!wget -xP'/tmp' --accept '.tsv.gz' --no-parent --recursive 'https://datasets.imdbws.com/'\n",
"!tree -h '/tmp/datasets.imdbws.com'"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "2ba974b2-1941-4d6f-bf51-666831eab338",
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-02T11:37:53.636926Z",
"iopub.status.busy": "2022-01-02T11:37:53.636238Z",
"iopub.status.idle": "2022-01-02T11:37:53.982262Z",
"shell.execute_reply": "2022-01-02T11:37:53.979946Z",
"shell.execute_reply.started": "2022-01-02T11:37:53.636829Z"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001B[01;34m/tmp/neo4j_staging/datasets.imdbws.com\u001B[0m\n",
"├── \u001B[01;34mdata\u001B[0m\n",
"├── \u001B[01;34mimport\u001B[0m\n",
"└── \u001B[01;34mlogs\u001B[0m\n",
"\n",
"3 directories, 0 files\n"
]
}
],
"source": [
"neo4j_staging = \"/tmp/neo4j_staging/datasets.imdbws.com\"\n",
"!rm -fr {neo4j_staging}\n",
"!mkdir -p '{neo4j_staging}/data' '{neo4j_staging}/import' '{neo4j_staging}/logs'\n",
"!tree {neo4j_staging}"
]
},
{
"cell_type": "markdown",
"id": "089aa784",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Transform"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "a19ce862-3b93-40be-a713-6d1bad57e303",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:37:53.986750Z",
"iopub.status.busy": "2022-01-02T11:37:53.986139Z",
"iopub.status.idle": "2022-01-02T11:37:58.336856Z",
"shell.execute_reply": "2022-01-02T11:37:58.336298Z",
"shell.execute_reply.started": "2022-01-02T11:37:53.986662Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"%load_ext lab_black\n",
"\n",
"import pyspark\n",
"from IPython.display import Markdown\n",
"\n",
"kwargs_read_csv = dict(header=True, nullValue=r\"\\N\", sep=\"\\t\", quote=\"\")\n",
"kwargs_write_csv = dict(compression=\"gzip\", escape='\"', header=True, mode=\"overwrite\")\n",
"\n",
"spark = pyspark.sql.SparkSession.builder.master(\"local[*]\").getOrCreate()"
]
},
{
"cell_type": "markdown",
"id": "30e6d717",
"metadata": {},
"source": [
"#### https://datasets.imdbws.com/name.basics.tsv.gz\n",
"\n",
"**name.basics.tsv.gz** – Contains the following information for names:\n",
"- nconst (string) - alphanumeric unique identifier of the name/person\n",
"- primaryName (string)– name by which the person is most often credited\n",
"- birthYear – in YYYY format\n",
"- deathYear – in YYYY format if applicable, else '\\N'\n",
"- primaryProfession (array of strings)– the top-3 professions of the person\n",
"- knownForTitles (array of tconsts) – titles the person is known for"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "5f46612e",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:37:58.337806Z",
"iopub.status.busy": "2022-01-02T11:37:58.337661Z",
"iopub.status.idle": "2022-01-02T11:38:01.193893Z",
"shell.execute_reply": "2022-01-02T11:38:01.193280Z",
"shell.execute_reply.started": "2022-01-02T11:37:58.337786Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
},
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"

\n",
"\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" 0\n",
" 1\n",
" 2\n",
" 3\n",
" 4\n",
" 5\n",
" 6\n",
" 7\n",
" 8\n",
" 9\n",
" \n",
" \n",
" \n",
" \n",
" nconst\n",
" nm0000001\n",
" nm0000002\n",
" nm0000003\n",
" nm0000004\n",
" nm0000005\n",
" nm0000006\n",
" nm0000007\n",
" nm0000008\n",
" nm0000009\n",
" nm0000010\n",
" \n",
" \n",
" primaryName\n",
" Fred Astaire\n",
" Lauren Bacall\n",
" Brigitte Bardot\n",
" John Belushi\n",
" Ingmar Bergman\n",
" Ingrid Bergman\n",
" Humphrey Bogart\n",
" Marlon Brando\n",
" Richard Burton\n",
" James Cagney\n",
" \n",
" \n",
" birthYear\n",
" 1899\n",
" 1924\n",
" 1934\n",
" 1949\n",
" 1918\n",
" 1915\n",
" 1899\n",
" 1924\n",
" 1925\n",
" 1899\n",
" \n",
" \n",
" deathYear\n",
" 1987.0\n",
" 2014.0\n",
" NaN\n",
" 1982.0\n",
" 2007.0\n",
" 1982.0\n",
" 1957.0\n",
" 2004.0\n",
" 1984.0\n",
" 1986.0\n",
" \n",
" \n",
" primaryProfession\n",
" soundtrack,actor,miscellaneous\n",
" actress,soundtrack\n",
" actress,soundtrack,music_department\n",
" actor,soundtrack,writer\n",
" writer,director,actor\n",
" actress,soundtrack,producer\n",
" actor,soundtrack,producer\n",
" actor,soundtrack,director\n",
" actor,soundtrack,producer\n",
" actor,soundtrack,director\n",
" \n",
" \n",
" knownForTitles\n",
" tt0050419,tt0072308,tt0053137,tt0031983\n",
" tt0075213,tt0038355,tt0117057,tt0037382\n",
" tt0054452,tt0049189,tt0057345,tt0056404\n",
" tt0072562,tt0077975,tt0080455,tt0078723\n",
" tt0083922,tt0060827,tt0050986,tt0050976\n",
" tt0077711,tt0038109,tt0036855,tt0034583\n",
" tt0033870,tt0034583,tt0037382,tt0043265\n",
" tt0078788,tt0047296,tt0070849,tt0068646\n",
" tt0087803,tt0057877,tt0059749,tt0061184\n",
" tt0035575,tt0031867,tt0029870,tt0042041\n",
" \n",
" \n",
"\n",
"
"
],
"text/plain": [
" 0 \\\n",
"nconst nm0000001 \n",
"primaryName Fred Astaire \n",
"birthYear 1899 \n",
"deathYear 1987.0 \n",
"primaryProfession soundtrack,actor,miscellaneous \n",
"knownForTitles tt0050419,tt0072308,tt0053137,tt0031983 \n",
"\n",
" 1 \\\n",
"nconst nm0000002 \n",
"primaryName Lauren Bacall \n",
"birthYear 1924 \n",
"deathYear 2014.0 \n",
"primaryProfession actress,soundtrack \n",
"knownForTitles tt0075213,tt0038355,tt0117057,tt0037382 \n",
"\n",
" 2 \\\n",
"nconst nm0000003 \n",
"primaryName Brigitte Bardot \n",
"birthYear 1934 \n",
"deathYear NaN \n",
"primaryProfession actress,soundtrack,music_department \n",
"knownForTitles tt0054452,tt0049189,tt0057345,tt0056404 \n",
"\n",
" 3 \\\n",
"nconst nm0000004 \n",
"primaryName John Belushi \n",
"birthYear 1949 \n",
"deathYear 1982.0 \n",
"primaryProfession actor,soundtrack,writer \n",
"knownForTitles tt0072562,tt0077975,tt0080455,tt0078723 \n",
"\n",
" 4 \\\n",
"nconst nm0000005 \n",
"primaryName Ingmar Bergman \n",
"birthYear 1918 \n",
"deathYear 2007.0 \n",
"primaryProfession writer,director,actor \n",
"knownForTitles tt0083922,tt0060827,tt0050986,tt0050976 \n",
"\n",
" 5 \\\n",
"nconst nm0000006 \n",
"primaryName Ingrid Bergman \n",
"birthYear 1915 \n",
"deathYear 1982.0 \n",
"primaryProfession actress,soundtrack,producer \n",
"knownForTitles tt0077711,tt0038109,tt0036855,tt0034583 \n",
"\n",
" 6 \\\n",
"nconst nm0000007 \n",
"primaryName Humphrey Bogart \n",
"birthYear 1899 \n",
"deathYear 1957.0 \n",
"primaryProfession actor,soundtrack,producer \n",
"knownForTitles tt0033870,tt0034583,tt0037382,tt0043265 \n",
"\n",
" 7 \\\n",
"nconst nm0000008 \n",
"primaryName Marlon Brando \n",
"birthYear 1924 \n",
"deathYear 2004.0 \n",
"primaryProfession actor,soundtrack,director \n",
"knownForTitles tt0078788,tt0047296,tt0070849,tt0068646 \n",
"\n",
" 8 \\\n",
"nconst nm0000009 \n",
"primaryName Richard Burton \n",
"birthYear 1925 \n",
"deathYear 1984.0 \n",
"primaryProfession actor,soundtrack,producer \n",
"knownForTitles tt0087803,tt0057877,tt0059749,tt0061184 \n",
"\n",
" 9 \n",
"nconst nm0000010 \n",
"primaryName James Cagney \n",
"birthYear 1899 \n",
"deathYear 1986.0 \n",
"primaryProfession actor,soundtrack,director \n",
"knownForTitles tt0035575,tt0031867,tt0029870,tt0042041 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spark.read.csv(\n",
" \"/tmp/datasets.imdbws.com/name.basics.tsv.gz\",\n",
" schema=\"\"\"\n",
" nconst string,\n",
" primaryName string,\n",
" birthYear integer,\n",
" deathYear integer,\n",
" primaryProfession string,\n",
" knownForTitles string\n",
" \"\"\",\n",
" **kwargs_read_csv\n",
").createOrReplaceTempView(\"`name.basics`\")\n",
"spark.table(\"`name.basics`\").limit(10).toPandas().T"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "a8f07481",
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-02T11:38:01.194790Z",
"iopub.status.busy": "2022-01-02T11:38:01.194626Z",
"iopub.status.idle": "2022-01-02T11:38:52.128205Z",
"shell.execute_reply": "2022-01-02T11:38:52.127773Z",
"shell.execute_reply.started": "2022-01-02T11:38:01.194767Z"
},
"pycharm": {
"name": "#%%\n"
},
"tags": []
},
"outputs": [],
"source": [
"spark.sql(\n",
" \"\"\"\n",
" select nconst as `nconst:ID(Name)`,\n",
" primaryName as `primaryName`,\n",
" birthYear as `birthYear:long`,\n",
" deathYear as `deathYear:long`,\n",
" array_join(array('Name') ||\n",
" ifnull(transform(split(primaryProfession, ','), `_` -> 'primaryProfession=' || `_`),\n",
" array()),\n",
" ';') as `:LABEL`\n",
" from `name.basics`\n",
" \"\"\"\n",
").coalesce(1).write.csv(f\"{neo4j_staging}/import/name.basics\", **kwargs_write_csv)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f1a037b5",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:38:52.128980Z",
"iopub.status.busy": "2022-01-02T11:38:52.128850Z",
"iopub.status.idle": "2022-01-02T11:39:34.958361Z",
"shell.execute_reply": "2022-01-02T11:39:34.957890Z",
"shell.execute_reply.started": "2022-01-02T11:38:52.128961Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"spark.sql(\n",
" \"\"\"\n",
" select nconst as `:START_ID(Name)`,\n",
" 'knownForTitles' as `:TYPE`,\n",
" explode(split(knownForTitles, ',')) as `:END_ID(Title)`\n",
" from `name.basics`\n",
" \"\"\"\n",
").coalesce(1).write.csv(\n",
" f\"{neo4j_staging}/import/name.basics.knownForTitles\", **kwargs_write_csv\n",
")"
]
},
{
"cell_type": "markdown",
"id": "825b4d71",
"metadata": {},
"source": [
"#### https://datasets.imdbws.com/title.akas.tsv.gz\n",
"\n",
"**title.akas.tsv.gz** - Contains the following information for titles:\n",
"\n",
"- titleId (string) - a tconst, an alphanumeric unique identifier of the title\n",
"- ordering (integer) – a number to uniquely identify rows for a given titleId\n",
"- title (string) – the localized title\n",
"- region (string) - the region for this version of the title\n",
"- language (string) - the language of the title\n",
"- types (array) - Enumerated set of attributes for this alternative title. One or more of the following: \"alternative\", \"dvd\", \"festival\", \"tv\", \"video\", \"working\", \"original\", \"imdbDisplay\". New values may be added in the future without warning\n",
"- attributes (array) - Additional terms to describe this alternative title, not enumerated\n",
"- isOriginalTitle (boolean) – 0: not original title; 1: original title"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "fddd9fda",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:39:34.959194Z",
"iopub.status.busy": "2022-01-02T11:39:34.959055Z",
"iopub.status.idle": "2022-01-02T11:39:35.088213Z",
"shell.execute_reply": "2022-01-02T11:39:35.087696Z",
"shell.execute_reply.started": "2022-01-02T11:39:34.959174Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" 0\n",
" 1\n",
" 2\n",
" 3\n",
" 4\n",
" 5\n",
" 6\n",
" 7\n",
" 8\n",
" 9\n",
" \n",
" \n",
" \n",
" \n",
" titleId\n",
" tt0000001\n",
" tt0000001\n",
" tt0000001\n",
" tt0000001\n",
" tt0000001\n",
" tt0000001\n",
" tt0000001\n",
" tt0000001\n",
" tt0000002\n",
" tt0000002\n",
" \n",
" \n",
" ordering\n",
" 1\n",
" 2\n",
" 3\n",
" 4\n",
" 5\n",
" 6\n",
" 7\n",
" 8\n",
" 1\n",
" 2\n",
" \n",
" \n",
" title\n",
" Карменсіта\n",
" Carmencita\n",
" Carmencita - spanyol tánc\n",
" Καρμενσίτα\n",
" Карменсита\n",
" Carmencita\n",
" Carmencita\n",
" カルメンチータ\n",
" Le clown et ses chiens\n",
" Le clown et ses chiens\n",
" \n",
" \n",
" region\n",
" UA\n",
" DE\n",
" HU\n",
" GR\n",
" RU\n",
" US\n",
" None\n",
" JP\n",
" None\n",
" FR\n",
" \n",
" \n",
" language\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" ja\n",
" None\n",
" None\n",
" \n",
" \n",
" types\n",
" imdbDisplay\n",
" None\n",
" imdbDisplay\n",
" imdbDisplay\n",
" imdbDisplay\n",
" imdbDisplay\n",
" original\n",
" imdbDisplay\n",
" original\n",
" imdbDisplay\n",
" \n",
" \n",
" attributes\n",
" None\n",
" literal title\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" \n",
" \n",
" isOriginalTitle\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 1\n",
" 0\n",
" 1\n",
" 0\n",
" \n",
" \n",
"\n",
"
"
],
"text/plain": [
" 0 1 2 \\\n",
"titleId tt0000001 tt0000001 tt0000001 \n",
"ordering 1 2 3 \n",
"title Карменсіта Carmencita Carmencita - spanyol tánc \n",
"region UA DE HU \n",
"language None None None \n",
"types imdbDisplay None imdbDisplay \n",
"attributes None literal title None \n",
"isOriginalTitle 0 0 0 \n",
"\n",
" 3 4 5 6 \\\n",
"titleId tt0000001 tt0000001 tt0000001 tt0000001 \n",
"ordering 4 5 6 7 \n",
"title Καρμενσίτα Карменсита Carmencita Carmencita \n",
"region GR RU US None \n",
"language None None None None \n",
"types imdbDisplay imdbDisplay imdbDisplay original \n",
"attributes None None None None \n",
"isOriginalTitle 0 0 0 1 \n",
"\n",
" 7 8 9 \n",
"titleId tt0000001 tt0000002 tt0000002 \n",
"ordering 8 1 2 \n",
"title カルメンチータ Le clown et ses chiens Le clown et ses chiens \n",
"region JP None FR \n",
"language ja None None \n",
"types imdbDisplay original imdbDisplay \n",
"attributes None None None \n",
"isOriginalTitle 0 1 0 "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spark.read.csv(\n",
" \"/tmp/datasets.imdbws.com/title.akas.tsv.gz\",\n",
" schema=\"\"\"\n",
" titleId string,\n",
" ordering integer,\n",
" title string,\n",
" region string,\n",
" language string,\n",
" types string,\n",
" attributes string,\n",
" isOriginalTitle integer\n",
" \"\"\",\n",
" **kwargs_read_csv\n",
").createOrReplaceTempView(\"`title.akas`\")\n",
"spark.table(\"`title.akas`\").limit(10).toPandas().T"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "2b5a8f50",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:39:35.089918Z",
"iopub.status.busy": "2022-01-02T11:39:35.089741Z",
"iopub.status.idle": "2022-01-02T11:41:31.213440Z",
"shell.execute_reply": "2022-01-02T11:41:31.212873Z",
"shell.execute_reply.started": "2022-01-02T11:39:35.089891Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"spark.sql(\n",
" \"\"\"\n",
" select titleId || '#' || ordering as `tconst:ID(TitleAka)`,\n",
" title as `title`,\n",
" region as `region`,\n",
" language as `language`,\n",
" boolean(isOriginalTitle) as `isOriginalTitle:boolean`,\n",
" array_join(array('TitleAka') ||\n",
" ifnull(transform(split(attributes, ','), `_` -> 'attributes=' || `_`),\n",
" array()) ||\n",
" ifnull(transform(split(types, ','), `_` -> 'types=' || `_`),\n",
" array()),\n",
" ';') as `:LABEL`\n",
" from `title.akas`\n",
" \"\"\"\n",
").coalesce(1).write.csv(f\"{neo4j_staging}/import/title.akas\", **kwargs_write_csv)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "0b9803f3",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:41:31.214840Z",
"iopub.status.busy": "2022-01-02T11:41:31.214630Z",
"iopub.status.idle": "2022-01-02T11:42:54.636656Z",
"shell.execute_reply": "2022-01-02T11:42:54.636076Z",
"shell.execute_reply.started": "2022-01-02T11:41:31.214801Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"spark.sql(\n",
" \"\"\"\n",
" select titleId as `:START_ID(Title)`,\n",
" 'akas' as `:TYPE`,\n",
" titleId || '#' || ordering as `:END_ID(TitleAka)`,\n",
" ordering as `ordering:long`\n",
" from `title.akas`\n",
" \"\"\"\n",
").coalesce(1).write.csv(f\"{neo4j_staging}/import/title.akas.akas\", **kwargs_write_csv)"
]
},
{
"cell_type": "markdown",
"id": "f0f344c1",
"metadata": {},
"source": [
"#### https://datasets.imdbws.com/title.basics.tsv.gz + https://datasets.imdbws.com/title.ratings.tsv.gz\n",
"\n",
"**title.basics.tsv.gz** - Contains the following information for titles:\n",
"- tconst (string) - alphanumeric unique identifier of the title\n",
"- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)\n",
"- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release\n",
"- originalTitle (string) - original title, in the original language\n",
"- isAdult (boolean) - 0: non-adult title; 1: adult title\n",
"- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year\n",
"- endYear (YYYY) – TV Series end year. ‘\\N’ for all other title types\n",
"- runtimeMinutes – primary runtime of the title, in minutes\n",
"- genres (string array) – includes up to three genres associated with the title"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "d7ae1aa0",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:42:54.637505Z",
"iopub.status.busy": "2022-01-02T11:42:54.637361Z",
"iopub.status.idle": "2022-01-02T11:42:54.775198Z",
"shell.execute_reply": "2022-01-02T11:42:54.774481Z",
"shell.execute_reply.started": "2022-01-02T11:42:54.637484Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
},
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" 0\n",
" 1\n",
" 2\n",
" 3\n",
" 4\n",
" 5\n",
" 6\n",
" 7\n",
" 8\n",
" 9\n",
" \n",
" \n",
" \n",
" \n",
" tconst\n",
" tt0000001\n",
" tt0000002\n",
" tt0000003\n",
" tt0000004\n",
" tt0000005\n",
" tt0000006\n",
" tt0000007\n",
" tt0000008\n",
" tt0000009\n",
" tt0000010\n",
" \n",
" \n",
" titleType\n",
" short\n",
" short\n",
" short\n",
" short\n",
" short\n",
" short\n",
" short\n",
" short\n",
" short\n",
" short\n",
" \n",
" \n",
" primaryTitle\n",
" Carmencita\n",
" Le clown et ses chiens\n",
" Pauvre Pierrot\n",
" Un bon bock\n",
" Blacksmith Scene\n",
" Chinese Opium Den\n",
" Corbett and Courtney Before the Kinetograph\n",
" Edison Kinetoscopic Record of a Sneeze\n",
" Miss Jerry\n",
" Leaving the Factory\n",
" \n",
" \n",
" originalTitle\n",
" Carmencita\n",
" Le clown et ses chiens\n",
" Pauvre Pierrot\n",
" Un bon bock\n",
" Blacksmith Scene\n",
" Chinese Opium Den\n",
" Corbett and Courtney Before the Kinetograph\n",
" Edison Kinetoscopic Record of a Sneeze\n",
" Miss Jerry\n",
" La sortie de l'usine Lumière à Lyon\n",
" \n",
" \n",
" isAdult\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" 0\n",
" \n",
" \n",
" startYear\n",
" 1894\n",
" 1892\n",
" 1892\n",
" 1892\n",
" 1893\n",
" 1894\n",
" 1894\n",
" 1894\n",
" 1894\n",
" 1895\n",
" \n",
" \n",
" endYear\n",
" NaN\n",
" NaN\n",
" NaN\n",
" NaN\n",
" NaN\n",
" NaN\n",
" NaN\n",
" NaN\n",
" NaN\n",
" NaN\n",
" \n",
" \n",
" runtimeMinutes\n",
" 1\n",
" 5\n",
" 4\n",
" 12\n",
" 1\n",
" 1\n",
" 1\n",
" 1\n",
" 40\n",
" 1\n",
" \n",
" \n",
" genres\n",
" Documentary,Short\n",
" Animation,Short\n",
" Animation,Comedy,Romance\n",
" Animation,Short\n",
" Comedy,Short\n",
" Short\n",
" Short,Sport\n",
" Documentary,Short\n",
" Romance,Short\n",
" Documentary,Short\n",
" \n",
" \n",
"\n",
"
"
],
"text/plain": [
" 0 1 \\\n",
"tconst tt0000001 tt0000002 \n",
"titleType short short \n",
"primaryTitle Carmencita Le clown et ses chiens \n",
"originalTitle Carmencita Le clown et ses chiens \n",
"isAdult 0 0 \n",
"startYear 1894 1892 \n",
"endYear NaN NaN \n",
"runtimeMinutes 1 5 \n",
"genres Documentary,Short Animation,Short \n",
"\n",
" 2 3 4 \\\n",
"tconst tt0000003 tt0000004 tt0000005 \n",
"titleType short short short \n",
"primaryTitle Pauvre Pierrot Un bon bock Blacksmith Scene \n",
"originalTitle Pauvre Pierrot Un bon bock Blacksmith Scene \n",
"isAdult 0 0 0 \n",
"startYear 1892 1892 1893 \n",
"endYear NaN NaN NaN \n",
"runtimeMinutes 4 12 1 \n",
"genres Animation,Comedy,Romance Animation,Short Comedy,Short \n",
"\n",
" 5 \\\n",
"tconst tt0000006 \n",
"titleType short \n",
"primaryTitle Chinese Opium Den \n",
"originalTitle Chinese Opium Den \n",
"isAdult 0 \n",
"startYear 1894 \n",
"endYear NaN \n",
"runtimeMinutes 1 \n",
"genres Short \n",
"\n",
" 6 \\\n",
"tconst tt0000007 \n",
"titleType short \n",
"primaryTitle Corbett and Courtney Before the Kinetograph \n",
"originalTitle Corbett and Courtney Before the Kinetograph \n",
"isAdult 0 \n",
"startYear 1894 \n",
"endYear NaN \n",
"runtimeMinutes 1 \n",
"genres Short,Sport \n",
"\n",
" 7 8 \\\n",
"tconst tt0000008 tt0000009 \n",
"titleType short short \n",
"primaryTitle Edison Kinetoscopic Record of a Sneeze Miss Jerry \n",
"originalTitle Edison Kinetoscopic Record of a Sneeze Miss Jerry \n",
"isAdult 0 0 \n",
"startYear 1894 1894 \n",
"endYear NaN NaN \n",
"runtimeMinutes 1 40 \n",
"genres Documentary,Short Romance,Short \n",
"\n",
" 9 \n",
"tconst tt0000010 \n",
"titleType short \n",
"primaryTitle Leaving the Factory \n",
"originalTitle La sortie de l'usine Lumière à Lyon \n",
"isAdult 0 \n",
"startYear 1895 \n",
"endYear NaN \n",
"runtimeMinutes 1 \n",
"genres Documentary,Short "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spark.read.csv(\n",
" \"/tmp/datasets.imdbws.com/title.basics.tsv.gz\",\n",
" schema=\"\"\"\n",
" tconst string,\n",
" titleType string,\n",
" primaryTitle string,\n",
" originalTitle string,\n",
" isAdult integer,\n",
" startYear integer,\n",
" endYear integer,\n",
" runtimeMinutes integer,\n",
" genres string\n",
" \"\"\",\n",
" **kwargs_read_csv\n",
").createOrReplaceTempView(\"`title.basics`\")\n",
"spark.table(\"`title.basics`\").limit(10).toPandas().T"
]
},
{
"cell_type": "markdown",
"id": "33b815d5-fa72-45ad-88ac-2783d4ff97fc",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**title.ratings.tsv.gz** – Contains the IMDb rating and votes information for titles\n",
"- tconst (string) - alphanumeric unique identifier of the title\n",
"- averageRating – weighted average of all the individual user ratings\n",
"- numVotes - number of votes the title has received"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "d8dd6b4a-420a-4ae6-995d-f64c5ee0713b",
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-02T11:42:54.776350Z",
"iopub.status.busy": "2022-01-02T11:42:54.776116Z",
"iopub.status.idle": "2022-01-02T11:42:54.899764Z",
"shell.execute_reply": "2022-01-02T11:42:54.899270Z",
"shell.execute_reply.started": "2022-01-02T11:42:54.776318Z"
},
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" 0\n",
" 1\n",
" 2\n",
" 3\n",
" 4\n",
" 5\n",
" 6\n",
" 7\n",
" 8\n",
" 9\n",
" \n",
" \n",
" \n",
" \n",
" tconst\n",
" tt0000001\n",
" tt0000002\n",
" tt0000003\n",
" tt0000004\n",
" tt0000005\n",
" tt0000006\n",
" tt0000007\n",
" tt0000008\n",
" tt0000009\n",
" tt0000010\n",
" \n",
" \n",
" averageRating\n",
" 5.7\n",
" 6.0\n",
" 6.5\n",
" 6.1\n",
" 6.2\n",
" 5.2\n",
" 5.4\n",
" 5.5\n",
" 5.9\n",
" 6.9\n",
" \n",
" \n",
" numVotes\n",
" 1847\n",
" 237\n",
" 1609\n",
" 154\n",
" 2432\n",
" 160\n",
" 760\n",
" 1992\n",
" 192\n",
" 6651\n",
" \n",
" \n",
"\n",
"
"
],
"text/plain": [
" 0 1 2 3 4 \\\n",
"tconst tt0000001 tt0000002 tt0000003 tt0000004 tt0000005 \n",
"averageRating 5.7 6.0 6.5 6.1 6.2 \n",
"numVotes 1847 237 1609 154 2432 \n",
"\n",
" 5 6 7 8 9 \n",
"tconst tt0000006 tt0000007 tt0000008 tt0000009 tt0000010 \n",
"averageRating 5.2 5.4 5.5 5.9 6.9 \n",
"numVotes 160 760 1992 192 6651 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spark.read.csv(\n",
" \"/tmp/datasets.imdbws.com/title.ratings.tsv.gz\",\n",
" schema=\"\"\"\n",
" tconst string,\n",
" averageRating float,\n",
" numVotes integer\n",
" \"\"\",\n",
" **kwargs_read_csv\n",
").createOrReplaceTempView(\"`title.ratings`\")\n",
"spark.table(\"`title.ratings`\").limit(10).toPandas().T"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "a58c8b4c",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:42:54.900571Z",
"iopub.status.busy": "2022-01-02T11:42:54.900438Z",
"iopub.status.idle": "2022-01-02T11:44:12.281864Z",
"shell.execute_reply": "2022-01-02T11:44:12.281424Z",
"shell.execute_reply.started": "2022-01-02T11:42:54.900553Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
},
"tags": []
},
"outputs": [],
"source": [
"spark.sql(\n",
" \"\"\"\n",
" select tconst as `tconst:ID(Title)`,\n",
" primaryTitle as `primaryTitle`,\n",
" originalTitle as `originalTitle`,\n",
" boolean(isAdult) as `isAdult:boolean`,\n",
" startYear as `startYear:long`,\n",
" endYear as `endYear:long`,\n",
" runtimeMinutes as `runtimeMinutes:long`,\n",
" averageRating as `averageRating:double`,\n",
" numVotes as `numVotes:long`,\n",
" array_join(array('Title') ||\n",
" ifnull(transform(array(titleType), `_` -> 'titleType=' || `_`), array()) ||\n",
" ifnull(transform(split(genres, ','), `_` -> 'genres=' || `_`), array()),\n",
" ';') as `:LABEL`\n",
" from `title.basics`\n",
" left join `title.ratings` using (tconst)\n",
" \"\"\"\n",
").coalesce(1).write.csv(f\"{neo4j_staging}/import/title.basics\", **kwargs_write_csv)"
]
},
{
"cell_type": "markdown",
"id": "3bd223f0",
"metadata": {},
"source": [
"#### https://datasets.imdbws.com/title.crew.tsv.gz\n",
"\n",
"**title.crew.tsv.gz** – Contains the director and writer information for all the titles in IMDb. Fields include:\n",
"- tconst (string) - alphanumeric unique identifier of the title\n",
"- directors (array of nconsts) - director(s) of the given title\n",
"- writers (array of nconsts) – writer(s) of the given title"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "f91729e4",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:44:12.282634Z",
"iopub.status.busy": "2022-01-02T11:44:12.282511Z",
"iopub.status.idle": "2022-01-02T11:44:12.395229Z",
"shell.execute_reply": "2022-01-02T11:44:12.394535Z",
"shell.execute_reply.started": "2022-01-02T11:44:12.282616Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" 0\n",
" 1\n",
" 2\n",
" 3\n",
" 4\n",
" 5\n",
" 6\n",
" 7\n",
" 8\n",
" 9\n",
" \n",
" \n",
" \n",
" \n",
" tconst\n",
" tt0000001\n",
" tt0000002\n",
" tt0000003\n",
" tt0000004\n",
" tt0000005\n",
" tt0000006\n",
" tt0000007\n",
" tt0000008\n",
" tt0000009\n",
" tt0000010\n",
" \n",
" \n",
" directors\n",
" nm0005690\n",
" nm0721526\n",
" nm0721526\n",
" nm0721526\n",
" nm0005690\n",
" nm0005690\n",
" nm0005690,nm0374658\n",
" nm0005690\n",
" nm0085156\n",
" nm0525910\n",
" \n",
" \n",
" writers\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" nm0085156\n",
" None\n",
" \n",
" \n",
"\n",
"
"
],
"text/plain": [
" 0 1 2 3 4 5 \\\n",
"tconst tt0000001 tt0000002 tt0000003 tt0000004 tt0000005 tt0000006 \n",
"directors nm0005690 nm0721526 nm0721526 nm0721526 nm0005690 nm0005690 \n",
"writers None None None None None None \n",
"\n",
" 6 7 8 9 \n",
"tconst tt0000007 tt0000008 tt0000009 tt0000010 \n",
"directors nm0005690,nm0374658 nm0005690 nm0085156 nm0525910 \n",
"writers None None nm0085156 None "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spark.read.csv(\n",
" \"/tmp/datasets.imdbws.com/title.crew.tsv.gz\",\n",
" schema=\"\"\"\n",
" tconst string,\n",
" directors string,\n",
" writers string\n",
" \"\"\",\n",
" **kwargs_read_csv\n",
").createOrReplaceTempView(\"`title.crew`\")\n",
"spark.table(\"`title.crew`\").limit(10).toPandas().T"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "64eb0014",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:44:12.396458Z",
"iopub.status.busy": "2022-01-02T11:44:12.396222Z",
"iopub.status.idle": "2022-01-02T11:45:14.714146Z",
"shell.execute_reply": "2022-01-02T11:45:14.713529Z",
"shell.execute_reply.started": "2022-01-02T11:44:12.396421Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"spark.sql(\n",
" \"\"\"\n",
" select tconst as `:START_ID(Title)`,\n",
" 'directors' as `:TYPE`,\n",
" explode(split(directors, ',')) as `:END_ID(Name)`\n",
" from `title.crew`\n",
" union\n",
" select tconst as `:START_ID(Title)`,\n",
" 'writers' as `:TYPE`,\n",
" explode(split(writers, ',')) as `:END_ID(Name)`\n",
" from `title.crew`\n",
" \"\"\"\n",
").coalesce(1).write.csv(f\"{neo4j_staging}/import/title.crew\", **kwargs_write_csv)"
]
},
{
"cell_type": "markdown",
"id": "6e0b9995",
"metadata": {},
"source": [
"#### https://datasets.imdbws.com/title.episode.tsv.gz\n",
"\n",
"**title.episode.tsv.gz** – Contains the tv episode information. Fields include:\n",
"- tconst (string) - alphanumeric identifier of episode\n",
"- parentTconst (string) - alphanumeric identifier of the parent TV Series\n",
"- seasonNumber (integer) – season number the episode belongs to\n",
"- episodeNumber (integer) – episode number of the tconst in the TV series"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "945ee461",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:45:14.715043Z",
"iopub.status.busy": "2022-01-02T11:45:14.714903Z",
"iopub.status.idle": "2022-01-02T11:45:14.824468Z",
"shell.execute_reply": "2022-01-02T11:45:14.823974Z",
"shell.execute_reply.started": "2022-01-02T11:45:14.715023Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" 0\n",
" 1\n",
" 2\n",
" 3\n",
" 4\n",
" 5\n",
" 6\n",
" 7\n",
" 8\n",
" 9\n",
" \n",
" \n",
" \n",
" \n",
" tconst\n",
" tt0020666\n",
" tt0020829\n",
" tt0021166\n",
" tt0021612\n",
" tt0021655\n",
" tt0021663\n",
" tt0021664\n",
" tt0021701\n",
" tt0021802\n",
" tt0022009\n",
" \n",
" \n",
" parentTconst\n",
" tt15180956\n",
" tt15180956\n",
" tt15180956\n",
" tt15180956\n",
" tt15180956\n",
" tt15180956\n",
" tt15180956\n",
" tt15180956\n",
" tt15180956\n",
" tt15180956\n",
" \n",
" \n",
" seasonNumber\n",
" 1\n",
" 1\n",
" 1\n",
" 2\n",
" 2\n",
" 2\n",
" 2\n",
" 2\n",
" 2\n",
" 2\n",
" \n",
" \n",
" episodeNumber\n",
" 2\n",
" 1\n",
" 3\n",
" 2\n",
" 5\n",
" 6\n",
" 4\n",
" 1\n",
" 11\n",
" 10\n",
" \n",
" \n",
"\n",
"
"
],
"text/plain": [
" 0 1 2 3 4 \\\n",
"tconst tt0020666 tt0020829 tt0021166 tt0021612 tt0021655 \n",
"parentTconst tt15180956 tt15180956 tt15180956 tt15180956 tt15180956 \n",
"seasonNumber 1 1 1 2 2 \n",
"episodeNumber 2 1 3 2 5 \n",
"\n",
" 5 6 7 8 9 \n",
"tconst tt0021663 tt0021664 tt0021701 tt0021802 tt0022009 \n",
"parentTconst tt15180956 tt15180956 tt15180956 tt15180956 tt15180956 \n",
"seasonNumber 2 2 2 2 2 \n",
"episodeNumber 6 4 1 11 10 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spark.read.csv(\n",
" \"/tmp/datasets.imdbws.com/title.episode.tsv.gz\",\n",
" schema=\"\"\"\n",
" tconst string,\n",
" parentTconst string,\n",
" seasonNumber integer,\n",
" episodeNumber integer\n",
" \"\"\",\n",
" **kwargs_read_csv\n",
").createOrReplaceTempView(\"`title.episode`\")\n",
"spark.table(\"`title.episode`\").limit(10).toPandas().T"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "956d89cd",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:45:14.825272Z",
"iopub.status.busy": "2022-01-02T11:45:14.825129Z",
"iopub.status.idle": "2022-01-02T11:45:32.546360Z",
"shell.execute_reply": "2022-01-02T11:45:32.545886Z",
"shell.execute_reply.started": "2022-01-02T11:45:14.825253Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"spark.sql(\n",
" \"\"\"\n",
" select parentTconst as `:START_ID(Title)`,\n",
" 'episodes' as `:TYPE`,\n",
" tconst as `:END_ID(Title)`,\n",
" seasonNumber as `seasonNumber:long`,\n",
" episodeNumber as `episodeNumber:long`\n",
" from `title.episode`\n",
" \"\"\"\n",
").coalesce(1).write.csv(f\"{neo4j_staging}/import/title.episode\", **kwargs_write_csv)"
]
},
{
"cell_type": "markdown",
"id": "368ac807",
"metadata": {},
"source": [
"#### https://datasets.imdbws.com/title.principals.tsv.gz\n",
"\n",
"**title.principals.tsv.gz** – Contains the principal cast/crew for titles\n",
"- tconst (string) - alphanumeric unique identifier of the title\n",
"- ordering (integer) – a number to uniquely identify rows for a given titleId\n",
"- nconst (string) - alphanumeric unique identifier of the name/person\n",
"- category (string) - the category of job that person was in\n",
"- job (string) - the specific job title if applicable, else '\\N'\n",
"- characters (string) - the name of the character played if applicable, else '\\N'"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "7c7bedf4",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:45:32.547214Z",
"iopub.status.busy": "2022-01-02T11:45:32.547056Z",
"iopub.status.idle": "2022-01-02T11:45:32.658140Z",
"shell.execute_reply": "2022-01-02T11:45:32.657482Z",
"shell.execute_reply.started": "2022-01-02T11:45:32.547194Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"\n",
"\n",
" \n",
" \n",
" \n",
" 0\n",
" 1\n",
" 2\n",
" 3\n",
" 4\n",
" 5\n",
" 6\n",
" 7\n",
" 8\n",
" 9\n",
" \n",
" \n",
" \n",
" \n",
" tconst\n",
" tt0000001\n",
" tt0000001\n",
" tt0000001\n",
" tt0000002\n",
" tt0000002\n",
" tt0000003\n",
" tt0000003\n",
" tt0000003\n",
" tt0000003\n",
" tt0000004\n",
" \n",
" \n",
" ordering\n",
" 1\n",
" 2\n",
" 3\n",
" 1\n",
" 2\n",
" 1\n",
" 2\n",
" 3\n",
" 4\n",
" 1\n",
" \n",
" \n",
" nconst\n",
" nm1588970\n",
" nm0005690\n",
" nm0374658\n",
" nm0721526\n",
" nm1335271\n",
" nm0721526\n",
" nm1770680\n",
" nm1335271\n",
" nm5442200\n",
" nm0721526\n",
" \n",
" \n",
" category\n",
" self\n",
" director\n",
" cinematographer\n",
" director\n",
" composer\n",
" director\n",
" producer\n",
" composer\n",
" editor\n",
" director\n",
" \n",
" \n",
" job\n",
" None\n",
" None\n",
" director of photography\n",
" None\n",
" None\n",
" None\n",
" producer\n",
" None\n",
" None\n",
" None\n",
" \n",
" \n",
" characters\n",
" [\"Self\"]\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" None\n",
" \n",
" \n",
"\n",
"
"
],
"text/plain": [
" 0 1 2 3 \\\n",
"tconst tt0000001 tt0000001 tt0000001 tt0000002 \n",
"ordering 1 2 3 1 \n",
"nconst nm1588970 nm0005690 nm0374658 nm0721526 \n",
"category self director cinematographer director \n",
"job None None director of photography None \n",
"characters [\"Self\"] None None None \n",
"\n",
" 4 5 6 7 8 9 \n",
"tconst tt0000002 tt0000003 tt0000003 tt0000003 tt0000003 tt0000004 \n",
"ordering 2 1 2 3 4 1 \n",
"nconst nm1335271 nm0721526 nm1770680 nm1335271 nm5442200 nm0721526 \n",
"category composer director producer composer editor director \n",
"job None None producer None None None \n",
"characters None None None None None None "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"spark.read.csv(\n",
" \"/tmp/datasets.imdbws.com/title.principals.tsv.gz\",\n",
" schema=\"\"\"\n",
" tconst string,\n",
" ordering integer,\n",
" nconst string,\n",
" category string,\n",
" job string,\n",
" characters string\n",
" \"\"\",\n",
" **kwargs_read_csv\n",
").createOrReplaceTempView(\"`title.principals`\")\n",
"spark.table(\"`title.principals`\").limit(10).toPandas().T"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "eea4088d",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:45:32.659248Z",
"iopub.status.busy": "2022-01-02T11:45:32.659019Z",
"iopub.status.idle": "2022-01-02T11:48:45.011871Z",
"shell.execute_reply": "2022-01-02T11:48:45.011387Z",
"shell.execute_reply.started": "2022-01-02T11:45:32.659219Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"spark.sql(\n",
" \"\"\"\n",
" select tconst as `:START_ID(Title)`,\n",
" 'principals' as `:TYPE`,\n",
" nconst as `:END_ID(Name)`,\n",
" ordering as `ordering:long`,\n",
" category as `category`,\n",
" job as `job`,\n",
" array_join(from_json(characters, 'array'),\n",
" ';') as `characters`\n",
" from `title.principals`\n",
" \"\"\"\n",
").coalesce(1).write.csv(f\"{neo4j_staging}/import/title.principals\", **kwargs_write_csv)"
]
},
{
"cell_type": "markdown",
"id": "4cfb4c17",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### Load"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "727b00fb",
"metadata": {
"collapsed": false,
"execution": {
"iopub.execute_input": "2022-01-02T11:48:45.012615Z",
"iopub.status.busy": "2022-01-02T11:48:45.012474Z",
"iopub.status.idle": "2022-01-02T11:48:45.129421Z",
"shell.execute_reply": "2022-01-02T11:48:45.128815Z",
"shell.execute_reply.started": "2022-01-02T11:48:45.012598Z"
},
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[4.0K] \u001B[01;34m/tmp/neo4j_staging/datasets.imdbws.com/import\u001B[0m\n",
"├── [4.0K] \u001B[01;34mname.basics\u001B[0m\n",
"│   ├── [148M] \u001B[01;31mpart-00000-4adc0db8-e1be-42b1-8009-093744e4c1a9-c000.csv.gz\u001B[0m\n",
"│   └── [ 0] _SUCCESS\n",
"├── [4.0K] \u001B[01;34mname.basics.knownForTitles\u001B[0m\n",
"│   ├── [ 98M] \u001B[01;31mpart-00000-0fc6773a-b8ea-472d-afe7-843875c4197b-c000.csv.gz\u001B[0m\n",
"│   └── [ 0] _SUCCESS\n",
"├── [4.0K] \u001B[01;34mtitle.akas\u001B[0m\n",
"│   ├── [252M] \u001B[01;31mpart-00000-c812541e-2d10-4815-91e7-7facc74186e5-c000.csv.gz\u001B[0m\n",
"│   └── [ 0] _SUCCESS\n",
"├── [4.0K] \u001B[01;34mtitle.akas.akas\u001B[0m\n",
"│   ├── [ 89M] \u001B[01;31mpart-00000-d57d1d81-5587-4717-98ed-f1af47f10328-c000.csv.gz\u001B[0m\n",
"│   └── [ 0] _SUCCESS\n",
"├── [4.0K] \u001B[01;34mtitle.basics\u001B[0m\n",
"│   ├── [153M] \u001B[01;31mpart-00000-2a4fa35f-fdc7-4596-b2b1-3b9c52bd1ead-c000.csv.gz\u001B[0m\n",
"│   └── [ 0] _SUCCESS\n",
"├── [4.0K] \u001B[01;34mtitle.crew\u001B[0m\n",
"│   ├── [117M] \u001B[01;31mpart-00000-ddc7238b-bd6c-41e4-875a-c110a35cc530-c000.csv.gz\u001B[0m\n",
"│   └── [ 0] _SUCCESS\n",
"├── [4.0K] \u001B[01;34mtitle.episode\u001B[0m\n",
"│   ├── [ 35M] \u001B[01;31mpart-00000-c3653084-841d-4972-a927-66d263d57cec-c000.csv.gz\u001B[0m\n",
"│   └── [ 0] _SUCCESS\n",
"└── [4.0K] \u001B[01;34mtitle.principals\u001B[0m\n",
" ├── [373M] \u001B[01;31mpart-00000-65eb0487-4aa4-4d26-b3f9-a3e89e919142-c000.csv.gz\u001B[0m\n",
" └── [ 0] _SUCCESS\n",
"\n",
"8 directories, 16 files\n"
]
}
],
"source": [
"!tree -h '{neo4j_staging}/import'"
]
},
{
"cell_type": "markdown",
"id": "0d7e1440",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Run the following command to ingest data into Neo4j:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "32c3135e-1e2a-48d3-a445-6f91df9fddea",
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-02T11:48:45.130689Z",
"iopub.status.busy": "2022-01-02T11:48:45.130477Z",
"iopub.status.idle": "2022-01-02T11:48:45.139473Z",
"shell.execute_reply": "2022-01-02T11:48:45.138750Z",
"shell.execute_reply.started": "2022-01-02T11:48:45.130649Z"
},
"tags": []
},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"```shell\n",
"docker pull neo4j:4.1.4-community\n",
"\n",
"docker run \\\n",
" --rm \\\n",
" -e NEO4J_AUTH=none \\\n",
" -p 7474:7474 \\\n",
" -p 7687:7687 \\\n",
" -v /tmp/neo4j_staging/datasets.imdbws.com/data:/data \\\n",
" -v /tmp/neo4j_staging/datasets.imdbws.com/logs:/logs \\\n",
" -v /tmp/neo4j_staging/datasets.imdbws.com/import:/var/lib/neo4j/import \\\n",
" neo4j:4.1.4-community bin/neo4j-admin import \\\n",
" --database=imdb \\\n",
" --high-io=true \\\n",
" --max-memory=2G \\\n",
" --nodes='import/name.basics/.+.csv.gz' \\\n",
" --nodes='import/title.akas/.+.csv.gz' \\\n",
" --nodes='import/title.basics/.+.csv.gz' \\\n",
" --relationships='import/name.basics.knownForTitles/.+.csv.gz' \\\n",
" --relationships='import/title.akas.akas/.+.csv.gz' \\\n",
" --relationships='import/title.crew/.+.csv.gz' \\\n",
" --relationships='import/title.episode/.+.csv.gz' \\\n",
" --relationships='import/title.principals/.+.csv.gz' \\\n",
" --skip-bad-relationships=true \\\n",
" --skip-duplicate-nodes=true\n",
"```\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Markdown(\n",
" fr\"\"\"\n",
"```shell\n",
"docker pull neo4j:4.1.4-community\n",
"\n",
"docker run \\\n",
" --rm \\\n",
" -e NEO4J_AUTH=none \\\n",
" -p 7474:7474 \\\n",
" -p 7687:7687 \\\n",
" -v {neo4j_staging}/data:/data \\\n",
" -v {neo4j_staging}/logs:/logs \\\n",
" -v {neo4j_staging}/import:/var/lib/neo4j/import \\\n",
" neo4j:4.1.4-community bin/neo4j-admin import \\\n",
" --database=imdb \\\n",
" --high-io=true \\\n",
" --max-memory=2G \\\n",
" --nodes='import/name.basics/.+.csv.gz' \\\n",
" --nodes='import/title.akas/.+.csv.gz' \\\n",
" --nodes='import/title.basics/.+.csv.gz' \\\n",
" --relationships='import/name.basics.knownForTitles/.+.csv.gz' \\\n",
" --relationships='import/title.akas.akas/.+.csv.gz' \\\n",
" --relationships='import/title.crew/.+.csv.gz' \\\n",
" --relationships='import/title.episode/.+.csv.gz' \\\n",
" --relationships='import/title.principals/.+.csv.gz' \\\n",
" --skip-bad-relationships=true \\\n",
" --skip-duplicate-nodes=true\n",
"```\n",
" \"\"\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "0e9a9b6a-482d-4151-9ab8-c3b559eaa8b6",
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-02T09:45:39.955749Z",
"iopub.status.busy": "2022-01-02T09:45:39.955521Z",
"iopub.status.idle": "2022-01-02T09:45:40.026859Z",
"shell.execute_reply": "2022-01-02T09:45:40.025871Z",
"shell.execute_reply.started": "2022-01-02T09:45:39.955723Z"
}
},
"source": [
"```\n",
"IMPORT DONE in 9m 34s 920ms. \n",
"Imported:\n",
" 50354453 nodes\n",
" 119199226 relationships\n",
" 455432935 properties\n",
"Peak memory usage: 748.0MiB\n",
"```\n",
"\n",
"Run the following command to boot up Neo4j:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "cdf4b77b-f419-4436-92b3-3a09d0df0774",
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-02T11:48:45.140649Z",
"iopub.status.busy": "2022-01-02T11:48:45.140437Z",
"iopub.status.idle": "2022-01-02T11:48:45.151877Z",
"shell.execute_reply": "2022-01-02T11:48:45.151273Z",
"shell.execute_reply.started": "2022-01-02T11:48:45.140615Z"
},
"tags": []
},
"outputs": [
{
"data": {
"text/markdown": [
"\n",
"```shell\n",
"docker run \\\n",
" --rm \\\n",
" -e NEO4J_AUTH=none \\\n",
" -e NEO4J_dbms_default__database=imdb \\\n",
" -p 7474:7474 \\\n",
" -p 7687:7687 \\\n",
" -v /tmp/neo4j_staging/datasets.imdbws.com/data:/data \\\n",
" -v /tmp/neo4j_staging/datasets.imdbws.com/logs:/logs \\\n",
" -v /tmp/neo4j_staging/datasets.imdbws.com/import:/var/lib/neo4j/import \\\n",
" neo4j:4.1.4-community\n",
"```\n",
" "
],
"text/plain": [
""
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Markdown(\n",
" fr\"\"\"\n",
"```shell\n",
"docker run \\\n",
" --rm \\\n",
" -e NEO4J_AUTH=none \\\n",
" -e NEO4J_dbms_default__database=imdb \\\n",
" -p 7474:7474 \\\n",
" -p 7687:7687 \\\n",
" -v {neo4j_staging}/data:/data \\\n",
" -v {neo4j_staging}/logs:/logs \\\n",
" -v {neo4j_staging}/import:/var/lib/neo4j/import \\\n",
" neo4j:4.1.4-community\n",
"```\n",
" \"\"\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "801dae3c",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"## Exploratory Data Analysis\n",
"\n",
"```cypher\n",
"MATCH (alex_garland:Name {nconst: 'nm0307497'}),\n",
" (denis_villeneuve:Name {nconst: 'nm0898288'})\n",
"RETURN shortestPath((alex_garland)-[*..10]-(denis_villeneuve));\n",
"```\n",
"\n",
"```cypher\n",
"MATCH (alex_garland:Name {nconst: 'nm0307497'}),\n",
" (denis_villeneuve:Name {nconst: 'nm0898288'})\n",
"RETURN allShortestPaths((alex_garland)-[*..5]-(denis_villeneuve));\n",
"```\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}