{"id":26323231,"url":"https://github.com/traceypooh/mozfest17","last_synced_at":"2025-03-15T17:16:54.763Z","repository":{"id":146364686,"uuid":"108297903","full_name":"traceypooh/mozfest17","owner":"traceypooh","description":"TV Archives cracked open - \"AI for IA\" Artificial Intelligence for Internet Archive.  Talk at MozFest, London Oct 2017.  VIEW SLIDES: https://traceypooh.github.io/mozfest17","archived":false,"fork":false,"pushed_at":"2023-10-05T01:07:20.000Z","size":5034,"stargazers_count":1,"open_issues_count":1,"forks_count":1,"subscribers_count":6,"default_branch":"master","last_synced_at":"2023-10-05T10:44:49.318Z","etag":null,"topics":["api","archives","artificial-intelligence","artificial-neural-networks","news","rest-api","tv"],"latest_commit_sha":null,"homepage":"https://archive.org/tv","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/traceypooh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-10-25T16:44:34.000Z","updated_at":"2023-10-05T01:07:24.000Z","dependencies_parsed_at":null,"dependency_job_id":"c5c32c48-027a-4d70-af4f-c5732a96e301","html_url":"https://github.com/traceypooh/mozfest17","commit_stats":null,"previous_names":[],"tags_count":0,"template":null,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/traceypooh%2Fmozfest17","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/traceypooh%2Fmozfest17/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/traceypooh%2Fmozfest17/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/traceypooh%2Fmozfest17/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/traceypooh","download_url":"https://codeload.github.com/traceypooh/mozfest17/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243762248,"owners_count":20343979,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api","archives","artificial-intelligence","artificial-neural-networks","news","rest-api","tv"],"created_at":"2025-03-15T17:16:54.127Z","updated_at":"2025-03-15T17:16:54.749Z","avatar_url":"https://github.com/traceypooh.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003c!doctype html\u003e\u003chead\u003e\u003cscript src=\"eveal.js\"\u003e\u003c/script\u003e\u003c/head\u003e\u003cbody\u003e\n\n# TV Archives cracked Open \"AI for IA\"\n## Artificial Intelligence for Internet Archive\n### MozFest, London Oct 2017\n\n\u003csmall\u003e\n  by\n  [traceypooh](https://twitter.com/tracey_pooh)\n  \u003ca href=\"https://github.com/traceypooh\"\u003e\u003cimg style=\"margin:0\" src=\"git.png\"/\u003e\u003c/a\u003e\n  \u003cbr/\u003e\n\u003c/small\u003e\n\n\u003csmall\u003e\n  https://traceypooh.github.io/mozfest17\n _?_ for key shortcuts\n\u003c/small\u003e\n\n```bash\ngit clone https://github.com/traceypooh/mozfest17; open mozfest17/index.html\n```\n\n\n\u003ca href=\"https://archive.org\"\u003e\n  \u003cimg src=\"https://archive.org/images/glogo.png\" style=\"position:fixed; bottom:0; left:10%;\"/\u003e\n\u003c/a\u003e\n\u003ca href=\"https://archive.org/tv\"\u003e\n  \u003cimg src=\"tvlogo-quarter.png\" style=\"position:fixed; bottom:0; left:80%;\"/\u003e\n\u003c/a\u003e\n\n\n---\n\n# Gist\n_decentralized research and AI \u003cbr/\u003e\nbuilt on top of \u003cbr/\u003e\na library of stable, untampered worldwide TV recordings_\n\n---\n# Intro to archive.org\n- WayBack Machine\n  - past copies of 300B+ pages\n  - 15M books, lendable\n  - ~4M videos, ~4M audio \u0026 live concerts\n  - 3M images\n  - 200K software items \u0026 emulation (in JS!)\n---\n\u003ca href=\"https://web.archive.org/web/19961219202222/http://www.apple.com:80/\"\u003e\n  \u003cimg style=\"height:700px\" src=\"wayback-apple.png\"/\u003e\n\u003c/a\u003e\n---\n\u003ca href=\"https://archive.org/details/goodytwoshoes00newyiala\"\u003e\n  \u003cimg style=\"height:700px\" src=\"bookreader.png\"/\u003e\n\u003c/a\u003e\n---\n\u003ca href=\"https://archive.org/details/msdos_Pac-Man_1983\"\u003e\n  \u003cimg style=\"height:700px\" src=\"software.png\"/\u003e\n\u003c/a\u003e\n---\n\u003ca href=\"https://archive.org/details/Sita_Sings_the_Blues\"\u003e\n  \u003cimg style=\"height:700px\" src=\"av.png\"/\u003e\n\u003c/a\u003e\n\n---\n# Library!\n- Absolute browser Privacy\n  - no personal data or IP addresses extracted\n- Validation \u0026 nontampering\n  - keep original versions with 2+ checksums and logs\n```xml\n\u003cfile name=\"commute.mp4\" source=\"derivative\"\u003e\n\u003ctitle\u003ecommute\u003c/title\u003e\n\u003cformat\u003eh.264\u003c/format\u003e\n\u003coriginal\u003ecommute.avi\u003c/original\u003e\n\u003cmtime\u003e1325973601\u003c/mtime\u003e\n\u003csize\u003e11919082\u003c/size\u003e\n\u003cmd5\u003eff17ed66e7db5693dd208dd6ac488ff8\u003c/md5\u003e\n\u003ccrc32\u003ead1df03a\u003c/crc32\u003e\n\u003csha1\u003ee9f9de8379cd25653d487ab30d198fc61a050091\u003c/sha1\u003e\n\u003clength\u003e115.61\u003c/length\u003e\n\u003cheight\u003e480\u003c/height\u003e\n\u003cwidth\u003e640\u003c/width\u003e\n\u003c/file\u003e\n```\n---\n## External Blockchain of Proofs\n### of file mod times / checksums\n- \u003ca href=\"https://opentimestamps.org/internet-archive\"\u003e\n  OpenTimestamps\u003c/a\u003e\n- uses SHA-1 and Merkle trees\n- by Peter Todd - \u003ca href=\"https://petertodd.org/2017/carbon-dating-the-internet-archive-with-opentimestamps\"\u003e\n  blog\u003c/a\u003e\n- _brand new!_\n\n---\n# archive.org/tv\n- recording 50 - 100 channels\n  - 24 x 7\n  - around the world\n  - since 2000\n- _2 million+ news shows_\n- \u003ca href=\"https://archive.org/tv?q=mozilla+firefox\"\u003esearch captions\u003c/a\u003e/metadata\n- new Trump Administration and Congress subsets\n- citable reference clips\n- \u003ca href=\"https://archive.org/pop\"\u003ePopcorn\u003c/a\u003e editing/mashup clips\n- for AI experiments\n\n---\n# Artificial Intelligence\n- _text_:\n  - chyron (\"lower third\") scanning OCR (Third Eye)\n  - caption alignment\n  - OCR captions from DVB-S\n    - BBC News\n  - speech to text (VoiceBase)\n    - Al Jazeera English\n    - Deutsche Welle English\n- _image_:\n  - public officials facial detection\u003cbr/\u003e (Faceomatic \u003c-- Matroid \u003c-- FaceNet)\n\n---\n# Artificial Intelligence\n- _audio_:\n  - fingerprinting\n    - \u003ca href=\"https://github.com/dpwe/audfprint\"\u003eaudfprint\u003c/a\u003e - free/open like shazam\n    - \u003ca href=\"https://politicalAdArchive.org\"\u003epolitical Ad tracking\u003c/a\u003e\n    - \u003ca href=\"https://github.com/slifty/tvarchive-duplitron\"\u003eDuplitron 5000\u003c/a\u003e\n\n---\n# Public Feeds\n- twitter bots \u0026 TSV\n  - \u003ca href=\"https://archive.org/services/third-eye.php\"\u003eThird Eye\u003c/a\u003e\n- slack bot\n  - \u003ca href=\"https://blog.archive.org/2017/07/19/introducing-face-o-matic/\"\u003eFaceomatic\u003c/a\u003e\n- continuous captions feed from CSPAN\n  - https://openedcaptions.com\n  - https://pietropassarelli.gitbooks.io/textav/projects/opened-captions-service.html\n\n\n---\n\u003cimg style=\"height:155px\" src=\"https://archive.org/download/third-eye/third-eye.png\"/\u003e\n- OCR 'lower third'\n  - chyrons\n    - overlaid text on broadcasts\n    - not captions or descriptive text\n    - editorial / summarizing in nature\n- 4 TV channels, 24x7, ~1 min from realtime\n  - CNN\n  - MSNBC\n  - Fox News\n  - BBC News\n\n---\n\u003cdiv style=\"max-width:500px; margin:auto\"\u003e\n  \u003cimg src=\"https://archive.org/download/third-eye/xmsn-full.png\"/\u003e\n  \u003cimg src=\"down.png\"/\u003e\n  \u003cimg src=\"https://archive.org/download/third-eye/xmsn.png\"/\u003e\n  \u003cimg src=\"down.png\"/\u003e\n  \u003cpre\u003e\n  AFTER WH MEETING, SCHUMER DISHES\n  WHEN HE THOUGHT NIC WAS OFF\n  \u003c/pre\u003e\n\u003c/div\u003e\n---\n# bots  \n- twitter bots\n  - https://twitter.com/tvThirdEye\n    - https://twitter.com/tvThirdEyeB\n    - https://twitter.com/tvThirdEyeF\n    - https://twitter.com/tvThirdEyeM\n  - https://twitter.com/tvThirdEye/lists/all\n\n---\n\u003cimg src=\"https://archive.org/download/third-eye/tweet.png\"/\u003e\n\n---\n# API\n- Tab Separated Values\n- https://archive.org/services/third-eye.php\n  - nice for command-line\n  - import to google and excel spreadsheets\n  - filtered\n  - raw (~25MB / day)\n    - more errors\n    - 3rd-party filtering possible\n  - TSV files uploaded to https://archive.org/details/third-eye\n\n---\n# Chyron filtering\n- tesseract OCR\n  - free; errors\n- \u003ca href=\"http://manpages.ubuntu.com/manpages/man1/simhash.1.html\"\u003esimhash\u003c/a\u003e\n  - groups 'nearly the same'\n    - character flips\n    - word off in time\n- look for vowels\n- pick 'most seen' group every minute\n  - and tweet\n\n---\n# TV AI Examples\n- Vox determined Puerto Rico was paid little attention by Fox News\n  - https://vox.com/2017/10/2/16401614/fox-news-puerto-rico-charts\n- audio fingerprints\n  - presented keynote paper on\u003cbr/\u003e \u003ca href=\"http://www.brycejdietrich.com/files/dietrich_schultz_television.pdf\"\u003eCSPAN floor speeches and vocal pitch\u003c/a\u003e \u003cbr/\u003eBryce Dietrich, UIowa\n  - discovered 375K political Ads\n  - find sound bites of speeches\n\n---\n# clips\n- little JSON annotations\n- associate metadata to program start/end time range\n- auto expands each clip to a \"synthetic\" document\n  - to elastic search\n- JSONPatch for changes\n- track play counts, some referers\n- allows for _decentralized_ annotations to other IA / research\n\n---\n# clip\n```json\n{\n    \"268.1|269.1\": {\n        \"subject\": [\n            \"Criminal Activity\"\n            \"Crime\"\n        ],\n        \"factcheck\": [\n            \"http://www.factcheck.org/2016/07/factchecking-trumps-big-speech/\"\n        ]\n    },\n    \"266.7|267.2\": {\n        \"ad_id\": \"PolAd_DonaldTrump_d9dsn\",\n        \"type\": \"campaign\",\n        \"race\": \"PRES\",\n        \"cycle\": \"2016\",\n        \"message\": \"pro\",\n        \"sponsor\": [\n            \"Republican National Cmte\"\n        ],\n        \"sponsor_type\": \"PAC\",\n        \"subject\": [\n            \"Job Accomplishments\"\n        ],\n        \"person\": [\n            \"Donald Trump\"\n        ]\n    },\n    \"268.1|269.1\": {\n        \"collection\": [\n            \"nancy_pelosi_archive\"\n        ],\n        \"subject\": [\n            \"Voting\",\n        ],\n    }\n}\n```\n\n---\n# Where We're Going\n- https://archive.org/details/TVNewsKitchen\n- want to serve journalists, researchers, librarians \u0026 more\n- responsible behavior and access to data\n- non-consumptive use\n\n\n---\n## [Part 2] \"There Goes 2 Weeks\"\n## deep dive into Image Matching and\u003cbr/\u003e Facial Recognition\n\n\u003cbr/\u003e\n\u003cbr/\u003e\n\u003csmall\u003e\u003ci\u003eAn imposter does not have Imposter Syndrome\u003c/i\u003e\u003c/small\u003e\n\n---\n# CNNs\n- Convolutional Neural Network\n  - filtered neural network\n- each layer uses output from prior layer as input\n- instead of rule-based learning, use classified datasets to learn\n- multi-node connections (but not \"fully connected\")\n- \"data squashers\"\n\n---\n# CNN Example\n- feed in image\n- node looking for eyelash\n- node looking for iris\n  - could feed to node looking for eye\n- meanwhile... nose node\n  - all feed to face recognizer node\n  - could feed to \"is this Barack Obama?\"\n\n\n---\n# Guru\nRik Heijdens from jwplayer\n- \u003ca href=\"https://www.youtube.com/watch?v=oGP7TfaRVlM\"\u003eDemuxed 2017 talk\u003c/a\u003e\n- feed in video - for each _shot_, make 3 vectors:\n  - _image_ Inception CNN (tensorflow)\n  - _audio_ CNN spectrogram\n  - _text_ transcripts/STT into Word2Vec\n- concat vectors, compare (cosine similarity), and graph\n- ... yields _scene detection_\n- all just for ideal Ad insertion!\n\n---\n# Image Matching\n- pixel diff algorithms (MAE, RMSE, MSE)\n- perceptual hashing pHash.org\n  - image =\u003e _8x8 grayscale_\n  - convolve to 8x8 image with DCT\n  - reduce to _64bit_ number\n  - hamming distance Int64 pairs\n\n---\n### pHash - to gray 8x8\n\u003cstyle\u003e .hashes img { width:150px; } \u003c/style\u003e\n\u003cdiv class=\"hashes\"\u003e\n  \u003cimg src=\"hash-images/AaronSwartz.png\"\u003e\n  \u003cimg src=\"hash-images/AaronXimm.png\"\u003e\n  \u003cimg src=\"hash-images/AlexisRossi.png\"\u003e\n  \u003cimg src=\"hash-images/BrewsterKahle.png\"\u003e\n  \u003cimg src=\"hash-images/JudeCohelo.png\"\u003e\n  \u003cimg src=\"hash-images/TraceyJaquith.png\"\u003e\n  \u003cbr/\u003e\n  \u003cimg src=\"hash-images/AaronSwartz.png.jpg\"\u003e\n  \u003cimg src=\"hash-images/AaronXimm.png.jpg\"\u003e\n  \u003cimg src=\"hash-images/AlexisRossi.png.jpg\"\u003e\n  \u003cimg src=\"hash-images/BrewsterKahle.png.jpg\"\u003e\n  \u003cimg src=\"hash-images/JudeCohelo.png.jpg\"\u003e\n  \u003cimg src=\"hash-images/TraceyJaquith.png.jpg\"\u003e\n  \u003cbr/\u003e\n  \u003cimg src=\"hash-images/AaronSwartz-gray.png\"\u003e\n  \u003cimg src=\"hash-images/AaronXimm-gray.png\"\u003e\n  \u003cimg src=\"hash-images/AlexisRossi-gray.png\"\u003e\n  \u003cimg src=\"hash-images/BrewsterKahle-gray.png\"\u003e\n  \u003cimg src=\"hash-images/JudeCohelo-gray.png\"\u003e\n  \u003cimg src=\"hash-images/TraceyJaquith-gray.png\"\u003e\n\u003c/div\u003e\n\n---\n# TensorFlow \u0026 Training\n- https://www.tensorflow.org/tutorials/image_recognition\n- trained CNNs, locally run\n- GoogLeNet Inception general classifier\n- retrainable / customizable\n  - redo 'top layer' (Rik idea)\n  - https://www.tensorflow.org/tutorials/image_retraining\n- 2048 multi-byte vectors (floats)\n- iOS smaller single-byte vectors\n- cosine distance comparisons\n- can just compare vectors (and ignore readable classification labels (Rik idea))\n\n---\n# OpenFace\n- implementation of \u003ca href=\"https://arxiv.org/abs/1503.03832\"\u003eFaceNet\u003c/a\u003e\n- https://cmusatyalab.github.io/openface/demo-3-classifier\n- similar to tensorflow (Torch..)\n\n---\n# OpenFace Training\n- 3+ images per person/face\n- avoid 'overfit'\n- align eyes + nose (nostrils?)\n\u003cbr/\u003e\n\u003cimg src=\"https://archive.org/~tracey/train/__aligned/BrewsterKahle/a.png\"/\u003e\n\u003cimg src=\"https://archive.org/~tracey/train/__aligned/BrewsterKahle/d.png\"/\u003e\n\u003cimg src=\"https://archive.org/~tracey/train/__aligned/BrewsterKahle/e.png\"/\u003e\n\u003cimg src=\"https://archive.org/~tracey/train/__aligned/BrewsterKahle/f.png\"/\u003e\n\u003cbr/\u003e\n\u003cimg src=\"https://archive.org/~tracey/train/__aligned/TraceyJaquith/a.png\"/\u003e\n\u003cimg src=\"https://archive.org/~tracey/train/__aligned/TraceyJaquith/b.png\"/\u003e\n\u003cimg src=\"https://archive.org/~tracey/train/__aligned/TraceyJaquith/c.png\"/\u003e\n\u003cimg src=\"https://archive.org/~tracey/train/__aligned/TraceyJaquith/f.png\"/\u003e\n\n---\n# Siamese \"one shot\" CNN recognizers\n- Rik idea\n- _differentiate_ instead of _classify_\n- learns similarity of 2 inputs\n\u003cimg src=\"https://cdn-images-1.medium.com/max/1600/1*tzGB6D97tHWR_-NJ8FKknw.jpeg\"/\u003e\n- \u003ca href=\"https://github.com/traceypooh/Facial-Similarity-with-Siamese-Networks-in-Pytorch\"\u003erepo / py notebook\u003c/a\u003e\n---\n\u003cimg style=\"max-height:650px\" src=\"https://cdn-images-1.medium.com/max/1200/1*XzVUiq-3lYFtZEW3XfmKqg.jpeg\"/\u003e\n\n---\n# AI Ethics\n- face tracking only public figures\n- https://www.itic.org/resources/AI-Policy-Principles-FullReport2.pdf\n  - min. government regulation \u0026 access\n  - public/private partner;  diversity/inclusion++\n  - preserve human dignity, rights, freedoms\n  - min. risk to humans;  human control\n  - large datasets -- avoid harmful bias\n- open discussion\n\n\n---\n# Demo Time\n\u003c!-- .slide: data-background=\"https://media.giphy.com/media/F3RKa517VBRg4/giphy.gif\" --\u003e\n\n---\n# Demo Time\n- Siamese network\n- \u003ca href=\"https://www.youtube.com/watch?v=Y94pLw9X2yY\"\u003eminiARchive\u003c/a\u003e\n- \u003ca href=\"https://www.tensorflow.org/mobile/\"\u003etensorflow\u003c/a\u003e\n- google translate\n\n\n---\n# help Shape US with YOUR Thoughts\n- extend/shape our APIs\n- AI ideas\n- research, visualizations\n- tag clips with AI metadata or pointers to Decentralized metadata\n- more!\n\n---\n# Ergo\n_decentralized research and AI \u003cbr/\u003e\nbuilt on top of \u003cbr/\u003e\na library of stable, untampered worldwide TV recordings_\n\n---\n\u003c!-- .slide: data-background=\"https://media.giphy.com/media/q4ICE9wYvOwBG/giphy.gif\" --\u003e\n# The End\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftraceypooh%2Fmozfest17","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftraceypooh%2Fmozfest17","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftraceypooh%2Fmozfest17/lists"}