{"id":22317240,"url":"https://github.com/boringppl/boringppl-meeting-summarization","last_synced_at":"2025-08-22T08:12:29.454Z","repository":{"id":112162707,"uuid":"288485062","full_name":"boringPpl/boringppl-meeting-summarization","owner":"boringPpl","description":null,"archived":false,"fork":false,"pushed_at":"2020-08-24T19:03:47.000Z","size":29319,"stargazers_count":3,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-06T05:02:36.088Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/boringPpl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-08-18T14:53:04.000Z","updated_at":"2021-01-23T15:22:30.000Z","dependencies_parsed_at":"2023-03-26T14:48:59.236Z","dependency_job_id":null,"html_url":"https://github.com/boringPpl/boringppl-meeting-summarization","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/boringPpl/boringppl-meeting-summarization","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/boringPpl%2Fboringppl-meeting-summarization","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/boringPpl%2Fboringppl-meeting-summarization/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/boringPpl%2Fboringppl-meeting-summarization/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/boringPpl%2Fboringppl-meeting-summarization/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/boringPpl","download_url":"https://codeload.github.com/boringPpl/boringppl-meeting-summarization/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/boringPpl%2Fboringppl-meeting-summarization/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":271606595,"owners_count":24788979,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-22T02:00:08.480Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-03T23:09:10.728Z","updated_at":"2025-08-22T08:12:29.444Z","avatar_url":"https://github.com/boringPpl.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Meeting Summarization\n\n## Environment\n- \u003ePython3.6\n- pipenv\n\n## Installation (Voice Identity)\n```python\ngit clone https://github.com/free-soellingeraj/boringppl-meeting-summarization.git\ncd boringppl-meeting-summarization\npipenv install\nrclone config # follow instructions configure your google drive\nrclone sync \"your-rlcone-drive-name:voice-identity-models/dataset/oyez/full_audios/\" \"full_audios\" --drive-shared-with-me\nrclone sync \"your-rlcone-drive-name:voice-identity-models/dataset/oyez/transcripts/\" \"transcripts\" --drive-shared-with-me\n```\n\n## Voice identity modeling\nThe task is to develop a capability of voice identification.  The most successful voice identity model or model system will associate each \"utterance\" in the audio with the person who made the utterance (\"speaker\").  \nPlease see the file `voice-identity-modeling/Design` document.\n\n### Oyez Dataset\nThe \"Oyez dataset\" is comprised of 1000s of recordings of US Supreme Court oral proceedings.  This dataset enables us to do a variety of experiments in automated conversational analysis in a somewhat controlled environment.  For more information, see: https://www.oyez.org/about  \n\nSee `voice-identity-modeling/oyez`  \n\n#### Data Access\nFirst you will need to download the `transcripts` and `full_audios` folders to the directory of your choosing.  I recommend using rclone (https://rclone.org/) to interact with google drive.  See the file called `rclone_access_examples`  \n\nNote: the transcripts folder is currently \u003e2GB compressed.  It will take some time to get them all.\n\nSee `example_data_access.ipynb`  \nThat example expects a directory structure as follows:\n```\nroot\n    |- example_data_access.ipynb\n    |- full_audios/\n    |- transcripts/\n```\nwhere seed data from the directories can be found in the google drive folder.\n\n#### Sourcing the Oyez dataset\nTo source the dataset, we found that the transcripts were embedded in dynamically populated html documents.  A selenium script was constructed (see: path/to/script) to load the page and extract the transcripts.  These documents are minimally structured and indexed in files called \"{}_tranny.pickle\"  \nThe script will explore the urls from  \nhttps://apps.oyez.org/player/#/roberts10/oral_argument_audio/12989  \nto  \nhttps://apps.oyez.org/player/#/roberts10/oral_argument_audio/25052  \n\n##### Characteristics of transcript files \n1) filename: {OyezRecordingId}_tranny.{compression}  \n2) file contents\n  - failures: list of failed TranscriptSections (see below definition)\n  - transcript: (chronologically) OrderedDict (key: md5(TranscriptSection.transcript+TranscriptSection.start_time+TranscriptSection.stop_time), values: TranscriptSection)\n  - files: list of audio files for audio sourcing\n\n3) TranscriptSection: a data object that occurs when a new speaker speaks in the conversation\n  semantic key: md5(TranscriptSection.transcript+TranscriptSection.start_time+TranscriptSection.stop_time)\n  - raw: html from which data was extracted\n  - case_name: String, the title of the court case associated with the overall recording\n  - conv_type: String, the type of oral proceeding \n  - conv_date: Date, on which the proceeding took place\n  - speaker: String, the individual who is speaking \"speaker\"\n  - start_time: String, a floating number that locates the beginning of the TranscriptSection.transcript in the audio\n  - stop_time: String, a floating number that locates the end of the TranscriptSection.transcript in the audio\n\n#### Gaps in Dataset\nThe Oyez dataset has limitations as compared to the kind of challenges that would exist in the real-world (e.g. zoom calls)\n1) Highly structured conversation eliminating interruptions, cross-talk, most utterances (huh, uh huh, ya, and, ...)\n2) Recordings were post processed with enhanced digital filtering by Oyez to improve sound quality\n3) The field conv_date is not precise, instead it states the day.  Therefore, \"real-time\" conversation linking to external datasources is not possible in a realistic way.\n\n\n### Using Selenium to scrape\n\n#### Setup\nDownloaded and installed selenium with pip\nDownloaded the chromedriver from https://sites.google.com/a/chromium.org/chromedriver/home\nRan `xattr -d com.apple.quarantine /Users/free-soellingeraj/Downloads/chromedriver`\nThen it's possible to pass that path into the MuncherySpider\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fboringppl%2Fboringppl-meeting-summarization","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fboringppl%2Fboringppl-meeting-summarization","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fboringppl%2Fboringppl-meeting-summarization/lists"}