https://github.com/mafiatun/un-webcast-analyzer
AI-powered platform for analyzing UN WebTV sessions with automated transcription, speaker diarization, entity extraction (speakers, countries, SDGs), semantic search, and RAG-based chat interface. Built with Azure OpenAI, Cosmos DB, and Streamlit.
https://github.com/mafiatun/un-webcast-analyzer
azure azureopenai entity-extraction international-relations rag semantic-search speaker-diarization streamlit un unwebtv
Last synced: 2 months ago
JSON representation
AI-powered platform for analyzing UN WebTV sessions with automated transcription, speaker diarization, entity extraction (speakers, countries, SDGs), semantic search, and RAG-based chat interface. Built with Azure OpenAI, Cosmos DB, and Streamlit.
- Host: GitHub
- URL: https://github.com/mafiatun/un-webcast-analyzer
- Owner: MafiAtUN
- License: mit
- Created: 2025-10-22T16:42:47.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2025-10-23T14:03:27.000Z (8 months ago)
- Last Synced: 2025-10-23T16:07:25.236Z (8 months ago)
- Topics: azure, azureopenai, entity-extraction, international-relations, rag, semantic-search, speaker-diarization, streamlit, un, unwebtv
- Language: Python
- Homepage:
- Size: 82 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# UN WebTV Analysis Platform
AI-powered toolkit for turning United Nations WebTV sessions into structured, research-ready knowledge with automated transcription, entity extraction, analytics, and an interactive chat surface.
## Features
- **UN WebTV ingestion** & session catalog: capture metadata from public session URLs and keep analyses searchable.
- **Transcription with diarization**: leverage Azure OpenAI (GPT-4o Transcribe & Whisper) for high-fidelity, speaker-aware transcripts.
- **Entity & SDG extraction**: identify speakers, countries, organizations, themes, treaties, SDGs, sentiment, and key decisions.
- **Vector-powered semantic search**: index transcript segments in Azure AI Search for lightning-fast retrieval.
- **AI research copilot**: RAG-style chat UI grounded in transcript segments with citations and source timestamps.
- **Analytics & visualizations**: Streamlit dashboards surface speaker participation, topic trends, and geographic coverage.
- **Export & collaboration**: download transcripts, summaries, and analysis artifacts to share with research teams.
## Installation & Setup
### Prerequisites
- Python 3.11+
- FFmpeg and ffprobe (e.g., `brew install ffmpeg` on macOS or `sudo apt install ffmpeg` on Ubuntu)
- Azure subscription with access to OpenAI, Speech Services, Cosmos DB, AI Search, and Blob Storage
- Git
### 1. Clone the repository
```bash
git clone
cd un-webcast-simple
```
### 2. Create and activate a virtual environment
```bash
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
```
### 3. Install Python dependencies
```bash
pip install --upgrade pip
pip install -r requirements.txt
```
### 4. Configure environment variables
Create a `.env` file (or use your secret manager of choice) with the configuration keys expected by `config/settings.py`. A minimal example:
```bash
APP_NAME="UN WebTV Analysis Platform"
AZURE_OPENAI_API_KEY="..."
AZURE_OPENAI_ENDPOINT="https://.openai.azure.com/"
AZURE_OPENAI_DEPLOYMENT_NAME="gpt-4o-unga"
AZURE_TRANSCRIBE_DIARIZE_DEPLOYMENT_NAME="gpt-4o-transcribe-diarize"
AZURE_SPEECH_KEY="..."
AZURE_SPEECH_REGION="eastus2"
COSMOS_ENDPOINT="https://.documents.azure.com:443/"
COSMOS_KEY="..."
COSMOS_DATABASE_NAME="untv_analysis"
BLOB_CONNECTION_STRING="DefaultEndpointsProtocol=...;"
BLOB_CONTAINER_AUDIO="audio-temp"
BLOB_CONTAINER_TRANSCRIPTS="transcripts"
SEARCH_ENDPOINT="https://.search.windows.net"
SEARCH_API_KEY="..."
SEARCH_INDEX_NAME="untv-segments"
```
Refer to `config/settings.py` for the full list of configurable options (deployment names, rate limits, logging paths, etc.).
### 5. Run the Streamlit application
```bash
streamlit run app.py
```
Optional: if you split the API backend and the UI, expose any FastAPI routes with Uvicorn (e.g., `uvicorn backend.api:app --reload`) before launching the UI.
## Project Structure
```
un-webcast-simple/
├── app.py # Streamlit entry point
├── pages/ # Additional Streamlit pages (visualizations, catalog, etc.)
├── backend/
│ ├── services/ # Ingestion, audio processing, OpenAI, database helpers
│ ├── models/ # Pydantic data models
│ └── api/ # FastAPI surface (coming soon)
├── config/ # Pydantic settings and configuration helpers
├── scripts/ # Operational scripts (maintenance, utilities)
├── tests/ # Automated test suite
├── docs/ # Architecture and deployment docs (extend as needed)
└── requirements.txt # Python dependencies
```
## Testing & Quality Checks
```bash
pytest # run unit/integration tests
pytest --cov # include coverage reporting
black . # format code
flake8 # lint
mypy . # static type checking
```
Manual diagnostic scripts for Azure integrations live in `scripts/manual/`. Run them directly with `python scripts/manual/.py` once your environment is configured.
## Documentation
- [Architecture](ARCHITECTURE.md) – system design and processing pipeline
- Add API specs, deployment runbooks, and contributor guidelines before public release (see checklist below).
## Contributing
Issues and pull requests are welcome. Please open a discussion if you plan significant changes so we can align on direction and Azure resource usage. See [CONTRIBUTING.md](CONTRIBUTING.md) and follow the [Code of Conduct](CODE_OF_CONDUCT.md).
## License
Distributed under the [MIT License](LICENSE).