An open API service indexing awesome lists of open source software.

https://github.com/brian-hepler-phd/mathresearchcompass

An interactive dashboard for exploring mathematical research trends on arXiv
https://github.com/brian-hepler-phd/mathresearchcompass

arxiv bertopic dashboard data-science data-visualization knowledge-discovery llm-applications machine-learning mathematics nlp plotly python research-trends shiny-python topic-modeling

Last synced: 20 days ago
JSON representation

An interactive dashboard for exploring mathematical research trends on arXiv

Awesome Lists containing this project

README

          

# Math Research Compass

![Math Research](https://img.shields.io/badge/Research-Mathematics-blue)
![Topic Modeling](https://img.shields.io/badge/NLP-Topic%20Modeling-green)
![Shiny App](https://img.shields.io/badge/App-Shiny-red)

## Overview

Math Research Compass analyzes arXiv preprints to identify trending research topics across mathematical subfields. This interactive dashboard visualizes topic modeling results from over 121,000 recent mathematics papers, helping researchers and students discover emerging areas and popular research directions.

The application uses advanced natural language processing to cluster semantically related papers and identify coherent research themes. Recent optimizations have improved performance dramatically, reducing loading times from 15-20 seconds to under 5 seconds through database architecture improvements.

**Live Dashboard**: [Math Research Compass](https://brian-hepler-phd.shinyapps.io/mathresearchcompass1/)

## Project Structure

### Core Applications
- `app_v2.py` - Optimized Shiny dashboard with database integration
- `optimized_data_manager.py` - High-performance data layer with caching and connection pooling
- `create_database.py` - Database migration script for converting CSV data to SQLite

### Data Processing Pipeline
- `topic_trends_analyzer.py` - Performs topic modeling analysis on arXiv papers using BERTopic
- `topic_labeling.py` - Enhances topic labels using Claude AI for better readability
- `category_distribution.py` - Analyzes distribution of arXiv categories across topics
- `combined_network_analysis.py` - Collaboration network analysis (in development)

### Configuration Files
- `Procfile` - Heroku deployment configuration
- `requirements.txt` - Minimal production dependencies
- `runtime.txt` - Python version specification

## Data Processing Workflow

### 1. Data Collection and Filtering

The project uses data from the [Kaggle ArXiv dataset](https://www.kaggle.com/datasets/Cornell-University/arxiv), containing approximately 2.7 million arXiv papers. We filter this to focus on mathematics papers from 2020-2025, resulting in 121,391 papers across 31 mathematical subfields.

The dataset includes standard arXiv metadata: paper IDs, titles, abstracts, author lists, publication dates, and category classifications.

### 2. Topic Modeling with BERTopic

The topic modeling pipeline combines several state-of-the-art techniques:

1. Text preprocessing combines paper titles and abstracts
2. Sentence-BERT generates semantic embeddings
3. UMAP reduces dimensionality for efficient clustering
4. HDBSCAN performs density-based clustering to discover topics
5. TF-IDF extraction identifies representative keywords

This process discovered 1,938 distinct topics across the mathematics corpus, with each paper assigned to its most relevant topic.

### 3. AI-Enhanced Topic Labeling

Raw topic keywords are processed through Claude AI to generate human-readable topic descriptions. For example, a topic with keywords like "homotopy", "spectral", "cohomology" becomes "Algebraic Topology - Homotopy Theory and Spectral Sequences".

### 4. Database Architecture

The application migrated from CSV file processing to an optimized SQLite database. Key tables include:

- `topics` - Topic metadata with counts and category classifications
- `papers` - Paper information with pre-processed author formatting
- `topic_keywords` - Ranked keywords for each topic
- `topic_category_distribution` - Category breakdowns within topics
- `topic_top_authors` - Author rankings by paper count per topic

This migration reduced initial loading time by 4-5x and optimized memory usage significantly.

### 5. Category Distribution Analysis

Each topic is analyzed to determine its primary mathematical subfield by calculating the frequency of arXiv categories within that topic's papers. This enables filtering and visualization by mathematical area.

## Dashboard Features

### Overview Page

The main dashboard provides a high-level view of mathematical research topics:

- Summary statistics showing total papers and topics
- Category filtering across 31 math subfields
- Interactive bar chart of top research topics
- Dynamic content that updates based on selected category

### Topic Explorer

The explorer page offers detailed analysis of individual topics:

- Topic selection filtered by mathematical category
- Author rankings showing most prolific contributors
- Category distribution charts showing topic spread across subfields
- Representative paper samples with metadata and arXiv links

All interactions are optimized for sub-second response times through database indexing and intelligent caching.

## Performance Optimizations

The application implements several performance improvements:

- **Database queries** replace CSV file loading, reducing response times to under 0.1 seconds
- **LRU caching** stores frequently accessed data in memory
- **Connection pooling** manages database connections efficiently
- **Lazy loading** only retrieves data when needed by users
- **Indexed queries** on frequently filtered columns

These optimizations support 50+ concurrent users while using less than 1GB of memory.

## Installation and Usage

### Quick Start

```bash
git clone https://github.com/brian-hepler-phd/MathResearchCompass.git
cd MathResearchCompass
pip install -r requirements.txt
python app_v3.py
```

### Database Setup

To recreate the database from raw data:

```bash
python src/create_database.py
python src/optimized_data_manager.py # Test performance
```

## Deployment

The application currently runs on shinyapps.io with plans to migrate to Heroku for improved performance and reliability. The Heroku deployment will provide:

- Professional hosting with 99.95% uptime
- SSL certificates and custom domain support
- Auto-scaling for traffic spikes
- Continuous deployment from GitHub

Migration files are included (`Procfile`, `runtime.txt`) for straightforward deployment.

## Future Development

### Collaboration Network Analysis

Development is underway for comprehensive author collaboration analysis:

- Network graphs showing research partnerships within topics
- Temporal analysis of how collaborations evolve over time
- Author influence metrics and centrality calculations
- Cross-topic collaboration discovery

This will analyze collaboration patterns across all 1,938 topics, providing insights into mathematical research communities.

### Additional Planned Features

- Predictive modeling to forecast emerging research areas
- Citation analysis integration for impact metrics
- Geographic mapping of research activity
- Real-time updates as new papers are published
- API access for programmatic data retrieval

## Technologies

The application is built with:

- **Python 3.11** with optimized dependencies
- **Shiny for Python** for the interactive web interface
- **SQLite** for high-performance data storage
- **BERTopic** for advanced topic modeling
- **Sentence-BERT** for semantic text embeddings
- **UMAP and HDBSCAN** for dimensionality reduction and clustering
- **Plotly** for interactive visualizations
- **NetworkX** for upcoming collaboration analysis

## Performance Metrics

| Metric | Before Optimization | After Optimization | Improvement |
|--------|-------------------|-------------------|-------------|
| Initial Load Time | 15-20 seconds | 2-5 seconds | 4-5x faster |
| Memory Usage | Reduced | <1 GB | Optimized |
| Query Response | N/A | <0.1 seconds | New capability |
| Concurrent Users | 1-2 | 50+ | 25x increase |

## Research Applications

The dashboard serves multiple research use cases:

- **Trend Discovery**: Identify emerging areas within mathematical subfields
- **Literature Review**: Find representative papers and related topics
- **Collaboration Planning**: Discover active researchers in specific areas
- **Academic Planning**: Understand research landscapes for students and early-career researchers
- **Institutional Strategy**: Inform hiring and resource allocation decisions

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments

- ArXiv for providing access to research paper metadata
- Kaggle for hosting the ArXiv dataset
- Anthropic for the Claude API used in topic labeling

## Links

- **Live Dashboard**: [Math Research Compass](https://brian-hepler-phd.shinyapps.io/mathresearchcompass/)
- **Creator's Website**: [bhepler.com](https://bhepler.com)
- **GitHub Repository**: [MathResearchCompass](https://github.com/brian-hepler-phd/MathResearchCompass)