Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nirantk/textexplorer
Modern Text Exploration Library
https://github.com/nirantk/textexplorer
Last synced: 4 days ago
JSON representation
Modern Text Exploration Library
- Host: GitHub
- URL: https://github.com/nirantk/textexplorer
- Owner: NirantK
- License: apache-2.0
- Created: 2024-10-29T11:31:53.000Z (15 days ago)
- Default Branch: main
- Last Pushed: 2024-10-30T14:19:25.000Z (14 days ago)
- Last Synced: 2024-10-30T15:26:28.991Z (14 days ago)
- Language: Python
- Size: 5.89 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# textexplorer
Modern Text Exploration LibraryA Python library for advanced text analysis and exploration, primarily focused on PDF documents. This library combines various NLP techniques to extract insights from text data.
> ⚠️ This is a work in progress and should not be used in production.
## Features
- PDF text extraction
- Advanced text analysis using spaCy
- Text statistics (Flesch-Kincaid grade level, type-token ratio)
- Extraction of:
- Out of vocabulary words
- Noun phrases
- Common words
- Keywords using TextRank
- Text clustering capabilities (with OpenAI embeddings)
- Visualization of text clusters## Installation
```bash
uv pip install text_explorer
```## Dependencies
- spaCy (with `en_core_web_lg` model)
- OpenAI API key (for embeddings)
- Other dependencies will be installed automatically## Usage
### Basic Text Analysis
```python
from text_explorer import PDFExtractor, TextProcessor# Extract text from PDF
pdf_extractor = PDFExtractor()
text = pdf_extractor.extract_text("your_document.pdf")# Initialize text processor
text_processor = TextProcessor(model_name="en_core_web_lg", text=text)# Get out of vocabulary words (useful for finding PDF parsing errors)
oov_words = text_processor.extract_top_oov_words(k=10)# Extract noun phrases
noun_phrases = text_processor.extract_top_noun_phrases(k=10)# Get most common words
common_words = text_processor.extract_top_words(k=10)# Extract keywords using TextRank
keywords = text_processor.extract_keywords(k=10)# Get readability metrics
grade_level = text_processor.flesch_kincaid_grade()
token_ratio = text_processor.type_token_ratio()
```### Clustering (Work in Progress)
```python
from text_explorer import ClusterVisualizer
from sklearn.cluster import KMeans
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings# Initialize components
text_splitter = SemanticChunker(OpenAIEmbeddings())
clustering_function = KMeans(n_clusters=5, random_state=42)
cluster_visualizer = ClusterVisualizer(clustering_function=clustering_function)# Split text into chunks
docs = text_splitter.create_documents([text])# Cluster and visualize
df_clusters = cluster_visualizer.cluster_texts(docs, embeddings)
df_labeled = cluster_visualizer.label_clusters(df_clusters)
cluster_visualizer.visualize_clusters(df_labeled, embeddings)
```## License & Contributing
[Apache 2.0](LICENSE) & [Contributing](CONTRIBUTING.md) have more details.
Contributions are welcome!