Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kimtth/pyspark-tika-text-extraction
π΄ββοΈβ·Data Lake, Performance tuning for text extraction from a huge amount of files.
https://github.com/kimtth/pyspark-tika-text-extraction
apache-spark apache-tika data-pipeline datalake multithreading pyspark spark tika-python
Last synced: 1 day ago
JSON representation
π΄ββοΈβ·Data Lake, Performance tuning for text extraction from a huge amount of files.
- Host: GitHub
- URL: https://github.com/kimtth/pyspark-tika-text-extraction
- Owner: kimtth
- Created: 2021-08-28T05:37:16.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2021-11-15T03:51:24.000Z (about 3 years ago)
- Last Synced: 2024-04-16T14:10:22.272Z (8 months ago)
- Topics: apache-spark, apache-tika, data-pipeline, datalake, multithreading, pyspark, spark, tika-python
- Language: Python
- Homepage:
- Size: 261 MB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0