Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vforteli/azuredatalakeindexer
Azure Datalake gen2 search indexer fiddling
https://github.com/vforteli/azuredatalakeindexer
Last synced: about 13 hours ago
JSON representation
Azure Datalake gen2 search indexer fiddling
- Host: GitHub
- URL: https://github.com/vforteli/azuredatalakeindexer
- Owner: vforteli
- Created: 2023-09-01T14:01:51.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-18T19:14:40.000Z (5 months ago)
- Last Synced: 2024-06-19T07:14:00.689Z (5 months ago)
- Language: C#
- Size: 427 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AzureDataLakeIndexer
Azure Datalake gen2 search indexer fiddling
The reason this exists, is because the built in datalake indexers in azure search are slow... mind numbingly slow.
Most of the time the indexers seem to spend on listing paths in datalake and this project solves this by using a helper index for paths. Querying this index for modified files is much faster than listing paths in datalake
The built in indexers also has a habit of forgetting to renew the access tokens even when using RBAC causing indexing to fail.## Overview
```mermaid
flowchart LR
datalake[(DataLake)]datalake-->|BlobCreated|pathfunc
datalake-->|BlobDeleted|pathfunc
pathfunc-->|UpsertPaths|pathindex
pathfunc-->|UpsertPaths|deletedpathindexsubgraph FuncHost
pathfunc{{PathFunc}}
indexerfunc{{IndexerFunc}}
endsubgraph AzureSearch
pathindex[(Path index)]
deletedpathindex[(Deleted Path index)]
dataindex[(Data index)]
endpathindex-->|ListPaths|indexerfunc
datalake-->|Readdocs|indexerfunc
indexerfunc-->|UpsertData|dataindex```