https://github.com/astrolabsoftware/fink-cutout-api
Internal API to retrieve cutouts from the Fink data lake on HDFS
https://github.com/astrolabsoftware/fink-cutout-api
Last synced: 12 months ago
JSON representation
Internal API to retrieve cutouts from the Fink data lake on HDFS
- Host: GitHub
- URL: https://github.com/astrolabsoftware/fink-cutout-api
- Owner: astrolabsoftware
- License: apache-2.0
- Created: 2024-11-22T08:37:06.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-12-19T07:13:07.000Z (over 1 year ago)
- Last Synced: 2025-01-24T14:47:46.427Z (over 1 year ago)
- Language: Python
- Size: 96.7 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Fink cutout API
[](https://github.com/astrolabsoftware/fink-cutout-api/actions?query=workflow%3ASentinel)

This API is used internally by Fink web components to retrieve cutouts from the data lake on HDFS. We take advantage of the pyarrow connector to read parquet files to efficiently extract required cutouts from an HDFS block.
## Motivation
Each night, Fink processes alerts and stores them in the datalake as parquet files. Additionally, we transfered data to HBase tables to support various services. This approach allowed for efficient data exposure once the data is in the tables.
However, this has led to numerous issues that outweigh the benefits. One of the most critical tasks during data writing is populating the main table, which includes both lightcurve and cutout data. Frequently, HBase crashes for extended periods—sometimes lasting minutes or even hours—resulting in downtime for all services. Furthermore, cutout data constitutes a significant portion of our storage needs (XX%, equating to YYTB), yet much of it remains rarely accessed.
Since cutouts are already stored in the datalake, if we can develop a quick access method, we can eliminate the need to transfer them to tables. This change would also lead to substantial savings in storage space and costs. This observation has led to the development of this API.
The fundamental mechanism is to store cutout metadata in HBase and leverage the capabilities of `pyarrow` to read parquet files stored in HDFS. This is slower than accessing cutouts from the HBase table:
| | time/cutout (second)|
|--------|--------------|
| HBase | 0.5 |
| pyarrow/HDFS | 1.2 |
But a factor of 2 slower is not too bad compared to the hundreds of TB saved (estimation about 30k EUR saved for the duration of the project!).
## Requirements and installation
You will need Python installed (>=3.11) with requirements listed in `requirements.txt`. You wiil also need Hadoop installed on the machine, and Java 11 (at least). For the full installation and deployment, refer as to the [procedure](install/README.md).
## Usage
To deploy the API, you need access to the Fink HDFS cluster. Once `config.yml` is filled, just test using:
```bash
python app.py
```
In production, we deploy using gunicorn (see the [procedure](install/README.md)).
## Accessing 2D cutout
In Python, you can simply retrieve the 2D cutout stored in the alert using:
```python
import json
import requests
r = requests.post(
"{}/api/v1/cutouts".format(URL),
json={
"hdfsPath": HDFS_PATH,
"kind": "Science",
"objectId": "ZTF24abssjsb",
"return_type": "array"
}
)
cutout = json.loads(r.content)[0]
```
Note that `HDFS_PATH` should be an URI relative to the user home folder on HDFS, e.g.:
```diff
- NO: hdfs://IP:PORT/user/toto/somefolder/myparquet.parquet
+ YES: somefolder/myparquet.parquet
```
If you choose `kind=All`, all 3 cutouts (`Science`, `Template`, `Difference` in that order) wil be returned.
## Original FITS
In addition to `array`, we support the `FITS` file format. Note that in this case, `kind` should be a single flavor (i.e. cannot be `All`).
You can easily retrieve the original FITS file stored in the alert from your terminal:
```bash
curl -H "Content-Type: application/json" \
-X POST -d \
'{"hdfsPath":HDFS_PATH, "objectId":"ZTF24abssjsb", "kind":"Science", "return_type": "FITS"}' \
URL/api/v1/cutouts -o cutoutScience.fits
```
or in Python:
```python
import io
import requests
from astropy.io import fits
r = requests.post(
"{}/api/v1/cutouts".format(URL),
json={
"hdfsPath": HDFS_PATH,
"kind": "Science",
"objectId": "ZTF24abssjsb",
"return_type": "FITS"
}
)
cutout = fits.open(io.BytesIO(r.content), ignore_missing_simple=True)
```
## Tests
Once the app is deployed (need credentials), simply trigger:
```bash
python tests/api_test.py URL
```
## Performances
For a single object, we obtain about 0.17 +/- 0.02 seconds for a single cutout, and 0.31 +/- 0.02 seconds for the 3 cutouts. Note it depends highly on the block size on HDFS and the load of the file system. For ZTF, the block size in the datalake are about 100MB.