{"id":13575343,"url":"https://github.com/ucbrise/flor","last_synced_at":"2025-04-05T22:06:32.504Z","repository":{"id":47376309,"uuid":"129949110","full_name":"ucbrise/flor","owner":"ucbrise","description":"🌻 FlorFlow: Flor, now with Dataflow","archived":false,"fork":false,"pushed_at":"2024-03-27T15:01:04.000Z","size":46852,"stargazers_count":146,"open_issues_count":4,"forks_count":17,"subscribers_count":13,"default_branch":"main","last_synced_at":"2024-05-29T13:35:48.942Z","etag":null,"topics":["airflow","build","dag","deep-learning","flor","hindsight","logger","logging","machine-learning","ml","pytorch","tensorboard","vldb"],"latest_commit_sha":null,"homepage":"https://rlnsanz.github.io","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ucbrise.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-04-17T18:35:37.000Z","updated_at":"2024-06-25T16:16:16.274Z","dependencies_parsed_at":"2023-10-15T14:50:48.888Z","dependency_job_id":"6315bcce-4570-46bd-a89a-a55815377b04","html_url":"https://github.com/ucbrise/flor","commit_stats":{"total_commits":840,"total_committers":17,"mean_commits":"49.411764705882355","dds":"0.27380952380952384","last_synced_commit":"f0996e27e5383fc6023a647f7fd3a500e255beed"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucbrise%2Fflor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucbrise%2Fflor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucbrise%2Fflor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ucbrise%2Fflor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ucbrise","download_url":"https://codeload.github.com/ucbrise/flor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247406087,"owners_count":20933803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","build","dag","deep-learning","flor","hindsight","logger","logging","machine-learning","ml","pytorch","tensorboard","vldb"],"created_at":"2024-08-01T15:01:00.203Z","updated_at":"2025-04-05T22:06:32.488Z","avatar_url":"https://github.com/ucbrise.png","language":"Jupyter Notebook","funding_links":[],"categories":["Jupyter Notebook","Python","Model, Data and Experiment Tracking"],"sub_categories":[],"readme":"Flow with FlorDB\n================================\n[![PyPI](https://img.shields.io/pypi/v/flordb.svg?nocache=1)](https://pypi.org/project/flordb/)\n\nFlorDB is a nimble hindsight logging database that simplifies how we manage context in the AI and machine learning lifecycle.\nWe center our approach on the developer-favored technique for generating metadata — log statements — leveraging the fact that logging creates context. \nFlorDB is designed to integrate seamlessly with your existing workflow. \nWhether you're using Make for basic automation, Airflow for complex pipelines, MLFlow for experiment tracking, or Slurm for cluster management – FlorDB works alongside all of them.\nThe goals of FlorDB are as follows:\n\n1. **Faster, More Flexible Experimentation:** users can quickly iterate on model training, and track hyper-parameters without worrying about missing something, thanks to hindsight logging.\n1. **Better Reproducibility and Provenance:** by capturing the full history and lineage (from code changes to model checkpoints and build DAGs), FlorDB ensures that every step in your workflow is traceable and versioned, making it easy to replicate experiments, validate outcomes, and maintain end-to-end transparency over the entire AI/ML lifecycle.\n1. **Long Term Maintainability:** FlorDB provides a single robust system for logging, storing, and retrieving all the context/metadata needed for anyone to manage AI/ML projects over their full lifecycle.\n\n## Quick Start Video\nFor a walkthrough of FlorDB's features and how to get started, check out our tutorial video:\n\n[▶️ Watch the FlorDB Tutorial on YouTube](https://youtu.be/mKENSkk3S4Y?si=urRHD6wk9PawsYqQ)\n\n## Installation\nTo install the latest stable version of FlorDB, run:\n\n```bash\npip install flordb\n```\n\n### Development Installation\n\nFor developers who want to contribute, are co-authors on a FlorDB manuscript and plan to run experiments, or need the latest features, install directly from the source:\n\n```bash\ngit clone https://github.com/ucbrise/flor.git\ncd flor\npip install -e .\n```\n\nTo keep your local copy up-to-date with the latest changes, remember to regularly pull updates from the repository (from within the `flor` directory):\n\n```bash\ngit pull origin\n```\n\n## Just start logging\n\nFlorDB is designed to be easy to use. \nYou don't need to define a schema, or set up a database.\nJust start logging your runs with a single line of code:\n\n```python\nimport flor\nflor.log(\"msg\", \"Hello world!\")\n```\n```\nmsg: Hello, World!\nChanges committed successfully\n```\n\nYou can read your logs with a Flor Dataframe:\n\n```python\nimport flor\nflor.dataframe(\"msg\")\n```\n![msg dataframe](img/just_start.png)\n\n## Logging your experiments\nFlorDB has a low floor, but a high ceiling. \nYou can start logging with a single line of code, but you can also log complex experiments with many hyper-parameters and metrics.\n\nHere's how you can modify your existing PyTorch training script to incorporate FlorDB logging:\n\n\n```python\nimport flor\nimport torch\n\n# Define and log hyper-parameters\nhidden_size = flor.arg(\"hidden\", default=500)\nbatch_size = flor.arg(\"batch_size\", 32)\nlearning_rate = flor.arg(\"lr\", 1e-3)\n...\n\n# Initialize your data loaders, model, optimizer, and loss function\ntrainloader: torch.utils.data.DataLoader\ntestloader:  torch.utils.data.DataLoader\noptimizer:   torch.optim.Optimizer\nnet:         torch.nn.Module\ncriterion:   torch.nn._Loss\n\n# Use FlorDB's checkpointing to manage model states\nwith flor.checkpointing(model=net, optimizer=optimizer):\n    for epoch in flor.loop(\"epoch\", range(num_epochs)):\n        for data in flor.loop(\"step\", trainloader):\n            inputs, labels = data\n            optimizer.zero_grad()\n            outputs = net(inputs)\n            loss = criterion(outputs, labels)\n            loss.backward()\n            optimizer.step()\n\n            # Log the loss value for each step\n            flor.log(\"loss\", loss.item())\n\n        # Evaluate the model on the test set\n        eval(net, testloader)\n```\n\n### Logging hyper-parameters\nAs shown above, you can log hyper-parameters with `flor.arg`:\n\n```python\n# Define and log hyper-parameters\n\nhidden_size = flor.arg(\"hidden\", default=500)\nbatch_size = flor.arg(\"batch_size\", 32)\nlearning_rate = flor.arg(\"lr\", 1e-3)\n...\nseed = flor.arg(\"seed\", default=randint(1, 10000))\n\n# Set the random seed for reproducibility\ntorch.manual_seed(seed)\n```\n\nWhen the experiment is run, the hyper-parameters are logged, and their values are stored in FlorDB.\n\nDuring replay, `flor.arg` reads the values from the database, so you can easily reproduce the experiment.\n\n### Setting hyper-parameters from the command line\nYou can set the value of any `flor.arg` from the command line:\n```bash \npython train.py --kwargs hidden=250 lr=5e-4\n```\n\n### Viewing your experiment history\nTo view the hyper-parameters and metrics logged during training, you can use the `flor.dataframe` function:\n\n```python\nimport flor\nflor.dataframe(\"hidden\", \"batch_size\", \"lr\", \"loss\")\n```\n![loss dataframe](img/loss_df.png)\n\n\n\n## Hindsight Logging for when you miss something\nHindsight logging is a post-hoc analysis practice that involves adding logging statements *after* encountering a surprise, and efficiently re-training with more logging as needed. FlorDB supports hindsight logging across multiple versions with its record-replay sub-system.\n\n### Clone a sample repository\nTo demonstrate hindsight logging, we will use a sample repository that contains a simple PyTorch training script. Let's clone the repository and install the requirements:\n\n```bash\ngit clone https://github.com/rlnsanz/ml_tutorial.git\ncd ml_tutorial\nmake install\n```\n\n### Record the first two runs\nOnce you have the repository cloned, and the dependencies installed, you can record the first run with FlorDB:\n\n```bash\npython train.py\n```\n```bash\nCreated and switched to new branch: flor.shadow\ndevice: cuda\nseed: 9288\nhidden: 500\nepochs: 5\nbatch_size: 32\nlr: 0.001\nprint_every: 500\nepoch: 0, step: 500, loss: 0.5111837387084961\nepoch: 0, step: 1000, loss: 0.33876052498817444\n...\nepoch: 4, step: 1500, loss: 0.5777633786201477\nepoch: 4, val_acc: 90.95  \n5it [00:23,  4.68s/it]    \naccuracy: 90.9\ncorrect: 9090\nChanges committed successfully.\n```\nNotice that the `train.py` script logs the loss and accuracy during training. The loss is logged for each step, and the accuracy is logged at the end of each epoch.\n\nNext, you'll want to run training with different hyper-parameters. You can do this by setting the hyper-parameters from the command line:\n\n```bash\npython train.py --kwargs epochs=3 batch_size=64 lr=0.0005\n```\n```bash\ndevice: cuda\nseed: 2470\nhidden: 500\nepochs: 3\nbatch_size: 64\nlr: 0.0005\nprint_every: 500\nepoch: 0, step: 500, loss: 0.847846508026123\nepoch: 0, val_acc: 65.65 \nepoch: 1, step: 500, loss: 0.9502124786376953\nepoch: 1, val_acc: 65.05 \nepoch: 2, step: 500, loss: 0.834592342376709\nepoch: 2, val_acc: 66.65 \n3it [00:11,  3.98s/it]   \naccuracy: 65.72\ncorrect: 6572\nChanges committed successfully.\n```\n\nNow, you have two runs recorded in FlorDB. You can view the hyper-parameters and metrics logged during training with the `flor.dataframe` function:\n\n```python\nimport flor\nflor.dataframe(\"device\", \"seed\", \"epochs\", \"batch_size\", \"lr\", \"accuracy\")\n```\n![alt text](img/two_runs.png)\n\n### Replay the previous runs\n\nWhenever something looks wrong during training, you can use FlorDB to replay the previous runs and log additional information, like the gradient norm. To log the gradient norm, you can add the following line to the training script:\n\n```python\nflor.log(\"gradient_norm\", \n    torch.nn.utils.clip_grad_norm_(\n        model.parameters(), max_norm=float('inf')\n    ).item()\n)\n```\n\nWe add the `flor.log` statement to the training script, inside the loop that iterates over the epochs:\n\n```python\nwith flor.checkpointing(model=net, optimizer=optimizer):\n    for epoch in flor.loop(\"epoch\", range(num_epochs)):\n        \n        # hindsight logging: gradient norm\n        flor.log(\"gradient_norm\", \n            torch.nn.utils.clip_grad_norm_(\n                model.parameters(), max_norm=float('inf')\n            ).item()\n        )\n\n        for data in flor.loop(\"step\", trainloader):\n            inputs, labels = data\n            optimizer.zero_grad()\n            outputs = net(inputs)\n            loss = criterion(outputs, labels)\n            loss.backward()\n            optimizer.step()\n            flor.log(\"loss\", loss.item())\n\n        # Evaluate the model on the test set\n        eval(net, testloader)\n```\n\nWe call the Flor Replay function with the name of the (comma-separated) variable(s) we want to hindsight log. In this case, we want to hindsight log the gradient norm at the start of each epoch, so we pass the variable name `gradient_norm`. From the command line:\n\n```bash\npython -m flor replay gradient_norm\n```\n```\nChanges committed successfully.\nlog level outer loop without suffix.\n\n        projid              tstamp  filename  ...        delta::prefix       delta::suffix composite\n0  ml_tutorial 2024-12-06 11:06:58  train.py  ...   0.4068293860000267  0.5810907259983651  6.632383\n1  ml_tutorial 2024-12-06 11:08:05  train.py  ...  0.35641806300009193  0.5474109189999581  4.340672\n\n[2 rows x 17 columns]\n\nContinue replay estimated to finish in under 2 minutes [y/N]? y\n```\nThe replay command will print a schedule of past versions to be replayed, including timing data and intermediate metrics. Columns containing `::` are profiling columns that Flor uses to estimate the replay’s runtime, and the phrase \"log level outer loop without suffix\" tells you the replay strategy that Flor will pursue on each version, which in this case means skipping the nested loop and the stuff that comes after the main epoch loop.\n\nWhen you confirm the replay, Flor will replay the past versions shown in the schedule, and hindsight log the gradient norm for each epoch. You can view the new metrics logged during replay with the `flor.dataframe` function:\n\n```python\nimport flor\nflor.dataframe(\"seed\", \"batch_size\", \"lr\", \"accuracy\", \"gradient_norm\")\n```\n![alt text](img/gradient_norm.png)\n\n## Building AI/ML Applications with FlorDB\nFlorDB is more than just for logging: it's a versatile tool that can be used to manage the entire AI/ML lifecycle. \nFlorDB takes a metadata-centric approach (i.e. based on \"[context](https://rlnsanz.github.io/dat/Flor_CMI_18_CameraReady.pdf)\") to managing AI/ML workflows, allowing you to store and query metadata about your experiments, models, and datasets. As demonstrated in the Document Parser, FlorDB can be used to easily define **model registries**, **feature stores**, **feedback loops**, and other ML System views common in AI/ML applications.\n\n![alt text](img/doc_parser.png)\n\n### Document Parser\n\nThe Document Parser is a Flask-based web application designed to process PDF documents. It allows users to split PDFs, extract text, and prepare data for analysis using NLP techniques. The parser converts PDF documents into text and images. It then performs featurization, which includes extracting text and inferring page features (such as headings, page numbers, etc). This process transforms raw PDF data into a structured format suitable for machine learning applications. FlorDB enhances this process by serving as a **feature store** during featurization and a **model registry** during inference. It automates the selection of optimal model checkpoints, streamlines document and image processing, and facilitates debugging with hindsight logging. FlorDB also manages training data and model repositories for training pipelines. The application's core functionalities are structured around Flask routes that handle web requests, including displaying and manipulating PDFs, and incorporating human-in-the-loop feedback (i.e. whether two contiguous pages belong in the same document, for document segmentation). This feedback loop allows domain experts to review and correct model predictions, which are then used to iteratively improve model performance and maintain data quality.\n\n**Try it yourself!**\n\nA working implementation of the Document Parser, along with example usage, can be found in the [Document Parser repository](https://github.com/rlnsanz/document_parser). This repository provides a template for getting started with FlorDB and demonstrates how it can be integrated into a real-world machine learning application.\n\n## Model Training Examples\nAI/ML applications typically make use of a variety of models, each part of a larger ecosystem with its own hyper-parameters and training data. FlorDB can be used to manage these models, store their metadata, and track their evolution throughout the development lifecycle. The following PyTorch examples use HuggingFace and should be a good starting point for people looking to get started training or fine-tuning models with FlorDB.\n\n\n| Model        | Model Size | Data          | Data Size | Objective                  | Evaluation   | Application              |\n|--------------|------------|---------------|-----------|---------------------------|--------------|--------------------------|\n| [ResNet-152](https://github.com/rlnsanz/xp-resnet152)   | 242 MB     | ImageNet-1k   | 156 GB    | image classification      | accuracy     | computer vision          |\n| [BERT](https://github.com/rlnsanz/xp-BERT)         | 440 MB     | Wikipedia     | 40.8 GB   | masked language modeling  | accuracy     | natural language processing |\n| [GPT-2](https://github.com/rlnsanz/xp-gpt2)        | 548 MB     | Wikipedia     | 40.8 GB   | text generation           | perplexity   | natural language processing |\n| [LayoutLMv3](https://github.com/rlnsanz/xp-layoutlmv3)   | 501 MB     | FUNSD         | 36 MB     | form understanding        | F1-score     | document intelligence    |\n| [DETR](https://github.com/rlnsanz/xp-DETR)         | 167 MB     | CPPE-5        | 234 MB    | object detection          | μ-precision  | computer vision          |\n| [TAPAS](https://github.com/rlnsanz/xp-tapas-base)        | 443 MB     | WTQ           | 429 MB    | table question answering  | accuracy     | document intelligence    |\n\n\n## Publications\nFlorDB is software developed at UC Berkeley's [RISE](https://rise.cs.berkeley.edu/) Lab (2017 - 2024). It is actively maintained by [Rolando Garcia](https://rlnsanz.github.io) (rolando.garcia@asu.edu) at ASU's School of Computing \u0026 Augmented Intelligence (SCAI).\n\nTo cite this work, please refer to [Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle ](https://vldb.org/cidrdb/papers/2025/p33-garcia.pdf). Published in the 15th Annual Conference\non Innovative Data Systems Research (CIDR ’25). Building on Ground's foundational work on data context services ([Hellerstein et al., 2017](https://www.cidrdb.org/cidr2017/papers/p111-hellerstein-cidr17.pdf)), FlorDB extends comprehensive context management to the ML lifecycle.\n\nFlorDB is open source software developed at UC Berkeley. \nFlorDB has been the subject of study by Eric Liu and Anusha Dandamudi for their masters degrees.\nThe list of publications resulting from our work is presented below:\n\n* [Flow with FlorDB: Incremental Context Maintenance for the Machine Learning Lifecycle](https://vldb.org/cidrdb/papers/2025/p33-garcia.pdf). _R Garcia, P Kallanagoudar, C Anand, SE Chasins, JM Hellerstein, EMT Kerrison, AG Parameswaran_. CIDR, 2025.\n* [The Management of Context in the Machine Learning Lifecycle](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-142.html). _R Garcia_. EECS Department, University of California, Berkeley, 2024. UCB/EECS-2024-142.\n* [Multiversion Hindsight Logging for Continuous Training](https://arxiv.org/abs/2310.07898). _R Garcia, A Dandamudi, G Matute, L Wan, JE Gonzalez, JM Hellerstein, K Sen_. pre-print on ArXiv, 2023.\n* [Hindsight Logging for Model Training](http://www.vldb.org/pvldb/vol14/p682-garcia.pdf). _R Garcia, E Liu, V Sreekanti, B Yan, A Dandamudi, JE Gonzalez, JM Hellerstein, K Sen_. The VLDB Journal, 2021.\n* [Fast Low-Overhead Logging Extending Time](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-117.html). _A Dandamudi_. EECS Department, UC Berkeley Technical Report, 2021.\n* [Low Overhead Materialization with FLOR](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-79.html). _E Liu_. EECS Department, UC Berkeley Technical Report, 2020. \n* [Context: The Missing Piece in the Machine Learning Lifecycle](https://rlnsanz.github.io/dat/Flor_CMI_18_CameraReady.pdf). _R Garcia, V Sreekanti, N Yadwadkar, D Crankshaw, JE Gonzalez, JM Hellerstein_. CMI, 2018.\n\n\n## License\n\nFlorDB is licensed under the [Apache v2 License](https://www.google.com/url?sa=E\u0026source=gmail\u0026q=https://www.google.com/url?sa=E%26source=gmail%26q=https://www.apache.org/licenses/LICENSE-2.0), which allows you to freely use, modify, and distribute the software for any purpose, with attribution.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fucbrise%2Fflor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fucbrise%2Fflor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fucbrise%2Fflor/lists"}