{"id":15988662,"url":"https://github.com/d-krupke/aemeasure","last_synced_at":"2025-05-07T07:01:41.232Z","repository":{"id":57408352,"uuid":"509999951","full_name":"d-krupke/AeMeasure","owner":"d-krupke","description":"A macro-benchmarking tool with a serverless database","archived":false,"fork":false,"pushed_at":"2023-07-03T20:04:34.000Z","size":93,"stargazers_count":1,"open_issues_count":4,"forks_count":2,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-01T07:08:15.167Z","etag":null,"topics":["benchmark","database","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/d-krupke.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-03T11:10:33.000Z","updated_at":"2022-08-25T08:21:03.000Z","dependencies_parsed_at":"2024-10-08T04:20:41.036Z","dependency_job_id":null,"html_url":"https://github.com/d-krupke/AeMeasure","commit_stats":{"total_commits":29,"total_committers":1,"mean_commits":29.0,"dds":0.0,"last_synced_commit":"f08b5ae184f18c39352f4e5f41d085d0e2246965"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d-krupke%2FAeMeasure","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d-krupke%2FAeMeasure/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d-krupke%2FAeMeasure/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/d-krupke%2FAeMeasure/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/d-krupke","download_url":"https://codeload.github.com/d-krupke/AeMeasure/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252831253,"owners_count":21810783,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","database","python"],"created_at":"2024-10-08T04:20:31.477Z","updated_at":"2025-05-07T07:01:41.187Z","avatar_url":"https://github.com/d-krupke.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AeMeasure - A macro-benchmarking tool with a serverless database\n\n**Consider [AlgBench](https://github.com/d-krupke/AlgBench) as a more modern alternative to AeMeasure. It requires less boilerplate code, has a better documentation, and direct support for logging.**\n\nThis module has been developed to save (macro-)benchmarks of algorithms in a simple and\ndynamic way. The primary features are:\n* Saving metadata such as Git-Revision, hostname, etc.\n* No requirement of a data scheme. This is important as often you add additional feature in later iterations but still want to compare it to the original data.\n* Compatibility with distributed execution, e.g., via slurm. If you want to use multi-processing, use separate `MeasurementSeries`.\n* Easy data rescue in case of data corruption (or non-availability of this library) as well as compression.\n  * Data is saved in multiple json files (coherent json is not efficient) and compressed as zip.\n* Optional capturing of stdin and stdout.\n\nYou can also consider this tool as a simple serverless but NFS-compatible object database with helpers for benchmarks.\n\nThe motivation for this tool came from the need to **quickly compare different optimization models (MIP, CP, SAT, ...)**\nand analyze their performance.  Here it is more important to save the context (parameters, revision,...) than to\nbe precise to a millisecond. If you need very precise measurements, you need to look for a micro-benchmarking tool.\nThis is a **macro-benchmarking tool** with a **file-based database**.\n\n## When to use AeMeasure?\n\n\u003e *\"They say the workman is only as good as his tools; in experimental algorithmics the workman must often build his tools.\"* - Catherine McGeoch, A Guide to Experimental Algorithmics\n\nAeMeasure is designed for flexibility and simplicity. If you don't have changing\nrequirements every few weeks, you may be better off with\nusing a proper database. If you are somewhere in between, you could take a look at, e.g.,\n[MongoDB](https://www.mongodb.com/), which is more flexible regarding the schema but\nstill provides a proper database. If you want a very simple\u0026flexible solution and the data\nin the repository (compressed of course, but still human-readable),\nAeMeasure may be the right tool for you.\n\n## Installation\n\nThe easiest installation is via pip\n```shell\npip install -U aemeasure\n```\n\n## Usage\n\nA simple application that runs an algorithm for a set of instances and saves the results to `./results` could look like this:\n\n```python\nfrom aemeasure import MeasurementSeries\n\nwith MeasurementSeries(\"./results\") as ms:\n    # By default, stdout, stdin, and metadata (git revision, hostname, etc) will\n    # automatically be added to each measurement.\n    for instance in instances:\n        with ms.measurement() as m:\n            m[\"instance\"] = str(instance)\n            m[\"size\"] = len(instance)\n            m[\"algorithm\"] = \"fancy_algorithm\"\n            m[\"parameters\"] = \"asdgfdfgsdgf\"\n            solution = run_algorithm(instance)\n            m[\"solution\"] = solution.as_json()\n            m[\"objective\"] = 42\n            m[\"lower_bound\"] = 13\n```\n\nYou can then parse the database as pandas table via\n```python\nfrom aemeasure import read_as_pandas_table\n\ntable = read_as_pandas_table(\"./results\", defaults={\"uses_later_added_special_feature\": False})\n```\n\n## Metadata and stdout/stderr\n\nIf you are using MeasurementSeries, all possible information is automatically\nadded to the measurements by default. You can deactivate this easily in\nthe constructor. However, often it is useful to have this data (especially stderr\nand git revision) at hand, when you notice some oddities in your results. This\ncan take up a lot of space, but the compression option of the database should\nhelp.\n\nThe following data is saved:\n* Runtime (enter and exit of Measurement)\n* stdout/stderr\n* Git Revision\n* Timestamp of start\n* Hostname\n* Arguments\n* Python-File\n* Current working directory\n\nYou can also activate individual metadata by just calling the corresponding member\nfunction of the measurement.\n\n## Usage with Slurminade\n\nThis tool is excellent in combination with [Slurminade](https://github.com/d-krupke/slurminade) to automatically distribute\nyour experiments to Slurm nodes. This also allows you to schedule the missing instances.\n\nAn example could look like this:\n\n```python\nimport slurminade\nfrom aemeasure import MeasurementSeries, read_as_pandas_table, Database\n\n# your supervisor/admin will tell you the necessary configuration.\nslurminade.update_default_configuration(partition=\"alg\", constraint=\"alggen03\")\nslurminade.set_dispatch_limit(200)  # just a safety measure in case you messed up\n\n# Experiment parameters\nresult_folder = \"./results\"\ntimelimit = 300\n\n# The part to be distributed\n@slurminade.slurmify()\ndef run_for_instance(instance_name, timelimit):\n    \"\"\"\n    Solve instance with different solvers.\n    \"\"\"\n    instances = load_instances()\n    instance = instances[instance_name]\n    with MeasurementSeries(result_folder) as ms:\n        models = (Model1(instance), Model2(instance))\n        for model in models:\n            with ms.measurement() as m:\n                ub, lb = model.optimize(timelimit)\n                m[\"instance\"] = instance_name\n                m[\"timelimit\"] = timelimit\n                m[\"ub\"] = ub\n                m[\"lb\"] = lb\n                m[\"n\"] = len(instance)\n                m[\"Method\"] = str(model)\n                m.save_seconds()\n\nif __name__ == \"__main__\":\n    # Read data\n    instances = load_instances()\n    Database(result_folder).compress()  # compress prior results to make space\n    t = read_as_pandas_table(result_folder)  # read prior results to check which instances are still missing\n\n    # find missing instances (skip already solved instances)\n    finished_instances = t[\"instance\"].to_list() if not t.empty else []\n    print(\"Already finished instances:\", finished_instances)\n    missing_instances = [i for i in instances if i not in finished_instances]\n    if finished_instances and missing_instances:\n        assert isinstance(missing_instances[0], type(finished_instances[0]))\n    print(\"Still missing instances:\", missing_instances)\n\n    # distribute\n    for instance in missing_instances:\n        run_for_instance.distribute(instance, timelimit)\n```\n\nIf you have a lot of instances, you may want to use `slurminade.AutoBatch` to automatically\nbatch multiple instances into a single task.\n\nImportant points of this example:\n* Extract the parameters as variables and put them at the top so you can easily copy and adapt such a template.\n* The `run_for_instance` function will read the instance itself as this is more efficient than to distribute it via slurm as an argument.\n* We compress the results at the beginning. As this is executed before distribution, it is threadsafe.\n* We quickly check, which instances are already solved and only distribute the missing ones.\n* To compress the final results, simply run this script again (it will also check if you may have missed some instances due to an error).\n\nI have often seen scripts that are simply started on each node that either use a complicated\nmanual instance distribution, require an additional server, or need a lot of additional\ndata collection in the end. The above's approach seems to be much more elegant to me, if\nyou already have Slurm and an NFS.\n\n## Serverless Database\n\nThe serverless database allows to dump unstructured JSONs in a relatively threadsafe way (focus on Surm-node with NFS).\n```python\nfrom aemeasure import Database\n# Writing\ndb = Database(\"./db_folder\")  # We use a folder, not a file, to make it NFS-safe.\ndb.add({\"key\": \"value\"}, flush=False)  # save simple dict, do not write directly.\ndb.flush()  # save to disk\ndb.compress()  # compress data on disk via zip\n\n# Reading\ndb2 = Database(\"./db_folder\")\ndata = db2.load()  # load all entries as a list of dicts.\n\n# Clear\ndb2.clear()\ndb2.dump([e for e in data if e[\"feature_x\"]])  # write back only entries with 'feature_x'\n```\n\nThe primary points of the database are:\n* No server is needed, synchronization possible via NFS.\n* We are using a folder instead of a single file. Otherwise, the synchronization of different nodes via NFS would be difficult.\n* Every node uses a separate, unique file to prevent conflicts.\n* Every entry is a new line in JSON format appended to the current database file of the node. As this allows simply appending, this is much more efficient that keeping the whole structure in JSON. If something goes wrong, you can still easily repair it with a text editor and some basic JSON-skills.\n* The database has a very simple format, such that it can also be read without this tool.\n* As the nativ JSON format can need a signficant amount of disk, a compression option allows to significantly reduce the size via ZIP-compression.\n\n**This database is made for frequent writing, infrequent reading. Currently, there are no query options aside of list comprehensions. Use `clear` and `dump` for selective deletion.**\n\n## Changelog\n\n* 0.2.9: Added pyproject.toml for PEP compliance.\n* 0.2.8: Saving Python-environment, too.\n* 0.2.7: Robust JSON serialization. It will save the data but print an error if the data is not JSON-serializable.\n* 0.2.6: Extended logging and exception if data could not be written.\n* 0.2.5: Skipping on zero size (probably not yet written, can be a problem with NFS)\n* 0.2.4: Added some logging.\n* 0.2.3: Setting LZMA as compression standard.\n* For some reason, the default keys for 'stdin' and 'stdout' were wrong. Fixed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fd-krupke%2Faemeasure","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fd-krupke%2Faemeasure","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fd-krupke%2Faemeasure/lists"}