{"id":28255162,"url":"https://github.com/pandas-dev/pandas-benchmarks","last_synced_at":"2025-06-16T06:31:48.594Z","repository":{"id":220426717,"uuid":"708336897","full_name":"pandas-dev/pandas-benchmarks","owner":"pandas-dev","description":"Environment to run the pandas benchmarks suite","archived":false,"fork":false,"pushed_at":"2024-01-26T18:45:14.000Z","size":3,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-06-04T05:46:35.549Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pandas-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null},"funding":{"custom":"https://pandas.pydata.org/donate.html","github":["numfocus"],"tidelift":"pypi/pandas"}},"created_at":"2023-10-22T08:58:31.000Z","updated_at":"2024-06-14T04:45:15.000Z","dependencies_parsed_at":"2024-02-02T01:49:13.152Z","dependency_job_id":null,"html_url":"https://github.com/pandas-dev/pandas-benchmarks","commit_stats":null,"previous_names":["pandas-dev/pandas-benchmarks"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/pandas-dev/pandas-benchmarks","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pandas-dev%2Fpandas-benchmarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pandas-dev%2Fpandas-benchmarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pandas-dev%2Fpandas-benchmarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pandas-dev%2Fpandas-benchmarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pandas-dev","download_url":"https://codeload.github.com/pandas-dev/pandas-benchmarks/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pandas-dev%2Fpandas-benchmarks/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259520065,"owners_count":22870389,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-19T21:12:48.793Z","updated_at":"2025-06-16T06:31:48.565Z","avatar_url":"https://github.com/pandas-dev.png","language":null,"funding_links":["https://pandas.pydata.org/donate.html","https://github.com/sponsors/numfocus","https://tidelift.com/funding/github/pypi/pandas"],"categories":[],"sub_categories":[],"readme":"# pandas benchmark\n\n## Set up instructions\n\nInstall the compilers needed to build pandas in the system:\n\n```shell\napt install gcc g++\n```\n\nCreate a user to run the benchmarks, and clone this repository in its home.\n\nInstall [pixi](https://prefix.dev), which we use to manage the environment that runs\nasv. Note that the the environment to run the benchmarks is managed by asv and it is\ndifferent from the pixi environment:\n\n```shell\ncurl -fsSL https://pixi.sh/install.sh | bash\n```\n\nClone the pandas repository inside the `pandas-benchmarks` directory:\n\n```shell\ncd pandas-benchmarks\ngit clone https://github.com/pandas-dev/pandas.git\n```\n\n## Run benchmarks\n\nWe use [pixi](https://prefix.dev) to manage the environment and run the benchmarks:\n\n```shell\npixi run bench\n```\n\nWe may want to implement a script that runs benchmarks continually (a new run starts\nwhen the previous finishes, indefinetly). But for now we are using cron.\n\nTo set up cron to run the benchmarks automatically we can use:\n\n```\n0 */3 * * * cd pandas-benchmarks \u0026\u0026 /home/bench/.pixi/bin/pixi run bench \u003e\u003e bench.log 2\u003e\u00261\n```\n\nNote that the frequency should avoid starting a new job when the previous\nhas not finished, so if the benchmarks take 2.5 hours to complete, we should\nschedule the runs to for example every 3 hours.\n\nTo view the log of cron executions we can run:\n\n```shell\ngrep CRON /var/log/syslog | grep \"(bench)\"\n```\n\n## System stability\n\nEverything that happens in the system while running the benchmarks causes an\nimpact, meaning that benchmarks will run faster when there is not much noise,\nand will run slower when there is. For example, if the core running the benchmarks\ntakes care of an operating system interruption, this will cause a context switch,\nwill flush the CPU caches, and the benchmark will take longer. Even if every\nbenchmark is run multiple times, this variance makes our results worse and likely\nto cause false positives. This section is about trying to make the system more\nstable and reduce the variance of the execution time of benchmarks.\n\n### CPU isolation\n\nFirst thing we can do is to isolate the CPUs where the benchmarks run. This means\nthat the operating system won't use the CPU unless a process is explicitly started\nwith a CPU affinity to that core.\n\nFirst, to check the cores available in the system we can run:\n\n```shell\n$ lscpu --all --extended\nCPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ       MHZ\n  0    0      0    0 0:0:0:0          yes 4900.0000 800.0000 4798.3130\n  1    0      0    1 1:1:1:0          yes 4900.0000 800.0000 4603.2891\n  2    0      0    2 2:2:2:0          yes 4900.0000 800.0000 4000.0000\n  3    0      0    3 3:3:3:0          yes 4900.0000 800.0000 4000.0000\n  4    0      0    0 0:0:0:0          yes 4900.0000 800.0000 4000.0000\n  5    0      0    1 1:1:1:0          yes 4900.0000 800.0000 4000.0000\n  6    0      0    2 2:2:2:0          yes 4900.0000 800.0000 4782.7388\n  7    0      0    3 3:3:3:0          yes 4900.0000 800.0000 4000.0000\n```\n\nThe `CPU` column shows that the benchmarks server has 8 cores, and the `CORE`\ncolumn shows that those are using 4 different physical cores (every physical\ncore is used by two separate pipelines or virtual cores, referred by Intel\nas hyperthreads). We need to isolate physical cores, so the OS does not\nexecute anything in the other pipeline either, which would also slow down\nthe benchmark execution.\n\nTo isolate CPUs we need to add parameters to the kernel. To do so, we edit\nthe file `/etc/default/grub` and do these changes:\n\n```\n# Find this line:\nGRUB_CMDLINE_LINUX_DEFAULT=\"quiet splash\"\n\n# Replace it with this line (add the parameters at the end):\nGRUB_CMDLINE_LINUX_DEFAULT=\"quiet splash isolcpus=3,7 nohz_full=3,7\"\n```\n\nThis will isolate the physical core 3, via its two virtual cores 3 and 7.\nIt will also remove these cores from the operating system scheduler ticks.\nWe can surely isolate more cores, for now we just start by one for simplicity.\n\nFor the changes to have an effect we first need to update the actual grub\nconfiguration with the changes in `/etc/default/grub.d/50-cloudimg-settings.cfg`.\nIn general `/etc/default/grub` is used for grub settings, but OVH overwrites the\ncontent of that file with `50-cloudimg-settings.cfg`. Note that grub does not read\ndirectly from those files, so it is needed to execute `update-grub` or `grub-mkconfig`\nwhich parse these files and write to `/boot/grub.grub.cfg` which is the one used by\nthe operating system. After executing one of those commands it is needed to restart\nthe system so the running kernel contains the new parameters. In practice this is as\nsimple as tuning the next commands\n\n```shell\n$ sudo vim /etc/default/grub.d/50-cloudimg-settings.cfg  # and make changes above\n$ sudo update-grub\n$ sudo reboot\n```\n\nOnce the system is restarted we should check that the CPUs are indeed\nisolated as expected. This can be done checking the information in the\nnext files:\n\n```shell\n$ cat /sys/devices/system/cpu/isolated\n3,7\n```\n\nWe can also see that the operating system is not running tasks in the isolated CPUs\nby generating process and checking CPU usage with htop:\n\n```shell\n$ apt install stress\n$ stress --cpu 8\n$ htop # in a different terminal\n```\n\nIsolation works for processes running in the user space, but not in the system space.\nIdeally, we would like to avoid interruptions running in our isolated kernel. While\nthis is a complex topic, and not all intererruptions can run in any core, to limit the\nnumber of cores every interruption runs in a general way, this command can be used:\n\n```shell\nfor IRQ_AFFINITY_FILE in $(find . -name smp_affinity); do echo 77 | sudo tee $IRQ_AFFINITY_FILE; done\n```\n\nNote that for some interruptions the command will fail. Also note that `77` is a binary\nmask in hexadecimal representing `0111 0111` (4th and 8th CPUs are not allowed to run the\ninterruption).\n\n## CPU frequency\n\nModern CPUs are able to scale their frequency depending on work load or temperature. When a CPU\nis idle it will decrease its frequency to save energy. Also, when a CPU is busy and its temperature\nincreases, it will eventually decrease its frequency so the temperature goes back to safe level.\n\nMost of these frequency scaling technologies can be disabled via the system BIOS, but we do not\nhave control of it in the servers in a data center, and disabling them may make frequency slow, and\nthe benchmark suite take much longer to run (something like double the time based on past tests).\n\nThere are some things we have control of at runtime. We should be able to disable TurboBoost via:\n```shell\necho 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo\n```\n\nWe can also install `cpufreq` which gives informations and allow to control certain features with:\n\n```shell\nsudo apt install linux-tools-generic\n```\n\n## Benchmarks variance\n\nWhile the system introduces noise to due to CPU scaling or our benchmark process being interrupted\nby other processes and interruptions, there are other sources of noise that cause variance in the\nresults of our benchmarks.\n\nThe main ones identifies are:\n- I/O operations\n- Unpredictable CPU cache misses\n- Randomness (for example, our benchmarks on functions that check duplicates are affected by the\n  randomness in the hashing functions for the used hash tables).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpandas-dev%2Fpandas-benchmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpandas-dev%2Fpandas-benchmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpandas-dev%2Fpandas-benchmarks/lists"}