{"id":51093651,"url":"https://github.com/yablokolabs/bendkernels","last_synced_at":"2026-06-24T04:02:50.238Z","repository":{"id":355199274,"uuid":"1227165926","full_name":"yablokolabs/bendkernels","owner":"yablokolabs","description":"Pure Bend parallel algorithm kernels and GPU-scaling examples","archived":false,"fork":false,"pushed_at":"2026-05-02T10:25:01.000Z","size":9,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-02T12:22:47.067Z","etag":null,"topics":["algorithms","bend","cuda","gpu","hvm","parallel-computing"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yablokolabs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-02T09:52:33.000Z","updated_at":"2026-05-02T10:25:05.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/yablokolabs/bendkernels","commit_stats":null,"previous_names":["yablokolabs/bendkernels"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/yablokolabs/bendkernels","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yablokolabs%2Fbendkernels","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yablokolabs%2Fbendkernels/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yablokolabs%2Fbendkernels/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yablokolabs%2Fbendkernels/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yablokolabs","download_url":"https://codeload.github.com/yablokolabs/bendkernels/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yablokolabs%2Fbendkernels/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34716326,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-24T02:00:07.484Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithms","bend","cuda","gpu","hvm","parallel-computing"],"created_at":"2026-06-24T04:02:47.011Z","updated_at":"2026-06-24T04:02:50.230Z","avatar_url":"https://github.com/yablokolabs.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# BendKernels\n\nPure Bend examples for algorithms that want thousands of lightweight threads.\n\nBendKernels is a small benchmark and learning repo for [Bend](https://github.com/HigherOrderCO/Bend), a high-level language that exposes parallelism from ordinary recursive structure. The goal is not to replace Rust, C, or CUDA for every workload. The goal is to show the kinds of programs where Bend's compiler/runtime can find a lot of independent work without explicit threads, locks, atomics, or GPU kernels.\n\n## Why this exists\n\nMost parallel examples hide the interesting part behind a framework. Bend makes the parallel shape visible in the program itself:\n\n- balanced recursion\n- tree folds\n- divide-and-conquer algorithms\n- sorting networks\n- independent map/reduce kernels\n\nThe exact same `.bend` file can be run on the reference runtime, parallel CPU runtime, or CUDA runtime when supported by the machine.\n\n## Examples\n\n| Example | Pattern | What it demonstrates |\n| --- | --- | --- |\n| `examples/parallel_sum.bend` | tree generation + fold | balanced reduction over independent branches |\n| `examples/tree_map_reduce.bend` | map/reduce | independent leaf kernels with a final checksum |\n| `examples/bitonic_sort.bend` | sorting network | compare/swap structure that maps well to parallel execution |\n| `examples/fibonacci_modes.bend` | branching vs sequential recursion | parallelism is not the same as algorithmic efficiency |\n| `examples/ai/tiny_mlp.bend` | neural-network inference | independent hidden neurons and batch samples |\n| `examples/ai/minimax_tree.bend` | game-tree search | independent child-position evaluation |\n| `examples/ai/ensemble_vote.bend` | ensemble inference | independent weak-learner votes over a batch |\n\nSee [`docs/ai-kernels.md`](docs/ai-kernels.md) for the AI-focused examples.\n\n## Install Bend\n\n```bash\ncargo install hvm\ncargo install bend-lang\nbend --version\n```\n\nFor GPU execution, install the NVIDIA CUDA toolkit supported by Bend/HVM, then use `bend run-cu`.\n\n## Run\n\nReference runtime:\n\n```bash\nbend run-rs examples/parallel_sum.bend -s\n```\n\nParallel CPU runtime:\n\n```bash\nbend run-c examples/parallel_sum.bend -s\n```\n\nCUDA runtime, on supported NVIDIA systems:\n\n```bash\nbend run-cu examples/parallel_sum.bend -s\n```\n\nRun the full smoke set locally:\n\n```bash\nbend run-rs examples/parallel_sum.bend\nbend run-rs examples/tree_map_reduce.bend\nbend run-rs examples/bitonic_sort.bend\nbend run-rs examples/fibonacci_modes.bend\nbend run-rs examples/ai/tiny_mlp.bend\nbend run-rs examples/ai/minimax_tree.bend\nbend run-rs examples/ai/ensemble_vote.bend\nbend run-c examples/parallel_sum.bend\n```\n\n## Usage examples with output\n\n### Parallel tree sum\n\n```bash\nbend run-rs examples/parallel_sum.bend\n```\n\n```text\nResult: 8386560\n```\n\nRun the same source on the parallel CPU runtime:\n\n```bash\nbend run-c examples/parallel_sum.bend\n```\n\n```text\nResult: 8386560\n```\n\nWith Bend's stats flag, you can compare runtime behavior. Numbers vary by machine, but the checksum should match:\n\n```bash\nbend run-rs examples/parallel_sum.bend -s\nbend run-c examples/parallel_sum.bend -s\n```\n\nExample output from one local run:\n\n```text\nResult: 8386560\n- ITRS: 360395\n- TIME: 0.01s\n- MIPS: 28.19\n\nResult: 8386560\n- ITRS: 352203\n- TIME: 0.01s\n- MIPS: 55.38\n```\n\n### Tree map/reduce\n\n```bash\nbend run-rs examples/tree_map_reduce.bend\n```\n\n```text\nResult: 2064384\n```\n\n### Bitonic sort checksum\n\n```bash\nbend run-rs examples/bitonic_sort.bend\n```\n\n```text\nResult: 32640\n```\n\n### Fibonacci modes\n\n```bash\nbend run-rs examples/fibonacci_modes.bend\n```\n\n```text\nResult: 13530\n```\n\nThis example intentionally includes both a branching recursive Fibonacci and a tail-recursive Fibonacci. It is a reminder that more parallel structure does not automatically mean a better algorithm.\n\n### Tiny MLP inference\n\n```bash\nbend run-rs examples/ai/tiny_mlp.bend\n```\n\n```text\nResult: 426312\n```\n\nThis runs a fixed integer neural-network forward pass over a synthetic batch and returns the batch-score checksum.\n\n### Minimax tree search\n\n```bash\nbend run-rs examples/ai/minimax_tree.bend\n```\n\n```text\nResult: 745\n```\n\nThis evaluates a synthetic ternary game tree with alternating min/max layers.\n\n### Ensemble vote\n\n```bash\nbend run-rs examples/ai/ensemble_vote.bend\n```\n\n```text\nResult: 2024\n```\n\nThis evaluates several independent decision stumps across a synthetic batch and returns the positive-class count.\n\n## Interpreting results\n\nBend currently uses 24-bit numeric types, so large integer examples may wrap modulo `2^24`. The examples return checksums rather than printing arrays or images, because Bend's IO ecosystem is still young and benchmark-style programs are easier to compare across runtimes.\n\nWhen comparing runtimes, look at Bend's `-s` output:\n\n```bash\nbend run-rs examples/bitonic_sort.bend -s\nbend run-c  examples/bitonic_sort.bend -s\nbend run-cu examples/bitonic_sort.bend -s\n```\n\nGood Bend candidates usually have lots of independent subproblems. Bad candidates are \"helplessly sequential\": each step depends on the exact result of the previous step.\n\n## Roadmap\n\n- matrix multiplication checksum\n- Mandelbrot / fractal checksum\n- N-body simulation checksum\n- prefix scan\n- larger AI inference/search kernels\n- benchmark result tables across CPU and GPU machines\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyablokolabs%2Fbendkernels","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyablokolabs%2Fbendkernels","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyablokolabs%2Fbendkernels/lists"}