{"id":27918774,"url":"https://github.com/juliaparallel/blocks.jl","last_synced_at":"2025-06-19T22:06:49.661Z","repository":{"id":9330344,"uuid":"11176164","full_name":"JuliaParallel/Blocks.jl","owner":"JuliaParallel","description":"A framework to represent chunks of entities and parallel methods on them.","archived":false,"fork":false,"pushed_at":"2023-03-14T12:45:31.000Z","size":95,"stargazers_count":30,"open_issues_count":7,"forks_count":6,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-06-19T22:05:17.368Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Julia","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"romanbsd/ember-cli-sass","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JuliaParallel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2013-07-04T12:25:17.000Z","updated_at":"2023-04-07T02:25:42.000Z","dependencies_parsed_at":"2025-05-06T18:39:33.486Z","dependency_job_id":null,"html_url":"https://github.com/JuliaParallel/Blocks.jl","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/JuliaParallel/Blocks.jl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JuliaParallel%2FBlocks.jl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JuliaParallel%2FBlocks.jl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JuliaParallel%2FBlocks.jl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JuliaParallel%2FBlocks.jl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JuliaParallel","download_url":"https://codeload.github.com/JuliaParallel/Blocks.jl/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JuliaParallel%2FBlocks.jl/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260838512,"owners_count":23070603,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-05-06T18:25:30.578Z","updated_at":"2025-06-19T22:06:44.620Z","avatar_url":"https://github.com/JuliaParallel.png","language":"Julia","funding_links":[],"categories":[],"sub_categories":[],"readme":"Blocks\n======\nA framework to:\n- represent chunks of an entity\n- represent processor affinities of the chunks (if any)\n- compose actions (both local and remote) on chunks by chaining functions\n- do map and reduce operations with the above\n\nIt represents a typical pattern observed across several types of parallel processing tasks. The Blocks framework can be leveraged to build convenience APIs for parallelizing such tasks. The composability of Blocks lends to a convenient and compact syntax.\n\nAs examples of its utility, it has been used to implement chunked and distributed operations on disk files, HDFS files, IO streams, arrays, matrices, and dataframes. Some of them are included in the Blocks module while the rest are available as sub modules of Blocks:\n- Blocks.MatOp\n\n[![Build Status](https://travis-ci.org/JuliaParallel/Blocks.jl.png)](https://travis-ci.org/JuliaParallel/Blocks.jl)\n\n### Creating Blocks\n#### Disk Files\n\n````\nusing Blocks\n\nBlock(file::File, nblocks::Int=0)\n    Where nblocks is the number of chunks to divide the file into.\n    Number of chunks (nblocks) defaults to number of worker processes.\n    Each chunk is represented as the file and the byte range.\n    Assumes that the file is available at all processors and chunks can be processed anywhere.\n````\n\n#### HDFS Files\n\n````\nusing Blocks\nusing HDFS\n\nBlock(file::HdfsURL)\n    Each chunk is a block in HDFS.\n    Processor affinity of each chunk is set to machines where this block has been replicated by HDFS.\n````\n\n#### Arrays:\n\n````\nusing Blocks\n\nBlock(A::Array, dims::Array)\n    Chunks created across dimensions specified in dims.\n    Chunks are not pre-distributed and any chunk can be processed at any processor.\n\nBlock(A::Array, dim::Int, nblocks::Int)\n    Chunked to nblocks chunks on dimension dim.\n    Chunks are not pre-distributed and any chunk can be processed at any processor.\n````\n\n#### Matrix Operations\n\nParallelized operations on matrices can be represented and executed using Blocks. Module `Blocks.MatOp` provides a set of convenience APIs using the `MatOpBlock` object.\n\n````\njulia\u003e using Blocks\n\njulia\u003e using Blocks.MatOp\n\njulia\u003e # create two matrices\n\njulia\u003e m1 = rand(Int, 6, 10);\n\njulia\u003e m2 = rand(Int, 10, 6);\n\njulia\u003e # create a parallel matrix operation using the two, multiplication in this case\n\njulia\u003e mb = MatOpBlock(m1, m2, :*, 3);\n\njulia\u003e # represent that in blocks\n\njulia\u003e blk = Block(mb);\n\njulia\u003e # execute the operation\n\njulia\u003e result = op(blk);\n\njulia\u003e # verify the result\n\njulia\u003e tr = m1*m2;\n\njulia\u003e all(tr .== result)\ntrue\n````\n\n`Blocks.MatOp` can be made to work on any `AbstractMatrix` implementation, as long as there is:\n\n- a function `Blocks(A, splits)`, where `A` is the matrix and `splits` is a `Tuple` of ranges (as returned from `mat_split_ranges`)\n- a function `matrixpart(blk)`, which returns a chunk of `A` that the block `blk` represents\n\n\n#### Streams:\n\n````\nusing Blocks\n\nBlock(stream::Union(IOStream,AsyncStream,IOBuffer,BlockIO), maxsize::Int)\n    Iterating on the block thus created would read a chunk of data from `stream`.\n    Each chunk will represent a `maxsize` sized data block read from `stream`.\n\nBlock(stream::Union(IOStream,AsyncStream,IOBuffer,BlockIO), approxsize::Int, dlm::Char)\n    Iterating on the block thus created would read a chunk of data from `stream`.\n    Each chunk is  approximately of size `approxsize` and ends with the `dlm` character.\n````\n\n#### Distributed DataFrames (discontinued):\n\nBlocks introduces a distributed `DataFrame` type named `DDataFrame`. It holds referenced to multiple remote data frames, on multiple processors. A large table can be read in parallel into a DDataFrame by using the special `dreadtable` method. \n\n````\nusing Blocks\nusing DataFrames\n\ndreadtable(filename::String; kwargs...)\ndreadtable(blocks::Block; kwargs...)\n    Where blocks are created from disk or HDFS files or from streams as described in sections above.\ndreadtable(ios::Union(AsyncStream,IOStream), chunk_sz::Int, merge_chunks::Bool=true; kwargs...)\n    Where \n        ios is a stream of data\n        chunk_sz is the approximate number of bytes to chunk the data into\n        merge_chunks indicates whether all chunks on a single processor should be merged. \n        Merging discards positional information but makes the dataframe efficient by having fewer chunks.\n````\n\nA `DDataFrame` is easily represented as Blocks. `DDataFrame` has been used with `Blocks` to implement most `DataFrame` operations in a distributed manner. Most methods defined on a DataFrame also work on DDataFrames in a distributed manner using `pmap` and `reduce` to operate on chunks parallely.\n\n````\njulia\u003e using Blocks\n\njulia\u003e using DataFrames\n\njulia\u003e dt = dreadtable(\"test.csv\")\n100x10 DDataFrame. 2 blocks over 2 processors\n\njulia\u003e head(dt)\n6x10 DataFrame:\n               x1        x2        x3        x4        x5        x6        x7       x8       x9      x10\n[1,]     0.105518  0.173988  0.244224 0.0174508 0.0969595   0.12792  0.316974 0.852373 0.165014 0.886957\n[2,]     0.319401 0.0719447 0.0019209  0.285511  0.945343  0.926718  0.162048 0.118748 0.361014 0.611316\n[3,]     0.516926  0.473779  0.867099  0.408605  0.579969  0.111174 0.0790296 0.263822 0.073827 0.187637\n[4,]     0.579538  0.319672  0.600223  0.707782  0.806437  0.402244  0.670792  0.10981 0.518356 0.604807\n[5,]     0.660944  0.648076  0.611529  0.885457  0.550101 0.0634721  0.152263 0.855182 0.408393 0.473676\n[6,]    0.0324734   0.22839  0.812387   0.59965  0.143703    0.1337  0.945763 0.296137 0.875762 0.989037\n\njulia\u003e colsums(dt)\n1x10 DataFrame:\n             x1      x2      x3      x4      x5      x6     x7      x8      x9     x10\n[1,]    46.1597 41.9286 51.4197 50.1906 48.2623 44.5622 50.914 50.7266 44.1346 51.1001\n\njulia\u003e all(dt+dt .== 2*dt)\ntrue\n````\n\n### Composing Actions on Blocks\n\nFunctions can be chained and then applied on to chunks in a block with a `pmap` or `pmapreduce`. The Julia notation `|\u003e` is used to indicate chaining. For example to read a block of DataFrame from a chunk of a disk file:\n\n````\nb = Block(File(filename)) |\u003e as_io |\u003e as_recordio |\u003e as_dataframe\n````\n\nEach function in the chain works on the output of the previous function. \n\nSometimes it is necessary to separate some of the actions that must be applied locally and serially (e.g. reading from an IO stream), from the remaining that can be distributed to remote processors (e.g. creating a dataframe out of the data chunk). Such actions can be chained by prepending the chain of functions with a `@prepare` macro. \n\n````\nb = Block(File(filename))\nb = @prepare b |\u003e as_io |\u003e as_recordio |\u003e as_bytearray\nb = b |\u003e as_dataframe |\u003e nrows\n````\n\nFollowing is a list of functions provided in the package. User specified functions can be chained in as well:\n- `as_io`: creates an `IO` instance from streams or files\n- `as_recordio`: creates an `IO` instance from streams or files where begin and end positions are adjusted to the boundaries of delimited records\n- `as_lines`: creates an array of lines from `IO`\n- `as_bufferedio`: creates buffered `IO` from any other `IO`\n- `as_bytearray`: creates bytearray from any `IO`\n- `as_dataframe`: creates a dataframe from any `IO`\n\n\n### Map and Reduces on Blocks\n\nRegular Julia map-reduce methods can be used on blocks. The map methods receive the chunks as they have been processed by the chain of actions composed into the Blocks.\n\n````\njulia\u003e ba = Block([1:100], 1, 10);\n\njulia\u003e pmap(x-\u003esum(x), ba)\n10-element Any Array:\n  55\n 155\n 255\n 355\n 455\n 555\n 655\n 755\n 855\n 955\n\njulia\u003e pmapreduce(x-\u003esum(x), +, ba)\n5050\n\njulia\u003e ba = Block([1:100], 1, 10);\n\njulia\u003e map(x-\u003esum(x), ba)\n10-element Any Array:\n  55\n 155\n 255\n 355\n 455\n 555\n 655\n 755\n 855\n 955\n\njulia\u003e mapreduce(x-\u003esum(x), +, ba)\n5050\n````\n\n### Defining Blocks for a new type\n\nIt is easy to define Blocks for a new type. The minimum requirement is just the constructor: `Block(data::T,...)`.\nIn the `Block{T}` structure returned,\n- elements of `block` define a chunk of `T` with enough information that can be serialized to a remote node and recreated\n- elements of `affinity` define one or more processors where the corresponding element of `block` can be accessed\n- the `prepare` function pre-processes `block` elements on the master node, before they are serialized to the remote node\n- the `filter` function processes `block` elements on the remote node\n\nBoth `prepare` and `filter` functions can be chained after construction.\n\nIn addition to that, you may also override the default implementations of the following:\n\n- `blocks{T}(b::Block{T})`: return an iterator over the chunk definitions\n- `affinities{T}(b::Block{T})`: return an iterator over the chunk affinities\n- `localpart(blk::Block)`: return only the blocks that are local to the current processor\n\n### Sample Use Cases (TODO)\n- Sorting Disk Files\n- Distributed DataFrame from streaming data\n- Continuous summarization of streaming data using DataFrames\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjuliaparallel%2Fblocks.jl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjuliaparallel%2Fblocks.jl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjuliaparallel%2Fblocks.jl/lists"}