{"id":13463059,"url":"https://github.com/afair/postgresql_cursor","last_synced_at":"2025-03-25T06:31:30.079Z","repository":{"id":908223,"uuid":"666958","full_name":"afair/postgresql_cursor","owner":"afair","description":"ActiveRecord PostgreSQL Adapter extension for using a cursor to return a large result set","archived":false,"fork":false,"pushed_at":"2024-06-03T11:56:20.000Z","size":150,"stargazers_count":621,"open_issues_count":8,"forks_count":47,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-03-02T10:36:11.414Z","etag":null,"topics":["activerecord","batch","cursor","for-update","large-scale","postgresql","postgresql-cursor","ruby","ruby-gem"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/afair.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2010-05-14T18:17:58.000Z","updated_at":"2025-02-14T15:49:27.000Z","dependencies_parsed_at":"2024-06-03T13:33:04.453Z","dependency_job_id":"14790ef6-f256-4411-83dd-722089b07332","html_url":"https://github.com/afair/postgresql_cursor","commit_stats":{"total_commits":123,"total_committers":21,"mean_commits":5.857142857142857,"dds":"0.33333333333333337","last_synced_commit":"14735ca64c4fd5cddef351efb510c4d69b940a8c"},"previous_names":[],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afair%2Fpostgresql_cursor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afair%2Fpostgresql_cursor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afair%2Fpostgresql_cursor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/afair%2Fpostgresql_cursor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/afair","download_url":"https://codeload.github.com/afair/postgresql_cursor/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245413831,"owners_count":20611353,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["activerecord","batch","cursor","for-update","large-scale","postgresql","postgresql-cursor","ruby","ruby-gem"],"created_at":"2024-07-31T13:00:45.216Z","updated_at":"2025-03-25T06:31:29.797Z","avatar_url":"https://github.com/afair.png","language":"Ruby","readme":"# PostgreSQLCursor for handling large Result Sets\n\n[![Gem Version](https://badge.fury.io/rb/postgresql_cursor.svg)](http://badge.fury.io/rb/postgresql_cursor)\n\nPostgreSQLCursor extends ActiveRecord to allow for efficient processing of queries\nreturning a large number of rows, and allows you to sort your result set.\n\nIn PostgreSQL, a\n[cursor](http://www.postgresql.org/docs/9.4/static/plpgsql-cursors.html)\nruns a query, from which you fetch a block of\n(say 1000) rows, process them, and continue fetching until the result\nset is exhausted. By fetching a smaller chunk of data, this reduces the\namount of memory your application uses and prevents the potential crash\nof running out of memory.\n\nSupports Rails/ActiveRecord v3.1 (v3.2 recommended) higher (including\nv5.0) and Ruby 1.9 and higher. Not all features work in ActiveRecord v3.1.\nSupport for this gem will only be for officially supported versions of\nActiveRecord and Ruby; others can try older versions of the gem.\n\n## Using Cursors\n\nPostgreSQLCursor was developed to take advantage of PostgreSQL's cursors. Cursors allow the program\nto declare a cursor to run a given query returning \"chunks\" of rows to the application program while\nretaining the position of the full result set in the database. This overcomes all the disadvantages\nof using find_each and find_in_batches.\n\nAlso, with PostgreSQL, you have on option to have raw hashes of the row returned instead of the\ninstantiated models. An informal benchmark showed that returning instances is a factor of 4 times\nslower than returning hashes. If you are can work with the data in this form, you will find better\nperformance.\n\nWith PostgreSQL, you can work with cursors as follows:\n\n```ruby\nProduct.where(\"id\u003e0\").order(\"name\").each_row { |hash| Product.process(hash) }\n\nProduct.where(\"id\u003e0\").each_instance { |product| product.process! }\nProduct.where(\"id\u003e0\").each_instance(block_size:100_000) { |product| product.process }\n\nProduct.each_row { |hash| Product.process(hash) }\nProduct.each_instance { |product| product.process }\n\nProduct.each_row_by_sql(\"select * from products\") { |hash| Product.process(hash) }\nProduct.each_instance_by_sql(\"select * from products\") { |product| product.process }\n```\n\nCursors must be run in a transaction if you need to fetch each row yourself\n\n```ruby\nProduct.transaction do\n  cursor = Product.all.each_row\n  row = cursor.fetch                       #=\u003e {\"id\"=\u003e\"1\"}\n  row = cursor.fetch(symbolize_keys:true)  #=\u003e {:id =\u003e\"2\"}\n  cursor.close\nend\n```\n\nAll these methods take an options hash to control things more:\n\n    block_size:n      The number of rows to fetch from the database each time (default 1000)\n    while:value       Continue looping as long as the block returns this value\n    until:value       Continue looping until the block returns this value\n    connection:conn   Use this connection instead of the current Product connection\n    fraction:float    A value to set for the cursor_tuple_fraction variable.\n                      PostgreSQL uses 0.1 (optimize for 10% of result set)\n                      This library uses 1.0 (Optimize for 100% of the result set)\n                      Do not override this value unless you understand it.\n    with_hold:boolean Keep the cursor \"open\" even after a commit.\n    cursor_name:string Give your cursor a name.\n\nNotes:\n\n* Use cursors *only* for large result sets. They have more overhead with the database\n  than ActiveRecord selecting all matching records.\n* Aliases each_hash and each_hash_by_sql are provided for each_row and each_row_by_sql\n  if you prefer to express what types are being returned.\n\n### PostgreSQLCursor is an Enumerable\n\nIf you do not pass in a block, the cursor is returned, which mixes in the Enumerable\nlibary. With that, you can pass it around, or chain in the awesome enumerable things\nlike `map` and `reduce`. Furthermore, the cursors already act as `lazy`, but you can\nalso chain in `lazy` when you want to keep the memory footprint small for rest of the process.\n\n```ruby\nProduct.each_row.map {|r| r[\"id\"].to_i } #=\u003e [1, 2, 3, ...]\nProduct.each_instance.map {|r| r.id }.each {|id| p id } #=\u003e [1, 2, 3, ...]\nProduct.each_instance.lazy.inject(0) {|sum,r| sum +  r.quantity } #=\u003e 499500\n```\n\n### PostgreSQLCursor and collection rendering\n       \nYou can render cursor collection, using enumeration as collection attribute.\n\n```ruby\nrender partial: \"some_partial\", collection: Product.each_instance\nrender partial: \"some_partial\", collection: Product.each_row\nrender partial: \"some_partial\", collection: Product.each_hash\n```\n\n### Hashes vs. Instances\n\nThe each_row method returns the Hash of strings for speed (as this allows you to process a lot of rows).\nHashes are returned with String values, and you must take care of any type conversion.\n\nWhen you use each_instance, ActiveRecord lazily casts these strings into\nRuby types (Time, Fixnum, etc.) only when you read the attribute.\n\nIf you find you need the types cast for your attributes, consider using each_instance\ninsead. ActiveRecord's read casting algorithm will only cast the values you need and\nhas become more efficient over time.\n\n### Select and Pluck\n\nTo limit the columns returned to just those you need, use `.select(:id, :name)`\nquery method.\n\n```ruby\nProduct.select(:id, :name).each_row { |product| product.process }\n```\n\nPluck is a great alternative instead of using a cursor. It does not instantiate\nthe row, and builds an array of result values, and translates the values into ruby\nvalues (numbers, Timestamps. etc.). Using the cursor would still allow you to lazy\nload them in batches for very large sets.\n\nYou can also use the `pluck_rows` or `pluck_instances` if the results\nwon't eat up too much memory.\n\n```ruby\nProduct.newly_arrived.pluck(:id) #=\u003e [1, 2, 3, ...]\nProduct.newly_arrived.each_row { |hash| }\nProduct.select(:id).each_row.map {|r| r[\"id\"].to_i } # cursor instead of pluck\nProduct.pluck_rows(:id) #=\u003e [\"1\", \"2\", ...]\nProduct.pluck_instances(:id, :quantity) #=\u003e [[1, 503], [2, 932], ...]\n```\n\n### Associations and Eager Loading\n\nActiveRecord performs some magic when eager-loading associated row. It\nwill usually not join the tables, and prefers to load the data in\nseparate queries.\n\nThis library hooks onto the `to_sql` feature of the query builder. As a\nresult, it can't do the join if ActiveRecord decided not to join, nor\ncan it construct the association objects eagerly.\n\n## Locking and Updating Each Row (FOR UPDATE Queries)\n\nWhen you use the AREL `lock` method, a \"FOR UPDATE\" clause is added to\nthe query. This causes the block of rows returned from each FETCH\noperation (see the `block_size` option) to be locked for you to update.\nThe lock is released on those rows once the block is exhausted and the\nnext FETCH or CLOSE statement is executed.\n\nThis example will run through a large table and potentially update each\nrow, locking only a set of rows at a time to allow concurrent use.\n\n```ruby\nProduct.lock.each_instance(block_size:100) do |p|\n  p.update(price: p.price * 1.05)\nend\n```\n\nAlso, pay attention to the `block_size` you request. Locking large\nblocks of rows for an extended time can cause deadlocks or other\nperformance issues in your application. On a busy table, or if the\nprocessing of each row consumes a lot of time or resources, try a\n`block_size` \u003c= 10.\n\nSee the [PostgreSQL Select Documentation](https://www.postgresql.org/docs/current/static/sql-select.html)\nfor more information and limitations when using \"FOR UPDATE\" locking.\n\n## Background: Why PostgreSQL Cursors?\n\nActiveRecord is designed and optimized for web performance. In a web transaction, only a \"page\" of\naround 20 rows is returned to the user. When you do this\n\n```ruby\nProduct.where(\"id\u003e0\").each { |product| product.process }\n```\n\nThe database returns all matching result set rows to ActiveRecord, which instantiates each row with\nthe data returned. This function returns an array of all these rows to the caller.\n\nAsynchronous, Background, or Offline processing may require processing a large amount of data.\nWhen there is a very large number of rows, this requires a lot more memory to hold the data. Ruby\ndoes not return that memory after processing the array, and the causes your process to \"bloat\". If you\ndon't have enough memory, it will cause an exception.\n\n### ActiveRecord.find_each and find_in_batches\n\nTo solve this problem, ActiveRecord gives us two alternative methods that work in \"chunks\" of your data:\n\n```ruby\nProduct.where(\"id\u003e0\").find_each { |model| Product.process }\n\nProduct.where(\"id\u003e0\").find_in_batches do |batch|\n  batch.each { |model| Product.process }\nend\n```\n\nOptionally, you can specify a :batch_size option as the size of the \"chunk\", and defaults to 1000.\n\nThere are drawbacks with these methods:\n\n* You cannot specify the order, it will be ordered by the primary key (usually id)\n* The primary key must be numeric\n* The query is rerun for each chunk (1000 rows), starting at the next id sequence.\n* You cannot use overly complex queries as that will be rerun and incur more overhead.\n\n### How it works\n\nUnder the covers, the library calls the PostgreSQL cursor operations\nwith the pseudo-code:\n\n    SET cursor_tuple_fraction TO 1.0;\n    DECLARE cursor_1 CURSOR WITH HOLD FOR select * from widgets;\n    loop\n      rows = FETCH 100 FROM cursor_1;\n      rows.each {|row| yield row}\n    until rows.size \u003c 100;\n    CLOSE cursor_1;\n\n## Meta\n### Author\nAllen Fair, [@allenfair](https://twitter.com/allenfair), [github://afair](https://github.com/afair)\n\n### Note on Patches/Pull Requests\n\n* Fork the project.\n* Make your feature addition or bug fix.\n* Add tests for it. This is important so I don't break it in a\n  future version unintentionally.\n* Commit, do not mess with rakefile, version, or history.\n  (if you want to have your own version, that is fine but bump version in a commit by itself I can ignore when I pull)\n* Send me a pull request. Bonus points for topic branches.\n\n### Code of Conduct\n\nThis project adheres to the [Open Code of Conduct](http://todogroup.org/opencodeofconduct/#postgresql_cursor/2016@allenfair.com).\nBy participating, you are expected to honor this code.\n\n### Copyright\n\nCopyright (c) 2010-2017 Allen Fair. See (MIT) LICENSE for details.\n","funding_links":[],"categories":["Data Persistence","Ruby","Gems"],"sub_categories":["SQL Database Adapters","Performance Optimization"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fafair%2Fpostgresql_cursor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fafair%2Fpostgresql_cursor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fafair%2Fpostgresql_cursor/lists"}