{"id":22443175,"url":"https://github.com/riversun/bigdoc","last_synced_at":"2025-08-01T18:34:07.848Z","repository":{"id":57740083,"uuid":"76413201","full_name":"riversun/bigdoc","owner":"riversun","description":"This library allows you to handle gigabyte order huge files easily with high performance. You can search bytes or words / read data/text from huge files.","archived":false,"fork":false,"pushed_at":"2023-04-11T02:45:13.000Z","size":138,"stargazers_count":12,"open_issues_count":1,"forks_count":4,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-04-24T12:42:44.047Z","etag":null,"topics":["file-search","huge-files","search-algorithm","search-binary","search-bytes"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/riversun.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-12-14T01:28:07.000Z","updated_at":"2024-03-10T08:14:15.000Z","dependencies_parsed_at":"2022-08-30T10:42:24.027Z","dependency_job_id":null,"html_url":"https://github.com/riversun/bigdoc","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riversun%2Fbigdoc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riversun%2Fbigdoc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riversun%2Fbigdoc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/riversun%2Fbigdoc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/riversun","download_url":"https://codeload.github.com/riversun/bigdoc/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228398931,"owners_count":17913696,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["file-search","huge-files","search-algorithm","search-binary","search-bytes"],"created_at":"2024-12-06T02:22:56.274Z","updated_at":"2024-12-06T02:22:57.738Z","avatar_url":"https://github.com/riversun.png","language":"Java","readme":"# Overview\r\n'bigdoc' allows you to handle gigabyte order files easily with high performance.\r\nYou can search bytes or words / read data/text from huge files.\r\n\r\nIt is licensed under [MIT license](https://opensource.org/licenses/MIT).\r\n\r\n[![Maven Central](https://maven-badges.herokuapp.com/maven-central/org.riversun/bigdoc/badge.svg)](https://maven-badges.herokuapp.com/maven-central/org.riversun/bigdoc)\r\n\r\n\r\n# Quick start\r\n## Search sequence of bytes from a big file quickly.\r\n\r\nSearch mega-bytes,giga-bytes order file.\r\n\r\n```java\r\npackage org.example;\r\n\r\nimport java.io.File;\r\nimport java.util.List;\r\n\r\nimport org.riversun.bigdoc.bin.BigFileSearcher;\r\n\r\npublic class Example {\r\n\r\n\tpublic static void main(String[] args) throws Exception {\r\n\r\n\t\tbyte[] searchBytes = \"hello world.\".getBytes(\"UTF-8\");\r\n\r\n\t\tFile file = new File(\"/var/tmp/yourBigfile.bin\");\r\n\r\n\t\tBigFileSearcher searcher = new BigFileSearcher();\r\n\r\n\t\tList\u003cLong\u003e findList = searcher.searchBigFile(file, searchBytes);\r\n\r\n\t\tSystem.out.println(\"positions = \" + findList);\r\n\t}\r\n}\r\n```\r\n\r\n## Example code for canceling a search in progress\r\n\r\nWhen used asynchronously, #cancel can be used to stop the process in the middle of a search.\r\n\r\n```java\r\npackage org.riversun.bigdoc.bin;\r\n\r\nimport java.io.File;\r\nimport java.io.UnsupportedEncodingException;\r\nimport java.util.List;\r\n\r\nimport org.riversun.bigdoc.bin.BigFileSearcher.OnRealtimeResultListener;\r\n\r\npublic class Example {\r\n\r\n  public static void main(String[] args) throws UnsupportedEncodingException, InterruptedException {\r\n    byte[] searchBytes = \"sometext\".getBytes(\"UTF-8\");\r\n    \r\n    File file = new File(\"path/to/file\");\r\n\r\n    final BigFileSearcher searcher = new BigFileSearcher();\r\n\r\n    searcher.setUseOptimization(true);\r\n    searcher.setSubBufferSize(256);\r\n    searcher.setSubThreadSize(Runtime.getRuntime().availableProcessors());\r\n\r\n    final SearchCondition sc = new SearchCondition();\r\n    \r\n    sc.srcFile = file;\r\n    sc.startPosition = 0;\r\n    sc.searchBytes = searchBytes;\r\n\r\n    sc.onRealtimeResultListener = new OnRealtimeResultListener() {\r\n\r\n      @Override\r\n      public void onRealtimeResultListener(float progress, List\u003cLong\u003e pointerList) {\r\n        System.out.println(\"progress:\" + progress + \" pointerList:\" + pointerList);\r\n      }\r\n    };\r\n\r\n    final Thread th = new Thread(new Runnable() {\r\n\r\n      @Override\r\n      public void run() {\r\n        List\u003cLong\u003e searchBigFileRealtime = searcher.searchBigFile(sc);\r\n      }\r\n    });\r\n\r\n    th.start();\r\n\r\n    Thread.sleep(1500);\r\n\r\n    searcher.cancel();\r\n\r\n    th.join();\r\n\r\n  }\r\n}\r\n\r\n```\r\n\r\n## Performance Test\r\nSearch sequence of bytes from big file\r\n\r\n### Environment\r\nTested on AWS t2.*\u003cbr\u003e\r\n\r\n### Results\r\n\u003ctable\u003e\r\n\u003ctr\u003e\u003ctd\u003eCPU Instance\u003c/td\u003e \u003ctd\u003eEC2 t2.2xlarge\u003cbr\u003evCPU x 8,32GiB\u003c/td\u003e  \u003ctd\u003eEC2 t2.xlarge\u003cbr\u003evCPU x 4,16GiB\u003c/td\u003e\u003ctd\u003eEC2 t2.large\u003cbr\u003evCPU x 2,8GiB\u003c/td\u003e\u003ctd\u003eEC2 t2.medium\u003cbr\u003evCPU x 2,4GiB\u003c/td\u003e         \u003c/tr\u003e\r\n\u003ctr\u003e\u003ctd\u003eFile Size\u003c/td\u003e    \u003ctd\u003eTime(sec)\u003c/td\u003e                              \u003ctd\u003eTime(sec)\u003c/td\u003e                           \u003ctd\u003eTime(sec)\u003c/td\u003e                         \u003ctd\u003eTime(sec)\u003c/td\u003e                                    \u003c/tr\u003e\r\n\u003ctr\u003e\u003ctd\u003e10MB\u003c/td\u003e         \u003ctd\u003e0.5s\u003c/td\u003e                              \u003ctd\u003e0.6s\u003c/td\u003e                           \u003ctd\u003e0.8s\u003c/td\u003e                         \u003ctd\u003e0.8s\u003c/td\u003e                                     \u003c/tr\u003e\r\n\u003ctr\u003e\u003ctd\u003e50MB\u003c/td\u003e         \u003ctd\u003e2.8s\u003c/td\u003e                              \u003ctd\u003e5.9s\u003c/td\u003e                           \u003ctd\u003e13.4s\u003c/td\u003e                        \u003ctd\u003e12.8s\u003c/td\u003e                                       \u003c/tr\u003e\r\n\u003ctr\u003e\u003ctd\u003e100MB\u003c/td\u003e        \u003ctd\u003e5.4s\u003c/td\u003e                              \u003ctd\u003e10.7s\u003c/td\u003e                          \u003ctd\u003e25.9s\u003c/td\u003e                        \u003ctd\u003e25.1s\u003c/td\u003e                                        \u003c/tr\u003e\r\n\u003ctr\u003e\u003ctd\u003e250MB\u003c/td\u003e        \u003ctd\u003e15.7s\u003c/td\u003e                             \u003ctd\u003e32.6s\u003c/td\u003e                          \u003ctd\u003e77.1s\u003c/td\u003e                        \u003ctd\u003e74.8s\u003c/td\u003e                                          \u003c/tr\u003e\r\n\u003ctr\u003e\u003ctd\u003e1GB\u003c/td\u003e          \u003ctd\u003e55.9s\u003c/td\u003e                             \u003ctd\u003e120.5s\u003c/td\u003e                         \u003ctd\u003e286.1s\u003c/td\u003e                            \u003ctd\u003e-\u003c/td\u003e                                       \u003c/tr\u003e\r\n\u003ctr\u003e\u003ctd\u003e5GB\u003c/td\u003e          \u003ctd\u003e259.6s\u003c/td\u003e                            \u003ctd\u003e566.1s\u003c/td\u003e                         \u003ctd\u003e-\u003c/td\u003e                            \u003ctd\u003e-\u003c/td\u003e                                         \u003c/tr\u003e\r\n\u003ctr\u003e\u003ctd\u003e10GB\u003c/td\u003e         \u003ctd\u003e507.0s\u003c/td\u003e                            \u003ctd\u003e1081.7s\u003c/td\u003e                        \u003ctd\u003e-\u003c/td\u003e                            \u003ctd\u003e-\u003c/td\u003e                                          \u003c/tr\u003e\r\n\u003c/table\u003e\r\n\r\nPlease Note\r\n\r\n- Processing speed depends on the number of CPU Cores(included hyper threading) not memory capacity.\r\n- The result is different depending on the environment of the Java ,Java version and compiler or runtime optimization.\r\n\r\n# Architecture and Tuning\r\n\r\n![architecture](https://riversun.github.io/img/bigdoc_how_to_tune.png\r\n \"architecture\")\r\n\r\nYou can tune the performance using the following methods.\r\nIt can be adjusted according to the number of CPU cores and memory capacity.\r\n\r\n- BigFileSearcher#setBlockSize\r\n- BigFileSearcher#setMaxNumOfThreads\r\n- BigFileSearcher#setBufferSizePerWorker\r\n- BigFileSearcher#setBufferSize\r\n- BigFileSearcher#setSubThreadSize\r\n\r\nBigFileSearcher can search for sequence of bytes by dividing a big file into multiple blocks.\r\nUse multiple workers to search for multiple blocks concurrently.\r\nOne worker thread sequentially searches for one block.\r\nThe number of workers is specified by #setMaxNumOfThreads.\r\nWithin a single worker thread, it reads and searches into the memory by the capacity specified by #setBufferSize.\r\nA small area - used to compare sequence of bytes when searching - is called a window, and the size of that window is specified by #setSubBufferSize.\r\nMultiple windows can be operated concurrently, and the number of conccurent operations in a worker is specified by #setSubThreadSize.\r\n\r\n\r\n\r\n# More Details\r\nSee javadoc as follows.\r\n\r\nhttps://riversun.github.io/javadoc/bigdoc/\r\n\r\n# Downloads\r\n## maven\r\n- You can add dependencies to maven pom.xml file.\r\n```xml\r\n\r\n\u003cdependency\u003e\r\n    \u003cgroupId\u003eorg.riversun\u003c/groupId\u003e\r\n    \u003cartifactId\u003ebigdoc\u003c/artifactId\u003e\r\n    \u003cversion\u003e0.4.0\u003c/version\u003e\r\n\u003c/dependency\u003e\r\n```\r\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Friversun%2Fbigdoc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Friversun%2Fbigdoc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Friversun%2Fbigdoc/lists"}