{"id":18573013,"url":"https://github.com/buganini/cluto","last_synced_at":"2025-05-15T23:12:00.289Z","repository":{"id":2312589,"uuid":"3272329","full_name":"buganini/cluto","owner":"buganini","description":"Data extractor for PDF/Image datasets","archived":false,"fork":false,"pushed_at":"2023-02-24T16:55:18.000Z","size":441,"stargazers_count":1,"open_issues_count":3,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-12-26T14:27:57.663Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/buganini.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2012-01-26T08:48:16.000Z","updated_at":"2022-02-12T15:57:37.000Z","dependencies_parsed_at":"2023-07-06T17:32:17.785Z","dependency_job_id":null,"html_url":"https://github.com/buganini/cluto","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buganini%2Fcluto","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buganini%2Fcluto/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buganini%2Fcluto/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/buganini%2Fcluto/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/buganini","download_url":"https://codeload.github.com/buganini/cluto/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239309238,"owners_count":19617848,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T23:07:49.777Z","updated_at":"2025-02-17T14:40:28.684Z","avatar_url":"https://github.com/buganini.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# macOS\n```\nbrew install sip-install\npip install -r requirements.txt\n```\n\n# Dependencies\n```\n\tPython3\n\tPyQT5\n\tPyQT5-Poppler\n\tPIL\n\txpdfimport from LibreOffice/OpenOffice.org\n```\n\n# Query Language\n```\n\t${KEY}: local variable\n\t@{KEY}: lexical variable (inherit from parent)\n\t#{KEY}: list of children's local variable\n\t%{KEY}: item properties:\n\t\tCONTENT: content (image or text)\n\t\tTYPE: area type\n\t\tWIDTH: image width\n\t\tHEIGHT: image height\n\t\tFILE: original file path\n\t\tLEVEL:\tlevel\n\t\tCOUNT: number of children\n\t\tORDER: children are well ordered\n\t\tPAGE: page number in pdf file, always 0 for other file type\n\t\tINDEX: index in siblings, started from 1\n\t\tHASH: assigned hash string\n\t\tTITLE: title (filename and page number, if available)\n\t\tTHIS: item itself\n\t\tPARENT: item's parent\n\tOperators:\n\t\t[]\n\t\t,\n\t\t+\n\t\t-\n\t\t==\n\t\t!=\n\t\t\u0026\u0026\n\t\t||\n\t\t!\n\t\t\u003e=\n\t\t\u003e\n\t\t\u003c=\n\t\t\u003c\n\tData Types:\n\t\tString: \"string\"\n\t\tNon-negative Integer: 1\n\t\tList: \"a\", 1\n\t\tSingle-item list: 1,\n\tFunctions:\n\t\tCHILD\n\t\t\tchild(%{this}, 1) -\u003e first child of this item\n\t\tABS\n\t\t\tabs(-1) -\u003e 1\n\t\tCONCAT\n\t\t\tconcat(\"abc\", \"def\") -\u003e \"abcdef\"\n\t\t\tconcat([\"abc\", \"def\"], [\"123\", \"456\"]) -\u003e [\"abc123\", \"def456\"]\n\t\tJOIN\n\t\t\tjoin(\" \",(\"a\",\"b\")) -\u003e \"a b\"\n\t\tBASENAME\n\t\t\tbasename(\"a/b/c.d\") -\u003e \"c.d\"\n\t\tDIRNAME\n\t\t\tbasename(\"a/b/c.d\") -\u003e \"a/b\"\n\t\tPREFIX\n\t\t\tprefix(\"c.d\") -\u003e \"c\"\n\t\tSUFFIX\n\t\t\tsuffix(\"c.d\") -\u003e \"d\"\n\t\tPATHJOIN\n\t\t\tpathjoin(\"a/b\",\"c.d\") -\u003e \"a/b/c.d\"\n\t\tLEN\n\t\t\tlen(\"a\",\"b\",\"abc\") -\u003e 3\n\t\tLIST_HAS\n\t\t\tlist_has((\"a\",\"b\"),\"a\") -\u003e true\n\t\t\tlist_has((\"a\",\"b\"),\"c\") -\u003e false\n\t\tSPLIT\n\t\t\tsplit(\";\",\"a;b\") -\u003e (\"a\",\"b\")\n\t\tSLICE\n\t\t\tslice((\"a\",\"b\",\"c\",\"d\"),1) -\u003e (\"b\",\"c\",\"d\")\n\t\t\tslice((\"a\",\"b\",\"c\",\"d\"),1,2) -\u003e (\"b\",\"c\")\n\t\tREPLACE\n\t\t\treplace(\"a\",\"b\",\"abc\") -\u003e \"bbc\"\n\t\tREGEX_REPLACE\n\t\t\tregex_replace(\"[bc]+\",\"z\",\"abcdcbe\") -\u003e \"azdze\"\n\t\tTEXT\n\t\t\tTEXT(content) -\u003e text\n\t\tCOALESCE\n\t\t\tcoalesce((\"\", \"a\",\"b\", \"\", \"c\")) -\u003e \"a\"\n\t\tFIELD\n\t\t\tfield(\"foo\", \"name:blah\\nfoo: bar\\nhoho\") -\u003e \"bar\"\n\t\tREJECT_FIELD\n\t\t\treject_field(\"aa\", \"aa:bb\\ncc:dd\") -\u003e \"cc:dd\"\n\t\t\treject_field((\"aa\",\"ee\"), \"aa:bb\\ncc:dd\\nee:ff\") -\u003e \"cc:dd\"\n\t\tREJECT\n\t\t\treject(\"aa\", \"aa\\nbb\\ncc\\ndd\") -\u003e \"bb\\ncc\\ndd\"\n\t\t\treject((\"aa\",\"cc\"), \"aa\\nbb\\ncc\\ndd\") -\u003e \"bb\\ndd\"\n\t\tLOOKUP\n\t\t\tlookup(\"k,v.csv\", k) -\u003e v\n\t\tENSURE\n\t\t\tensure(\"foo\", \"a\\nb\", 1) -\u003e \"a\\nfoo b\"\n\t\t\tensure(\"foo\", \"foo a\\nb\", 1) -\u003e \"foo a\\nb\"\n\t\tPRINT\n\t\t\tprint(x) -\u003e x # output to stdout\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbuganini%2Fcluto","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbuganini%2Fcluto","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbuganini%2Fcluto/lists"}