{"id":13442707,"url":"https://github.com/defunkydrummer/auto-text","last_synced_at":"2025-03-20T15:30:23.823Z","repository":{"id":91185922,"uuid":"175478102","full_name":"defunkydrummer/auto-text","owner":"defunkydrummer","description":"Automatic  (encoding, end of line, column width, etc) detection for text files. 100% Common Lisp.","archived":false,"fork":false,"pushed_at":"2019-04-15T16:46:44.000Z","size":90,"stargazers_count":10,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-10-28T05:59:37.853Z","etag":null,"topics":["common","common-lisp","csv","encoding","eol","text"],"latest_commit_sha":null,"homepage":"","language":"Common Lisp","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/defunkydrummer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-13T18:34:24.000Z","updated_at":"2021-12-13T23:19:26.000Z","dependencies_parsed_at":null,"dependency_job_id":"e077bf07-e1a2-457e-a6b5-68477ee54f27","html_url":"https://github.com/defunkydrummer/auto-text","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defunkydrummer%2Fauto-text","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defunkydrummer%2Fauto-text/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defunkydrummer%2Fauto-text/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defunkydrummer%2Fauto-text/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/defunkydrummer","download_url":"https://codeload.github.com/defunkydrummer/auto-text/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244639815,"owners_count":20485932,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["common","common-lisp","csv","encoding","eol","text"],"created_at":"2024-07-31T03:01:49.461Z","updated_at":"2025-03-20T15:30:23.810Z","avatar_url":"https://github.com/defunkydrummer.png","language":"Common Lisp","funding_links":[],"categories":["Common Lisp","Expert Systems"],"sub_categories":[],"readme":"## AUTO-TEXT\n\nThis is a Common Lisp library intended for working with unknown \"real-life\" text files that hopefully hold data in table form (i.e. CSV files, fixed-width files, etc). \n\nIn these cases, given a text file, its useful to finding out the following:\n\n- Which file encoding can we use? UTF-8, 16, 32? ISO-8859-1? Windows-1252?\n\n- Is the file delimited? Which is the delimiter? (i.e. tab, comma, etc)\n\n- Which is the file ending? CR, LF? CR and LF? Does it have mixed encoding? (Files with more than one different end of line indicator(!!)) \n\n- Given the above, can we read the file as CSV and get the same amount of columns all the time?\n\n- Can we get how many times each different character appears through a file?\n\n- Can we sample some rows, without having to hold all the file in memory?\n\n**Plus**\n\n- It does not hold the whole file into memory, so it should work for files of any size.\n\n- It works reasonably fast.\n\n ## Caveats\n \n The file encoding detection is able to discriminate between:\n \n - Files with BOM: UTF-8, 16LE, 16BE, and 32.\n \n - ISO-8859-1, Windows-1252 (Latin european encondings)\n \n - Neither of the above.\n \n This library is intended for working with english or \"latin-1\" (spanish, french, portuguese, italian) data, thus the choice of encodings.\n \n However, more encoding detection rules can be added or edited in a simple way, see `encoding.lisp`, the code is simple to understand. \n\nFor detection of **asian and far eastern** languages like Chinese, Japanese, Korean, Russian, Arabic, Greek, see the [inquisitor](https://github.com/t-sin/inquisitor) lib.  Inquisitor doesn't work with latin or west european languages.\n\n## Usage\n\n**Main usage:**\n\n```lisp\n(ql:quickload \"auto-text\")\n\n(auto-text:analyze \"my-file.txt\") \n```\n\nSample out:\n\n```\nCL-USER\u003e (auto-text:analyze #P\"my.txt\" :silent nil)\nReading file for analysis... my.txt\nEol-type: CRLF\nLikely delimiter? #\\,  \nBOM: NIL \nPossible encodings:  UTF-8 \nSampling 10 rows as CSV for checking width...\n(:SAME-NUMBER-OF-COLUMNS T :DELIMITER #\\, :EOL-TYPE :CRLF :BOM-TYPE NIL\n :ENCODING :UTF-8)\n \nCL-USER\u003e (auto-text:analyze #P\"file.txt\" :silent nil)\nReading file for analysis... file.txt\nEol-type: CRLF\nLikely delimiter? #\\Return  \nBOM: NIL \nPossible encodings:  WINDOWS-1252 \n(:DELIMITER #\\Return :EOL-TYPE :CRLF :BOM-TYPE NIL :ENCODING :WINDOWS-1252)\n```\nSilence messages by passing `:silent T`\n\nThe `analyze` function will produce a property-list (plist) with the following keys:\n\n- :same-number-of-columns : True if file appears to have the same number of columns when read as a CSV.\n\n- :delimiter : column separator char\n\n- :eol-type : type of end-of-line char (:CR, :LF, :CRLF, :MIXED, :NO-LINE-ENDING)\n\n- :bom-type : if BOM is detected, which is the UTF type reported.\n\n- :encoding : detected encoding.\n\n\n**Sample an arbitrary number of rows (lines) from an arbitrarily-sized file:** See `sample-rows-string`:\n\n```lisp\n(defun sample-rows-string (path \u0026key (eol-type :crlf)\n                                     (encoding :utf-8)\n                                     (sample-size 10))\n  \"Sample some rows from path, return as list of strings.\nEach line does not include the EOL\"\n  (loop for rows in (sample-rows-bytes path :eol-type eol-type\n                                            :sample-size sample-size)\n        collecting \n        (bytes-to-string rows encoding)))\n```\n\n**Obtain a histogram** or frequency hash-table that will tell you how many times a character appears on the whole file. (NOTE: This reads the file in byte mode): See `histogram-binary-file` on `histogram.lisp`\n\n## Config\n\nSee `config.lisp`, to config:\n\n- Buffer sizes (important: line buffer size, default 32 KB)\n\n- CSV delimiters to look for.\n\n## Hacking\n\n- Feel free to improve the encoding detection rules on `encoding.lisp` and send me a PR.\n\n- There are more features inside the source code that will later be polished, exported, and 'surfaced' to this readme. \n\n## Implementations\n\nSo far works on SBCL, CCL, CLISP, and ABCL.\n\nShould run in any implementation where BABEL and CL-CSV work fine!!\n\n## Author\n\nFlavio E. also known as *defunkydrummer*\n\n## License\n\nMIT\n\n## See also\n\nThe aforementioned [inquisitor](https://github.com/t-sin/inquisitor) lib. This code shares no code in common with inquisitor. \n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefunkydrummer%2Fauto-text","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdefunkydrummer%2Fauto-text","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefunkydrummer%2Fauto-text/lists"}