{"id":40615365,"url":"https://github.com/x1angli/cvt2utf","last_synced_at":"2026-01-21T06:02:29.072Z","repository":{"id":26336172,"uuid":"29784886","full_name":"x1angli/cvt2utf","owner":"x1angli","description":"This lightweight tool converts non-UTF-encoded (such as GB2312, GBK, BIG5 encoded) files to UTF-8 encoding. ","archived":false,"fork":false,"pushed_at":"2024-03-22T09:58:50.000Z","size":86,"stargazers_count":104,"open_issues_count":0,"forks_count":28,"subscribers_count":5,"default_branch":"master","last_synced_at":"2026-01-14T16:06:26.802Z","etag":null,"topics":["byte-order-mark","text-encodings","utf8"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/x1angli.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-01-24T17:29:17.000Z","updated_at":"2025-12-09T08:55:09.000Z","dependencies_parsed_at":"2024-06-21T05:46:00.879Z","dependency_job_id":"87c8ee4d-75af-4d36-a2d9-c60b16e8844f","html_url":"https://github.com/x1angli/cvt2utf","commit_stats":{"total_commits":71,"total_committers":6,"mean_commits":"11.833333333333334","dds":0.5633802816901409,"last_synced_commit":"785cf0dfd249441aa91afcd32dee8e4240b55d19"},"previous_names":["x1angli/convert2utf","x1angli/covert-to-utf8","x1angli/convert_to_utf-8"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/x1angli/cvt2utf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x1angli%2Fcvt2utf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x1angli%2Fcvt2utf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x1angli%2Fcvt2utf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x1angli%2Fcvt2utf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/x1angli","download_url":"https://codeload.github.com/x1angli/cvt2utf/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/x1angli%2Fcvt2utf/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28628701,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-21T04:47:28.174Z","status":"ssl_error","status_checked_at":"2026-01-21T04:47:22.943Z","response_time":86,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["byte-order-mark","text-encodings","utf8"],"created_at":"2026-01-21T06:02:28.158Z","updated_at":"2026-01-21T06:02:29.055Z","avatar_url":"https://github.com/x1angli.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Converts text files or source code files into UTF-8 encoding\n\nA lightweight tool that converts txt and source code files into UTF-8 encodings.\nIt can either be executed from command line interface(a.k.a \"CLI\" or \"console\"), or imported into your own Python code.\n\n## Installation\n\n1. Make sure Python 3 (Preferably 3.7 or above) is properly installed.\n   2. [Optional] Dependency management tools such as [Poetry](https://python-poetry.org/) are also recommended.\n1. Install Dependencies\n   2. In your console, execute `pip3 install cvt2utf`\n   2. Or, `pip3 install -r \"./requirements.txt\"`\n   2. Or, for Poetry users, run `poetry install`\n1. After installation, make sure the `cvt2utf` is in your PATH environment variable.\n    \n## Usage\nThere is only one mandatory argument: filename, where you can specify the directory or file name. \n* ___Directory mode___: You should put in a directory as the input, and all text files that meets the criteria underneath it will be converted to UTF-8.\n* ___Single file mode___: If the input argument is just an individual file, it would be straightforwardly converted to UTF-8. \n\n___Examples:___\n\n* Changes all .txt files to UTF-8 encoding. Additionally, **removes BOMs** from utf_8_sig-encoded files: \n\n    `cvt2utf convert \"/path/to/your/repo\" `\n\n* Changes all .php files to UTF-8 encoding. But, skip processing those utf_8_sig-encoded PHP files: \n\n    `cvt2utf convert \"/path/to/your/repo\" -ext php --skiputf`\n\n* Changes all .csv files to UTF-8-SIG encoding.\n\n     Since BOM are used by some applications (such as Microsoft Excel), we want to add BOM\n\n    `cvt2utf convert \"/path/to/your/repo\" -bom -ext csv`\n\n    \n* Convert all .c and .cpp files to UTF-8 with BOMs. \n\n    This action will also __add__ BOMs to existing UTF-encoded files. \n    \n    Visual Studio may mandate BOM in source files. If BOMs are missing, then Visual Studio will unable to compile them.\n\n    `cvt2utf convert \"/path/to/your/repo\" -bom -ext c cpp`\n    \n* Converts an individual file \n\n    `cvt2utf convert \"/path/to/your/repo/a.txt\"`\n\n* After manually verify the new UTF-8 files are correct, you can remove all .bak files\n\n    `cvt2utf cleanbak \"/path/to/your/repo\" `\n\n\n* Alternatively, if you are extremely confident with everything, you can simply convert files without creating backups in the beginning.\n    \n    Use the `--nobak` option with **extra caution**!\n\n    `cvt2utf convert \"/path/to/your/repo\" --nobak`\n\n* Display help information\n\n    `cvt2utf -h`\n\n* Show version information\n\n    `cvt2utf -v`\n\n## Usage Note\n\n### 1. About BOM\n\nBy default, the converted output text files will __NOT__ contain BOM (byte order mark). \n\nHowever, you can use the switch `-b` or `--addbom` to explicitly include BOM in the output text files. \n\n### 2. About file extensions\n\nYou should only feed text-like files to cvt2utf, while binary files (such as .exe files) **should be** left untouched. \nHowever, how to distinguish? Well, we use extension names. By default, files with the extension `txt` will be processed.\nFeel free to customize this list either through editing the source code or with command line arguments.\n\n### 3. About file size limits\n\nWe will ignore empty files. Also, we ignore files larger than 10MB. This is a reasonable limit. If you really wants to change it, feel free to do so.\n\n## Trivial knowledge\n\n### 1. About BOM\nTo learn more about byte-order-mark (BOM), please check: https://en.wikipedia.org/wiki/Byte_order_mark \n\n#### 1.1 When should we remove BOM?\nBelow is a list of places where BOM might cause a problem. To make your life easy and smooth, BOMs in these files are advised to be removed.\n* __Jekyll__ : Jekyll is a Ruby-based CMS that generates static websites. Please remove BOMs in your source files. Also, remove them in your CSS if you are SASSifying.\n* __PHP__: BOMs in `*.php` files should be stripped.\n* __JSP__: BOMs in `*.jsp` files should be stripped. \n* (to be added...)\n\n#### 2 When should we add BOM?\nBOMs in these files are not necessary, but it is recommended to add them.\n\n* __Source Code in Visual Studio Projects__: \n    It is recommended in MSDN that \"Always prefix a Unicode plain text file with a byte order mark\" [Link](https://msdn.microsoft.com/en-us/library/windows/desktop/dd374101(v=vs.85).aspx). \n    Visual Studio may mandate BOM in source files. If BOMs are missing, then Visual Studio may not be able to compile them.\n\n* __CSV__: \n    BOMs in CSV files might be useful and necessary, especially if it is opened by Excel.\n\n### 2. About UTF \u0026 Unicode\n\n![img.png](https://ask.qcloudimg.com/draft/1300884/xmwux3k6z4.jpg)\n* **ASCII**: Just 1 byte. 1st byte: `00`~`7F`\n* **Latin-1**: Just 1 byte. ASCII charset + (`80`~`FF`)\n* **GB2312**: 2 bytes. ASCII charset + (1st byte: `A1`~`FE` (or more restrictively, `A1`~`F7`) with 2nd byte: `A1`~`FE`).\n* **GBK**: 2 bytes. ASCII charset + (1st byte: `A1`~`FE` with 2nd byte: `40`~`FE`).\n* **UTF-8**: Variable Length:  `0x00`~`0x7F`; `0x80`~`0x7FF`; `0x800`~`0xFFFF`; `0x10000`~`0x10FFFF`\n\n#### See Also\n* [其实你并不懂 Unicode by 纤夫张](https://zhuanlan.zhihu.com/p/53714077)\n* [UTF-8 编码及检查其完整性](https://github.com/hsiaosiyuan0/blog/blob/master/%2Fposts%2Fos%2FUTF-8%20%E7%BC%96%E7%A0%81%E5%8F%8A%E6%A3%80%E6%9F%A5%E5%85%B6%E5%AE%8C%E6%95%B4%E6%80%A7.md)\n\n\n## FAQ\n\n#### Why do we choose UTF-8 among all charsets? \n\nIt is the de-facto standard for i18n.\n\nCompared with UTF-16, UTF-8 is usually more compact and \"with full fidelity\". It also doesn't suffer from the endianness issue of UTF-16. \n\n#### Why do we need this tool?\n\nIndeed, there are a bunch of text editors with stunning text encoding capabilities. Yet for users who want to do __batch conversions__ this tool could be handy. \n\nAdditionally, some users gave me the feedback to bring into attention those Linux commands such as `sed`, `iconv`, `enca`. All of them have the limitation that they are Linux-only commands, and not applicable for other OS. \n* __`iconv`__ requires you to explicitly specify the \"from-encoding\" of the file. Moreover, it converts a single file at a time, so that you have to write a bash script for batch conversion. Worst of all, it lacks adaptability so that the set of files have to be encoded in the same character set. See [here](https://www.tecmint.com/convert-files-to-utf-8-encoding-in-linux/) for more information.\n* __`recode`__ is really a nice and powerful tool. It goes further by supporting CR-LF conversion and Base64. See [here](https://stackoverflow.com/questions/64860/best-way-to-convert-text-files-between-character-sets) and [here](https://github.com/rrthomas/recode/).\n* __`sed`__ can be used to add or remove BOM. It can also be used in combination with `iconv`. \n* __`enca`__ is used to detect the current encoding of a file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx1angli%2Fcvt2utf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fx1angli%2Fcvt2utf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fx1angli%2Fcvt2utf/lists"}