{"id":18382349,"url":"https://github.com/li-plus/chat4u","last_synced_at":"2025-08-20T14:30:39.432Z","repository":{"id":176008934,"uuid":"639948521","full_name":"li-plus/chat4u","owner":"li-plus","description":"用微信聊天记录训练一个你专属的聊天机器人","archived":false,"fork":false,"pushed_at":"2023-06-17T06:02:37.000Z","size":233,"stargazers_count":185,"open_issues_count":2,"forks_count":22,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-12-07T00:59:51.288Z","etag":null,"topics":["alpaca","chatbot","llama","wechat","wechaty"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/li-plus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-12T15:40:41.000Z","updated_at":"2024-12-03T07:25:39.000Z","dependencies_parsed_at":null,"dependency_job_id":"6278be01-3c59-4b9f-8585-cc17afdf0833","html_url":"https://github.com/li-plus/chat4u","commit_stats":{"total_commits":4,"total_committers":1,"mean_commits":4.0,"dds":0.0,"last_synced_commit":"72c36dac855ca4dcb7fa33cc3677a41de4a69fdd"},"previous_names":["li-plus/chat4u"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/li-plus%2Fchat4u","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/li-plus%2Fchat4u/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/li-plus%2Fchat4u/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/li-plus%2Fchat4u/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/li-plus","download_url":"https://codeload.github.com/li-plus/chat4u/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230431100,"owners_count":18224655,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alpaca","chatbot","llama","wechat","wechaty"],"created_at":"2024-11-06T01:04:36.557Z","updated_at":"2024-12-19T12:08:58.418Z","avatar_url":"https://github.com/li-plus.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Chat4U\n\n用微信聊天记录训练一个你专属的聊天机器人。\n\n## 获取数据库密钥\n\n微信聊天记录会加密存储在 sqlite 数据库中，首先需要获取数据库密钥，你需要一台 macOS 笔记本，手机使用 Android/iPhone 均可，执行以下步骤：\n\n1. 将微信聊天记录同步到 mac 笔记本：手机微信进入 设置 \u003e 聊天 \u003e 聊天记录迁移与备份 \u003e 迁移 \u003e 迁移到电脑微信，选择需要同步的聊天记录。\n2. 下载 [nalzok/wechat-decipher-macos](https://github.com/nalzok/wechat-decipher-macos) 提供的 dtrace 脚本。\n```sh\ngit clone https://github.com/nalzok/wechat-decipher-macos\n```\n3. 电脑端启动微信，停留在登录页面，执行以下命令。\n```sh\nsudo ./wechat-decipher-macos/macos/dbcracker.d -p $(pgrep WeChat) | tee dbtrace.log\n```\n4. 如果没有 dtrace 权限，需要临时关闭 SIP，参考[这里](https://apple.stackexchange.com/questions/208762/now-that-el-capitan-is-rootless-is-there-any-way-to-get-dtrace-working)，再次执行上述指令。\n5. 电脑端登录微信，然后退出微信，预期会提取出微信所有数据库的密钥，保存在 `dbtrace.log` 内，样例如下。\n```\nsqlcipher '/Users/\u003cuser\u003e/Library/Containers/com.tencent.xinWeChat/Data/Library/Application Support/com.tencent.xinWeChat/2.0b4.0.9/5976edc4b2ac64741cacc525f229c5fe/Message/msg_0.db'\n--------------------------------------------------------------------------------\nPRAGMA key = \"x'\u003c384_bit_key\u003e'\";\nPRAGMA cipher_compatibility = 3;\nPRAGMA kdf_iter = 64000;\nPRAGMA cipher_page_size = 1024;\n........................................\n```\n\n其他操作系统用户可以尝试以下方式，仅调研未验证过，供参考：\n* Android:\n  * ROOT 权限导出 `EnMicroMsg.db`：https://github.com/ppwwyyxx/wechat-dump\n  * 暴力破解 `EnMicroMsg.db` 密钥: https://github.com/chg-hou/EnMicroMsg.db-Password-Cracker\n* iPhone:\n  * 使用 iTunes 备份后导出：https://github.com/BlueMatthew/WechatExporter\n* Windows:\n  * https://bbs.kanxue.com/thread-251303.htm\n  * https://github.com/AdminTest0/SharpWxDump\n\n## 解密数据库\n\n在我的 macOS 笔记本上，微信聊天记录存储在 `msg_0.db` - `msg_9.db` 内，仅解密这几个数据库即可。\n\n需要安装 [sqlcipher](https://github.com/sqlcipher/sqlcipher) 进行解密，macOS 系统用户直接执行：\n```sh\nbrew install sqlcipher\n```\n\n执行以下脚本，自动解析 `dbtrace.log`，解密 `msg_x.db` 并导出到 `plain_msg_x.db`。\n```sh\npython3 decrypt.py\n```\n\n## 数据处理\n\n可以通过 https://sqliteviewer.app/ 打开解密后的数据库 `plain_msg_x.db`，找到你所需聊天记录所在的表，将数据库和表名填写到 `prepare_data.py` 内，执行下面脚本生成训练数据 `train.json`，目前策略比较简单，仅处理了单轮对话，会将 5 分钟内连续的对话合并。\n```sh\npython3 prepare_data.py\n```\n\n训练数据样例如下：\n```json\n[\n    {\"instruction\": \"你好\", \"output\": \"你好\"}\n    {\"instruction\": \"你是谁\", \"output\": \"你猜猜\"}\n]\n```\n\n## 模型训练\n\n准备一台带 GPU 的 linux 机器，将 `train.json` scp 到 GPU 机器上。\n\n我使用的是 [stanford_alpaca](https://github.com/tatsu-lab/stanford_alpaca) 全图微调 LLaMA-7B，在 8 卡 V100-SXM2-32GB 上训练 90k 条数据 3 个 epoch，仅需 1 小时。\n```sh\n# clone the alpaca repo\ngit clone https://github.com/tatsu-lab/stanford_alpaca.git \u0026\u0026 cd stanford_alpaca\n# adjust deepspeed config ... such as disabling offloading\nvim ./configs/default_offload_opt_param.json\n# train with deepspeed zero3\ntorchrun --nproc_per_node=8 --master_port=23456 train.py \\\n    --model_name_or_path huggyllama/llama-7b \\\n    --data_path ../train.json \\\n    --model_max_length 128 \\\n    --fp16 True \\\n    --output_dir ../llama-wechat \\\n    --num_train_epochs 3 \\\n    --per_device_train_batch_size 8 \\\n    --per_device_eval_batch_size 8 \\\n    --gradient_accumulation_steps 1 \\\n    --evaluation_strategy \"no\" \\\n    --save_strategy \"epoch\" \\\n    --save_total_limit 1 \\\n    --learning_rate 2e-5 \\\n    --weight_decay 0. \\\n    --warmup_ratio 0.03 \\\n    --lr_scheduler_type \"cosine\" \\\n    --logging_steps 10 \\\n    --deepspeed \"./configs/default_offload_opt_param.json\" \\\n    --tf32 False\n```\n\nDeepSpeed zero3 会分片保存权重，需要将它们合并成一个 pytorch checkpoint 文件：\n```sh\ncd llama-wechat\npython3 zero_to_fp32.py . pytorch_model.bin\n```\n\n消费级显卡上可以尝试 [alpaca-lora](https://github.com/tloen/alpaca-lora) 仅微调 lora 权重，可以显著降低显存和训练成本。\n\n## 模型部署\n\n### 前端调试\n\n可以使用 [alpaca-lora](https://github.com/tloen/alpaca-lora) 部署 gradio 前端，供调试使用。如果是全图微调，需要把 peft 相关代码注释掉，仅加载基础模型。\n```sh\ngit clone https://github.com/tloen/alpaca-lora.git \u0026\u0026 cd alpaca-lora\nCUDA_VISIBLE_DEVICES=0 python3 generate.py --base_model ../llama-wechat\n```\n\n运行效果：\n\n![](docs/alpaca-lora-ui.jpg)\n\n### 微信接入\n\n需要部署一个兼容 OpenAI API 的模型服务，这里基于 [llama4openai-api.py](https://gist.github.com/kinoc/8a042d8c5683725aa8c372274c02ea2f) 简单适配下，见本仓库里的 [llama4openai-api.py](llama4openai-api.py)，启动服务：\n```sh\nCUDA_VISIBLE_DEVICES=0 python3 llama4openai-api.py\n```\n\n测试接口是否可用：\n```sh\ncurl http://127.0.0.1:5000/chat/completions -v -H \"Content-Type: application/json\" -H \"Authorization: Bearer $OPENAI_API_KEY\" --data '{\"model\":\"llama-wechat\",\"max_tokens\":128,\"temperature\":0.95,\"messages\":[{\"role\":\"user\",\"content\":\"你好\"}]}'\n```\n\n使用 [wechat-chatgpt](https://github.com/fuergaosi233/wechat-chatgpt) 接入微信，API 地址填自己本地的模型服务地址：\n```sh\ndocker run -it --rm --name wechat-chatgpt \\\n    -e API=http://127.0.0.1:5000 \\\n    -e OPENAI_API_KEY=$OPENAI_API_KEY \\\n    -e MODEL=\"gpt-3.5-turbo\" \\\n    -e CHAT_PRIVATE_TRIGGER_KEYWORD=\"\" \\\n    -v $(pwd)/data:/app/data/wechat-assistant.memory-card.json \\\n    holegots/wechat-chatgpt:latest\n```\n\n运行效果：\n\n| ![](docs/chat1.jpg) | ![](docs/chat2.jpg) |\n|---------------------|---------------------|\n\n\"刚接入\" 是机器人说的第一句话，对方到最后也没有猜到。\n\n总体来看，用聊天记录训练的机器人必然会有一些常识性错误，但在聊天风格上已经模仿的比较好了。\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fli-plus%2Fchat4u","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fli-plus%2Fchat4u","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fli-plus%2Fchat4u/lists"}