{"id":13421892,"url":"https://github.com/harveyaot/DianJing","last_synced_at":"2025-03-15T10:31:35.414Z","repository":{"id":45156993,"uuid":"90511288","full_name":"harveyaot/DianJing","owner":"harveyaot","description":"点睛 - 头条号文章标题生成工具（Dianjing, AI to write Title for Articles）","archived":false,"fork":false,"pushed_at":"2018-02-28T05:03:54.000Z","size":13656,"stargazers_count":241,"open_issues_count":2,"forks_count":62,"subscribers_count":12,"default_branch":"master","last_synced_at":"2024-10-27T22:29:02.745Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/harveyaot.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-05-07T05:35:24.000Z","updated_at":"2024-08-01T15:56:12.000Z","dependencies_parsed_at":"2022-07-13T16:50:31.169Z","dependency_job_id":null,"html_url":"https://github.com/harveyaot/DianJing","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harveyaot%2FDianJing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harveyaot%2FDianJing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harveyaot%2FDianJing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/harveyaot%2FDianJing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/harveyaot","download_url":"https://codeload.github.com/harveyaot/DianJing/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243719070,"owners_count":20336591,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-30T23:00:33.537Z","updated_at":"2025-03-15T10:31:30.404Z","avatar_url":"https://github.com/harveyaot.png","language":"Jupyter Notebook","funding_links":[],"categories":["内容运营","Jupyter Notebook"],"sub_categories":[],"readme":"# DianJing\n点睛 - 头条号文章标题生成工具\n\n1. 功能：\n 自动为头条的文章生成一个题目候选列表(Automatically Generate Article Title in TouTiao Style)\n2. 展现形式：\n 初期是linux 的客户端，后期开发一个前端页面，或者一个chrome 插件的形式存在。\n3. 主要技术：\n 使用encoder-decoder的技术对头条的摘要和文章对(abstract-title pair)标题进行训练\n4. 数据来源：\n 主要使用头条的数据接口，抓万级别的训练样本。\n\n## 数据使用和爬取 Data Usage and Crawl\n1. 可用的训练数据，`./data/basic_data_80k_v2.pkl.tgz` 包含了约61K的(abstract, title)数据。通过tar -xzvf basic_data_80k_v2.pkl.tgz 解压之后，可以使用 `./scripts` 下的 data_utils.py 来check 数据数量和展示sample的样本。\n2. 另有一份large data set 包含 700k左右的training 样本，将稍候公开（ETA Mar. 2018）。\n3. 使用 `./scripts/crawl.py` 来爬取头条数据，但是需要指定头条feed 流中的 as 和 cp 两个参数，这两个参数，最好每三天更新一次，获取方法如下\n 从chrom 浏览器的 network 中可以看到最新feed 流地址的这两个参数\n ![](./image/ascp.png)\n\n## 实验日志 Experiment Log \n1. 2017/05/27 使用大约30K的训练样本，摘要-标题对，对每个汉字做100 维 embeding 使用CNN做encoder，GRU unit 的RNN 做decoer. 一天500个epoch 之后训练效果如下：\n * ![](./image/train_res_20170527.png)\n * 分析：\n * 基本可以分析出描述中的关键语义\n * 但是语言可读性较差\n * 改进方向\n * 训练样本可能不足\n * 基于中文分词做，不是汉字粒度\n * LSTM 在生成长文本上的能力并不好，可以考虑基于大量语料库的language model\n2. 2017/06/01 \n * 提升：\n * 使用search api 爬去了8000(dict/keywords.select)关键字的600K 文章\n * 使用jieba 进行分词，进行\n * 在30K 训练样本上的语言可读性提高\n * 问题：\n * 在600K 数据上OOV 问题严重，模型难以收敛\n * 解决：\n * 增大vocabulary size\n * 先训练rnn 的language model，能说好话\n * 然后是基于condition的条件下说话\n3. 2017/06/20 \n * ![](./image/train_res_20170620_2.png)\n * 这预测标题写的，以假乱真，不去查一下，还真的被他骗了\n * ![](./image/train_res_20170620.png)\n * 分析：\n * 增大vacobulary size 到 212K\n * pre-train rnn 的decoder，然后标题可读性大大提升\n * 700k 的training samples，对语义的理解准确性提升\n * 提升：\n * 调低learning rate 和 batch_size 增加探索能力\n * 考虑如何解决 OOV 的问题\n * 主要是push 当前的 search，abstract-embedding，AI，三个版本上线\n \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharveyaot%2FDianJing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fharveyaot%2FDianJing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fharveyaot%2FDianJing/lists"}