{"id":16269276,"url":"https://github.com/thinkwee/ir_ie_work","last_synced_at":"2025-04-08T15:17:41.339Z","repository":{"id":79907404,"uuid":"141010137","full_name":"thinkwee/IR_IE_Work","owner":"thinkwee","description":"a simple search system for date matching","archived":false,"fork":false,"pushed_at":"2019-02-25T14:55:28.000Z","size":19138,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-14T11:33:36.980Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thinkwee.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-07-15T08:17:25.000Z","updated_at":"2024-10-11T08:34:19.000Z","dependencies_parsed_at":"2023-05-13T05:46:00.968Z","dependency_job_id":null,"html_url":"https://github.com/thinkwee/IR_IE_Work","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thinkwee%2FIR_IE_Work","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thinkwee%2FIR_IE_Work/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thinkwee%2FIR_IE_Work/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thinkwee%2FIR_IE_Work/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thinkwee","download_url":"https://codeload.github.com/thinkwee/IR_IE_Work/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247867365,"owners_count":21009240,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-10T18:07:54.180Z","updated_at":"2025-04-08T15:17:41.309Z","avatar_url":"https://github.com/thinkwee.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"#   信息抽取与信息检索大作业\n\n##  项目环境\n-\t大作业构建为一个django项目，后端使用python，前端使用html+css，前后端数据交互采用django，数据多记录成词典并用pickle存取。\n-\tPython3 、Django2.0.7、结巴分词、Ubuntu16.04\n\n## 项目文件\n-\t主要在mysite文件夹下\n-\tmysite\n\t-\tmysite:主站点配置\n\t-\tdb.sqlite3:默认数据库，未使用\n\t-\tmanage.py:工作脚本，通过此脚本启动本地服务器\n\t-\tsearch:大作业搜索引擎网页app\n\t\t-\tcontent:归一化tfidf检索所用语料，处理成编号形式\n\t\t-\tcorpora:结巴tfidf和词嵌入检索所用语料，与content相同，但保留原标题\n\t\t-\tkeywords:语料关键词及其tfidf值\n\t\t-\tmigrations:Django项目自带目录\n\t\t-\tstatic:Django项目静态文件项目\n\t\t-\ttemplates:Django项目模板文件夹，包含搜索和搜索结果两个主要的html文件\n\t\t-\ttfidf:pickle保存的所有语料关键词词典及其tfidf值\n\t\t-\tadmin.py、apps.py、models.py、tests.py、urls.py:Django项目自带文件\n\t\t-\tbbs.model:归一化检索所用数据\n\t\t-\tbyr.py:爬虫脚本\n\t\t-\tcreate_embedding.py:载入预训练词嵌入，并只保留语料中存在的单词\n\t\t-\tcreate_idf.py:计算idf表\n\t\t-\tcreate_raw.py:整理爬取后的文件\n\t\t-\tdict_idf.pickle、time.pickle、sender.pickle、embedding.pickle:pickle保存的idf词典、发帖时间和发帖人词典、词嵌入模型\n\t\t-\tforms.py:创建搜索框\n\t\t-\tindex.txt:归一化tfidf检索所用编号语料索引\n\t\t-\tir.py:词嵌入和结巴tfidf索引主程序\n\t\t-  preprocessing.py:创建tfidf表，信息抽取程序\n\t\t-  pretreat:语料预处理\n\t\t-  segment.py:归一化tfidf检索主程序\n\t\t-  segment.txt:分词文件\n\t\t-  stopwords.txt:停用词表\n\t\t-  views.py:Django前后端数据交互主程序\n\n##\t如何使用\n确认环境安装好后，运行爬虫文件和预处理文件得到数据，切换到manage.py所在目录，命令行输入：\n```python\npython manage.py runserver\n```\n在浏览器中输入：\n```\nhttp://127.0.0.1:8000\n```\n载入搜索界面并使用\n\n## 效果图\n-\t![image1](https://github.com/thinkwee/IR_IE_Work/blob/master/image/1.jpg)\n-\t![image2](https://github.com/thinkwee/IR_IE_Work/blob/master/image/2.jpg)\n-\t![image3](https://github.com/thinkwee/IR_IE_Work/blob/master/image/3.jpg)\n\n##\t已知问题\n由于归一化tfidf在处理标题上会存在一点不兼容，可能遇到词典查找错误，可在view.py和response.html中删除相应部分，直接修改运行segment.py进行检索。\n\n##\t数据\n数据爬取自北邮人论坛缘来如此板块，需要的同学可以自行修改爬虫爬取，暂时不提供帖子数据，注意需输入自己的用户名和密码登录\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthinkwee%2Fir_ie_work","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthinkwee%2Fir_ie_work","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthinkwee%2Fir_ie_work/lists"}