{"id":13672166,"url":"https://github.com/RimoChan/internet-dataset","last_synced_at":"2025-04-27T21:32:23.508Z","repository":{"id":41162667,"uuid":"497230448","full_name":"RimoChan/internet-dataset","owner":"RimoChan","description":"【数据集】好耶，是互联网数据集！","archived":false,"fork":false,"pushed_at":"2023-07-02T04:51:57.000Z","size":14,"stargazers_count":202,"open_issues_count":4,"forks_count":21,"subscribers_count":8,"default_branch":"slave","last_synced_at":"2024-10-29T15:49:39.359Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RimoChan.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-28T06:10:32.000Z","updated_at":"2024-10-16T13:25:05.000Z","dependencies_parsed_at":"2024-10-29T14:32:15.521Z","dependency_job_id":null,"html_url":"https://github.com/RimoChan/internet-dataset","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RimoChan%2Finternet-dataset","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RimoChan%2Finternet-dataset/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RimoChan%2Finternet-dataset/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RimoChan%2Finternet-dataset/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RimoChan","download_url":"https://codeload.github.com/RimoChan/internet-dataset/tar.gz/refs/heads/slave","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224087166,"owners_count":17253514,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T09:01:28.118Z","updated_at":"2024-11-11T10:30:26.531Z","avatar_url":"https://github.com/RimoChan.png","language":null,"readme":"# 互联网数据集\n\n我的[搜索引擎](https://github.com/RimoChan/sese-engine)已经运行了1年多了，收集到了不少有用的数据。\n\n我想，这些数据应该还是挺有价值的，不如就把它们拿出来和大家1起分享吧。\n\n\n## 数据量\n\n数据量正在持续增长中，截至2023年5月大约有130G的数据，包含这些内容——\n\n- 域名数据\u003csub\u003e(5.8G)\u003c/sub\u003e包含14,000,000个域名，来自4,000,000个1级域名。\n\n- 网页数据\u003csub\u003e(24.9G)\u003c/sub\u003e包含115,000,000个网页。\n\n- 反向索引数据\u003csub\u003e(99.7G)\u003c/sub\u003e包含22,000,000个词，每个词对应1~30000个网页。\n\n\n## 下载地址\n\n你可以选1个自己喜欢的地方下载: \n\n- [GitHub Release](https://github.com/RimoChan/internet-dataset/releases)\n\n- [OneDrive](https://v0vxj-my.sharepoint.com/:f:/g/personal/rimochan_v0vxj_onmicrosoft_com/EqRakuQVVjBDqMyU8xd7NnEB3MZrDZxDwPTVXK7tNv5Rqw?e=cXQMod )\u003csub\u003e (感谢[@skoqaq](https://github.com/skoqaq)帮我传了OneDrive，但是这是2022年的数据) \u003c/sub\u003e\n\n\n## 数据内容\n\n- 域名级别\n  - ip\n  - 最后访问时间\n  - 访问次数 \u003csub\u003e(搜索引擎爬取该域名下页面的次数，数值越高，其他的字段越可靠)\u003c/sub\u003e\n  - 语种 \u003csub\u003e(fasttext的语种识别结果，对该域名下的所有网页滑动平均)\u003c/sub\u003e\n  - 链接 \u003csub\u003e(该域名下的所有网页的指向其他域名的链接，滑动抽样200个左右)\u003c/sub\u003e\n  - 重定向 \u003csub\u003e(该域名下的所有网页被重定向的情况，滑动抽样40个左右)\u003c/sub\u003e\n  - ======下面的属性只针对该域名的首页======\n  - https可用\n  - 关键词 \u003csub\u003e(频率最高的词)\u003c/sub\u003e\n  - 结构 \u003csub\u003e(将HTML结构映射到字符串，用于过滤模板生成的大量域名)\u003c/sub\u003e\n\n- 网页级别\n  - 网页的标题\n  - 网页的介绍 \u003csub\u003e(meta description，截断到256字符)\u003c/sub\u003e\n  - 网页的文本 \u003csub\u003e(截断到256字符)\u003c/sub\u003e\n  - 最后访问时间\n\n注意: \n\n- ip字段并不完整，基本上是东亚地区的解析结果。\n\n- 如果网站是动态的，关键词就不可靠\u003csub\u003e(比如新闻网站的首页)\u003c/sub\u003e。中文和英文以外的网站也不可靠。\n\n- 与语言相关的字段都是有偏的。出现这个差别的原因是因为在选择爬取目标时，包含更多中文网页的域名的权重更高，而这些域名链接到其他域名的中文网页的概率也更高，因此滑动平均的结果上中文的比例会偏高。\n\n- 有些域名会缺字段，这个是正常的。主要原因是有些字段是后来陆续加上的，没有回扫或者只回扫了1部分。\n\n此外，还有1些没有列出的字段，它们大多是没有意义或者是已经被废弃的字段。看到的话不用去管它们就行了。\n\n\n## 样例\n\n- `https://github.com/`的网页信息\n\n```json\n[\n  \"GitHub: Let’s build from here · GitHub\",\n  \"GitHub is where over 100 million developers shape the future of software, together. Contribute to the open source community, manage your Git repositories, review code like a pro, track bugs and features, power your CI/CD and DevOps workflows, and secure co\",\n  \"Skip to content Toggle navigation Sign up Product Actions Automate any workflow Packages Host and manage packages Security Find and fix vulnerabilities Codespaces Instant dev environments Copilot Write better code with AI Code review Manage code changes Is\",\n  1683401341\n]\n```\n\n- `github.com`的域名信息\n\n```json\n{\n  \"https可用\": true,\n  \"ip\": [\"20.205.243.166\"],\n  \"关键词\": [\"github\", \"your\", \"you\", \"code\", \"octocat\", \"classifier\", \"all\", \"actions\", \"s\", \"build\", \"open\", \"sign\", \"pull\", \"up\", \"learn\", \"world\", \"more\", \"community\", \"review\", \"readme\", \"requests\", \"security\", \"merge\", \"software\", \"git\", \"https\", \"source\", \"developers\", \"jump\", \"can\", \"repositories\", \"added\", \"packages\", \"cli\", \"codespaces\", \"million\", \"developer\", \"available\", \"windows\", \"cloud\"],\n  \"成功率\": 0.9450735409551977,\n  \"最后访问时间\": 1683429300,\n  \"结构\": \"[[20,[[18,[1,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,21,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,3,3,3,3,1,1,1,3,3,3,1,3]],[19,[[0,[[32,[[0,[[0,[0,[0,[14]]]],[0,[[0,[14]],11,[0,[[0,[[0,[[0,[[16,[[10,[7,7,7,0]]]]]]]]]],0]]]]]]]]]],0,[0,[[38,[[0,[[0,[14,0]]]]]]]],\\\"include-fragment\\\",[0,[[44,[[0,[[0,[[0,[[0,[[0,[[0,[30,[16,[[0,[[17,[[15,[10]],[9,[7]]]],7,14]]]],[0,[[0,[[0,[[0,[[0,[4]],[0,[4]],[0,[4]],[0,[5]]]]]]\",\n  \"访问次数\": 556694.9000000316,\n  \"语种\": {\n    \"ar\": 0.00012560604452730657,\n    \"bg\": 5.067841852608503e-11,\n    \"cs\": 2.5e-323,\n    \"de\": 3.7523861649975535e-10,\n    \"el\": 1.0593663757620937e-07,\n    \"en\": 0.9744931624632813,\n    \"eo\": 2.5e-323,\n    \"es\": 4.333235886036386e-05,\n    \"eu\": 2.5e-323,\n    \"fa\": 1.0508114267205955e-12,\n    \"fr\": 7.0824680275658e-05,\n    \"hu\": 0.0004203567032961541,\n    \"id\": 2.5e-323,\n    \"it\": 3.063911176630927e-15,\n    \"ja\": 0.0017088828225574997,\n    \"ko\": 0.0015900679303152115,\n    \"ne\": 2.5e-323,\n    \"nl\": 0.0005624524911557625,\n    \"pl\": 1.717507063606e-312,\n    \"pt\": 4.0744963011521734e-06,\n    \"ru\": 0.0009116465126009326,\n    \"sk\": 2.5e-323,\n    \"th\": 0.000237629139099449,\n    \"tr\": 2.830707100472314e-05,\n    \"uk\": 1.9260313777921886e-12,\n    \"vi\": 2.5e-323,\n    \"zh\": 0.019803550921189992\n  },\n  \"重定向\": {\"http://github.com/5iux/\": \"https://github.com/5iux/\", \"http://github.com/brianmario/yajl-ruby\": \"https://github.com/brianmario/yajl-ruby\", \"http://github.com/cch123\": \"https://github.com/cch123\", \"http://github.com/composer/packagist\": \"https://github.com/composer/packagist\", \"https://github.com/FriendsOfPHP/PHP-CS-Fixer\": \"https://github.com/PHP-CS-Fixer/PHP-CS-Fixer\", \"https://github.com/Homebrew/homebrew\": \"https://github.com/Homebrew/legacy-homebrew\", \"https://github.com/Laravelium/laravel-sitemap\": \"https://github.com/LaraPalCom/laravel-sitemap\", \"https://github.com/Nikschavan/header-footer-elementor\": \"https://github.com/brainstormforce/header-footer-elementor\", \"https://github.com/Redsmin/redsmin\": \"https://github.com/Redsmin/proxy\", \"https://github.com/TralahM/pympesa/pull/\": \"https://github.com/TralahM/pympesa/pulls\", \"https://github.com/UsersWP/userswp/\": \"https://github.com/AyeCode/userswp\", \"https://github.com/Vtrois/Kratos\": \"https://github.com/seatonjiang/kratos\", \"https://github.com/Xhofe/alist\": \"https://github.com/alist-org/alist\", \"https://github.com/apps/dependabot\": \"https://docs.github.com/github/managing-security-vulnerabilities/configuring-dependabot-security-updates\", \"https://github.com/apps/github-actions\": \"https://github.com/features/actions\", \"https://github.com/bollnh/hexo-theme-material\": \"https://github.com/iblh/hexo-theme-material\", \"https://github.com/bryan31/tlog-homepage/edit/master/docs/10.%E6%96%87%E6%A1%A3/210.%E5%AF%B9Soul%E7%BD%91%E5%85%B3%E7%9A%84%E6%94%AF%E6%8C%81.md\": \"https://github.com/dromara/tlog-homepage/edit/master/docs/10.%E6%96%87%E6%A1%A3/210.%E5%AF%B9Soul%E7%BD%91%E5%85%B3%E7%9A%84%E6%94%AF%E6%8C%81.md\", \"https://github.com/business\": \"https://github.com/enterprise\", \"https://github.com/contact\": \"https://support.github.com?tags=dotcom-direct\", \"https://github.com/creationix/nvm\": \"https://github.com/nvm-sh/nvm\", \"https://github.com/easydigitaldownloads/easy-digital-downloads/issues/7130\": \"https://github.com/awesomemotive/easy-digital-downloads/issues/7130\", \"https://github.com/github/feedback\": \"https://github.com/community/community\", \"https://github.com/gojek/feast\": \"https://github.com/feast-dev/feast\", \"https://github.com/greensock/GreenSock-JS/\": \"https://github.com/greensock/GSAP\", \"https://github.com/hackmdio/hackmd/issues/720\": \"https://github.com/hackmdio/codimd/issues/720\", \"https://github.com/iceb0y/winjudge\": \"https://github.com/iceboy233/winjudge\", \"https://github.com/indyplanets/flexnav\": \"https://github.com/mrjasonweaver/flexnav\", \"https://github.com/iteufel/nwjs-ffmpeg-prebuilt/releases\": \"https://github.com/nwjs-ffmpeg-prebuilt/nwjs-ffmpeg-prebuilt/releases\", \"https://github.com/kubernetes-sigs/federation-v2/blob/master/docs/userguide.md\": \"https://github.com/kubernetes-retired/kubefed/blob/master/docs/userguide.md\", \"https://github.com/kubernetes/test-infra/blob/master/images\": \"https://github.com/kubernetes/test-infra/tree/master/images\", \"https://github.com/mereithhh/van-blog\": \"https://github.com/Mereithhh/vanblog\", \"https://github.com/mperham/sidekiq\": \"https://github.com/sidekiq/sidekiq\", \"https://github.com/nuxt-community/i18n-module\": \"https://github.com/nuxt-modules/i18n\", \"https://github.com/pinggod/hexo-theme-apollo\": \"https://github.com/chongshengsun/hexo-theme-apollo\", \"https://github.com/rtfd/readthedocs.org\": \"https://github.com/readthedocs/readthedocs.org\", \"https://github.com/rtfd/sphinx_rtd_theme\": \"https://github.com/readthedocs/sphinx_rtd_theme\", \"https://github.com/rx-ts/prettier\": \"https://github.com/un-ts/prettier\", \"https://github.com/site/privacy\": \"https://docs.github.com/site-policy/privacy-policies/github-privacy-statement\", \"https://github.com/sourcemeta/json-size-benchmark/blob/master/benchmark/esmrc\": \"https://github.com/sourcemeta/json-size-benchmark/tree/master/benchmark/esmrc\", \"https://github.com/spatie/larabank-event-projector-aggregates\": \"https://github.com/spatie/larabank-aggregates\", \"https://github.com/uptrace/bun/blob/v1.1.12/example\": \"https://github.com/uptrace/bun/tree/v1.1.12/example\", \"https://github.com/users/hzoo/sponsorship\": \"https://github.com/sponsors/hzoo\", \"https://github.com/wikimedia/php-excimer\": \"https://github.com/wikimedia/mediawiki-php-excimer\", \"https://github.com/wowchemy/wowchemy-hugo-modules\": \"https://github.com/wowchemy/wowchemy-hugo-themes\", \"https://github.com/xb2016/kratos\": \"https://github.com/seatonjiang/kratos\", \"https://github.com/xingrz/bs-map-editor\": \"https://github.com/xingrz/rdt-editor\", \"https://github.com/xuegao-tzx/2048-HarmonyOS\": \"https://github.com/xuegao-tzx/2048-HarmonyOS-Lite\", \"https://github.com/xuperchain/xuperunion/wiki\": \"https://github.com/xuperchain/xuperchain/wiki\", \"https://github.com/yarnpkg/berry/blob/master/packages/yarnpkg-core\": \"https://github.com/yarnpkg/berry/tree/master/packages/yarnpkg-core\"},\n  \"链接\": [\"https://github.blog\", \"https://avatars.githubusercontent.com/u/7133698?v=4\", \"https://neko-dev.github.io/material-theme-docs/\", \"https://github.blog\", \"https://twitter.com/palkan_tula\", \"https://github.blog\", \"http://blog.cocoapods.org/CocoaPods.org-Two-point-Five/\", \"https://twitter.com/github\", \"https://packagist.org/packages/brunocfalcao/larapush\", \"https://leeoniya.github.io/uPlot/demos/zoom-wheel.html\", \"https://developer.apple.com/xcode/\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://usthe.com/sureness\", \"https://github.blog\", \"https://github.co/hiddenchars\", \"https://www.githubstatus.com/\", \"https://formbold.com/templates\", \"https://nodejs.org/en/\", \"https://cloudburstmc.org\", \"https://formbold.com/\", \"https://www.apache.org/licenses/LICENSE-2.0.html\", \"https://www.githubstatus.com/\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://anarchyinstaller.org/\", \"https://evilmartians.com/chronicles/system-of-a-test-setting-up-end-to-end-rails-testing\", \"https://github.co/hiddenchars\", \"https://evilmartians.com/chronicles/hotwire-reactive-rails-with-no-javascript\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://www.npmjs.com/package/errorhandler\", \"https://github.blog\", \"https://github.blog\", \"https://github.blog\", \"https://noti.st/palkan/jWy57U/weaving-seaming-mocks\", \"https://www.npmjs.com/package/raw-body\", \"https://git.sr.ht/~emersion/gamja/\", \"https://www.githubstatus.com/\", \"https://actions.github.io/humans.txt\", \"https://github.blog\", \"https://www.busybox.net/\", \"https://github.blog\", \"https://laravel.com/docs/collections\", \"https://github.blog\", \"https://wordpress.org/plugins/userswp-recaptcha/\", \"https://developer.apple.com/xcode/\", \"https://github.blog\", \"https://www.bitlbee.org/\", \"https://github.blog\", \"https://github.co/hiddenchars\", \"https://github.blog\", \"https://www.mdpi.com/2078-2489/11/2/125\", \"https://github.blog\", \"https://github.co/hiddenchars\", \"https://github.blog\", \"https://gyoogle.dev/\", \"https://www.githubstatus.com/\", \"https://developer.apple.com/xcode/\", \"https://jsonapi.org\", \"https://www.irccloud.com/\", \"https://formbold.com/\", \"https://github.blog\", \"https://userswp.io/downloads/verified-users/\", \"https://www.githubstatus.com/\", \"https://www.electronjs.org\", \"https://laravel.com/docs/eloquent-relationships\", \"https://www.archlinux.org\", \"https://xxxx.com/pc/js/manifest.e90b779b12a4f25606f0.js\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://github.blog\", \"https://www.npmjs.com/package/chartjs-adapter-spacetime\", \"https://userswp.io/downloads/wp-job-manager/\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://github.blog\", \"https://www.buymeacoffee.com/gyoogle\", \"https://github.blog\", \"https://demos.ayecode.io/userswp/\", \"https://github.blog\", \"https://github.blog\", \"https://link.springer.com/chapter/10.1007/978-3-030-55789-8_60\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://jsonapi.org/format/\", \"https://en.wikipedia.org/wiki/Open_source\", \"https://www.npmjs.com/package/spacetime\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"http://issues.npcap.org/226\", \"https://github.co/hiddenchars\", \"https://userswp.io/downloads/multisite-creator/\", \"https://github.blog\", \"https://openjsf.org/\", \"https://packagist.org/packages/tightenco/collect\", \"https://www.githubstatus.com/\", \"https://userswp.io/downloads/mailchimp/\", \"https://www.githubstatus.com/\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://github.co/hiddenchars\", \"https://berlin.social/@lutki95\", \"https://github.blog\", \"https://www.producthunt.com/posts/github-metrics?utm_source=badge-featured\u0026utm_medium=badge\u0026utm_source=badge-github-metrics\", \"https://www.npmjs.com/package/cookie-session\", \"https://github.blog\", \"https://github.blog\", \"https://github.co/hiddenchars\", \"https://www.johnsmith.com\", \"https://www.githubstatus.com/\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://github.co/hiddenchars\", \"https://www.githubstatus.com/\", \"https://weechat.org/\", \"https://www.githubstatus.com/\", \"https://albumentations.ai/docs/\", \"https://www.npmjs.com/package/serve-favicon\", \"https://github.blog\", \"https://github.blog\", \"https://npmjs.org/package/connect\", \"https://github.blog\", \"https://twitter.com/thephpleague\", \"https://userswp.io/downloads/private-messages/\", \"https://github.blog\", \"https://sdrausty.github.io/TermuxArch/docs/install\", \"https://github.blog\", \"https://www.apache.org/licenses/LICENSE-2.0.html\", \"https://github.blog\", \"https://developer.apple.com/xcode/\", \"https://search.maven.org/artifact/com.usthe.sureness/sureness-core\", \"https://camo.githubusercontent.com/559879cbb14f47ebeb2da8aeaa8a4c61db62b9d484fd94e569e2e75e4af51469/68747470733a2f2f6769746875622d726561646d652d73746174732e76657263656c2e6170702f6170692f746f702d6c616e67732f3f757365726e616d653d70696e757373696c76657374727573266c61796f75743d636f6d7061637426686964655f7469746c653d3126636172645f77696474683d333030\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://github.co/hiddenchars\", \"https://beta-metrics.lecoq.io\", \"https://www.githubstatus.com/\", \"https://lineicons.com/\", \"https://www.githubstatus.com/\", \"https://packagist.org/packages/brunocfalcao/larapush\", \"https://noti.st/palkan/VWPOSd/between-monoliths-and-microservices\", \"https://github.blog\", \"https://github.blog\", \"https://userswp.io/downloads/advanced-search/\", \"https://github.blog\", \"https://www.npmjs.com/package/chart.js\", \"https://flight-manual.atom.io/hacking-atom/sections/debugging/\", \"https://twitter.com/lutki95\", \"https://github.blog\", \"https://www.npmjs.com/package/cookie-parser\", \"https://www.npmjs.com/package/lineicons\", \"https://www.githubstatus.com/\", \"https://avatars.githubusercontent.com/u/113136203?v=4\", \"https://avatars.githubusercontent.com/u/21816?v=4\", \"https://www.githubstatus.com/\", \"https://twitter.com/feelinglucky\", \"https://github.blog\", \"https://github.blog\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://actions.github.io/humans.txt\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://t.me/pornbaike\", \"https://github.co/hiddenchars\", \"https://github.blog\", \"https://spec.matrix.org/v1.4/client-server-api/\", \"https://stackoverflow.com/questions/tagged/chart.js\", \"https://devblogs.microsoft.com/dotnet/dotnet-maui-dotnet-7/\", \"https://brew.sh\", \"https://camo.githubusercontent.com/626c2f3093027bc493aa25c73cf7f445e87e3e35f33885abf5407be37c862a94/68747470733a2f2f696d67732e786b63642e636f6d2f636f6d6963732f7465616d5f636861742e706e67\", \"https://github.blog\", \"https://coveralls.io/r/senchalabs/connect?branch=master\", \"https://github.blog\", \"https://github.blog\", \"https://github.blog\", \"https://discuss.atom.io/c/faq\", \"https://www.linkedin.com/in/niklas-kiefer-9249341a2/\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://github.blog\", \"https://github.blog\", \"https://huww98.github.io/TimeChart/docs/performance\", \"https://developer.apple.com/xcode/\", \"https://www.githubstatus.com/\", \"https://www.githubstatus.com/\", \"https://json-schema.org\", \"https://flight-manual.atom.io/hacking-atom/sections/debugging/\", \"https://github.blog\", \"https://www.deviantart.com/dartty/art/Kanaya-364815353\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://www.facebook.com/whitehat\", \"https://github.blog\", \"https://github.blog\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://demo.stack.jimmycai.com\", \"https://dev.stack.jimmycai.com\", \"https://user-images.githubusercontent.com/5889006/190859553-5b229b4f-c476-4cbd-928f-890f5265ca4c.png\", \"https://stack.jimmycai.com\", \"https://developer.apple.com/xcode/\", \"https://www.githubstatus.com/\", \"https://ko-fi.com/jimmycai\", \"https://stack.jimmycai.com\", \"https://user-images.githubusercontent.com/5889006/190859441-141b5f81-8483-40d2-bd96-ebf85616a46d.png\", \"https://cssnano.co/\", \"https://gitter.im/postcss/postcss\", \"https://vk.com/postcss\", \"https://plugins.jetbrains.com/plugin/8578-postcss\", \"https://postcss.org/api/\", \"https://parceljs.org\", \"https://www.postcss.parts/\", \"https://evilmartians.com/?utm_source=postcss\", \"https://postcss.org\", \"https://atom.io/packages/source-preview-postcss\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://github.co/hiddenchars\", \"https://github.co/hiddenchars\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://github.blog\", \"https://www.githubstatus.com/\", \"https://github.blog\", \"https://github.blog\", \"https://router.vuejs.org/guide/advanced/meta.html\", \"https://developer.mozilla.org/en-US/docs/Web/API/ScrollToOptions\", \"https://developer.mozilla.org/en-US/docs/Web/API/History/state\", \"https://www.githubstatus.com/\", \"https://router.vuejs.org/api/interfaces/routelocationoptions.html\", \"https://github.blog\", \"https://mathiasbynens.be/notes/css-escapes\", \"https://developer.mozilla.org/en-US/docs/Web/API/ScrollToOptions/behavior\", \"https://pinia.vuejs.org\", \"https://github.blog\"]\n}\n```\n\n\n## 数据格式\n\n首先要把zip包解压缩。\n\n网页数据和域名数据都是被简单压缩过的json格式，所以可以这样读取——\n\n```python\nimport json\nimport brotli\n\nprint(json.loads(brotli.decompress(open(path, 'rb').read())))\n```\n\n反向索引的数据是1个比较奇怪的2进制格式\u003csub\u003e(由于历史原因还有两种)\u003c/sub\u003e，所以读取的代码比较复杂，像是下面这样——\n\n```python\nimport json\nimport struct\nimport brotli\n\n\ndef _load1(b: bytes):\n    n = struct.unpack('i', b[:4])[0]\n    字符串长度 = struct.unpack(f'{n}h', b[4:4+n*2])\n    吸0 = struct.unpack(f'{n}e', b[4+n*2:4+n*4])\n    文hint = ''.join([f'{x}s' for x in 字符串长度])\n    吸1 = struct.unpack(文hint, b[4+n*4:])\n    吸1 = [x.decode('utf8') for x in 吸1]\n    return [*zip(吸0, 吸1)]\n\n\ndef _load2(b: bytes):\n    assert b[:6] == b'yn0001', '版本不对'\n    n = struct.unpack('i', b[6:10])[0]\n    吸0 = struct.unpack(f'{n}e', b[10:10+n*2])\n    吸1 = json.loads(b[10+n*2:])\n    assert len(吸0) == len(吸1), '数据不完整'\n    return [*zip(吸0, 吸1)]\n\n\ndef load(b: bytes):\n    if b.startswith(b'yn0001'):\n        return _load2(b)\n    return _load1(b)\n\n\nprint(load(brotli.decompress(open(path, 'rb').read())))\n```\n\n## 赞助\n\n如果你觉得互联网数据集对你的工作或学习有帮助，欢迎来当我的女朋友。\n\n要可爱的，最好是白发贫乳傲娇双马尾。\n","funding_links":["https://github.com/sponsors/hzoo","https://www.buymeacoffee.com/gyoogle","https://ko-fi.com/jimmycai"],"categories":["Others"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRimoChan%2Finternet-dataset","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FRimoChan%2Finternet-dataset","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRimoChan%2Finternet-dataset/lists"}