{"id":15307946,"url":"https://github.com/wscats/intersect","last_synced_at":"2025-04-14T23:13:55.637Z","repository":{"id":72736072,"uuid":"215337156","full_name":"Wscats/intersect","owner":"Wscats","description":"一道面试题的思考 - 6000万数据包和300万数据包在50M内存使用环境中求交集","archived":false,"fork":false,"pushed_at":"2020-12-31T03:27:47.000Z","size":4949,"stargazers_count":98,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-14T23:13:43.585Z","etag":null,"topics":["bigdata","intersect","memory","nodejs","stream"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Wscats.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2019-10-15T15:48:01.000Z","updated_at":"2025-02-23T08:24:18.000Z","dependencies_parsed_at":"2023-09-18T01:34:56.547Z","dependency_job_id":null,"html_url":"https://github.com/Wscats/intersect","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wscats%2Fintersect","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wscats%2Fintersect/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wscats%2Fintersect/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wscats%2Fintersect/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Wscats","download_url":"https://codeload.github.com/Wscats/intersect/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248975329,"owners_count":21192210,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","intersect","memory","nodejs","stream"],"created_at":"2024-10-01T08:13:03.251Z","updated_at":"2025-04-14T23:13:55.608Z","avatar_url":"https://github.com/Wscats.png","language":"JavaScript","readme":"# 下载 \u0026\u0026 运行\n\n下载源代码：\n```bash\ngit clone https://github.com/Wscats/intersect\n```\n\n使用以下命令运行测试，运行成功后结果会在`result.txt`中展现结果：\n```bash\n# 运行\nnpm start\n# 查看结果\nnpm run dev\n# 生成新的大数据\nnpm run build\n```\n\n# 目录结构\n\n- database\n    - data-3M.txt - 模拟的3百万数据包\n    - data-60M.txt - 模拟的6千万数据包\n- library\n    - data-3M.js - 处理3百万数据包的逻辑\n    - data-60M.js - 处理6千万数据包的逻辑\n    - intersect.js - 处理数据包的交集\n    - create-60M.js - 生成大数据的文件\n- result.txt 最终数据包的交集结果\n- index.js 主逻辑文件\n\n理想数据包的数据结构如下：\n```\nQQ:40645253 地址：xxx 年龄：xxx\nQQ:49844525 地址：xxx 年龄：xxx\nQQ:51053984 地址：xxx 年龄：xxx\nQQ:15692967 地址：xxx 年龄：xxx\nQQ:39211026 地址：xxx 年龄：xxx\n...\n```\n理想数据包的内存占用如下：\n\n|数据量|内存占用|\n|-|-|\n|6000条数据|\u003e=30KB|\n|6000万条数据|\u003e=300.000KB\u003e=300MB|\n|300条数据|\u003e=15KB|\n|300万条数据|\u003e=150.000KB\u003e=15MB|\n\n在50MB的内存限制下，我们可以把300万条约15MB的数据完全放入内存中，剩余大概35MB空间是不允许我们完全放入6000万条约300MB的数据，所以我们需要把数据切割成10块左右，大概每块控制在30MB，然后分别读取出来跟内存中的300万条数据进行比对并求出交集。\n\n在 Node 中要满足上面的要求，我们分别需要用到两个关键的内置模块：\n\n- fs - 文件系统\n- readline - 逐行读取\n\n`fs.createReadStream(path[, options])`方法中，其中 options 可以包括 start 和 end 值，以从文件中读取一定范围的字节而不是整个文件。 start 和 end 都包含在内并从 0 开始计数，这种是方法方便我们分段读取6000万条数据。\n\n示例，从一个大小为 100 个字节的文件中读取最后 10 个字节：\n```js\nfs.createReadStream('data3M.txt', { start: 90, end: 99 });\n```\n\n除此之外还可以使用，`fs.createReadStream()` 提供 highWaterMark 选项，它允许我们将以大小等于 highWaterMark 选项的块读取流，highWaterMark 的默认值为: 64 * 1024(即64KB)，我们可以根据需要进行调整，当内部的可读缓冲的总大小达到 highWaterMark 设置的阈值时，流会暂时停止从底层资源读取数据，直到当前缓冲的数据被消费，我们就可以触发`readline.pause()`暂停流，处理完之后继续触发`readline.resume()`恢复流，然后不断重复以上步骤，将6000万数据分别处理完。\n\nreadline 模块提供了一个接口，用于一次一行地读取可读流中的数据。 它可以使用以下方式访问，并且我们的数据包，每条数据之间是使用`\\n、\\r 或 \\r\\n`隔开，所以这样方便我们使用`readline.on('line', (input) =\u003e {})`来接受每一行数据包的字符串。\n\n# data-60M.js\n\n该文件用于专门处理6000万数据，我们使用`readline`和`createReadStream`两者配合，将数据按一定条数分别缓存在内存中，由于提交的代码不适合太大(Git传上去太慢)，所以把数据量减少到6000条，那么分成10份的话，每份缓存就需要读600条左右，读完每份数据之后调用`intersect`函数求交集，并存入硬盘`result.txt`文件中，然后释放内存：\n\n```js\n// 写入结果\nconst writeResult = (element) =\u003e {\n    appendFile('./result.txt', `${element}\\n`, (err) =\u003e {\n        err ? () =\u003e console.log('写入成功') : () =\u003e console.log('写入失败');\n    })\n}\n```\n这里最关键是要定义一个空的容器`lineCount`来存放每段数据，并且使用`if (lineCount === 600) {}`语句判断内存超过限制的空间后做释放内存的处理：\n```js\nconst { createReadStream, appendFile } = require('fs');\nconst readline = require('readline');\nconst intersect = require('./intersect');\n\nmodule.exports = (smallData) =\u003e {\n    return new Promise((resolve) =\u003e {\n        const rl = readline.createInterface({\n            // 6000条数据流\n            input: createReadStream('./database/data60M.txt', {\n                // 节流阀\n                highWaterMark: 50\n            }),\n            // 处理分隔符\n            crlfDelay: Infinity\n        });\n        // 缓存次数\n        let lineCount = 0;\n        // 缓存容器\n        let rawData = [];\n        // 逐行读取\n        rl.on('line', (line) =\u003e {\n            rawData.push(line);\n            lineCount++;\n            // 限制每一次读取600条数据，分十次读取\n            if (lineCount === 600) {\n                // 释放内存\n                // ...\n            }\n        );\n        rl.on('close', () =\u003e {\n            resolve('结束');\n        })\n    })\n}\n```\n释放内存后前需要使用`rl.pause()`暂停流，然后做两步逻辑:\n\n- 求交集结果\n- 写入每段交集结果到硬盘\n\n然后需要使用`rl.resume()`重启流：\n```js\nif (lineCount === 600) {\n    // 暂停流\n    rl.pause();\n    // 获取交集\n    let intersectResult = intersect(rawData, smallData);\n    // 遍历交集并写入结果\n    intersectResult.forEach(element =\u003e {\n        writeResult(element)\n    });\n    // 释放缓存\n    rawData = null;\n    intersectResult = null;\n    rawData = [];\n    // 重置读取次数\n    lineCount = 0;\n    // 重启流\n    rl.resume();\n}\n```\n\n# data-3M.js\n\n这里的数据由于是3百万，所以可以把全部数据放入内存，这里用Promise封装，方便在外部配合`async`和`await`使用：\n```js\nconst fs = require('fs');\nconst readline = require('readline');\nmodule.exports = () =\u003e {\n    return new Promise((resolve) =\u003e {\n        const rl = readline.createInterface({\n            input: fs.createReadStream('./database/data-3M.txt'),\n            crlfDelay: Infinity\n        });\n        let check = [];\n        rl.on('line', (line) =\u003e {\n            check.push(line);\n        });\n        rl.on('close', () =\u003e {\n            resolve(check)\n        })\n    })\n}\n```\n\n# intersect.js\n\n这里简单的使用`Set`和`filter`方法来求交集：\n```js\n// 交集方法\nmodule.exports = (a, b) =\u003e {\n    return a.filter(x =\u003e new Set(b).has(x));\n}\n```\n\n# index.js\n\n这里分别把上面两份处理关键数据的逻辑引入，然后执行逻辑：\n```js\nconst data3M = require('./library/data-3M');\nconst data60M = require('./library/data-60M');\n(async () =\u003e {\n    let smallData = await data3M();\n    let result = await data60M(smallData);\n    console.log(result);\n})();\n```\n\n# create-60M.js\n\n生成全新的大数据，用于测试：\n```js\nconst fs = require(\"fs\");\nconst path = require('path');\nconst writer = fs.createWriteStream(path.resolve(__dirname, '../database/data-60M.txt'), { highWaterMark: 1 });\n\nconst writeSixtyMillionTimes = (writer) =\u003e {\n    const write = () =\u003e {\n        let data = Buffer.from(`${parseInt(Math.random() * 60000000)}\\n`)\n        let ok = true;\n        do {\n            i--;\n            if (i === 0) {\n                // 最后一次写入。\n                writer.write(data);\n            } else {\n                // 检查是否可以继续写入。 \n                // 不要传入回调，因为写入还没有结束。\n                ok = writer.write(data);\n            }\n        } while (i \u003e 0 \u0026\u0026 ok);\n        if (i \u003e 0) {\n            // 被提前中止。\n            // 当触发 'drain' 事件时继续写入。\n            writer.once('drain', write);\n        }\n    }\n    // 初始化6000万数据\n    let i = 600000;\n    write();\n}\n\nwriteSixtyMillionTimes(writer)\n```\n\n# 后记\n\n15 个月后再回顾，发现跟 VSCode 的这个思路很相似，具体如下\n\n- [Text Buffer Reimplementation](https://code.visualstudio.com/blogs/2018/03/23/text-buffer-reimplementation)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwscats%2Fintersect","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwscats%2Fintersect","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwscats%2Fintersect/lists"}