{"id":16732666,"url":"https://github.com/igorescobar/line-server","last_synced_at":"2025-06-30T23:03:17.340Z","repository":{"id":66331153,"uuid":"172401925","full_name":"igorescobar/line-server","owner":"igorescobar","description":"Get the content of any line within a file regardless its size in a few milliseconds.","archived":false,"fork":false,"pushed_at":"2019-02-24T23:15:30.000Z","size":4709,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-15T19:44:25.883Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/igorescobar.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-02-24T23:14:46.000Z","updated_at":"2023-03-10T10:47:26.000Z","dependencies_parsed_at":null,"dependency_job_id":"c844cd04-f5c9-4b1a-aa73-0d888c37d003","html_url":"https://github.com/igorescobar/line-server","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/igorescobar/line-server","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/igorescobar%2Fline-server","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/igorescobar%2Fline-server/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/igorescobar%2Fline-server/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/igorescobar%2Fline-server/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/igorescobar","download_url":"https://codeload.github.com/igorescobar/line-server/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/igorescobar%2Fline-server/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262863687,"owners_count":23376452,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-12T23:46:04.834Z","updated_at":"2025-06-30T23:03:17.295Z","avatar_url":"https://github.com/igorescobar.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Line Server\nhttps://salsify.github.io/line-server.html\n\n## Assumptions\n\n* Each line is terminated with a newline (\"\\n\").\n* Any given line will fit into memory.\n* The line is valid ASCII (e.g. not Unicode).\n* The files can be as small as 1MB or as big as 100GB+.\n* It should be scalable and handle mutiple clients.\n* It should be a REST API.\n  * 200 for success\n  * 413 (!?!) for invalid line index.\n\n## Stack\n### Node.js\nO picked Node.js because it uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices.\n\n### Express\nI'm using express to have a more elegant mapping between routes and the actual implementation.\n\n### Byline\nThis library allows me to have a read stream of a file to bind an event for each line of the file.\n\n### Docker\nI used docker for fast deployment and also to make build less reliable on the `host` that the service are running. Just download and install:\nhttps://docs.docker.com/v17.12/docker-for-mac/install/\n\n## Implementation\nFirst and most importantly, this solution relies entirely on disk (don't run away just yet! :B).\n\nFor this to work, I'm using a tool called `split` to create chunks of 10,000 lines from a given file. This is of course a time-consuming process and in an ideal world we would have a mechanism in place that could do it faster after each file is uploaded. Using the `chunks` approach, it allows me to have a predicted response time and memory/CPU consumption regardless the size of the file. Therefore, the fastest response would be if you request the beginning of the chunk, and the slowest as you progress to the end of the chunk. So the smaller the chunks the faster would be the response time.\n\nAfter creating the chunks inside of the `static/chunks/` folder the system creates a *map table* to have a in-memory link between `chunk range` and `file ` so when you request a `line-number` the system knows exactly in which chunk to look inside.\n\nAfter this, instead of adding the entire file into memory, `byline` streams smaller chunks of the chunk file, streams it and flushes it after each chunk it consumed. This ensures minimal memory usage.\n\nThe API documentation can be found at `src/api.yml` or you can open it [online](https://petstore.swagger.io/?url=https://bitbucket.org/igorescobar/line-server/raw/65b5918072827f5aede29e13f2e126b9ef6f8394/src/api.yml).\n\n### Performance\n#### Beginning of the chunk (best case scenario):\n*** Memory: 60mb (idle) max: 250mb ***\n```\nConcurrency Level:      80\nTime taken for tests:   1.473 seconds\nComplete requests:      1000\nFailed requests:        0\nTotal transferred:      246000 bytes\nHTML transferred:       39000 bytes\nRequests per second:    678.99 [#/sec] (mean)\nTime per request:       117.822 [ms] (mean)\nTime per request:       1.473 [ms] (mean, across all concurrent requests)\nTransfer rate:          163.12 [Kbytes/sec] received\n\nConnection Times (ms)\n              min  mean[+/-sd] median   max\nConnect:        0    1   1.1      0       6\nProcessing:    14  114  17.6    114     165\nWaiting:       12   95  16.6     97     148\nTotal:         20  115  17.2    114     165\n```\n\n#### Worst case scenario (end of chunk):\n*** Memory: 60mb (idle) max: 130mb ***\n```\nConcurrency Level:      80\nTime taken for tests:   13.134 seconds\nComplete requests:      1000\nFailed requests:        0\nTotal transferred:      292000 bytes\nHTML transferred:       85000 bytes\nRequests per second:    76.14 [#/sec] (mean)\nTime per request:       1050.686 [ms] (mean)\nTime per request:       13.134 [ms] (mean, across all concurrent requests)\nTransfer rate:          21.71 [Kbytes/sec] received\n\nConnection Times (ms)\n              min  mean[+/-sd] median   max\nConnect:        0    1   1.2      0       7\nProcessing:   778 1031  82.5   1032    1331\nWaiting:      778 1008  80.1   1013    1321\nTotal:        779 1031  82.8   1033    1331\n```\n\n### Possible improvements\n1) In a real world case I would never use my own machine to store/process those chunks. I would do it directly on `Amazon S3 + CloudWatch Triggers` which would allow me to scale the chunks processing regardless of the amount of traffic that we may have and would also allow us to handle this way more faster.\n\n2) Using Amazon S3 (or any storage as a service) would allow us to surpass the file system limitations regarding maximum open files (varies according to OS), space and also speed access to the chunks.\n\n3) To make the application scale in terms of requests, we could just deploy the application using AWS `ECS + Fargate` in order to scale the number of containers we have running based on CPU usage, network traffic or any other metric we might see fit.\n\n4) Since we are working with only one file it didn't matter much how we store the chunks. To ensure faster access to the files regardless the amount of chunks we might have I would pick a different storage strategy which would be something like `chunks/$chunk_code/$filename` instead of throwing everything inside of the `chunks` folder.\n\n5) Since I don't rely much on memory to solve this I could use a small amount of memory to do some caching on my end to optimize sequential reading and avoid unnecessary I/O usage if this was a use case.\n\n6) I don't really like usage of the status code `413`. IMHO it can be improved. By defintion `/lines` is the resource and if I'm looking for a resource `/:line_number` that doesn't exist, the correct status should be `404 - Not Found` since it is an imutable file and this line will NEVER exist. The correct way to represent a resource that \"existed\" before but doens't exist anymore is the status code `410 - Gone` for example. The API Design could also be more extensible like `/files/:name/:line_number` which would allow me to serve multiple files and look for their files.\n\n7) Also, if this was a real product the files would have a better organisation with `controllers`, `models`, etc.\n\n## Explored possibilities\n\n### Database implementation\nProbably fastest solution would be to create a model on a database where I could just insert each line of the file to a `table` and I could instantly have a relation between `line -\u003e content`. Adding an index to the file, line column would allow instant searches inside of a file regardless the amount of lines that I would need. I didn't use this because then it would be too obvious and you woudn't have much code to review. The only problem I see with this solution is in case we achieve a super high level of concurrency we could hit `active connection` limitations on this database. Using something like `DynamoDB` could also be an alternative solution.\n\n### In-memory implementation\nWorking with memory is always faster but, as always, a very limited resource. It would allow me to work only with a very limited amount of files and the size of those files would also be very limited since a a file of 1GB would occupy way more than its physical size after serialized and dumped into the application memory.\n\n## How to build it\n```sh\n./build.sh small.txt\n```\n\n## How to run it\n```sh\n./run.sh small.txt\n```\n\n### Testing with different files\nMake sure you add your files to the `src/public/` folder.\nIf you add more files make sure to run `./chunks.sh small.txt` to generate the chunks of the new files before running the service.\n\n## Running tests\n```sh\ndocker-compose run --rm web npm run test\n```\n\n### If you are trying to make this all blow up, you might need to increase the OS max number if open files per process, on MacOS it is like this:\n```sh\necho kern.maxfiles=65536 | sudo tee -a /etc/sysctl.conf\necho kern.maxfilesperproc=65536 | sudo tee -a /etc/sysctl.conf\nsudo sysctl -w kern.maxfiles=65536\nsudo sysctl -w kern.maxfilesperproc=65536\nulimit -n 65536\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Figorescobar%2Fline-server","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Figorescobar%2Fline-server","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Figorescobar%2Fline-server/lists"}