{"id":17680347,"url":"https://github.com/docelic/article_date","last_synced_at":"2025-08-17T13:36:01.569Z","repository":{"id":66956507,"uuid":"255967216","full_name":"docelic/article_date","owner":"docelic","description":"Extracts date from URLs","archived":false,"fork":false,"pushed_at":"2020-09-23T09:27:02.000Z","size":697,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-30T18:48:39.733Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Crystal","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/docelic.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-04-15T15:57:27.000Z","updated_at":"2020-09-23T09:27:05.000Z","dependencies_parsed_at":"2023-02-28T02:31:20.877Z","dependency_job_id":null,"html_url":"https://github.com/docelic/article_date","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/docelic/article_date","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docelic%2Farticle_date","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docelic%2Farticle_date/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docelic%2Farticle_date/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docelic%2Farticle_date/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/docelic","download_url":"https://codeload.github.com/docelic/article_date/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docelic%2Farticle_date/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260536895,"owners_count":23024501,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-24T09:06:40.395Z","updated_at":"2025-06-18T10:40:02.215Z","avatar_url":"https://github.com/docelic.png","language":"Crystal","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Introduction\n\nThis repository contains a small, self-contained application for retrieving\nweb pages, parsing their creation and modification dates, and printing various\naccompanying statistics.\n\nThe user interface consists of a simple HTML form for specifying the URLs\nand displaying statistics. It is available at\n[http://localhost:3000/](http://localhost:3000/) once the application is\nrunning.\n\nThe general idea for the project was to try create a quick parser that is\ntext-based (recognizing only minimal parts of HTML), and observe the\nperformance and accuracy it would achieve when parsing HTML content.\n\n## Running the app\n\nTo run the application, install Crystal, run `shards` to install dependencies, and then:\n\n```bash\ncrystal [run --release] src/app.cr [-- options]\n\nOR\n\nshards build [--release]\n./bin/app [options]\n```\n\nThe application supports the following options:\n\n```\n    -c, --connections    (50) - Concurrent connections per host\n    -d, --downloaders   (200) - Number of downloading threads\n    -h, --help                - This help\n    -p, --parsers        (10) - Number of date parsing threads\n    -v, --verbose     (false) - Print a dot (.) on each download\n                                Print a plus (+) on each parse\n\n    -C, --connect-timeout (5) - Set connect timeout (s)\n    -D, --dns-timeout     (5) - Set DNS timeout (s)\n    -R, --read-timeout    (5) - Set read timeout (s)\n    -W, --write-timeout   (5) - Set write timeout (s)\n    -T, --timeout         (5) - Set all timeouts (s)\n```\n\n**Complete example:**\n\n```bash\ngit clone https://github.com/docelic/article_date\ncd article_date\nshards build --release\nbin/app -c 50 -d 200\n```\n\n## Usage\n\nOnce the app is started with the desired options, visit\n[http://localhost:3000/](http://localhost:3000/) in the browser.\n\nThe minimal HTML user interface is provided by Kemal and it provides\na textarea for entering page URLs, one per line.\n\nClicking \"Submit\" will process the data and display the results.\nA sort of real-time update is achieved by printing data to the\nresponse IO in real-time, allowing the browser to display result\nrows in incremental chunks instead of having to wait for all data\nto be processed.\n\n## Runtime and results\n\nWhen the application starts, it will print a summary of the running\nconfiguration to STDOUT. Also, if option -v is provided,\nit will print '.' and '+' for each downloaded and parsed\nfile.\n\nAs URLs are processed, each result row in the browser\nwill display the following values:\n\n1. Sequential number following the ordering of URLs in the list, starting from 0\n2. Page URL (in case of a download error also with a copy of the error text)\n3. Parsed creation/modification date. If no date is determined, the value is empty\n4. Elapsed time for parsing the date (this value includes all Fiber\nwait times, but as methods invoked are generally non-blocking and execute without\nreleasing the CPU, this value is considered to be close to real algorithm execution time)\n5. Elapsed time for downloading the page (this value includes all Fiber\nwait times, e.g. times waiting for web servers to respond as well as\nfibers to be scheduled on the CPU. As such it is regularly\nhigher than the amount of time spent in actual execution)\n6. HTTP response status code\n7. Name of program method which determined the date\n8. The corresponding confidence store (0.0 - 1.0)\n\nThe footer of the table also contains 3 summarized values:\n\n1. Total real (wallclock) time\n2. Sum of all parsing times\n3. Sum of all download times\n\nWallclock time is useful for determining general/overall performance.\n\nParsing times report the actual times spent parsing and are useful for\nidentifying potential needed improvements in the algorithms or on particular\ntypes of pages.\n\nDownload times, if very high, are useful for identifying\nthat the thread settings (options -d and -p) may be suboptimal\nand could be adjusted. Alternatively if very low, the\nnumber of threads could be increased.\n\nWhen processing of the URL list is complete, all open download connections\nand Fibers terminate. They are re-created on every request.\n\n## App design\n\nThe app is based on Fibers and Channels.\n\nA group of N (--downloaders) Fibers works on the input URLs, processing\neach one while taking advantage of basic HTTP Keep-Alive implementation\nand maintaining at most N (--connections) HTTP::Clients open for each\nindividual domain.\n\nParallel connections to the same host are not created up-front, but\nare instantiated only if needed to crawl multiple pages from the same\ndomain simultaneously.\n\nThe app is intended to run N (--downloaders) download\nfibers in parallel. However, if the input list is heavily sorted by\ndomain the performance may be reduced to N (--connections).\nIn such cases, either set options -d and -c to the same value or\nrandomize the input list (e.g. `sort -R \u003cfile\u003e`).\n\nAs each downloader downloads its page, it sends the intermediate data\nover the appropriate Channel to the parser processes, and then\nwaits for the next page to download.\n\nThe parser processes receive downloaded data and try to determine the\npage creation or modification date using various parsing strategies.\nThe current design of the parsing and extraction system is documented\nin the file *PARSING.md*.\n\nAs each parser finishes scanning through the page, it sends the final\nresults and statistics to the results Channel and then waits for another\npage to parse.\n\n### In more general terms\n\nThe implemented design based on Channels, \"downloaders\" and \"parsers\"\nis chosen on the idea that a real-world, larger system could use\na similar architecture on a larger scale.\n\nFor example, the downloader processes might be advanced clients capable\nof downloading JavaScript-heavy/SPA pages, re-parsing stored content instead\nof downloading it again, and/or using various APIs instead of getting data\nthrough crawling (e.g. search engines get data from Wikipedia via\nAPI, not downloading HTML).\n\nThese processes would then send contents via message passing or\nqueueing systems for further processes down the line, of which date\nparsers could be just one type of consumer.\n\n### Improvements\n\nIn a more complete, non-prototype implementation, a couple improvements\ncould be added:\n\n- More per-domain crawling limits and/or bandwidth caps\n\n- Keeping track of which parsing strategies had the best success rate on\nparticular domains and/or subdirectories within domains. The order in\nwhich the parsing strategies are run could then be dynamically adjusted\nfor best performance.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocelic%2Farticle_date","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdocelic%2Farticle_date","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocelic%2Farticle_date/lists"}