{"id":19316650,"url":"https://github.com/wisskirchenj/filetypeanalyzer","last_synced_at":"2026-06-20T14:32:59.146Z","repository":{"id":110253325,"uuid":"530818599","full_name":"wisskirchenj/FileTypeAnalyzer","owner":"wisskirchenj","description":"Toy project of JetBrains JavaCore track for analyzing file types","archived":false,"fork":false,"pushed_at":"2023-02-06T14:30:16.000Z","size":169,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-24T04:42:19.581Z","etag":null,"topics":["algorithm","java19","knuth-morris-pratt","multithreading","rabin-karp"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wisskirchenj.png","metadata":{"files":{"readme":"README.MD","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-30T20:25:05.000Z","updated_at":"2022-12-06T21:41:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"e8cc2ff1-9813-4050-8af2-4f46815b4458","html_url":"https://github.com/wisskirchenj/FileTypeAnalyzer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/wisskirchenj/FileTypeAnalyzer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wisskirchenj%2FFileTypeAnalyzer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wisskirchenj%2FFileTypeAnalyzer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wisskirchenj%2FFileTypeAnalyzer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wisskirchenj%2FFileTypeAnalyzer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wisskirchenj","download_url":"https://codeload.github.com/wisskirchenj/FileTypeAnalyzer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wisskirchenj%2FFileTypeAnalyzer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34573744,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-20T02:00:06.407Z","response_time":98,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithm","java19","knuth-morris-pratt","multithreading","rabin-karp"],"created_at":"2024-11-10T01:12:06.612Z","updated_at":"2026-06-20T14:32:59.129Z","avatar_url":"https://github.com/wisskirchenj.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# IDEA EDU Course ...\n\nImplemented in the Java Core Track of hyperskill.org's JetBrain Academy.\n\nPurpose of doing this project, is to further practise core java topics as multi-threading, file handling, applying\nnew algorithms as Knuth-Morris-Pratt algorithm and Rabin-Karp algorithm for data hashing, searching and sorting - and just some more POJO java.\n\nAlso: This is the graduate project of the Java Core Track, which covers the most yet undone learning topics and thus\nwill serve to finish this last unfinished Jetbrains java track.\n\n## Technology / External Libraries\n\n- POJO Java 18,\n- multi-threading with Java Executor-Service (Java-core .util.concurrent)\n- speed-streaming searches of arbitrarily large files - e.g \u003e65 GB docker files... (in \u003c60 sec.)\n- prefix, hashing and search algorithms as Knuth-Morris-Prath (best performance!) and Rabin-Karp.\n- with Lombok annotation processors,\n- Apache Log4j SLF4J API binding to Log4j 2 logging and\n- Junit 5 with\n- Mockito (mockito-inline) testing.\n\n## Repository Contents\n\nThe sources of main project tasks (5 stages) and unit, mockito testing.\n\n## Program description\n\nAn application / tool a tool that will extract info from unknown (also binary) files\nto determine the type of the file. Different algorithms are used to solve this problem, and will\nbe compared regarding the speed (of search and sorting).\n\nCL-Command overview:\n\n\u003e usage: java fileanalyzer.Main \u0026lt;directory-path\u0026gt; \u0026lt;patterns-csv-path\u0026gt;\n\nall files in the directory path given are \"file-type-analyzed\" using the patterns-csv-file given as\na dataset for prioritized search patterns to detect file types... \nPriorization means, that e.g. when we find the identifying character sequences for a zip-File AND a\nMS PowerPoint-file in one file, then we know it is a PPT, because all PPT are stored as Zips.\nthis is realized, by giving the identifying PPT-sequence a higher priority than the zip-sequence.\n\nHave fun!\n\n## Project completion\n\nProject was completed on 20.09.22.\n\n## Progress\n\n30.08.22 Project started - just git repo and gradle setup.\n\n02.09.22 Stage 1 completed - analyze arbitrary (also binary) file given as argument for occurrence of a (type defining)\nstring - also given as argument. Implemented in two versions with Files.readAllBytes and FileInputStream for huge files.\n\n08.09.22 Stage 2 completed - Knuth-Morris-Pratt (KMP) algorithm implemented accelerating the string search about a factor 4-5.\nUsed strategy pattern in controller class to dynamically choose a SearchAlgorithm (naive vs KMP)\n\n12.09.22 Stage 3 completed - parallelize the search on all files of a directory for a search pattern by use of an executor\nservice with KMP-algorithm - use of DirectoryStream.\n\n15.09.22 Stage 4 completed - Read in a CL-argument specified CSV-file with prioritized search patterns to identify the file\ntype of an arbitrary (binary) file. Perform parallelized multi-threaded search on all files of a directory for all the patterns.\n\n20.09.22 Final Stage 5 completed - Implemented a multi-pattern search with variable length patterns (not ideal) using \nRabin-Karp algorithm and polynomial rolling hash functions. Though carefully implemented - optimized for speed, the performance\nis about a factor 3 to 5 worse, than with the Knuth-Morris-Pratt algorithm.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwisskirchenj%2Ffiletypeanalyzer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwisskirchenj%2Ffiletypeanalyzer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwisskirchenj%2Ffiletypeanalyzer/lists"}