{"id":20828801,"url":"https://github.com/perty/log-file-analysis","last_synced_at":"2025-07-19T10:38:04.617Z","repository":{"id":137744270,"uuid":"232439794","full_name":"perty/log-file-analysis","owner":"perty","description":"Using KMeans to analyze text entries and classify them for studying changes in logging over time.","archived":false,"fork":false,"pushed_at":"2020-01-12T15:12:38.000Z","size":13,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-03-12T07:44:34.945Z","etag":null,"topics":["machine-learning"],"latest_commit_sha":null,"homepage":null,"language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/perty.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-07T23:54:44.000Z","updated_at":"2020-01-12T15:12:40.000Z","dependencies_parsed_at":null,"dependency_job_id":"34cd8939-f597-464d-af2c-606e9a457ca4","html_url":"https://github.com/perty/log-file-analysis","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/perty/log-file-analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/perty%2Flog-file-analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/perty%2Flog-file-analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/perty%2Flog-file-analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/perty%2Flog-file-analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/perty","download_url":"https://codeload.github.com/perty/log-file-analysis/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/perty%2Flog-file-analysis/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265918978,"owners_count":23849273,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning"],"created_at":"2024-11-17T23:18:28.813Z","updated_at":"2025-07-19T10:38:04.575Z","avatar_url":"https://github.com/perty.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# log-file-analysis\nUsing K-Means to analyze text entries and classify them for studying changes in logging over time.\n\nK-Means is for unsupervised machine learning.\n\nThe idea is to classify log entries. First, learn from a sample of log files. \nThen apply constantly to monitor logs for anomalies such as a sudden increase in logs of a category.\nOr entries that are not classifiable, i e new events which differ more than a threshold from every \nknown category.\n\n## K-Means\nK-Means works by defining the number of categories and then iterate until the features of the samples put them in the \nsame category. The categories are found by starting with randomly placed centroids that represents the first assumption \nof what each category has as feature set. The samples are associated with the centroid they are most similar to. Then, \neach centroid is changed (moved in feature space) so that it represents what would be the most typical feature set of \nits category. \n\nThe next iteration will associate some samples with a different centroid due to the centroids move in feature space. \nWhen there are no changes to associations, the classification is stable and learning is considered done.\n\nThe stabilization is not guaranteed, an oscillation of samples may occur so there needs to be a maximum number of \niterations or a time budget.\n\nThe number of categories is heavily dependent on the domain and how it is modeled. In the case of log files, is a log \nentry just a string or should the log level be one feature and the content another? The size could be a feature, too. \n\nWe could be interested in a series of connected log entries, not just isolated entries. That would create categories \naround use cases which would help us monitor for changes in usage.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fperty%2Flog-file-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fperty%2Flog-file-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fperty%2Flog-file-analysis/lists"}