{"id":17998205,"url":"https://github.com/primaryobjects/tfidf","last_synced_at":"2025-03-26T04:31:49.757Z","repository":{"id":10562713,"uuid":"12765607","full_name":"primaryobjects/TFIDF","owner":"primaryobjects","description":"TF*IDF Term Frequency Inverse Document Frequency in C# .NET","archived":false,"fork":false,"pushed_at":"2021-06-02T18:47:46.000Z","size":144,"stargazers_count":61,"open_issues_count":0,"forks_count":27,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-03-21T06:41:34.731Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"evermeer/EVReflection","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/primaryobjects.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-09-11T19:32:12.000Z","updated_at":"2025-01-28T19:47:53.000Z","dependencies_parsed_at":"2022-09-10T08:23:00.081Z","dependency_job_id":null,"html_url":"https://github.com/primaryobjects/TFIDF","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/primaryobjects%2FTFIDF","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/primaryobjects%2FTFIDF/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/primaryobjects%2FTFIDF/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/primaryobjects%2FTFIDF/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/primaryobjects","download_url":"https://codeload.github.com/primaryobjects/TFIDF/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245589266,"owners_count":20640254,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-29T21:24:35.018Z","updated_at":"2025-03-26T04:31:49.493Z","avatar_url":"https://github.com/primaryobjects.png","language":"C#","readme":"TF*IDF in C# .NET\n=========\n\nTutorial\nhttp://www.primaryobjects.com/CMS/Article157.aspx\n\n## Motivation\n\nDuring a recent machine learning competition, I struggled to find an example of working code in C# .NET that performed a TF*IDF transformation on a set of documents. For such a relatively simple mathematical formula, I had hoped there would be a library available for easy importing into a project. After a bit of searching, I decided to create my own TF*IDF transformation class, modeled after the Python scikit-learn library's method, TfidfVectorizer(). This class is based upon the formula described on Wikipedia and through the sources listed below.\n\n## Description\n\nTF*IDF = Term Frequency * Inverse Document Frequency\n\nTFIDF is a way of representing a document, based upon its keywords holding values that represent their importance within the document. For a complete description of TFIDF, see http://en.wikipedia.org/wiki/Tf%E2%80%93idf\n\nThis example project includes the class TFIDF.cs for performing TF*IDF transformations on a set of documents in C# .NET, and returning the resulting list of vectors as a multi-dimensional array of doubles. Upon providing the class with a list of documents as strings, the class builds a vocabulary (skipping over stop words and stemming terms), and calculates the inverse document frequency (IDF) for each vocabulary term, against the total number of documents. The class then takes each word in the vocabulary and calculates the term frequency against each document. Finally, each term frequency is multiplied by the term's inverse document frequency to provide the TF*IDF score.\n\nThe class returns a matrix of doubles. Each row in the matrix represents a vectorized document (converted from string to TF*IDF values for each vocabulary term). Each column in the matrix represents a feature/term from the list of vocabulary words.\n\nFor example:\n\n```\nThe sun in the sky is bright.\n-0.405465108108164, 0, -0.405465108108164, 0, 0,\n\nWe can see the shining sun, the bright sun.\n-0.810930216216329, 0, -0.405465108108164, 0, 0,\n```\n\nIn the example above, the vocabulary consists of 5 terms (resulting in 2 rows and 5 columns in the matrix).\n\nThe class includes an optional method for normalizing the resulting vectors, using L2-norm: Xi = Xi / Sqrt(X0^2 + X1^2 + .. + Xn^2). Applying normalization to the above example produces the following result:\n\n```\nThe sun in the sky is bright.\n-0.707106781186547, 0, -0.707106781186547, 0, 0\n\nWe can see the shining sun, the bright sun.\n-0.894427190999916, 0, -0.447213595499958, 0, 0\n```\n\n## Usage\n\n```\nstring[] documents = LoadYourDocumentsHere();\n\ndouble[][] inputs = TFIDF.Transform(documents);\ninputs = TFIDF.Normalize(inputs);\n```\n\n## Sources\n\nOverall TF*IDF description\nhttp://en.wikipedia.org/wiki/Tf%E2%80%93idf\n\nSimple description of TF and IDF math formulas with examples\nhttp://r3dux.org/2012/10/how-to-count-word-occurences-in-a-string-or-file-using-csharp/\n\nTF*IDF normalization via L2-Norm\nhttp://pyevolve.sourceforge.net/wordpress/?tag=inverse-document-frequency\n\n## License\nCopyright (c) 2013 Kory Becker http://www.primaryobjects.com/kory-becker.aspx\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n\n## Author\n\nKory Becker\nhttp://www.primaryobjects.com/kory-becker.aspx\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprimaryobjects%2Ftfidf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprimaryobjects%2Ftfidf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprimaryobjects%2Ftfidf/lists"}