{"id":31363859,"url":"https://github.com/innernull/feather","last_synced_at":"2025-09-27T05:26:31.740Z","repository":{"id":46805275,"uuid":"397510594","full_name":"innerNULL/feather","owner":"innerNULL","description":"FEATure HashER ","archived":false,"fork":false,"pushed_at":"2022-10-19T03:45:12.000Z","size":83,"stargazers_count":3,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-25T13:36:14.434Z","etag":null,"topics":["cpp","feature-engineering","feature-extraction","feature-hash","feature-hashing","hash-embedding","libsvm","libsvm-data","libsvm-format","machine-learning","python","recommendation-system","recommender-system","recsys","sparse-data","sparse-representations"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/innerNULL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-08-18T07:23:40.000Z","updated_at":"2024-05-12T01:54:14.000Z","dependencies_parsed_at":"2025-04-12T08:57:33.293Z","dependency_job_id":"2db024df-79de-48c2-b14d-5adfd782d8f1","html_url":"https://github.com/innerNULL/feather","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/innerNULL/feather","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/innerNULL%2Ffeather","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/innerNULL%2Ffeather/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/innerNULL%2Ffeather/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/innerNULL%2Ffeather/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/innerNULL","download_url":"https://codeload.github.com/innerNULL/feather/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/innerNULL%2Ffeather/sbom","scorecard":{"id":489115,"data":{"date":"2025-08-11","repo":{"name":"github.com/innerNULL/feather","commit":"b0b3719701fed6a026be499edbc86a19d3a55490"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3.4,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Dangerous-Workflow","score":10,"reason":"no dangerous workflow patterns detected","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Pinned-Dependencies","score":0,"reason":"dependency not pinned by hash detected -- score normalized to 0","details":["Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/cmake.yml:22: update your workflow using https://app.stepsecurity.io/secureworkflow/innerNULL/feather/cmake.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/python-package.yml:23: update your workflow using https://app.stepsecurity.io/secureworkflow/innerNULL/feather/python-package.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/python-package.yml:25: update your workflow using https://app.stepsecurity.io/secureworkflow/innerNULL/feather/python-package.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/python-publish.yml:18: update your workflow using https://app.stepsecurity.io/secureworkflow/innerNULL/feather/python-publish.yml/main?enable=pin","Warn: GitHub-owned GitHubAction not pinned by hash: .github/workflows/python-publish.yml:20: update your workflow using https://app.stepsecurity.io/secureworkflow/innerNULL/feather/python-publish.yml/main?enable=pin","Warn: third-party GitHubAction not pinned by hash: .github/workflows/python-publish.yml:34: update your workflow using https://app.stepsecurity.io/secureworkflow/innerNULL/feather/python-publish.yml/main?enable=pin","Warn: pipCommand not pinned by hash: .github/workflows/python-package.yml:30","Warn: pipCommand not pinned by hash: .github/workflows/python-package.yml:31","Warn: pipCommand not pinned by hash: .github/workflows/python-package.yml:32","Warn: pipCommand not pinned by hash: .github/workflows/python-package.yml:33","Warn: pipCommand not pinned by hash: .github/workflows/python-publish.yml:25","Warn: pipCommand not pinned by hash: .github/workflows/python-publish.yml:26","Warn: pipCommand not pinned by hash: .github/workflows/python-publish.yml:29","Warn: pipCommand not pinned by hash: .github/workflows/python-publish.yml:53","Info:   0 out of   5 GitHub-owned GitHubAction dependencies pinned","Info:   0 out of   1 third-party GitHubAction dependencies pinned","Info:   0 out of   8 pipCommand dependencies pinned"],"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Token-Permissions","score":0,"reason":"detected GitHub workflow tokens with excessive permissions","details":["Warn: no topLevel permission defined: .github/workflows/cmake.yml:1","Warn: no topLevel permission defined: .github/workflows/python-package.yml:1","Warn: no topLevel permission defined: .github/workflows/python-publish.yml:1","Info: no jobLevel write permissions found"],"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: Apache License 2.0: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'main'","Warn: branch protection not enabled for branch 'dev_dbg'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 1 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-19T18:35:48.552Z","repository_id":46805275,"created_at":"2025-08-19T18:35:48.553Z","updated_at":"2025-08-19T18:35:48.553Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":277184148,"owners_count":25775286,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-27T02:00:08.978Z","response_time":73,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","feature-engineering","feature-extraction","feature-hash","feature-hashing","hash-embedding","libsvm","libsvm-data","libsvm-format","machine-learning","python","recommendation-system","recommender-system","recsys","sparse-data","sparse-representations"],"created_at":"2025-09-27T05:26:30.159Z","updated_at":"2025-09-27T05:26:31.734Z","avatar_url":"https://github.com/innerNULL.png","language":"C++","readme":"# feather\nFEATure HashER \n\n\n## Build \u0026 Install\n* **CPP**  \n```bash\ncd  PATH/TO/FEATHER\nmkdir build \u0026\u0026 cd build\n\n\n# Build with unit-test, without python-binding\ncmake ../ -DFEATHER_BUILD_TESTS=ON -DFEATHER_BUILD_PY_BINDER=OFF  \n# Build without unit-test and python-binding\ncmake ../ -DFEATHER_BUILD_TESTS=OFF -DFEATHER_BUILD_PY_BINDER=OFF\n# Build python binding\ncmake ../ -DFEATHER_BUILD_TESTS=OFF -DFEATHER_BUILD_PY_BINDER=ON -DPYTHON_EXECUTABLE=/usr/bin//python3.7\n\n\nmake -j12\n```\n\n* **Python with pip**    \n```bash\npython -m pip install git+https://github.com/innerNULL/feather.git -vvv\n# or\npython -m pip install https://github.com/innerNULL/feather/archive/refs/heads/main.zip -vvv\n# or\npython -m pip install pyfeather\n```\nhere is what you may see:  \n```\nProcessing /Path/To/feather                                                                                        \nBuilding wheels for collected packages: pyfeather\nBuilding wheel for pyfeather (setup.py) ... done  \nCreated wheel for pyfeather:\nfilename=pyfeather-0.0.1-cp37-cp37m-macosx_10_15_x86_64.whl size=1284474 sha256=e3f9d0be1e7578274f3fcecb854c1e66336a24985b8e6ff4213375d76463299e\nStored in directory: /private/var/folders/4q/50_2647d1yb47jt9j6plwx2r0000gq/T/pip-ephem-wheel-cache-996awbes/wheels/0f/bd/93/b6936ec0c1169201de264147e21ae7e2bb894720b34bcdce79\nSuccessfully built pyfeather\nInstalling collected packages: pyfeather\nSuccessfully installed pyfeather-0.0.1  \n```\n\n## How to Use\n### Feature-Hash\nHere is an simple example:\n```python\nimport pyfeather\nfrom typing import List\n\n# Loading the pre-defined feature schema.\nfeahash = pyfeather.FeaHash(\"./conf/feather.conf\")\n\n# Getting hash of value 2 and '2' of 'fea1', which is an discrete \n# feature, and both hash results should be same.\nfea1_hash_str2: List[int] = feahash.GetFeaHash(\"fea1\", \"2\")\nfea1_hash_int2: List[int] = feahash.GetFeaHash(\"fea1\", 2)\n# [10100070] [10100070]\nprint(fea1_hash_str2, fea1_hash_int2)\n\n# Getting hash of value 3.14 and '3.14' and 5.12 of \"fea10\", which \n# is an continuous feature, and the hash-bucket of any value of \n# this feature should always be 0, so all value has same feature \n# hash result.\nfea10_hash_float3p14: List[int] = feahash.GetFeaHash(\"fea10\", 3.14)\nfea10_hash_str3p14: List[int] = feahash.GetFeaHash(\"fea10\", '3.14')\nfea10_hash_float5p12: List[int] = feahash.GetFeaHash(\"fea10\", 5.12)\n# [11000000] [11000000] [11000000] \nprint(fea10_hash_float3p14, fea10_hash_str3p14, fea10_hash_float5p12)\n\n# Getting hash of value of [4.0, 3.0, 2.0, 1.0] and [1.0, 2.0, 3.0, 4.0] \n# of \"fea11\", which is an vector feature with dimension as 4, so all \n# 4-dim vectors' feature-hash-bucket of this feature should always \n# be [0, 1, 2, 3], and like continusous-feature, all feature-hash of \n# any value of this feature should be same.\nfea11_hash_4to1: List[int] = feahash.GetFeaHash(\"fea11\", [4.0, 3.0, 2.0, 1.0])\nfea11_hash_1to4: List[int] = feahash.GetFeaHash(\"fea11\", [1.0, 2.0, 3.0, 4.0])\n# [11100000, 11100001, 11100002, 11100003]\n# [11100000, 11100001, 11100002, 11100003]\nprint(fea11_hash_4to1, fea11_hash_1to4)\n```\n\n\n## Feature Hashing\n### Notions\n* **Feature Value**:  \nDefined by `FeaValue`, which is a unified wrapping of feature data. Feature-data could be classified into three types: discrete-feature, continuous-feature, vector-feature.  \n    * **Discrete Feature**:  \n    The input could be `std::string`, `int32_t`, `float`, `double`, no matter what input type it is, the input will be casted into `std::string` and saved as `FeaValue::discrete_val`.  \n    * **Continuous Feature**:  \n    The input could be `std::string`, `int32_t`, `float`, `double`, no matter what input type it is, the input will be casted into `float` and saved as `FeaValue::continuous_val`.  \n    * **Vector Feature**:  \n    The input could be `std::vector\u003cstd::string\u003e`, `std::vector\u003cint\u003e`, `std::vector\u003cfloat\u003e`, no matter what input type it is, the input will be casted into `std::vector\u003cfloat\u003e` and saved as `FeaValue::vec_val`.  \n\n`FeaValue` instance will also record some feature meta-data, such as feature-type, 0 for discrete-feature, 1 for continuous-feature, 2 for vector-feature.  \n\nBesides, `FeaValue` support transform feature-value to feature-value's hash id by `FeaValue::GetHash`, according feature-type:  \n* **Discrete Feature Hash**:  \nJust the result of calling `std::hash` on `FeaValue::discrete_val`.  \n* **Continuous Feature Hash**:  \nAlways returns 1 as feature-hash. This is because continuous-feature actually don't needs feature-hashing, so its **Feature-Slot**'s bucket-size should always be 1, so by assign its feather hash-id to 1, when using mod function to hash-id on slot-bucket-size, we can always assign continuous-feature value on 0-bucket of its slot.   \n* **Vector Feature Hash**:  \nSimiliar to continuous-feature case, vector-feature also doesn't need feature-hashing operation. So we can assign each element in vector to slot-bucket which id corresponding to element's index.  \nTo do this, we can let each element's hash-id as `element-index + slot-bucket-size`, when using mod function to hash-id on slot-bucket-size, we can always assign each element's slot-bucket to `element-index`.   \nFor example, a 3-dim vector [3.14, 5.21, 6.79], its slot-bucket-size must be 3 (same with its dimension), we assign each element's hash-id as [3 + 0, 3 + 1, 3 + 2], and so each element's slot-bucket-id is [3 % 3, 4 % 3, 5 % 3].   \n\n* **Feature Slot**  \nDefined in `FeaSlot`. Each feature corresponing to a \"slot\", the feature-slot takes responsibility to map each feature value's hash to a bucket by executing mod operation on feature value's hash with slot's bucket-size.\n* **Feature Hash**  \nDefine in `FeaHash`. There is a config which defines schema of target features, the 1st column is feature-name, 2nd column is feature-slot id, 3rd column is feature-slot hash-bucket number, 4th column is feature-type (0 for discrete-feathe, 1 for continuous-feature, 2 for vector-feature). The `FeaHash` will record/register all slots at construction function. \n\n* **Feature Indexer**(TODO)\n\n\n* **Feature Extractor**  \nThe base class is `FeaExtractor`, but the mainly using case is mapping feature's hash-id/index and value (in continuous and vector feature case) to libsvm format, which can be done by `LibSVMExtractor`.\n\n* **Bucket-ID and Bucket-Code**  \nBriefly, **Bucket-ID** is `int32_t`, **Bucket-Code** is `std::string` which digits/length should be fixed. Here are some example about mapping bucket-id to bucket-code, which all bucket-code has 5 digits:  \n    * bucket-id: 5 -\u003e bucket-code: '00005'  \n    * bucket-id: 54234 -\u003e bucket-code: '54234'  \n    * bucket-id: 568 -\u003e bucket-code: '00568'   \n\n### Algorithm\nEach feature has a slot and hash-bucket size, the finally hash of this feature is a `int64` in the format as `${SLOT}${HASH-BUCKET-CODE}`, since the top digit of this int64 is controled by slot, so each feature's finally hash value will far away with each other, the second part is '**part of** hash value of feature value according feature hash-bucket size' because, hash value of feature value is a int64, finally-hash is also a int64, if we just concat slot and hash value of feature value, then the finally result number will have posibillity to overflow from int64 range.\n\nby the way, in case we want adjust each feature-slot's hash-bucket size, we can maintain a hash-ring/consistant hash for each slot.\n\n## TODO\n* `FeaHash::Hash2IndexDictBuild` should supports rebuild mode.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finnernull%2Ffeather","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finnernull%2Ffeather","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finnernull%2Ffeather/lists"}