{"id":18684991,"url":"https://github.com/h1alexbel/samples-filter","last_synced_at":"2025-04-12T04:32:45.417Z","repository":{"id":233486889,"uuid":"786791311","full_name":"h1alexbel/samples-filter","owner":"h1alexbel","description":"Command-line filter for GitHub repositories that contain \"samples\", instead of real project or framework or library","archived":false,"fork":false,"pushed_at":"2025-04-05T22:25:33.000Z","size":6362,"stargazers_count":6,"open_issues_count":16,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-05T23:23:18.419Z","etag":null,"topics":["dataset-filtering","github","machine-learning","research-project"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/h1alexbel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-15T10:05:14.000Z","updated_at":"2024-12-24T12:04:14.000Z","dependencies_parsed_at":"2024-05-29T22:44:57.984Z","dependency_job_id":"c5382a1b-a91b-4fee-87a4-f4d3de7a7410","html_url":"https://github.com/h1alexbel/samples-filter","commit_stats":null,"previous_names":["h1alexbel/samples-filter"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/h1alexbel%2Fsamples-filter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/h1alexbel%2Fsamples-filter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/h1alexbel%2Fsamples-filter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/h1alexbel%2Fsamples-filter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/h1alexbel","download_url":"https://codeload.github.com/h1alexbel/samples-filter/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248517348,"owners_count":21117436,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset-filtering","github","machine-learning","research-project"],"created_at":"2024-11-07T10:19:57.956Z","updated_at":"2025-04-12T04:32:40.399Z","avatar_url":"https://github.com/h1alexbel.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# samples-filter\n\n[![EO principles respected here](https://www.elegantobjects.org/badge.svg)](https://www.elegantobjects.org)\n[![DevOps By Rultor.com](http://www.rultor.com/b/h1alexbel/samples-filter)](http://www.rultor.com/p/h1alexbel/samples-filter)\n[![We recommend IntelliJ IDEA](https://www.elegantobjects.org/intellij-idea.svg)](https://www.jetbrains.com/idea/)\n\n[![py](https://github.com/h1alexbel/samples-filter/actions/workflows/py.yml/badge.svg)](https://github.com/h1alexbel/samples-filter/actions/workflows/py.yml)\n[![PyPI - Version](https://img.shields.io/pypi/v/samples-filter)](https://pypi.org/project/samples-filter)\n[![codecov](https://codecov.io/gh/h1alexbel/samples-filter/graph/badge.svg?token=lVkWRVIqfE)](https://codecov.io/gh/h1alexbel/samples-filter)\n[![PDD status](http://www.0pdd.com/svg?name=h1alexbel/samples-filter)](http://www.0pdd.com/p?name=h1alexbel/samples-filter)\n[![Hits-of-Code](https://hitsofcode.com/github/h1alexbel/samples-filter)](https://hitsofcode.com/view/github/h1alexbel/samples-filter)\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](https://github.com/h1alexbel/samples-filter/blob/master/LICENSE.txt)\n[![Known Vulnerabilities](https://snyk.io/test/github/h1alexbel/samples-filter/badge.svg)](https://snyk.io/test/github/h1alexbel/samples-filter)\n\nSamples-filter is a command-line filter\nfor GitHub repositories that contain sample repositories (SR),\nthat mostly contain educational or demonstration materials supposed to be copied\ninstead of reused as a dependency, like framework or library.\nE.g. [leeowenowen/rxjava-examples](https://github.com/leeowenowen/rxjava-examples),\n[streaming-with-flink/examples-java](https://github.com/streaming-with-flink/examples-java),\n[redisson/redisson-examples](https://github.com/redisson/redisson-examples).\n\n**Motivation**. During the work on [CaM] project,\nwhere we're building datasets with open source Java programs,\nwe [discovered](https://github.com/yegor256/cam/issues/227)\nthe need for filtering out repositories that contain samples, tutorials or\nexamples. This repository is portable command-line tool that filters those\nrepositories.\n\n## How to use\n\nFirst, install it from [PyPI](https://pypi.org/project/samples-filter) like that:\n\n```bash\npip install samples-filter\n```\n\nthen, execute:\n\n```bash\nsamples-filter filter --repositories=repos.csv --out=filtered.csv\n```\n\nFor `--repositories` you should provide a name of **existing** [CSV] dataset\nwith GitHub repositories, and name for the output file in `--out`\n(it will be created automatically). If you feel missed, try `--help` and tool\nwill explain to you what you should do.\n\nOptionally, you can decide which [model](/models/README.md) to use for\nfiltering via `--model`. You can pass either `transformer` (the default one), or\n`ml`.\n\n**Warning!**\nVersions `\u003c=0.5.1` utilized models based on supervised learning algorithms,\nsuch as [Random-Forest] and [fine-tuned] transformer model based on\n[DistilBERT]. Besides that models were able to handle [binary classification]\nonly. In contrast, latest versions using models that are based on\n[unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning),\nand can output the `rating` of how input repository is similar to SR.\n\n## How to contribute\n\nFork repository, make changes, send us a [pull request](https://www.yegor256.com/2014/04/15/github-guidelines.html).\nWe will review your changes and apply them to the `master` branch shortly,\nprovided they don't violate our quality standards. To avoid frustration,\nbefore sending us your pull request please run full build:\n\n```bash\nmake install cov check\n```\n\nTo set up virtual environment use this set of commands:\n\n```bash\npython3 -m venv venv\nsource $(pwd)/venv/bin/activate\n```\n\nYou will need [Python 3.11+]\ninstalled.\n\n[CaM]: https://github.com/yegor256/cam\n[Random-Forest]: https://en.wikipedia.org/wiki/Random_forest\n[fine-tuned]: https://huggingface.co/docs/transformers/en/tasks/sequence_classification\n[DistilBERT]: https://huggingface.co/distilbert/distilbert-base-uncased\n[binary classification]: https://en.wikipedia.org/wiki/Binary_classification\n[CSV]: https://en.wikipedia.org/wiki/Comma-separated_values\n[Python 3.11+]: https://www.python.org/downloads/release/python-3110\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fh1alexbel%2Fsamples-filter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fh1alexbel%2Fsamples-filter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fh1alexbel%2Fsamples-filter/lists"}