{"id":31650461,"url":"https://github.com/open-technology-foundation/stopwords.bash","last_synced_at":"2026-05-18T19:04:27.558Z","repository":{"id":317937757,"uuid":"1069439049","full_name":"Open-Technology-Foundation/stopwords.bash","owner":"Open-Technology-Foundation","description":"Pure Bash stopwords filter from input text. Faster than python for texts \u003c 2000 words","archived":false,"fork":false,"pushed_at":"2025-10-04T00:18:33.000Z","size":55,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-10-04T02:37:06.096Z","etag":null,"topics":["bash","nltk","stopwords"],"latest_commit_sha":null,"homepage":"https://yatti.id/","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Open-Technology-Foundation.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-10-04T00:09:35.000Z","updated_at":"2025-10-04T00:18:36.000Z","dependencies_parsed_at":"2025-10-04T02:37:26.757Z","dependency_job_id":"e1d890f7-bebf-4ef2-bdd8-55fb0f6a9da6","html_url":"https://github.com/Open-Technology-Foundation/stopwords.bash","commit_stats":null,"previous_names":["open-technology-foundation/stopwords.bash"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/Open-Technology-Foundation/stopwords.bash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-Technology-Foundation%2Fstopwords.bash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-Technology-Foundation%2Fstopwords.bash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-Technology-Foundation%2Fstopwords.bash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-Technology-Foundation%2Fstopwords.bash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Open-Technology-Foundation","download_url":"https://codeload.github.com/Open-Technology-Foundation/stopwords.bash/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Open-Technology-Foundation%2Fstopwords.bash/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278742889,"owners_count":26037915,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-07T02:00:06.786Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bash","nltk","stopwords"],"created_at":"2025-10-07T08:29:50.768Z","updated_at":"2026-05-18T19:04:27.549Z","avatar_url":"https://github.com/Open-Technology-Foundation.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Stopwords Filter\n\nA fast, multilingual text processing utility that filters stopwords from input text. Supports 33 languages with efficient O(1) lookup using Bash associative arrays.\n\n\u003e **Note:** For documents \u003e 2,000 words, consider the **[Python implementation](https://github.com/Open-Technology-Foundation/stopwords)** which offers superior performance on larger datasets. Both use the same NLTK stopwords data.\n\n## Features\n\n- **Multilingual Support**: Filter stopwords in 33 different languages\n- **Multiple Output Formats**: Single-line, list, or word frequency counts\n- **Flexible Input**: Accept text via command-line arguments or stdin\n- **Punctuation Control**: Optionally preserve or remove punctuation marks\n- **Case-Insensitive**: Matches stopwords regardless of case\n- **Fast Performance**: O(1) stopword lookup using associative arrays\n- **Dual Usage**: Use as a standalone script or source as a Bash function\n\n## Installation\n\n### Quick Install\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/Open-Technology-Foundation/stopwords.bash/main/install.sh | sudo bash\n```\n\n### Standard Install\n\n**System-wide** (recommended):\n```bash\ngit clone https://github.com/Open-Technology-Foundation/stopwords.bash\ncd stopwords.bash\nsudo ./install.sh install\n```\n\n**User-local** (no sudo):\n```bash\nPREFIX=$HOME/.local ./install.sh install\n```\n\nThis installs the script to `$PREFIX/bin/stopwords` and stopwords data to `/usr/share/stopwords/` (33 languages, ~170KB). If Python NLTK stopwords are already installed, data installation is automatically skipped.\n\n### Verify \u0026 Uninstall\n\n```bash\n# Verify installation\n./install.sh check\n\n# Uninstall (system)\nsudo ./install.sh uninstall\n\n# Uninstall (user)\nPREFIX=$HOME/.local ./install.sh uninstall\n```\n\n## Usage\n\n### Basic Filtering\n\n```bash\n./stopwords 'the quick brown fox jumps over the lazy dog'\n# Output: quick brown fox jumps lazy dog\n```\n\n### Reading from stdin\n\n```bash\necho 'the quick brown fox' | ./stopwords\ncat document.txt | ./stopwords\n```\n\n### Language Selection (`-l`)\n\n```bash\n./stopwords -l spanish 'el rápido zorro marrón salta sobre el perro perezoso'\n# Output: rápido zorro marrón salta perro perezoso\n```\n\n### Punctuation Preservation (`-p`)\n\n```bash\n./stopwords 'Hello, world!'      # Output: hello world\n./stopwords -p 'Hello, world!'   # Output: hello, world!\n```\n\n### List Output (`-w`)\n\n```bash\n./stopwords -w 'the quick brown fox'\n# Output:\n# quick\n# brown\n# fox\n```\n\n### Word Frequency Counting (`-c`)\n\n```bash\n./stopwords -c 'the fox jumps and the fox runs'\n# Output:\n# 1 jumps\n# 1 runs\n# 2 fox\n\n./stopwords -c \u003c document.txt\n```\n\n## Supported Languages\n\nalbanian, arabic, azerbaijani, basque, belarusian, bengali, catalan, chinese, danish, dutch, english, finnish, french, german, greek, hebrew, hinglish, hungarian, indonesian, italian, kazakh, nepali, norwegian, portuguese, romanian, russian, slovene, spanish, swedish, tajik, tamil, turkish\n\n## Command-Line Options\n\n| Option | Long Form | Description |\n|--------|-----------|-------------|\n| `-l LANG` | `--language LANG` | Set the language for stopwords (default: english) |\n| `-p` | `--keep-punctuation` | Keep punctuation marks (default: remove) |\n| `-w` | `--list-words` | Output filtered words as a list (one per line) |\n| `-c` | `--count` | Output word frequency counts (sorted ascending) |\n| `-V` | `--version` | Show version information |\n| `-h` | `--help` | Show help message |\n\n## Using as a Sourced Function\n\n```bash\nsource stopwords\nstopwords 'the quick brown fox'           # Output: quick brown fox\nstopwords -l spanish 'el rápido zorro'    # Output: rápido zorro\n```\n\n## Practical Examples\n\n```bash\n# Extract keywords from a document\ncat article.txt | ./stopwords -w | sort | uniq\n\n# Find most common words\n./stopwords -c \u003c article.txt | tail -20\n\n# Clean search queries\necho \"how to install python on ubuntu\" | ./stopwords\n# Output: install python ubuntu\n\n# Batch preprocessing\nfor file in corpus/*.txt; do\n  ./stopwords \u003c \"$file\" \u003e \"processed/$(basename \"$file\")\"\ndone\n```\n\n## Exit Codes\n\n- `0`: Success\n- `1`: Data directory or stopwords file not found\n- `2`: Missing argument for option\n- `22`: Invalid option\n\n## Troubleshooting\n\n**Stopwords data not found?**\n\nThe script searches these locations in order:\n1. `$NLTK_DATA/corpora/stopwords/` (custom NLTK path)\n2. `/usr/share/nltk_data/corpora/stopwords/` (system NLTK)\n3. `/usr/share/stopwords/` (bundled fallback)\n\nSolutions:\n```bash\n# Install this package\nsudo ./install.sh install\n\n# OR use Python NLTK\npip install nltk \u0026\u0026 python -m nltk.downloader stopwords\n\n# OR set NLTK_DATA manually\nexport NLTK_DATA=/path/to/your/nltk_data\n```\n\n**User-local install not in PATH?**\n```bash\n# Add to ~/.bashrc\nexport PATH=\"$HOME/.local/bin:$PATH\"\n```\n\n## License\n\nGPL-3. See [LICENSE](LICENSE)\n\n## Contributing\n\nContributions welcome! Submit issues or pull requests on GitHub.\n\n## Acknowledgments\n\nStopword lists sourced from the [NLTK corpus](https://www.nltk.org/).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-technology-foundation%2Fstopwords.bash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopen-technology-foundation%2Fstopwords.bash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopen-technology-foundation%2Fstopwords.bash/lists"}