{"id":28944735,"url":"https://github.com/john-hawkins/texturizer","last_synced_at":"2025-07-12T16:38:51.071Z","repository":{"id":62584674,"uuid":"231017915","full_name":"john-hawkins/texturizer","owner":"john-hawkins","description":"A library and command line application for adding different kinds of features derived from columns of raw text.","archived":false,"fork":false,"pushed_at":"2022-02-26T04:05:39.000Z","size":3663,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-23T06:07:20.415Z","etag":null,"topics":["feature-engineering","feature-extraction","language-processing","natural-language-processing","text-mining","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/john-hawkins.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-12-31T03:19:21.000Z","updated_at":"2023-12-31T02:50:48.000Z","dependencies_parsed_at":"2022-11-03T22:00:45.893Z","dependency_job_id":null,"html_url":"https://github.com/john-hawkins/texturizer","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/john-hawkins/texturizer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/john-hawkins%2Ftexturizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/john-hawkins%2Ftexturizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/john-hawkins%2Ftexturizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/john-hawkins%2Ftexturizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/john-hawkins","download_url":"https://codeload.github.com/john-hawkins/texturizer/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/john-hawkins%2Ftexturizer/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265024280,"owners_count":23699589,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["feature-engineering","feature-extraction","language-processing","natural-language-processing","text-mining","text-processing"],"created_at":"2025-06-23T06:07:18.799Z","updated_at":"2025-07-12T16:38:51.065Z","avatar_url":"https://github.com/john-hawkins.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"texturizer\n----------\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Tests](https://github.com/john-hawkins/texturizer/actions/workflows/python-package.yml/badge.svg)](https://github.com/john-hawkins/texturizer/actions/workflows/python-package.yml)\n[![PyPI](https://img.shields.io/pypi/v/texturizer.svg)](https://pypi.org/project/texturizer)\n[![Documentation Status](https://readthedocs.org/projects/texturizer/badge/?version=latest)](https://texturizer.readthedocs.io/en/latest/?badge=latest)\n\n```\nStatus - Functional\n```\n\nThis is an application to add features to a dataset that are derived from processing\nthe content of existing columns of text data. It is specifically designed for adding\nsomewhat bespoke and unusual features that are not particularly well identified by\nn-gram or word embedding approaches.\n\nIt will accept a CSV, TSV or XLS file and output an extended version of\nthe dataset with additional columns appended. When run with default settings\nit will produce a small number of very simple numerical summaries. \n\nAdditional feature flags unlock features that are more computationally intensive and\ngenerally domain specific.\n\nReleased and distributed via setuptools/PyPI/pip for Python 3.\n\nAdditional detail available in the [documentation](https://texturizer.readthedocs.io)\n\n### TODO\n\n```\nCurrent features are all derived from single records. Future development will add these\nin some sense relative to a corpus.\n\n* Add capacity to generate features relative to corpus averages\n* Add capacity for comparison features to be generated relative to reference text(s)\n* Investigate functionality for working with unix shell pipes and streams\n\n```\n\n### Distribution\n\nReleased and distributed via setuptools/PyPI/pip for Python 3.\n\n\n### Resources \u0026 Dependencies\n\nFor Part of Speech Tagging and Word Embeddings we use [spacy](https://spacy.io/usage/spacy-101)\n\nNote: After install you will need to get spaCy to download the English model.\n```\nsudo python3 -m spacy download en\n```\nFor string based text comparisons we use [jellyfish](https://pypi.org/project/jellyfish/) and\n[textdistance](https://pypi.org/project/textdistance/)\n\n## Features\n\nEach type of feature can be unlocked through the use of a specific command line switch:\n\n```\n  -topics            Default: False. Indicators for words from common topics.\n  -topics=count                      Count matching words from common topics.\n  -topics=normalize                  Count matching topic key words and normalize over topics.\n  -traits            Default: False. Word usage for personality traits.\n  -reason            Default: False. Word usage for reasoning: premises, conclusions.\n  -rhetoric          Default: False. Word usage for rhetorical devices.\n  -pos               Default: False. Part of speech proportions.\n  -literacy          Default: False. Checks for simple literacy markers.\n  -profanity         Default: False. Profanity check flags.\n  -sentiment         Default: False. Words counts for positive and negative sentiment.\n  -scarcity          Default: False. Word scarcity scores.\n  -emoticons         Default: False. Emoticon check flags.\n  -embedding         Default: False. Aggregate of Word Embedding Vectors.\n  -embedding=normalize               Normalised Aggregate of Word Embedding Vectors.\n  -comparison        Default: False. Cross-column comparisons using edit distances.\n\n```\n\n## Usage\n\nYou can use this application multiple ways\n\n### Runner\n\nUse the runner without installing the application. \nThe following example will generate all features on the test data.\n\n```\n./texturizer-runner.py -columns=question,answer -pos -literacy -traits -reason -rhetoric -profanity -emoticons -embedding -sentiment -scarcity -comparison -topics=count data/test.csv \u003e data/output.csv\n```\n\nThis will send the time performance profile to STDERR as shown below:\n```\nComputation Time Profile for each Feature Set\n---------------------------------------------\nsimple               0:00:00.498634\ncomparison           0:00:00.536637\nprofanity            0:00:00.496018\nsentiment            0:00:03.310224\nscarcity             0:00:00.523863\nemoticons            0:00:00.219341\nembedding            0:00:43.456939\ntopics               0:00:05.285120\ntraits               0:00:00.298902\nreason               0:00:00.305391\nrhetoric             0:00:02.988197\npos                  0:00:40.981175\nliteracy             0:00:00.371007\n```  \n\nAs you can see the part of speech (POS) features and word embeddings\nare the most time consuming to generate. In both instances these rely on the \nSpacY package to process the text block. For the moment it would be advised to\navoid using them on very large datasets.\n\nTODO: improve performance of these feature generators. \n\n### Directory as package \n\nAlternatively, you can invoke the directory as a package:\n \n```\npython -m texturizer -columns=question,answer data/test.csv \u003e data/output.csv\n```\n\n### From Install\n\nOr you can simply install the package and use the command line application directly\n\n```\ntexturizer -h\n```\nWill print out the help\n\n\n# Installation\nInstallation from the source tree:\n\n```\npython setup.py install\n```\n\n(or via pip from PyPI):\n\n```\npip install texturizer\n```\n\nYou will then need to run the [POST INSTALL SCRIPT](https://github.com/john-hawkins/texturizer/blob/master/POST_INSTALL.sh) to install the required Spacy Model (otherwise the POS features cannot be calculated).\n \n\nNow, the ``texturizer`` command is available::\n\n```\ntexturizer -columns=question,answer -topics data/test.csv \u003e data/output.csv\n```\n\nThis will take the Input CSV, calculate some simple summary features and \nproduce an Output CSV with features appended as new columns.\n\nFor more complicated features see the additional options (outlined above).\n\n# Acknowledgements\n\nPython package built using the\n[bootstrap cmdline template](https://github.com/jgehrcke/python-cmdline-bootstrap)\n by [jgehrcke](https://github.com/jgehrcke)\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjohn-hawkins%2Ftexturizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjohn-hawkins%2Ftexturizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjohn-hawkins%2Ftexturizer/lists"}