{"id":37065246,"url":"https://github.com/tsproisl/textcomplexity","last_synced_at":"2026-01-14T07:38:16.805Z","repository":{"id":57474631,"uuid":"111522851","full_name":"tsproisl/textcomplexity","owner":"tsproisl","description":"Linguistic and stylistic complexity measures for (literary) texts","archived":false,"fork":false,"pushed_at":"2024-01-22T20:00:48.000Z","size":27740,"stargazers_count":84,"open_issues_count":5,"forks_count":13,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-09-21T05:29:01.647Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tsproisl.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-11-21T08:48:09.000Z","updated_at":"2025-09-13T03:01:07.000Z","dependencies_parsed_at":"2022-09-10T02:21:41.884Z","dependency_job_id":null,"html_url":"https://github.com/tsproisl/textcomplexity","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/tsproisl/textcomplexity","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tsproisl%2Ftextcomplexity","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tsproisl%2Ftextcomplexity/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tsproisl%2Ftextcomplexity/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tsproisl%2Ftextcomplexity/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tsproisl","download_url":"https://codeload.github.com/tsproisl/textcomplexity/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tsproisl%2Ftextcomplexity/sbom","scorecard":{"id":901002,"data":{"date":"2025-08-11","repo":{"name":"github.com/tsproisl/textcomplexity","commit":"940fceaa0f98e85305793604070a3a6b79dc60c9"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":3,"checks":[{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Code-Review","score":0,"reason":"Found 0/30 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"SAST","score":0,"reason":"no SAST tool detected","details":["Warn: no pull requests merged into dev branch"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Vulnerabilities","score":10,"reason":"0 existing vulnerabilities detected","details":null,"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE.txt:0","Info: FSF or OSI recognized license: GNU General Public License v3.0: LICENSE.txt:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'master'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}}]},"last_synced_at":"2025-08-24T15:33:39.773Z","repository_id":57474631,"created_at":"2025-08-24T15:33:39.773Z","updated_at":"2025-08-24T15:33:39.773Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28413461,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T05:26:33.345Z","status":"ssl_error","status_checked_at":"2026-01-14T05:21:57.251Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-14T07:38:16.269Z","updated_at":"2026-01-14T07:38:16.795Z","avatar_url":"https://github.com/tsproisl.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Linguistic and Stylistic Complexity\n\n[![PyPI](https://img.shields.io/pypi/v/textcomplexity)](https://pypi.org/project/textcomplexity/)\n\nThis project implements various measures that assess the linguistic\nand stylistic complexity of (literary) texts. There are\n[surface-based](#surface-based-measures),\n[sentence-based](#sentence-based-measures),\n[pos-based](#pos-based-measures),\n[dependency-based](#dependency-based-measures) and\n[constituency-based](#constituency-based-measures) measures. Most of\nthe measures are language independent, but some of them rely on\nlanguage-specific information (see [language definition\nfiles](#language-definition-files)) or are only defined for German\n(this affects some of the constituency-based measures).\n\n## Installation\n\nThe easiest way to install the toolbox is via pip (pip3 in some\ndistributions):\n\n    pip install textcomplexity\n\nAlternatively, you can download and decompress the [latest\nrelease](https://github.com/tsproisl/textcomplexity/releases/latest)\nor clone the git repository:\n\n    git clone https://github.com/tsproisl/textcomplexity.git\n\nIn the new directory, run the following command:\n\n    python3 setup.py install\n\n\n## Usage\n\nYou can use the script `bin/txtcomplexity` to compute (a sensible\nsubset of) all implemented complexity measures from the command line.\nThe script currently supports two input formats: The widely used\n[CoNLL-U format](https://universaldependencies.org/format.html)\n(`--input-format conllu`) and a custom tab-separated input format\n(`--input-format tsv`).\n\nThe **CoNLL-U format** consists of ten tab-separated columns that encode,\namong other things, the dependency structure of the sentence. Missing\nvalues can be represented by an underscore (`_`). Here is an example:\n\n    # sent_id = hdt-s469\n    # text = Netscape hatte den Browser-Markt noch 1994 zu fast 90 Prozent beherrscht .\n    1\tNetscape\tNetscape\tPROPN\tNE\t_\t11\tnsubj\t_\t_\n    2\thatte\thaben\tAUX\tVAFIN\t_\t11\taux\t_\t_\n    3\tden\tden\tDET\tART\t_\t4\tdet\t_\t_\n    4\tBrowser-Markt\tMarkt\tNOUN\tNN\t_\t11\tobj\t_\t_\n    5\tnoch\tnoch\tADV\tADV\t_\t6\tadvmod\t_\t_\n    6\t1994\t1994\tNUM\tCARD\t_\t11\tobl\t_\t_\n    7\tzu\tzu\tADP\tAPPR\t_\t10\tcase\t_\t_\n    8\tfast\tfast\tADV\tADV\t_\t9\tadvmod\t_\t_\n    9\t90\t90\tNUM\tCARD\t_\t10\tnummod\t_\t_\n    10\tProzent Prozent NOUN\tNN\t_\t11\tobl\t_\t_\n    11\tbeherrscht\tbeherrschen\tVERB\tVVPP\t_\t0\troot\t_\t_\n    12\t.\t.\tPUNCT\t$.\t_\t11\tpunct\t_\t_\n\nIf you want to compute the constituency-based complexity measures, the\ninput should be in a **custom tab-separated format** with six\ntab-separated columns and an empty line after each sentence. The six\ncolumns are: word index, word, part-of-speech tag, index of dependency\nhead, dependency relation, phrase structure tree. Missing values can\nbe represented by an underscore (`_`). Here is a short example with\ntwo sentences:\n\n    1\tDas\tART\t3\tNK\t(TOP(S(NP*\n    2\tfremde\tADJA\t3\tNK\t*\n    3\tSchiff\tNN\t4\tSB\t*)\n    4\twar\tVAFIN\t-1\t--\t*\n    5\tnicht\tPTKNEG\t6\tNG\t(AVP*\n    6\tallein\tADV\t4\tMO\t*)\n    7\t.\t$.\t6\t--\t*))\n\n    1\tSieben\tCARD\t2\tNK\t(TOP(S(NP*\n    2\tweitere\tADJA\t3\tMO\t*)\n    3\tbegleiteten\tVVFIN\t-1\t--\t*\n    4\tes\tPPER\t3\tOA\t*\n    5\t.\t$.\t4\t--\t*))\n\nWithout any further options, the script computes a sensible subset of\nall applicable measures (see below):\n\n    txtcomplexity --input-format conllu \u003cfile\u003e\n\nThe script automatically includes measures that rely on\nlanguage-specific information, if you specify the input language. If\nyour texts are in German or English, you can use `--lang de` or\n`--lang en`. If your texts are in another language, use `--lang other\n--lang-def \u003cfile\u003e` to provide a custom [language definition\nfile](#language-definition-files).\n\nIf you want to compute more (or fewer) measures, indicate one of the\npredefined sets of measures (via `--preset`). You can choose to ignore\npunctuation (`--ignore-punct`) or case (`--ignore-case`) and set the\nwindow-size for the surface-based measures (`--window-size`). By\ndefault, the script formats its output as JSON but you can also\nrequest tab-separated values suitable for import in a spreadsheet\n(`--output-format tsv`). More detailed usage information is available\nvia:\n\n    txtcomplexity -h\n\n### Utility script: From raw text to CONLL-U\n\nGetting the input format right can sometimes be a bit tricky.\nTherefore, we provide a simple wrapper script around\n[stanza](https://stanfordnlp.github.io/stanza/), a state-of-the-art\nNLP pipeline, which you can find in the `utils/` subdirectory of this\nrepository.\n\nFirst, you need to install stanza:\n\n    pip install stanza\n\nNow you can use the wrapper script to parse your text files:\n\n    run_stanza.py --language \u003clanguage\u003e --output-dir \u003cdirectory\u003e \u003cfile\u003e …\n\n## Complexity measures\n\n### Core measures of lexical complexity\n\nIn our article on lexical complexity (currently in preparation) we\nargue that there are several distinct aspects (or dimensions) of\nlexical complexity and we propose a single measure for each of the\ndimensions. Most of them are implemented here.\n\n  - *Variability*: How large is the vocabulary? Measured via type-token ratio.\n  - *Evenness*: How evenly are the tokens distributed among the\n    different types? Measured via normalized entropy.\n  - *Rarity*: How many rare words are used? Measured with the help of\n    a reference frequency list.\n      - General rarity: Rarity with respect to a representative sample\n        of the language.\n      - Genre rarity: Rarity with respect to a specific genre.\n  - *Dispersion*: How evenly are the tokens of a type distributed\n    throughout the text? Measured via Gini-based dispersion (without\n    hapax legomena)\n  - *Lexical density*: How many content words are used? Measured with\n    the help of part-of-speech tags.\n  - *Surprise*: How unexpected are word choices in the text? Not\n    implemented here.\n  - *Disparity*: How semantically dissimilar are the words? Not\n    implemented here.\n\n### Surface-based measures\n\n#### Measures that use sample size and vocabulary size\n\n  - Type-token ratio\n  - Brunet's (1978) W\n  - Carroll's (1964) CTTR\n  - Dugast's (1978, 1979) U\n  - Dugast's (1979) k\n  - Guiraud's (1954) R\n  - Herdan's (1960, 1964) C\n  - Maas' (1972) a\u003csup\u003e2\u003c/sup\u003e\n  - Summer's S\n  - Tuldava's (1977) LN\n\nAll of these measures correlate perfectly.\n\n\u003c!-- Therefore, the default --\u003e\n\u003c!-- setting is to only compute the type-token ratio. If you want to --\u003e\n\u003c!-- compute all of these measures, use the option `--all-measures`. --\u003e\n\n#### Measures that use part of the frequency spectrum\n\n  - Honoré's (1979) H\n  - Michéa's (1969, 1971) M\n  - Sichel's (1975) S\n\nMichéa's M is the reciprocal of Sichel's S\n\n\u003c!-- , therefore we only --\u003e\n\u003c!-- compute Sichel's S by default. If you want to compute Michéa's M as --\u003e\n\u003c!-- well, use the option `--all-measures`. --\u003e\n\n#### Measures that use the whole frequency spectrum\n\n  - Entropy (Shannon 1948)\n  - Evenness (= [normalized entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)#Efficiency_(normalized_entropy)))\n  - Herdan's (1955) V\u003csub\u003em\u003c/sub\u003e\n  - Jarvis' (2013) evenness (standard deviation of tokens per type)\n  - McCarthy and Jarvis' (2010) HD-D\n  - Simpson's (1949) D\n  - Yule's (1944) K\n\nYule's K, Simpson's D and Herdan's V\u003csub\u003em\u003c/sub\u003e correlate perfectly.\nSimpson's D is perhaps the most intuitive of the three measures and\ncan be interpreted as the probability of two randomly drawn tokens\nfrom the text being identical\n\n\u003c!-- Therefore, the default setting is to only compute Simpson's D (which --\u003e\n\u003c!-- can be interpreted as the probability of two randomly drawn tokens --\u003e\n\u003c!-- from the text being identical). If you also want to compute Yule's K --\u003e\n\u003c!-- and Herdan's V\u003csub\u003em\u003c/sub\u003e, use the option `--all-measures`. --\u003e\n\n#### Parameters of probabilistic models\n\n  - Orlov's (1983) Z\n\n#### Measures that use the whole text\n\n  - Average token length\n  - Covington and McFall's (2010) MATTR\n  - Kubát and Milička's (2013) STTR\n  - Log10 text length in characters\n  - Log10 text length in tokens\n  - MTLD (McCarthy and Jarvis 2010)\n\nMeasures of dispersion:\n  - Evenness-based dispersion\n  - Gini-based dispersion\n  - Gries' DP and DP\u003csub\u003enorm\u003c/sub\u003e (Gries 2008, Lijffijt and Gries 2012)\n  - Kullback-Leibler divergence (Kullback and Leibler 1951)\n\nDP/DP\u003csub\u003enorm\u003c/sub\u003e and KL-divergence require an additional parameter\n(the number of parts in which to split the text), therefore they are\nnot computed in the command-line script.\n\n### Sentence-based measures\n\n  - Sentence length in characters\n  - Sentence length in tokens\n\nLanguage-specific measures relying on a list of part-of-speech tags\nthat indicate punctuation, see [language definition\nfiles](#language-definition-files):\n  - Punctuation per sentence\n  - Punctuation per token\n  - Sentence length in words\n\n### POS-based measures\n\n  - Lexical density (Ure 1971)\n  - Rarity (requires a reference frequency list)\n\nThese measures rely on language-specific information (lists of\npart-of-speech tags that indicate open word classes and proper names\nand lists of the most common word-tag pairs in a reference corpus),\nsee [language definition files](#language-definition-files).\n\n### Dependency-based measures\n\n  - Average dependency distance (Oya 2011)\n  - Closeness centrality\n  - Closeness centralization (Freeman 1978)\n  - Dependents per token\n  - Longest shortest path\n  - Outdegree centralization (Freeman 1978)\n\n### Constituency-based measures\n\nLanguage-independent measures:\n  - Constituents per sentence\n  - Height of the parse trees\n  - Non-terminal constituents per sentence\n\nLanguage-dependent measures (defined for the German [NEGRA parsing\nscheme](http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/knoten.html)):\n  - Clauses per sentence\n  - Complex t-units per sentence\n  - Coordinate phrases per sentence\n  - Dependent clauses per sentence\n  - Noun phrases per sentence\n  - Prepositional phrases per sentence\n  - Verb phrases per sentence\n  - t-units per sentence\n\n## Language definition files\n\nSome complexity measures (e.g. lexical density and rarity) require\nlanguage specific information that needs to be provided by *language\ndefinition files*. For German and English, the built-in language\ndefinition files will be used automatically (as long as you indicate\nthe language via the `--lang` option). For other languages (`--lang\nother`), you need to provide the language definition files yourself.\nLanguage definition files are in JSON format and contain the following\ninformation:\n\n  - `language`: Language code\n  - `punctuation`: List of language-specific part-of-speech tags used\n    for punctuation (column XPOS in CoNLL-U format)\n  - `proper_names`: List of language-specific part-of-speech tags used\n    for proper names\n  - `open_classes`: List of language-specific part-of-speech tags used\n    for open word classes (including proper names)\n  - `most_common`: List of the most frequent content words (excluding\n    proper names) and their part-of speech tags; for German and\n    English, we use the 5.000 most frequent words according to the\n    [COW frequency\n    lists](https://www.webcorpora.org/opendata/frequencies/)\n\nHere is an excerpt from the [German language definition\nfile](textcomplexity/de.json) (omitting most of the 5.000 most common\ncontent words):\n\n```json\n{\"language\": \"de\",\n \"punctuation\": [\"$.\", \"$,\", \"$(\"],\n \"proper_names\": [\"NE\"],\n \"open_classes\": [\"ADJA\", \"ADJD\", \"ITJ\", \"NE\", \"NN\", \"TRUNC\", \"VVFIN\", \"VVIMP\", \"VVINF\", \"VVIZU\", \"VVPP\"],\n \"most_common\": [[\"gibt\", \"VVFIN\"],\n                 [\"gut\", \"ADJD\"],\n                 [\"Zeit\", \"NN\"],\n                 …\n                 [\"Fahrzeugen\", \"NN\"],\n                 [\"Kopie\", \"NN\"],\n                 [\"Merkmale\", \"NN\"]\n                ]\n}\n```\n\nHere is an excerpt from the [English language definition\nfile](textcomplexity/en.json) (omitting most of the 5.000 most common\ncontent words). Note that the part-of-speech tags for punctuation look\nlike punctuation symbols – but we list pos tags, not punctuation\nsymbols:\n\n```json\n{\"language\": \"en\",\n \"punctuation\": [\".\", \",\" ,\":\", \"\\\"\", \"``\", \"(\", \")\", \"-LRB-\", \"-RRB-\"],\n \"proper_names\": [\"NNP\", \"NNPS\"],\n \"open_classes\": [\"AFX\", \"JJ\", \"JJR\", \"JJS\", \"NN\", \"NNS\", \"RB\", \"RBR\", \"RBS\", \"UH\", \"VB\", \"VBD\", \"VBG\", \"VBN\", \"VBP\", \"VBZ\"],\n \"most_common\": [[\"is\", \"VBZ\"],\n                 [\"be\", \"VB\"],\n                 [\"was\", \"VBD\"],\n                 …\n                 [\"statistical\", \"JJ\"],\n                 [\"appearing\", \"VBG\"],\n                 [\"recipes\", \"NNS\"]\n                ]\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftsproisl%2Ftextcomplexity","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftsproisl%2Ftextcomplexity","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftsproisl%2Ftextcomplexity/lists"}