{"id":22155910,"url":"https://github.com/citiususc/pyplexity","last_synced_at":"2025-07-26T07:32:28.479Z","repository":{"id":43462948,"uuid":"462290971","full_name":"citiususc/pyplexity","owner":"citiususc","description":"Cleaning tool for web scraped text","archived":false,"fork":false,"pushed_at":"2023-06-07T09:59:20.000Z","size":269,"stargazers_count":38,"open_issues_count":1,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2024-04-16T11:35:33.289Z","etag":null,"topics":["information-retrieval","nlp","python","scraping","tag-cleaning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/citiususc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-02-22T12:37:09.000Z","updated_at":"2024-04-05T15:11:13.000Z","dependencies_parsed_at":"2022-09-23T05:22:06.585Z","dependency_job_id":null,"html_url":"https://github.com/citiususc/pyplexity","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2Fpyplexity","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2Fpyplexity/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2Fpyplexity/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2Fpyplexity/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/citiususc","download_url":"https://codeload.github.com/citiususc/pyplexity/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227569527,"owners_count":17787779,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["information-retrieval","nlp","python","scraping","tag-cleaning"],"created_at":"2024-12-02T02:33:02.846Z","updated_at":"2024-12-02T02:33:03.510Z","avatar_url":"https://github.com/citiususc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PyPlexity\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://pepy.tech/project/pyplexity/\"\u003e\u003cimg alt=\"Downloads\" src=\"https://img.shields.io/badge/dynamic/json?style=flat-square\u0026maxAge=3600\u0026label=downloads\u0026query=$.total_downloads\u0026url=https://api.pepy.tech/api/projects/pyplexity\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://pypi.python.org/pypi/pyplexity/\"\u003e\u003cimg alt=\"PyPi\" src=\"https://img.shields.io/pypi/v/pyplexity.svg?style=flat-square\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\nThis package provides a simple interface to apply perplexity filters to any document. A possible use case for this technology could be the removal of boilerplate (sentences with a high perplexity score). \nFurthermore, it provides a rough HTML tag cleaner and a WARC and HTML bulk processor, with distributed capabilities.\n\n![](imgs/perpl.PNG)\n\n## Cite\n\nAnyone that uses this tool, please refer to:\n\nFernández-Pichel, M., Prada-Corral, M., Losada, D. E., Pichel, J. C., \u0026 Gamallo, P. (2023). [An unsupervised perplexity-based method for boilerplate removal. Natural Language Engineering](https://www.cambridge.org/core/journals/natural-language-engineering/article/an-unsupervised-perplexitybased-method-for-boilerplate-removal/5E589D838F1D1E0736B4F52001150339), 1-18.\n\n\n## Models\n\n### English language\n\nMemory intensive but does not scale on CPU. \n| Model | RAM usage | Download size |\n| --- | --- | --- |\n| bigrams-cord19 | 2GB | 230MB |\n| bigrams-bnc | 5GB | 660MB |\n| trigrams-cord19 | 6,6GB | 1GB |\n| trigrams-bnc | 14GB | 2,2GB |\n\nTwo different datasets were selected to build the background language model (LM): CORD-19 dataset [1] and the British National Corpus (BNC) [2]. \n\n[1] Wang, L. L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., ... \u0026 Kohlmeier, S. (2020). Cord-19: The covid-19 open research dataset. ArXiv.\n\n[2] BNC Consortium. (2007). British national corpus. Oxford Text Archive Core Collection.\n\n### Galician language\n\nCOMING SOON. Support for minority languages, [Nós project](http://nos.gal/es/proxecto-nos). You can download the model from [here](https://fegalaz.usc.es/~gamallo/bigrams_modelo-gl-bigramas-merged.st).\n\n### Build and use custom models\n\nIf you want to build your own models, you can check it [here](https://github.com/gamallo/LanguageModel). You can also use the parameter **--model PATH** to load local models.\n\n## Installation process\n\nThis package can be directly found in [Pypi](https://pypi.org/project/pyplexity/) repository or installed in two ways: \n\n```\npip install pyplexity\n```\n\n## Examples of usage options\n\n### Compute perplexity from console\nCommand \"perplexity\". This very first command computes the perplexity score or the probability for a given sentence according to a given distribution, in this case, the background LM. By default, bigrams-bnc. Argument \"--model bigrams-bnc\" changes the model.  \n*Documentation*:\n```\ncitius@pc:~$ pyplexity perplexity --help\nUsage: pyplexity perplexity [OPTIONS] TEXT\n\nArguments:\n  TEXT  [required]\n\nOptions:\n  --model TEXT  [default: bigrams-bnc]\n  --help        Show this message and exit.\n```\nBy default, models are stored in ~/.cache/cached_path/, as per cached-path package documentation. *Example*:\n```\ncitius@pc:~$ pyplexity perplexity \"this is normal text\"\ndownloading: 100%|##########| 660M/660M [00:11\u003c00:00, 59.0MiB/s]\nLoading model... Done.\n1844.85540669094\ncitius@pc:~$ pyplexity perplexity \"this is normal HTML PAGE BOI%\u0026 678346 NOR  text\"\nLoading model... Done.\n44787.99199563819\n```\nAs can be seen, malformed sentences obtain a higher value. \n\n### Bulk perplexity computation and cleaning of a directory\n\nThe previous command was a toy example, as normally in real applications, we will want to score complete datasets to clean them up. This scenario is where the bulk-perplexity functionality that supports WARC or HTML directories comes in.\n\n*Documentation*:\n```\ncitius@pc:~$ pyplexity bulk-perplexity --help\nUsage: pyplexity bulk-perplexity [OPTIONS] INPUT_DIR\n\nArguments:\n  INPUT_DIR  [required]\n\nOptions:\n  --output-dir TEXT                [default: out_dir]\n  --model TEXT                     [default: bigrams-bnc]\n  --perpl-limit FLOAT              [default: 8000.0]\n  --warc-input / --no-warc-input   [default: no-warc-input]\nDistributed computing options:\n  --distributed / --no-distributed [default: no-distributed]\n  --n-workers INTEGER              [default: 1]\n  --node INTEGER                   [default: 1]\n  --port INTEGER                   [default: 8866]\n  --help                           Show this message and exit.\n```\nWe will explain the distributed computing capabilities later. Input directory is allowed to have recursive subdirectories with files. WARC containers and HTML files should have been previously tag-cleaned with the command below. *Example*:\n```\ncitius@pc:~$ pyplexity bulk-perplexity ./out_dir/ --output-dir cleaned_files --model bigrams-cord19\ndownloading: 100%|##########| 233M/233M [00:03\u003c00:00, 63.3MiB/s] \nLoading model... Done.\nComputed 1124 files in 0:00:01.905390.\n```\n\n**NOTE**: In this new version, we do not remove the malformed sentences. We just tag them with **ppl**, giving more control to the end-users.\n\n### Perform HTML tag cleaning of a directory\n\nOur method does not remove HTML tags by default. This fact could impoverish its global performance. That's why we recommend removing HTML tags first, and we offer this option inside our package.\n\n*Documentation*:\n```\ncitius@pc:~$ pyplexity tag-remover --help\nUsage: pyplexity tag-remover [OPTIONS] BASE_DIR\n\nArguments:\n  BASE_DIR  [required]\n\nOptions:\n  --output-dir TEXT                [default: out_dir]\n  --warc-input / --no-warc-input   [default: no-warc-input]\nDistributed computing options:\n  --distributed / --no-distributed [default: no-distributed]\n  --n-workers INTEGER              [default: 1]\n  --node INTEGER                   [default: 1]\n  --port INTEGER                   [default: 8866]\n  --help                           Show this message and exit.\n\n```\nWe will explain the distributed computing capabilities later. Input directory is allowed to have recursive subdirectories with files. It can process HTML files or WARC files. In this case, it will recompress the WARC efficiently, after stripping out all the tags. *Example*:\n```\ncitius@pc:~$ pyplexity tag-remover ./html_source --output-dir ./output\nComputed 1124 files in 0:00:00.543175.\n```\n## Parallel mode (cluster)\nPrevious documentation shows that our commands have integrated distributed computing capabilities. When using the cluster mode, all the nodes must be interconnected in a local network, having the access to the same files mounted via SSHFS or other filesystem. A master node will recursively load the folder of files to be computed, with the command:\n```\npyplexity fileserver /mnt/input_dir --port 8866\n```\nNow, clients from the nodes will connect to the master node asking for file names to be processed. This mechanism allows for load distribution, as clients are able to ask for files in queue for processing from the master. For example, from a node:\n```\npyplexity bulk-perplexity /mnt/input_dir --output-dir /mnt/output_dir --warc-input --distributed --n-workers 10 --node 2 --url master.local --port 8866\n```\nThat command should be executed in every machine of the cluster. The node argument identifies the machine for logging purposes, and has no functional relevance. The n-workers argument controls the number of thread workers per machine that will be querying the master node for files concurrently. When the master server has served all the files, worker procceses will shutdown accordingly. In our experiments, we use this feature to run the HTML tag removal and perplexity computation in 20 threads * 15 machines.\n\n## Interfacing from Python\n\nWe also offer the possibility of utilising *pyplexity* from Python code. As an example, we provide an API that serves a web app to make some small tests on how to directly clean texts or raw files.\n\nExample: computing the perplexity score for a sentence:\n```\nfrom pyplexity import PerplexityComputer\n\nmodel = PerplexityModel.from_str(\"bigrams-cord19\")\nperpl = model.compute_sentence(\"this is normal text\")\n```\nExample 2: Cleaning sentences from a text:\n```\nfrom pyplexity import PerplexityModel, PerplexityProcessor\n\nmodel = PerplexityModel.from_str(\"bigrams-cord19\")\ntext_processor = PerplexityProcessor(perpl_model=model, perpl_limit=8000.0)\nclean_text = text_processor.process(\"This is a normal sentence. Meanwhile, hjldfuia HTML BODY this one will be deleted LINK URL COUISUDOANLHJWQKEJK\")\n```\nExample 3: Removing HTML tags from a website:\n```\nimport requests\nfrom pyplexity.tag_remover import HTMLTagRemover\n\nhtml = requests.get(\"https://example.com\").text\ntext = HTMLTagRemover().process(html)\n```\n## Web Demo\nWe also provide a [web demo](https://tec.citius.usc.es/pyplexity/) as a simple example of the power of our tool. Screenshot:\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://user-images.githubusercontent.com/6536835/158210142-c0b04512-f482-49fc-9261-adb15628984f.png\" alt=\"screenshot\" width=\"600\"/\u003e\n\u003c/p\u003e\n\n\n## Building the package\n\nIf you are interested, you can also build the same package version we have currently deployed in the Pypi repository.\n\n```\ngit clone https://github.com/citiususc/pyplexity \u0026\u0026 cd pyplexity\ncurl -sSL https://install.python-poetry.org | python3 -\nsource $HOME/.poetry/env\npoetry build\npip3 install dist/pyplexity-X.X.X-py3-none-any.whl\n```\n\n## General Advice\n\nAs you may have noticed, this is an unsupervised method that requires setting the optimal model and threshold. From our [experimentation](https://www.cambridge.org/core/journals/natural-language-engineering/article/an-unsupervised-perplexitybased-method-for-boilerplate-removal/5E589D838F1D1E0736B4F52001150339), we have concluded that the bigrams-bnc model and removing sentences with a value higher than 8k is a robust strategy both for an IR search task and a text classification task.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcitiususc%2Fpyplexity","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcitiususc%2Fpyplexity","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcitiususc%2Fpyplexity/lists"}