{"id":20693441,"url":"https://github.com/nanxstats/pdf-word-extraction","last_synced_at":"2025-04-22T17:43:31.333Z","repository":{"id":206871260,"uuid":"656951911","full_name":"nanxstats/pdf-word-extraction","owner":"nanxstats","description":"Extract meaningful words from a collection of PDF documents and count their frequencies","archived":false,"fork":false,"pushed_at":"2024-06-14T01:15:48.000Z","size":4,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-06-15T02:24:41.332Z","etag":null,"topics":["ftfy","natural-language-processing","pypdf","research-paper","spacy","wordcloud"],"latest_commit_sha":null,"homepage":"https://nanx.me/blog/post/research-word-cloud/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nanxstats.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-06-22T01:58:38.000Z","updated_at":"2024-06-15T02:24:41.333Z","dependencies_parsed_at":"2023-11-12T22:30:40.458Z","dependency_job_id":"e26b04ea-2778-40cb-bd15-3733107b36f6","html_url":"https://github.com/nanxstats/pdf-word-extraction","commit_stats":null,"previous_names":["nanxstats/pdf-word-extraction"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nanxstats%2Fpdf-word-extraction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nanxstats%2Fpdf-word-extraction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nanxstats%2Fpdf-word-extraction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nanxstats%2Fpdf-word-extraction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nanxstats","download_url":"https://codeload.github.com/nanxstats/pdf-word-extraction/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224982113,"owners_count":17402315,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ftfy","natural-language-processing","pypdf","research-paper","spacy","wordcloud"],"created_at":"2024-11-16T23:26:42.411Z","updated_at":"2024-11-16T23:26:42.998Z","avatar_url":"https://github.com/nanxstats.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF Word Extraction\n\nThis tool is designed to extract meaningful words from a collection of PDF\ndocuments. The extracted words are processed and their frequencies are counted.\nThis frequency data can be used for various text analysis and visualization\ntasks, such as generating word clouds or identifying common themes in the\ndocument collection.\n\nThe tool leverages the modern text data toolchain in Python:\n\n- pypdf: for reading PDFs.\n- ftfy: for text cleaning.\n- SpaCy: for natural language processing such as\n  tokenization, lemmatization, and stop-word removal.\n\nThe tool also provides customizable features such as the ability to specify\nwords for removal or replacement.\n\n## Workflow\n\nClone the repository:\n\n```bash\ngit clone https://github.com/nanxstats/pdf-word-extraction.git\n```\n\nCreate a [virtual environment](https://docs.python.org/3/library/venv.html)\ninside the cloned repository, activate it, and install the required Python\npackages into the virtual environment:\n\n```bash\ncd pdf-word-extraction\npython3 -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\n```\n\nPut the PDF files under `pdf/`, run\n\n```\npython3 pdf_word_extraction.py\n```\n\nIf you use VS Code, open the project and select the recommended \"venv\"\nPython interpreter. Edit the list of words to remove and replace in\n`pdf_word_extraction.py`, save the file and run it again in terminal.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnanxstats%2Fpdf-word-extraction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnanxstats%2Fpdf-word-extraction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnanxstats%2Fpdf-word-extraction/lists"}