{"id":27021711,"url":"https://github.com/mciccale/scholarvista","last_synced_at":"2025-04-04T19:52:07.450Z","repository":{"id":221831466,"uuid":"755475575","full_name":"mciccale/ScholarVista","owner":"mciccale","description":"ScholarVista analyses research papers and extracts/plots information about them. It uses Grobid to extract all the content of the research papers. Then all this data is plotted and displayed using Python.","archived":false,"fork":false,"pushed_at":"2024-03-06T19:12:37.000Z","size":3402,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-03-06T20:33:17.324Z","etag":null,"topics":["keyword-cloud","keyword-extraction","machine-learning","python3","text-extraction"],"latest_commit_sha":null,"homepage":"https://scholarvista.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mciccale.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-10T10:25:21.000Z","updated_at":"2024-03-06T19:13:36.000Z","dependencies_parsed_at":"2024-03-08T10:15:39.309Z","dependency_job_id":null,"html_url":"https://github.com/mciccale/ScholarVista","commit_stats":null,"previous_names":["mciccale/text-extraction-project"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mciccale%2FScholarVista","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mciccale%2FScholarVista/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mciccale%2FScholarVista/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mciccale%2FScholarVista/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mciccale","download_url":"https://codeload.github.com/mciccale/ScholarVista/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247242607,"owners_count":20907128,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["keyword-cloud","keyword-extraction","machine-learning","python3","text-extraction"],"created_at":"2025-04-04T19:52:06.878Z","updated_at":"2025-04-04T19:52:07.442Z","avatar_url":"https://github.com/mciccale.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Documentation Status](https://readthedocs.org/projects/scholarvista/badge/?version=latest)](https://scholarvista.readthedocs.io/en/latest/?badge=latest)\n[![zenodo](https://zenodo.org/badge/DOI/10.5281/zenodo.10654761.svg)](https://zenodo.org/doi/10.5281/zenodo.10654760)\n![test workflow](https://github.com/mciccale/ScholarVista/actions/workflows/test.yml/badge.svg)\n![lint workflow](https://github.com/mciccale/ScholarVista/actions/workflows/lint.yml/badge.svg)\n\n# ScholarVista\n\n**ScholarVista** is a tool that extracts and plots information from a set of Academic Research Papers in PDF / TEI XML format. To process PDFs, it utilizes [Grobid](https://github.com/kermitt2/grobid/) to generate the TEI XML files, then **ScholarVista** extracts the relevant information from the TEI XML files and generates the following data:\n\n1. **Keyword Cloud** for each of the paper's abstract and for the total of all abstracts.\n2. **Links List** for each one of the links found in the paper.\n3. **Figures Histogram** comparing the number of figures per paper.\n\n## Table of Contents:\n\n- [Requirements](#requirements)\n- [Install ScholarVista](#install-scholarvista)\n- [Execution Instructions](#execution-instructions)\n- [License](#license)\n- [Where to Get Help](#where-to-get-help)\n\n## Requirements\n\n**Python \u003e=3.12** is required for installing the **ScholarVista** package, not for the **Docker Image**.\n\nIf you want to generate the results from a set of PDF academic papers, you must ensure that the **Grobid Service** is installed and running in your machine. See Grobid installation instrucions [here](https://grobid.readthedocs.io/en/latest/Run-Grobid/).\n\nThe most straight-forward way of starting and running **Grobid Service** is by running a _Docker_ image. Make sure you have _Docker_ installed in your system.\n\n```bash\ndocker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0\n```\n\nThis command will run **Grobid** and expose a web client in port 8070.\n\nIf you already have the TEI XML files generated from Grobid saved in a folder, you can directly generate the information from them.\n\n_Note: The TEI XML files **MUST** be obtained using Grobid, as this tool is intended to work only with Grobid generated TEI XML files._\n\n\n## Install ScholarVista\n\n### From Source\n\nTo install **ScholarVista** from source, you can clone the repository and install the package using **_pip_**. When using **_pip_** it is a good practice to use virtual environments. Check out the official documentation on virtual envornments [here](https://docs.python.org/3/library/venv.html).\n\n#### Conda\n\n```bash\ngit clone https://github.com/mciccale/ScholarVista\ncd ScholarVista\nconda create -n scholarvista-env-3.12 python=3.12\nconda activate scholarvista-env-3.12\npip install .\n```\n\n_Note: You can use **PyEnv** to create a virtual environment. But since **ScholarVista** needs Python \u003e=3.12, it is more suitable to use **Conda**, where you can select the **Python** version to use._\n\n### Docker Container\n\nIf you prefer running **ScholarVista** from a Docker Container, you can build the Docker Image with the following commands.\n\n```bash\ngit clone https://github.com/mciccale/ScholarVista\ncd ScholarVista\ndocker build -t scholarvista-app .\n```\n\nThis will create an image called **scholarvista-app**.\n\n \n## Execution Instructions\n\n### From Source\n\n#### CLI Tool\n\nThe most convenient way of using **ScholarVista** is by using its CLI.\n\nThe CLI Tool will generate and save to a directory a **keyword cloud** of the abstract of each paper and a **list of URLs** for each PDF analyzed; together with a **histogram** comparing the numer of figures of each PDF and a general **keyword cloud** of all abstracts.\n\n```\nUsage: scholarvista [OPTIONS] COMMAND [ARGS]...\n\n  ScholarVista's CLI main entry point.\n\nOptions:\n  --input-dir PATH   Directory containing PDF files.  [required]\n  --output-dir PATH  Directory to save results. Defaults to current directory.\n  --help             Show this message and exit.\n\nCommands:\n  process-pdfs  Process all PDFs in the given directory.\n  process-xmls  Process all TEI XMLs in the given directory.\n```\n\n##### Example\n\n1. Start **Grobid** service using the container.\n\n```bash\ndocker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0\n```\n\n2. Run **ScholarVista's** CLI to process all the PDFs in a given directory and leave the results in another directory.\n\n```bash\n# Process PDF files and save the results to a specified directory\nscholarvista --input-dir ./pdfs --output-dir ./output process-pdfs\n```\n\n#### Python Modules\n\n**ScholarVista** provides a set of classes and modules to take leverage of all its functionality from your Python code. To see an example, see `example.py`\n\n### Docker Container\n\nIf you prefer running **ScholarVista** with Docker, you can make use of **ScholarVista** CLI directly from the Docker Image you created following [these instructions](#docker-container).\n\n1. Start **Grobid** service using the container.\n\n```bash\ndocker run --rm --init --ulimit core=0 -p 8070:8070 lfoppiano/grobid:0.8.0\n```\n\n2. Run **ScholarVista's** container with 2 mounted volumes for input and output directories and connected to the host network.\n\n```bash\ndocker run -it --rm --network=host -v /path/to/input/dir:/input -v /path/to/output/dir:/output scholarvista-app\n```\n\n*Note: The default behaviour of ScholarVista's Docker Image is processing pdf files, you can override this by providing the `process-xmls` argument after the image name.* \n\n#### Example\n\nHere's an example where we process a set of PDFs contained in the `foo` directory and we leave the results at `bar` using the Docker Image. Assuming the **Grobid** service is running at `localhost:8070`. \n\n```bash\ndocker run -it --rm --network=host -v foo:/input -v bar:/output scholarvista-app process-pdfs\n```\n\n### Docker Compose (Experimental)\n\nYou can try to run **ScholarVista** through **Docker Compose**. However, this feature is still in development and may not work as expected. **ScholarVista** will be trying to connect to **Grobid** before it has started, and it will be restarted until the **Grobid** service is up and running. You can try it by:\n\n#### SH-Shell like\n\n```bash\nINPUT_DIR=/path/to/input/dir OUTPUT_DIR=/path/to/output/dir COMMAND='process-pdfs' docker-compose up\n```\n\n#### PowerShell\n\n```powershell\n$env:INPUT_DIR=\"/path/to/input/dir\"; $env:OUTPUT_DIR=\"/path/to/output/dir\"; $env:COMMAND=\"process-pdfs\" docker-compose up\n```\n\n_Note: The **COMMAND** variable can be either `process-pdfs` or `process-xmls`. And the directories are the host machine directories where the files are extracted and left, respectively._\n\n## License\n\nPlease refer to the `LICENSE` file.\n\n## Where to Get Help\n\nFor further assistance or to contribute to the project, please refer to the `CONTRIBUTING.md` file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmciccale%2Fscholarvista","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmciccale%2Fscholarvista","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmciccale%2Fscholarvista/lists"}