{"id":13585630,"url":"https://github.com/recite/autosum","last_synced_at":"2025-12-30T01:48:37.261Z","repository":{"id":36034466,"uuid":"40331118","full_name":"recite/autosum","owner":"recite","description":"Summarize Publications Automatically","archived":false,"fork":false,"pushed_at":"2023-02-13T22:57:58.000Z","size":14135,"stargazers_count":36,"open_issues_count":0,"forks_count":10,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-05-01T20:27:01.057Z","etag":null,"topics":["arxiv","citation","google-scholar"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/recite.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"License.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"Citation.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["soodoku"],"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"custom":null}},"created_at":"2015-08-06T23:19:35.000Z","updated_at":"2024-08-01T16:31:38.372Z","dependencies_parsed_at":"2024-08-01T16:31:37.461Z","dependency_job_id":"edd890fd-70fd-43a3-a450-9042b74a9fff","html_url":"https://github.com/recite/autosum","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recite%2Fautosum","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recite%2Fautosum/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recite%2Fautosum/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/recite%2Fautosum/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/recite","download_url":"https://codeload.github.com/recite/autosum/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247636387,"owners_count":20970922,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["arxiv","citation","google-scholar"],"created_at":"2024-08-01T15:05:03.100Z","updated_at":"2025-12-30T01:48:37.220Z","avatar_url":"https://github.com/recite.png","language":"Python","funding_links":["https://github.com/sponsors/soodoku"],"categories":["Python"],"sub_categories":[],"readme":"### AutoSum: Summarize Publications Automatically\n\nThe tool exploits the labor already expended by scholars in summarizing articles. It scrapes words next to citations across all openly available research citing a publication, and collates the output. The result is a very useful summary and data that are in a format that allows easy discovery of potential miscitations. \n\n[CLICK HERE to suggest an edit to this page!](https://github.com/soodoku/autosum/edit/master/Readme.md)\n\n--------------------\n\n#### Table of Contents\n\n* [Get the Data](#get-the-data)  \n  Scrapes all openly accessible research citing a particular publication using links provided by [Google Scholar](https://scholar.google.com).\n  **Note:** Google monitors scraping on Google scholar. \n\n* [Parse the Data](#parse-the-data)  \n  Iterates through a directory with all the articles citing a particular research article, and using regular expressions, picks up sentences near a citation.\n\n* [Example from Social Science](#example-from-social-science)\n\n-----------------------\n\n#### Get the Data\n\nTo search for openly accessible pdfs citing the original research article on Google Scholar, use [Scholar.py](scripts/scholar.py). \n\n1. Input: URL to Google Scholar Page of an article.\n2. What the script does:\n   * Goes to 'Cited By..'\n   * Downloads a user specified number of publicly available papers (pdfs only for now) that cite the paper to a user specified directory. \n   * Creates a csv that tracks basic characteristics of each of the downloaded paper -- title, url, author names, journal etc. It also dumps relative path to downloaded file.\n3. [Sample output](testout/einstein_search_200.csv)\n\n##### Usage\n\n```\nusage: scholar.py [-h] [-u USER] [-p PASSWORD] [-a AUTHOR] [-d DIR]\n                  [-o OUTPUT] [-n N_CITES] [-v] [--version]\n                  keyword [keyword ...]\n\npositional arguments:\n  keyword               Keyword to be searched\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -u USER, --user USER  Google account e-mail\n  -p PASSWORD, --password PASSWORD\n                        Google account password\n  -a AUTHOR, --author AUTHOR\n                        Author to be filtered\n  -d DIR, --dir DIR     Output directory for PDF files\n  -o OUTPUT, --output OUTPUT\n                        CSV output filename\n  -n N_CITES, --n-cites N_CITES\n                        Number of cites to be download\n  -v, --verbose\n  --version             show program's version number and exit\n```\n\n**Example**  \n```\npython scholar.py -v -d pdfs -o output.csv -n 100 -a \"A Einstein\" \\\n\"Can quantum-mechanical description of physical reality be considered complete?\"\n```\n\n-----------------------\n\n#### Parse the Data \n\nTo scrape the text next to the relevant citations within the pdfs, use [autosumpdf.py](scripts/autosumpdf.py):\n\n1. The script iterates through the pdfs using the csv generated above. \n2. Using citation information, or a custom regexp gets the text and puts it in the same csv. If multiple regex are matched, everything is concatenated with a line space.\n3. [Sample output](testout/einstein_cites_100.csv)\n\n```\nusage: searchpdf.py [-h] [-i INPUT] [-o OUTPUT] [-v] [--version]\n                    regex [regex ...]\n\noptional arguments:\n   -h, --help            show this help message and exit\n  -i INPUT, --input INPUT\n                        CSV input filename\n  -o OUTPUT, --output OUTPUT\n                        CSV output filename\n  -t TXT_DIR, --text TXT_DIR\n                        extract to specific directory\n  -f, --force           force extract text file if exists\n  -v, --verbose\n  -a1 AUTHOR1, --author-1-lastname AUTHOR1\n                        1st author of citation\n  -a2 AUTHOR2, --author-2-lastname AUTHOR2\n                        2nd author of citation\n  -y YEAR, --year YEAR  Year of publication\n  --version             show program's version number and exit\n  -r REGEX, --regex REGEX\n                        specify custom regex to filter citations.\n```\n\n**Example**  \n```\npython searchpdf.py -v -i output.csv -o search-output.csv -r \"\\.\\s(.{5,100}[\\[\\(]?Einstein.{2,30}\\d+[\\]\\)])\"\n```\n\nThe custom regular expression (-r switch) matches a sentence (max 100 chars) following by author name \"Einstein\", any words (max 30 chars) and number with close bracket at the end.\n\nDepending on the command line arguments (-a1, -a2, -y) the following citation patterns will be automatically used for finding matching sentences:\n* Author1_Last_Name Year\n* Author1_Last_Name et al.\n* Author1_Last_Name et al. Year\n* Author1_Last_Name et al., Year\n* Author1_Last_Name and Author2_Last_Name\n* Author1_Last_Name and Author2_Last_Name Year\n* Author1_Last_Name, and Author2_Last_Name Year\n* Author1_Last_Name and Author2_Last_Name, Year\n* Author1_Last_Name \u0026 Author2_Last_Name Year\n* Author1_Last_Name \u0026 Author2_Last_Name, Year\n\n-----------------------\n\n#### Example from Social Science\n\n* [What to search for?](social_science_citations.md)\n  * **Example with Google Scholar**  \n    Download 500 articles from Google Scholar:\n    ```\n    python scholar.py -v -d pdfs -o iyengar-output.csv -n 500 -a \"S Iyengar\" \"Is anyone responsible?: How television frames political issues.\"\n    ```\n\n* **Searching in the Test Data**\n  * [Sample input data](testdat/)\n  * Use [autosumpdf.py](scripts/autosumpdf.py) to filter citations to Iyengar et al. 2012:\n    ```\n    python autosumpdf.py -v -i testdata.csv -o search-testdata-new.csv -a1 \"Iyengar\" -y \"2012\"\n    ```\n\n\n* **Miscitations**    \n  Social scientists hold that few truths are self-evident. But some truths become obvious to all social scientists after some years of experience, including: a) [Peer review is a mess](http://gbytes.gsood.com/2015/07/24/reviewing-the-peer-review-with-reviews-as-data/), b) Faculty hiring is idiosyncratic, and c) Research is often miscited. Here we quantify the last portion.  \n\n#### License\n\nReleased under the [MIT License](License.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frecite%2Fautosum","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frecite%2Fautosum","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frecite%2Fautosum/lists"}