{"id":20070298,"url":"https://github.com/kallewesterling/process-entertainment-archive","last_synced_at":"2026-05-03T07:43:03.389Z","repository":{"id":169404371,"uuid":"154404406","full_name":"kallewesterling/process-entertainment-archive","owner":"kallewesterling","description":"A python script to traverse through HTML files with ProQuest results to generate an easily navigable CSV file (and Pandas DataFrame).","archived":false,"fork":false,"pushed_at":"2021-03-05T15:21:47.000Z","size":17,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-02T00:24:47.475Z","etag":null,"topics":["csv","html-files","pandas","pandas-dataframe","proquest","python-script"],"latest_commit_sha":null,"homepage":"http://www.westerling.nu/digital-projects","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kallewesterling.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-23T22:23:03.000Z","updated_at":"2023-02-13T14:55:43.000Z","dependencies_parsed_at":null,"dependency_job_id":"faeb04ae-4b65-4672-8f4c-37046dff3ece","html_url":"https://github.com/kallewesterling/process-entertainment-archive","commit_stats":null,"previous_names":["kallewesterling/process-entertainment-archive"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kallewesterling/process-entertainment-archive","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kallewesterling%2Fprocess-entertainment-archive","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kallewesterling%2Fprocess-entertainment-archive/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kallewesterling%2Fprocess-entertainment-archive/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kallewesterling%2Fprocess-entertainment-archive/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kallewesterling","download_url":"https://codeload.github.com/kallewesterling/process-entertainment-archive/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kallewesterling%2Fprocess-entertainment-archive/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32562118,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T06:36:36.687Z","status":"ssl_error","status_checked_at":"2026-05-03T06:36:09.306Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csv","html-files","pandas","pandas-dataframe","proquest","python-script"],"created_at":"2024-11-13T14:21:46.202Z","updated_at":"2026-05-03T07:43:03.371Z","avatar_url":"https://github.com/kallewesterling.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Process ProQuest Entertainment Archive files\n\nA python script to traverse through HTML files with ProQuest results to generate an easily navigable CSV file (and Pandas DataFrame).\n\n\n## How to Install\n\nThis package requires you to install two other packages for it to run: `pandas` and `BeautifulSoup`. Install them by running these two commands in your command line:\n\n```sh\npip install pandas\n```\n\n```sh\npip install beautifulsoup4\n```\n\nDrop the `ProQuestResult.py` file into your project folder. Then run the following command in your project, whether it is a Python file or a Jupyter Notebook:\n\n```python\nfrom ProQuestResult import *\n```\n\n\n## Set Up the Program\n\nThe program allows you to define two optional settings. Open `ProQuestResult.py` and find the two lines that contain the two variables `STOPFILES` and `CACHE_RAW_IN_OBJECT`.\n\n`STOPFILES` needs to be a list of strings. It determines which file names the program will block when reading a directory. By default it is set to only include one element, Mac OS X's annoyingly present .DS_Store files:\n\n```python\nSTOPFILES = ['.DS_Store']\n```\n\n`CACHE_RAW_IN_OBJECT` needs to be a boolean. It determines whether each ProQuestResult will contain an instance variable (`ProQuestResult._raw`) that contains the raw HTML from each of the files. By default, this variable is set to `False` in order to save memory. Switch to `True` if you for some reason need to be able to access the HTML from your search result file.\n\n\n## How to Run\n\nYou have two options when creating an object containing your search results: `ProQuestResult` (1) and `ProQuestResults` (2). The subtle difference is in the plural.\n\n\n### (1) ProQuestResult\n\nIf you have one individual HTML files with ProQuest search results, this is the object you want to invoke. It provides a list of dictionaries (`ProQuestResults.results`) and a DataFrame object (`ProQuestResults.df`) with all the details for the search results.\n\n\n#### Setting up the object\n\nTo set up an object, simply provide it with a file variable to set it up: \n\n```python\nparsed_results = ProQuestResult(file = './my_search_results/the_file_with_results.html')\n```\n\nThe `file` parameter should be a string but can also be a PosixPath (see [pathlib's documentation for reference](https://docs.python.org/3/library/pathlib.html)).\n\n\n#### Accessing search results\n\nOnce the object has been set up, you can easily access the search results as a list of dictionaries:\n\n```python\nprint(parsed_results.results)\n```\n\nIf you'd rather see the search results as a pandas DataFrame, you can do so by calling:\n\n```python\nparsed_results.df\n```\n\nThis also provides an easy way to export the DataFrame to a CSV, by calling:\n\n```python\nparsed_results.df.to_csv('xxx.csv')\n``` \n\n*Note: Accessing the instance variables `results` and `df` will both generate them to order. That means that the script, depending on the number of search results in each file, can take some time to run.*\n\nThe object also gives you easy access to the search query as a string:\n\n```python\nprint(parsed_results.query)\n```\n\nIf you request `len()` for the object, it will return the number of search results in the file:\n\n```python\nlen(parsed_results)\n```\n\n\n### (2) ProQuestResults\n\nIf you have a directory or a list of files containing search results from ProQuest and you want to collect all of them in one object, you can do so by calling `ProQuestResults` instead of the examples above.\n\n\n#### Setting up the object\n\nThe program is flexible and can ingest a number of variations through the two variables it accepts: `files` or `directory`.\n\n**`files`** needs to be provided as a list of file names as strings (or PosixPaths). For example:\n\n```python\nparsed_results = ProQuestResult(files = ['./first_file.html', './second_file.html', './third_file.html', './fourth_file.html'])\n```\n\n**`directory`** can be provided as either *(i)* a string (or a PosixPath) with a path to a directory containing the search result files you want to work with, or *(ii)* a list of strings (or PosixPaths) that refer to any number of directories containing search result files.\n\n*(i)* For example, if you work with a single directory, you would call:\n\n```python\nparsed_results = ProQuestResults(directory = './my_search_results/')\n```\n\n*(ii)* If you have a number of directories you need to summarize in one object, you would call the same object but set it up with a list of directories:\n\n```python\nparsed_results = ProQuestResults(directory = ['./my_first_search_result_directory/', './my_second_search_result_directory/'])\n```\n\n\n#### Accessing search results\n\nOnce the object has been set up, you can easily access the search results in the same manner as the examples under `ProQuestResult` above:\n\nTo access all the search results as a list of dictionaries: \n```python\nprint(parsed_results.results)\n```\n\nTo access all the search results as a DataFrame: \n```python\nparsed_results.df\n```\n\n*Note: As is the case with `ProQuestResult`, accessing the instance variable `results` and `df` will both generate them to order. That means that the script, depending on the number of search results in each file, can take some time to run.*\n\n\n#### Accessing queries for the search result files and vice versa\n\nSince the `ProQuestResults` object is set up by numerous files, which all contain *one* search query, there are two methods to access search query information. The program can provide the search query for each file (through requesting `ProQuestResults.files_to_queries`) and a list of the files that contains each search query (through requesting `ProQuestResults.query_to_files`).\n\n**`files_to_query`** is accessible as a native Python dictionary of the key-value structure `{Path(file): 'search term'}`:\n\n```python\ndict_object_with_files_to_query = parsed_results.files_to_query\n```\n\n**`query_to_files`** is accessible in the same way a native Python dictionary but with the inverse key-value structure `{Path(file): 'search term'}`:\n\n```python\ndict_object_with_query_to_files = parsed_results.query_to_files\n```\n\nSince both of these methods provide you with a native dictionary, you can use any of the native functions built in to the dictionary type with these results such as *slicing*:\n\n```python\nfile = Path('./my_search_results/the_file_with_results.html')\ndict_object_with_files_to_query[file]\n```\n\nYou can also iterate through the results through the dictionary type's native method `items()`:\n\n```python\nfor search_term, list_of_files in dict_object_with_query_to_files.items():\n    print(\"The search term\", search_term, \"was used to generate these files:\", list_of_files)\n\nfor file, search_term in dict_object_with_files_to_query.items():\n    print(\"The file\", file, \"was generated from this search term:\", search_term)\n```\n\n\n## Future features\n\nNo future features are planned. If you would like to request a feature, feel free to so by opening [an Issue on GitHub](https://github.com/kallewesterling/process-entertainment-archive/issues).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkallewesterling%2Fprocess-entertainment-archive","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkallewesterling%2Fprocess-entertainment-archive","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkallewesterling%2Fprocess-entertainment-archive/lists"}