https://github.com/alisonmitchell/document-metadata-extract
An application to import, iterate over, open, extract metadata and output information for a list of PDF and Excel documents hosted on a website to identify missing metadata fields.
https://github.com/alisonmitchell/document-metadata-extract
openpyxl pypdf2 python requests
Last synced: 3 months ago
JSON representation
An application to import, iterate over, open, extract metadata and output information for a list of PDF and Excel documents hosted on a website to identify missing metadata fields.
- Host: GitHub
- URL: https://github.com/alisonmitchell/document-metadata-extract
- Owner: alisonmitchell
- Created: 2021-01-20T20:17:26.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2021-01-20T20:57:57.000Z (over 4 years ago)
- Last Synced: 2025-01-15T14:00:39.159Z (9 months ago)
- Topics: openpyxl, pypdf2, python, requests
- Language: Python
- Homepage:
- Size: 4.88 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Document Metadata Extract
## Project description
The challenge was to identify which of several hundred documents on a public website had incomplete metadata fields.
The solution was to automate the process by developing an application in Python to read in the dataset as a csv, iterate over website URLs of PDF and Excel documents, open the documents, extract metadata and output information to a spreadsheet to be able to identify missing fields.
Populating document fields such as Author, Title (equivalent to a web page title tag) and Subject (equivalent to a web page meta description) assists with Search Engine Optimisation by providing metadata for search engines to crawl and create listings for documents, and to determine their search rankings.
## Data source
Test dataset created for the purpose of testing the algorithm.
## Requirements
* Python 3.8.x
* requests: a Python HTTP library
* PyPDF2: a Python library built as a PDF toolkit
* openpyxl: a Python library to read/write Excel files