https://github.com/alisonmitchell/document-metadata-extract

An application to import, iterate over, open, extract metadata and output information for a list of PDF and Excel documents hosted on a website to identify missing metadata fields.
https://github.com/alisonmitchell/document-metadata-extract

openpyxl pypdf2 python requests

Last synced: 3 months ago
JSON representation

An application to import, iterate over, open, extract metadata and output information for a list of PDF and Excel documents hosted on a website to identify missing metadata fields.

Host: GitHub
URL: https://github.com/alisonmitchell/document-metadata-extract
Owner: alisonmitchell
Created: 2021-01-20T20:17:26.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2021-01-20T20:57:57.000Z (over 4 years ago)
Last Synced: 2025-01-15T14:00:39.159Z (9 months ago)
Topics: openpyxl, pypdf2, python, requests
Language: Python
Homepage:
Size: 4.88 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Document Metadata Extract

## Project description

The challenge was to identify which of several hundred documents on a public website had incomplete metadata fields.

The solution was to automate the process by developing an application in Python to read in the dataset as a csv, iterate over website URLs of PDF and Excel documents, open the documents, extract metadata and output information to a spreadsheet to be able to identify missing fields.

Populating document fields such as Author, Title (equivalent to a web page title tag) and Subject (equivalent to a web page meta description) assists with Search Engine Optimisation by providing metadata for search engines to crawl and create listings for documents, and to determine their search rankings.

## Data source

Test dataset created for the purpose of testing the algorithm.

## Requirements

* Python 3.8.x
* requests: a Python HTTP library
* PyPDF2: a Python library built as a PDF toolkit
* openpyxl: a Python library to read/write Excel files

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alisonmitchell/document-metadata-extract

Awesome Lists containing this project

README