https://github.com/shamim-akhtar/extract-pdf-text-images
A sample code to extract the images and text from a PDF file.
https://github.com/shamim-akhtar/extract-pdf-text-images
Last synced: about 2 months ago
JSON representation
A sample code to extract the images and text from a PDF file.
- Host: GitHub
- URL: https://github.com/shamim-akhtar/extract-pdf-text-images
- Owner: shamim-akhtar
- License: mit
- Created: 2023-04-01T16:09:40.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-03-29T14:28:14.000Z (about 1 year ago)
- Last Synced: 2025-01-17T22:25:32.053Z (3 months ago)
- Language: Jupyter Notebook
- Size: 1.34 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Extract Text and Images from a PDF File
## Code Explanation
The code is a Python script (Jupyter Notebook) that extracts text and images from a PDF file. It imports three libraries:
* fitz,
* os, and
* PyPDF2.It then sets the input and output paths, opens the PDF file and reads it using PyPDF2, and extracts the text from the first page of the PDF. The extracted text is saved to a file in the specified output folder. The script then opens the PDF file again using fitz, gets a list of images on the first page of the PDF, saves each image to a file in the specified output folder, and prints the number of images detected on the first page of the PDF.
To use the code, you must replace the **input_path** variable with the PDF file path you want to extract text and images. You should also set the **output_path** variable to the folder where you want the output files to be saved. After running the Python script, it will extract the text and images from the PDF and save them to files in the specified output folder.
## Usage Instructions
### To use the code, follow these instructions:
1. Install the required libraries: fitz, os, and PyPDF2. You can install them using pip in your command prompt or terminal.
2. Save the code as a Python script in a folder on your computer.
3. Replace the input_path variable with the PDF file path you want to extract text and images.
4. Set the output_path variable to the folder where you want the output files to be saved.
5. Run the script using Python. You can do this by navigating to the folder where the script is saved in your command prompt or terminal and running python scriptname.py. > Replace scriptname.py with the name of the script you saved in step 2.
6. Check the output folder to verify the text and images were extracted successfully.**Note** that the code is only designed to extract text and images from the first page of the PDF. To extract text and images from other pages, you must modify the code accordingly.