https://github.com/dongju93/extract-ti-from-reports
Convert PDFs to text, then transform that text into structured JSON objects for Threat Intelligence.
https://github.com/dongju93/extract-ti-from-reports
json jupyter-notebook pdf pdf-to-text python regex text-to-json threat-intelligence
Last synced: 3 months ago
JSON representation
Convert PDFs to text, then transform that text into structured JSON objects for Threat Intelligence.
- Host: GitHub
- URL: https://github.com/dongju93/extract-ti-from-reports
- Owner: dongju93
- Created: 2023-08-22T02:11:02.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-03-24T09:18:08.000Z (about 1 year ago)
- Last Synced: 2025-01-07T12:44:38.450Z (5 months ago)
- Topics: json, jupyter-notebook, pdf, pdf-to-text, python, regex, text-to-json, threat-intelligence
- Language: Jupyter Notebook
- Homepage:
- Size: 134 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
pdf_to_text
-
Uses the [pdfminer.six](https://github.com/pdfminer/pdfminer.six) library to perform the task of converting .PDF to .TXT
```
pip install pdfminer.six
```text_to_json_ti
-
Converts .TXT to .JSON, using regular expressions to separate JSON items by predetermined fields.
URL and Filename items are extracted along with any incorrect information (not malicious) to create a whitelist array for filtering.field_to_excel
-
Reads all of the specific field data from the .JSON files, dataframes them, and saves them to an .XLSX file with statistics as needed.