https://github.com/ortanav2/data-scraper
A data-scraper that makes it possible to filter out the most important information from huge amounts of text based data.
https://github.com/ortanav2/data-scraper
data data-scraper file file-scraper scraper search searching
Last synced: 3 months ago
JSON representation
A data-scraper that makes it possible to filter out the most important information from huge amounts of text based data.
- Host: GitHub
- URL: https://github.com/ortanav2/data-scraper
- Owner: ortanaV2
- Created: 2023-10-30T13:13:16.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-11-01T22:54:46.000Z (over 1 year ago)
- Last Synced: 2025-03-17T22:08:01.473Z (3 months ago)
- Topics: data, data-scraper, file, file-scraper, scraper, search, searching
- Language: Python
- Homepage:
- Size: 6.84 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data-Scraper
> A data-scraper that makes it possible to filter out the most important information from huge amounts of text based data.The script asks for a ***keyword*** to search for. It compares the keyword with the ***file-name*** and its ***contents***. As soon as it finds the keyword in it, it is listed as a match and output at the end.
## File Content Read
> The scraper is able to read only the following text-based files:
- .docx
- .txt
## Usage
The scraper is searching the `./DATA` **directory** by default. To change that you have to edit the **variable** `directory`._Line 9_: `directory = "./DATA"`
> [!NOTE]
> It iterates through every file in the directory. To speed up the process, it is recommended to limit the amount of files.
## Requirements
> How to install the required libraries.
```
pip install pdfplumber
```
```
pip install docx
```## Improving
Suggestions for improvements are welcome.