https://github.com/ortanav2/data-scraper

A data-scraper that makes it possible to filter out the most important information from huge amounts of text based data.
https://github.com/ortanav2/data-scraper

data data-scraper file file-scraper scraper search searching

Last synced: 3 months ago
JSON representation

A data-scraper that makes it possible to filter out the most important information from huge amounts of text based data.

Host: GitHub
URL: https://github.com/ortanav2/data-scraper
Owner: ortanaV2
Created: 2023-10-30T13:13:16.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2023-11-01T22:54:46.000Z (over 1 year ago)
Last Synced: 2025-03-17T22:08:01.473Z (3 months ago)
Topics: data, data-scraper, file, file-scraper, scraper, search, searching
Language: Python
Homepage:
Size: 6.84 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Data-Scraper
> A data-scraper that makes it possible to filter out the most important information from huge amounts of text based data.

The script asks for a ***keyword*** to search for. It compares the keyword with the ***file-name*** and its ***contents***. As soon as it finds the keyword in it, it is listed as a match and output at the end.
## File Content Read
> The scraper is able to read only the following text-based files:
- .docx
- .pdf
- .txt
## Usage
The scraper is searching the `./DATA` **directory** by default. To change that you have to edit the **variable** `directory`.

_Line 9_: `directory = "./DATA"`
> [!NOTE]
> It iterates through every file in the directory. To speed up the process, it is recommended to limit the amount of files.
## Requirements
> How to install the required libraries.
```
pip install pdfplumber
```
```
pip install docx
```

## Improving
Suggestions for improvements are welcome.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ortanav2/data-scraper

Awesome Lists containing this project

README