https://github.com/ythecombinator/spyck

Framework extensível para mineração de dados
https://github.com/ythecombinator/spyck

Last synced: 4 months ago
JSON representation

Framework extensível para mineração de dados

Host: GitHub
URL: https://github.com/ythecombinator/spyck
Owner: ythecombinator
License: mit
Created: 2016-04-13T21:02:35.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2016-04-24T22:03:18.000Z (about 9 years ago)
Last Synced: 2025-01-18T10:28:45.732Z (5 months ago)
Language: Python
Homepage: http://zetaresearch.github.io/projects/spyck
Size: 304 KB
Stars: 0
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md

Awesome Lists containing this project

README

        


  

    

  





  An extensible framework for data mining.





  



## Table of Contents

- [Purpose](#purpose)

- [Concepts](#concepts)

- [Requirements](#requirements)

- [Other Resources](#other-resources)

- [Roadmap](#roadmap)

- [Contributing](#contributing)

- [History](#history)

- [License](#license)

## Purpose

*spyck* is a framework which aims to make it easy to develop crawlers and

integrate collected data - independent of its type and origin. It's easily

**expandable** and **adaptable**. It also aims to be easy to use, even for

beginners.

It can be very useful for a wide variety of cases, e.g.:

- Journalist investigations to find corruption cases - like [this one](http://g1.globo.com/jornal-nacional/noticia/2016/01/hospital-do-rj-tem-medico-no-plantao-que-nao-aparece-para-trabalhar.html);

- Researching the population of a particular group;

- Better understanding of a candidate for a job before it hiring

- *etc.*

## Concepts

During the framework development some words got new meanings:

- **Crawler**: The data collector.

- **Harvest**: The execution.

- **Dependencies**: Required previous data.

> Also, each crawler has its *possible-to-achieve* **crop** after the

**harvest**. Each crawler works in one or more different **entities**, where it

contextualizes and store the collected data.

## Requirements

> Everything below can be easily installed via

[setuptools](https://pypi.python.org/pypi/setuptools).

- python 3.x

- requests

- PyPDF2

- selenium

- pyslibtesseract

- aylien-apiclient

The you need to install:

- phantomJS

```sh

sudo apt-get install phantomjs

```

## Other Resources

> Relax, some better docs will come soon.

You can find more info about the framework - and get some feed about its

development through [this blog post](http://macalogs.com.br/spyck-apresentacao-do-framework-de-mineracao-de-dados/).

You can also check the slides from a presentation made at [XI Pylestras](http://pylestras.org/evento/xi-pylestras/)

about the framework [here](http://zetaresearch.github.io/talks/spyck.pdf).

## Roadmap

- [ ] Simplify the code and make it easier to work on the development of the

framework itself.

- [ ] Create a graphical interface (*GUI*) to make it more accessible to

beginners.

- [ ] Implement analysis and inferences about the collected data.

## Contributing

Contributions are very welcome! If you'd like to contribute,

[these guidelines](CONTRIBUTING.md) may help you.

## History

See [Releases](https://github.com/zetaresearch/spyck/releases) for detailed changelog.

## License

[MIT License](LICENSE.md) © ZETA Research.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ythecombinator/spyck

Awesome Lists containing this project

README