Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tubone24/askfm-qa-crawler
Crawl Ask.fm QA lists and create corpus for ML.
https://github.com/tubone24/askfm-qa-crawler
askfm chromedriver corpus-builder crawler selenium
Last synced: 1 day ago
JSON representation
Crawl Ask.fm QA lists and create corpus for ML.
- Host: GitHub
- URL: https://github.com/tubone24/askfm-qa-crawler
- Owner: tubone24
- License: mit
- Created: 2019-10-27T07:44:10.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2023-12-15T17:34:02.000Z (about 1 year ago)
- Last Synced: 2024-04-13T04:55:35.042Z (9 months ago)
- Topics: askfm, chromedriver, corpus-builder, crawler, selenium
- Language: Python
- Homepage:
- Size: 95.7 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# askfm-qa-crawler
![img](./docs/image/header.png)
[![license](https://img.shields.io/github/license/tubone24/askfm-qa-clawler.svg)](LICENSE)
[![standard-readme compliant](https://img.shields.io/badge/readme%20style-standard-brightgreen.svg?style=flat-square)](https://github.com/RichardLitt/standard-readme)
[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)> Crawl Ask.fm QA lists and create corpus for ML.
This is a selenium tasks to crawl Ask.fm because of correcting QA list for Machine Learning.
## Table of Contents
- [Background](#background)
- [Install](#install)
- [Usage](#usage)
- [Contributing](#contributing)
- [License](#license)## Background
Among machine learning, there was a task to create a bot that responds to natural language using LSTM (a kind of RNN).
At that time, a large amount of conversation corpus is required, but since I did not get a good conversation corpus, I decided to make a conversation corpus by crawling the Ask.fm question answer list with Selenium (Google Chrome) Did.
I'm using Selenium for Python because my favorite programming language is Python.
## Install
### Precondition
- Python 3.6+
- Google Chrome
- [Google Chrome WebDriver](https://sites.google.com/a/chromium.org/chromedriver/downloads)
- Check your Chrome version and install suitable driver version.### PIP
Install dependencies.
```
pip install -r requirements.txt
```## Usage
### Create faces list (Account list)
Before create conversation corpus, create `face list` because of crawling QA.
First args, number of loop count.
```
python src/get_faces.py 100
```After run script, get face list into `data/face_list.txt`
### Create conversation corpus
```
python src/main.py
```After run script, get conversation corpus into `data/askfm_data/foobar.txt`
## Contributing
See [the contributing file](CONTRIBUTING.md)!
PRs accepted.
Small note: If editing the Readme, please conform to the [standard-readme](https://github.com/RichardLitt/standard-readme) specification.
## License
[MIT © tubone.](LICENSE)