https://github.com/tubone24/askfm-qa-crawler

Crawl Ask.fm QA lists and create corpus for ML.
https://github.com/tubone24/askfm-qa-crawler

askfm chromedriver corpus-builder crawler selenium

Last synced: about 2 months ago
JSON representation

Crawl Ask.fm QA lists and create corpus for ML.

Host: GitHub
URL: https://github.com/tubone24/askfm-qa-crawler
Owner: tubone24
License: mit
Created: 2019-10-27T07:44:10.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2023-12-15T17:34:02.000Z (over 1 year ago)
Last Synced: 2024-12-25T17:12:39.401Z (6 months ago)
Topics: askfm, chromedriver, corpus-builder, crawler, selenium
Language: Python
Homepage:
Size: 95.7 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

        # askfm-qa-crawler

![img](./docs/image/header.png)

[![license](https://img.shields.io/github/license/tubone24/askfm-qa-clawler.svg)](LICENSE)

[![standard-readme compliant](https://img.shields.io/badge/readme%20style-standard-brightgreen.svg?style=flat-square)](https://github.com/RichardLitt/standard-readme)

[![PRs Welcome](https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=flat-square)](http://makeapullrequest.com)

> Crawl Ask.fm QA lists and create corpus for ML.

This is a selenium tasks to crawl Ask.fm because of correcting QA list for Machine Learning.

## Table of Contents

- [Background](#background)

- [Install](#install)

- [Usage](#usage)

- [Contributing](#contributing)

- [License](#license)

## Background

Among machine learning, there was a task to create a bot that responds to natural language using LSTM (a kind of RNN).

At that time, a large amount of conversation corpus is required, but since I did not get a good conversation corpus, I decided to make a conversation corpus by crawling the Ask.fm question answer list with Selenium (Google Chrome) Did.

I'm using Selenium for Python because my favorite programming language is Python.

## Install

### Precondition

- Python 3.6+

- Google Chrome

- [Google Chrome WebDriver](https://sites.google.com/a/chromium.org/chromedriver/downloads)

  - Check your Chrome version and install suitable driver version.

### PIP

Install dependencies.

```

pip install -r requirements.txt

```

## Usage

### Create faces list (Account list)

Before create conversation corpus, create `face list` because of crawling QA.

First args, number of loop count.

```

python src/get_faces.py 100

```

After run script, get face list into `data/face_list.txt`

### Create conversation corpus

```

python src/main.py

```

After run script, get conversation corpus into `data/askfm_data/foobar.txt`

## Contributing

See [the contributing file](CONTRIBUTING.md)!

PRs accepted.

Small note: If editing the Readme, please conform to the [standard-readme](https://github.com/RichardLitt/standard-readme) specification.

## License

[MIT © tubone.](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tubone24/askfm-qa-crawler

Awesome Lists containing this project

README