Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/banglakit/corpus-builder
toolkit for compiling corpus from various sources
https://github.com/banglakit/corpus-builder
Last synced: about 1 month ago
JSON representation
toolkit for compiling corpus from various sources
- Host: GitHub
- URL: https://github.com/banglakit/corpus-builder
- Owner: banglakit
- License: mit
- Created: 2016-06-28T14:09:53.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2018-08-24T14:00:29.000Z (almost 6 years ago)
- Last Synced: 2024-03-15T02:07:55.892Z (4 months ago)
- Language: Python
- Homepage: https://github.com/banglakit/corpus-builder/wiki/Status
- Size: 53.7 KB
- Stars: 42
- Watchers: 4
- Forks: 15
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Lists
- awesome-bangla - Corpus Builder
README
# banglakit/corpus-builder
Having a large enough set of text is essential for NLP tasks; this tool is designed for the sole purpose of building large collection of text documents from the web.
A practical understanding of Python and [Scrapy](http://www.scrapy.org) is essential for using the tool.
### Example Usage
```bash
scrapy crawl bangladesh_pratidin -a start_date='2016-06-01' -a end_date='2016-06-05' -o test3.csv
```