An open API service indexing awesome lists of open source software.

https://github.com/mykhode/data_mining_py

Simple Scrabe data with Python
https://github.com/mykhode/data_mining_py

ai scrabe-data training-data

Last synced: 11 days ago
JSON representation

Simple Scrabe data with Python

Awesome Lists containing this project

README

          

Data Mining with Python

Description



This project scrapes a Q&A website (Khmer language-based) to generate intents for a conversational AI system. It utilizes web scraping techniques, natural language processing, and data structuring to create a dataset of tagged intents for training language models.

Features




  • Web Scraping: Utilizes requests and BeautifulSoup for data extraction from the website.


  • NLP Tagging: Implements the KhmerNLP library for part-of-speech tagging.


  • Intent Generation: Gathers unique nouns from questions to form intent patterns and extracts corresponding answers.

Installation



  1. Clone the repository:
    git clone https://github.com/your-username/repo-name.git

  2. Install dependencies:
    pip install -r requirements.txt

Usage



  1. Run the Python script generate_intents.py.

  2. The script will scrape the Q&A website and generate a JSON file (data_intents.json) containing intents for conversational AI systems.

Example


python generate_intents.py

Dependencies



  • requests

  • beautifulsoup4

  • khmernltk

Data Structure


The generated JSON file (data_intents.json) follows the structure:




{
"intents": [
{
"tag": "id_1",
"patterns": ["Question Pattern 1", "Noun Pattern 1"],
"responses": ["Answer 1"]
},
// Other intents follow the same structure
]
}

Contribution



  1. Fork the repository.

  2. Create a new branch (git checkout -b feature/new-feature).

  3. Make your changes and commit (git commit -am 'Add new feature').

  4. Push to the branch (git push origin feature/new-feature).

  5. Create a new Pull Request.

License


This project is licensed under the MIT License.

Credits