Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/e9t/google-keywords

Get keywords from Google blog, news search.
https://github.com/e9t/google-keywords

Last synced: 8 days ago
JSON representation

Get keywords from Google blog, news search.

Host: GitHub
URL: https://github.com/e9t/google-keywords
Owner: e9t
Created: 2013-02-06T06:00:36.000Z (almost 12 years ago)
Default Branch: master
Last Pushed: 2013-02-13T08:31:20.000Z (almost 12 years ago)
Last Synced: 2024-11-07T17:14:04.260Z (about 2 months ago)
Language: Python
Size: 133 KB
Stars: 2
Watchers: 2
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        Get Google Keywords

=================================

### Preparation

- Install python packages `lxml`, `html5lib`, `nltk` (Linux users need to apt-get `python-dev`, `python-lxml` as well.)

        $ pip install lxml html5lib nltk

- From `nltk.download()` select and install `corpus/stopwords`

        $ python

        >>> import nltk

        >>> nltk.download()

### Configure settings

    vi settings.py

- TARGET: Either 'news' or 'blog'

- QUERYLIST: List of queries

- HTMLPATH: Path to save html files. (Paths should end with a slash)

- KEYWORDPATH = 'data/keywords/'

- NCRAWLPAGES: Number of search pages to crawl from Google

- DELIMS: Delimiters for parsing words in HTML page

- TODAY: Date for analysis

### Run

In order to get search results for `data mining`, run

    python main.py data mining

or set `QUERYLIST=['data', 'mining']` in `settings.py`, and run

    python main.py

### Results

If `HTMLPATH='data/html/'` and `KEYWORDPATH='data/keywords/` in `settings.py`, the search results and keywords are stored in the 'data' folder as below.

    data/

        ├── html/

        │   ├── data_mining/

        │   └── data_mining-20120907.json

        └── keywords/

            └── keywords-data_mining.json

- **data/html/data_mining/**: This folder contains the raw HTML files. File names are marked with a timestamp.

- **data/html/data_mining-20120907.json**: This file contains th `url`, `desc`(description), `crawled_time`, `title` extracted from the raw HTML files. Below is an example.

        [

          {

            "url": "http://smartdatacollective.com/timoelliott/101486/analytics-world-news-big-data-cool-3d-analytics", 

            "desc": "Themos Kalafatis has worked as a consultant for , Text Mining, Information Extraction and Data Quality for over a decade. More \u00bb ", 

            "crawled_time": "20120907_192648",

            "page_no": 1,

            "title": "Scary Big Data, Cool 3D Analytics and More"

          },

          ...

        ]

- **data/keywords/keywords-data_mining.json**: This file contains the most frequent keywords. An example is shown below.

        ["data", 23],

        ["mining", 19],

        ["analytics", 3],

        ["app", 3],

        ["big", 2],

        ["mayo", 2],

        ["companies", 2],

        ["3d", 2],

        ["ehr", 2],

        ["datamining", 2],

        ["partner", 2],

        ["nlp", 1],

        ["desktops", 1],

        ["office", 1],

        ["advisory", 1]

        ...

### Authors

2012 LG-SNU Smart TV Project Team

(Created Sep. 2012)