Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/e9t/google-keywords
Get keywords from Google blog, news search.
https://github.com/e9t/google-keywords
Last synced: 8 days ago
JSON representation
Get keywords from Google blog, news search.
- Host: GitHub
- URL: https://github.com/e9t/google-keywords
- Owner: e9t
- Created: 2013-02-06T06:00:36.000Z (almost 12 years ago)
- Default Branch: master
- Last Pushed: 2013-02-13T08:31:20.000Z (almost 12 years ago)
- Last Synced: 2024-11-07T17:14:04.260Z (about 2 months ago)
- Language: Python
- Size: 133 KB
- Stars: 2
- Watchers: 2
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Get Google Keywords
=================================### Preparation
- Install python packages `lxml`, `html5lib`, `nltk` (Linux users need to apt-get `python-dev`, `python-lxml` as well.)$ pip install lxml html5lib nltk
- From `nltk.download()` select and install `corpus/stopwords`
$ python
>>> import nltk
>>> nltk.download()### Configure settings
vi settings.py
- TARGET: Either 'news' or 'blog'
- QUERYLIST: List of queries
- HTMLPATH: Path to save html files. (Paths should end with a slash)
- KEYWORDPATH = 'data/keywords/'
- NCRAWLPAGES: Number of search pages to crawl from Google
- DELIMS: Delimiters for parsing words in HTML page
- TODAY: Date for analysis### Run
In order to get search results for `data mining`, run
python main.py data mining
or set `QUERYLIST=['data', 'mining']` in `settings.py`, and run
python main.py
### Results
If `HTMLPATH='data/html/'` and `KEYWORDPATH='data/keywords/` in `settings.py`, the search results and keywords are stored in the 'data' folder as below.data/
├── html/
│ ├── data_mining/
│ └── data_mining-20120907.json
└── keywords/
└── keywords-data_mining.json- **data/html/data_mining/**: This folder contains the raw HTML files. File names are marked with a timestamp.
- **data/html/data_mining-20120907.json**: This file contains th `url`, `desc`(description), `crawled_time`, `title` extracted from the raw HTML files. Below is an example.[
{
"url": "http://smartdatacollective.com/timoelliott/101486/analytics-world-news-big-data-cool-3d-analytics",
"desc": "Themos Kalafatis has worked as a consultant for , Text Mining, Information Extraction and Data Quality for over a decade. More \u00bb ",
"crawled_time": "20120907_192648",
"page_no": 1,
"title": "Scary Big Data, Cool 3D Analytics and More"
},
...
]- **data/keywords/keywords-data_mining.json**: This file contains the most frequent keywords. An example is shown below.
["data", 23],
["mining", 19],
["analytics", 3],
["app", 3],
["big", 2],
["mayo", 2],
["companies", 2],
["3d", 2],
["ehr", 2],
["datamining", 2],
["partner", 2],
["nlp", 1],
["desktops", 1],
["office", 1],
["advisory", 1]
...### Authors
2012 LG-SNU Smart TV Project Team
(Created Sep. 2012)