https://github.com/skyguy126/python-frequencyanalysis
A simple script to prove Zipf's law.
https://github.com/skyguy126/python-frequencyanalysis
Last synced: 12 months ago
JSON representation
A simple script to prove Zipf's law.
- Host: GitHub
- URL: https://github.com/skyguy126/python-frequencyanalysis
- Owner: skyguy126
- License: gpl-3.0
- Created: 2016-05-17T02:16:34.000Z (about 10 years ago)
- Default Branch: master
- Last Pushed: 2016-09-28T20:51:32.000Z (over 9 years ago)
- Last Synced: 2025-07-06T00:01:48.448Z (12 months ago)
- Language: Python
- Homepage:
- Size: 23.4 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# **Python_FrequencyAnalysis**
**A simple script to prove Zipf's law.**
## Usage
### Step 1: Setup
Clone this repository and create two directories inside the `src` folder named `books` and `temp`.
### Step 2: Download
This will download books in plain text format (will automatically strip headers) from Project Gutenberg.
```python
python Downloader.py
```
Specify the number of passes (1 pass is around 20-50 books).
Press a key at anytime to exit (NOTE: Program will only exit once current pass is complete).
The files will be downloaded to the `temp` directory.
*Requires BeautifulSoup.*
#### Options
You may edit the offset for Project Gutenberg in the `Downloader_Config.ini` (This value is auto-updated).
### Step 3: Analyze
This will generate a set of confidence intervals.
```python
python Analyze_Multicore.py
```
Make sure specified books are in the `books` directory.
Output will be saved to `conf.txt`.
#### Options
```python
'''
Number of books to sample for one confidence interval.
Make sure value is lower than number of books in the directory.
'''
NUM_OF_SAMPLES = 300
'''
Number of words to include in the data set for generating the regression line.
Set to -1 to use all words (not recommended), 1000 works best.
'''
NUM_TOP_WORDS = 1000
'''
Number of confidence intervals to generate.
'''
NUM_INTERVALS = 100
'''
Number of processes.
'''
NUM_PROCESSES = 8
'''
Alpha value for confidence interval.
0.05 = 95% confidence
'''
ALPHA_VALUE = 0.05
```
*Requires statsmodels.api, numpy, matplotlib.*
### Presentation
[Google Slides](https://docs.google.com/presentation/d/17swH8eZlLc9Ui8GqVKJfDptICacmxzMZ78Bg92nCtw4/edit?usp=sharing)