An open API service indexing awesome lists of open source software.

https://github.com/stef/pywik

webserver log analyzer
https://github.com/stef/pywik

Last synced: about 1 year ago
JSON representation

webserver log analyzer

Awesome Lists containing this project

README

          

* pywik
Welcome to pywik the static webserver csv logfile analyzer.
** Features
can filter and list:
- Visited Pages,
- Search Queries,
- Server errors,
- unknown files (not in the good set),
- bot visits,
- 404 pages (malware shows up here),
- visits from tor users (and 404s from tor users),
- external referers
** Setup
install the dependencies (you will need mongodb):
#+BEGIN_SRC
virtualenv --no-site-packages env
source env/bin/activate
pip install -r deps.txt
#+END_SRC
setup environment for pywik:
#+BEGIN_SRC
mkdir logs
./updatelists.sh
mkdir data/myhost
touch data/myhost/goodpaths
touch data/myhost/ignoremissing
touch data/myhost/ignorepaths
touch data/myhost/ownhosts
touch data/myhost/classes
#+END_SRC
** Host specific files
pywik uses a few host specific files, which improve the output
considerably. Create a directory under data with your hostname as the
name and populate the following files accordingly.
*** ownhosts
a list of hostnames that are considered part of your
infrastructure. Any log entries with referers from other than
these hosts are considered external hits.
*** goodpath
Any path considered a page visit, each line is a regexp.
*** ignoremissing
Any path that is regularly generating 404 responses, each line is a regexp.
*** ignorepaths
Any path that is uninteresting for tracking pageviews, like all
requisites for pages (e.g. .css, .js, etc files), each line is a
regexp.
*** rss
Each line is a regexp for an rss/atom feed.
*** classes
This file allows you to categorise the entries. The format is the
following: Each class starts with its name, then pairwise
fieldnames and regexps. Classes are separated with empty lines.
#+BEGIN_SRC
Users
path
/user/\?id=

Indexed Products
path
/products/\?id=
http_user_agent
.*Googlebot/2\.1
#+END_SRC
The above example defines two new classes:
- Users are any entries that start with the path "/user/?id="
- indexed products, certain paths starting with "/products..." and
are hit by googlebot - notice the double rule one for the path,
the other for the user agent
** Web-server logformat
set your webserver to use the following logformats, or use:
#+BEGIN_SRC
./ncsa2csv.py /plugins
There are two kind of plugins:
- those that generate queries for filtered listings for output,
- and those that enrich the database with while parsing the logfile
For examples look into data/plugins, **addrapp** and **tor** are
good canditates for starting off.
*** Plugin Initialization
Plugins providing an init(ctx) function, will be able to
initialize themselves. The param ctx is a dictionary, that
currently only has one key 'host'.
*** query plugins
Query plugins implement a queries() function that returns a list of:
#+BEGIN_SRC
('title', {'field1': value1, 'field2': value2},['displayfield1', 'displayfield2'])
#+END_SRC
- Where 'title' is the title to be displayed,
- the second elem is a dict containing a mongodb filter expression,
- the final elem is a list of fieldnames to be returned by mongo
for each matching elements

This can be as simple as:

#+BEGIN_SRC python
def queries():
return [('tor', {'tags': ['tor', 'page'], },['path', 'hostname', 'http_user_agent']),
('tor404', {'tags': ['tor'], 'status': 404 },['path', 'hostname', 'http_user_agent'])]
#+END_SRC
*** loader plugins
Loader plugins enrich the information in each log entry during
database import. A loader plugin implements a process(entry)
interface, that returns the changed entry.

#+BEGIN_SRC python
def process(entry):
if entry['path']=='/foo': entry['foo']='bar'
return entry
#+END_SRC

Here's a more advanced example (you can find more in data/plugins)
#+BEGIN_SRC python
from load import basepath
with open('%s/data/torexits.csv' % basepath,'r') as fp:
torexits=[x.strip() for x in fp]
#print '[tor plugin]', len(torexits), 'torexits loaded'

def process(entry):
if entry['remote_addr'] in torexits:
entry['tags'].append('tor')
return entry
#+END_SRC
** Bugs
Many, reporting them is encouraged, fixing them very welcome.