{"id":15395476,"url":"https://github.com/stef/pywik","last_synced_at":"2025-04-16T00:11:28.132Z","repository":{"id":3387771,"uuid":"4436288","full_name":"stef/pywik","owner":"stef","description":"webserver log analyzer","archived":false,"fork":false,"pushed_at":"2012-12-01T16:37:16.000Z","size":236,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-16T00:11:24.190Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/stef.png","metadata":{"files":{"readme":"README.org","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-05-24T18:27:06.000Z","updated_at":"2023-10-27T18:21:42.000Z","dependencies_parsed_at":"2022-09-05T14:51:17.709Z","dependency_job_id":null,"html_url":"https://github.com/stef/pywik","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stef%2Fpywik","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stef%2Fpywik/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stef%2Fpywik/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/stef%2Fpywik/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/stef","download_url":"https://codeload.github.com/stef/pywik/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249173086,"owners_count":21224483,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-01T15:28:28.316Z","updated_at":"2025-04-16T00:11:28.110Z","avatar_url":"https://github.com/stef.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"* pywik\nWelcome to pywik the static webserver csv logfile analyzer.\n** Features\n   can filter and list:\n   - Visited Pages,\n   - Search Queries,\n   - Server errors,\n   - unknown files (not in the good set),\n   - bot visits,\n   - 404 pages (malware shows up here),\n   - visits from tor users (and 404s from tor users),\n   - external referers\n** Setup\n   install the dependencies (you will need mongodb):\n   #+BEGIN_SRC\n   virtualenv --no-site-packages env\n   source env/bin/activate\n   pip install -r deps.txt\n   #+END_SRC\n   setup environment for pywik:\n   #+BEGIN_SRC\n   mkdir logs\n   ./updatelists.sh\n   mkdir data/myhost\n   touch data/myhost/goodpaths\n   touch data/myhost/ignoremissing\n   touch data/myhost/ignorepaths\n   touch data/myhost/ownhosts\n   touch data/myhost/classes\n   #+END_SRC\n** Host specific files\n   pywik uses a few host specific files, which improve the output\n   considerably. Create a directory under data with your hostname as the\n   name and populate the following files accordingly.\n*** ownhosts\n    a list of hostnames that are considered part of your\n    infrastructure. Any log entries with referers from other than\n    these hosts are considered external hits.\n*** goodpath\n    Any path considered a page visit, each line is a regexp.\n*** ignoremissing\n    Any path that is regularly generating 404 responses, each line is a regexp.\n*** ignorepaths\n    Any path that is uninteresting for tracking pageviews, like all\n    requisites for pages (e.g. .css, .js, etc files), each line is a\n    regexp.\n*** rss\n    Each line is a regexp for an rss/atom feed.\n*** classes\n    This file allows you to categorise the entries. The format is the\n    following: Each class starts with its name, then pairwise\n    fieldnames and regexps. Classes are separated with empty lines.\n   #+BEGIN_SRC\n    Users\n    path\n    /user/\\?id=\n\n    Indexed Products\n    path\n    /products/\\?id=\n    http_user_agent\n    .*Googlebot/2\\.1\n   #+END_SRC\n    The above example defines two new classes:\n    - Users are any entries that start with the path \"/user/?id=\"\n    - indexed products, certain paths starting with \"/products...\" and\n      are hit by googlebot - notice the double rule one for the path,\n      the other for the user agent\n** Web-server logformat\n   set your webserver to use the following logformats, or use:\n   #+BEGIN_SRC\n   ./ncsa2csv.py \u003caccess.log | ./load.py mysite\n   #+END_SRC\n   to convert from NCSA logs to csv format - note however that this is\n   missing some data, that the csv based format provides.\n*** Apache\n   For Apache the following should work:\n   #+BEGIN_SRC\n   LogFormat \"%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;0;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x\" csv-http\n   LogFormat \"%{%Y-%m-%dT%H:%M:%S%z}t;x;%h;1;%v;%R;%s;%I;%O;%D;%{Referer}i;%u;%{User-agent}i;%{X-Forwarded-For}i;x\" csv-https\n   #+END_SRC\n*** nginx\n   For nginx the following should work:\n   #+BEGIN_SRC\n   log_format csv-http  '\"$time_local\";$connection;\"$remote_addr\";0;\"$http_host\";\"$request\";'\n      '$status;$request_length;$body_bytes_sent;$request_time;\"$http_referer\";\"$remote_user\";'\n      '\"$http_user_agent\";\"$http_x_forwarded_for\";$msec';\n   log_format csv-https '\"$time_local\";$connection;\"$remote_addr\";1;\"$http_host\";\"$request\";'\n      '$status;$request_length;$body_bytes_sent;$request_time;\"$http_referer\";\"$remote_user\";'\n      '\"$http_user_agent\";\"$http_x_forwarded_for\";$msec';\n   #+END_SRC\n   and for your hosts use them for logging:\n   #+BEGIN_SRC\n    access_log /var/log/nginx/access.csv csv-http;\n   #+END_SRC\n   or\n   #+BEGIN_SRC\n    access_log /var/log/nginx/access.csv csv-https;\n   #+END_SRC\n   respectively for https hosts stanzas.\n** Running pywik\n   #+BEGIN_SRC\n   ./fetchlogs.sh myhost.net\n   ./pywik.py month myhost | less\n   #+END_SRC\n   if you find anything interesting, you can extract all logentries\n   matching certain fields:\n   #+BEGIN_SRC\n   ./getentries.py logs/access.csv myhost path 'cart.php?a=asdf\u0026templatefile=../../../configuration.php'\n   #+END_SRC\n   Alternatively you can also run pywik as a Flask webapp:\n   #+BEGIN_SRC\n   ./webapp.py\n   #+END_SRC\n   Point your browser at http://localhost:5002/myhost/today\n   and start clicking around.\n** Plugins\n   You can easily extend the functionality of pywik using\n   plugins. Plugins can be\n   - global if you put them into data/plugins\n   - or site-specific if you put them in data/\u003csite\u003e/plugins\n   There are two kind of plugins:\n   - those that generate queries for filtered listings for output,\n   - and those that enrich the database with while parsing the logfile\n   For examples look into data/plugins, **addrapp** and **tor** are\n   good canditates for starting off.\n*** Plugin Initialization\n    Plugins providing an init(ctx) function, will be able to\n    initialize themselves. The param ctx is a dictionary, that\n    currently only has one key 'host'.\n*** query plugins\n    Query plugins implement a queries() function that returns a list of:\n   #+BEGIN_SRC\n    ('title', {'field1': value1, 'field2': value2},['displayfield1', 'displayfield2'])\n   #+END_SRC\n    - Where 'title' is the title to be displayed,\n    - the second elem is a dict containing a mongodb filter expression,\n    - the final elem is a list of fieldnames to be returned by mongo\n      for each matching elements\n\n    This can be as simple as:\n\n   #+BEGIN_SRC python\ndef queries():\n    return [('tor', {'tags': ['tor', 'page'], },['path', 'hostname', 'http_user_agent']),\n            ('tor404', {'tags': ['tor'], 'status': 404 },['path', 'hostname', 'http_user_agent'])]\n   #+END_SRC\n*** loader plugins\n    Loader plugins enrich the information in each log entry during\n    database import. A loader plugin implements a process(entry)\n    interface, that returns the changed entry.\n\n   #+BEGIN_SRC python\ndef process(entry):\n   if entry['path']=='/foo': entry['foo']='bar'\n   return entry\n   #+END_SRC\n\n   Here's a more advanced example (you can find more in data/plugins)\n   #+BEGIN_SRC python\nfrom load import basepath\nwith open('%s/data/torexits.csv' % basepath,'r') as fp:\n   torexits=[x.strip() for x in fp]\n#print '[tor plugin]', len(torexits), 'torexits loaded'\n\ndef process(entry):\n   if entry['remote_addr'] in torexits:\n      entry['tags'].append('tor')\n   return entry\n   #+END_SRC\n** Bugs\n   Many, reporting them is encouraged, fixing them very welcome.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstef%2Fpywik","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fstef%2Fpywik","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fstef%2Fpywik/lists"}