{"id":20066436,"url":"https://github.com/apparebit/analog","last_synced_at":"2025-10-11T10:11:58.122Z","repository":{"id":88952804,"uuid":"526020954","full_name":"apparebit/analog","owner":"apparebit","description":"ana(lyze) log(s)","archived":false,"fork":false,"pushed_at":"2025-01-26T05:40:43.000Z","size":728,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-26T06:19:53.708Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apparebit.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-08-18T02:07:53.000Z","updated_at":"2025-01-26T05:40:46.000Z","dependencies_parsed_at":"2025-01-26T06:19:27.056Z","dependency_job_id":"c0ca0dfb-123a-4dda-aab6-73cee3af6f13","html_url":"https://github.com/apparebit/analog","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apparebit%2Fanalog","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apparebit%2Fanalog/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apparebit%2Fanalog/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apparebit%2Fanalog/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apparebit","download_url":"https://codeload.github.com/apparebit/analog/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241494184,"owners_count":19971871,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T13:58:21.056Z","updated_at":"2025-10-11T10:11:53.087Z","avatar_url":"https://github.com/apparebit.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Ana(lyze) Log(s)\n\nA modern approach to analyzing webserver access logs!\n\n  * Keep on reading for a detailed, top-down description of analog's features.\n  * Peruse [this\n    notebook](https://github.com/apparebit/analog/blob/master/docs/hands-on.ipynb)\n    for a hands-on introduction using [my website's](https://apparebit.com) logs\n    as example.\n  * Consult [this\n    grammar](https://github.com/apparebit/analog/blob/master/docs/grammar.md)\n    for the concise summary of analog's fluent interface.\n\n\n## Overview\n\nAnalog builds on two technologies that have become ubiquitous when it comes to\ndata processing:\n\n  * [Notebooks](https://jupyter.org), which provide an effective graphical\n    read-eval-print-loop (REPL);\n  * [Pandas](https://pandas.pydata.org), which handles the low-level aspects\n    of data wrangling with its *dataframe* abstraction.\n\nAnalog then adds:\n\n  * Parsing and enriching the raw, textual access logs;\n  * File management to automatically ingest monthly logs and combine them\n    into a single dataframe;\n  * A convenient, fluent interface that makes common analysis tasks easy,\n    while seamlessly falling back onto Pandas for more complex tasks.\n\n\n## Motivation\n\nMany websites have switched to client analytics as a service. While certainly\nconvenient and often free, these services also have a terrible track record when\nit comes to privacy and hence are entirely exploitative of website visitors.\nEven when they are self-hosted, the necessary client code adds unnecessary bloat\nto webpages. It also is far from guaranteed to produce meaningful results\nbecause, by the time the code might run, users have already moved on or because\nthey have blocked the client code or JavaScript altogether.\n\nBeing more respectful of website visitors and hence removing invasive client\nanalytics is easy enough. But we'd still like to have *some* insight into how\nvisitors use our websites. Well, there are server access logs! Alas, in most\nenterprises, those logs feed into larger log analytics and monitoring solutions,\nwhich are overkill for an individual or small business using shared hosting.\nThen there are the ancient [AWStats](https://awstats.sourceforge.io) and\n[Webalizer](https://webalizer.net), typically included with the equally ancient\ncPanel. Finally, there is the actively maintained\n[GoAccess](https://goaccess.io). While pretty nifty, even that tool shows its\nage: It's written in C and not exactly designed for extensibility or answering\nad-hoc queries.\n\n\n## Analog\n\nAnalog relies on [notebooks](https://jupyter.org) for graphical REPL and\n[Pandas](https://pandas.pydata.org) for low-level data wrangling. It then\nadds a convenient, fluent interface that makes common analysis tasks easy.\nIt also manages monthly log files, parsing and enriching the raw access\nlogs as needed and automatically combining them into a single dataframe.\n\n\n### Storage Management\n\nAnalog stores all data for a website in a dedicated directory. It uses three\nsubdirectories:\n\n  * `access-logs` stores monthly access logs in files named like\n    `apparebit.com-Aug-2022.gz`.\n\n  * `enriched-logs` stores parsed and enriched monthly logs as\n    [Parquet](https://parquet.apache.org) files named like\n    `apparebit.com-2022-08.parquet`.\n\n  * `location-db` stores IP location databases in GeoLite2 format named like\n    `city-2022-07-26.mmdb`. Analog uses the most recent one.\n\nAnalog creates three files in its data directory:\n\n  * The combined dataframe, again in Parquet format, is named like\n    `apparebit.com-2018-07-2022-07.parquet`.\n\n  * The metadata sidecar file in JSON format has the same name but with a\n    `.json` file extension.\n\n  * `hostnames.json` caches previous DNS lookups of IP addresses, which are by\n    far the slowest part of ingesting raw access logs.\n\nWhen running analog from the command line or invoking `analog.latest()`, analog\nfirst ingests raw monthly logs that have no corresponding enriched log files.\nThen, if there is no combined log covering all monthly log files or one of those\nfiles was just updated, analog creates a new combined log and its metadata\nsidecar file.\n\nWhen using the `--clean` command line option or invoking `analog.latest()` with\na truthy `clean` keyword argument, analog starts by deleting all monthly log\nfiles stored in `enriched-logs`, which causes both monthly and combined log\nfiles to be re-generated. You can also deleted these files manually. But\n*please*, do *not* delete `access-logs` or `hostnames.json`.\n\n\n### Log Schema\n\nAnalog combines properties parsed from the raw access logs, derived from the\noriginal data, and derived from external databases for domain names, IP\nlocations, and user agents. The `SCHEMA` mapping in the\n[`analog.schema`](https://github.com/apparebit/analog/blob/master/analog/schema.py)\nmodule defines the Pandas schema for the resulting dataframes. It makes use of\nseveral enumerations defined in the\n[`analog.label`](https://github.com/apparebit/analog/blob/master/analog/label.py)\nmodule.\n\nNote that analog uses *two* independent databases of user agents to detect bots\n— [matomo](https://matomo.org) and [ua-parser](https://github.com/ua-parser).\nEach project detects a good number of bots not detected by the other. Hence,\nanalog's `only.bots()` and `only.humans()` filters take both into account.\nanalog also fixes a minor misclassification made by ua-parser.\n\nAs of July 13, 2023, the latest version of the `ua-parser` package is 0.18.0. It\nwas released five days before, on July 8, 2023. Since that package saw only two\nupdates between 2018 and 2022, I did use a forked version, `ua-parser-up2date`.\nIts latest version is 0.16.1, which was released on December 16, 2022. Looking\nat the two packages' update histories for the last couple of years, the original\n`ua-parser` seems preferable again.\n\n\n### Fluent Grammar\n\nAnalog's fluent interface makes use of computed properties as well as methods.\nProperties typically distinguish between different types of clauses whereas\nmethods terminate the clauses. In the grammar below, property and method names\nare double quoted. The attribute selector's period is written as `\u003cdot\u003e` and\nmethods are followed by `()`, with parameters listed in between.\n\nThe following grammar summarizes the fluent interface. At the top-level, a\n***sentence*** consists of terms to specify (1) selection, (2) grouping and\naggregation, as well as (3) display:\n\n    sentence -\u003e selection grouping-and-aggregation display\n\nThe ***selection*** extracts rows that meet certain criteria. It distinguishes\nbetween three kinds of criteria, namely (1) terms that start with the `.only`\nproperty and filter based on attributes of the HTTP protocol, (2) terms that\nstart with the `.over` property and filter based on datetime, and (3) terms that\ninvoke `.select()` or `.map()` and thus serve as extension points. You can track\nthe impact of these filters with the `.count_rows()` method, which appends the\nnumber of rows to the context's list inside a `with analog.fresh_counts()`\nblock. It is an error to call this method outside such a block. Square brackets\ncontaining a slice, select rows by their numbers.\n\n    selection -\u003e\n        | \u003cdot\u003e \"only\" \u003cdot\u003e protocol  selection\n        | \u003cdot\u003e \"over\" \u003cdot\u003e datetime  selection\n        | \u003cdot\u003e \"filter\" (predicate)   selection\n        | \u003cdot\u003e \"map\" (mapper)         selection\n        | \u003cdot\u003e \"count_rows\" ()        selection\n        | [ \u003cslice\u003e ]                  selection\n        | 𝜀\n\nThe ***protocol*** criterion contains several convenience methods that filter\ncommon protocol values. The `.has()` method is more general and can filter on\nthe `content_type`, `method`, `protocol` and `status_class` column. Since the\nvarious enumeration constants defined in [the `label`\nmodule](https://github.com/apparebit/analog/blob/master/analog/label.py)\nuniquely identify the column, there is no need for also specifying the column\nname. In contrast, the `.equals()` method generalizes `.has()` for columns that\ndo not have a categorical type and therefore requires the column name. Finally,\nthe `.contains()` method implements a common operation on string-valued data.\n\n    protocol -\u003e\n        | \"bots\" ()\n        | \"humans\" ()\n        | \"GET\" ()\n        | \"POST\" ()\n        | ...\n        | \"markup\" ()\n        | ...\n        | \"successful\" ()\n        | \"redirection\" ()\n        | \"client_error\" ()\n        | \"server_error\" ()\n        |\n        | \"not_found\" ()\n        | \"equals\" (column, value)\n        | \"one_of\" (column, value, value, ...)\n        | \"contains\" (column, value)\n\nThe `.bots()` and `.humans()` methods categorize requests based on the `is_bot`\nand `is_bot2` properties. They concisely capture two different third-party\nclassifications of the user agent header. Also see the [hands-on\nnotebook](https://github.com/apparebit/analog/blob/master/workbook.ipynb).\n\nIn contrast to Pandas' expressive and complex operations on times and dates,\nanalog's ***datetime*** criterion is much simpler — and more limited. It selects\nthe day, week, or year ending having the last entry in the log as its last day.\n\neither the last calendar day, month, or year  containing the last entry in the\nlog\n\n\nday, month, or year ending with the end of the log or an arbitrary range\nspecified by two Python datetimes or Pandas timestamps. If your analysis focuses\non calendar months, you may find that the `monthly_slice()` and\n`monthly_range()` functions in [the `month_in_year`\nmodule](https://github.com/apparebit/analog/blob/master/analog/month_in_year.py)\ncome in handy. Note that all datetimes and timestamps must have a valid\ntimezone. It defaults to UTC in analog's own code.\n\n    datetime -\u003e\n        | \"last_day\" ()\n        | \"last_week\" ()\n        | \"last_year\" ()\n        | \"range\" (begin, end_inclusive)\n\n**About extensibility**: Analog is designed to make common log analysis\nsteps simple and thereby reduce the barrier to entry when using Pandas for log\nanalysis. But for implementing uncommon analysis steps, you still need to use\nPandas. In particular, you access the wrapped Pandas dataframe or series through\nthe `.data` property.\n\nSince unwrapping a dataframe, invoking a Pandas method, and then rewrapping the\nresult is a bit tedious, analog has two extension methods that apply an\narbitrary callable on the wrapped dataframe while also wrapping the result. The\n`.select()` method takes a predicate producing a boolean series and the `.map()`\nmethod takes transformation producing another dataframe.\n\nThere are three options for ***grouping and aggregation***: a rate and metric,\njust a metric by itself, or an explicit bypass of metrics with the `.just`\nproperty. Requiring explicit bypass arguably is less elegant than just omitting\nunnecessary clauses. But it also keeps the implementation simpler and hence won\nout.\n\n    grouping-and-aggregation -\u003e\n        | rate \u003cdot\u003e metric\n        | \u003cdot\u003e metric\n        | \u003cdot\u003e \"just\"\n\nA ***rate*** is indicated by the `.monthly` property. So far, I haven't seen the\nneed to add more options.\n\n    rate -\u003e \u003cdot\u003e \"monthly\"\n\nCurrently supported ***metrics*** are (1) the number of requests, (2) the value\ncounts for a given column, and (3) the unique values for a given column. The\n`.status_classes()` and `.content_types()` methods are convenient aliases for\nspecific value counts. The `unique_values()` method makes little sense as a rate\nand hence is only supported without a preceding `.monthly` property.\n\n    metric -\u003e\n        | \u003cdot\u003e \"requests\" ()\n        | \u003cdot\u003e \"content_types\" ()\n        | \u003cdot\u003e \"status_classes\" ()\n        | \u003cdot\u003e \"value_counts\" (column)\n        | \u003cdot\u003e \"unique_values\" (column)\n\n**About result types**: The result of a selection always is another wrapped\nPandas dataframe. However, if the grouping and aggregation is just a metric\n*without* rate, the result of `.requests()` is an integer value that terminates\nthe fluent expression. Other metrics *without* rate such as `.value_counts()`\nand `.unique_values()` produce a wrapped Pandas *series*. If the grouping and\naggregation *includes the rate*, the result of `.requests()` is a wrapped Pandas\n*series*. Other metrics *with* rate produce a wrapped Pandas dataframe.\n\nThe ***display*** formats, prints, or plots the data. The `.format()` method\nconverts the wrapped series or dataframe into lines of text. It terminates the\nfluent sentence to return the result. `.count_rows()` appends the number of rows\nto the context inside a `with analog.fresh_counts()` block, whereas square\nbrackets containing a slice pick rows by their numbers. `.show()` displays the\ndata as text and  `.plot()` as a graph.\n\n    display -\u003e\n        | \u003cdot\u003e \"format\" ()\n        | \u003cdot\u003e \"count_rows\" ()        display\n        | [ \u003cslice\u003e ]                  display\n        | \u003cdot\u003e \"show\" (rows = None)   display\n        | \u003cdot\u003e \"plot\" (**kwargs)      display\n        | \u003cdot\u003e \"also\" ()              sentence\n        | \u003cdot\u003e \"done\" ()\n        | 𝜀\n\nFinally, `.also()` starts another sentence, as long as the wrapped data is a\ndataframe, and `.done()` terminates the sentence. Since it returns `None`, the\nlatter method suppresses the display of a series or dataframe in Jupyter\nnotebooks.\n\n\n### Fluent Implementation\n\nThe implementation generally follows the grammar. A class implementing a clause\ntypically has the same name as the corresponding nonterminal, though the name is\nCamelCased and prefixed with `Fluent`. All classes representing nonterminals\ninherit from the same abstract base class `FluentTerm`, which holds the wrapped\nstate and provides convenient, private methods for creating new subclass\ninstances. Since, as described above, some statistics result in series instead\nof dataframes, that base class and `FluentDisplay` are generic.\n\n\n#### *Cool Features*\n\nThree features of the implementation [stand out, especially in a\nnotebook](https://github.com/apparebit/analog/blob/master/docs/hands-on.ipynb):\n\n  * Wrapped series and dataframes display as HTML tables in Jupyter, when\n    invoking `.show()` and when becoming a cell's value.\n  * When the fluent grammar generates new wrapped series, it makes sure that the\n    series have meaningful index and data names.\n  * Wrapped series and dataframes support slicing by row numbers, so you can\n    throttle the amount of data displayed in a notebook or interactive shell,\n    even when relying on the notebook for doing the displaying.\n\n\n#### *Entry Points*\n\nThe main entry point for fluent analysis is:\n\n    def analyze(frame: pd.DataFrame) -\u003e FluentSentence: ...\n\nIt returns an instance of `FluentSentence`. A second function recombines several\nwrapped or unwrapped series into a dataframe again, notably for plotting:\n\n    def merge(\n      *series: FluentTerm[pd.Series] | pd.Series,\n      **named_series: FluentTerm[pd.Series] | pd.Series,\n    ) -\u003e FluentSentence:\n\nThe function returns a wrapped dataframe that combines all series given as\narguments. For series passed with keyword arguments, it also renames the series\nto the keywords.\n\nThe `count_rows()` method supported by `FluentSentence` and `FluentDisplay`\nrequires a context provides with a list for those counts. You create the context\nthrough a `with fresh_counts() as counts` statement.\n\nHappy, happy, joy, joy! 😎\n\n---\n\n© 2022 [Robert Grimm](https://apparebit.com).\n[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license.\n[GitHub](https://github.com/apparebit/analog).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapparebit%2Fanalog","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapparebit%2Fanalog","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapparebit%2Fanalog/lists"}