{"id":13937173,"url":"https://github.com/xdevplatform/Gnip-Trend-Detection","last_synced_at":"2025-07-19T23:31:02.906Z","repository":{"id":26427453,"uuid":"29877910","full_name":"xdevplatform/Gnip-Trend-Detection","owner":"xdevplatform","description":"Trend detection algorithms for Twitter time series data","archived":true,"fork":false,"pushed_at":"2017-02-28T17:55:17.000Z","size":12716,"stargazers_count":192,"open_issues_count":3,"forks_count":48,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-05-14T02:48:05.870Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xdevplatform.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-01-26T19:37:03.000Z","updated_at":"2024-08-07T02:58:06.000Z","dependencies_parsed_at":"2022-07-07T22:34:36.794Z","dependency_job_id":null,"html_url":"https://github.com/xdevplatform/Gnip-Trend-Detection","commit_stats":null,"previous_names":["xdevplatform/gnip-trend-detection"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/xdevplatform/Gnip-Trend-Detection","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xdevplatform%2FGnip-Trend-Detection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xdevplatform%2FGnip-Trend-Detection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xdevplatform%2FGnip-Trend-Detection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xdevplatform%2FGnip-Trend-Detection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xdevplatform","download_url":"https://codeload.github.com/xdevplatform/Gnip-Trend-Detection/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xdevplatform%2FGnip-Trend-Detection/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266041686,"owners_count":23867944,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-07T23:03:21.419Z","updated_at":"2025-07-19T23:31:02.455Z","avatar_url":"https://github.com/xdevplatform.png","language":"Python","readme":"# Introduction\n\nThis repository contains the \"Trend Detection in Social Data\" whitepaper,\nalong with software that implements a variety of models for trend detection.\n\nWe focus on trend detection in social data times series. A time series is\ndefined by the presence of a word, a phrase, a hashtags, a mention, or any\nother characteristic of a social media event that can be counted in a\nseries of time intervals. To do trend detection, we quantify \nthe degree to which each count in the time series is atypical. We refer to\nthis figure of merit with the Greek letter *eta*, and we say that a \ntime series and its associated topic are \"trending\" if the figure of merit\nexceeds a pre-defined threshold denoted by the Greek letter *theta*. \n\n# Whitepaper\n\nThe trends whitepaper source can be found in the `paper` directory, which\nalso includes a subdirectory for figures, `figs`. A PDF version of the \npaper is included but it is not gaurenteed to be up-to-date. A new version can\nbe generated from the source by running:\n\n`pdflatex paper/trends.tex`\n\nInstallation of `pdflatex` and/or additional .sty files may be required.\n\n# Gnip-Trend-Detection Software\n\n## Input Data\n\nThe input data consists of CSV records, and is expected to contain data for one\nquantity (\"counter\") and one time interval on each line, in the following format:\n\n| interval start time | interval duration in sec. | count | counter name |\n| ------------------- | --------- | ---------- | ------------------- |\n| 2015-01-01 00:03:25.0  | 195 | 201 | TweetCounter |\n| 2015-01-01 00:03:25.0  | 195 | 13 | ReTweetCounter |\n|2015-01-01 00:06:40.0| 195 | 191 | TweetCounter |\n|2015-01-01 00:06:40.0| 195 | 10 | ReTweetCounter |\n\nThe format of the interval start time can be any of the large number of standard\nformats recognized by Python's [dateutil](https://dateutil.readthedocs.io/en/stable/) package. \n\nThe recommended way to produce time series data in the correct format is to use\nthe [Gnip-Analysis-Pipeline](https://github.com/jeffakolb/Gnip-Analysis-Pipeline) package. \nWith this package, you can enrich and aggregate Tweet data from the Gnip APIs.\nYou can find a set of dummy data in `example/example.csv`.\n\n## Installation\n\nThe package can be pip-installed. The 'plotting' extra includes matplotlib,\nand can be ignored if plotting is not important. Note that the examples below\nrequire plotting.\n\n`$ pip install gnip_trend_detection[plotting]` \n\nThe scripts and library in the repository can also be pip-installed locally. \n\n`[REPOSITORY] $ pip install -e .[plotting]`\n\nDepending on your operating system, you may need to set a \n[Matplotlib backend](http://matplotlib.org/faq/usage_faq.html#what-is-a-backend), \noften done in `~/.matplotlib/matplotlibrc`. \n\n## Key functionalities\n\nThe software in this package provides three scripts that perform the three main tasks:\n* `trend_rebin.py` - resize the time intervals of the input data\n* `trend_analyze.py` - calculate a figure-of-merit (trend score) at each point\n* `trend_plot.py` - plot the counts and the figure-of-merit \n\nThese scripts act on and deliver CSV data.\n\nA fourth script, `trend_analyze_many.py`, performs these steps sequentially,\nwith re-binning and analysis done in parallel. To manage the (potentially) \nlarge number of time series, this script uses JSON-formatted intermediate \nand final data strutures.  \n\nTwo final scripts provide extra analysis information:\n* `trend_detection.py`\n    * return information about time series data sets with trend figures-of-merit that exceed a threshold. \nThis script is intended \nto be used on the analyzed output of the `trend_analyze_many.py` script.\n* `time_series_correlations.py` \n    * calculate a correlation coefficient between\nall pairs of time series in a CSV data set (BUGS BE HERE).\n\n## Configuration\n\nAll the scripts mentioned in the previous sections assume the presence of a configuration\nfile. By default, its name is `config.cfg`. You can find a template at `config.cfg.example`.\nA few parameters can be set with command-line argument. Use the scripts' `-h` option\nfor more details.\n\n## Example\n\nA full example has been provided in the `example` directory. In it, you will find\nformatted time series data for mentions of the \"#scotus\" hashtag in August-September 2014.\nThis file is `example/example.csv`. In the same directory, there is a configuration file, \nwhich specifies what the software will do, including the size of the final time buckets \nand the trend detection technique and parameter values. This example assumes that you\nhave installed the package, but are working from the repo directory. The example will run\nfrom any location, but the paths to input and configuration files would have to change. \n\nThe first step to to use the rebin script to get appropriately and evenly sized time buckets.\nLet's use 2-hour buckets and put the output back in the the example directory. \n\n`cat example/example.csv | trend_rebin.py -c example/config.cfg \u003e example/scotus_rebinned.csv` \n\nNext, we will run the analysis script on the re-binned data.\nRemember, all the modeling specification is in the config file.\n\n`cat example/scotus_rebinned.csv | trend_analyze.py -c example/config.cfg \u003e example/scotus_analyzed.csv`\n\nTo view results, let's run the plotting after the analysis, both of which \nare packaged in the plotting script:\n\n`cat example/scotus_analyzed.csv | trend_plot.py -c example/config.cfg` \n\nThe configuration specifies that the output PNG should be in the example directory.\nIt will look like:\n\n![scotus](https://github.com/jeffakolb/Gnip-Trend-Detection/blob/master/example/scotus.png?raw=true) \n\nThis analysis is based on a point-by-point Poisson model, with the previous point \ndefining the expectation for the current point. You must still choose the cutoff value of *eta* (called *theta*)\nthat defines the presence of a trend. It is clear that, if you wish to flag the large\nspike as a trend, almost any choice for *theta* will lead to lots of false positives.\n\nA more robust background model can be used by changing the `mode` parameter in the `Poisson_model`\nsection of the `example/config.cfg` from `lc` (last count) to `a` (average). The `period_list`\nparameter determines the time interval over which the average is taken.  \n\nThe output PNG should for this model should look like:\n\n![scotus](https://github.com/jeffakolb/Gnip-Trend-Detection/blob/master/example/scotus_averaged.png?raw=true) \n\nThere is less noise in this result, but we can do better. Choose the data-derived template method\nin `example/config.cfg` by uncommenting `model_name=WeightedDataTemplates`. In this model, *eta* quantifies the\nextent to which the test series looks more like a set of known trending time series, or like a set of\ntime series known _not_ to be trending. \n\nThe output PNG should for this model should look like:\n\n![scotus](https://github.com/jeffakolb/Gnip-Trend-Detection/blob/master/example/scotus_data.png?raw=true) \n\nIn this result, there is virtually no noise, but the *eta* curve lags the data because of the data\nsmoothing procedure. Nevertheless, this model provides the most robust performance, at the cost\nof additional complexity and CPU time. The ROC curve for this model looks like:\n\n![roc](https://github.com/jeffakolb/Gnip-Trend-Detection/blob/master/example/roc.png?raw=true)  \n\nThe previous methods focus on identifying sudden increases, or spikes, in the time series.\nTo identify trends characterized by constant growth over time, you can use \na linear regression. Choose the `LinearRegressionModel` in the config file,\nand the output PNG should look like:\n\n![roc](https://github.com/jeffakolb/Gnip-Trend-Detection/blob/master/example/scotus_linear.png?raw=true)  \n\n\n## Analysis Model Details\n\nThe various trend detection techniques are implemented as classes in `gnip_trend_detection/models.py`.\nThe idea is for each model to get updated point-by-point with the time series data,\nand to store internally whichever data is need to calculate the figure of merit for\nthe latest point.\n\nEach class must define:\n\n*  a constructor that accepts one argument, which is a dictionary containing \nconfiguration name/value pairs. \n*  an `update` method that accepts at least a keyword argument \"counts\",\nrepresenting the latest data point to be analyzed. No return value.\n*  a `get_results` method, which takes no arguments and returns\nthe figure of merit for the most recent update. \n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxdevplatform%2FGnip-Trend-Detection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxdevplatform%2FGnip-Trend-Detection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxdevplatform%2FGnip-Trend-Detection/lists"}