{"id":21359355,"url":"https://github.com/trendmicro/defplorex","last_synced_at":"2025-07-13T01:31:10.780Z","repository":{"id":143692416,"uuid":"96397145","full_name":"trendmicro/defplorex","owner":"trendmicro","description":"defplorex for BlackHat Arsenal","archived":false,"fork":false,"pushed_at":"2017-07-27T03:14:35.000Z","size":1521,"stargazers_count":112,"open_issues_count":0,"forks_count":30,"subscribers_count":29,"default_branch":"master","last_synced_at":"2024-05-18T11:33:15.619Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/trendmicro.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2017-07-06T06:39:57.000Z","updated_at":"2024-05-13T19:49:20.000Z","dependencies_parsed_at":null,"dependency_job_id":"350b81d2-af07-4bdc-afff-738fe8e34c97","html_url":"https://github.com/trendmicro/defplorex","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trendmicro%2Fdefplorex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trendmicro%2Fdefplorex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trendmicro%2Fdefplorex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/trendmicro%2Fdefplorex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/trendmicro","download_url":"https://codeload.github.com/trendmicro/defplorex/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225849897,"owners_count":17534058,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-22T05:27:54.882Z","updated_at":"2024-11-22T05:27:55.485Z","avatar_url":"https://github.com/trendmicro.png","language":"Python","funding_links":[],"categories":["\u003ca id=\"ecb63dfb62722feb6d43a9506515b4e3\"\u003e\u003c/a\u003e新添加"],"sub_categories":[],"readme":"# DefPloreX (Public Release)\nAt [BlackHat USA 2017's Arsenal we've showcased\nDefPloreX](https://www.blackhat.com/us-17/arsenal/schedule/index.html#defplorex-a-machine-learning-toolkit-for-large-scale-ecrime-forensics-8065),\nan Elasticsearch-based toolkit that our team uses for large-scale processing,\nanalysis and visualization of e-crime records. In particular, we've\nsuccessfully been applying DefPloreX to the analysis of deface records (e.g., from web compromises);\n hence its name, Def(acement) eXPlorer (DefPloreX).\n\n![DefPloreX Visualization](i/dpx-clusters-viz.png?raw=true \"DefPloreX Visualization\")\n\nDefPloreX automatically organizes deface records by web pages' content and format (what we call ``template pages'').\nThis allows an analyst to easily investigate on campaigns, \nfor example in discovering websites targeted by the same campaign or\nattributing one or more actors to the same hacking group.\nAll of this without sacrificing the interactivity aspect of the investigation.\n\n![Overview of DefPloreX](i/dpx-overall.png?raw=true \"Overview of DefPloreX\")\n\nThe full version of DefPloreX includes:\n\n  * A thin wrapper to interact with an Elasticsearch backend (included in this release)\n  * A distributed data-processing pipeline based on Celery (example included in this release)\n  * An analysis component to extract information from deface web pages\n  * A features extraction component to produce a compact, numerical and categorical representation of each web page\n  * A statistical machine-learning component to automatically find groups of similar web pages\n\nThe input to DefPloreX is a feed of URLs describing the deface web pages,\nincluding metadata such as the (declared) attacker name, timestamp, reason\nfor hacking that page, and so on. Separately, we also have a mirror of the\nweb pages at the time of compromise.\n\n## Code Release\nThis repository contains the public release of DefPloreX. Technically speaking,\nwe're releasing an example use of the DefPloreX approach to distributed data\nprocessing using Elasticsearch (ES). This is not meant to be a ready-to-use,\nplug-n-play solution, but rather a framework that you can reuse, extend and\nimprove to adapt it to your needs.\n\nThe goal that guided us to implement DefPloreX was the need to efficiently \nanalyze a large number of records (pages) for common aspects, recurrent attackers,\nor groups of organized attackers. In other words, a typical e-crime\nforensics task.\n\nIn this, the core challenge was to visit and analyze over 13 million web pages, \nparse their source code, analyze their resources (e.g.,\nimages, scripts), extract visual information, store the data so extracted in\na database, and query it to answer the typical questions that arise during\na post-mortem investigation. Given its popularity and scalability,\nwe've chosen Elasticsearch as our data storage solution. Since we wanted our\nsolution to be scalable, and given that visiting a web page (with an automated,\nheadless browser) takes at least 5 seconds, the only option was to distribute\nthe workload across several worker machines.\n\n## Distributed Data Processing\n\nNormally, to take full advantage of Elasticsearch's distributed\ndata-processing functionality, you need to resort to\n[scripting](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html).\nAlthough scripting is quite powerful and handy for small data-manipulation\ntasks, it's a bit cumbersome to deploy and handle requires; and, in addition, it\nrequires full access to the Elasticsearch's client nodes. For example, if you\nneed to process all the documents in an Elastic index (e.g., to enrich them by\ncomputing additional fields), you will have to choose one of the scripting\nlanguages supported by Elastic, write a script, deploy it and run it. Needless\nto say, your script will run within the context of the ES runtime,\nwith all the limitations that this implies. For example, should you need to use\nPython, you're forced to use the Jython Java implementation of Python, which is\nnot the same as pure Python. For instance, some of the libraries that you may\nwant to use may not be supported, and so on. In other words, we don't want to depend\non the Elastic's scripting subsystem in our work :)\n\nInstead, we take a more \"detached\" approach. We decouple the data-processing\npart, making it independent from the Elasticsearch runtime and architecture,\nand rely on ES exclusively as a data back-end to store, retrieve and\nmodify JSON documents. The coordination of the distributed computation is\ndelegated to a well-known and widely used distributed task queue:\n[Celery](http://www.celeryproject.org/). The friendliness of Celery is\nastonishing: from the programmer's perspective, all it requires is to write\nyour data-processing code by means of a function, and Celery will take care of\noffloading the (expensive and long-running) computation to one of the available\nworkers.\n\n![DefPloreX distributed data processing via Celery](i/dpx-celery.png?raw=true \"DefPloreX distributed data processing via Celery\")\n\nFor example, if you need to visit a web page with an automated headless browser,\nall you need to do is to wrap your code into a function, let's say `visit_page`,\nand decorate it with `@app.task` to inform Celery that this is a task:\n\n```\n@app.task\ndef visit_page(url):\n    result = long_running_process(url)\n\n    return result\n```\n\nLater on in your code, all you need to do is to call the function (almost) as\nyou would normally do:\n\n```\nvisit_page.delay(url)\n```\n\nThe `.delay()` function indicates that the function call will not execute\nimmediately, but instead will be \"pushed\" into a task list, from which an\navailable worker will pull it and do the work.\n\nOn the other end of the task list, you can launch as many workers as you need,\nby simply keeping the Celery daemon active:\n\n```\n$ celery worker --autoscale=6,64\n```\n\nAssuming having a 64-core machine, this command spawns 6 concurrent processes, up\nto 64 when more workload comes in. And of course you can add as many workers as\nneeded, from a single computer with a few tenths of cores, to a full rack\ndistributed across the globe. In our deployment, we have 5 machines, with\na total of 128 cores. With these modest resources, we were able to visit the\nentire collection of over 13 million web pages in a week. Adding more cores would have\nmade the analysis even faster.\n\n# Document Transformations\nFrom this moment on, we have a solid foundation to efficiently transform JSON\ndocuments stored in the Elastic index. Therefore, we \"encode\" any operation\nthat we need to perform in DefPloreX by means of a few lines of Python code. For\nexample, we often need to \"tag\" JSON documents to mark those that have been\nprocessed. To this end, as exemplified in this repository, we use the\n`TagTransformer` transformation. As any other transform, this function receives one JSON\ndocument and returns the newly added fields, or the modified fields.\n\n```\nclass TagTransformer(Transformer):\n    \"\"\"\n    Example transform to append tag to a record.\n    \"\"\"\n    _name = 'tag'                   # unique name\n\n    def __call__(self, doc, *args, **kwargs):\n        doc = super(TagTransformer, self).__call__(\n                doc, *args, **kwargs)\n\n        tag = kwargs.get('tag')     # tag that we want to apply to the JSON\n\n        if not tag:\n            log.debug('No tags supplied, skipping')\n            return []\n\n        tags = doc.get('tags', [])  # get the 'tags' field from the existing JSON doc\n\n        if tags:\n            log.debug('Found tags: %s', tags)\n\n        tags.append(tag)            # append the new tag\n        tags = list(set(tags))      # remove duplicates\n\n        log.debug('Updated tags: %s', tags)\n\n        return dict(tags=tags)      # return the enriched JSON\n```\n\nThe output of this transformation is automatically handled by our Elasticsearch \nwrapper (see `backend.elastic.ESStorer`) and the\n`transformer.Pipeline` class, which merges the new (partial) document with the\noriginal one and saves it into the ES index. Actually, this is\nperformed in bulk: that is, every worker consumes and processes a given amount\nof documents at each round (default is 1000). To summarize: given a query, we\nenqueue all the IDs of the documents that match that query. The queue consumers\nwill pull 1000 IDs at a time, query Elastic for the respective documents,\ntransform them, and push them back on Elastic as update operations.\n\nOther transformations that we have implemented (briefly explained in\nthe following) include for example visiting the web pages with an automated,\nheadless browser, extracting information from the visited web pages,\ncalculating numerical features, and so on. Every task is expressed by means of\na subclass of `Transformer`, which takes as input a document, and returns the\nenriched or modified fields.\n\n## Extracted Information\nFrom each web page, we were interested in collecting two \"sides\" of the same\nstory: a \"static\" view of the page (e.g., non-interpreted resources, scripts,\ntext) and a \"dynamic\" view of the same page (e.g., rendered page with DOM\nmodifications and so on). In concrete, the full version of DefPloreX can\nextract URLs, e-mail addresses, social-network nicknames and handles, hashtags,\nimages, file metadata, summarized text, and so on. These information captures the\nmain characteristics of a defaced web page.\n\n![Extracted data from each web page](i/dpx-extraction.png?raw=true \"Extracted data from each page\")\n\n## Scalable Data Clustering\nWe approach the problem of finding groups of related deface pages\n(e.g., hacktivism campaigns) as a typical data-mining problem. We assume that\nthere are recurring and similar characteristics among these pages that we can\ncapture and use as clustering features. For example, we assume that the same\nattacker will reuse the same web snippets or templates (albeit with minimal variations)\nwithin the same campaign. We capture this and other aspects by extracting\nnumerical and categorical features from the data that we obtained by analyzing\neach page (static and dynamic view). To this end, we express the following\ntask by means of a transform function.\n\nFor example, here's an excerpt of the features that we compute from\neach of our documents:\n\n```\n{\n  \"n_urls\": 135,\n  \"n_object\": 0,\n  \"n_embed\": 0,\n  \"n_telephone\": 8,\n  \"n_email\": 1,\n  \"n_img\": 18,\n  \"n_link\": 0,\n  \"n_sound_urls\": 0,\n  \"n_anchor\": 60,\n  \"n_meta\": 4,\n  \"n_resource\": 0,\n  \"n_iframe\": 0,\n  \"n_script\": 34,\n  \"n_hashtag\": 0,\n  \"n_style\": 9,\n  \"n_twitter\": 1,\n  \"avg_color\": \"#000000\",\n  \"frac_letters_in_title\": 0.6979166666666666,\n  \"frac_punct_in_title\": 0.17708333333333334,\n  \"frac_whitespace_in_title\": 0.0625,\n  \"frac_digits_in_title\": 0.0625\n}\n```\n\n![Feature extraction](i/dpx-features.png?raw=true \"Feature extraction\")\n\nAt this point we could use any clustering algorithm to find groups. However,\nthis would not be the most efficient solution, at least in general, because\nwe would need to compare all pairs of our collection of 13 million records, \ncalculate \"some\" form of distance (e.g., ssdeep), and then start forming groups by\nmeans of such distance.\n\nWe take a different approach, which is approximate but way faster. As a result,\nwe're able to cluster our entire collection of 13 million documents in less than a\nminute, and we dynamically configure the clustering features on demand (i.e., at\neach clustering execution).\n\nIntuitively, we would like to be able to find logical groups of web pages that\nshare \"similar\" feature values. Instead of approaching this problem as\na distance-metric calculation task, we use the concept of \"feature binning\" or\n\"feature quantization\". In simple words, we want all the web pages with a \"low\nnumber of URLs\" to fall in the same cluster. At the same time, we want all the\nweb pages with a \"high number of URLs\" to fall in another cluster. And so on,\nfor all the features. In other words, the clustering task becomes a \"group-by\"\ntask, which is natively and well supported by all database engines. In the case of\nElastic, it's efficiently implemented in a map-reduce fashion, effectively distributing\nthe workload across all the available nodes.\n\nThe missing piece is how we obtain these \"low, medium, high\" values from the\noriginal, numerical feature values. For instance, is \"42 URLs\" considered low,\nhigh, or medium? To this end, we look at the statistical distribution of each feature,\nand divide its space into intervals according to estimated percentiles. For instance,\nthe values below the 25% percentile are considered low, those between 25-50% percentile\nare medium, and those between 50% and 75% are high. Those above the 75% percentile\nare outliers. This is just an example, of course.\n\n![Feature quantization and clustering](i/dpx-binning.png?raw=true \"Feature quantization and clustering\")\n\nIt turns out that Elasticsearch already supports the calculation of a few\nstatistical metrics, among which we happily found the percentiles. So all we need\nto do is asking Elastic to compute the percentiles of each feature -- done in a matter\nof few seconds. Then, we store these percentiles\nand use them as thresholds to quantize the numerical features.\n\nFor example, here's an excerpt of four equally-spaced percentiles (from 1%\nto 99%) that we obtaine from our collection:\n\n```\n\"features\": {\n\t\"n_style\": [\n\t  0,\n\t  2,\n\t  5,\n\t  10\n\t],\n\t\"n_anchor\": [\n\t  0,\n\t  10,\n\t  34,\n\t  284.78097304328793\n\t],\n\t\"n_urls\": [\n\t  0,\n\t  6.999999999999999,\n\t  19.575392097264313,\n\t  201.65553368092415\n\t],\n\t\"n_hashtag\": [\n\t  0,\n\t  2.2336270350272462,\n\t  5,\n\t  16\n\t],\n\t\"n_script\": [\n\t  0,\n\t  4,\n\t  12,\n\t  45\n\t],\n\t\"n_sound_urls\": [\n\t  0,\n\t  1,\n\t  2.4871283217280453,\n\t  7\n\t],\n...\n}\n```\n\nOverall, for each page, we obtain a vector as the following that we store in ES.\n\n```\n{\n  \"n_urls\": H,\n  \"n_object\": L,\n  \"n_embed\": L,\n  \"n_telephone\": M,\n  \"n_email\": L,\n  \"n_img\": M,\n  \"n_link\": L,\n  \"n_sound_urls\": L,\n  \"n_anchor\": M,\n  \"n_meta\": L,\n  \"n_resource\": L,\n  \"n_iframe\": L,\n  \"n_script\": M,\n  \"n_hashtag\": L,\n  \"n_style\": L,\n  \"n_twitter\": L,\n  \"avg_color\": \"#000000\",\n  \"frac_letters_in_title\": M,\n  \"frac_punct_in_title\": L,\n  \"frac_whitespace_in_title\": L,\n  \"frac_digits_in_title\": L\n}\n```\n\nAt this point, the web operator (the analyst) simply chooses the features for data pivoting, and\nruns an Elasticsearch aggregate query, which is natively supported.\n\nIn the remainder of this page you can see some example results.\n\n![Feature quantization and clustering (visualized)](i/dpx-binning-viz.png?raw=true \"Feature quantization and clustering (visualized)\")\n\n![Feature quantization and clustering (visualized)](i/dpx-binned-records-viz.png?raw=true \"Feature quantization and clustering (visualized)\")\n\n# License\n```\nCopyright (c) 2017, Trend Micro Incorporated\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions are met:\n\n1. Redistributions of source code must retain the above copyright notice,\nthis list of conditions and the following disclaimer.\n2. Redistributions in binary form must reproduce the above copyright notice,\nthis list of conditions and the following disclaimer in the documentation\nand/or other materials provided with the distribution.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\nAND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\nIMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE\nARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE\nLIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR\nCONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF\nSUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS\nINTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN\nCONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)\nARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE\nPOSSIBILITY OF SUCH DAMAGE.\n\nThe views and conclusions contained in the software and documentation are\nthose of the authors and should not be interpreted as representing official\npolicies, either expressed or implied, of the FreeBSD Project.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrendmicro%2Fdefplorex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftrendmicro%2Fdefplorex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftrendmicro%2Fdefplorex/lists"}