{"id":20213595,"url":"https://github.com/adobe-research/spindle","last_synced_at":"2026-03-06T06:32:21.511Z","repository":{"id":19741535,"uuid":"22998139","full_name":"adobe-research/spindle","owner":"adobe-research","description":"Next-generation web analytics processing with Scala, Spark, and Parquet.","archived":false,"fork":false,"pushed_at":"2015-03-28T16:48:42.000Z","size":11167,"stargazers_count":331,"open_issues_count":2,"forks_count":60,"subscribers_count":39,"default_branch":"master","last_synced_at":"2025-09-10T19:55:59.201Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://adobe-research.github.io/spindle/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adobe-research.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-08-15T17:55:59.000Z","updated_at":"2024-09-23T10:22:30.000Z","dependencies_parsed_at":"2022-08-24T23:51:39.660Z","dependency_job_id":null,"html_url":"https://github.com/adobe-research/spindle","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/adobe-research/spindle","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adobe-research%2Fspindle","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adobe-research%2Fspindle/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adobe-research%2Fspindle/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adobe-research%2Fspindle/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adobe-research","download_url":"https://codeload.github.com/adobe-research/spindle/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adobe-research%2Fspindle/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30164594,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T04:43:31.446Z","status":"ssl_error","status_checked_at":"2026-03-06T04:40:30.133Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T06:10:17.394Z","updated_at":"2026-03-06T06:32:21.492Z","avatar_url":"https://github.com/adobe-research.png","language":"JavaScript","funding_links":[],"categories":["Applications"],"sub_categories":[],"readme":"# Spindle\n\n**Spindle is [Brandon Amos'](http://github.com/bamos)\n2014 summer internship project with Adobe Research\nand is not under active development.**\n\n---\n\n![](https://github.com/adobe-research/spindle/raw/master/images/architecture.png)\n\nAnalytics platforms such as [Adobe Analytics][adobe-analytics]\nare growing to process petabytes of data in real-time.\nDelivering responsive interfaces querying this amount of data is difficult,\nand there are many distributed data processing technologies such\nas [Hadoop MapReduce][mapreduce], [Apache Spark][spark],\n[Apache Drill][drill], and [Cloudera Impala][impala]\nto build low-latency query systems.\n\nSpark is part of the [Apache Software Foundation][apache]\nand claims speedups up to 100x faster than Hadoop for in-memory\nprocessing.\nSpark is shifting from a research project to a production-ready library,\nand academic publications and presentations from\nthe [2014 Spark Summit][2014-spark-summit]\narchives several use cases of Spark and related technology.\nFor example,\n[NBC Universal][nbc] presents their use of Spark to query [HBase][hbase]\ntables and analyze an international cable TV video distribution [here][nbc-pres].\nTelefonica presents their use of\nSpark with [Cassandra][cassandra]\nfor cyber security analytics [here][telefonica-pres].\n[ADAM][adam] is an open source data storage format and processing\npipeline for genomics data built in Spark and [Parquet][parquet].\n\nEven though people are publishing use cases of Spark,\nfew people have published\nexperiences of building and tuning production-ready Spark systems.\nThorough knowledge of Spark internals\nand libraries that interoperate well with Spark is necessary\nto achieve optimal performance from Spark applications.\n\n**Spindle is a prototype Spark-based web analytics query engine designed\naround the requirements of production workloads.**\nSpindle exposes query requests through a multi-threaded\nHTTP interface implemented with [Spray][spray].\nQueries are processed by loading data from [Apache Parquet][parquet] columnar\nstorage format on the\n[Hadoop distributed filesystem][hdfs].\n\nThis repo contains the Spindle implementation and benchmarking scripts\nto observe Spindle's performance while exploring Spark's tuning options.\nSpindle's goal is to process petabytes of data on thousands of nodes,\nbut the current implementation has not yet been tested at this scale.\nOur current experimental results use six nodes,\neach with 24 cores and 21g of Spark memory, to query 13.1GB of analytics data.\nThe trends show that further Spark tuning and optimizations should\nbe investigated before attempting larger scale deployments.\n\n# Demo\nWe used Spindle to generate static webpages that are hosted\nstatically [here][demo].\nUnfortunately, the demo is only for illustrative purposes and\nis not running Spindle in real-time.\n\n![](https://github.com/adobe-research/spindle/raw/master/images/top-pages-by-browser.png)\n![](https://github.com/adobe-research/spindle/raw/master/images/adhoc.png)\n\n[Grunt][grunt] is used to deploy `demo` to [Github pages][ghp]\nin the [gh-pages][ghp] branch with the [grunt-build-control][gbc] plugin.\nThe [npm][npm] dependencies are managed in [package.json][pjson]\nand can be installed with `npm install`.\n\n# Loading Sample Data\nThe `load-sample-data` directory contains a Scala program\nto load the following sample data into [HDFS][hdfs]\nmodeled after\n[adobe-research/spark-parquet-thrift-example][spark-parquet-thrift-example].\nSee [adobe-research/spark-parquet-thrift-example][spark-parquet-thrift-example]\nfor more information on running this application\nwith [adobe-research/spark-cluster-deployment][spark-cluster-deployment].\n\n### hdfs://hdfs_server_address:8020/spindle-sample-data/2014-08-14\n| post_pagename | user_agent | visit_referrer | post_visid_high | post_visid_low | visit_num | hit_time_gmt | post_purchaseid | post_product_list | first_hit_referrer |\n|---|---|---|---|---|---|---|---|---|---|\n| Page A | Chrome | http://facebook.com | 111 | 111 | 1 | 1408007374 | | | http://google.com\n| Page B | Chrome | http://facebook.com | 111 | 111 | 1 | 1408007377 | | | http://google.com\n| Page C | Chrome | http://facebook.com | 111 | 111 | 1 | 1408007380 | purchase1 | ;ProductID1;1;40;,;ProductID2;1;20; | http://google.com\n| Page B | Chrome | http://google.com | 222 | 222 | 1 | 1408007379 | | | http://google.com\n| Page C | Chrome | http://google.com | 222 | 222 | 1 | 1408007381 | | | http://google.com\n| Page A | Firefox | http://google.com | 222 | 222 | 1 | 1408007382 | | | http://google.com\n| Page A | Safari | http://google.com | 333 | 333 | 1 | 1408007383 | | | http://facebook.com\n| Page B | Safari | http://google.com | 333 | 333 | 1 | 1408007386 | | | http://facebook.com\n\n### hdfs://hdfs_server_address:8020/spindle-sample-data/2014-08-15\n| post_pagename | user_agent | visit_referrer | post_visid_high | post_visid_low | visit_num | hit_time_gmt | post_purchaseid | post_product_list | first_hit_referrer |\n|---|---|---|---|---|---|---|---|---|---|\n| Page A | Chrome | http://facebook.com | 111 | 111 | 1 | 1408097374 | | | http://google.com\n| Page B | Chrome | http://facebook.com | 111 | 111 | 1 | 1408097377 | | | http://google.com\n| Page C | Chrome | http://facebook.com | 111 | 111 | 1 | 1408097380 | purchase1 | ;ProductID1;1;60;,;ProductID2;1;100; | http://google.com\n| Page B | Chrome | http://google.com | 222 | 222 | 1 | 1408097379 | | | http://google.com\n| Page A | Safari | http://google.com | 333 | 333 | 1 | 1408097383 | | | http://facebook.com\n| Page B | Safari | http://google.com | 333 | 333 | 1 | 1408097386 | | | http://facebook.com\n\n### hdfs://hdfs_server_address:8020/spindle-sample-data/2014-08-16\n| post_pagename | user_agent | visit_referrer | post_visid_high | post_visid_low | visit_num | hit_time_gmt | post_purchaseid | post_product_list | first_hit_referrer |\n|---|---|---|---|---|---|---|---|---|---|\n| Page A | Chrome | http://facebook.com | 111 | 111 | 1 | 1408187380 | purchase1 | ;ProductID1;1;60;,;ProductID2;1;100; | http://google.com\n| Page B | Chrome | http://facebook.com | 111 | 111 | 1 | 1408187380 | purchase1 | ;ProductID1;1;200; | http://google.com\n| Page D | Chrome | http://google.com | 222 | 222 | 1 | 1408187379 | | | http://google.com\n| Page A | Safari | http://google.com | 333 | 333 | 1 | 1408187383 | | | http://facebook.com\n| Page B | Safari | http://google.com | 333 | 333 | 1 | 1408187386 | | | http://facebook.com\n| Page C | Safari | http://google.com | 333 | 333 | 1 | 1408187388 | | | http://facebook.com\n\n# Queries.\nSpindle includes eight queries that are representative of\nthe data sets and computations of real queries the\nAdobe Marketing Cloud processes.\nAll collect statements refer to the combined filter and map operation,\nnot the operation to gather an RDD as a local Scala object.\n\n+ *Q0* (**Pageviews**)\n  is a breakdown of the number of pages viewed\n  each day in the specified range.\n+ *Q1* (**Revenue**) is the overall revenue for each day in\n  the specified range.\n+ *Q2* (**RevenueFromTopReferringDomains**) obtains the top referring\n  domains for each visit and breaks down the revenue by day.\n  The `visit_referrer` field is preprocessed into each record in\n  the raw data.\n+ *Q3* (**RevenueFromTopReferringDomainsFirstVisitGoogle**) is\n  the same as RevenueFromTopReferringDomains, but with the\n  visitor's absolute first referrer from Google.\n  The `first_hit_referrer` field is preprocessed into each record in\n  the raw data.\n+ *Q4* (**TopPages**) is a breakdown of the top pages for the\n  entire date range, not per day.\n+ *Q5* (**TopPagesByBrowser**) is a breakdown of the browsers\n  used for TopPages.\n+ *Q6* (**TopPagesByPreviousTopPages**) breaks down the top previous\n  pages a visitor was at for TopPages.\n+ *Q7* (**TopReferringDomains**) is the top referring domains for\n  the entire date range, not per day.\n\nThe following table shows the columnar subset\neach query utilizes.\n\n![](https://github.com/adobe-research/spindle/raw/master/images/columns-needed.png)\n\nThe following table shows the operations each query performs\nand is intended as a summary rather than full description of\nthe implementations.\nThe bold text in indicate operations in which the target\npartition size is specified, which is further described in the\n\"Partitioning\" section below.\n\n![](https://github.com/adobe-research/spindle/raw/master/images/query-operations.png)\n\n# Spindle Architecture\nThe query engine provides a request and response interface to\ninteract with the application layer, and Spindle's goal is to\nbenchmark a realistic low latency web analytics query engine.\n\nSpindle provides query requests and reports over HTTP with the\n[Spray][spray] library, which is multi-threaded and provides\nREST/HTTP-based integration layer on Scala for queries and parameters,\nas illustrated in the figure below.\n\n![](https://github.com/adobe-research/spindle/raw/master/images/architecture.png)\n\nWhen a user request to execute a query over HTTP,\nSpray allocates a thread to process the HTTP request and converts\nit into a Spray request.\nThe Spray request follows a route defined in the `QueryService` Actor,\nand queries are processed with the `QueryProcessor` singleton object.\nThe `QueryProcessor` interacts with a global Spark context,\nwhich connects the Scala application to the Spark cluster.\n\nThe Spark context supports multi-threading and offers a\n`FIFO` and `FAIR` scheduling options for concurrent queries.\nSpindle uses Spark's `FAIR` scheduling option to minimize overall latency.\n\n## Future Work - Utilizing Spark job servers or resource managers.\nSpindle's architecture can likely be improved on larger clusters by\nutilizing a job server or resource manager to\nmaintain a pool of Spark contexts for query execution.\n[Ooyala's spark-jobserver][spark-jobserver] provides\na RESTful interface for submitting Spark jobs that Spindle could\ninterface with instead of interfacing with Spark directly.\n[YARN][yarn] can also be used to manage Spark's\nresources on a cluster, as described in [this article][spark-yarn].\n\nHowever, allocating resources on the cluster raises additional\nquestions and engineering work that Spindle can address in future work.\nSpindle's current architecture coincides HDFS and Spark workers\non the same nodes, minimizing the network traffic required\nto load data.\nHow much will the performance degrade if the resource manager\nallocates some subset of Spark workers that don't\ncoincide with any of the HDFS data being accessed?\n\nFurthermore, how would a production-ready caching policy\non a pool of Spark Contexts look?\nWhat if many queries are being submitted and executed on\ndifferent Spark Contexts that use the same data?\nScheduling the queries on the same Spark Context and\ncaching the data between query executions would substantially\nincrease the performance, but how should the scheduler\nbe informed of this information?\n\n## Data Format\nAdobe Analytics events data have at least 250 columns,\nand sometimes significantly more than 250 columns.\nMost queries use less than 7 columns, and loading all of the\ncolumns into memory to only use 7 is inefficient.\nSpindle stores event data in the [Parquet][parquet] columnar store\non the [Hadoop Distributed File System][hdfs] (HDFS) with\n[Kryo][kryo] serialization enabled\nto only load the subsets of columns each query requires.\n\n[Cassandra][cassandra] is a NoSQL database that we considered\nas an alternate to Parquet.\nHowever, Spindle also utilizes [Spark SQL][spark-sql],\nwhich supports Parquet, but not Cassandra.\n\nParquet can be used with [Avro][avro] or [Thrift][thrift] schemas.\n[Matt Massie's article][spark-parquet-avro] provides an example of\nusing Parquet with Avro.\n[adobe-research/spark-parquet-thrift-example][spark-parquet-thrift-example]\nis a complete [Scala][scala]/[sbt][sbt] project\nusing Thrift for data serialization and shows how to only load the\nspecified columnar subset.\nFor a more detailed introduction to Thrift,\nsee [Thrift: The Missing Guide][thrift-guide].\n\nThe entire Adobe Analytics schema cannot be published.\nThe open source release of Spindle uses\n[AnalyticsData.thrift][AnalyticsData.thrift],\nwhich contains 10 non-proprietary fields for web analytics.\n\nColumns postprocessed into the data after collection have the `post_`\nprefix along with `visit_referrer` and `first_hit_referrer`.\nVisitors are categorized by concatenating the strings\n`post_visid_high` and `post_visid_low`.\nA visitor has visits which are numbered by `visit_num`,\nand a visit has hits that occur at `hit_time_gmt`.\nIf the hit is a webpage hit from a browser, the `post_pagename` and\n`user_agent` fields are used, and the revenue from a hit,\nis denoted in `post_purchaseid` and `post_product_list`.\n\n```Thrift\nstruct AnalyticsData {\n  1: string post_pagename;\n  2: string user_agent;\n  3: string visit_referrer;\n  4: string post_visid_high;\n  5: string post_visid_low;\n  6: string visit_num;\n  7: string hit_time_gmt;\n  8: string post_purchaseid;\n  9: string post_product_list;\n  10: string first_hit_referrer;\n}\n```\n\nThis data is separated by day on disk of format `YYYY-MM-DD`.\n\n## Caching Data\nSpindle provides a caching option that will cache the loaded\nSpark data in memory between query requests to show the\nmaximum speedup caching provides.\nCaching introduces a number of interesting questions when dealing\nwith sparse data.\nFor example, two queries could be submitted on the same date range\nthat request overlapping, but not identical, column subsets.\nHow should these data sets with partially overlapping values be\ncached in the application?\nWhat if one of the queries is called substantially more times than\nthe other? How should the caching policy ensure these columns are\nnot evicted?\nWe will explore these questions in future work.\n\n## Partitioning\nSpark affords partitioning data across nodes for operations\nsuch as `distinct`, `reduceByKey`, and `groupByKey` to specify the\nminimum number of resulting partitions.\n\nCounting the number of records in an RDD\nexpensive, and automatically knowing the optimal number of partitions\nfor operations depends highly on the data and operations.\nFor optimal partitioning, applications should estimate the\nnumber of records to process and ensure the partitions contain\nsome minimum value of records.\n\nSpindle puts a target number of records in each partition\nby estimating the total number of records to be processed\nfrom Parquet's metadata.\nHowever, most queries filter records before doing operations that\nimpact the partitioning by approximately 50\\% in our data.\nFor example, an empty `post_pagename` field indicates that the\nanalytics hit is from an event other than a user visiting a page,\nand the first Spark operation in TopPages is to obtain only\nthe page visit hits by filtering out records with empty `post_pagename`\nfields.\n\n# Installing Spark and HDFS on a cluster.\n| ![](https://github.com/adobe-research/spark-cluster-deployment/raw/master/images/initial-deployment-2.png) | ![](https://github.com/adobe-research/spark-cluster-deployment/raw/master/images/application-deployment-1.png) |\n|---|---|\n\nSpark 1.0.0 can be deployed to traditional cloud and job management services\nsuch as [EC2][spark-ec2], [Mesos][spark-mesos], or\n[Yarn][spark-yarn].\nFurther, Spark's [standalone cluster][spark-standalone] mode enables\nSpark to run on other servers without installing other\njob management services.\n\nHowever, configuring and submitting applications to a Spark 1.0.0 standalone\ncluster currently requires files to be synchronized across the entire cluster,\nincluding the Spark installation directory.\nThese problems have motivated our\n[adobe-research/spark-cluster-deployment][spark-cluster-deployment] project,\nwhich utilizes [Fabric][fabric] and [Puppet][puppet] to further automate\nthe Spark standalone cluster.\n\n# Building\n\nEnsure you have the following software on the server.\nSpindle has been developed on CentOS 6.5 with\nsbt 0.13.5, Spark 1.0.0, Hadoop 2.0.0-cdh4.7.0,\nand parquet-thrift 1.5.0.\n\n| Command | Output |\n|---|---|\n| `cat /etc/centos-release` | CentOS release 6.5 (Final) |\n| `sbt --version` | sbt launcher version 0.13.5 |\n| `thrift --version` | Thrift version 0.9.1 |\n| `hadoop version` | Hadoop 2.0.0-cdh4.7.0 |\n| `cat /usr/lib/spark/RELEASE` | Spark 1.0.0 built for Hadoop 2.0.0-cdh4.7.0 |\n\nSpindle uses [sbt][sbt] and the [sbt-assembly][sbt-assembly] plugin\nto build Spark into a fat JAR to be deployed to the Spark cluster.\nUsing [adobe-research/spark-cluster-deployment][spark-cluster-deployment],\nmodify `config.yaml` to have your server configurations,\nand build the application with `ss-a`, send the JAR to your cluster\nwith `ss-sy`, and start Spindle with `ss-st`.\n\n# Experimental Results\nAll experiments leverage a homogeneous six node production cluster\nof HP ProLiant DL360p Gen8 blades.\nEach node has 32GB of DDR3 memory at 1333MHz,\n(2) 6 core Intel Xeon 0 processors at 2.30GHz and 1066MHz FSB,\nand (10) 15K SAS 146GB, RAID 5 hard disks.\nFurthermore, each node has CentOS 6.5, Hadoop 2.0.0-cdh4.7.0,\nSpark 1.0.0, sbt 0.13.5, and Thrift 0.9.1.\nThe Spark workers each utilizes 21g of memory.\n\nThese experiments benchmark Spindle's queries\non a week's worth of data consuming 13.1G as serialized Thrift objects\nin Parquet.\n\nThe YAML formatted results, scripts, and resulting figures\nare in the [benchmark-scripts][benchmark-scripts] directory.\n\n## Scaling HDFS and Spark workers.\nPredicting the optimal resource allocation to minimize query latency for\ndistributed applications is difficult. No production software can accurately\npredict the optimal number of Spark and HDFS nodes for a given application.\nThis experiment observes the execution time of queries as the number of Spark\nand HDFS workers is increased. We manually scale and rebalance the HDFS data.\n\nThe following figure shows the time to load all columns the queries\nuse for the week of data as the Spark and HDFS workers are scaled. The data is\nloaded by caching the Spark RDD and performing a null operation on them, such\nas `rdd.cache.foreach{x =\u003e{}}`. The downward trend of the data load times\nindicate that using more Spark or HDFS workers will decrease the time to load\ndata.\n\n![](https://raw.githubusercontent.com/adobe-research/spindle/master/benchmark-scripts/scaling/dataLoad.png)\n\nThe following table and plot show the execution time of the queries\nwith cached data when scaling the HDFS and Spark workers.\nThe bold data indicates where adding a\nSpark and HDFS worker hurts performance. The surprising results show that\nadding a single Spark or HDFS worker commonly hurts query performance, and\ninterestingly, no query experiences minimal execution time when using all 6\nworkers. Our future work is to further experiment by tuning Spark to understand\nthe performance degradation, which might be caused by network traffic or\nimbalanced workloads.\n\nQ2 and Q3 are similar queries and consequently have similar performance as\nscaling the Spark and HDFS workers, but has an anomaly when using 3 workers\nwhere Q2 executes in 17.10s and Q3 executes in 55.15s. Q6’s execution time\nincreases by 10.67 seconds between three and six Spark and HDFS workers.\n\n![](https://github.com/adobe-research/spindle/raw/master/images/scaling-spark-hdfs.png)\n![](https://raw.githubusercontent.com/adobe-research/spindle/master/benchmark-scripts/scaling/scalingWorkers.png)\n\n## Intermediate data partitioning.\nSpark cannot optimize the number of records in the partitions\nbecause counting the number of records in the initial and\nintermediate data sets is expensive, and the\nSpark application has to provide the number of partitions\nto use for certain computations.\nThis experiment fully utilizes all six nodes with Spark (144 cores)\nand HDFS workers.\n\nAveraging four execution times for each point between\n10,000 and 1,500,000 target partition sizes for every query\nresults in similar performance to the TopPages query (Q4) shown below.\n\n![](https://github.com/adobe-research/spindle/raw/master/benchmark-scripts/partitions/png/TopPages.png)\n\nTargeting 10,000 records per partition results in poor performance,\nwhich we suspect is due to the Spark overhead of creating an execution\nenvironment for the task, and the performance monotonically decreases\nand levels off at a target partition size of 1,500,000.\nThis experiment fully utilizes all six nodes with Spark (144 cores)\nand HDFS workers.\n\nThe table below summarizes the results from all queries\nby showing the best average execution times for all partitions\nand the execution time at a target partition size of 1,500,000.\nQ2 and Q3 have nearly identical performance because Q3\nonly adds a filter to Q2.\n\n| Query | Best Execution Time (s) | Final Execution Time (s) |\n|---|---|---|\n| TopPages | 3.31 | 3.37 |\n| TopPagesByBrowser | 15.41 | 15.58 |\n| TopPagesByPreviousTopPages | 34.70 | 36.89 |\n| TopReferringDomains | 5.68 | 5.68 |\n| RevenueFromTopReferringDomains | 16.66 | 16.661 |\n| RevenueFromTopReferringDomainsFirstVisitGoogle | 16.89 | 16.89 |\n\nThe remaining experiments use a target partition size of 1,500,000,\nand the performance is the best observed for the operations with partitioning.\nWe expect the support for specifying partitioning for\nloading Parquet data from HDFS will yield further performance results.\n\n## Impact of caching on query execution time.\nThis experiment shows the ideal speedups from having\nall the data in memory as RDD's.\nFurthermore, the performances from caching in this experiment\nare better than the performances from caching the raw data in memory because\nthe RDD is cached, and the time to load raw data\ninto a RDD is non-negligible.\n\nThe figure below shows the average execution times from four trials\nof every query with and without caching.\nCaching the data substantially improves performance, but\nreveals that Spindle has further performance bottlenecks inhibiting\nsubsecond query execution time.\nThese bottlenecks can be partially overcome by preprocessing the data\nand further analyzing Spark internals.\n![](https://github.com/adobe-research/spindle/raw/master/benchmark-scripts/caching/caching.png)\n\n## Query execution time for concurrent queries.\nSpindle's can process concurrent queries with multi-threading, since\nmany users will use the analytics application concurrently.\nUsers will request different queries concurrently,\nbut for simplicity, this experiment shows the performance\ndegradation as the same query is called with an increasing\nnumber of threads with in-memory caching.\n\nThis experiment will spawn a number of threads which continuously\nexecute the same query.\nEach thread remains loaded and continues processing\nqueries until all threads have processed four queries,\nand the average execution time of the first four queries\nfrom every thread will be used as a metric to estimate the\nslowdowns.\n\nThe performance of the TopPages query below\nis indicative of the performance of most queries.\nTopPages appears to underutilize the Spark system when\nprocessing in serial, and the Spark schedule is able to process\ntwo queries concurrently and return them as a factor of 1.32 of\nthe original execution time.\n\n![](https://github.com/adobe-research/spindle/raw/master/benchmark-scripts/concurrent/png/TopPages.png)\n\nThe slowdown factors from serial execution are shown in\nthe table below for two and eight concurrent queries.\n\n| Query | Serial Time (ms) | 2 Concurrent Slowdown | 8 Concurrent Slowdown |\n|---|---|---|---|\n| Pageviews | 2.70 | 1.63 | 5.98 |\n| TopPages | 3.37 | 1.32 | 5.66 |\n| TopPagesByBrowser | 15.93 | 2.02 | 7.58 |\n| TopPagesByPreviousTopPages | 37.49 | 1.24 | 4.15 |\n| Revenue | 2.74 | 1.53 | 5.82 |\n| TopReferringDomains | 5.75 | 1.19 | 4.45 |\n| RevenueFromTopReferringDomains | 17.79 | 1.55 | 5.91 |\n| RevenueFromTopReferringDomainsFirstVisitGoogle | 16.35 | 1.68 | 7.29 |\n\nThis experiment shows the ability of Spark's scheduler at the\nsmall scale of six nodes.\nThe slowdowns for two concurrent queries indicate further query optimizations\ncould better balance the work between all Spark workers and\nlikely result in better query execution time.\n\n# Contributing and Development Status\nSpindle is not currently under active development by Adobe.\nHowever, we are happy to review and respond to issues,\nquestions, and pull requests.\n\n# License\nBundled applications are copyright their respective owners.\n[Twitter Bootstrap][bootstrap] and\n[dangrossman/bootstrap-daterangepicker][bootstrap-daterangepicker]\nare Apache 2.0 licensed\nand [rlamana/Terminus][terminus] is MIT licensed.\nDiagrams are available in the public domain from\n[bamos/beamer-snippets][beamer-snippets].\n\nAll other portions are copyright 2014 Adobe Systems Incorporated\nunder the Apache 2 license, and a copy is provided in `LICENSE`.\n\n[adobe-analytics]: http://www.adobe.com/solutions/digital-analytics.html\n\n[mapreduce]: http://wiki.apache.org/hadoop/MapReduce\n[drill]: http://incubator.apache.org/drill/\n[impala]: http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html\n[spark]: http://spark.apache.org/\n[spark-sql]: https://spark.apache.org/sql/\n[spark-ec2]: http://spark.apache.org/docs/1.0.0/ec2-scripts.html\n[spark-mesos]: http://spark.apache.org/docs/1.0.0/running-on-mesos.html\n[spark-yarn]: http://spark.apache.org/docs/1.0.0/running-on-yarn.html\n[spark-standalone]: http://spark.apache.org/docs/1.0.0/spark-standalone.html\n\n[apache]: http://www.apache.org/\n[hbase]: http://hbase.apache.org/\n[cassandra]: http://cassandra.apache.org\n[parquet]: http://parquet.io/\n[hdfs]: http://hadoop.apache.org/\n[thrift]: https://thrift.apache.org/\n[thrift-guide]: http://diwakergupta.github.io/thrift-missing-guide/\n[avro]: http://avro.apache.org/\n[spark-parquet-avro]: http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/\n[spray]: http://spray.io\n[kryo]: https://github.com/EsotericSoftware/kryo\n[fabric]: http://www.fabfile.org/\n[puppet]: http://puppetlabs.com/puppet/puppet-open-source\n\n[2014-spark-summit]: http://spark-summit.org/2014\n[nbc]: http://www.nbcuni.com/\n[nbc-pres]: http://spark-summit.org/wp-content/uploads/2014/06/Using-Spark-to-Generate-Analytics-for-International-Cable-TV-Video-Distribution-Christopher-Burdorf.pdf\n[telefonica-pres]: http://spark-summit.org/wp-content/uploads/2014/07/Spark-use-case-at-Telefonica-CBS-Fran-Gomez.pdf\n[adam]: https://github.com/bigdatagenomics/adam\n\n[grunt]: http://gruntjs.com/\n[ghp]: https://pages.github.com/\n[gbc]: https://github.com/robwierzbowski/grunt-build-control\n[npm]: https://www.npmjs.org/\n\n[scala]: http://scala-lang.org\n[rdd]: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD\n\n[sbt]: http://www.scala-sbt.org/\n[sbt-thrift]: https://github.com/bigtoast/sbt-thrift\n[sbt-assembly]: https://github.com/sbt/sbt-assembly\n\n[pjson]: https://github.com/adobe-research/spindle/blob/master/package.json\n[AnalyticsData.thrift]: https://github.com/adobe-research/spindle/blob/master/src/main/thrift/AnalyticsData.thrift\n[benchmark-scipts]: https://github.com/adobe-research/spindle/tree/master/benchmark-scripts\n\n[demo]: http://adobe-research.github.io/spindle/\n[spark-parquet-thrift-example]: https://github.com/adobe-research/spark-parquet-thrift-example\n[spark-cluster-deployment]: https://github.com/adobe-research/spark-cluster-deployment\n\n[bootstrap]: http://getbootstrap.com/\n[terminus]: https://github.com/rlamana/Terminus\n[beamer-snippets]: https://github.com/bamos/beamer-snippets\n[bootstrap-daterangepicker]: https://github.com/dangrossman/bootstrap-daterangepicker\n\n[spark-jobserver]: https://github.com/ooyala/spark-jobserver\n[yarn]: http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html\n[spark-yarn]: http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app-models/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadobe-research%2Fspindle","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadobe-research%2Fspindle","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadobe-research%2Fspindle/lists"}