{"id":20629727,"url":"https://github.com/agile-lab-dev/sparksearchengine","last_synced_at":"2025-04-15T18:18:15.961Z","repository":{"id":70732230,"uuid":"86149837","full_name":"agile-lab-dev/sparksearchengine","owner":"agile-lab-dev","description":"Big Data search with Spark and Lucene","archived":false,"fork":false,"pushed_at":"2023-12-15T20:18:14.000Z","size":1032,"stargazers_count":17,"open_issues_count":3,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-15T18:18:06.926Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/agile-lab-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2017-03-25T10:36:21.000Z","updated_at":"2024-03-17T09:52:29.000Z","dependencies_parsed_at":"2023-07-11T10:36:12.383Z","dependency_job_id":null,"html_url":"https://github.com/agile-lab-dev/sparksearchengine","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fsparksearchengine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fsparksearchengine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fsparksearchengine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agile-lab-dev%2Fsparksearchengine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/agile-lab-dev","download_url":"https://codeload.github.com/agile-lab-dev/sparksearchengine/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249125998,"owners_count":21216705,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T14:05:48.206Z","updated_at":"2025-04-15T18:18:15.942Z","avatar_url":"https://github.com/agile-lab-dev.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Gitter chat](https://badges.gitter.im/spark-search.png)](https://gitter.im/spark-search/)\n\n# SearchableRDD for Apache Spark\n#### Big Data search with Spark and Lucene\n\n**spark-search** is an open source library for [Apache Spark](http://spark.apache.org/) that allows you to easily index and search your Spark datasets with similar functionality to that of a dedicated search engine like Elasticsearch or Solr.\n\nWith spark-search you can leverage information retrieval functionality to analyze and explore you Spark datasets without having to setup an external search engine, lowering the effort needed. Without external systems there are no deployment, administration or resource costs associated with them; everything needed for information retrieval is handled inside your Spark application.\n \nUnstructured information like text is easy to leverage by using the standard query types for full-text search; filters for efficient interrogations are provided for non-textual data types. Queries and filters can be mixed together to express complex information retrieval needs.\n\nWith a transparent integration with Spark's `RDD`s and a domain specific language for queries and filters the effort needed to leverage information retrieval from Spark is brought to a minimum.\n\n## Setup\n\n#### SBT\n\nAdd the repository to your resolvers:\n\n```sbtshell\nresolvers += Resolver.bintrayRepo(\"agile-lab-dev\", \"SparkSearchEngine\")\n```\n\nAdd the dependency:\n\n```sbtshell\nlibraryDependencies += \"it.agilelab\" %% \"spark-search\" % \"0.1\"\n```\n\n#### Maven\n\nAdd the repository:\n\n```xml\n\u003crepositories\u003e\n    \u003crepository\u003e\n        \u003cid\u003espark-search\u003c/id\u003e\n        \u003curl\u003ehttps://dl.bintray.com/agile-lab-dev/SparkSearchEngine/\u003c/url\u003e\n    \u003c/repository\u003e\n\u003c/repositories\u003e\n```\n\nAdd the dependency:\n\n```xml\n\u003cdependencies\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003eit.agilelab\u003c/groupId\u003e\n        \u003cartifactId\u003espark-search_2.11\u003c/artifactId\u003e\n        \u003cversion\u003e0.1\u003c/version\u003e\n    \u003c/dependency\u003e\n\u003c/dependencies\u003e\n```\n\nScala 2.10:\n\n```xml\n\u003cdependencies\u003e\n    \u003cdependency\u003e\n        \u003cgroupId\u003eit.agilelab\u003c/groupId\u003e\n        \u003cartifactId\u003espark-search_2.10\u003c/artifactId\u003e\n        \u003cversion\u003e0.1\u003c/version\u003e\n    \u003c/dependency\u003e\n\u003c/dependencies\u003e\n```\n\n## Documentation\n\nThe scaladoc is available at:\n- [https://agile-lab-dev.github.io/sparksearchengine/scaladoc/0.1/scala_2.11] for Scala 2.11\n- [https://agile-lab-dev.github.io/sparksearchengine/scaladoc/0.1/scala_2.10] for Scala 2.10\n\n## How it works\n\nPowered by [Apache Lucene](http://lucene.apache.org/), `spark-search` enables you to run queries on `RDD`s by building Lucene indices for the elements in your input `RDD`s, creating `SearchableRDD`s which you can then execute queries on.\n\nThe only requirement is that elements in the input `RDD` must implement the `Indexable` trait. There is an experimental automatic conversion feature which allows you to transparently use your case classes without any work, which currently only works in the Scala 2.11 build. When used, spark-search will automatically add the functionality needed to implement the trait by using reflection and runtime code-generation. See the scaladoc for `it.agilelab.bigdata.spark.search.Indexable` and `it.agilelab.bigdata.spark.search.Indexable.ProductAsIndexable` for further information.\n\nQueries can be specified either with the Lucene syntax or with `spark-search`'s own domain specific language; to explore the DSL, check out the scaladoc for the `it.agilelab.bigdata.spark.search.dsl.QueryBuilder` class.\n\n\n## Example: indexing and searching a Wikipedia dump\n\nAs a usage example, let's index and search a Wikipedia dump; let's start with the Simple English Wikipedia, as it is small enough to be readily downloadable and usable on less powerful hardware.\n\nHead over to [https://dumps.wikimedia.org/simplewiki/] and grab the latest dump; choose the one marked as \"Articles, templates, media/file descriptions, and primary meta-pages, in multiple bz2 streams, 100 pages per stream\" - it should be named something like `simplewiki-20170820-pages-articles-multistream.xml.bz2`. Download it and decompress it somewhere.\n\nFirst, we parse the XML dump into and `RDD[wikipage]`:\n\n```scala\nimport it.agilelab.bigdata.spark.search.utils.WikipediaXmlDumpParser.xmlDumpToRdd\nimport it.agilelab.bigdata.spark.search.utils.wikipage\n\n// path to xml dump\nval xmlPath = \"/path/to/simplewiki-20170820-pages-articles-multistream.xml\"\n\n// read xml dump into an rdd of wikipages\nval wikipages = xmlDumpToRdd(sc, xmlPath).cache()\n```\n\nWe now check how many pages we got:\n\n```scala\nprintln(s\"Number of pages: ${wikipages.count()}\")\n```\n\nLet's make it a `SearchableRDD`:\n\n```scala\nimport it.agilelab.bigdata.spark.search.SearchableRDD\nimport it.agilelab.bigdata.spark.search.dsl._\nimport it.agilelab.bigdata.spark.search.impl.analyzers.EnglishWikipediaAnalyzer\nimport it.agilelab.bigdata.spark.search.impl.queries.DefaultQueryConstructor\nimport it.agilelab.bigdata.spark.search.impl.{DistributedIndexLuceneRDD, LuceneConfig}\n\n// define a configuration to use english analyzers for wikipedia and the default query constructor\nval luceneConfig = LuceneConfig(classOf[EnglishWikipediaAnalyzer],\n                                classOf[EnglishWikipediaAnalyzer],\n                                classOf[DefaultQueryConstructor])\n\n// index using DistributedIndexLuceneRDD implementation with 2 indices\nval searchable: SearchableRDD[wikipage] = DistributedIndexLuceneRDD(wikipages, 2, luceneConfig).cache()\n```\n\nWe can now do queries:\n\n```scala\n// define a query using the DSL\nval query = \"text\" matchAll termSet(\"island\")\n\n// run it against the searchable rdd\nval queryResults = searchable.aggregatingSearch(query, 10)\n\n// print results\nprintln(s\"Results for query $query:\")\nqueryResults foreach { result =\u003e println(f\"\\tscore: ${result._2}%6.3f title: ${result._1.title}\") }\n```\n\nGet information about the indices that were built:\n\n```scala\nval indicesInfo = searchable.getIndicesInfo\n\n// print it\nprintln(indicesInfo.prettyToString())\n```\n\nGet information about the terms:\n\n```scala\nval termInfo = searchable.getTermCounts\n\n// print top 10 terms for \"title\" field\nval topTenTerms = termInfo(\"title\").toList.sortBy(_._2).reverse.take(10)\nprintln(\"Top 10 terms for \\\"title\\\" field:\")\ntopTenTerms foreach { case (term, count) =\u003e println(s\"\\tterm: $term count: $count\") }\n```\n\nOr do a query join to find similar pages:\n\n```scala\n// define query generator where we simply use the title and the first few characters of the text as a query\nval queryGenerator: wikipage =\u003e DslQuery = (wp) =\u003e \"text\" matchText (wp.title + wp.text.take(200))\n\n// do a query join on itself\nval join = searchable.queryJoin(searchable, queryGenerator, 5) map {\n    case (wp, results) =\u003e (wp, results map { case (wp2, score) =\u003e (wp2.title, score) })\n}\nval queryJoinResults = join.take(5)\n\n// print first five elements and corresponding matches\nprintln(\"Results for query join:\")\nqueryJoinResults foreach {\n    case (wp, results) =\u003e\n        println(s\"title: ${wp.title}\")\n        results foreach { result =\u003e println(f\"\\tscore: ${result._2}%6.3f title: ${result._1}\") }\n}\n```\n\nYou can find this example in `it.agilelab.bigdata.spark.search.examples.SearchableRDDExamples`, ready to be run with spark-submit.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagile-lab-dev%2Fsparksearchengine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fagile-lab-dev%2Fsparksearchengine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagile-lab-dev%2Fsparksearchengine/lists"}