{"id":26933161,"url":"https://github.com/xiaohan2012/der","last_synced_at":"2025-04-02T09:17:42.793Z","repository":{"id":140896236,"uuid":"46448050","full_name":"xiaohan2012/der","owner":"xiaohan2012","description":"Disambiguation Entity R****","archived":false,"fork":false,"pushed_at":"2016-01-03T20:11:35.000Z","size":10843,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-04-14T18:06:57.846Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xiaohan2012.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-11-18T21:17:24.000Z","updated_at":"2015-11-19T21:20:30.000Z","dependencies_parsed_at":"2023-06-06T15:15:43.506Z","dependency_job_id":null,"html_url":"https://github.com/xiaohan2012/der","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xiaohan2012%2Fder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xiaohan2012","download_url":"https://codeload.github.com/xiaohan2012/der/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246785479,"owners_count":20833498,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-02T09:17:42.161Z","updated_at":"2025-04-02T09:17:42.774Z","avatar_url":"https://github.com/xiaohan2012.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Wikipedia related\n\n- [Wikipedia dump link](https://dumps.wikimedia.org/enwiki/20151102/enwiki-20151102-pages-articles-multistream.xml.bz2)\n\n\n# Popular\n\n- [API: Spark 1.5.2 for Scala](http://spark.apache.org/docs/latest/api/scala/index.html)\n- [ByKey* functions in Spark](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions)\n\n# Learning resources\n\n- [Good tutorial on Scala XML](https://bcomposes.wordpress.com/2012/05/04/basic-xml-processing-with-scala/): read, query, iterate, convert to/from object\n- [Get method names for Scala object](http://stackoverflow.com/questions/2886446/how-to-get-methods-list-in-scala)\n- [Scala interpreter print longer traces](http://stackoverflow.com/questions/3767808/how-to-force-interpreter-show-complete-stack-trace/3769827#3769827)\n- [Regular expression with lazy matching and non-capturing group](http://stackoverflow.com/questions/8213837/optional-grouping-in-scala-regular-expressions), [regular expression web tool](http://regexr.com/2v8m4) and my work: [regexp on Wiki markup freelink](http://regexr.com/3c87k)\n- [Parse huge xml files](http://www.lucasallan.com/2014/12/23/parsing-huge-xml-files-in-scala.html)\n- [Emacs as Scala IDE](http://www.troikatech.com/blog/2014/11/26/ensime-and-emacs-as-a-scala-ide)\n- [Get test resource path](http://stackoverflow.com/questions/23831768/scala-get-file-path-of-file-in-resources-folder)\n- [ScalaTest](http://www.scalatest.org/user_guide/writing_your_first_test)\n- [Tail recursion](http://oldfashionedsoftware.com/2008/09/27/tail-recursion-basics-in-scala/): very good example\n  - way to tell if tail recursion: if first compute, then recursive call, then it's, otherwise, it's not.\n- [foldLeft and foldRight](http://oldfashionedsoftware.com/2009/07/10/scala-code-review-foldleft-and-foldright/) plus some reminder on tail recursion\n- [Chop newline character](http://alvinalexander.com/scala/scala-string-chomp-chop-function-newline-characters)\n- [BufferedWriter and FileWriter in Java](http://stackoverflow.com/questions/12350248/java-difference-between-filewriter-and-bufferedwriter)\n- [Scala run jar](http://stackoverflow.com/questions/2930146/running-scala-apps-with-java-jar): `scala -cp  target/scala-2.11/der_2.11-0.1.0.jar  org.hxiao.der.util.PageXMLFlattener` \n- [Spark unit testing example](http://mkuthan.github.io/blog/2015/03/01/spark-unit-testing/)\n- [Tuples to Map](http://stackoverflow.com/questions/6522459/scala-map-from-tuple-iterable)\n- [Spark reduceByKey and groupByKey performance issue](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html)\n- [Spark foldByKey](http://blog.madhukaraphatak.com/spark-rdd-fold/)\n- [Overloading case class](http://stackoverflow.com/questions/2400794/overload-constructor-for-scalas-case-classes)\n- [typedef](http://stackoverflow.com/questions/21223051/typedef-in-scala) and [placing type in package object](http://stackoverflow.com/questions/7441277/scala-type-keyword-how-best-to-use-it-across-multiple-classes)\n- [flatMap](http://stackoverflow.com/questions/23138352/how-to-flatten-a-collection-with-spark-scala)\n- [sorted and sortWith](http://alvinalexander.com/scala/how-sort-scala-sequences-seq-list-array-buffer-vector-ordering-ordered) and [implicit ordering](http://stackoverflow.com/questions/19345030/easy-idiomatic-way-to-define-ordering-for-a-simple-case-class)\n- [ScalaTest should DSL, Matchers](http://www.scalatest.org/user_guide/using_matchers#checkingEqualityWithMatchers)\n- [create a class meanwhile modifying its definition](http://stackoverflow.com/questions/3648870/scala-using-hashmap-with-a-default-value)\n- [spark groupBy](http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#groupBy)\n- [mapValues and frequency calculation](http://stackoverflow.com/questions/12105130/generating-a-frequency-map-for-a-string-in-scala)\n- [Chunker/NER using Solr](http://sujitpal.blogspot.fi/2013/07/dictionary-backed-named-entity.html)\n- Solr\n  - [API: add document programically](https://wiki.apache.org/solr/Solrj)\n  - [query syntax](https://wiki.apache.org/solr/CommonQueryParameters#fl)\n- [Option type](http://danielwestheide.com/blog/2012/12/19/the-neophytes-guide-to-scala-part-5-the-option-type.html)\n- [Exception handling and pattern matching](http://danielwestheide.com/blog/2012/12/26/the-neophytes-guide-to-scala-part-6-error-handling-with-try.html)\n- [When to use new operator?](https://stackoverflow.com/questions/9727637/new-keyword-in-scala/9727784#9727784)\n- [split into chunks(grouped)](http://stackoverflow.com/questions/7459174/split-list-into-multiple-lists-with-fixed-number-of-elements)\n- If `XX is not member of XXX` appears over and over though you tried many versions of packages, use `jar tf XX.jar | grep` to check.\n- [JavaConversions](http://www.scala-lang.org/api/current/index.html#scala.collection.JavaConversions$)\n- [testOnly in sbt](http://stackoverflow.com/questions/11159953/scalatest-in-sbt-is-there-a-way-to-run-a-single-test-without-tags)\n- [Core definition(new) in Solr](https://cwiki.apache.org/confluence/display/solr/Defining+core.properties)\n- [Solr schema data types](https://cwiki.apache.org/confluence/display/solr/Field+Types+Included+with+Solr)\n- [ScalaTest almost equal](http://stackoverflow.com/questions/29938653/scalatest-check-for-almost-equal-for-floats-and-objects-containing-floats/29940436#29940436)\n- [Spark partition opeation(quite a lot)](https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/RDD.html)\n- [RDD.toLocalIterator](https://spark.apache.org/docs/1.0.0/api/java/org/apache/spark/rdd/RDD.html#toLocalIterator%28%29)\n- [sequence test execution in SBT(create one spark context at a time)](http://stackoverflow.com/questions/15145987/how-to-run-specifications-sequentially)\n- [Solr, OverlappingFileLockException -\u003e core not available](http://stackoverflow.com/questions/5898977/solr-overlappingfilelockexception-when-concurrent-commits)\n- [Solr Query Syntax](http://www.solrtutorial.com/solr-query-syntax.html)\n- [Submit Spark application](https://spark.apache.org/docs/1.1.0/submitting-applications.html)\n- [SBT Assembly: create fat jar](https://github.com/sbt/sbt-assembly) and [Solution for scala 2.11](http://stackoverflow.com/questions/28459333/how-to-build-an-uber-jar-fat-jar-using-sbt-within-intellij-idea)\n- [classpath in sbt](http://stackoverflow.com/questions/21698205/how-to-display-classpath-used-for-run-task)\n- [Spark test in standalone cluster](http://eugenezhulenev.com/blog/2014/10/18/run-tests-in-standalone-spark-cluster/)\n- [Spark for Scala 2.11](http://spark.apache.org/docs/latest/building-spark.html#building-for-scala-211)\n- [Unmanaged JARs in sbt](http://www.scala-sbt.org/release/tutorial/Library-Dependencies.html)\n- [\"Object is not a value\" error](http://stackoverflow.com/questions/9079129/object-is-not-a-value-error-in-scala)\n- Suppress Solr update log: `log4j.logger.org.apache.solr.core=WARN` and `log4j.logger.org.apache.solr.update.processor=WARN` in `log4j.properties`\n- [kill -9](http://unix.stackexchange.com/questions/5642/what-if-kill-9-does-not-work)\n- [Column based storage](https://en.wikipedia.org/wiki/Column-oriented_DBMS) and [Parquet](https://parquet.apache.org/)\n- [double definition after compiler type eraser](http://stackoverflow.com/questions/3307427/scala-double-definition-2-methods-have-the-same-type-erasure)\n- [object memory measurement](https://github.com/jbellis/jamm/)\n- [Spark architecture: workers, master](http://0x0fff.com/spark-architecture/): concepts covered:\n  - JVM heap memory allocation(storage, unroll, shuffle, etc)\n  - worker, executor, task(number of parallelism = core-per-executor x \\#executor x \\#worker)\n- [Run main in jar](http://stackoverflow.com/questions/8064699/export-scala-application-to-runnable-jar)\n- [Solr: index locked for write for core](http://stackoverflow.com/questions/17444493/caused-by-org-apache-solr-common-solrexception-index-locked-for-write-for-core)\n- [Talk: Dictionary Based Annotation at Scale with Spark SolrTextTagger and OpenNLP](https://www.youtube.com/watch?v=gOe0aYAS8Do): more on optimization\n- [ClassCastException](http://stackoverflow.com/questions/3511169/java-lang-classcastexception) and check the software dependency version\n- [Solr add classpath](https://cwiki.apache.org/confluence/display/solr/Lib+Directives+in+SolrConfig)\n- [Adding Play Json to sbt](http://stackoverflow.com/questions/19436069/adding-play-json-library-to-sbt)\n- [Play Json example](https://www.playframework.com/documentation/2.1.1/ScalaJson)\n- [Five ways to create Scala List](http://alvinalexander.com/scala/how-create-scala-list-range-fill-tabulate-constructors)\n- [Scala Set](http://www.scala-lang.org/docu/files/collections-api/collections_7.html)\n- [FileUtils](https://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html)\n- [File/Directory exists in Java](http://stackoverflow.com/questions/1816673/how-do-i-check-if-a-file-exists-in-java)\n- Lucene -- format version not supported: remove the `data` directory under `resources/solr/{core_name}`\n- [Solr: automatic generate unique key](http://solr.pl/en/2013/07/08/automatically-generate-document-identifiers-solr-4-x/)\n- [Bash: XML pretty print](http://stackoverflow.com/questions/16090869/how-to-pretty-print-xml-from-the-command-line)\n- [scala sys](http://www.scala-lang.org/api/2.11.5/index.html#scala.sys.process.package)\n\n\n# TODO\n\n- should surface names be lowercased?\n- collect titles?\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiaohan2012%2Fder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxiaohan2012%2Fder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxiaohan2012%2Fder/lists"}