{"id":13564957,"url":"https://github.com/thucdx/news-trending","last_synced_at":"2025-04-12T06:05:13.093Z","repository":{"id":71060860,"uuid":"228062618","full_name":"thucdx/news-trending","owner":"thucdx","description":"Finding trends in news article with Spark (MLLIB, LDA), Spark-Solr, Solr","archived":false,"fork":false,"pushed_at":"2019-12-16T18:24:06.000Z","size":503,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-12T06:04:43.939Z","etag":null,"topics":["lda","spark-solr","topicmodelling"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thucdx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-12-14T17:35:24.000Z","updated_at":"2023-12-05T03:35:14.000Z","dependencies_parsed_at":"2023-02-23T04:01:08.771Z","dependency_job_id":null,"html_url":"https://github.com/thucdx/news-trending","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thucdx%2Fnews-trending","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thucdx%2Fnews-trending/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thucdx%2Fnews-trending/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thucdx%2Fnews-trending/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thucdx","download_url":"https://codeload.github.com/thucdx/news-trending/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248525144,"owners_count":21118618,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lda","spark-solr","topicmodelling"],"created_at":"2024-08-01T13:01:38.526Z","updated_at":"2025-04-12T06:05:13.036Z","avatar_url":"https://github.com/thucdx.png","language":"Scala","funding_links":[],"categories":["Scala"],"sub_categories":[],"readme":"# Finding trends in news\n\n\n*CS410 Final Project, Fall 2019*\n\nTotal member: `1`\n\n| Member | Email   | Role                      |\n|--------|-------|---------------------------|\n| Thuc Dinh | thucd2@illinois.edu| Team leader \u0026 Member  |\n\n\n## What this tool can do\nThe tool help find popular topics in the news during specified period.\nUser need to specify the start time, the end time and number of topics he/she expected to see. \nThe tool then show each topic along with its keywords, word's weight and related articles for the current topic.\n\n\n## Architecture\n\n![General architecture](imgs/FindingTrends.jpg)\n\n+ *[Apache Solr](https://lucene.apache.org/solr/)*: Used to store articles and support full-text search\n+ *[Apache Spark](https://spark.apache.org/)*: Perform LDA algorithm to discover topics and then save the discovered topics back to *Solr* (if necessary). Apache Spark communicates with Apache Solr through the *Spark-Solr* connector.\nThis connector was developed by lucidworks and can be founded at [https://github.com/lucidworks/spark-solr](https://github.com/lucidworks/spark-solr)\n+ *Crawler*: In reality, the tool should have a crawling component to fetch any update from some online news sites. \nHowever, this project use a [Kaggle's public dataset of the Guardian news](https://www.kaggle.com/sameedhayat/guardian-news-dataset) containing about 53.000 articles from the beginning of 2016 to the end of 2018.\n\n## Download and install requirements\nBelow is instruction to install Apache Solr on `Ubuntu 19.10`. Other OSes or distributions of Linux may have a little bit difference.\nThe project use:\n + `Apache Solr 8.2.0`\n + `Java 1.8`\n + `Apache Maven 3.6.1`\n + `Scala 2.11.12`\n + `Apache Spark 2.4.4`\n\n+ Install Apache Solr 8.2.0\n\n   + Download Apache Solr 8.2.0\n\n```sh\ncd $YOUR_WORKING_DIR\nwget https://archive.apache.org/dist/lucene/solr/8.2.0/solr-8.2.0.tgz\ntar -xzvf solr-8.2.0.tgz\n```\n\n   + Start Solr in Cloud mode with 2 nodes (running locally and listening to different ports: `8983` (default) and `7574`):\n   \n```sh\ncd $YOUR_WORKING_DIR/solr-8.2.0\n\n# start Solr running in SolrCloud mode on default port (8983)\nbin/solr start -c\n\n# start Solr running in SolrCloud mode on port 7574 and using localhost:9983 to connect to zookeeper\nbin/solr start -c -p 7574 -z localhost:9983\n```\n\nTo check if everything is OK, go to [Solr Admin](http://localhost:8983/solr/#/~cloud)\n\nYou should see something like below\n![Solr Admin](imgs/solr_admin_run_up.png)\n\nIn the `node` column, there are two nodes: `7574_solr`, `8983_solr`. These indicate two node running and listening to 2 different specified ports.\nThat means we are good to go.\n\n   + Create collections in Solr \n\nWe need to create two collection: `news` collection to store news articles \n\n```sh\n# create news collection with 2 shards and replication factor = 2\nbin/solr create -c news  -s 2 -rf 2\n```\n\n+ Install Java 8, maven 3.6.1\n\nFollow the link: https://linuxize.com/post/how-to-install-apache-maven-on-ubuntu-18-04/\n\n\n+ Build projects from source\n\nAfter install maven successfully, we can build project from source\n\n```sh\ncd $PROJECT_DIR\n\n# package with maven\nmvn clean package\n```\n\nIf the build succeeded, there will be a file named `news_topic-1.0.jar` in `$PROJECT_DIR/target`\n\n+ Download the Guardian news dataset\n\n    1. Download from Kaggle public dataset: https://www.kaggle.com/sameedhayat/guardian-news-dataset\n([another link](https://drive.google.com/open?id=1QwE3VqnCMjFeiRYT6NV1rs6AC8FnnvzW))\n\n    2. Unzip, rename and place it in `$PROJECT_DIR/input/the_guardian_articles.csv`\n\n\n## How to use\n\nTool can be used as a command line command with arguments\n\n```\ncd $PROJECT_DIR\n\njava -cp target/news_topic-1.0.jar Main [options]\n```\n\nFull list of argument can be found in the table below:\n```sh\nFinding trends in news v1.0\nUsage: news_topic-VERSION.jar Main [options]\n\n  -m, --mode \u003cvalue\u003e       Mode to run: extract/trend\n  -p, --inputPath \u003cvalue\u003e  Path of csv file containing articles to index. Default = input/the_guardian_articles.csv\n  -o, --outputPath \u003cvalue\u003e\n                           Path to store extract articles. Default = output/\n  -c, --newsCollection \u003cvalue\u003e\n                           Name of collection to index to Solr. Default: news\n  --extractStartDate \u003cvalue\u003e\n                           Extracting start date, format: yyyy-MM-dd. Default 2018-01-01\n  --extractEndDate \u003cvalue\u003e\n                           Extracting end date, format: yyyy-MM-dd. Default 2019-01-01\n  --trendStartDate \u003cvalue\u003e\n                           Trend start date, format: yyyy-MM-dd. Default 2018-11-01\n  --trendEndDate \u003cvalue\u003e   Trend end date, format: yyyy-MM-dd. Default 2018-12-01\n  -z, --zookeeper \u003cvalue\u003e  Zookeeper url, default: localhost:9983\n  -t, --topics \u003cvalue\u003e     Number of topics. Default = 5\n  -w, --words \u003cvalue\u003e      Number of words per topic to show. Default = 7\n  -a, --articles \u003cvalue\u003e   Number of related articles to show for each topic. Default = 5\n```\n\n*You don't need* to remember all these options, just need to know the tool has two main features:\n1. Index (`--mode extract`): Extract all articles published during `[extractStartDate, extractEndDate)` from input source,\nperform cleaning and save to csv file before indexing to Solr. We then use Solr's `post` tool to index this csv file\n\nSome other options:\n   + `--inputPath \u003cvalue\u003e` option: the path of file containing news (csv file)\n   + `--outputPath \u003cvalue\u003e` option: output path of extracted articles (csv file) \n   + `--extractStartDate \u003cvalue\u003e`, `--extractEndDate \u003cvalue\u003e` define the period of time we need to index articles of this range to Solr (and leave articles published in other ranges untouched)\n\n2. Finding trend (`--mode trend`): Find trends / discover topics in any given period of time and show related articles of these topics.\nSome other options:\n   + `--trendStartDate \u003cvalue\u003e`\n   + `--trendEndDate \u003cvalue\u003e` \n   + `--topics \u003cvalue\u003e`\n   + `--words \u003cvalue\u003e`\n   + `--articles \u003cvalue\u003e`\n\nTL\u0026DR:\n----\n\n\n1. Extract the Guardian News data set, and cleaning before indexing to Solr\n\n```sh\njava -cp target/news_topic-1.0.jar Main --mode extract --inputPath input/the_guardian_articles.csv\n```\n\nIf extracting process was OK, you should see file named `part-*.csv` in `output/` folder\n\nIndexing to Solr with `post` tool\n```sh\ncd $SOLR_DIR\nbin/post -c news $PROJECT_DIR/output/*.csv\n```\n\nIf indexing process was OK, you should see something like this in [http://127.0.1.1:8983/solr/#/news/query](http://127.0.1.1:8983/solr/#/news/query)\n\n![Indexing successfully](imgs/index_news_ok.png)\n\n\nYou can try different values for `--extractStartDate` and `--extractEndDate` to index more articles to Solr. By default, we indexed only articles published in `2018`.\n\n2. View trends in any given time range\n\nDiscover `5` topics in articles published in `May 2018`, each topics show `8` words and `6` related articles\n\n```sh\njava -cp target/news_topic-1.0.jar Main --mode trend --trendStartDate 2018-05-01 --trendEndDate 2018-06-01 --topics 5 --words 8 --articles 6\n```\nThe result of console is something like below:\n\n```sh\n======================\nFINDING TRENDS\n\t Solr's news collection: news\n\t\t startDate = 2018-05-01 \n\t\t endDate = 2018-06-01\n\t\t number of topics: 5\n\t\t words per topic: 8\n\t\t related articles: 6\nTotal article in date range [2018-05-01, 2018-06-01) : 1260\nFinished training LDA model.\nTraining time: 7.130969228 secs\nShowing 5 topics and related articles: \n#################################\nTopic 1 / 5\nTopic word with its weight:\nList((brexit,0.0044), (labour,0.0039), (customs,0.0032), (eu,0.0032), (trade,0.0028), (party,0.0027), (growth,0.0025), (uk,0.0023))\nRelated article: \n+-----------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+--------+\n|title                                                                  |publishedDate|url                                                                                                                            |section |\n+-----------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+--------+\n|Brexit weekly briefing crunch time on customs union approaches         |2018-05-01   |https://www.theguardian.com/politics/2018/may/01/brexit-weekly-briefing-crunch-time-on-customs-union-approaches                |Politics|\n|Brexit vote has cost each UK household 900 says Mark Carney            |2018-05-22   |https://www.theguardian.com/politics/2018/may/22/brexit-vote-cost-uk-mark-carney-bank-of-england                               |Politics|\n|Brexit weekly briefing Boris Johnson launches customs union broadside  |2018-05-08   |https://www.theguardian.com/politics/2018/may/08/brexit-weekly-briefing-boris-johnson-launches-customs-union-broadside         |Politics|\n|Brexit weekly briefing Irish border problem dominates debate           |2018-05-22   |https://www.theguardian.com/politics/2018/may/22/brexit-weekly-briefing-irish-border-problem-dominates-debate                  |Politics|\n|Local elections haunted by Brexit offer little comfort to right or left|2018-05-06   |https://www.theguardian.com/politics/2018/may/05/local-elections-brexit-little-comfort-right-or-left                           |Politics|\n|Labours choice to fight Lewisham East may be decided by Brexit views   |2018-05-18   |https://www.theguardian.com/politics/2018/may/18/labour-choice-to-fight-lewisham-east-byelection-may-be-decided-by-brexit-views|Politics|\n+-----------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------+--------+\n\n\n..... MANY TEXT ....\n\n\n#################################\nTopic 5 / 5\nTopic word with its weight:\nList((rugby,0.0028), (players,0.0024), (season,0.0020), (cup,0.0019), (min,0.0019), (game,0.0018), (saracens,0.0018), (exeter,0.0017))\nRelated article: \n+------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------------------------+-------+\n|title                                                                         |publishedDate|url                                                                                                            |section|\n+------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------------------------+-------+\n|Super Rugby is gravely ill but its last breath is yet to be taken  Bret Harris|2018-05-22   |https://www.theguardian.com/sport/2018/may/22/super-rugby-is-gravely-ill-but-its-last-breath-is-yet-to-be-taken|Sport  |\n|HR McMaster on rugby The warrior ethos is what a good team has                |2018-05-28   |https://www.theguardian.com/sport/2018/may/28/hr-mcmaster-rugby-warrior-ethos                                  |Sport  |\n|Its in my blood how rugby managed to unite Americas elite                     |2018-06-01   |https://www.theguardian.com/sport/blog/2018/jun/01/famous-american-rugby-players-wales-v-south-africa          |Sport  |\n|Eddie Joness England training methods to come under scrutiny                  |2018-05-31   |https://www.theguardian.com/sport/2018/may/31/eddie-jones-england-training-scrutiny                            |Sport  |\n|Super Rugby player drain looms as New Zealands biggest foe  Bret Harris       |2018-05-14   |https://www.theguardian.com/sport/2018/may/15/super-rugby-player-drain-looms-as-new-zealands-biggest-foe       |Sport  |\n|Russia handed World Cup place as Romania penalised for ineligible player      |2018-05-15   |https://www.theguardian.com/sport/2018/may/15/russia-romania-rugby-world-cup-2019-ineligible-player            |Sport  |\n+------------------------------------------------------------------------------+-------------+---------------------------------------------------------------------------------------------------------------+-------+\n```\n## How the tool was developed\n\nThe tool use Spark-Solr connector to Read from and Write to Solr. The Spark side use Spark-ML to perform LDA algorithm to discover topics.\n\nDetails on these key features are detailed below:\n\n+ Read data from Solr to Spark\n\nWith Spark-Solr connector, Solr could be think of as a data source to Spark SQL. It's as easy to read from and write to Solr from Spark as with other data sources.\n```scala\ndef loadArticleFromSolr(ss: SparkSession, zkHost: String, newsCollection: String): Dataset[Article] = {\n    import ss.implicits._\n\n    val options = Map(\n      \"zkhost\" -\u003e zkHost,\n      \"collection\" -\u003e newsCollection\n    )\n\n    val ds = ss.read.format(\"solr\")\n      .options(options)\n      .load\n      .flatMap(rowToArticle)\n\n    ds\n  }\n```\n\n+ Save data from Spark to Solr\n```scala\ndef saveToSolr(ss: SparkSession, zkHost: String, collection: String, dataDF: DataFrame): Unit = {\n    val options = Map(\n      \"zkhost\" -\u003e zkHost,\n      \"collection\" -\u003e collection,\n      \"gen_uniq_key\" -\u003e \"true\",\n      \"soft_commit_secs\" -\u003e \"5\"\n    )\n\n    dataDF\n      .write\n      .format(\"solr\")\n      .options(options)\n      .mode(org.apache.spark.sql.SaveMode.Overwrite)\n      .save\n  }\n```\n\n+ Perform LDA algorithm to find topics\n\nWe do some basic transformation first.\n\nTokenizer\n```scala\n// TOKENIZER\n    val tokenizer = new Tokenizer().setInputCol(\"bodyText\").setOutputCol(\"words\")\n    val newsWithTokenizer = tokenizer.transform(newsDataset)\n\n    val countNullWords = newsWithTokenizer\n      .filter($\"words\".isNull)\n      .count()\n```\n\nThen remove stopwords, currently using default english \"stopwords\" of Spark's MLLib.\n```scala\n    // REMOVE STOPWORDS\n    val stopWords = new StopWordsRemover()\n      .setInputCol(tokenizer.getOutputCol)\n      .setOutputCol(\"filtered_words\")\n\n\n    val filteredStopwords = stopWords.transform(newsWithTokenizer)\n```\n\nConverts a text document to a sparse vector of token counts\n```scala\n    // VECTORISED\n    val cvModel: CountVectorizerModel = new CountVectorizer()\n      .setInputCol(\"filtered_words\")\n      .setOutputCol(\"features\")\n      .setMinDF(2)\n      .fit(filteredStopwords)\n\n    val afterPreprocessed = cvModel.transform(newsInRange)\n```\n\nPenalize popular terms/tokens by using Inverse Document Frequency (IDF)\n```scala\n    //  IDF\n    val idf = new IDF()\n      .setInputCol(cvModel.getOutputCol)\n      .setOutputCol(\"features_tfidf\")\n\n    val rescaled = idf.fit(afterPreprocessed).transform(afterPreprocessed)\n    rescaled.persist()\n\n    val vocabArray = cvModel.vocabulary\n\n    val documents = rescaled\n      .select(\"features_tfidf\")\n      .rdd\n      .map {\n        case Row(features: MLVector) =\u003e Vectors.fromML(features)\n      }\n      .zipWithIndex()\n      .map(_.swap)\n```\n\nPerform LDA algorithm\n```scala\n    val lda = new LDA()\n    lda.setK(nTopic)\n\n    val ldaModel = lda.run(documents)\n    val topicIndices = ldaModel.describeTopics(maxTermsPerTopic = nWord)\n```\n\n+ Finding `top-n` related articles for a given topic\n\nWe get from topic the top words, and use these words to search for articles. Ranking these articles by scores and retrieve the top.\nLeverage Solr's search power to do this task.\n\n```scala\nval words = rangedTopic.words.mkString(\" \")\n val relatedArticles = ss.read.format(\"solr\")\n        .[....]\n      .option(\"query\", s\"bodyText: $words\")     //  \u003c= using topic's words to search with Solr\n      .option(\"solr.params\", \"sort=score desc\") //  \u003c= ranking articles by score in descending order \n      .option(\"max_rows\", maxArticle)           //  \u003c= retrieving some top related articles    \n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthucdx%2Fnews-trending","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthucdx%2Fnews-trending","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthucdx%2Fnews-trending/lists"}