{"id":18842833,"url":"https://github.com/aamend/texata-r2-2017","last_synced_at":"2026-04-22T23:35:04.520Z","repository":{"id":72653747,"uuid":"106953423","full_name":"aamend/texata-r2-2017","owner":"aamend","description":"This project has been created in a 4h time for the purpose of the Texata Big Data world championship. ","archived":false,"fork":false,"pushed_at":"2023-07-16T07:28:06.000Z","size":3430,"stargazers_count":2,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-06-30T01:38:33.843Z","etag":null,"topics":["bigdata","gdelt","hackathon","spark","texata"],"latest_commit_sha":null,"homepage":"http://www.texata.com/","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aamend.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2017-10-14T18:42:19.000Z","updated_at":"2020-07-08T06:39:04.000Z","dependencies_parsed_at":null,"dependency_job_id":"8bdbf8ce-72be-4a3f-bc5a-5c28ac61d546","html_url":"https://github.com/aamend/texata-r2-2017","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aamend/texata-r2-2017","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Ftexata-r2-2017","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Ftexata-r2-2017/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Ftexata-r2-2017/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Ftexata-r2-2017/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aamend","download_url":"https://codeload.github.com/aamend/texata-r2-2017/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aamend%2Ftexata-r2-2017/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32159959,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-22T17:06:48.269Z","status":"ssl_error","status_checked_at":"2026-04-22T17:06:19.037Z","response_time":58,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigdata","gdelt","hackathon","spark","texata"],"created_at":"2024-11-08T02:55:50.225Z","updated_at":"2026-04-22T23:35:04.470Z","avatar_url":"https://github.com/aamend.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Texata 2017 - Round 2\n\n![HEADER](/images/header.png)\n\n## The questions\n\n*You are the technical co-founder of a new start-up called texata.ai. \nThis business was founded to leverage the vast source of news information available on \nthe internet in order to better predict geo political instability in exporting countries, \nturning these events into actionable insights that can be used as financial instruments \nin the oil and gas markets. Using the GDELT (global database of events language and tone) \ndataset provided, conduct analysis relevant to the three business situations below.*\n\n- [Time Series analysis](#TIMESERIES): *Using the GDELT event dataset, can you train a computer to detect arising conflicts \nin a particular region of the globe at the early stage of a political instability?*\n- [Network analysis](#NETWORK): *Using the GDELT global knowledge graph database and the game of alliances that \nexists between different regimes and world leaders, can you identify the main \ninfluencers in the different oil and gas markets?*\n- [Inference](#INFERENCE): *Given your newly acquired domain expertise together with the provided benchmarks \nin crude oil (BRENT and OPEC), can you define the influence of a series of successive \npolitical events in the oil market?*\n\n\u003ca name=\"TIMESERIES\"\u003e\u003c/a\u003e\n## Time Series analysis\n\nMy approach will consist on the following\n\n- Extract all event Ids from GKG that relate to oil or gas (resp. `ENV_OIL` and `ENV_GAS` cameo code)\n- Retrieve all events from EVENT related to the above by joining 2 sets through the `eventId` (massive `JOIN` operation)\n- Extract the media coverage\n- Plot both the goldstein scale and coverage over time grouped by country\n\n### Media coverage\n\nBelow example shows the (normalized) media coverage for both France and United Kingdom with regards to oil and gas.\n\n![EVENT](/images/FR_UK_OIL-events.png)\n\nThat way, I can quickly eye ball any potential outbreak related to the oil and gas markets. \nProgrammatically, I define the coverage as the zscore function of the number of articles per country. \nI should define a threshold after which a random event is considered a major outbreak, but for now, let's just get \nthe top 1000 tuples country / dates (i.e. the 1000 top most massively covered events). \n\nThe idea is then to enrich the full data with the actual events that took place on those dates, at these places. \n\n```\n+-----------+-------+------------------+\n|       date|country|          coverage|\n+-----------+-------+------------------+\n|2016-03-03 |     NC|13.662072684496462|\n|2016-04-14 |     MJ|13.414666979122172|\n|2016-02-25 |     NC|12.405574670546713|\n|2014-09-25 |     SU|  9.48863214818537|\n|2015-07-30 |     GA|  9.29919393241244|\n|2016-01-21 |     SZ| 8.868822058237168|\n|2016-07-07 |     FJ| 8.484678646900855|\n|2017-08-03 |     VE|  8.31103934678676|\n|2017-07-27 |     VE|  8.02805049313106|\n|2017-02-23 |     MY| 8.022447901213992|\n|2015-03-12 |     MP| 8.000639172377282|\n|2016-04-07 |     MJ| 7.972090458950071|\n|2016-02-25 |     IV| 7.945543203860656|\n|2015-07-09 |     EC|7.7897667107480775|\n|2016-05-05 |     CA|  7.56970466851033|\n|2015-01-15 |     MC| 7.471498021926732|\n|2016-01-14 |     IR|7.2391919471156125|\n|2016-07-14 |     CH| 7.145798060874983|\n|2016-07-14 |     RP| 7.072568219684949|\n|2015-07-23 |     GA| 6.950294549668176|\n+-----------+-------+------------------+\n```\n\n- I extracted only the top 1,000 (in order to limit the number of articles to fetch in a 4h time competition)\n- I can safely fetch all articles from online websites using the URLs provided in Gdelt data model. \n- The list of URLs I get back is around 30K large. On my 10 nodes cluster, I reckon it should take around 30mn to scrape all of those. \n- I only fetch the first third and build an efficient web scraper that I distribute across my 10 nodes (took 15mn overall)\n\n### Fetching HTML content\n\nFor that purpose, I'm using a version of [Goose](https://github.com/GravityLabs/goose/wiki) that I recompiled for Scala 2.11\n\n```\n+-------+---------+---------------------------------------------------------------------------------------------------------+----------+\n|country|goldstein|title                                                                                                    |date      |\n+-------+---------+---------------------------------------------------------------------------------------------------------+----------+\n|BL     |-10.0    |Pope's 'homecoming' tour moves from Ecuador to Bolivia                                                   |2015-07-08|\n|IV     |-10.0    |One pirate killed, four arrested after raid on hijacked ship                                             |2016-02-22|\n|NO     |10.0     |Copter fuselage retrieved, search still on for missing                                                   |2016-04-30|\n|VE     |-10.0    |Several nations see Venezuela vote as a sham--Aleteia                                                    |2017-07-31|\n|MY     |-10.0    |North Korean diplomat warned to cooperate in Kim Jong Un’s alleged assassination investigation - National|2017-02-25|\n|SZ     |-10.0    |22,000 Islamic State jihadists have been killed by coalition, France claims                              |2016-01-22|\n|EC     |-10.0    |Pope's 'homecoming' tour moves from Ecuador to Bolivia                                                   |2015-07-08|\n|IR     |-10.0    |The Other News: Ayatollah Ali Khamenei                                                                   |2016-01-11|\n|MP     |8.0      |India to fund key Mauritian infrastructure projects                                                      |2015-03-12|\n|AQ     |7.0      |Southern California monuments would be spared, six others would be reduced – Daily News                  |2017-09-19|\n|GA     |7.0      |Morris grateful for LDM help in fire rescue                                                              |2015-07-31|\n|FM     |4.0      |Africa and Asia forge stronger alliances                                                                 |2016-08-30|\n|NC     |2.8      |Marquesas Islands, French Polynesia: How to get to the world's most remote islands                       |2016-03-04|\n|SU     |1.9      |S. Korea holds send-off ceremony for U.N. mission to South Sudan                                         |2014-09-23|\n+-------+---------+---------------------------------------------------------------------------------------------------------+----------+\n```\n\nThose are the news events that happened on those dates, at those places, \nand that were identified as breaking news articles with regards to either `ENV_OIL` or `ENV_GAS`. \nI reckon I should de-noise this data by looking at the text content, applying some NLP and topic modeling perhaps, \nbut the second top most article is of a great value already as clearly, piracy off the Ivory coast should have \nstrong impact in the oil and gas markets. Let's rely on the taxonomy provided by Gdelt and assume all of those were actual\noil and gas events.\n\n![PIRACY](/images/piracy.jpg)\n\nI have now enriched my raw data with articles I know could have serious impact on the markets. \nI will (hopefully) be using this information later when inferring series of events that could affect oil and gas price.\n\n\u003ca name=\"NETWORK\"\u003e\u003c/a\u003e\n## Network Analysis\n\nThe idea here is to look at the possible connections in the oil and gas markets, \nturning GKG into social graph that can be analysed further. \nMy end goal here is to extract relations, infer communities, and find out the common denominator among communities. \nTogether with the list of raw articles I managed to extract earlier, \nI should be able to see the influence a particular event may have in this network graph.\n\nFirst, I extract all GKG events related to `ENV_OIL` or `ENV_GAS`.\n\nIn term of community detection, due to the scale of the problem (see below figures), this must be done in parallel. \nI have two possible alternative\n\n- WCC detection: [http://arxiv.org/pdf/1411.0557.pdf](http://arxiv.org/pdf/1411.0557.pdf)\n- Louvain modularity: [https://arxiv.org/pdf/0803.0476.pdf](https://arxiv.org/pdf/0803.0476.pdf)\n\n### Processing graph\n\nMy graph contains around 2,000,000 vertices, 78,000,000 edges, with each node having 70 connections in average. \nAlthough I'm not concerned processing this graph, \nI feel concerned processing this graph in the remaining 1h and 40mn. \nFor the sake of the competition, I'll remove all edges with less than 100 articles in common between 2 different vertices (persons). \nThis can be achieved by first collecting the degrees of each node and then removing the appropriate edge and nodes\n\n\n```scala\n  val subgraph = graph.subgraph(\n    (et: EdgeTriplet[String, Long]) =\u003e et.attr \u003e 100,\n    (_, vData: String) =\u003e true\n  )\n\n  val subGraphDeg = subgraph.outerJoinVertices(subgraph.degrees)((vId, vData, vDeg) =\u003e {\n    (vData, vDeg.getOrElse(0))\n  }).subgraph(\n    (et: EdgeTriplet[(String, Int), Long]) =\u003e et.srcAttr._2 \u003e 0 \u0026\u0026 et.dstAttr._2 \u003e 0,\n    (_, vData: (String, Int)) =\u003e vData._2 \u003e 0\n  ).mapVertices({ case (vId, (vData, vDeg)) =\u003e\n    vData\n  })\n```\n\nThis now reduces my dimensions down to ~18,000 vertices, 330,000 edges and an \naverage of 11 connections per node. Executing WCC brings back 54 communities that \ncan be investigated further.\nAlso, in addition of the community, I execute a simple PageRank as a direct\n measure of the \"influencer\" score. \n\n#### Extracting communities\n\nHere are few examples of different communities I managed to extract though my implementation of WCC algorithm (Download [pdf](/images/graph.pdf) for a more detailed picture)\n\n![GRAPH](/images/graph.png)\n\nNot a surprise, Donald Trump is a big player in our graph, and is close to the center of the most important community (random first 20 displayed below)\n\n```\n+-----------------+\n|           person|\n+-----------------+\n|    igor shuvalov|\n|       mary barra|\n|      gerald ford|\n|viktor yanukovych|\n|      harold hamm|\n|     steve bannon|\n|arkady dvorkovich|\n|     gary johnson|\n|    bernie sander|\n|      igor sechin|\n|    mick mulvaney|\n|    sergei lavrov|\n|  katya golubkova|\n|     ernest moniz|\n|   hilary clinton|\n|   david petraeus|\n| alexander korzun|\n|     neil gorsuch|\n|   lincoln chafee|\n|   laurent fabius|\n+-----------------+\n```\n\nInterestingly, we have some Russian / Ukranian politician names in here. \nAlso Laurent Fabius as ex minister of foreign affairs in France at that time. \nWhilst the main community is around the big players (Donald Trump, Vladimir Putin, Barack obama, John Kerry, Bashar Al Assad, etc.), \nthe second most important community seems to be about Europe and African countries (first 20 records below).\n\n```\n+------------------+\n|            person|\n+------------------+\n|     dolly edwards|\n|     umaru yaradua|\n|       olisa metuh|\n|      ibe kachikwu|\n|       garba shehu|\n|    steven sotloff|\n|      lenin moreno|\n|    stephen harper|\n| goodluck jonathan|\n|       nnamdi kanu|\n|   paolo gentiloni|\n|  muhammadu buhari|\n|    pierre trudeau|\n|  yanis varoufakis|\n|enrique pena nieto|\n|     sylvie corbet|\n|    michael fallon|\n|   patrick hodgins|\n|federica mogherini|\n|     darren palmer|\n+------------------+\n```\n\nThe first observation is that oil and gas does not seem to be one single market, but multiple. \nI'm not an expert, but I know at least 3 indices for benchmarking crude oil\n\n- **WTI**: Refers to oil extracted from wells in the U.S. and sent via pipeline to Cushing, Oklahoma\n- **BRENT**: Produced by various entities in the north sea\n- **OPEC**: Produced by member of the OPEC (Algeria, Angola, Ecuador, Gabon, Iran, Irak, Kuwait, Libya, Nigeria, etc..)\n\n![crude_oil_globe](/images/crude_oil_globe.jpg)\n\nIt seems that those defined communities (US + Russia, Europe + Africa, etc.) \ncould be seen as a definition of those different markets. The fact that goodluck Jonhattan and muhammadu buhari\n (resp. former and actual president of Nigeria) are \"close\" to Angela Merkel, \n David Cameron and Francois Hollande confirms my theory (Nigeria is part of OPEC by the way).\n\n\u003ca name=\"INFERENCE\"\u003e\u003c/a\u003e\n## Detecting trends in the oil and gas market\n\nNow comes the last bit to get a successful startup. \n\nWe've been able to extract major news articles around oil and gas, \nwe know the group of people connected together, \nthe different markets these oil \u0026 gas influencers are dealing with, \nit is time to enter to the heart of the subject and look at crude oil price. \n\nI use the brent index provided by [QUANDL](https://www.quandl.com/collections/markets/crude-oil).\n\n![BRENT1](/images/brent.png)\n\nThe technique I am using to detect trends was invented from a friend of mine, Andrew Morgan -  [TrendCalculus](https://bitbucket.org/bytesumo/trendcalculus-public). \n\nThe concept is to find all the highs and lows in my timeseries data, \nfinding the highest high and lowest low occurring in each moving window. I use a window of a 30 days, expecting to find 36 highs and lows  \nbetween 2014 and 2017, transforming my raw series into a series of trends.\n\nOnce the trends are identified, I extract the reversals, i.e. the highest high and lowest low that were observed \nbefore a flip of a trend (moving from rising to falling). I report few dates below.\n\n```\n+-----+--------------------+-----+\n|trend|                   x|    y|\n+-----+--------------------+-----+\n|  LOW|2015-01-13 00:00:...|45.13|\n| HIGH|2015-05-13 00:00:...|66.33|\n|  LOW|2015-08-24 00:00:...|41.59|\n| HIGH|2015-10-08 00:00:...|52.13|\n|  LOW|2016-01-20 00:00:...|26.01|\n| HIGH|2016-06-08 00:00:...|50.73|\n|  LOW|2016-08-02 00:00:...| 40.0|\n| HIGH|2016-08-26 00:00:...|49.66|\n|  LOW|2016-09-27 00:00:...|44.95|\n| HIGH|2016-10-19 00:00:...|51.85|\n|  LOW|2016-11-13 00:00:...|41.61|\n+-----+--------------------+-----+\n```\n\nMy hypothesis is the following: \n- *There might be a breaking news event captured on Gdelt that could have explained those trend reversals*\n\n![BRENT2](/images/brent_H_L.png)\n\n## Inference\n\nThe rest is pure theory here, as I'm not able to progress much further in the remaining 20mn, but here is my idea:\n\n- Enrich my initial data with the trend reversals detected from the BRENT series\n- Retrieve all the articles I scraped from most of the breaking news articles\n- Hopefully the dates just work fine, I now have plenty of articles that could have caused the market to rise or fall\n- I deduplicate those articles, group them into \"stories\" (i.e. covered by many articles) and find out the ones that are contextually close to Person, Organisation, Theme, etc.\n- Thanks to GKG (though I could extract those from a simple NER tagger), I know who is mentioned in those stories\n- I know who's connected to who, and who's dealing with what market\n- I know the influence an event may have in a community\n- I should be able to build a labeled data set in order to train a simple classifier. \n\n## Conclusion\n\nWith enough time, I could train a computer to understand what event happened in what country, \nwhat was the impact in what community, \nand predict the positive or negative effect in the crude oil markets.\n\nFinally, by exporting my model, I can apply the same in near real time (GDELT data is published every 15mn) \nso that I will be able to detect rise and fall as the events unfold.\n\nThis is my product, this is texata.ai!\n\nThank you!\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faamend%2Ftexata-r2-2017","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faamend%2Ftexata-r2-2017","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faamend%2Ftexata-r2-2017/lists"}