{"id":14982363,"url":"https://github.com/rjurney/agile_data_code_2","last_synced_at":"2025-04-12T17:40:54.395Z","repository":{"id":41512209,"uuid":"54056503","full_name":"rjurney/Agile_Data_Code_2","owner":"rjurney","description":"Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition","archived":false,"fork":false,"pushed_at":"2024-06-18T01:39:38.000Z","size":24291,"stargazers_count":458,"open_issues_count":8,"forks_count":307,"subscribers_count":43,"default_branch":"master","last_synced_at":"2025-04-03T18:15:31.856Z","etag":null,"topics":["agile-data","agile-data-science","airflow","amazon-ec2","amazon-web-services","analytics","apache-kafka","apache-spark","data","data-science","data-syndrome","kafka","machine-learning","machine-learning-algorithms","predictive-analytics","python","python-3","python3","spark","vagrant"],"latest_commit_sha":null,"homepage":"http://bit.ly/agile_data_science","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rjurney.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-03-16T18:25:28.000Z","updated_at":"2024-11-11T13:21:52.000Z","dependencies_parsed_at":"2023-01-25T13:16:11.920Z","dependency_job_id":"2c7cd8e9-e555-4adb-9f0f-991fdd725aeb","html_url":"https://github.com/rjurney/Agile_Data_Code_2","commit_stats":{"total_commits":980,"total_committers":22,"mean_commits":44.54545454545455,"dds":0.07551020408163267,"last_synced_commit":"862b4959adeeb0d5d4e6f9bc24452a47b0b9c70a"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rjurney%2FAgile_Data_Code_2","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rjurney%2FAgile_Data_Code_2/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rjurney%2FAgile_Data_Code_2/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rjurney%2FAgile_Data_Code_2/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rjurney","download_url":"https://codeload.github.com/rjurney/Agile_Data_Code_2/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248607608,"owners_count":21132572,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agile-data","agile-data-science","airflow","amazon-ec2","amazon-web-services","analytics","apache-kafka","apache-spark","data","data-science","data-syndrome","kafka","machine-learning","machine-learning-algorithms","predictive-analytics","python","python-3","python3","spark","vagrant"],"created_at":"2024-09-24T14:05:16.581Z","updated_at":"2025-04-12T17:40:54.359Z","avatar_url":"https://github.com/rjurney.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Agile Data Science 2.0 (O'Reilly, 2017)\n\nThis repository contains the updated sourcec code for [Agile Data Science 2.0](http://shop.oreilly.com/product/0636920051619.do), O'Reilly 2017. Now available at the [O'Reilly Store](http://shop.oreilly.com/product/0636920051619.do), on [Amazon](https://www.amazon.com/Agile-Data-Science-2-0-Applications/dp/1491960116) (in Paperback and Kindle) and on [O'Reilly Safari](https://www.safaribooksonline.com/library/view/agile-data-science/9781491960103/). Also available anywhere technical books are sold!\n\nNOTE: THE BOOK'S CODE IS OLD, BUT THE CODE IS MAINTAINED. USE DOCKER COMPOSE AND THE NOTEBOOKS IM THIS REPOSITORY.\n\nYou should refer to the Jupyter Notebooks in this repository rather than the book's source code, which is badly outdated and will no longer work for you.\n\nHave problems? Please [file an issue](https://github.com/rjurney/Agile_Data_Code_2/issues)!\n\n## Deep Discovery\n\nLike my work? Connect with me [on LinkedIn](https://linkedin.com/in/russelljurney)!\n\n## Installation and Execution\n\nThere is now only ONE version of the install: Docker via the [docker-compose.yml](docker-compose.yml). It is MUCH EASIER than the old methods.\n\nTo build the `agile` Docker image, run this:\n\n```bash\ndocker-compose build agile\n```\n\nTo run the `agile` Docker image, defined by the [`docker-compose.yml`](docker-compose.yml) and [`Dockerfile`](Dockerfile), run:\n\n```bash\ndocker-compose up -d\n```\n\nNow visit: [http://localhost:8888](http://localhost:8888)\n\n## Other Images\n\nTo manage the `mongo` image with Mongo Express, visit: [http://localhost:8081](http://localhost:8081)\n\n## Downloading Data\n\nOnce the server comes up, download the data and you are ready to go. First open a shell in Jupyter Lab. The working directory corresponds to this folder.\n\nNow download the data:\n\n```bash\n./download.sh\n```\n\n## Running Examples\n\nAll scripts run from the base directory, except the web app which runs in ex. `ch08/web/`. Open [Welcome.ipynb](Welcome.ipynb) and get started.\n\n### Jupyter Notebooks\n\nAll notebooks assume you have run the jupyter notebook command from the project root directory `Agile_Data_Code_2`. If you are using a virtual machine image (Vagrant/Virtualbox or EC2), jupyter notebook is already running. See directions on port mapping to proceed.\n\n# The Data Value Pyramid\n\nOriginally by Pete Warden, the data value pyramid is how the book is organized and structured. We climb it as we go forward each chapter.\n\n![Data Value Pyramid](images/climbing_the_pyramid_chapter_intro.png)\n\n# System Architecture\n\nThe following diagrams are pulled from the book, and express the basic concepts in the system architecture. The front and back end architectures work together to make a complete predictive system.\n\n## Front End Architecture\n\nThis diagram shows how the front end architecture works in our flight delay prediction application. The user fills out a form with some basic information in a form on a web page, which is submitted to the server. The server fills out some neccesary fields derived from those in the form like \"day of year\" and emits a Kafka message containing a prediction request. Spark Streaming is listening on a Kafka queue for these requests, and makes the prediction, storing the result in MongoDB. Meanwhile, the client has received a UUID in the form's response, and has been polling another endpoint every second. Once the data is available in Mongo, the client's next request picks it up. Finally, the client displays the result of the prediction to the user! \n\nThis setup is extremely fun to setup, operate and watch. Check out chapters 7 and 8 for more information!\n\n![Front End Architecture](images/front_end_realtime_architecture.png)\n\n## Back End Architecture\n\nThe back end architecture diagram shows how we train a classifier model using historical data (all flights from 2015) on disk (HDFS or Amazon S3, etc.) to predict flight delays in batch in Spark. We save the model to disk when it is ready. Next, we launch Zookeeper and a Kafka queue. We use Spark Streaming to load the classifier model, and then listen for prediction requests in a Kafka queue. When a prediction request arrives, Spark Streaming makes the prediction, storing the result in MongoDB where the web application can pick it up.\n\nThis architecture is extremely powerful, and it is a huge benefit that we get to use the same code in batch and in realtime with PySpark Streaming.\n\n![Backend Architecture](images/back_end_realtime_architecture.png)\n\n# Screenshots\n\nBelow are some examples of parts of the application we build in this book and in this repo. Check out the book for more!\n\n## Airline Entity Page\n\nEach airline gets its own entity page, complete with a summary of its fleet and a description pulled from Wikipedia.\n\n![Airline Page](images/airline_page_enriched_wikipedia.png)\n\n## Airplane Fleet Page\n\nWe demonstrate summarizing an entity with an airplane fleet page which describes the entire fleet.\n\n![Airplane Fleet Page](images/airplanes_page_chart_v1_v2.png)\n\n## Flight Delay Prediction UI\n\nWe create an entire realtime predictive system with a web front-end to submit prediction requests.\n\n![Predicting Flight Delays UI](images/predicting_flight_kafka_waiting.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frjurney%2Fagile_data_code_2","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frjurney%2Fagile_data_code_2","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frjurney%2Fagile_data_code_2/lists"}