{"id":18604813,"url":"https://github.com/repnot/coursera_capstone","last_synced_at":"2025-04-10T20:30:49.796Z","repository":{"id":41463795,"uuid":"164512781","full_name":"REPNOT/Coursera_Capstone","owner":"REPNOT","description":"Capstone project repository for the IBM Data Science program offered on Coursera","archived":true,"fork":false,"pushed_at":"2023-09-22T07:20:21.000Z","size":474,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-04-04T05:42:42.174Z","etag":null,"topics":["api-rest","data-science","database","ibm-db2","ibm-watson","python","sql"],"latest_commit_sha":null,"homepage":"https://derekevansonnotion.notion.site/f1fb169970794ab3abf9c6fbfea11962?v=61fb5c3c0edf450787ee81dc842ec71d\u0026pvs=4","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/REPNOT.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-01-07T23:18:41.000Z","updated_at":"2024-06-03T01:48:05.000Z","dependencies_parsed_at":"2022-09-05T03:40:23.851Z","dependency_job_id":null,"html_url":"https://github.com/REPNOT/Coursera_Capstone","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/REPNOT%2FCoursera_Capstone","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/REPNOT%2FCoursera_Capstone/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/REPNOT%2FCoursera_Capstone/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/REPNOT%2FCoursera_Capstone/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/REPNOT","download_url":"https://codeload.github.com/REPNOT/Coursera_Capstone/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248290019,"owners_count":21078923,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["api-rest","data-science","database","ibm-db2","ibm-watson","python","sql"],"created_at":"2024-11-07T02:19:04.767Z","updated_at":"2025-04-10T20:30:49.433Z","avatar_url":"https://github.com/REPNOT.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003eFinal Report\u003cbr\u003eAssignment By: Derek Evans\u003cbr\u003eSubmitted On: Jan. 20th, 2019\u003c/h1\u003e\n\n\u003cp\u003eThe main scope of this project is to deploy an automated data collection solution using the \u003ca href=\"https://www.ibm.com/watson/index.html\" target=\"_blank\"\u003eWatson Studio\u003c/a\u003e data science platform in conjunction with a \u003ca href=\"https://www.ibm.com/cloud/db2-on-cloud\" target=\"_blank\"\u003eIBM DB2 on Cloud\u003c/a\u003e database.  In developing this solution, my intention was to solve a problem that was mentioned by one of the instructors from the Data Science Methodology course of this program.  During one of the course presentations, the instructor had placed an emphasis on mentioning that one of the most difficult and time-consuming aspects of the data science process is data collection, processing, and generation of new data.  While I'm not a data scientist by any means, I can relate to and agree that the amount of worked involved in preparing data for any kind of analysis and, or research is not only difficult, but exhausting.  This is the type of work that often has little to no reportable production output and depending on the circumstances, can come under a great deal of criticism as the allocation of human capital towards such efforts are costly. \u003c/p\u003e \n\n\u003cp\u003eBy using a series of multiple of \u003ca href=\"https://jupyter.org/\" target=\"_blank\"\u003eJupyter\u003c/a\u003e notebooks and a cloud database, I was able to not only automate the data collection process in retrieving data from the Foursquare API database, but also generate new data as well.  When a user performs a location-based trending search using any of the \u003ca href=\"https://foursquare.com/\" target=\"_blank\"\u003eFoursquare\u003c/a\u003e applications or API calls, the search results will only return trending data based on the number of users currently checked in at a location at the time the search is executed.  Once the browser or application is closed, the user no longer has access to the information provided in that search.  Which is one of the issues this project solves, by simply executing an API call that performs a trending search and stores the search results in a database for permanent storage.  In doing this, the database will have historical time-series trending venue data capable of producing useable data in real-time with preestablished time delay. \u003c/p\u003e\n\n\u003cp\u003eNow, I'm sure Four squares already have this data and offers it to customers through one of its premium subscription services, which starts off at around $600 a month, but I couldn’t find solid historical data of this nature through any other channels offered with the unpaid developer’s subscriptions.  The closest thing I found was the \u003ca href=\"https://developer.foursquare.com/docs/api/venues/details\" target=\"_blank\"\u003e\u003cstrong\u003e\u003cem\u003ePopular\u003c/em\u003e\u003c/strong\u003e\u003c/a\u003e response field under the \u003ca href=\"https://developer.foursquare.com/docs/api/venues/details\" target=\"_blank\"\u003e\u003cstrong\u003e\u003cem\u003eGet Venue Details\u003c/em\u003e\u003c/strong\u003e\u003c/a\u003e endpoint API call.  This endpoint will give users a range of operating hours when the specific location searched has the most amount of traffic.  This information doesn't give me a weighted value of any nature to compare how popular the venue is at that time in relationship to other venues. \u003c/p\u003e\n\n\u003cp\u003eI'll be fourth coming in pointing out that intent of this project was to develop a solution which aims to accurate the data collection and processing stages of the data science methodology.  Most of my time was focused on developing a solution to improve aid in future data science projects to come.  I was able to generate new time series data using the trending venue API call.  This data can be viewed in the \u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Analysis%20and%20Report.ipynb\" target=\"_blank\"\u003eAnalysis and Report\u003c/a\u003e notebook stored in my GitHub repository, which is one of the 15 notebooks utilized for this project.  The following list provided below contains links to the notebooks that we're utilized throughout the course of the project, not including the numerous testing files generated throughout the troubleshooting process. \u003c/p\u003e\n\n\u003cul\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Web%20Scraper%20%20.ipynb\" target=\"_blank\"\u003eWeb Scraper Notebook\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Retrieval.ipynb\" target=\"_blank\"\u003eData Retrieval Notebook\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Analysis%20and%20Report.ipynb\" target=\"_blank\"\u003eAnalysis \u0026amp; Report Notebook\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%201.ipynb\" target=\"_blank\"\u003eData Collector 1\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%202.ipynb\" target=\"_blank\"\u003eData Collector 2\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%203.ipynb\" target=\"_blank\"\u003eData Collector 3\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%204.ipynb\" target=\"_blank\"\u003eData Collector 4\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%205.ipynb\" target=\"_blank\"\u003eData Collector 5\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%206.ipynb\" target=\"_blank\"\u003eData Collector 6\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%207.ipynb\" target=\"_blank\"\u003eData Collector 7\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%208.ipynb\" target=\"_blank\"\u003eData Collector 8\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%209.ipynb\" target=\"_blank\"\u003eData Collector 9\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%2010.ipynb\" target=\"_blank\"\u003eData Collector 10\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%2011.ipynb\" target=\"_blank\"\u003eData Collector 11\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n    \u003cli\u003e\u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Data%20Collector%2012.ipynb\" target=\"_blank\"\u003eData Collector 12\u003c/a\u003e\u003c/strong\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\n\n\u003cp\u003eThe data contained in the Analysis and Report notebook is a combination of data generated by the Web Scraper notebook that was used in the Data Collectors, along with new data produced by the data collectors.  For the actual data that was used or, can be used to perform an analysis, I was able to collect trending venue data for various cities in five-minute intervals spanning a period of 5 hours and 36 minutes on Sunday, Jan. 20th, 2019.  While this is not a significant amount of data, it was generated autonomously using the following structure shown in the diagram below. \u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003e\u003ca href=\"https://share.mindmanager.com/#publish/5Q7GMjJ-1gXEDz4VOR68-SEZmFcPAlOZZmjFhl2R\" target=\"_blank\"\u003eProject Diagram\u003c/a\u003e\u003c/h3\u003e\n\n\u003chr\u003e\n\n\u003cimg src=\"https://fgy7oa.dm.files.1drv.com/y4mOktvWwvwzHpJl87YCdPKjhoA2hEen9Jv71BP-u4Lgn59TxcB6JvEaY7z4W90SEq5TPeza21zVu-JlOWn_KXn7szHSOZMSVvN45jZ3ITV6daRgtMOSuisIrQEcPIxTzwt57kzXfmNrVSvI8mmcNorkzY6LOSocYQBryPJAMtyES1Ut6r8ip9jN6tW8I2OmQkxY1jKYVZYvhJfBSn0FT27ow?width=1388\u0026height=803\u0026cropmode=none\" width=\"1024\" height=\"592\" /\u003e\n\n\u003chr\u003e\n\n\u003cp\u003eThe Data Retrieval notebook was created as a work around and will eventually be eliminated, unless another purpose for it is found.  This was added to the scope of the project as a temporary solution for exceeding my maximum daily API call limits which is capped at 950 calls per day.  One of the problems I've been having throughout the course of the project, is inconsistent data being returned when performing an API call to the Foursquare database.  Based on my observations, the search results vary based on the city and venue identified in the search.  While I'm not completely sure, I believe venue plays a more significant role in the variations, however the differences tend to be somewhat consistent from one city to the next.  In addition to this, I've noticed that since I've began working on this project, the incoming data has been changing leading me to be believe that the records contained in the database are being updated simultaneously.  The reason this becomes an issue, is I've been working to program exceptions into the data collectors in order to produce consistent data frame priors to submitting the data frame to the DataSciDB database. \u003c/p\u003e\n\n\u003cp\u003eIf you review the results generated by the data collectors in the \u003cstrong\u003e\u003ca href=\"https://github.com/REPNOT/Coursera_Capstone/blob/master/Final%20Assignment%20-%20Analysis%20and%20Report.ipynb\" target=\"_blank\"\u003e\u003cem\u003eAnalysis \u0026amp; Report Notebook\u003c/em\u003e\u003c/a\u003e\u003c/strong\u003e you'll notice that the data isn't uniform in the dataframe.  In order to produce an auto generated data sample for this project, I had to let start letting some of the unprocessed data through to the final dataframe in order to get results sent to the DataSciDB database for storage.  From there the data is retreived and processed in the Data Retrieval notebook.\u003c/p\u003e\n\n\u003cp\u003eThe most critical aspect of the project is the data collectors, which handle all of the hard work by performing API calls, retreiving the search results, processing the incoming data, and sending it to the database for retreival.  Originally, I only intended to use four data collectors that would be scheduled to run every hour 15 minutes apart.  As a means of producing a large enough data set for finishing this project, I increased that number to 12 and scheduled each one to run hourly, five minutes apart from one another.  This resulted in a lot of overlapping data that probably isn't necessary, but it does provide a good example of the system capabilities available with the IBM Watson Studio platform.  Once I started letting unprocessed data through, the program worked excellent.  I monitored the Trending Data table in the DataSciDB during the process and the database was automaically refreshing as new data came in.\u003c/p\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frepnot%2Fcoursera_capstone","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frepnot%2Fcoursera_capstone","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frepnot%2Fcoursera_capstone/lists"}