{"id":20285457,"url":"https://github.com/mramshaw/ml_at_scale","last_synced_at":"2026-02-02T05:42:17.013Z","repository":{"id":92905287,"uuid":"162717929","full_name":"mramshaw/ML_at_Scale","owner":"mramshaw","description":"An operational description of ML at Scale","archived":false,"fork":false,"pushed_at":"2019-11-16T03:22:48.000Z","size":5,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-06-22T13:07:03.688Z","etag":null,"topics":["business-analyst","data-engineer","data-scientist","etl","ml","production-engineer"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mramshaw.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-12-21T13:33:55.000Z","updated_at":"2025-05-01T14:55:07.000Z","dependencies_parsed_at":"2023-04-29T00:53:41.338Z","dependency_job_id":null,"html_url":"https://github.com/mramshaw/ML_at_Scale","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mramshaw/ML_at_Scale","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FML_at_Scale","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FML_at_Scale/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FML_at_Scale/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FML_at_Scale/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mramshaw","download_url":"https://codeload.github.com/mramshaw/ML_at_Scale/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mramshaw%2FML_at_Scale/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261296912,"owners_count":23137218,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["business-analyst","data-engineer","data-scientist","etl","ml","production-engineer"],"created_at":"2024-11-14T14:26:49.687Z","updated_at":"2026-02-02T05:42:16.952Z","avatar_url":"https://github.com/mramshaw.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# ML_at_Scale\n\nA quick description of the roles found in organizations that use ML at Scale\n\n## Overview\n\n\u003e ___\"Moore's Law is dead.\"___ - Bill Dally, NVIDIA Chief Scientist\n\nThe above quote was in reference to increasing hardware performance to train\nand test large datasets. But to increase the performance and scale of human\noperations, the standard approach is for increasingly specialized roles.\n\nML is all well and good, but in a business context it must be able to deliver\nvalue to the business. For that reason, it makes sense to create specialist\nroles within a businesses' ML department.\n\n![ML at Scale](images/ML.svg)\n\n## Roles\n\nAs shown above, there are four roles in Production ML departments:\n\n1. [Business Analyst](#business-analyst)\n2. [Data Engineer](#data-engineer)\n3. [Data Scientist](#data-scientist)\n4. [Production Engineer](#production-engineer)\n\nAll of the above roles need some understanding and experience of ML practices.\n\n#### Business Analyst\n\nThis role is more of an outward-facing role. It is over-arching and is\nabout re-framing ML practices into a business framework in terms of business\nvalue (as defined by deliverables).\n\nIt is probably the least technical role, although technology and ML\nexpertise still plays a large factor in how valuable this person may be.\n\n#### Data Engineer\n\nThis role is largely technical and mainly involves [ETL](#etl); however,\ndata cleaning and transformation plays a large part. Populating and/or\nenriching sparse data may also be a requirement. Database knowledge\n(both of traditional relational databases [SQL] as well as NoSQL [CQL,\n GraphQL] databases) and expertise is critical for this role.\n\n#### Data Scientist\n\nThis is the most purely technical role. Experienced data scientists\nare very hard to find and generally have very specific (and deep)\nexpertise and training. For this reason, it is best that they are\nprovided with clean data so that they can focus their energies on\ntasks that they have been trained to do.\n\nMathematics, statistics, probability theory and generally either\nR or Python expertise seem to be the main requirements for this role.\n\n#### Production Engineer\n\nThis role calls for a generalist. Once an ML model has been trained,\nit must then be deployed in order to become operational and deliver\nbusiness benefit.\n\nThe how and why of this deployment can be very technical, but generally\nwill involve a number of separate disciplines - mostly involving more\ntraditional IT practices.\n\nGenerally, this is the role that a traditional data scientist is\n__least__ qualified to perform. For this reason, it is a good idea\nto source a candidate with a much broader skill-set. What these\nskills will need to be largely depends upon the pre-existing\nproduction systems.\n\n## Rolling Deployment\n\nTrained models degrade quickly - which means that models must be re-trained\nfrom time to time (as new data becomes available) and likewise re-deployed.\n\nThis largely involves the data scientist and the production engineer - who\nmust have knowledge of staged deployments as well as - ideally - expertise\nwith CI/CD practices.\n\n## References\n\n#### ETL\n\n    http://en.wikipedia.org/wiki/Extract%2C_transform%2C_load\n\n[Well worth reading for some historical insight into this process.]\n\n#### Moore's Law\n\n    http://en.wikipedia.org/wiki/Moore's_law\n\nStates (a little sniffily) that:\n\n\u003e __Moore's law__ is the ___observation___\n\n### Moore's Law is dead\n\n    https://changelog.com/practicalai/15\n\n[A fairly interesting overview of recent work from nVidia.]\n\n## Credits\n\nInspired by this podcast on Behavioral economics and AI-driven decision making:\n\n    http://changelog.com/practicalai/9\n\n[Practical AI is an excellent podcast, some great stuff and worth following.]\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmramshaw%2Fml_at_scale","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmramshaw%2Fml_at_scale","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmramshaw%2Fml_at_scale/lists"}