{"id":532,"url":"https://github.com/mmcgrana/services-engineering","last_synced_at":"2026-01-26T18:53:33.616Z","repository":{"id":11016953,"uuid":"13344997","full_name":"mmcgrana/services-engineering","owner":"mmcgrana","description":"A reading list for services engineering, with a focus on cloud infrastructure services","archived":false,"fork":false,"pushed_at":"2022-10-02T14:18:29.000Z","size":49,"stargazers_count":3583,"open_issues_count":48,"forks_count":306,"subscribers_count":211,"default_branch":"master","last_synced_at":"2024-05-23T07:26:09.940Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mmcgrana.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-10-05T12:49:03.000Z","updated_at":"2024-05-22T10:24:45.000Z","dependencies_parsed_at":"2023-01-11T19:31:25.734Z","dependency_job_id":null,"html_url":"https://github.com/mmcgrana/services-engineering","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mmcgrana%2Fservices-engineering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mmcgrana%2Fservices-engineering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mmcgrana%2Fservices-engineering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mmcgrana%2Fservices-engineering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mmcgrana","download_url":"https://codeload.github.com/mmcgrana/services-engineering/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244959445,"owners_count":20538628,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-01-05T20:12:57.469Z","updated_at":"2026-01-26T18:53:28.570Z","avatar_url":"https://github.com/mmcgrana.png","language":null,"readme":"## Services Engineering Reading List\n\nA reading list for services engineering, with a focus on cloud\ninfrastructure services.\n\nWe welcome [suggestions](CONTRIBUTING.md).\n\n#### Papers\n\n* [Fault Injection in Production](http://queue.acm.org/detail.cfm?id=2353017) (Allspaw)\n* [Making Reliable Distributed Systems in the Presence of Software Errors](http://www.erlang.org/download/armstrong_thesis_2003.pdf) (Armstrong)\n* [Highly Available Transactions: Virtues and Limitations](http://www.bailis.org/papers/hat-vldb2014.pdf) (Bailis et al.)\n* [The Incident Command System](http://www.high-reliability.org/files/The_Incident_Command_System.pdf) (Bigley and Roberts)\n* [The Chubby Lock Service for Loosely Coupled Distributed Systems](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/chubby-osdi06.pdf) (Burrows)\n* [Bigtable: a Distributed Storage System for Structured Data](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/chang06bigtable.pdf) (Chang et al.)\n* [Spanner: Google’s Globally-Distributed Database](http://research.google.com/archive/spanner-osdi2012.pdf) (Corbett et al.)\n* [Dynamo: Amazon’s Highly Available Key-Value Store](http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf) (DeCandia et al.)\n* [MapReduce: Simplified Data Processing on Large Clusters](http://research.google.com/archive/mapreduce-osdi04.pdf) (Dean and Ghemawat)\n* [The Google File System](http://research.google.com/archive/gfs-sosp2003.pdf) (Ghemawat et al.)\n* [On Designing and Deploying Internet Scale Services](http://mvdirona.com/jrh/talksAndPapers/JamesRH_Lisa.pdf) (Hamilton)\n* [Kafka: A Distributed Messaging System for Log Processing](http://research.microsoft.com/en-us/UM/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf) (Kreps et al.)\n* [Weathering the Unexpected](http://queue.acm.org/detail.cfm?id=2371516) (Krishnan)\n* [The Unified Logging Infrastructure for Data Analytics at Twitter](http://vldb.org/pvldb/vol5/p1771_georgelee_vldb2012.pdf) (Lee et al.)\n* [Automatic Management of Partitioned, Replicated Search Services](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.222.1862\u0026rep=rep1\u0026type=pdf) (Leibert et al.)\n* [Learning to Embrace Failure](http://best.dtu.dk/SC13/p20-casestudy.pdf) (Limoncelli et al.)\n* [Scaling Big Data Mining Infrastructure: The Twitter Experience](http://www.kdd.org/sites/default/files/issues/14-2-2012-12/V14-02-02-Lin.pdf) (Lin and Rayboy)\n* [Dremel: Interactive Analysis of Web-Scale Datasets](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36632.pdf) (Melnik et al.)\n* [Out of the Tar Pit](http://shaffner.us/cs/papers/tarpit.pdf) (Moseley and Marks)\n* [The Log-Structured Merge-Tree](http://www.cs.umb.edu/~poneil/lsmtree.pdf) (O'Neil et al.)\n* [In Search of an Understandable Consensus Algorithm](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf) (Ongaro and Ousterhout)\n* [Failure Trends in a Large Disk Drive Population](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf) (Pinheiro et al.)\n* [Fallacies of Distributed Computing Explained](http://www.rgoarchitects.com/Files/fallacies.pdf) (Rotem-Gal-Oz)\n* [F1 - The Fault-Tolerant Distributed RDBMS Supporting Google’s Ad Business](http://research.google.com/pubs/archive/38125.pdf) (Shute et al.)\n* [Dapper, A Large Scale Distributed Systems Tracing Infrastructure](http://research.google.com/pubs/archive/36356.pdf) (Sigelman et al.)\n* [Resident Distributed Datasets: a Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) (Zahari et al.)\n* [The Human Side of Postmortems](https://docs.google.com/file/d/0Byl4UKRYLErDVlJMNDNjaThiR2M/edit) (Zwieback)\n* [Crew Resource Management: a Positive Change for the Fire Service](http://www.iaff.org/06news/NearMissKit/6.%20Crew%20Resource%20Management/CRM.pdf)\n\n\n#### Posts\n\n* [Resilience Engineering: Part I](http://www.kitchensoap.com/2011/04/07/resilience-engineering-part-i/), [Part II](http://www.kitchensoap.com/2012/06/18/resilience-engineering-part-ii-lenses/) (Allspaw)\n* [Systems Engineering: a Great Definition](http://www.kitchensoap.com/2011/07/18/systems-engineering-great-definition/) (Allspaw)\n* [Chaos Monkey Released Into The Wild](http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html) (Bennett and Tseitlin)\n* [Some Rules for Engineering and Operations](http://blog.b3k.us/2012/01/24/some-rules.html) (Black)\n* [Service Level Disagreements Part I](http://blog.b3k.us/2009/07/15/service-level-disagreements.html), [Part II](http://blog.b3k.us/2009/07/16/service-level-disagreements-2.html) (Black)\n* [Incuriosity Will Kill Your Infrastructure](http://yellerapp.com/posts/2015-03-16-incuriosity-killed-the-infrastructure.html) (Crayford)\n* [My Philosophy on Alerting](https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit#heading=h.whsaboyw21nk) (Ewaschuk)\n* [You Can’t Sacrifice Partition Tolerance](http://codahale.com/you-cant-sacrifice-partition-tolerance/) (Hale)\n* [Customer Trust](http://perspectives.mvdirona.com/2013/01/15/CustomerTrust.aspx) (Hamilton)\n* [Observations on Errors, Corrections, \u0026 Trust of Dependent Systems](http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx) (Hamilton)\n* [Game Day Exercises at Stripe: Learning from `kill -9`](https://stripe.com/blog/game-day-exercises-at-stripe) (Hedlund)\n* [Life Beyond Distributed Transactions: An Apostate’s Opinion](http://cs.brown.edu/courses/cs227/archives/2012/papers/weaker/cidr07p15.pdf) (Helland)\n* [Notes on Distributed Systems for Young Bloods](http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/) (Hodges)\n* [The Network is Reliable](http://aphyr.com/posts/288-the-network-is-reliable) (Kingsbury)\n* [The Trouble with Clocks](http://aphyr.com/posts/299-the-trouble-with-timestamps) (Kingsbury)\n* [Call Me Maybe: Final Thoughts](http://aphyr.com/posts/286-call-me-maybe-final-thoughts) (Kingsbury)\n* [Getting Real About Distributed Systems Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability) (Kreps)\n* [The Log: What every software engineer should know about real-time data's unifying abstraction](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying) (Kreps)\n* [Incident Response at Heroku](https://blog.heroku.com/archives/2014/5/9/incident-response-at-heroku) (McGranaghan)\n* [On HTTP Load Testing](http://www.mnot.net/blog/2011/05/18/http_benchmark_rules) (Nottingham)\n* [Observability at Twitter](https://blog.twitter.com/2013/observability-at-twitter) (Watson)\n* [Stevey’s Google Platforms Rant](https://gist.github.com/chitchcock/1281611) (Yegge)\n\n#### Presentations\n\n* [Design, Lessons, and Advice from Building Distributed Systems at Google](http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf) (Dean)\n* [Service Design Best Practices](http://www.mvdirona.com/jrh/TalksAndPapers/JamesHamilton_POA20090226.pdf) (Hamilton)\n\n#### Books\n\n* [The Field Guide To Understanding Human Error](http://www.amazon.com/Field-Guide-Understanding-Human-Error/dp/0754648265) (Dekker)\n* [Agile Retrospectives: Making Good Teams Great](http://www.amazon.com/Agile-Retrospectives-Making-Teams-Great/dp/0977616649) (Derby et al.)\n* [Better: A Surgeon’s Notes on Performance](http://www.amazon.com/dp/0312427654) (Gawande)\n* [The Checklist Manifesto: How to Get Things Right](http://www.amazon.com/The-Checklist-Manifesto-ebook/dp/B0030V0PEW) (Gawande)\n* [High Performance Browser Networking](http://chimera.labs.oreilly.com/books/1230000000545/index.html) (Grigorik)\n* [Resilience Engineering in Practice](http://www.amazon.com/Resilience-Engineering-Practice-Ashgate-Studies/dp/1409410358/) (Hollnagel et al.)\n* [Effective Monitoring and Alerting](http://www.amazon.com/Effective-Monitoring-Alerting-For-Operations/dp/1449333524) (Ligus)\n* [Release It!: Design and Deploy Production-Ready Software](http://www.amazon.com/Release-It-Production-Ready-Pragmatic-Programmers/dp/0978739213) (Nygard)\n* [The Challenger Launch Decision](http://www.amazon.com/The-Challenger-Launch-Decision-Technology/dp/0226851761) (Vaughan)\n* [Managing the Unexpected](http://www.amazon.com/gp/product/B004IK9U4U) (Weick and Sutcliffe)\n\n#### Research Groups\n\n* [Berkley AMP Lab](https://amplab.cs.berkeley.edu/)\n* [Berkeley Database Group](http://db.cs.berkeley.edu/w/)\n* [Google Research](http://research.google.com/)\n* [Microsoft Systems Research](http://research.microsoft.com/en-US/groups/sr/default.aspx)\n* [Twitter Research](https://engineering.twitter.com/research)\n\n#### Conferences\n\n* [Monitorama](http://monitorama.com/)\n* [Ricon](http://ricon.io/)\n* [Surge](http://surge.omniti.com/)\n* [Velocity](http://velocityconf.com/)\n","funding_links":[],"categories":["Miscellaneous","Operations","Technical","Others","Papers","Uncategorized","General","其他","Live Site:   [searchAwesome](https://search-awesome.vercel.app/)","others","Other Lists","杂项","Info","Themed Directories","Backend"],"sub_categories":["Burn Iso","Uncategorized","ramanihiteshc@gmail.com","TeX Lists","Other Good Places to Find Papers"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmmcgrana%2Fservices-engineering","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmmcgrana%2Fservices-engineering","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmmcgrana%2Fservices-engineering/lists"}