Come listen to experts in building infrastructure and enabling development and deployment processes discuss the ideas and technologies involved in DevOps.
Similar Podcasts

Flutter 101 Podcast
Weekly podcast focusing on software development with Flutter and Dart. Hosted by Vince Varga.

Views on Vue
Vue is a growing front-end framework for web developments. Hear experts cover technologies and movements within the Vue community by talking to members of the open source and development community.

React Round Up
Stay current on the latest innovations and technologies in the React community by listening to our panel of React and Web Development Experts.
How to build in Observability at Petabyte Scale
We welcome guest Ang Li and dive into the immense challenge of observability at scale, where some customers are generating petabytes of data per day. Ang explains that instead of building a database from scratch—a decision he says went "against all the instincts" of a founding engineer—Observe chose to build its platform on top of Snowflake, leveraging its separation of compute and storage on EC2 and S3.The discussion delves into the technical stack and architectural decisions, including the use of Kafka to absorb large bursts of incoming customer data and smooth it out for Snowflake's batch-based engine. Ang notes this choice was also strategic for avoiding tight coupling with a single cloud provider like AWS Kinesis, which would hinder future multi-cloud deployments on GCP or Azure. The discussion also covers their unique pricing model, which avoids surprising customers with high bills by providing a lower cost for data ingestion and then using a usage-based model for queries. This is contrasted with Warren's experience with his company's user-based pricing, which can lead to negative customer experiences when limits are exceeded.The episode also explores Observe’s "love-hate relationship" with Snowflake, as Observe's usage accounts for over 2% of Snowflake's compute, which has helped them discover a lot of bugs but also caused sleepless nights for Snowflake's on-call engineers. Ang discusses hedging their bets for the future by leveraging open data formats like Iceberg, which can be stored directly in customer S3 buckets to enable true data ownership and portability. The episode concludes with a deep dive into the security challenges of providing multi-account access to customer data using IAM trust policies, and a look at the personal picks from the hosts.Notable LinksFact - Passkeys: Phishing on Google's own domain and It isn't even newEpisode: All About OTELEpisode: Self Healing SystemsPicks:Warren - The Shadow (1994 film)Ang - Xreal Pro AR Glasses
The Open-Source Product Leader Challenge: Navigating Community, Code, and Collaboration Chaos
In a special solo flight, Warren welcomes Meagan Cojocar, General Manager at Pulumi and a self-proclaimed graduate of “PM school” at AWS. They dive into what it’s like to own an entire product line and why giving up that startup hustle for the big leagues sometimes means you miss the direct signal from your users. The conversation goes deep on the paradox of open-source where direct feedback is gold, but dealing with license-shifting competitors can make you wary. From the notorious HashiCorp kerfuffle to the rise of OpenTofu, they explore how Pulumi maintains its commitment to the community amidst a wave of customer distrust.Meagan highlights the invaluable feedback loop provided by the community, allowing for direct interaction between users and the engineering team. This contrasts with the "telephone game" that can happen in proprietary product development. The conversation also addresses the recent industry shift and then immediate back-peddling from open-source licenses, discussing the subsequent customer distrust and how Pulumi maintains its commitment to the open-source model.And finally, the duo tackles the elephant in the cloud: LLMs, and extends on the early MCP episode. They debate the great code quality vs. speed trade-off, the risk of a "botched" infrastructure deployment, and whether these models can solve anything more than a glorified statistical guessing game. It's a candid look at the future of DevOps, where the real chaos isn't the code, but the tools that write it. The conversation concludes with a philosophical debate on the fundamental capabilities of LLMs, questioning whether they can truly solve "hard problems" or are merely powerful statistical next-word predictors.Notable FactsVeritasium - the Math that predicts everythingFact - Don't outsource your customer support: Clorox sues CognizantCloudFlare uses an LLM to generate an OAuth2 LibraryPicks:Warren - Rands Leadership CommunityMeagan - The Manager's Path by Camille Fournier
FinOps: Holding engineering teams accountable for spend
In this episode of Adventures in DevOps, we dive into the world of FinOps, a concept that aims to apply the DevOps mindset to financial accountability. Yasmin Rajabi, Chief Strategy Officer at CloudBolt, joins us to demystify, as we acknowledge the critical challenge of bringing together financial accountability and engineering teams who often are not paying attention to the business.The discussion further explores the practicalities of FinOps in the context of cloud spending and Kubernetes. Yasmin highlights that a significant amount of waste in organizations comes from simply not turning off unused systems and not right-sizing resources. She explains how tools like Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA) can help, but also points out the complexities of optimizing across horizontal and vertical scaling behaviors. The conversation touches on "shame back reporting" as a way to provide visibility into costs for engineering teams, although the conversation emphasizes that providing tooling and insights is more effective than simply telling developers to change configurations.The episode also delves into the evolving mindset around cloud costs, especially with the rise of AI and machine learning workloads. While historically engineering salaries eclipsed cloud spending, the increasing hardware requirements for ML and data workloads are making cost optimization a more pressing concern. Spending-conscious teams are increasingly asking about GPU optimization, even if AI/ML teams are still largely focused on limitless spending to drive unjustified "innovation". The conclude by discussing the challenges of on-premise versus cloud deployments and the importance of addressing "day two problems" regardless of the infrastructure choice.PicksWarren - Lions and Dolphins cannot make babiesAimee - The Equip Protein Powder and Protein BarYasmin - Bone Broth drink by 1990 Snacks
The Auth Showdown: Single tenant versus Multitenant Architectures
Get ready for a lively debate on this episode of Adventures in DevOps. We're joined by Brian Pontarelli, founder of FusionAuth and CleanSpeak. Warren and Brian face off by diving into the controversial topic of multitenant versus single-tenant architecture. Expert co-host Aimee Knight joins to moderate the discussion. Ever wondered how someone becomes an "auth expert"? Warren spills the beans on his journey, explaining it's less about a direct path and more about figuring out what it means for yourself. Brian chimes in with his own "random chance" story, revealing how they fell into it after their forum-based product didn't pan out.Aimee confesses her "alarm bells" start ringing whenever multitenant architecture is mentioned, jokingly demanding "details" and admitting her preference for more separation when it comes to reliability. Brian makes a compelling case for his company's chosen path, explaining how their high-performance, downloadable single-tenant profanity filter, CleanSpeak, handles billions of chat messages a month with extreme low latency. This architectural choice became a competitive advantage, attracting companies that couldn't use cloud-based multitenant competitors due to their need to run solutions in their own data centers.We critique cloud providers' tendency to push users towards their most profitable services, citing AWS Cognito as an example of a cost-effective solution for small-scale use that becomes cost-prohibitive with scaling and feature enablement. The challenges of integrating with Cognito, including its reliance on numerous other AWS services and the need for custom Lambda functions for configuration, are also a point of contention. The conversation extends to the frustrations of managing upgrades and breaking changes in both multitenant and single-tenant systems and the inherent difficulties of ensuring compatibility across different software versions and integrations. The episode concludes with a humorous take on the current state and perceived limitations of AI in software development, particularly concerning security.PicksWarren - Scarpa Hiking shoes - Planet Mojito SuadeAimee - Peloton TreadBrian - Searchcraft and Fight or Flight
Should We Be Using Kubernetes: Did the Best Product Win? With Omer Hamerman
Episode Sponsor: PagerDuty - Checkout the features in their official feature release.This episode dives into a fundamental question facing the DevOps world: Did Kubernetes truly win the infrastructure race because it was the best technology, or were there other, perhaps less obvious, factors at play? Omer Hamerman joins Will and Warren to take a hard look at it. Despite the rise of serverless solutions promising to abstract away infrastructure management, Omer shares that Kubernetes has seen a surge in adoption, with potentially 70-75% of corporations now using or migrating to it. We explore the theory that human nature's preference for incremental "step changes" (Kaizen) over disruptive "giant leaps" (Kaikaku) might explain why a solution perceived by some as "worse" or more complex has gained such widespread traction.The discussion unpacks the undeniable strengths of Kubernetes, including its "thriving community", its remarkable extensibility through APIs, and how it inadvertently created "job security" for engineers who "nerd out" on its intricacies. We also challenge the narrative by examining why serverless options like AWS Fargate could often be a more efficient and less burdensome choice for many organizations, especially those not requiring deep control or specialized hardware like GPUs. The conversation highlights that the perceived "need" for Kubernetes' emerges often from something other than technical superiority.Finally, we consider the disruptive influence of AI and "vibe coding" on this landscape, how could we not? As LLMs are adopted to "accelerate development", they tend to favor serverless deployment models, implicitly suggesting that for rapid product creation, Kubernetes might not be the optimal fit. This shift raises crucial questions about the trade-offs between development speed and code quality, the evolving role of software engineers towards code review, and the long-term maintainability of AI-generated code. We close by pondering the broader societal and environmental implications of these technological shifts, including AI's massive energy consumption and the ongoing debate about centralizing versus decentralizing infrastructure for efficiency.Links:Comparison: Linux versus E. coliPicksWarren - Surveys are great, and also fill in the Podcast SurveyWill - Katana.networkOmer - Mobland and JJ (Jujutsu)
Mastering SRE: Insights in Scale and at Capacity with Aimee Knight
In this episode, Aimee Knight, an expert in Site Reliability Engineering (SRE) whose experience hails from Paramount and NPM, joins the podcast to discuss her journey into SRE, the challenges she faced, and the strategies she employed to succeed. Aimee shares her transition from a non-traditional background in JavaScript development to SRE, highlighting the importance of understanding both the programming and infrastructure sides of engineering. She also delves into the complexities of SRE at different scales, the role of playbooks in incident management, and the balance between speed and quality in software development.Aimee discusses the impact of AI and machine learning on SRE, emphasizing the need for responsible use of these tools. She touches on the importance of understanding business needs and how it affects decision-making in SRE roles. The conversation also covers the trade-offs in system design, the challenges of scaling applications, and the importance of resilience in distributed systems. Aimee provides valuable insights into the pros and cons of a career in SRE, including the importance of self-care and the satisfaction of mentoring others.The episode concludes with us discussing some of the hard problems such as the on-call burden for large teams, and the technical expertise an org needs to maintain higher complexity systems. Is the average tenure in tech decreasing, we discuss it and do a deep dive on the consequences in the SRE world.PicksThe Adventures In DevOps: SurveyWarren's Technical BlogWarren: The Fifth Discipline by Peter SengeAimee: Sleep Token (Band) - Caramel, GraniteWill: The Bear Grylls Celebrity Hunt on NetflixJillian: Horizon Zero Dawn Video Game
Exploring MCP Servers and Agent Interactions with Gil Feig
In this episode, we delve into the concept of MCP (Machine Control Protocol) servers and their role in enabling agent interactions. Gil Feig, the co-founder and CTO of Merge, shares insights on how MCP servers facilitate efficient and secure integration between various services and APIs.The discussion covers the benefits and challenges of using MCP servers, including their stateful nature, security considerations, and the importance of understanding real-world use cases. Gil emphasizes the need for thorough testing and evaluation to ensure that MCP servers effectively meet user needs.Additionally, we explore the implications of MCP servers on data security, scaling, and the evolving landscape of API interactions. Warren chimes in with experiences integrating AI with Auth. Will stuns us with some nuclear fission history. And finally, we also touch on the balance between short-term innovation and long-term stability in technology, reflecting on how different generations approach problem-solving and knowledge sharing.Picks:The Adventures In DevOps: SurveyWarren: The Magicians by Lev GrossmanGil: Constant Escapement in WatchmakingWill: Dungeon Crawler Carl & Atmos Clock
No Lag: Building the Future of High-Performance Cloud with Nathan Goulding
Warren talks with Nathan Goulding, SVP of Engineering at Vultr, about what it actually takes to run a high-performance cloud platform. They cover everything from global game server latency and hybrid models to bare metal provisioning and the power/cooling constraints that come with modern GPU clusters.The discussion gets into real-world deployment challenges like scaling across 32 data centers, edge use cases that actually matter, and how to design systems for location-sensitive customers—whether that’s due to regulation or performance. Additionally, there's talk about where the hyperscalers have overcomplicated pricing and where simplicity in a flatter pricing model and optimized defaults are better for everyone.There’s a section on nuclear energy (yes, really), including SMRs, power procurement, and what it means to keep scaling compute with limited resources. If you're wondering whether your app actually needs high-performance compute or just better visibility into your costs, this is the episode.PicksThe Adventures In DevOps: SurveyWarren: Jetlag: The GameNathan: Money Heist (La Casa de Papel)
Ground Truth & Guided Journeys: Rethinking Data for AI with Inna Tokarev Sela
Inna Tokarev Sela, CEO and founder of Illumex, joins the crew to break down what it really means to make your data “AI-ready.” This isn’t just about clean tables—it’s about semantic fabric, business ontologies, and grounding agents in your company's context to prevent the dreaded LLM hallucination. We dive into how modern enterprises just cannot build a single source of truth, not matter how hard they try. All the while knowing that it's required to build effected agents utilizing the available knowledge graphs and.The conversation unpacks democratizing data access and avoiding analytics anarchy. Inna explains how automation and graph modeling are used to extract semantic meaning from disconnected data stores, and how to resolve conflicting definitions. And yes, Warren finally coughs up what's so wrong with most dashboards.Lastly, we quickly get to the core philosophical questions of agentic systems and AGI, including why intuition is the real differentiator between humans and machines. Plus: storage cost regrets, spiritual journeys disguised as inference pipelines, and a very healthy fear of subscription-based sleep wearables.PicksThe Adventures In DevOps: SurveyWarren: The Non-Computability of IntuitionWill: The Arc BrowserInna: Healthy GenAI skepticism
AI, SREs & The Future of Self-Healing Systems” with Sylvain Kalache from Rootly - DEVOPS 242
In this episode, we sat down with Sylvain Kalache, Head of Developer Relations at Rootly, and wow—what a ride. We dove headfirst into the world of self-healing systems, LLMs in incident management, and why "incident vibing" (yep, that's a thing now) might be our collective future.Sylvain’s journey started with on-call SRE work at SlideShare and LinkedIn, where he noticed recurring failures and dreamed of a system that could fix itself. This episode is the story of that dream maturing over the years—culminating in AI-powered incident response tools that don’t just detect problems but actively hypothesize root causes and suggest resolutions.We came away from this chat with one big takeaway: The future of incident response is fast, AI-assisted, and increasingly human-centric. But to get there, we need to treat LLMs like tools, not oracles—and be very intentional about how we use them.Want to learn more? Check out Rootly AI Labs and Sylvain’s spicy piece on "Incident Vibing" (linked in the show notes 🔥).
Breaking Web3: Node Ops at Scale, Hard Fork Havoc, and Bare-Metal Mastery - DEVOPS 241
In this episode of Adventures in DevOps, we welcomed Paul Marston from Anchor (yes, the Web3 infrastructure powerhouse) to dive deep into the world of blockchain node operations — and wow, what a ride it was!We kicked things off with some podcast housekeeping (hey, did you fill out the listener survey yet? There are AWS credits on the line! 👀) before diving headfirst into Paul’s fascinating journey from underwriting loans on green-screen terminals to managing sprawling, bare-metal blockchain infrastructure.Key Takeaways & Highlights:1. Web2 vs. Web3 Infra:We loved Paul's analogy-rich walkthrough of moving from financial services into Web3. Turns out, while the fundamentals of system resilience and scale remain, Web3 brings a whole new tempo — no weekends, no downtime, just constant evolution. He put it best: "We don't have bank holidays in Web3."2. What Anchor Actually Does:Paul explained it beautifully — imagine AWS for Web3, but on high-performance bare metal. They're running 100+ chains, handling everything from provisioning nodes to ensuring high availability and ultra-low latency. And yes, that includes crazy challenges like scaling archive nodes with terabytes of blockchain history.3. Hard Fork Hysteria:Ever wonder what it’s like to be on the front lines of a hard fork? Paul’s firsthand stories (like the Ethereum Pectra upgrade chaos) showed us just how critical real-time response and coordination are. These aren't just code releases — they're make-or-break moments for decentralized networks.4. Load Balancing in Web3 is... Intense:Forget simple health checks. Their load balancer routes traffic based on node sync status, archive availability, and even request type. If you're into distributed systems, this was catnip.5. AI's Role in DevOps? It’s Getting Real:Anchor's bringing AI (shout-out to "Monica") into their ops stack. From diagnosing node issues to answering internal questions, it's early days, but the impact is promising — and growing fast.6. The Vinyl Vibe:In the spirit of "tech meets tactile," Paul’s pick of the week? A quirky vertical turntable from the Netherlands that scans and indexes your records like a CD. 🎶 Old-school meets innovation — just like this episode.Tune in, nerd out, and don’t forget to leave us that survey feedback (preferably helpful, but we’ll take weird too): adventuresindevops.com/survey
Observability in the CI/CD Pipeline with Adriana Villela - DEVOPS 240
In this episode, we sat down with the delightful Adriana Villela—principal developer advocate at Dynatrace, CNCF ambassador, and host of the “Geeking Out” podcast (featuring a capybara logo designed by her daughter, no less!). Adriana brought not just deep insights into observability, but also a refreshingly human and humorous perspective on the ever-evolving world of DevOps. 💡Here’s what we dove into:-The Heart of Observability: We explored how observability is so much more than just a postmortem tool for SREs. Adriana reminded us it’s a team sport—from developers writing telemetry to QA teams using trace data to debug pre-prod bugs.-CI/CD Pipelines Need Love Too: When your build pipeline mysteriously breaks down, what do you do? Adriana championed the idea of bringing observability to our pipelines, arguing they’re production systems in their own right. Metrics like build times, failure rates, and even stage-by-stage breakdowns can be goldmines for improving dev efficiency.-OpenTelemetry (OTEL) FTW: We got a crash course on OTEL’s architecture—API, SDK, and the mighty Collector—and how it’s revolutionized telemetry standardization. There’s even OTEL for Bash! (Regex lovers, rejoice... or run.)-Beyond Engineering: Adriana blew our minds by suggesting observability principles could—and should—be applied outside of tech: recruiting pipelines, hospital ER wait times, sales cycles... basically, otel all the things.-Sustainability in Observability: As a longtime environmentalist, Adriana is now researching how to make observability greener. Spoiler alert: she's taking that message global with upcoming talks at Observability Day in London and KubeCon Japan.-The Human Side of Tech: From learning Rust at 72 (shoutout to her awesome dad!) to tales of whiplash from headbanging at metal concerts (yes, there’s an “Iron Neck” involved), this episode was packed with personality.Key Takeaways:-Observability is shifting left—developers and QA should be just as invested as SREs.-OpenTelemetry is the lingua franca of modern observability—and the ecosystem around it is growing fast.-Treat your CI/CD pipeline like a product: monitor, trace, and optimize it.-We’re only scratching the surface of how observability can improve every system—not just tech stacks."Observability allows us to ask meaningful questions, get useful answers, and act effectively on the information that we get." – Hazel Weakly (quoted by Adriana)Tune in for tech insights, capybara love, open-source advocacy, environmental passion, and a whole lot of laughs.
Building Engineering Excellence with Ganesh Datta of Cortex - DEVOPS 239
In this episode, I (flying solo today!) sat down with Ganesh Datta, the CTO and co-founder of Cortex, to explore what it really means to drive engineering excellence at scale. And spoiler: it’s not just about better dashboards or fancy developer tools—it’s about treating software development like the competitive advantage it is.We went deep into the why behind internal developer portals (IDPs) and how they’re transforming platform engineering, developer experience, and organizational maturity. Ganesh shares how Cortex came to life—from being paged at 2am for a mystery Game of Thrones-named microservice (yep, we've all been there), to realizing that every other business function had a system of record—except engineering.Key Takeaways:IDPs are like CRMs for Engineering: Just as sales teams wouldn’t function without a CRM, modern engineering orgs shouldn’t be flying blind without a structured, centralized developer portal.Engineering Excellence = Business Outcomes: Whether it’s reliability, security, or platform efficiency, IDPs help codify best practices and align teams toward measurable goals.Start Small to Win Big: You don’t need to overhaul everything on day one. Start with a pain point you already know—like production readiness—and improve that incrementally.SREs and Platform Engineers Love IDPs: Because it gives them the data, ownership visibility, and real-time checks they need, without the honor-system chaos.Developer Experience is Just the Beginning: Tools like Cortex aren’t just about dev productivity—they’re about creating resilient, aligned, scalable engineering orgs.We also geeked out about everything from naming services (“Brewer” for a feature extraction tool? Chef’s kiss.) to the surprising power of reading 15 minutes before bed to improve sleep quality—yep, we went there!If you’re part of an engineering team (or leading one) and want to know how to move faster and smarter, this is the episode for you.
Modern DevOps Challenges: Automation, AI, and Scaling in 2025 - DevOps 238
In this episode of DevOps 238, we sat down with Zach Lloyd to dive into what’s really happening in the world of modern DevOps—from automation and AI to scaling systems and maintaining team culture in fast-paced environments.We talked about the evolving role of DevOps engineers, the shift toward platform engineering, and why tool sprawl is becoming a bigger issue than ever. Zach shared some powerful real-world lessons on implementing CI/CD pipelines, avoiding burnout in high-pressure environments, and how teams can stay aligned without drowning in Slack notifications or endless dashboards.One of our favorite takeaways? The idea that simplicity and communication still beat out fancy tooling—every time. We also touched on emerging trends like AI-assisted deployments, observability, and what DevOps might look like in 2026 and beyond.If you’re navigating legacy systems, scaling rapidly, or just trying to keep your team sane and productive, this episode’s packed with insights you won’t want to miss.Dungeon Crawler Carl (Book)Dabble of DevOps AI Data Discovery ToolThe Impact of Generative AI on Critical Thinking (Research Paper)Granola AI Meeting NotesA Travel Guide to the Middle Ages (Book)
Matt Lee Discusses Cloud War Games and Elevating Everyday DevOps - DevOps 237
Welcome to another exciting episode of Top End Devs! In this installment of "Adventures in DevOps," we dive into the world of cloud architecture and engineering with a fascinating discussion led by our hosts Warren Parad and Will Button, and joined by our special guest, Matt Lee. Matt, hailing from Wisconsin, is the driving force behind innovative projects like CloudWarGames.com, a platform designed to enhance DevOps training and hiring through engaging problem-solving scenarios. As we explore his journey, from coaching gymnastics to developing digital training ecosystems, you'll discover how Matt's experiences shape his unique perspectives on technical challenges, team dynamics, and the ever-evolving landscape of cloud solutions. Whether you're curious about the technical intricacies of infrastructure or seeking inspiration for your own career path, this episode offers a captivating look at the intersection of technology, creativity, and human connections. So, sit back, relax, and get ready to explore the world of DevOps in a whole new way!Become a supporter of this podcast: https://www.spreaker.com/podcast/adventures-in-devops--6102036/support.