309: Microsoft tries to give away cloud services for free, sadly, it's only SQL

Welcome to episode 308 of The Cloud Pod – where the forecast is always cloudy! Justin and Matt are on hand and ready to bring you an action packed episode. Unfortunately, this one is also lullaby free. Apologies. This week we’re talking about Databricks and Lakebridge, Cedar Analysis, Amazon Q, Google’s little hiccup, and updates to SQL – plus so much more! Thanks for joining us.

Titles we almost went with this week:

KV Phone Home: When Your Key-Value Store Goes AWOL
When Your Coreless Service Finds Its Core Problem
Oracle’s Vanity Fair: Pretty URLs for Pretty Penny
From Warehouse to Lakehouse: Your Free Ticket to Cloud Town
1⃣Databricks Uno: Because One is the Loneliest Number
Free as in Beer, Smart as in Data Science
Cedar Analysis: Because Your Authorization Policies Wood Never Lie
Cedar Analysis: Teaching Old Policies New Proofs
Amazon Q Finally Learns to Talk to Other Apps
Tomorrow: Visual Studio’s Predictive Edit Revolution
The Ghost of Edits Future: AI Haunts Your Code Before You Write It
IAM What IAM: Google’s Identity Crisis Breaks the Internet
Permission Denied: The Day Google Forgot Who Everyone Was
403 Forbidden: When Google’s Bouncer Called in Sick
AWS Brings the Heat to Fusion Research
Larry’s Cloud Nine: Oracle Stock Soars on Forecast Raise
OCI You Later: Oracle Bets Big on Cloud Growth
Oracle’s Crystal Ball Shows 40% Cloud Growth Ahead
Meta Scales Up Its AI Ambitions with $14 Billion Investment
From FAIR to Scale: Meta’s $14 Billion AI Makeover
Congratulations Databricks one, you are now the new low code solution.
AWS burns power to figure out how power works

AI Is Going Great – Or How ML Makes Money

02:12 Zuckerberg makes Meta’s biggest bet on AI, $14 billion Scale AI deal

Meta is finalizing a $14 billion investment for a 49% stake in Scale AI, with CEO Alexandr Wang joining to lead a new AI research lab at Meta.
This follows similar moves by Google and Microsoft acquiring AI talent through investments rather than direct acquisitions to avoid regulatory scrutiny.
Scale AI specializes in data labeling and annotation services critical for training AI models, serving major clients including OpenAI, Google, Microsoft, and Meta.
The company’s expertise covers approximately 70% of all AI models being built, providing Meta with valuable intelligence on competitor approaches to model development.
The deal reflects Meta’s struggles with its Llama AI models, particularly the underwhelming reception of Llama 4 and delays in releasing the more powerful “Behemoth” model due to concerns about competitiveness with OpenAI and DeepSeek. Meta recently reorganized its GenAI unit into two divisions following these setbacks.
Wang brings both technical AI expertise and business acumen, having built Scale AI from a 2016 startup to a $14 billion valuation. His experience includes defense contracts and the recent Defense Llama collaboration with Meta for national security applications.
For cloud providers and developers, this consolidation signals increased competition in AI infrastructure and services, as Meta seeks to strengthen its position against OpenAI’s consumer applications and model capabilities through enhanced data preparation and training methodologies.

03:29 Matt – “It’s interesting, especially the first part of this where companies are trying to acquire AI talent through investments rather than directly hiring people – and hiring them away from other companies. It’s going to be an interesting trend to see if it continues on in the industry where they just keep doing it this way. They just acquire small companies and medium (or large in this case) in order to continue to grow their teams or to at least augment their teams in that way. Or if they’re going to try to build their own in-house units too.”

07:50 Introducing Databricks Free Edition | Databricks Blog

Databricks Free Edition provides access to the same data and AI tools used by enterprise customers, removing the cost barrier for students and hobbyists to gain hands-on experience with production-grade platforms.
The offering addresses the growing skills gap in AI/ML roles, where job postings have increased 74% annually over four years and 66% of business leaders require AI skills for new hires.
Free Edition includes access to Databricks’ training resources and industry-recognized certifications, allowing users to validate their skills on the same platform used by major companies.
Universities like Texas A&M are already integrating Free Edition into their curriculum, enabling students to gain practical experience with enterprise data tools before entering the workforce.
This move positions Databricks to capture mindshare among future data professionals while competing with other cloud providers’ free tiers and educational offerings.
Want to try it out? You can do that here.

08:28 Introducing Databricks One | Databricks Blog

Databricks One creates a simplified interface specifically for business users to access data insights without needing technical expertise in clusters, queries, or notebooks.
The consumer access entitlement is available now, with the full experience entering beta later this summer.
The platform provides three key capabilities for non-technical users: AI/BI Dashboards, Genie for natural language data queries, and interaction with Databricks Apps through a streamlined interface designed to minimize complexity.
Security and governance remain centralized through Unity Catalog, allowing administrators to expand access to business users while maintaining existing compliance and auditing controls without changing their governance strategy.
The service will be included at no additional license fee for existing Databricks Intelligence Platform customers, potentially expanding data access across organizations without requiring additional technical training or resources.
Future roadmap includes expanding from single workspace access to account-wide asset visibility, positioning Databricks One as a centralized hub for business intelligence across the entire Databricks ecosystem.

08:42 Justin – “I think the Databricks Free Edition is a really strong move on their part… I can play with it, see what it does and kick the tires on it and be interested in it as a hobbyist. And then I can bring it back to my day job and say, hey, I was using Databricks over the weekend and I did a thing and I think it could work for us at work. Being able to get access to these tools and these types of capabilities to play with, I think it’s a huge advantage. Everything’s moving so fast right now, that unless you have access to these tools, you feel like you’re left behind.”

AWS

10:45 AWS And National Lab Team Up To Deploy AI Tools In Pursuit Of Fusion Energy

AWS is partnering with Lawrence Livermore National Laboratory to apply machine learning to fusion energy research, specifically to predict and prevent plasma disruptions that can damage tokamak reactors.
The collaboration uses AWS cloud infrastructure to process massive datasets from fusion experiments.
The project leverages AWS SageMaker and high-performance computing resources to analyze terabytes of sensor data from fusion reactors, training models that can predict plasma instabilities milliseconds before they occur. This predictive capability could prevent costly reactor damage and accelerate fusion development timelines.
Cloud computing enables fusion researchers to scale their computational workloads dynamically, running complex simulations and ML training jobs that would be prohibitively expensive with on-premises infrastructure.
AWS provides the elastic compute needed to process years of experimental data from multiple fusion facilities worldwide.
The partnership demonstrates how cloud-based AI/ML services are becoming essential for scientific computing applications that require massive parallel processing and real-time analysis.
Fusion researchers can now iterate on models faster and share findings globally through cloud collaboration tools.
This application of cloud AI to fusion energy could accelerate the path to commercial fusion power by reducing experimental downtime and improving reactor designs through better predictive models. Success here would validate cloud platforms as critical infrastructure for next-generation energy research.

12:34 Use Model Context Protocol with Amazon Q Developer for context-aware IDE workflows | AWS DevOps & Developer Productivity Blog

Amazon Q Developer now supports Model Context Protocol (MCP) in VS Code and JetBrains IDEs, enabling developers to connect external tools like Jira and Figma directly into their coding workflow.
This eliminates manual context switching between browser tabs and allows Q Developer to automatically fetch project requirements, design specs, and update task statuses.
MCP provides a standardized way for LLMs to integrate with applications, share context, and interact with APIs. Developers can configure MCP servers with either Global scope (across all projects) or Workspace scope (current IDE only), with granular permissions for individual tools including Ask, Always Allow, or Deny options.
The practical implementation shown demonstrates fetching Jira issues, moving tickets to “In Progress”, analyzing Figma designs for technical requirements, and implementing code changes based on combined context from both tools. This integration allows Q Developer to generate more accurate code by understanding both business requirements and design specifications simultaneously.
This feature builds on Q Developer’s existing agentic coding capabilities which already included executing shell commands and reading local files. The addition of MCP support extends these capabilities to any tool that implements the protocol, with AWS providing an open-source MCP Servers repository on GitHub for additional integrations.
For AWS customers, this reduces development friction by keeping developers in their IDE while maintaining full context from project management and design tools. The feature is available now in Q Developer’s IDE plugins with no additional cost beyond standard Q Developer pricing.

13:26 Justin – “I mean, if you think Q Developer is the best tool for you, then more power to you, and I’m not going to stop you. But I am glad to see this get added to one more place.”

14:08 AWS WAF now supports automatic application layer distributed denial of service (DDoS) protection – AWS

AWS WAF now includes automatic Layer 7 DDoS protection that detects and mitigates attacks within seconds, using machine learning to establish traffic baselines in minutes and identify anomalies without manual rule configuration.
The managed rule group works across CloudFront, ALB, and other WAF-supported services, reducing operational overhead for security teams who previously had to manually configure and tune DDoS protection rules.
Available to all AWS WAF and Shield Advanced subscribers in most regions, the service automatically applies mitigation rules when traffic deviates from normal patterns, with configurable responses including challenges or blocks.
This addresses a critical gap in application-layer protection where traditional network-layer DDoS defenses fall short, particularly important as L7 attacks become more sophisticated and frequent.
Pricing follows standard AWS WAF managed rule group costs, making enterprise-grade DDoS protection accessible without requiring dedicated security infrastructure or expertise.

14:56 Justin – “I have say that I’ve used the WAF now quite a bit – as well as Shield and CloudFront. Compared to using CloudFlare, they’re so limited what you can do on these things. I so much prefer CloudFlare over trying to tune AWS WAF properly.”

19:27 Powertools for AWS Lambda introduces Bedrock Agents Function utility – AWS

Powertools for AWS Lambda now includes a Bedrock Agents Function utility that eliminates boilerplate code when building Lambda functions that respond to Amazon Bedrock Agent action requests.
The utility handles parameter injection and response formatting automatically, letting developers focus on business logic instead of integration complexity.
This utility integrates seamlessly with existing Powertools features like Logger and Metrics, providing a production-ready foundation for AI applications. Available for Python, TypeScript, and .NET, it standardizes how Lambda functions interact with Bedrock Agents across different programming languages.
For organizations building agent-based AI solutions, this reduces development time and potential errors in the Lambda-to-Bedrock integration layer. The utility abstracts away the complex request/response patterns required for agent actions, making it easier to build and maintain serverless AI applications.
Developers can get started by updating to the latest version of Powertools for AWS Lambda in their preferred language. Since this is an open-source utility addition, there are no additional costs beyond standard Lambda and Bedrock usage fees.
This release signals AWS’s continued investment in simplifying AI application development by providing purpose-built utilities that handle common integration patterns. It addresses a specific pain point for developers who previously had to write custom code to properly format Lambda responses for Bedrock Agents.

20:21 Matt – “It’s great to see them making these more accessible to *not* subject matter experts and to the general developer. So would I want to take my full app and go to full production leveraging power tools? No, but it’s good to let the standard developer that just wants to play with something and learn and figure out how to do it. Get something up and running decently easily.”

20:53 Introducing Cedar Analysis: Open Source Tools for Verifying Authorization Policies | AWS Open Source Blog

AWS releases Cedar Analysis as open source tools for verifying authorization policies, addressing the challenge of ensuring fine-grained access controls work correctly across all scenarios rather than just test cases. The toolkit includes a Cedar Symbolic Compiler that translates policies into mathematical formulas and a CLI tool for policy comparison and conflict detection.
The technology uses SMT (Satisfiability Modulo Theories) solvers and formal verification with Lean to provide mathematically proven soundness and completeness, ensuring analysis results accurately reflect production behavior.
This approach can answer questions like whether two policies are equivalent, if changes grant unintended permissions, or if policies contain conflicts or redundancies.
Cedar itself has gained significant traction with 1.17 million downloads and production use by companies like MongoDB and StrongDM, making robust analysis tools increasingly important as applications scale. The open source release under Apache 2.0 license allows developers to independently verify policies and researchers to build upon the formal methods foundation.
The practical example demonstrates how subtle policy refactoring errors can be caught – splitting a single policy into multiple policies accidentally restricted owner access to private photos, which the analysis tool identified before production deployment. This capability helps prevent authorization bugs that could lead to security incidents or access disruptions.
For AWS customers using services like Verified Permissions (which uses Cedar), this provides additional confidence in policy correctness and a path for building custom analysis tools tailored to specific organizational needs. The formal verification aspect also positions Cedar as a research platform for advancing authorization system design.

22:57 Justin – “We’re using strong DM in the day jo0,b and it is very nice to see Cedar getting used in lots of different ways, particularly the mathematical proofs to be used in policies.”

GCP

23:51 Identity and access management failure in Google Cloud causes widespread internet service disruptions – SiliconANGLE

A misconfiguration in Google Cloud’s IAM systems caused widespread outages affecting App Engine, Firestore, Cloud SQL, BigQuery, and Memorystore, demonstrating how a single identity management failure can cascade across multiple cloud services and impact thousands of businesses globally.
The incident highlighted the interconnected nature of modern cloud infrastructure as services like Cloudflare Workers, Spotify, Discord, Shopify, and UPS experienced partial or complete downtime due to their dependencies on Google Cloud components.
Google Workspace applications including Gmail, Drive, Docs, Calendar, and Meet all experienced failures, showing how IAM issues can affect both infrastructure services and end-user applications simultaneously.
The outage underscores the critical importance of IAM redundancy and configuration management in cloud environments, as even major providers like Google can experience service-wide disruptions from a single misconfiguration.
While AWS appeared largely unaffected, Amazon’s Twitch service may have experienced issues due to network-level interdependencies, illustrating how cloud outages can have ripple effects across provider boundaries through shared DNS, CDN, or authentication services.
FULL RCA is available here.

26:11 Matt – “For the SRE team at Google, within 2 minutes was already triaging, in 10 minutes it identified the root cause – that’s an impressive response time.”

28:28 Cloudflare service outage June 12, 2025

Cloudflare experienced a 2 hour 28 minute global outage on June 12, 2025 affecting Workers KV, WARP, Access, Gateway, Images, Stream, Workers AI, Turnstile, and other critical services due to a third-party storage provider failure that exposed architectural vulnerabilities in their infrastructure.
The incident revealed a critical single point of failure in Workers KV’s central data store, which depends on many Cloudflare products despite being designed as a “coreless” service that should run independently across all locations.
During the outage window, 91% of Workers KV requests failed, cascading failures across dependent services while core services like DNS, Cache, proxy, and WAF remained operational, highlighting the blast radius of shared infrastructure dependencies.
Cloudflare is accelerating migration of Workers KV to their own R2 storage infrastructure and implementing progressive namespace re-enablement tooling to prevent future cascading failures and reduce reliance on third-party providers.
This marks at least the third significant R2-related outage in recent months (March 21 and February 6, 2025 also mentioned), raising questions about the stability of Cloudflare’s storage infrastructure during their architectural transition period.

29:31 Justin – “I think the failure here is they’re running an entire KV on top of GCS or GCP in a way that they were impacted by this word that should be blast radiuses out to multiple clouds. Cloudflare is a partner of AWS, GCP, and Azure. They should be able to make things redundant – because I don’t necessarily know that their infrastructure is going to be better than anyone else’s infrastructure.”

32:53 Securing open-source credentials at scale | Google Cloud Blog

Google Cloud has developed an automated tool that scans open-source packages and Docker images for exposed GCP credentials like API keys and service account keys, processing over 5 billion files across hundreds of millions of artifacts from repositories like PyPI, Maven Central, and DockerHub.
The system detects and reports leaked credentials within minutes of publication, matching the speed at which malicious actors typically exploit them, with automatic remediation options including disabling compromised service account keys based on customer-configured policies.
Unlike GitHub and GitLab’s source code scanning, this tool specifically targets built packages and container images where credentials often hide in configuration files, compiled binaries, and build scripts – areas traditionally overlooked in security scanning.
Google plans to expand beyond GCP credentials to include third-party credential scanning later this year, positioning this as part of their broader deps.dev ecosystem for open-source security analysis.
For GCP customers publishing open-source software, this provides free automated protection against credential exposure without requiring additional tooling or workflow changes, addressing what Mandiant reports as the second-highest cloud attack vector at 16% of investigations.
The moral of the story? Please patch. We know it’s a pain. But please, patch.

33:55 Matt – “I feel like AWS has had this, where they scan the GIthub commits for years – so I appreciate them doing it, don’t get me wrong, but also, I feel like this has been done before?”

35:48 Google’s Cloud Location Finder unifies multi-cloud location data | Google Cloud Blog

Google Cloud Location Finder provides a unified API for accessing location data across Google Cloud, AWS, Azure, and Oracle Cloud Infrastructure, eliminating the need to manually track region information across multiple providers. The service is available at no cost via REST APIs and gcloud CLI.
The API returns rich metadata including region proximity data (currently only for GCP regions), territory codes for compliance requirements, and carbon footprint information to support sustainability initiatives.
Data freshness is maintained at 24 hours for active regions with automatic removal of deprecated locations.
Key use cases include optimizing multi-cloud deployments by identifying the nearest GCP region to existing AWS/Azure/OCI infrastructure, ensuring data residency compliance by filtering regions by territory, and automating location selection in multi-cloud applications. This addresses a common pain point where organizations maintain hard-coded lists of cloud regions across providers.
While AWS and Azure offer their own region discovery APIs, Google’s approach of providing cross-cloud visibility in a single service is unique among major cloud providers. The inclusion of sustainability metrics like carbon footprint data aligns with Google’s broader environmental commitments.

37:39 C4D VMs: Unparalleled performance for business workloads | Google Cloud Blog

Google’s C4D VMs are now generally available, powered by 5th Gen AMD EPYC processors (Turin) and delivering up to 80% higher throughput for web serving and 30% better performance for general computing workloads compared to C3D.
The new instances scale up to 384 vCPUs and 3TB of DDR5 memory, with support for Hyperdisk storage offering up to 500K IOPS.
C4D introduces Google’s first AMD-based Bare Metal instances (coming in weeks), providing direct server access for workloads requiring custom hypervisors or specialized licensing needs. The instances also feature next-gen Titanium Local SSD with 35% lower read latency than previous generations.
Performance benchmarks show C4D delivers 25% better price-performance than C3D for general computing and up to 20% better than comparable offerings from other cloud providers. For database workloads like MySQL and Redis, C4D shows 35% better price-performance than competitive VMs, with MySQL seeing up to 55% faster query processing.
The new VMs support AVX-512 with a 512-bit datapath and 50% more memory channels, making them well-suited for CPU-based AI inference workloads with up to 75% price-performance improvement for recommendation inference. C4D also includes confidential computing support via AMD SEV for regulated workloads.
C4D is available in 12 regions and 28 zones at launch, with a 30-day uptime window between planned maintenance events. Early adopters like AppLovin report 40% performance improvements, while Verve Group sees 191% faster ad serving compared to N2D instances.

38:18 Introducing G4 VM with NVIDIA RTX PRO 6000 | Google Cloud Blog

Google Cloud is first to market with G4 VMs featuring NVIDIA RTX PRO 6000 Blackwell GPUs, combining 8 GPUs with AMD Turin CPUs (up to 384 vCPUs) and delivering 4x compute/memory and 6x memory bandwidth compared to G2 VMs. This positions GCP ahead of AWS and Azure in offering Blackwell-based instances for diverse workloads beyond just AI training.
The G4 instances target a broader range of use cases than typical AI-focused GPUs, including cost-efficient inference, robotics simulations, generative AI content creation, and next-generation game rendering with 2x ray-tracing performance. Key customers include Snap for LLM inference, WPP for robotics simulation, and major gaming companies for next-gen rendering.
With 768GB GDDR7 memory, 12 TiB local SSD, and support for Multi-Instance GPU (MIG), G4 VMs enable running multiple workloads per GPU for better cost efficiency. The instances integrate with Vertex AI, GKE, and Hyperdisk (500K IOPS, 10GB/s throughput) for complete AI inference pipelines.
G4 supports NVIDIA Omniverse workloads natively, opening opportunities in manufacturing, automotive, and logistics for digital twins and real-time simulation. The combination of high CPU-to-GPU ratio (48:1) and Titanium’s 400 Gbps networking makes it suitable for complex simulations where CPUs orchestrate graphics workloads.
Currently in preview with global availability by year-end through Google Cloud Sales representatives. Pricing not disclosed, but positioning suggests premium pricing for specialized workloads requiring both AI and graphics capabilities.

Azure

39:40 Public Preview: Encrypt Premium SSD v2 and Ultra Disks with Cross Tenant Customer Managed Keys

Cross-Tenant customer-managed Keys (CMK) for Premium SSD v2 and Ultra disk are now in preview in select regions.
Encrypting managed disks with cross-tenant CMK enables encrypting the disk with a CMK hosted in an Azure Key Vault in a different Microsoft Entra tenant than the disk.
This will allow customers leveraging SaaS solutions that support CMK to use cross-tenant CMK with Premium SSD v2 and Ultra Disks without ever giving up complete control. (i have doubts)

40:31 Justin – “The only was this makes sense to me is if you have a SaaS application where you’re getting single servers or small cluster of servers per tenant; which I don’t want to manage. But if that’s what you have, then this may make sense to you. But this has a pretty limited use case, in my opinion.”

42:10 Microsoft Cost Management updates—May 2025 (summary) | Microsoft Community Hub

Azure Carbon Optimization reaches general availability, allowing organizations to track and reduce their cloud carbon footprint alongside cost optimization efforts.
This positions Azure competitively with AWS’s Customer Carbon Footprint Tool and GCP’s Carbon Footprint reporting.
Export to Microsoft Fabric enters limited preview, enabling direct integration of Azure cost data into Microsoft’s unified analytics platform.
This streamlines FinOps workflows by eliminating manual data transfers between Cost Management and analytics tools.
Free Azure SQL Managed Instance offer launches in GA, providing a no-cost entry point for database migrations.
This directly challenges AWS RDS Free Tier and could accelerate enterprise SQL Server migrations to Azure.
Network Optimized Azure Virtual Machines enter preview, promising reduced network latency and improved throughput for data-intensive workloads. These specialized VMs target high-performance computing and real-time analytics scenarios.
Smart VM Defaults in AKS reaches GA, automatically selecting cost-optimized VM sizes for Kubernetes workloads.
This feature reduces overprovisioning and helps organizations avoid common AKS sizing mistakes that inflate costs.

42:49 Matt – “I doubt they’re giving you Enterprise SQL. I assume it’s SQL Express or SQL standard – but they’re not giving you Enterprise SQL.”

44:20 Next edit suggestions available in Visual Studio – Visual Studio Blog

GitHub Copilot’s Next Edit Suggestions (NES) in Visual Studio 2022 17.14 predicts and suggests your next code edit anywhere in the file, not just at cursor location, using AI to analyze previous edits and suggest insertions, deletions, or mixed changes.
The feature goes beyond simple code completion by understanding logical patterns in your editing flow, such as refactoring a 2D Point class to 3D or updating legacy C++ syntax to modern STL, making it particularly useful for systematic code transformations.
NES presents suggestions as inline diffs with red/green highlighting and provides navigation hints with arrows when the suggested edit is on a different line, allowing developers to Tab through related changes across the file.
Early user feedback indicates accuracy issues with less common frameworks like Pulumi in C# and outdated training data for rapidly evolving APIs, highlighting the challenge of AI suggestions for niche or fast-changing technologies.
While this enhances Visual Studio’s AI-assisted development capabilities, the feature currently appears limited to Visual Studio users rather than being a cloud-based service accessible across platforms or IDEs.

45:36 Matt – “It’s a pretty cool feature and I like the premise of it, especially when you are refactoring legacy code or anything along those lines where it’s like, hey, don’t forget this thing over here – because on the flip side, while it’s distracting, it also would be fairly nice to not run everything, compile it, and then have the error because I forgot to refactor this one section out.”

Oracle

46:25 Oracle soars after raising annual forecast on robust cloud services demand | Reuters

Oracle raised its fiscal 2026 revenue forecast to $67 billion, projecting 16.7% annual growth driven by cloud services demand, with total cloud growth expected to accelerate from 24% to over 40%.
Oracle Cloud Infrastructure (OCI) is gaining traction through multi-cloud strategies and integration with Oracle’s enterprise applications, though this growth primarily benefits existing Oracle customers rather than attracting new cloud-native workloads.
The company’s approach of embedding generative AI capabilities into its cloud applications at no additional cost contrasts with AWS, Azure, and GCP’s usage-based AI pricing models, potentially lowering adoption barriers for Oracle’s enterprise customer base.
Fourth quarter cloud services revenue reached $11.70 billion with 14% year-over-year growth, suggesting Oracle is capturing market share but still trails the big three cloud providers who report quarterly cloud revenues of $25+ billion.
Oracle’s growth story depends heavily on enterprises already invested in Oracle databases and applications migrating to OCI, making it less relevant for organizations without existing Oracle dependencies.

48:18 Justin – “Oracle is actually a really simple cloud. It is just Solaris boxes, as a cloud service to you. It’s all very server-based. That’s why they have iSCSI and they have fiber channels and they have all these things that are very data center centric. So if you love the data center, and you just want a cloud version of it, Oracle cloud is not bad for you. Or if you have a ton of egress traffic, the cost advantages of their networking is far superior to any of the other cloud providers. So there are benefits as much as I hate to say it.”

49:38 Oracle and AMD Collaborate to Help Customers Deliver Breakthrough Performance for Large-Scale AI and Agentic Workloads

Oracle announces AMD Instinct MI355X GPUs on OCI, claiming 2X better price-performance than previous generation and offering zettascale AI clusters with up to 131,072 GPUs for large-scale AI training and inference workloads.
This positions Oracle as one of the first hyperscalers to offer AMD’s latest AI accelerators, though AWS, Azure, and GCP already have established GPU offerings from NVIDIA and their own custom silicon, making Oracle’s differentiation primarily about AMD partnership and pricing.
The MI355X delivers triple the compute power and 50% more high-bandwidth memory than its predecessor, with OCI’s RDMA cluster network architecture supporting the massive 131,072 GPU configuration for customers needing extreme scale.
Oracle emphasizes open-source compatibility and flexibility, which could appeal to customers wanting alternatives to NVIDIA’s CUDA ecosystem, though the real test will be whether the price-performance claims hold up against established solutions.
The announcement targets customers running large language models and agentic AI workloads, but adoption will likely depend on actual benchmarks, software ecosystem maturity, and whether Oracle can deliver on the promised cost advantages.

50:52 Introducing Vanity Urls On Autonomous DB

Oracle now allows custom domain names for APEX applications on Autonomous Database, eliminating the need for awkward database-specific URLs like apex.oraclecloud.com/ords/f?p=12345 in favor of cleaner addresses like myapp.company.com.
This vanity URL feature requires configuring DNS CNAME records and SSL certificates through Oracle’s Certificate Service, adding operational complexity compared to AWS CloudFront or Azure Front Door which handle SSL automatically.
The feature is limited to paid Autonomous Database instances only, excluding Always Free tier users, which may restrict adoption for developers testing or running small applications.
While this brings Oracle closer to parity with other cloud providers’ application hosting capabilities, the implementation requires manual certificate management and DNS configuration that competitors have largely automated.
The primary benefit targets enterprises already invested in Oracle’s ecosystem who need professional-looking URLs for customer-facing APEX applications without exposing underlying database infrastructure details.

Closing

And that is the week in the cloud! Visit our website, the home of the Cloud Pod where you can join our newsletter, slack team, send feedback or ask questions at theCloud Pod.net or tweet at us with hashtag #theCloudPod

[00:00:00] Speaker A: Foreign. [00:00:06] Speaker B: Welcome to the Cloud Pod where the forecast is always cloudy. We talk weekly about all things aws, GCP and Azure. [00:00:14] Speaker A: We are your hosts, Justin, Jonathan, Ryan and Matthew. Episode 309 recorded for June 17, 2025 Microsoft tries to give away cloud services for free. Sadly, it's only sequel Good evening, Matt. How's it going? [00:00:30] Speaker C: Better than using SQL every day. [00:00:33] Speaker A: Yeah, well that's unfortunately I have to do and you have to do so. [00:00:36] Speaker C: Just don't tell anyone. [00:00:38] Speaker A: I know, I know. Unfortunately we don't have Ryan tonight. He informed us that his children are at camp and he's going on a date with his wife and I think that's worthy. So, you know, he should, he should enjoy that time. So we we instead listen to your child going to sleep and try to get a third who failed us. But we'll be hopefully have them on here very soon to join us to talk about some QA stuff, a couple other things, but let's get into it. There's two of us. This will make this episode maybe a little less long than last week. Since we were like an hour and 30 minutes last week. [00:01:08] Speaker C: Every time we think it's going to be short, it's always long. [00:01:11] Speaker A: So, you know, maybe, you know, we'll see. We'll see how we do. First up, Meta is finalizing a $14 billion investment for a 49% stake in Scale AI with CEO Alexander Wang join to lead a new AI research lab at Meta. This follows similar moves by Google and Microsoft acquiring AI talent through investments rather than direct acquisitions to avoid regulatory scrutiny. Scale AI specializes in data labeling and annotation services critical for training AI models serving major clients including OpenAI, Google, Microsoft and Meta. Company's expertise covers approximately 70% of all AI models being built, providing Meta with valuable intelligence on competitor approaches to model development. Deal reflects Meta struggles with Llama AI models, particularly the underwhelming reception of LLAMA 4 and delays in releasing the more powerful behemoth model due to concerns about competitiveness with OpenAI and deep seq. Meta recently recognized his gen unit into two divisions following these setbacks. Wang brings both technical AI expertise and business acumen, having built Scale AI from a 2016 startup to a 14 billion valuation today. His experience includes defense contracts and recent defense llama collaboration with Meta for national security applications. [00:02:18] Speaker C: 49% of $14 billion a lot of money. Something, right? [00:02:23] Speaker A: You sure did. [00:02:25] Speaker C: No, it's interesting, especially the first part of this where companies are trying to acquire AI talent through investments rather than the, you know, directly hiring people and hiring them away from other companies. It's going to be an interesting trend to see if it continues on in the industry where they just keep doing it this way. They just acquire small companies and medium or large, in this case companies, in order to continue to grow their teams or to at least augment their teams in that way. Or if they're going to try to build their own in house units too. [00:02:56] Speaker A: Yeah, it'll be curious to see how many more of these are left. Like how many startup companies are really out there that have this valuation that you can even ProCure into an AI aqua hire. I mean, Claude is probably the last big one I can think of. Or Anthropic. Yeah, but Amazon, I guess is technically Aqua Hire. I mean they're the bigger and biggest investor in that company. [00:03:17] Speaker C: Yeah, they took on Claude or Anthropic and Microsoft took on OpenAI, so. Really? And Google just said, hey, we got this, we can build our own models. We don't need anybody else with Gemini. And their models are pretty good, at least from what I played with. I definitely use their deep research. [00:03:35] Speaker A: I mean the Bard models were pretty bad and I think the early, like Gemini 1.5 I think it was, was okay. The 2.0 was a good step forward and the 2.5 has been quite, quite good. [00:03:47] Speaker C: The 2.5 I thought was pretty good. 2.0 I thought was decent. I like the deep research where it actually searches the web and everything else. That's really where I use the Google model I think the most. I know a lot of the other ones have that feature now, but it's still kind of like the first one I played with. So it's kind of like my default to go to. And I'm like, hey, I have this random home project that I need at least a starting point on. Someone go in and tell me the top four trackers for trucks or something completely random. [00:04:21] Speaker A: Yeah, I think it's evolving so quickly that you just have to continue to keep on track of all of these different models and what they're doing. And this one may not work well today and then tomorrow it does. So I mean that's been kind of my experience. Even with some of the minor updates to Claude or Toyota Gemini. I think we're on the fourth version of 2.5 actually. Uh, you know, like one, I think that was the third version, it was a little weak compared to the second one. And then the fourth one's kind of come back and brought it back into power where I thought it should be then. ChatGPT has been kind of interesting too, watching it recently with its struggles with some of their newer models. Not really, you know, being huge upgrades over previous ones. [00:04:57] Speaker C: Yeah, I feel like there's a lot of, you know, we're kind of, and maybe I'm wrong, but I feel like we're starting to hit the point where they're going to be more incremental updates and these massive power shifts. But we might see a few of them here and there. But at this point it's incremental and or leveraging the correct model in the correct location at the right time. So, you know, leveraging the models that best suit your need and it's playing with them as the application developer to figure out, hey, this model works better for me with these problems versus this other one over here. [00:05:27] Speaker A: Yeah, I mean I think the next big area you'll probably continue to see is a lot more consolidation of multimodals where they're basically able to handle video, audio and text. You know, you start getting into like different areas there. There's some interesting. You have the language support I think is a big area still where some of them lack, unless you're a mistral, they went big. So yeah, there's, there's definitely some, you know, key areas where you're seeing investment. But yeah, I don't know that you're going to can you're going to see a major leap forward. Although, I mean, I think it was Sam Altman was just saying, you know, by next year we're going to have AI making fundamental discoveries, which I'm like, I don't, I mean, unless you're using a model that I don't see, I don't know how you make that speculation or then it's good marketing for you. [00:06:08] Speaker C: I wonder it's if it's like going to be, hey, there's something here human, go look at it. AKA like hey, Hydrogen. It was a. Hydrogen was discovered in the sun before it was discovered on Earth, you know, so they were able to see, understand it, what it was before they figured out where it was. So I wonder if it's going to be that same thing. Like AI is going to be like something funky's over here that we don't quite understand. And the humans, you have to go in and kind of do that. Last 15, 20%. [00:06:38] Speaker A: Databricks dropped us three new toys this week. First up is the Databricks free edition, which provides access to the same data and AI tools used by enterprise customers, removing the cost barrier for students and hobbyists to gain hands on experience with production grade platforms. In addition to that, they also gave you databricks linkbridge as a free open source tool that automates up to 80% of enterprise data warehouse migration tasks including SQL conversion, validation and reconciliation. Supporting over 10 legacy data warehouses and converts proprietary SQL dialects like BTech T, SQL, PL SQL into ANSI compliant SQL. And then the third tool they gave us is the Databricks one, which creates a simplified interface specifically for business users to access data insights without needing technical expertise in clusters, queries or notebooks. Consumer access Entitlement is available now with the full experience of entering beta later this summer. So overall I think the Databricks free edition is a really strong move on their part. Being able to now I have access to it while I'm at home, I can play with it, I can see what it does and kick the tires on it and be interested in it as a hobbyist and then I can bring it back to my day job and say, hey, I was using databricks over the weekend and I did a thing and I think it could work for us at work being able to get access to these tools and these type of capabilities to play with. I think it's a huge advantage for the industry is you're trying to, you know, everything's moving so fast right now that unless you have access to these tools, you feel like you're left behind. [00:08:00] Speaker C: Yeah, I mean, and getting it in the hands of students and universities that are going to come out and say, hey, I use this tool, let me go get this company that I work for to now buy it in the same way as hey, I'm on the weekend. But targeting it at students is I think is another great area. And Microsoft, I know, did that years ago when they were like, hey, as a student you can get all of Microsoft products for free. And really getting the getting into the hands of people early enabled them to leverage that in the long term. So it's really a long term play here that they're doing and I think it's great. I mean the more tools you get to more people like this, the better off we all are. Because someone's going to try something that nobody ever thought of and it will come up with a brand new thing. [00:08:48] Speaker A: Yeah. You have any good use cases you might want to use some of these databricks tools for just to play with? [00:08:53] Speaker C: No, I've always just looked at the price tag and said, yeah, I'm good. So I now have to like take that out of my mind to Be like, oh, it's free now. I can go play with this. But also my life is currently consumed with a infant and newborn, so any spare time I get. And my life is currently devoted to sleeping. [00:09:11] Speaker A: So yeah, you'll eventually get out of that phase. Then you can, you can remember databricks is there, or next time you take some vacation. Yeah, whenever that is. [00:09:20] Speaker C: Or when I want to sleep at night, I'll go play with databricks. One of the two. [00:09:23] Speaker A: Yeah, there you go. I mean, I'm sure, I'm sure it's riveting. [00:09:26] Speaker C: Yeah. [00:09:28] Speaker A: All right, let's move to aws, who is partnering this week with Lawrence Livermore National Laboratory to apply machine learning to fusion energy research, specifically to predict and prevent plasma disruptions that can damage tokamak reactors. The collaboration uses AWS cloud infrastructure to process massive datasets from fusion experiments. And the project leverages AWS, SageMaker and high performance computing resources to analyze terabytes of sensor data from fusion reactors, training models that can predict plasma instabilities milliseconds before they occur. And this predictive capability could prevent costly reactor damage and accelerate fusion development timelines. Cloud computing enables fusion researchers to scale their computational workloads dynamically, running complex simulations and ML training jobs that could be prohibitively expensive. With on premise infrastructure, AWS provides the elastic compute needed to process years of experimental data from multiple fusion facilities around the world. Partnership demonstrates how cloud based AI ML services are becoming essential for scientific computing applications that require massive parallel processing. Really the benefit of this is Amazon, because if they can make this work and they can get cheaper power for their data centers, that's a huge benefit to their margins. [00:10:31] Speaker C: Oh yeah, that's going to be massive. But to me, look at the same thing 15 years ago. Hey, use the cloud for HPC computing. You know, we have all this, this extra compute that you're able to spin up and turn off as needed. Go run your, you know, spark job and your HPC compute workload and turn it off. So to me, this is really the same thing that the cloud was built for, just spun a little bit differently and targeting, you know, GPUs versus CPUs things along those lines. So it's good to see the cloud. The premise of the cloud is still there at the fundamentals. [00:11:09] Speaker A: I agree. It is kind of cute though how they always try to make it sound like some new amazing thing. It's existed for a while. Amazon Q Developer is now supporting Model Context Protocol, or MCP for short, in Visual Studio code and JetBrain IDEs, enabling developers to connect external tools like Jira and FIGMA directly into their coding workflow. This eliminates manual context switching between browser tabs and allows Q Developer to automatically fetch project requirements, design specs, and update task statuses. MCP provides a standardized way for LLMs to integrate with application, share context, integrate with APIs, and developers can configure MCP servers with either Global Scope or Workspace Scope with grant tools like Ask, Always Allow and Deny options. The prior limitation shown demonstrates fetching JIRA issues, moving tickets to in progress, analyzing FIGMA designs for technical requirements, and implementing code changes based on combined context for both tools. This integration allows Q Developer to generate more accurate code by understanding both businesses requirements and design specifications simultaneously. I mean, if you think that Q Developer is the best tool for you, more power to you. I'm not going to stop you, but I'm just glad you're using a cool and I'm glad to see this get added to one more place. [00:12:17] Speaker C: Have you played with Q Developer? I have not. [00:12:20] Speaker A: I have not because why would I, right? [00:12:22] Speaker C: That's kind of where I am. Between GitHub, Copilot and Claude code and all the other stuff I'm like yeah, I don't really even need to go into Q Developer and it just feels like I'm trying to get their hands in the game a little bit too late. I feel like when they finally were like oh wait, we have to do. [00:12:42] Speaker A: This thing AWS WAF is now including automatic layer 7 DDoS protection that detects and mitigates attacks within seconds, using machine learning to establish traffic baselines and minutes identify anomalies about manual rule configuration. The managed rule works across cloudfront, ALB and other WAF supported services, reducing operational overhead for security teams who previously had to manually configure and tune DDoS protections available to all AWS, WAF and Shield Advanced subscribers in most regions. The service automatically applies mitigation rules when traffic deviates from neural patterns with configurable responses including challenges or blocks. This addresses a critical gap in application layer protection where traditional network layer DDoS defenses fall short. Particularly important as layer 7 attacks become more sophisticated and frequently pricing follows standard AWS WAF manage rule group costs, making enterprise grade ddos production accessible without requiring dedicated security infrastructure or expertise. I had to say that I've used the WAF now quite a bit as well as SHIELD and Cloudfront and compared to using Cloudflare they're so limited what you can do on these things. I so much prefer cloudflare over trying to tune AWS WAF properly and they have these really big bulky selections like turn on bot protection. Cool. Okay, what type of bots? Oh no, you can't get to that level of granular unless you write custom rules. And then you're in this mess of writing custom rules to do everything. It's kind of a terrible experience in my experience with it, but I'm glad that's now easier because I think people probably have thought they had DDoS configured and they probably didn't. And if this helps them in a time of need, I'm okay with that. [00:14:14] Speaker C: I mean, it goes into two pieces, you know, based on what you were saying before. It's that you build it, you buy it. Cloudflare is going to be more expensive than AWS or Azure waf, you know, so. Or you buy, you know, and you have to build it, you have to build it to your specific specifications and everything else. Or you're going to go out and buy a tool like Cloudflare and perva any of them, you know, whatever it is, you know, these are good in my opinion for small and medium sized businesses that are just starting and needs to check that box. Specifically if you're on like a soccer or ISO compliance that you kind of have to say you have this done. Or if you're just trying to get, you know, hey, I have this product and you know, it is kind of a, you know, known entity and just giving yourselves a little bit of that, hey, I'm trying to do my best. But you're right, it's not the best thing out there. For that reason, you know, if you want something good, you go to one of the other solutions, but you're also going to pay for it. I did. Look, it's only 50. 50, whatever. WCU, because I don't remember those stands for WAF compute units probably. So it is something that I do remember from last time I dealt with the waf, which was each WAF set can only have so many rules associated with it and it's all set up based on the number of compute units that it's done too. So you might have to like turn around this and turn off other ones. You kind of get your rule set right where you want it because 50 was a good amount if I remember correctly. [00:15:47] Speaker A: Yep. They do have a rule ordering process that is a little bit opaque and definitely a little more trickier than you think it should be. [00:15:56] Speaker C: You can override the defaults in the manage rules, but then you gotta like figure it out so you're in CloudWatch logs to figure out which rule it disabled. Like it's fun. [00:16:05] Speaker A: Well then like, you know, they have block and they have challenge and they have all these things. And like there's like secret magic handshake stuff happening in the web back of the browser that you have to make sure you're not blocking with other things like csp. Like there's all kinds of weird edge cases that you run into trying to use all the different blocking methods. And it's like, well, I don't. I really want you to give me a breakdown of like this particular category, how many of this thing is happening and from where. And then I'll make a decision based on what it is if I want to do something. Because like, maybe I don't want to allow some crawler from, you know, Russia that looks like a search engine to crawl at all versus, you know, this thing, which is maybe Claude. And I want Claude to access my website because people are doing deep research on what I am selling on my website and I want them to get the access to that data so they can use my website to buy stuff. You know, like there's things like that that make sense, but again, I think it's just a blunt instrument that's hard to wield properly. [00:16:54] Speaker C: It's the old good bot versus bad bot. [00:16:57] Speaker A: Correct. And without the level of granularity. Unless you're willing to take your logs and then basically pipe them through Athena, that's the only way you can really get that level of granularity and detail. But then again, still customizing the rules per bot is impossible. [00:17:11] Speaker C: Right, which shows that into you're going to build your own and have your own team to manage, run, update this waf. Or do you just say whatever, I'm going to just go buy the tool because the WAF isn't that cheap. If I remember correctly, it can add up decently quick. If I remember correctly, it was like $15 just to open the door per month. It's not a ton, but it's not nothing. [00:17:36] Speaker A: For a small business, it's money well spent. But again, like if you have nothing, it's better than nothing. Just be aware that if you have a need, it might limit you. Then that's where I recommend Cloudflare over that. [00:17:52] Speaker C: So. [00:17:52] Speaker A: So, all right. Power tools for AWS Lambda now includes a Bedrock Agents function utility that eliminates boilerplate code when building Lambda functions that respond to Amazon Bedrock Agent action requests. The utility handles parameter injection and response formatting automatically, letting developers focus on business logic. Instead of integration complexity, the utility integrates seamlessly with existing power tools features like logger and metrics, providing a production ready foundation for AI apps available for Python, TypeScript and. NET. It standardizes how lambda functions interact with Bedrock agents across different programming languages. Organization is building agent based AI solutions. This reduces development time and potential errors in the Lambda to Bedrock integration layer. This usually extracts away the complex request response patterns required for agent actions, making it easier to build and maintain serverless AI apps. Developers can now start get started by updating to latest version of Power Tools or AWS Lambda in their preferred language. And since this is an open source utility edition, there are no additional costs beyond standard Lambda and Bedrock usage fees. [00:18:48] Speaker C: It's great to see them making these more accessible to not subject matter experts into the general developer that's just learning this to start off. So would I want to take my full app and go to full production leveraging power tools? No, but it's good to let the standard developer that just wants to play with something and learn and figure out how to do it get something up and running decently easy. [00:19:13] Speaker A: AWS is releasing Cedar Analysis as open source tools for verifying authorization policies, addressing the challenge of ensuring fine grain access controls work correctly across all scenarios rather than just test cases. The toolkit includes a CDER symbolic compiler that translates policies into mathematical functions and a CLI tool for policy comparison and conflict detection. The technology uses satisfiably modulo theories or SMT to solvers and formal verification with Lean to provide mathematically proven soundness and completeness, ensuring analysis for results accurately reflected production behavior. This approach can answer questions like whether two policies are equivalent if changes granted unintended permissions or policies contain conflicts or redundancy. Cedar itself has gained significant traction with 1.17 million downloads and production use by companies like MongoDB and Strong DM, making robust analysis tools increasingly important as their application scale. Open source Release under Apache 2.0 License allows developers to independently verify policies and researchers to build upon the formal methods foundation. Practical example demonstrates how subtle policy refactoring errors can be caught. Splitting a single policy into multiple policies accidentally restricted owner access to private photos with analysis tool identified before production deployment. This capability helps prevent authorization bugs that could lead to security incidents or access disruptions. So pretty good little example. [00:20:28] Speaker C: Yeah, specifically like the splitting into multiple policies and things along those lines. I've definitely shot myself in the foot before and it's great to see a tool out there that will actually help you debug and verify these because half the time with IAM policies it's like hey, we test in prod yolo, let's go see what's going on. So it's great to kind of see that there's a tool for that. I mean I've forgotten about this in our pre read, Justin mentioned that this was one of the few things that was released during the first reinforced, which I think is actually going on today. So it's great to see that they are continuing the development of this. Agreed. [00:21:09] Speaker A: We're using strong DM in the day job and it is really nice to see Cedar getting used in lots of different ways, particularly the mathematical proofs to be used in policies. So if you want to start factoring in like hey, you're on the corporate network or the corporate VPN and you're on a trusted device and you're on these things, then you can mathematically prove all those things. It's, you know, fantastic once you explain to an auditor, it's a little tricky at first, but it's pretty nice once you get there. [00:21:36] Speaker C: Explaining a lot of technical stuff to auditors is always a fun battle. [00:21:40] Speaker A: Yes, as long as you they can understand math you can get them there. But if they don't come from a mathematical background, it's a little bit trickier, is my experience. [00:21:49] Speaker C: Yeah, things will just get me in trouble if we continue this conversation. Yeah. [00:21:59] Speaker B: There are a lot of cloud cost management tools out there, but only Archera provides cloud commitment insurance. It sounds fancy, but it's really simple. Archera gives you the cost savings of a one or three year AWS savings plan with a commitment as short as 30 days. If you don't use all the cloud resources you've committed to, they will literally put the money back in your bank account to cover the difference. Other cost management tools may say they offer commitment insurance, but remember to ask will you actually give me my money back? Achero will click the link in the show notes to check them out on the AWS Marketplace. [00:22:39] Speaker A: All right. GCP A misconfiguration of Google's cloud IAM system caused widespread outages affecting App Engine, Firestore, Cloud SQL, BigQuery and Memory Store, demonstrating how a single identity management failure can cascade across multiple cloud services and impact thousands of businesses globally for two and a half hours. In fact, incident highlighted the interconnected nature of modern cloud infrastructure as services like cloudflare, workers, Spotify, Discord, Shopify and UPS experience partial or complete downtime due to their dependencies on Google Cloud components. Google Workspace applications including Gmail, Drive Docs Calendar Meet all experience some failures, some of showing how IAM issues can affect both infrastructure services and end user applications simultaneously, and the OUT underscores the critical importance of IAM redundancy and configuration management. They have now completed their full RCA which you can read on their incident report directly and basically they talked about In May they released a new feature to their Service control for additional quota policy checks. This code change and binary release went through their region by region rollout, but the code path that failed was never exercised during this rollout due to needing a policy change that would trigger the code as a safety precaution. The code changes came with a red button to turn off that particular policy serving path and the issue with this change was that it did not have appropriate error handling nor was it feature flag protected. Then on June 12th when the ad occurred last week a policy change was inserted into the regional spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds and this policy data contained unintended blank fields. Service Control then regionally exercised quota checks on policies in each regional data store and this pulled in those blank fields for this respective policy change and exercise the code path that hit the null pointer causing the binaries to go into a crash loop. This occurred globally given each regional deployment within two minutes their SRE team was triaged during the incident and within 10 minutes the root cause was identified and the red button to disable the serving path was being put into place. The red button was apparently rolled out within 25 minutes from the start of the incident and within 40 minutes of the incidents the red button rollout was completed and they started seeing recovery across regions, starting with the smaller ones first. Within some of their larger regions, particularly the one I was impacted by, US Central one who the service control task restarted. It created a thundering herd effect on the underlying infrastructure. It depends on I. E. The spanner table, overloading the infrastructure in general. So not a great adage, not a great day for Google. But you know Hugops, I mean for. [00:24:58] Speaker C: The SRE team at Google, within two minutes was already triaging and 10 minutes had identified the root cause. That's an impressive response time. Like that's really impressive to get your alerts, triage, get a solution out the door within 25 minutes. I mean while it was a global outage and affected many large companies out there, like you can't really be mad at anyone. Bugs are going to happen. It's going to happen. It's really how you respond to it. So, you know, getting that fixed and out there, the one piece of it that I didn't like as much, you know, granted I wasn't directly impacted at my day job about it was it felt like it took them about an hour after the outage started to acknowledge it. And it was kind of that old issue that AWS had years ago where they hosted their cloud status on the same infrastructure as the outage occurred. So they couldn't actually notify. [00:25:57] Speaker A: Yeah. And they called that out and one of their actions they took was that they'll resolve that, make sure their infrastructure for the Google cloud is not the same as the monitoring products. It was interesting because most of Google's actual services, Google Search, Google Ads, all those businesses were not impacted on this because they don't run on gcp. [00:26:15] Speaker C: Right. [00:26:16] Speaker A: Which, you know, it's sort of interesting you don't eat your own dog food, but also interesting that, you know, it's like this monitoring service probably live in that other infrastructure maybe. [00:26:24] Speaker C: Well, AWS had that years ago. I think with their first cloud watch out, the RCA was moving to its own dedicated infrastructure. So to me that aspect of it's kind of one of those things of like, hey, we learn from what the other people ran into 10 years ago. Don't make the same mistakes. Cloud status page was same thing as the S3 outage where Amazon couldn't notify. That was like 2015. Bravo to the SRE team at Google. Hug ops all around, but pat yourself on the back for a job well done. [00:26:59] Speaker A: Yep. Cloudflare also posted their RCA on the issue. They were impacted for 2 hours and 28 minutes on June 12, affecting their workers Key Value Store, Warp, Access, Gateway Images, Stream Workers, AI and Turnstile and other critical services due to a third party storage provider's failure that exposed architectural vulnerabilities in their infrastructure. Google the incident revealed a critical single point of failure and workers KV Central Data store, which depends on many Cloudflare products. Despite being designed as a coreless service that should run independently across all locations during the outage window, 91% of workers KV requests failed cascading failures across independent services while core services like DNS, Cache, Proxy and WAF remained operational, highlighting the blast radius of a shared infrastructure dependency. Cloudflare is accelerating migration of workers KV to their own R2 storage infrastructure and implementing progressive namespace RE enablement tooling to prevent future cascading failures and reduce reliance on third party providers. This marks at least the third significant R2 related outage in recent months, raising questions about the stability of Cloudflare storage infrastructure during the artificial transitional period. I mean, I Ops is hard, infrastructure is hard. I think the failure here is that they are running an entire KV on top of GCS or GCP in a way that they were impacted by this, where that should be blast radius out to multiple clouds. And again, Cloudflare is a partner of AWS and GCP and Azure. They should be able to make things redundant because I don't necessarily know that their infrastructure is going to be better than anyone else's infrastructure. They're going to have the same type of failures. And you know, this outage is very familiar to an outage that you mentioned in aws back in 2015 or 2016. So when you get to this type of scale, these outages happen. And so Amazon has definitely fixed a lot of their issues. We don't joke about tire fire one as much anymore in the US east one region. Google is going through its growing pains. Azure's gone through its. Maybe still going through it, still going through them. [00:28:46] Speaker C: Still going through them. [00:28:48] Speaker A: And so moving to their own infrastructure doesn't necessarily make me feel more confident that they're going to have less problems. It just means that they're going to experience problems in a bunch of different ways. And we talked about Cloudflare outages in the past where they realized they had too much dependency on one region or their software and not enough backoffs and they did a bunch of work around their core, this coreless architecture thing. But again, it just shows you that if you're not testing this stuff, often things have a sneaky way of becoming not coreless. [00:29:19] Speaker C: Yeah. And even if you're going to build your own, you're going to build it resilient to what you know, the cloud providers all try to build it to what they, they try to build out to an extra level and they have a lot more data to base it on. They have hundreds of customers, thousands of customers, millions of customers across all different workloads, so that they can see a lot more. To me, having this just be a one cloud provider just shows a failure. And even when it's in their own data store, it's not going to magically just always work. There's still somebody that has to maintain it. It's still someone's server somewhere. You know, it reminds me of, I think it was one of the first prime days AWS had their, had their website or sorry, Amazon had their website fronted, but not, not just by aws. You know, cdn, cloudfront but they also had done by other providers. If it's such a critical service, you need to make sure it's up. So anything that's in this P0 massive criticality, you know, whatever it is sphere of your business, you need to take a look at it. Not just like on a yearly basis, probably at least a quarterly basis, say, okay, let's talk about, let's tabletop this out. If this one component goes down, what happens? And at least attempt to tabletop it. If you can't fully exercise it, you can't do true chaos engineering. Because most places can't do true chaos engineering. [00:30:48] Speaker A: Yeah. Or they try. It's a disaster and they say we're never doing that again. [00:30:51] Speaker C: Right. [00:30:52] Speaker A: It's a good idea, but with sharp edges. Good idea with sharp edges. Do it in debt first. That's always my recommendation. [00:30:59] Speaker C: Tomod, you're a fud. [00:31:02] Speaker A: Well, I mean, if you want to have, you know, have a career, a limited career, production is the way to go. [00:31:06] Speaker C: That's what you do Right. Before you retire, get your severance, you know, call a day. [00:31:12] Speaker A: Yep, exactly. All right. Google is or has developed an automated tool that scans open source packages and docker images for exposed GCP credentials like API keys and service account keys, processing over 5 billion files across hundreds of millions of artifacts from repositories like PyPi, Maven Central and Docker Hub. System detects and reports leaked credentials within minutes of publication, matching the speed at which malicious actors typically exploit them with automatic remediation options, including disabling compromised service account keys based on customer configured policies. Unlike GitHub and GitLab source code scanning, this tool specifically targets built packages and container images where credentials often hide in configuration files, compiled binaries and build scripts, areas traditionally overlooked in security scanning. Google plans to expand beyond GCP credentials to include third party credential scanning later this year, positioning this as part of their broader deps.dev ecosystem for open source security analysis. For GCP customers publishing open source software, this provides free automated protection against credential exposure without requiring additional tooling or workflow changes, addressing what Mandy reports as the second highest cloud attack vector at 16% of their total investigations. [00:32:14] Speaker C: I feel like AWS has had this where they scan the GitHub commits for years. So great. I appreciate them doing it. Don't get me wrong, it's amazing and they need to keep doing it, but also feel like this has been done before. Just saying. [00:32:31] Speaker A: I mean AWS did it years ago. Yeah, where they automatically when people Commit code to GitHub it tells you that AWS secrets are in there as well. There's a bunch of open source tools that can do things. This unique vend here is that they're looking at the compiled objects and seeing if they're in the compiled item, which is probably the differentiator that's key here. But again, I'm glad to see this. I wish the cloud providers, knowing how big of a security issue this is, would just combine forces and work on this together as part of a unified strategy versus relying on each cloud provider to kind of solve this on their own way. [00:33:05] Speaker C: I'm curious, this is the second highest attack vector. What's the highest in the article that I did not read? [00:33:11] Speaker A: Probably misconfiguration. [00:33:12] Speaker C: Yeah, okay, well that makes sense because this was only 16%. [00:33:16] Speaker A: So yeah, let's see if it says. [00:33:18] Speaker C: In here just the second highest. [00:33:20] Speaker A: Yeah. [00:33:21] Speaker C: Oh it looks like. [00:33:22] Speaker A: Oh it's in the M Trends report. Let me go. Exploits continue to be the most common intentional vector infection vector at 33% and for the first time stolen credentials rose the second most common in 2024 at 16%. So yeah, exploits not patching. That's your number one cause to pay the butt. [00:33:42] Speaker C: Please patch. [00:33:43] Speaker A: Please patch. It's awful. I know where we oh yes, here we are. Google Cloud Location Finder is providing a unified API for accessing location data across Google Cloud, AWS and Azure and Oracle cloud infrastructure, eliminating the need to manually track region information across multiple providers. This new service from Google is available at no cost via REST APIs and the G Cloud CLI. The API will return you rich metadata including region and proximity data, currently only for GCP regions, territory codes for compliance requirements and carbon footprint information to support sustainability initiatives. Data freshness is maintained at 24 hours for active regions with automatic removal of deprecated locations and key use cases include optimizing multicloud deployments by identifying the nearest GCP region to existing AWS and Azure or OCI infrastructure, ensuring data residency compliance by filtering regions by territory and automating location selection in multiple cloud applications. This addresses a common pain point where organizations maintain hard coded lists of cloud regions across providers, while AWS and Azure offer their own region discovery APIs. Google's approach of providing cross cloud visibility to a single service is unique among the cloud providers and the inclusion of sustainability metric carbon footprint data lines of Google's broader environmental commitment. So take a look at this Google Cloud Location Finder. If you are doing multi cloud and care where your data has to be. [00:34:58] Speaker C: I feel like this is them just saying, hey, you did it. We're just going to do it slightly better so that we can say we did it slightly better. Like most of the regions you know, are all pretty well defined. They don't spin up out of nowhere and you're not surprised by them. Maybe the local zones and you know, the pops. If you're at the edge, maybe that's where this comes into play. But I don't see this being a massive issue. That is a problem. You need to software. [00:35:26] Speaker A: Yeah, I mean the people who have it, it's probably a big issue for them and for everybody else who doesn't care. Google's got a couple of new VM types for us this week. They have the new C4D VMs which are generally available, powered by the 5th gen AMD EPYC processor, the Turing processor, and delivers up to 80% higher throughput for web serving and 30% better performance for general computing workloads compared to the C3D predecessor. The new instances scale up to 384 VCPUs and 3 TB of DDR5 memory stored for hybrid disk storage offering up to 500,000 IOPS per second. C4D introduces Google's first AMD based bare metal instances coming in a few weeks, providing direct server access for workloads requiring custom hypervisors or specialized licensing needs. The other instance coming out is The Google Cloud G4VMS featuring Nvidia RTX Pro 6000 Blackwell GPUs combining 8 GPUs with AMD's turn CPUs delivering 4x memory and 6x memory bandwidth compared to G2VMS positions used to be ahead of AWS and Azure and offering Blackwell based instances for diverse workloads beyond just AI training. The G4 instances target a broader range of use cases that typically AI focused GPUs including cost efficient inference, robotic simulations and generative AI content creation and next generation game rendering with 2x ray tracing performance. Key customers include Snap for LM inference, WPP for robot simulation, and major gaming companies for next gen rendering. So yeah, if you need some new servers, some faster, better hardware, Here you go, C4DS and G4S. [00:36:55] Speaker C: Every time we talk about these I look at the metrics and I'm like I really want to know what I need to be, what I need to do to use 400 gigabit networking with 500,000 IOPS. Like I just don't design that way I guess, you know, for most things I like, I totally understand the use cases for them. But at the same point I'm like it's just so much. [00:37:21] Speaker A: So, so much for sure. Moving on to Azure Cross Tenant Customer managed keys or CMKs for premium SSD v2 and ultra disk is now in preview and select regions. Encrypting managed disks with Cross tenant CMDK enables encrypting the disk with a CMK hosted in Azure Key Vault and a different Microsoft Entre tenant than the disk itself. This will allow customers leveraging SaaS solutions that support CMK to use cross dense CMK with premium SSD v2 and ultra disks without ever giving up complete control. And my first comment on this is I have major doubts because for most SaaS companies to implement CMK they're typically not doing that at the SSD level or the ultra disk level because they're typically in multi tenant architecture and typically their CMK deployment will occur in the database layer and or in the object store layer where they're storing custom data or customers data typically. So only way this makes sense for me is if you have a SaaS application where you're getting single servers or a small cluster of servers per tenant which I don't want to manage. But if that's what you have, then this may make sense to you. But this has a pretty limited use case in my opinion. [00:38:29] Speaker C: Yeah, I've run across this in the day job where a customer wanted the servers themselves and like you said at that point it's you're losing your multi tenant, you're losing your economies of scale as a SaaS provider. But if customers are willing to pay for it, here's the dollar amount you decide if you want to do it and you're 100% correct. That's where I've kind of seen it be done. The only other place I've seen something like this be useful was on aws when a organization had such stringent security requirements where they weren't allowed to make their own KMS keys, they had to use the centralized managed KMS keys for disks on boot up. And you want to talk about debugging hell on aws. That was a really good way to put yourself in a hole and kill many by many hours of oh the auto scaling service doesn't have access to boot the EVS volume. You know the reason why I threw this Arcoid here was just more of a this doesn't feel like something that should have taken this long for volumes to be able to do it. I understand it's for premium SSD2 which are fairly recent ultradisc but at the same point like this is not something that if you're leveraging that infrastructure you probably already have these requirements. [00:39:45] Speaker A: Yeah, that'd be my take on it as well. Microsoft Cost Management has several updates for this week including Azure Carbon Optimization which reaches gnavability, allowing organizations to track and reduce their cloud carbon footprints alongside cost cost optimization efforts. Exports to Microsoft fabric enters limited preview enabling direct integration of Azure cost data into Microsoft's unified analytics platforms. You can build all those nice Power BI dashboards. Free Azure SQL managed instances offer launches and general availability providing a no cost entry point for database migrations. This directly challenges is AWS's RDS's free tier and could accelerate enterprise SQL server migrations to Azure if all it took was one free server. Shame on Microsoft for taking that long. [00:40:26] Speaker C: Also, I doubt that they are giving you enterprise SQL. I assume it's SQL Express or SQL Standard. They're not giving you enterprise SQL. I would be shocked. [00:40:35] Speaker A: I mean or if they are giving you enterprise that's on a very limited hardware footprint. Yeah, yeah, that doesn't actually do anything that you could use. [00:40:41] Speaker C: And this isn't SaaS, this is managed instances. So you still have a running server in your account. I don't know how much Chelsea, you're familiar but like there's 12 different ways to run SQL inside of Azure. This is not the Azure SQL where you just say give me a SQL Server. This is hey give me a server. And then it has SQL on it and they do some special magic to help manage that instance along the way. [00:41:09] Speaker A: Gotcha. That sounds terrible. [00:41:11] Speaker C: Yeah. [00:41:12] Speaker A: In addition to free Azure SQL manage instances, you also get network optimized Azure virtual machines entering preview promising reduced network latency and improved throughput for for data intensive workloads. And then Smart VM defaults in Azure Kubernetes service reaches general availability automatically selecting cost optimized VM sizes for your Kubernetes workloads. This feature reduces over provisioning and helps organizations avoid common kubernetes sizing mistakes that inflate your costs. I mean Kubernetes is number one reason why your costs were inflated. [00:41:40] Speaker C: I thought I was going to say it's the number one way to cause outages. It's going to automatically change your provisioning for you. [00:41:48] Speaker A: Who doesn't Love that? Your VMs changing underneath the hood without letting you know it's always appreciated. And then GitHub Copilot's next edit suggestions or NES in Visual Studio 2022.17.14 predicts and suggests your next code edit anywhere in the file, not just at cursor location. Using AI to analyze previous edits and suggest insertions, deletions, or mixed changes. This feature goes beyond simple code completion by understanding logical patterns in your editing flow, such as refactoring a 2D point class to a 3D for updating legacy C syntax and modern STL. I don't want to do either one of those two things, by the way, so hard pass, making it particularly useful for systematic code transformations. NES presents suggestions as inline diffs with red green highlighting and provides navigation hints with arrows when the suggested edit is on a different line, allowing developers to tab through related changes across the file. Early user feedback indicates accuracy issues with less common frameworks like Pulumi and C, and outdated train data for rapidly evolving APIs, highlighting the challenges of AI suggestions for niche or fast changing technologies. While this enhances Visual Studio's AI assisted development capabilities, the feature currently appears limited to Visual Studio users rather than being a cloud based service accessible across platforms or IDEs. [00:42:55] Speaker C: This is a really good way to get me never to finish a project because it's just going to move the cursor and notify me or something and my add is going to be like ooh, shiny object over here. I'm never going to finish anything. [00:43:06] Speaker A: Click. [00:43:07] Speaker C: Yeah, so I mean it's a pretty cool feature. You know, I like the premise of it, especially when you are refactoring legacy code or anything along those lines where it's like hey, don't forget this thing over here. Because on the flip side, while it's distracting, it also would be fairly nice to not be just to not run everything, compile it and then have the error, you know, because I forgot to refactor this one section out. So you know a little bit of your damned if you do, damned if you don't. But this feels like my ADD is going to take a really, really big distraction on this for sure. [00:43:41] Speaker A: Well, do you feel a cold chill in your spine Matt? [00:43:44] Speaker C: Well, we're still at Azure or we're moving to Oracle, so obviously the answer. [00:43:48] Speaker A: Is yes, we are moving to Oracle. You are correct. And Oracle had some surprises this week. First of all, they had their earnings call where it raised its fiscal 2026 revenue forecast to 67 billion, predicting a 16.7 billion percent annual growth driven by cloud services demand, which with total cloud growth expected to accelerate from 24% to over 40%, it's not the law of large numbers anymore. [00:44:14] Speaker C: They're still small numbers. [00:44:15] Speaker A: Guys, I Mean, it's still the law of large number problems, but the fact that they are this aggressive now means that they are succeeding. Getting those lawyer salespeople out there to lawsuit the hell out of you in this middle of this pandemic. [00:44:28] Speaker C: No, no, you're wrong. They sold one of the SAP Sahana instances that we talked about last week. All they needed was one. It's that much. [00:44:35] Speaker A: Yeah, that's true. Yeah, yeah. They said in the earnings call. Oracle cloud infrastructure is gaining traction through multi cloud strategies integration with Oracle's enterprise applications. So some of the revenue they're talking about is shared with the other cloud providers. Just to cut to that such a chase though this growth primarily benefits existing oral customers rather than attracting new cloud data workloads. The company's approach of embedding generative AI capabilities into its cloud applications at no additional costs contrasts with AWS, Azure and GCP's usage based AI pricing model. I mean this is you get it as long as you build it on top of Oracle. So there's that licensing part that you got it to you for free. But you know, okay. This potentially could lower adoption barriers for Oracle's enterprise customer base. Fourth quarter cloud service revenue reached 11.7 billion. With 14% year over year growth searching Oracle is capturing market share but still trails the big three cloud providers who reported quarterly cloud revenues of 25/plus billion dollars. Oracle's growth story depends heavily on enterprises already invested in Oracle databases and applications migrating to oci, making it less relevant for organizations without existing Oracle dependencies. Want, want. [00:45:36] Speaker C: The only thing that I would dislike more than running my infrastructure on Azure would be to run it on Oracle. [00:45:44] Speaker A: I mean Oracle is actually a really simple cloud. It is just Solaris boxes as a cloud service to you. It's all very server based. That's why they have ISCSI and they have Fibre Channel and they have all these things that are very data center centric. So if you love the data center and you just want a cloud version of it, Oracle cloud is not bad for you. Or if you have a ton of egress traffic, the cost advantages of their networking is far superior to any of the other cloud providers. So there are benefits. As much as I hate to say it in using Oracle, I think I. [00:46:15] Speaker C: Will put on my prediction for every year for re invent going forward is going to be AWS see a lower egress cost by at least one set. [00:46:23] Speaker A: Yeah, we keep trying but that hasn't happened yet. Yeah, I mean I do think the being able to take credit for part of the revenue of your @ partner Oracle databases is a clever move. I do give Oracle props for that one. Very, very Darth Vader ish. [00:46:42] Speaker C: So all those RDS instances running Oracle that I've had to deal with in prior lives now account for them. So it's always nice indeed. [00:46:53] Speaker A: Then Oracle announces a new AMD Instinct Mi355X GPU on OCI, claiming 2x better price performance than previous generation and offering Zetascale AI clusters with up to 131,000 GPUs for large scale AI training and inference workloads. This position's Oracle is one of the first hyperscalers to offer AMD's latest AI accelerators through AWS. Azure and GP already have established GPU offerings from Nvidia and their own custom silicone, making Oracle's division primarily about AMD partnership and pricing. So yeah, if you need access to some non Nvidia GPUs, Oracle's got your back. But if you're building out, have a Cuda, this doesn't help you out. [00:47:32] Speaker C: It's so many GPUs. I'm still just flabbergasted by that number. And also why that number 131,072 I don't know. [00:47:43] Speaker A: It's a bit of a mystery to me on that one too. It seems sort of strange that you would have that be your the number. It's got to be a silicon limitation of some kind. I just don't know what it is. [00:47:54] Speaker C: This is where we need Jonathan would be like this is the reason why. [00:47:58] Speaker A: Yeah Jonathan like duh. Come on. All right, and then our final Oracle story. Oracle is now allowing custom domain names for Apex applications on autonomous database, eliminating the need for Awkward database specific URLs like apex.oraclecloud.com Ords F? P equals 12345 in favor of clean addresses like myapp.company.com this van URL provides feature requires configuring DNS scene records and SL certificates through Oracle Certificate service, adding operational complexity compared to Amazon Cloudfront or Azure Front Door which handle SSL automatically. The feature is limited to paid autonomous database instances only, excluding always free tier users. So my two databases just sit there doing nothing will not get this, which may restrict adoption for developers. Nope, we'll still wouldn't have used it either way or running small applications. While this brings Oracle closer to parity with other cloud providers, application hosted capabilities implementation requires manual certificate management and DNS configuration that competitors have largely automated the Primary benefit targets enterprises already invested in Oracle's ecosystem who need professional looking URLs for customer facing Apex apps without exposing underlying database infrastructure details. Thanks Oracle. [00:49:06] Speaker C: Welcome to 1997. It just felt like something that should already be there. [00:49:13] Speaker A: Yeah, you would have thought so, but apparently not. When your cloud is eight years behind everyone else, then that's what you get. [00:49:20] Speaker C: I mean, there's always those features that are out there. The KMS for Ultra disk. Everything else felt like a feature that's just out there. The WAF DDoS protection we talked about earlier for AWS again felt like something that was sort of there, but not really. And they just kind of, you know, finally finished it off. You know, I feel like because we live in these worlds every day we forget what's 100% there. And because we deal in multiple clouds, you know, what is the exact thing on each cloud versus what's just we know from elsewhere, which is always like, you know, my problem is like making sure I'm thinking about the right cloud with the right limitations for the right service. [00:49:59] Speaker A: Agreed. Well, that is it. We have reached the end of the show. Once again, sub one hour. [00:50:05] Speaker C: Good job, Justin. [00:50:07] Speaker A: Yeah, you're welcome. The editing I did to the show note bot was definitely. Yeah, a little less awkward too in a couple of spots. So good. [00:50:17] Speaker C: Good job. [00:50:18] Speaker A: All right, we'll see you maybe next week, maybe not. I know we'll definitely see Ryan back next week, but I know you're maybe traveling, so we'll see how it goes. [00:50:26] Speaker C: Yep, I'll do my best. [00:50:27] Speaker A: Have a great one. [00:50:27] Speaker C: Bye, everybody. [00:50:32] Speaker B: And that's all for this week in Cloud. We'd like to thank our sponsor, Archera. Be sure to click the link in our show notes to learn more about their services. While you're at it, head over to our [email protected] where you can subscribe to our newsletter, join our Slack community, send us your feedback and ask any questions you might have. Thanks for listening and we'll catch you on the next episode. [00:50:55] Speaker A: Sam.

Show Notes

Titles we almost went with this week:

AI Is Going Great – Or How ML Makes Money

AWS

GCP

Azure

Oracle

Closing

Chapters

Episode Transcript

Other Episodes

Episode 200

200: Now you can make bad cloud decisions like running EKS on SNOW

Episode 318

318: One Extension to Rule Them All (And in the VS Code Bind Them)

Episode 242

242: DoH: DNS over HTTPS – or One More Way For It To be DNS Fault