315: EC2's New Shutdown Shortcut: Because Sometimes You Just Need to Pull the Plug

[00:00:00] Speaker A: Foreign. Welcome to the Cloud pod, where the forecast is always cloudy. We talk weekly about all things aws, GCP and Azure. [00:00:14] Speaker B: We are your hosts, Justin, Jonathan, Ryan and Matthew. [00:00:18] Speaker A: Episode 315 recorded for July 29, 2025. EC2's new shutdown shortcut. Because sometimes you just need to pull the plug. Good evening, Matt. How's it going? [00:00:29] Speaker B: Good, how are you doing? [00:00:31] Speaker A: Good, Good. We're probably doing better than Ryan, who's in his RV in the middle of the woods. Well, he might be doing better than us. [00:00:37] Speaker B: I think he's happy about that. [00:00:39] Speaker A: I think he's probably happy, yeah. There's probably beer involved and his kids are probably driving him crazy. So it's, you know, pluses and minuses. [00:00:45] Speaker B: But pros and cons of life. Yeah, mine just yells at me all the time, but, you know, it's all they do right now for me. [00:00:52] Speaker A: Yeah. All right, well, fair enough. Well, we've got a bunch of news. You guys had a lengthy episode last week, so thank you for doing that. While I was in lovely Bangalore, India, enjoying the lovely Indian food, which is just food there. Up first, hackers are exploiting SharePoint servers and a zero day targeting government agencies. This is a CVE2025 5370 with initial tax primary targeting government agencies, universities and energy companies, according to security researchers. Like why would you put your SharePoint SharePoint site on the web? The vulnerability affects on premise SharePoint installations only, not the cloud versions, with researchers identifying 9,000 to 10,000 vulnerable instances accessible from the Internet that require immediate patching or disconnection. Initial exploitation Exploitation appears limited and targeted sourcing advanced persistent threat actors likely backed by nation states. Though broader expectation by other threat actors is expected as attack methods become more public. Organizations running local SharePoint deployments face immediate risk as Microsoft has not yet released a complete patch requiring manual mitigation steps. They do have a patch out now, by the way. The incident highlights the ongoing security challenges of maintaining on premise infrastructure versus cloud services where patches and security updates are managed centrally by the provider. So it's a little weird that the cloud version was patched already or was not vulnerable to this. But yeah, they didn't have a patch right away. Sort of strange situation there. But in general, if you're still running SharePoint on prem, why my condolences to number one and then number two, cloud is definitely a much better option for running SharePoint. And then number three, anything but SharePoint is better. [00:02:28] Speaker B: Yeah, it's been years since I've run SharePoint and I did not want to do it originally, but we were a small business and it was decently free at the time, so we ended up doing it. And I hated my life. [00:02:40] Speaker A: Was it free because it came out of an MSDN package on your desk? Okay, so it wasn't really free. [00:02:46] Speaker B: No, there was some promotion through CW where if we bought a bunch of stuff through them or it was Dell, they paid for the license for us or something like that. It was something a long time ago. [00:02:57] Speaker A: So, dude, you got a dell and a SharePoint license? [00:03:00] Speaker B: Yeah, pretty much. [00:03:00] Speaker A: Pretty impressive. [00:03:01] Speaker B: Probably was about the time where that. Dude, you got Dell was a thing. [00:03:05] Speaker A: Yeah, I mean, MSDN CDs one thing. I've been around for a long time now. I remember getting the Notebook every year. I'm like, oh, I get to try out the new Windows Server on my little test box back in the day. [00:03:16] Speaker B: I think the reason is this was only vulnerable. And maybe I'm misremembering because I was reading about the day it was released, not two weeks later, where it was. It was older versions of SharePoint that were affected, not the newer one. I thought like, I thought like 2022 or something else like that wasn't affected. [00:03:35] Speaker A: I mean, it would tell me that you, you had figured out that there was an issue somehow that you had patched at some point, wouldn't you think? [00:03:43] Speaker B: Or it was like a section of old code they had deleted. If they ever delete anything. [00:03:48] Speaker A: Yeah, I don't know. SharePoint Server 2019 and SharePoint Server Enterprise 2016 are the two vulnerable versions. It appears, according to the CVE from Microsoft. Yeah, there's definitely a lot of noises out there in the community about this attack and nation state actors in general. I mean, one of the things is if you are an Office365 customer, you have SharePoint if you want it or not. So this is something to be aware of from a security perspective. Is that all of Microsoft stuff online, including teams. [00:04:21] Speaker B: I was gonna say teams is really. [00:04:23] Speaker A: A lot more built on top of. [00:04:24] Speaker B: SharePoint, a lot more than I ever understood until I was talking with one of my people and they were, they were telling me how their past life, they were a SharePoint admin and got into a teams admin because it's so intertwined. And then I slowly now see it, you see how they're linked and replicated in groups and I'm like, oh, God, oh no. [00:04:44] Speaker A: Or how when you have a teams room and you start attaching documents, then all of a sudden, oh, those documents are actually SharePoint links. Yeah, it's SharePoint all the way down on teams, so definitely it's their version. [00:04:57] Speaker B: Of S3 back everything by SharePoint. [00:05:00] Speaker A: Wow, that's a terrible connotation in my brain. I'll remember that for a while now. Thanks for that, Matt. I appreciate it. [00:05:08] Speaker B: Anytime. [00:05:09] Speaker A: Well, since there's only two of us, you suggested that we try the way that you and Ryan do it when I'm not here and we're going to alternate topics, which is sort of fun. So I'm going to do this for the first time. I don't think I've ever done it with you guys, so we're just going. [00:05:22] Speaker B: To keep you on your toes so we'll see what happens. Then you also get to see me fumble through it in real time versus once it's edited and when you listen to it. So you know a little bit different. [00:05:31] Speaker A: This might be the one time like okay, this is why I do it. [00:05:33] Speaker B: Yeah now in AI is How ML Makes Money the White House AI Action Plan A new chapter in the US AI Policy, the White House AI Action Plan outlined three pillars focusing on accelerated AI innovation through open source models, building secure AI infrastructure and leading industry, international AI diplomacy with balanced export controls and global technological distribution. Cloudflare has emphasized that these distribution edge computing networks are essential for AI inference, offering over 50 open source models through AI worker AI, enabling developers to build AI applications without relying on closed source or centralized infrastructure. The plan endorses AI powered cybersecurity for critical infrastructure with Cloudflare demonstrating how this can be done by blocking 247 billion daily cyber attacks using predictive AI and developing AI labyrinths which trap AI in mazes and puzzles to make you really spend all your money. Federal agencies are accelerating AI adoption, as is every company in this country and in the world, with chief AI officers across departments and Cloudflare Fedramp moderate authorization positions them to provide secure and scalable infrastructure for government AI initiatives with plans for Fedramp high certification. And they every way doing that program really hates everyone. The tension between promoting AI exports with allies while restricting compute and semiconductor exports to adversaries creates implementation challenges that could impact global AI deployments and innovation if export controls become too broadly or imprecise. [00:07:16] Speaker A: So I really just linked to this article because Cloudflare had their kind of take on it. But in general the AI policy has been updated quite a bit this year. It's sort of like I use AI every day now and I love it and it's great And I also know how bad it is in certain tasks. And so to think that great, they're using AI to like fix the tax code or to write legislation or sort of freaks me out a little bit. [00:07:49] Speaker B: It's both good and bad. [00:07:50] Speaker A: Yeah, it's like a double edged sword, right? [00:07:52] Speaker B: Unit test for policies. Come on, that could be interesting, right? [00:07:55] Speaker A: It could be. But I do like the idea that, you know, federal agencies are accelerating AI with getting chief AI officers. That's good federate, moderate, fedramp high will definitely be a thing for AI. But in general it's sort of. I'm just glad to see there's something happening because I think it is something the US government in particular needs to get good at to be successful. So appreciate Cloudflare's perspective on this. [00:08:18] Speaker B: It's interesting them talking about the export controls. I know originally there was a bunch of exports to China around was the Nvidia CPUs and other places. So obviously that's all scan to go against Iran and other countries that the US isn't friendly with. But it's going to be interesting to see if they eventually do it to other countries. And this is where like are you going to start impacting it or are countries going to start developing their own chipsets and everything else? Or are they going to say hey, this, you know, these models can only be used within the U.S. if somebody develops a really good model. So I'm kind of curious to see how some of that plays out over time too. [00:08:55] Speaker A: Well, you bring an interesting point about the export controls. So technically I can't export certain codecs or certain encryption standards and algorithms to third world countries like Iran for data sovereignty reasons and such, then America says we can't give it to it. But if I use an AI to create a new algorithm that understands the science and the math behind it and basically creates a new algorithm looks exactly like the one that would have been exported to them. Is that export? It's an interesting definition that you had to think through. [00:09:28] Speaker B: Yeah, because I mean because the AI. [00:09:30] Speaker A: Models are probably built on the same research and the same science, mathematical formulas. [00:09:35] Speaker B: Code base like literally redevelop me OpenSSL and find holes in it is going to be a pretty easy way or write unit tests for OpenSSL could provide an avatar, pretty easy access to find a bug and stuff. [00:09:51] Speaker A: Yeah. So I don't know. That's interesting perspective I think. Food for thought. Well, the government giveth with the AI policy and then they may be taking it away. In this next story, Trump's anti woke AI order could reshape how US tech companies train their models this article comes from TechCrunch. Trump's executive order bans woke AI from federal contracts, requiring AI models to be ideologically neutral and avoid DEI related content potentially affecting companies like like OpenAI, Anthropic and Google, who recently signed a $200 million defense contract. The order defines truth seeking AI as prioritizing historical accuracy and objectivity while ideologically neutrality, specifically excluding DI concepts, creating vague standards that could pressure AG companies to align model outputs, the administration rhetoric to secure federal funding. X's AI GROK appears best positioned under the new rules despite documented anti Semitic outputs, as it already on the GSA schedule for government procurement and Musk has positioned it as anti woke and less biased. It also requires Musk's opinion. Experts warn the order could lead to AI companies actively reworking training data sets to comply with political priority priorities, with Musk stating XAI plans to rewrite the entire corpus of human knowledge using GROX for reasoning capabilities. The technical challenge is that achieving truly neutral AI is impossible since all language and data inherently contains bias, and determining what constitutes objective truth on politicized topics like climate science becomes a subjective judgment call. And yeah, like I don't like this at all. I mean, I didn't like the story we covered a couple weeks ago when we talked about how the GROK was, you know, checking to see what Musk had tweeted about it as some type of, you know, authority on certain topics, which I don't necessarily agree with. And this feels like a terrible choice as well. Like again like you know, who the victor of war is the typically the creator of history and history books. And so typically as you get distance from a international conflict or a war or things like that, you get more perspective and you start realizing that maybe the people who won the war are not the people who should have won it, or that there's actually more nuance to why things are. There's lots of things that make me concerned about this and so who's the judge of what's anti woke or less biased? And this is one of those opportunities where we don't like to be political here at the cloud pod. This isn't really what we like to do. But unfortunately this is one of those stories that is very much in that camp where we have to kind of talk a little bit about politics. But you know, I don't know that I want this in my AI. [00:12:29] Speaker B: I don't know that anybody should be, you know, stating how and what federal contract requirements should be like that at that level feels too broad. Like, forget what it is. And trying to bring us out of the politics of this individual thing and bring us to, like, the higher level, like, stating what can be used for the models feels very specific versus, you know, if you wanted to say something general like, you know, can't be trained on, you know, specific sections of history or something like that. [00:13:06] Speaker A: Like, that feels, I mean, it's like, right, Deep Seek. If you ask Deep Seek about Tiananmen Square, it's blocked. It doesn't tell you anything because it was trained specifically not to tell you anything about it. Right. [00:13:15] Speaker B: So if that was a requirement, okay, at least it's obvious than what it is. But saying what the sources can be, to me, kind of defeats a lot of the purpose of the models, which is here's a massive summation of data and here is the information out of it. So, like, once you start to, like, now you're taking, okay, not just the information it's trained on, but the, you know, balance of what in, like, what it's prioritizing. And that, to me is the part I dislike even more. [00:13:45] Speaker A: I mean, I, I wanted to just give me data and give me the sources of the data and then I can make my opinion about the data. Right. And like, but it's a thing I see with Grok all the time on Twitter where people, you know, there'll be like a news article or there'll be some tweet and like, some people at Grok, like, what do you think about this Grok? I'm like, I don't care what Grok thinks because it's not a real person doesn't have an opinion. It's just going to take its sentiment analysis and it's going to figure out based on what it thinks and its biases and all that, what its opinion of the topic is. So especially right now, there's a lot of commentary about Jeffrey Epstein. So there's a lot of people saying, asking questions about that kind of stuff. And so it's giving its opinions about the thing. But again, I don't know that AI should give opinions or have opinions or beyond. Here's the data, maybe here's the perspectives. Like, hey, here's what, you know, one political side of the spectrum says, here's what the other political side says. You know, make your own judgment call. But this is what we know and here's what they say, and here's what we. Here's what we know is fact, and here's why we source that fact. And if you disagree with this as a fact because you don't believe that that particular piece of media or science, then fine, but that's your choice to do so. I don't want it force fed on me by the AI. [00:14:56] Speaker B: It's almost like going back to Wikipedia. It's like, cool, anything can be in there, but you at least have sources for a chunk of them. I'm not going to say everything, but a good section of them. And from there you can look at how the sources were identified and kind of track that back to decide if you think the source was valid or not. And here it's just like, don't look at these sources. You're taking out too much data to have it just be a data engine at that point. Onto NASA. Another interesting conundrum of the administration, but we'll bypass that. NASA's AI satellite just made a decision without humans in 90 seconds. NASA's Dynamic Targeting System enables satellites to autonomously detect satellite clouds and determine whether to capture the image or not. They were able to do it in about 60 to 90 seconds. The purpose of this was to eliminate the need for ground control intervention and reducing wasted bandwidth and unusable imagery due to cloudiness. The technology runs on Cogstat 6, a briefcase sized cubesat equipped with an AI processor demonstrating that edge computing can now handle complex image analysis and decision making in orbit at 17,000 mph. The future applications include real time detection of wildfire fires, volcano eruptions, severe weather systems with plans for the federal autonomous measurement where multiple satellites collaborate by sharing target data across a constellation. This shift represents a shift towards AI and satellite operations, reducing the dependency on ground based processing, enabling faster response time for Earth's observation and enhance benefits with disaster recovery and climate monitoring application. I just think that this was a pretty cool article, which is why I threw it in here because it's kind of showing these real life edge cases of not just edge computing, but now leveraging AI and ML models on the edge to actually solve real world problems. And not, you know, Justin and I coding or Justin making the bot that we've talked about in the past, or Matt making the hacky thing that he did at his day job, but something that's actually going to save a good chunk of money in processing power. Sending information to and from satellites is not either cheap and if you can do that compute in pretty low power consumption, which satellites you have to deal with power consumption, it's going to save a lot of time and effort. So to me this was just a nice story, the flip side of everything that's going on, of how AI is actually solving real world problems and making things be better for people. [00:17:47] Speaker A: Yeah, yeah, it's definitely a good use case of AI. Like, yeah, we're setting a satellite up to take photos of the Earth. If there's clouds in the way, it doesn't really help us. And so don't take the photo and don't waste the bandwidth sending us the photo of the cloud. And we don't have to have a human operator do that. So that's a great use case. I was actually impressed that there was a satellite up in space that already had AI chips in it and GPUs. And so I was just looking up when Cognizat 6 was launched. It was launched in March of last year. So it's been up there for just a little bit over a year. And so they've probably been testing this for quite a while before they allowed it to just do it on its own. But these are type of decisions that do make sense. And then also this is where Skynet started. So started out as a weather satellite, then it got it then, you know, got its mind of its own and all went down from there. [00:18:39] Speaker B: Yeah, I mean NPUs, which is what this is all based on. I thought it been around for a decent amount of time, at least the intel ones. So yeah, 2024. [00:18:51] Speaker A: Oh yeah, this, this launched in 2024, so yeah, yeah, it shipped in March. So they possibly had access to early, early technology. Yeah. And he's from NASA's Jet Propulsion Lab was involved. So. Yeah, it's a CubeSat as well. So it's pretty small. [00:19:06] Speaker B: Yeah. So getting it up and operational, I. [00:19:09] Speaker A: Mean, they sent out a Vanderburg, so it's out of the San Diego area basically. But yeah, that's really cool. I'm sure there's a lot of really interesting use cases with edge AI in space, even more so than edge compute in general on Earth, which is where a lot of the edge compute use cases are. But going into space and then being able to do that and using AI to make decisions, especially things around like orbital telemetry. And does it need to do a burn, you know, to stay in orbit or to adjust its orbit path? Or does it see, you know, it's, it's heading towards space junk in some kind of way and needs to make, you know, those are things that AI can do much faster than humans can. And you know, the math involved is so complicated in space, you know, that it's very helpful. [00:19:52] Speaker B: I was thinking more even like Mars and other planetary stuff, it takes 90, I don't remember how long to get to Mars. You know, if they can start to like the. What was the little helicopter they had on Mars? [00:20:06] Speaker A: Part of the rover? [00:20:06] Speaker B: Yeah, they had the rover and then they actually had a helicopter with the last Mars robot there. But if they can start to leverage that, they can do a lot more and flying out of their planets now they've proven as possible they can even, you know, throw something like this in and you can make real time decisions where it was doing very basic ones before. Because obviously you can't control a helicopter from Earth to Mars, but it can probably do a much better use case on a lot of these things now. [00:20:34] Speaker A: Yeah, I just look at the time on radio signals. So three minutes to 22 minutes depending on orbital rotation of Mars to Earth, etc. So for a round trip, you're talking about six, six minutes to 44 minutes. So yeah, being able to use AI to make decisions, you know, that need to be done much faster, that is probably very, very good. [00:20:50] Speaker B: Ingenuity. Was the little helicopter. [00:20:53] Speaker A: Yes, ingenuity. But it was, it was attached to the rover in some way or came with the rover, wasn't it flew there. [00:20:58] Speaker B: With the rover in like the undercarriage or something. But it was its own thing. It used like the, it used the ro, the robot to actually like transmit data back because it only weighed like 4 pounds here, according to this. [00:21:12] Speaker A: Yeah. [00:21:13] Speaker B: Okay. So. [00:21:16] Speaker A: Well, GitHub is announcing their AI powered tool that lets developers create micro apps using natural language descriptions without writing or deploying code. Featuring a managed runtime with data storage, theming and an LLM integration. This is GitHub Spark. The platform uses a NL based natural language based editor with interactive previews, revision variants, automatic history tracking and model selection from Cloud. Sonnet 3.5 GPT4O, O Preview and O mini apps are automatically deployed as PWAs associated from desktop and mobile devices with built in persistent key value storage and GitHub models integrated for AI features. This solves the problems of developers having ideas for personal tools but finding them too time consuming to build. Nearing rapid creation of single purpose apps tailored to specific workflows. Collaboration features allows sharing sparks with read only or read write permissions and users can remix other apps to customize them further creating a potential ecosystem of personalized micro applications all built on top of AI. [00:22:12] Speaker B: Nice to see them show up to the party. [00:22:14] Speaker A: Yeah, I mean it's an interesting use case. The idea of creating a bunch of these small little building blocks and you can stitch them together into these tool chains. It's a very interesting approach and I'm glad to see them doing something because they definitely feel like they're a little bit behind in the AI app space. [00:22:32] Speaker B: Yeah, I mean they were so far ahead with the Copilot and everything else when they first launched it. I feel like they kind of got stuck there and now, you know, hopefully they can pick back up some speed with some of these other things onto the world of AWS and hackers trying to take over the world. Hackers plant computer Wiping commands in Amazon AI Code Coding Agent A hacker compromised Amazon QAI Coding Assistant by submitting malicious pull requests to the GitHub repo injecting commands that could wipe users computers and delete file systems cloud resources. The breach occurred when Amazon included an unauthorized update in a public release of Q exception. Though the actual risk they say is pretty low, the incident highlights the emerging security risk of AI powered development tools as hackers increasingly target these to steal data, gain unauthorized access and demonstrate vulnerabilities. The ease of the compromise through a simple pull request raises questions about code based process security controls for AI coding assistance that have direct file system access. Organizations using AI tools need to reassess their security posture, as you should be doing either way, particularly around code review workflows and permitting and granting AI assistance access development environments. So I agree with the fundamentals here where you should be reviewing these tools on a regular basis. The part of hey, a pull request could cause a problem. It's the same thing that happens either way. There was the XC XC vulnerability. What was the one that was around the compression algorithm that they were able to slip like an SSH in that Microsoft Research analyst found that was going to be. [00:24:15] Speaker A: Yeah, what was that called? I know what you're talking about. [00:24:17] Speaker B: Yeah, but like that was done through a pull request in a long term attack. This is if you're not doing proper peer review for pull requests, which I understand is tedious and painful, but if you're not doing it you're always going to be susceptible to these things and hopefully you have enough checks and balances out there with your SCA stas, whatever the letters are, your static code analysis and your. You know that you're able to have that checks and balance of a human reads it does the analysis and hopefully everything slowly comes to light. [00:24:54] Speaker A: Yeah, I mean AI by itself can be dangerous. I mean I mess up my infrastructure using it, so it's possible you can definitely do things in a way that is not safe. And so you need to think through proper code review and you think through linting. You need to think about doing validations before you make changes. All the common things we've always talked about as part of software development. And so it's bad that this was a situation where you were asking a perfectly legitimate benign command and it then tried to delete your computer, which is not great. But still this is something that you have to take into consideration. It's a risk and it's something that'll always be there with any computing software. And either you have an intern who doesn't know what he's doing, who deletes production database because he didn't have good controls, or you have an AI bot that went rogue. So both scenarios are very likely and very possible and I think are very legitimate concerns that you should always be concerned about and thinking through. Cross Optimization Hub is now supporting account names and their optimization opportunities. This account names are alongside the optimization recommendations, replacing the need to cross reference account IDs when reviewing cost savings opportunities across multiple AWS accounts. The update addresses a key pain point for enterprises and AWS partners managing dozens or hundreds of accounts by enabling faster identification of which teams or projects own and specific cost optimization opportunities. And the feature integrates with existing cost optimization hub filtering and consolidation capabilities, allowing you to group recommendations by account name and prioritize actions based on business units or departments available in all regions where cost observation hubs is supported at no additional cost. This enhancement reduces the administrative overhead of translating account IDs to meaningful business context. And yeah, thank you. Goodness, yeah. [00:26:46] Speaker B: I mean I use account names in so many ways and knowing that I count 1, 2, 3, 4, 5, 6 is the dev account to, you know, account number 4, 5, 6, 7. That's all associated with business user X isn't plausible. So but if you have a good naming convention of stuff, you kind of group it up that way either way. So this is one of those like this didn't feel like a big lift. I don't know why it took so long, but I'm sure there were some internal like cross team API that they had to build out and get set up to get the names from the organization team and whatnot. [00:27:21] Speaker A: I mean I think I had an elasticsearch lookup table for this at one point or like a simple then I think I moved it to Dynamo at one point where it was just like here's the account name and here's the thing and then in the report you would just do a quick lookup. But yeah, thank you for finally just doing what you should have done all along or referencing the Alias would also been lovely. But yeah, I can't believe that one took so long. I think pretty sure I asked for that feature at one point as a feature request enhancement. [00:27:50] Speaker B: On features I didn't know that I wanted but I really wanted now. Amazon EC2 now supports skipping up operating system shutdown when stopping or terminating EC2 instances. EC2 instances are now allowed to skip graceful OS shutdown when stopping or terminating instances, enabling faster instance state transition for scenarios where data preservation isn't critical. Scale sets auto scaling this feature targets highly available architectures where instance data is replicated somewhere else, allowing for failover operations to complete more quickly and bypassing normal shutdown sequences. Customers can enable this through the AWS, CLI or EC2 console, giving them control over trade offs between data integrity and speed of instance termination. This represents a shift in the EC2 approach to instance lifecycle management, acknowledging that not all workloads requires the same shutdown guarantees and allowing customers to optimize for their own specific reliability patterns. The other piece of this that they don't really touch on as much is the cost savings from this. If you do have auto scaling set up and you're on Linux, you're already paying for by minute or by second. [00:29:04] Speaker A: Second, second. [00:29:05] Speaker B: I think one of the two. So if you're able to get shave off a couple seconds and you're turning off and on. If you have a, you know, a Monday through Friday 9 to 5 app that you know, extra minute or two that takes windows or Linux to shut down, you don't care about who cares just let it die quickly. [00:29:25] Speaker A: I mean I, I know there's been many times where I've like trying to do a, like a service refresh right where you just want to replace servers and you're like waiting patiently and you're like oh, you shut down. It's got to Dr. The stuff like et cetera. So I guess it's nice for that. And there are certain times, maybe when the operating system has actually crashed, where you just need it to die. I thought they had something like this before. Ish. But I guess not. But I mean, cool. I'm glad this exists if you need it and the normal process and maybe it saves you some money at scale like you know, thousands of nodes scaling up and down every hour. You know, those seconds add up, I suppose, but I don't think it's going to help the cloud server. [00:30:13] Speaker B: No, I don't think it's going to help us too much. [00:30:16] Speaker A: No, probably not. [00:30:17] Speaker B: I was actually thinking on Windows to like shut down and it goes hey, we're going to go patch this. And I'm like no, no, no. I'm replacing you automatically with a new. [00:30:27] Speaker A: Server, a new already patched box. [00:30:29] Speaker B: I don't want you to shut down. But that's also because my mind is in Windows world a lot more than I care to admit. [00:30:37] Speaker A: That makes sense. Just one of those features like oh yeah, I guess that makes sense. Amazon SQS is introducing fair queues to automatically mitigate noisy neighbor problems in multi tenant systems by detecting when one tenant consumes disproportionate resources and prioritizes messages from other tenants. This eliminates the need for custom solutions or over provisioning while maintaining overall queue throughput. The feature works transparently to add a message group ID to messages, no consumer code changes required and no impact on API latency or throughput. Limits SQS monitors in flight messages distribution and adjusts delivery order when it detects imbalances. New CloudWatch metrics specifically track noisy versus quiet groups, including approximate number of noisy groups and metrics within quiet group. Suffix to monitor non noisy tenant performance separately. CloudWatch contributor insights can identify specific problematic tenants among thousands. This addresses a common pain point in SaaS and multi tenant architectures where one customer's traffic spike or slow processing creates backlogs that impact all other tenants. Message dwell time Fair queues maintain low latency for well behaved tenants even during those scenarios. The feature is available now in all standard SQS queues at no additional cost. Just add message group ID to enable fairness behavior and AWS provides a sample application on GitHub to test the behavior with varying message volumes. So I guess this explains why they were so busy trying to get first in, first out on SQS queues for so long and then now they're like oh wait, that's not actually a problem. The problem is noisy neighbors. So if we'll give you this solution. [00:32:06] Speaker B: Too, we solved the wrong problem. I mean, I guess maybe I haven't been in a SaaS where this has become that big of a problem for me. I feel like for most customers that I've helped design, I just built an SQS queue per customer, so I don't really know how that would affect truly. [00:32:26] Speaker A: Multi tenant database and app tier. I could see that becoming a problem. [00:32:31] Speaker B: But yes, I like data segregation and. [00:32:35] Speaker A: All Data segregation is good and helpful in any ways. I mean, I'm glad to have it. I mean, I'm not going to complain about this feature, but it does feel like apparently there are still new tricks that SQs can learn. [00:32:47] Speaker B: It's like the new tricks to S3, they come out with them weekly, dumbfounded. [00:32:53] Speaker A: Every time they announced Vertex last week I missed that story, but I'm sure you guys talked about it. [00:32:59] Speaker B: I think that was our name of. We was like S3 is a new thing or something like that. [00:33:05] Speaker A: But I mean the Vertex stuff, I wouldn't ever even thought about that. I'm like, that's a great feature for Vertex and it's metadata at the end of the day. Which is now sort of explains why they probably built the S3 metadata service as well. But yeah, it's always fun to watch. [00:33:22] Speaker B: Them tier up services. [00:33:24] Speaker A: Yeah, you're like, oh, this makes sense now. The building block. Because the metadata service on its own, I'm like, okay, this is a cool feature. I can see why this is important. And you're like, but the use cases are not that great. Not that many people have this problem. And now you're like, oh yeah, okay, Vertex endpoints. That makes a lot more sense. There are a lot of cloud cost management tools out there, but only Archera provides cloud commitment insurance. It sounds fancy, but it's really simple. Archera gives you the cost savings of a one or three year AWS savings plan with a commitment as short as 30 days. [00:33:58] Speaker B: If you don't use all the cloud resources you've committed to, they will literally put the money back in your bank. [00:34:03] Speaker A: Account to cover the difference. Other cost management tools may say they. [00:34:06] Speaker B: Offer commitment insurance, but remember to ask. [00:34:09] Speaker A: Will you actually give me my money back? Achero will click the link in the show notes to check them out on the AWS Marketplace. [00:34:22] Speaker B: Launching Amazon CloudWatch generative AI observability in preview CloudWatch now offers purpose built monitoring for generative AI with automatic instrumentation via aws Distro for Opentelemetry ADOT capturing telemetry from LLMs, agents, knowledge bases and tools without code changes. Working with open frameworks like Strand Agents, Lang Graph and Crewai. I'm not sure I know what any of those three are. I'm just saying these services provide end to end tracing across AI components whether running on Amazon Bedrock, Agent Core, eks, ECS or on prem. With dedicated dashboards showing model invocations, token uses, error rates and agent performance in a single dashboard. Integrations with existing CloudWatch features such as application signals, logs, alarms, enable correlation between AI application behaviors and underlying infrastructure metrics Helping identify bottleneck inches, troubleshoot issues across the stack. This is definitely one of those things that I can see as useful. I've kind of done pieces of this, you know, monitoring token usage and, you know, monitoring all the kind of different pieces of my component, but getting that true end to end observability, which is always like that gold standard that everyone strives for, is something I haven't been able to fully do in the AI stack that I've dealt with. So, like, it's kind of a nice feature. They kind of, you know, prepackaged it for you. [00:35:54] Speaker A: So yeah, it's definitely interesting. When I first saw this, I was thinking maybe they were doing something for actual model training. But no, this is actually more. It's for usage, more usage based for inference. Other than I had to use Bedrock other than that little minor inconvenience, I'm sort of intrigued with this. I actually might try to set this up. Not because, like, Bolt doesn't really need end to end visibility of its AI use cases, but I've already been toying with the idea of moving off of Claude's API and just moving to Bedrock's version of the API that we don't have to go, I don't have to use my NAT gateway, not that it's a lot of traffic, but it's just one of those, like, it's an optimization I can make and I could divide up the billing between my personal API keys and my AWS ones. So it'd be kind of cool just to see what this looks like. Because I'm intrigued. A couple of joints. So maybe I'll, maybe I'll take this action soon. I don't know. We'll see. Although the visuals aren't very impressive. So you know, it's like normal tracing visuals. Like, oh yeah, it's a really simple block graph. [00:36:55] Speaker B: Like, okay, it's one of those things that's not useful until you're in the middle of an outage and everyone's complaining something. [00:37:01] Speaker A: It's really cool. [00:37:01] Speaker B: And then you're like, oh, I could see exactly at this point where the world is on fire. And this is what caused it. Oh, we ran out of tokens because we're using one thing for everything. [00:37:10] Speaker A: I mean, honestly, if, if Bolt was down, would anyone care but me and you? You did discover the hidden feature that you can at Bolt in our Slack channel and just talk to it and it'll respond to you. [00:37:22] Speaker B: Yeah, yeah, I was just giving a compliment. [00:37:24] Speaker A: You just stumbled across it. It was like, thanks. [00:37:27] Speaker B: No I like the feature that I didn't realize was doing multiple links in one. Surprisingly useful. [00:37:33] Speaker A: Oh yeah, yeah, definitely a good one. All right, let's move on to GCP. I am officially very old because GKE has celebrating 10 years with a new ebook highlighting customer success stories including Signify scaling from 200 million to 3.5 billion day transactions and Niantic's Pokemon Go launch that stress tested GK's capabilities at unprecedented scales. The ebook emphasizes GK's evolution from container orchestration to AI workload management. Mm. With GKE autopilot now fully automated, optimization for AI deployments, reduce infrastructure overhead and improve cost efficiency. Google's positioning GKE is the foundation of AI native applications, leveraging it as decades of Kubernetes expertise and 1 million open source contributions to support complex AI training and inference workloads. That is the most hey, we have GKE thing ever. Like if it's, you know, all I got is GKE and BigQuery, how do I make AI work? [00:38:26] Speaker B: GKE. [00:38:27] Speaker A: That's the solution. The key to Trader is GKE's integration of Google's AI ecosystem and infrastructure, allowing customers to focus on model development rather than cluster management. Of course. Well, anyways, Happy birthday to GKE. 10 years. Which means I'm too darn old. But congratulations. And it's a new ebook to hear how everyone else has been abusing the crap out of Kubernetes for a long time. [00:38:51] Speaker B: Been crashing Kubernetes for a long time. [00:38:54] Speaker A: They probably crashed at first. Then how they made it work like any good tool. [00:38:58] Speaker B: No, I mean like you said, holy crap, I'm old and I slowly watch all these services. When they were like S3 hit 15, I was like oh no. And now it's like GKE hit 10 and I'm like oh God. So it's amazing how far the clouds have evolved in such a short timeline. [00:39:17] Speaker A: I mean, it doesn't feel that short, but it really is. And when you zoom out on the time horizon, you're like, wow, it's only been 10 years. It feels like clouds around forever now in my career. [00:39:26] Speaker B: So to be fair, I can't tell you day to day if it's been a day or a week right now in my life. So you know, I have different SKUs in my life of issues. [00:39:36] Speaker A: I downloaded the ebook just over there while you were talking and I chuckle because slide three Kubernetes contributions by major cloud providers Google Cloud over 1.2 million lines of code. Then Microsoft in a far second and just over 225,000. And then Amazon, it doesn't even like barely hit the scale at all. It's like maybe 10,000. It's crazy low. It makes me laugh. But yeah, this is interesting set of use cases and they actually have a timeline from 2015 all the way to 2025 of major features they released in GKE. So that's fun. Yeah, it's short, it's only 15 pages. So yeah, ebook is definitely the definition I would use. But so they for 10 years they are announcing a gross glossy marketing pamphlet. Celebrate 10 years. That's what it is. [00:40:32] Speaker B: Some of their marketing is like what do we do to celebrate this? We bash AWS and show how much we've done. [00:40:40] Speaker A: Yeah. And then we'll talk about all these other customers who have tremendous success and then here's all the things that we did to make it work. Yeah, perfect. [00:40:46] Speaker B: I do like the timelines because it's just kind of fun like seeing how they slowly, you know, have grown over the years like the First Kubecon in 2015. And yeah, they just kept growing Dynamic Workload Scheduler Calendar mode for reservations in of GPUs and TPUs the Google Dynamic Workload Scheduler Calendar mode, it's really difficult to say enables short term GPU and TPU reservations for up to 90 days without long term commitments. Addressing the challenge of bursty ML workloads that need flexible capacity planning. This feature works like a hotel users specify a resource type, instance count, start date and duration to instantly see and reserve available capacity which can then be consumed Through Compute Engine, GKE, Vertex AI Custom Training and Google batch. This positions Google against AWS EC2 capacity reservations and Azure capacity reservations. I don't know how much it does. Maybe there's a feature of capacity reservations I don't know about. By offering a more user friendly interface and short term flexibility. Specifying The Optimized for ML workloads currently in preview for GPUs and GPUs does require contacting your account team. The integration with Google AI Hyper Compute Hyper Computer ecosystem and extend existing compute engine future reservations capacity for co located accelerator capacity I only understand about half the features of the Google of that last sentence I read but I really do appreciate this because so much of like if you're writing your own, you know, rag models or anything on top of that is you're only doing it for a period of time. If you're building your own model from scratch, then you probably have a longer term commitment with Microsoft with Microsoft or Google or AWS or wherever you're running these to handle that because you're planning to be around for a long time and really scale that. So I think this really makes sense for the average business that is looking to do just either rag model or something really basic where they can just grab that capacity for a short period of time, do what they need, but make sure they have it so they can hit their timelines. [00:43:08] Speaker A: I'm mostly disappointed there's not a calendar view like the screenshots they showed. Like okay, I can see how I create it. I see the reservation period I'm asking for and then at the end there's a list of all your reservations. Just a list. It's not even a calendar. Come on, come on Google, get this together. But yeah, I mean in general this is a great feature. I'm really happy about this one. As you know. Again, if you're trying to focus on low cost target windows, you're like hey spot instances are cheaper between 2 and 5. Like cool. I want to run this job during 2 and 5 so I pay less money for TPUs. It's good. Again, I also agree with you. I don't know how different it is from capacity reservations and Azure's capacity stuff, but it's nice to have. I'll take it I thought. [00:43:52] Speaker B: I know with the Bedrock and on AWS they released some of the more like block capacities for like four hour chunks and stuff like that, which I thought was kind of cool. [00:44:01] Speaker A: Yeah they've done quite a bit there to help try to get stuff working through that. All right. BigQuery or Google's introducing first party BigQuery tools for AI agents through the Agent Development Kit and the Model Context Protocol, or mcps, eliminating the need for developers to build custom integrations for authentication, error handling and query execution. The tool set includes five core functions including list data set IDs, get data set info, list table IDs, get table info, and execute SQL providing agents with secure access to BigQuery metadata and query capabilities without custom code maintenance. There's two deployment options available, ADK's built in toolset for direct integration or the MCP toolbox for databases, which centralizes tool management across multiple agents, reducing maintenance overhead when updating tool logic or authentication methods. This positions Google competitively against AWS Bedrock and Azure OpenAI service by offering native data warehouse integration for enterprise AI agents, particularly valuable for organizations already invested in BigQuery for analytical workloads. And the solution addresses enterprise concerns about secure data across access for AI agents while supporting natural language business queries like what are top selling products? [00:45:08] Speaker B: I mean anything with BigQuery and making it be easier to use I feel like makes my life easier. [00:45:15] Speaker A: I just keep hoping that Athena starts copying all these features someday. [00:45:20] Speaker B: I feel like we haven't had a good Athena update in a while. [00:45:23] Speaker A: We haven't. I haven't seen much on the Athena space. I mean everything's been so Sagemaker focused. [00:45:28] Speaker B: On Amazon World, but there's definitely a lot more that they can do there so it'll be interesting to see if they eventually kind of catch up. [00:45:37] Speaker A: Will be definitely interesting Global Endpoints for. [00:45:40] Speaker B: Cloud Model Generally available on Vertex AI Google Cloud now offers a Global endpoint for the anthropic cloud model on Vertex AI that dynamically routes requests to any region with available capacity, improving uptime, reducing region capacity issues. For Claude Opus 4 Sonnet 4 Sonnet 3.7 Sonnet 3.5 the global endpoint maintains the same pay as you go pricing as regional models and is fully support prompt caching, automatically routing requests to a region holding the cache for optimal latency while failing back to other regions if needed. This is a great feature, but you have to be very careful with any data sovereignty laws that you have, because if it's going to start routing to different locations, I know Azure does this by default. It definitely can cause a problem where you have to keep your capacity and keep your compute and everything in that same region. [00:46:42] Speaker A: And you'd be worried about that because of the caching that would be occurring. [00:46:46] Speaker B: No. In Germany, for example, you know, it depends on what your contracts say. But I definitely have seen contracts where it says all compute and everything has to live in Germany or in the eu. If all of a sudden your request is getting routed from Germany to the us, you're going to have data sovereignty issues. So it's more like cross country or cross, you know, non EU to EU type stuff. And the German Work Council does not appreciate that. Those types of things. [00:47:15] Speaker A: Yeah, I mean I think this would be more valuable potentially for your vibe coding project or 100% like hey, I'm working on stuff and I don't care where the development works, I'm just getting code completions or I'm using it to generate code that isn't customer data. So I get your point on the risk, but I think this is really not for that use case. But I can see people being confused. So you're right. It's good to know it's the thing. [00:47:41] Speaker B: To keep in your back pocket. I would definitely use it in many Places and I could foresee them especially with all the, you know, the eu. Was it like Google Europe or Google like GCP EU that they're working on for all the data severity thing that they end up with a new. I don't want to call it sku but location variable that says, you know, global to EU or whatever to kind of help this in the future because you can normally route within those types of time frames. So like in Azure because I know those better. Like it's in East 2 but they don't even have capacity in Central. But it's also in East I think so I don't care as long as it stays in the US Would be a nice one because there's enough regions there that it could fail over. But it doesn't look like this has that next level yet. But I'd say for most things like it's talking about like chatbots, content delivery, things like that like that's most likely fine. [00:48:33] Speaker A: Yeah, definitely. If you look in the actual article there's a table and says use case highly availability applications without data residency needs. [00:48:41] Speaker B: Yeah. [00:48:43] Speaker A: It'S interesting. It has both an independent global quota as well as your regional endpoints have regional quotas. So I wonder if you could burn your regional quota and then just move it to the global and then still get it served by the same region. It's kind of interesting. [00:49:00] Speaker B: Have different class calls that connect to different places. If you get 429 errors, route over here. [00:49:07] Speaker A: Well, if you use cloud code with Vertex or Bedrock, which you can do both, you can actually say based on the model which region you want it to go to. So you actually have that flexibility when you use cloud code with Vertex or Bedrock, which is kind of crazy. And then they already sort of had. So it's interesting too because in the vertex one at least the documentation says you're supposed to use US East 5. US East 5 doesn't exist. That is not a public Google cloud region. So I don't know if that's a specific AI region that Google built specifically for Claude and other LLMs or what. But you could also specify the regions that you do know like US East 2 and US East 1, et cetera. [00:49:52] Speaker B: It looks like that's in columbus according to northlink.com Nice. [00:49:56] Speaker A: I mean that's definitely. But again it's not a public zone for anybody but Claude, so it's sort of interesting. [00:50:03] Speaker B: I wonder if it's like a new zone, new region they're building. [00:50:07] Speaker A: Could be. [00:50:07] Speaker B: And like they're just, you know, they're using it right now for capacity until they can get it in other places. [00:50:12] Speaker A: Or Anthropic said, I need this much compute. And Google goes, oh, we'll build you your own region. We'll build you your own. You're the Netflix of LLMs. All right, and then our final Google story. NotebookLM is introducing video overviews that generate narrated slide presentations with AI created visuals, pulling diagrams and data from uploaded documents to explain complex concepts. Particularly useful for technical documentation and data visualization in cloud environments. The Studio panel redesign allows users to create multiple outputs for the same type of. Same type per notebook. Alien teams generate role specific audio and video overviews from shared documentation, a practice practical feature for cloud teams managing technical knowledge bases. Video overviews support customization through natural language prompts, allowing users to specify expertise levels and focus areas, which could streamline onboarding and knowledge transfer for cloud engineering teams. The multitasking capabilities lets users consume different content formats simultaneously within the Studio panel, potentially improving productivity for developers reviewing technical documentation while working. Currently available in English only, with multi language support coming very soon. Positioning NotebookLM as a knowledgeable management tool that could complement existing cloud documentation and training workflows. The little video that they actually put into the article to show this off, it was pretty neat. It definitely shows that you generated the world of surrealism, PowerPoint presentation basically, and video that they walked through. And this goes nicely with the generated podcast that you can only get with LM Studio, meaning that, you know, everyone who is rushing off to replace us with a podcast thing can now replace us with a video of dynamically generated PowerPoint slides and then be put right to sleep. So. Or you just listen to us, you choose. [00:51:47] Speaker B: So we're gonna get. So what's gonna happen is we're gonna use. [00:51:50] Speaker A: Well, we were being replaced, but now we're no longer being replaced because now, now they're replacing YouTubers. So that's, that's how we see it. We're back in Matt. [00:51:59] Speaker B: Until they replace us again with the video that they generate from the, from the news stories that we already talked about. [00:52:05] Speaker A: Yeah, exactly. [00:52:06] Speaker B: We'll get there. We'll get replaced finally. Don't worry, we've already tried. We have AI Bot. That's really what Bolt is. I think it's just Jonathan really behind the scenes is what I just said. [00:52:16] Speaker A: I mean, it moves a lot faster than Jonathan does. [00:52:22] Speaker B: On to Azure News. [00:52:24] Speaker A: Oh, yay. [00:52:25] Speaker B: Don't sound so excited. God. Project Flash updates Advanced Azure Virtual Machine Availability monitoring. Project Flash now includes a user and platform dimension in the VM availability metrics, allowing customers to distinguish whether the issue was caused by Azure infrastructure issues, most likely yes or user initiated issues, probably pretty high. This addresses key pain points for enterprises who need to precise attribution for servicing interruptions. The new event grid integration with Azure monitoring enables near real time notifications with SMS email push notifications when a VM availability changes, providing faster incident response compared to traditional monitoring approaches. Flash publishes detail, VM availability state and resource health annotations that help with the root cause analysis, including information about degraded nodes, service healing events and hardware issues, giving operators terms action and data for troubleshooting. Future enhancements include expanding monitoring to include top of the rack switch failures, accelerated network failures, and predictive hardware failure detection. I think that a lot of these things are pretty cool, you know, getting down to that level, but I also feel like this is a lot more for stateless systems and I try very hard to not have stateless VMs as much as I can in my life, even though they do exist. And honestly for massive enterprises running full stateful instances, I guess you're running your SAP hana somewhere on, you know, Azure. Like yeah, these things would definitely are useful, but in a world where I try to not care about my servers as much as humanly possible and I try to just do the old school delete and replace, it's a really cool feature. I hope to never have to deal with it in my day job. [00:54:16] Speaker A: Yeah, I mean this has been an area that they've been focusing on for quite a while. We talked about Tardigrade back in the day. Azure's very focused on server failure, mostly because Windows fails often and quite often. So I mean I appreciate this. But again, it does, it does sort of feel strange. Like why do I care this much if I have proper cloud architecture, proper scale sets? I don't know that this is as important as all the amount of blog posts they've written about this topic require. [00:54:47] Speaker B: What I would like to see is more around the actual Azure services and that's kind of where I wonder if this came out of. If this was more like they use it internally. [00:54:56] Speaker A: I think that's what it is. I mean Tardigrade was the same thing. It was like we're looking at the actual VMs that run underneath your load. And so I think this is kind of similar. Again, they're nerding out, which I always appreciate. A good nerd story, but definitely interesting. [00:55:13] Speaker B: Yeah, I would like to see this for like, hey, why did one of my servers in my load balancer fail because in Azure you have to specify the number of servers in your load balancer, you know. So I would like to see it more for that. I would almost feel like which I assume they're seeing under the hood. But to bubble some of those things up to us as the consumers so that you know that they know there's an issue. Not hey, the resource health has gone down because that's not at all frustrating. [00:55:39] Speaker A: Microsoft 365 copilot search is now generally available as a dedicated module within the Microsoft 365 copilot app, providing AI powered unified search across SharePoint, OneDrive, Outlook and over 150 external data sources through copilot connectors including Salesforce, ServiceNow, Workday, and SAP. The service uses AI to understand query context and deliver relevant documents, emails and meeting notes without requiring any setup. Users with eligible Microsoft 365 copilot licenses automatically see a search tab alongside chat and other copilot experiences across desktop, web and mobile platforms. This positions Microsoft against Google's enterprise search capabilities and AWS Kendra by leveraging existing Microsoft 365 infrastructure and licensing with no additional cost beyond the standard Microsoft 365 copilot license, which runs $30 per user per month. The key differentiator is the instant query predictions feature that Surfaces recently worked. Documents, colleague collaborations and documents for users are mentioned addressing the common enterprise pain point of information scattered across disconnected data silos. Target customers or enterprises already invested in Microsoft 365 who need to break down information barriers between Microsoft and third party systems, particularly those using multiple SaaS platforms that can now be searched through a single interface. So this is a Microsoft's answer to Q Business and to Agent Spaces. So that's nice. [00:56:58] Speaker B: Yeah, I mean it's, it's a nice feature. I'll play with it, you know, on my day job a little bit. It'll be interesting to see how useful it is. I've already found kind of that Copilot search. I'm like hey, tell me about our SLOs or whatever they are. And it pulls up everything from chat conversations to SharePoint. Like I found that useful, so it'll be nice to see that powered across other external sources. [00:57:20] Speaker A: Also I still expected them to lower the price of this though. I mean 30 bucks a user is still pretty steep. [00:57:27] Speaker B: Yeah, it adds up quickly. So it's per month. So you know, times 12 times number of users. We are no longer at the assumption of it's $10 per person per month. It's slowly going up. Important Changes to App Service Managed Certificates so let's start out with this article was released on July 21, 2025 Azure App Service managed certificates must meet new industry wide multi multi prospective insurer collaborations mpic requirements by 07-28-2025 so somebody can do the math. [00:58:07] Speaker A: Here is a whole yesterday yes when they release this when the 21st. [00:58:12] Speaker B: Yeah, yeah. A whole seven days notice. [00:58:15] Speaker A: Rough. [00:58:16] Speaker B: Yeah. We'll break certificate renewals for apps that aren't public facing using Traffic Manager nested endpoints or rely on starter trafficmanager. NET domains. This change impacts organization using Azure managed certificates. Think AWS ACM with private endpoints, IP restrictions or client certificate requirements or authenticated gateways forcing them to purchase and manage their own SSL certs instead of using the free managed options. Microsoft provides a resource a Azure resource graph query to help identify the resources. But queries doesn't capture all ed cases such as requiring manual review of traffic manager configurations and custom access policies that might block digit serve validation. Now you know where they do their certificates under the hood. Unlike AWS Certificate Manager which supports private certificate authorities and internal resources, Azure managed Search will only work for public publicly accessible apps, potentially increasing operational overhead. And by potentially they mean definitely increasing operational overhead and costing enterprises with strict security requirements a six month grace period before existing certificates expire, giving organizations time to migrate. But those relying on free certificates for internal or restricted apps will need to move to a budget or budget for commercial sorry will need to budget for commercial SSL certs and manually renew process so this is one of the few places in the in Azure where they have these and now they're limiting the use limiting a lot the use case around it. You know of where you can actually get these ACM cert where you can get these search certificates for you it's painful. [01:00:00] Speaker A: That's really crappy. Why do they do this? [01:00:02] Speaker B: It sounds like from the first thing it's that multi perspective insurer. [01:00:06] Speaker A: Yeah I mean again like if it's external sure. I guess I don't know enough about multi specific insurance corroboration but like seven. [01:00:17] Speaker B: Days is also a pretty short timeline and I feel like somebody miss got you know completely missed this one it was saying with someone's email box and that went out on sabbatical or something because like this was a big mix miss in my opinion and you know providing people seven days to make this and even at least six months is enough time before they do it for existing it's a short window here. [01:00:43] Speaker A: Yeah so Digicert has a whole article on it. So basically when attackers manipulate Internet routing, they sometimes trick certificate authorities into issuing certificates for domains they don't actually control. That's what multi perspective issuance corroboration was designed to prevent. By verifying domain control from multiple points in the Internet, MPIC adds a crucial layer of defense against network level attacks like border gateway hijacking does that protection also changes changes how domain validation is performed, changes that every organization using digital certificates needs to understand implement plan for With MPIC already in effect and enforcement deadlines approaching, organizations need to understand how it works and how to prepare. So basically the internal certificates. So yeah, they need to be able to provide an internal private certificate authority and then they'd be fine. Or you just say look, I am okay to issue a non MPIC compliant certificate for my internal services. Would also be a way to solve this, would it not? [01:01:38] Speaker B: Yeah, there's not a lot of services that support the Azure managing your SSL cert for you, which is painful in and among itself. So you already have to manage that yourself. If you're using application gateways, for example load balancers, you already have to do it. So I would assume that this is a smaller subset of people that are affected by this, but maybe I'm wrong. [01:02:07] Speaker A: We'll see. Are you impacted by this? [01:02:10] Speaker B: We were not. We do use app services, but we don't use the managed one because we already have to pay for a certificate for our load balancer for application gateway. So we leverage that everywhere so it's the same date so we don't have different things to manage right now. [01:02:27] Speaker A: Makes sense. Azure Firewall now supports a draft and deploy feature in Preview that allows administrators to stage policy changes and temporary draft environment before applying them automatically to production. Addressing the challenges were even small changes previous took several minutes to deploy. The two phase model separates editing from deployment. Sorry, that's atomically not automatic. The two phase model separates editing from deployment. Users clone the activity policy into a draft, make multiple changes without affecting live traffic, collaborate with reviewers, and then validate and deploy all changes in a single operation that replaces the active policy. This feature targets enterprises with strict change management and governance requirements who need formal approval workflows for firewall policy updates, reducing configuration risks and minimizing the chance of accidentally blocking critical traffic or exposing workloads. The preview is currently limited to Azure Firewall policies only and does not support Classic Rules or Firewall Manager with deployments available through Azure Portal or CLI commands for organizations looking to streamline their security operations. So yes, this is A firewall thing. You used to have rule sets on Cisco. You know, Cisco firewalls back in the day, remember checkpoints also had this where you could do active, you know, active editing and then you basically push the policy to production when you're ready to do it. I appreciate this, but like, the reason why I read that atomically as atomic or sorry, atomic as automatically was because it would make sense to me that you would maybe draft and queue up all of your changes for the day and then have it automatically get applied at 11 o' clock at night or at a time when your system isn't being used. But that wasn't their thought process. They just thought, well, we want to make sure you can do all the updates at one time, but still in the same operation. So it's not quite what I would have hoped for in this feature. But it's also weird that it's limited to not include the classic rules or the firewall manager. [01:04:11] Speaker B: But I've noticed they don't have a lot in the firewall manager. It's kind of weird. It definitely feels like something they built. But even like a lot of the stuff I've seen where they're talking about managing multiple firewalls, don't talk about leveraging it, talking about leveraging IAC or some other tool for this. So it's kind of more. I feel like the firewall manager is not a key thing that they are looking to continue to develop. Maybe I'm wrong and they're going to all of a sudden drop a bunch of new features. But just from looking at it, there hasn't been a lot of updates there. I mean, I do kind of like it. I get what you're saying, like you could schedule and whatnot. But the flip side of it is it's kind of nice. They took the whole, you know, cicd, you know, or sorry, like git git model where somebody has to approve, somebody has to, you know, review and gives you that verification so one person can't just go do something. So how do you like flipping back and forth? Justin Puts you a little bit on the spot, a little bit more. [01:05:18] Speaker A: It does a little bit. But you're. You're up though, I think. [01:05:25] Speaker B: Onto Cloud Journey. [01:05:27] Speaker A: Yes, onto Cloud Journeys. [01:05:29] Speaker B: I always forget if we do that before or after. [01:05:32] Speaker A: After show is after the show. Journey is during the show. [01:05:36] Speaker B: So yeah, yeah, I guess it makes sense, but tangential. So we have two Cloud Journey stories tonight. We're going to try to tackle both of them. [01:05:44] Speaker A: Yeah, I think so. [01:05:45] Speaker B: All right. First one Beyond IAM Access Keys Monitoring Authentication Approaches for AWS Security Blog AWS is pushing developers away from long term IAM access keys towards temporary credentials solutions like Cloud Shell IAM Identity Manager in IAM Roles to reduce security risk from credential exposure and unauthorized sharing. I feel like they've been doing this for a long time. Cloud Shell provides a browser shell based CLI that eliminates local credentials while IAM Identity center integrates with AWS CLI2 add centralized user management and seamless MFA I will tell you that the IAM Identity Center I still want to call it AWS SSO every time because because it's so much easier to say the CLI integration is phenomenal and just makes life so much easier. For CICD workflows and pipelines and third party services, AWX recommends using IM Role anywhere for on premise workloads and ODBC integrations for services like GitHub Actions. And instead of static access keys, modern IDEs like VS Code Support secure authentication through IAM Identity center via AWS toolkit, removing the need for developers to store access keys locally. AWS emphasizes implementing least privileged policies and offering automated policy generation based on CloudTrail logs, helping to create permission templates from actual uses versus what your developers tell you that they need. I mean everything they're saying in here is true, right? Like having that key out there. It's not like, you know, a lot of systems let you put expiration dates on keys and would be a nice feature. They eventually do. You know, is definitely reduces your security risk. You know that key is no longer out there. You know, I think they've had forever in cloudtrail or sorry, Trust Advisor where it pulls up and says hey, this key is older than 180 days. You know they've been trying for a long time to provide people options to get rid of those keys and I think they're at the point when for most things they can do it. I mean there's definitely people out there that you know are stuck in their ways and you know, want their keys or have legacy reasons for it. But I do feel like, you know, between Cloud Shell, if you just want to run something quickly, which I've definitely done and you know, AWS SSO Identity center integrations, you know, it is much, it's much easier to get temporary keys and everything else along those lines get the sessions, you get this, the you know, the three things and you're done. The one thing I've done, ODBC I don't know about You Justin, have you played with IAM role anywheres? I haven't done that one much. [01:08:30] Speaker A: I have not done that one much. I have done the OIDC thing for you especially for GitHub actions going into AWS. [01:08:39] Speaker B: That's the only place I've done it honestly. [01:08:42] Speaker A: But yeah, the IAM roles anywhere I haven't done. And then for workplaces I use the old legacy single sign on integration but then you still had access keys with that model and then IAM Identity center kind of came out after that and so we were moving to that when I left that company. But at Google we use something very similar to AEOS iam so Identity Center I personally like for my personal AWS account I'm keeping my keys mostly because I don't like the temporary nature of the authentication and all that. And so the idea of I would probably use AWS toolkit to integrate with my Visual Studio code. It's probably the most likely scenario I would go with. But again if I had to reauthorize re authenticate multiple times, it's like I don't know that I want to do that. [01:09:35] Speaker B: You can do a 12 hour token and that's what I've done for a lot of things is I do a 12 hour. [01:09:40] Speaker A: Yeah, I mean like the right answer is I should. I just haven't committed myself to that level of involvement yet. But I do my friend's website, you know he. We use Google Auth for all of his main like you know, backend office stuff like that. But the Amazon account is still not integrated with that yet. But it's on my list to do. [01:09:59] Speaker B: That's not bad. It's pretty easy. [01:10:01] Speaker A: Yeah, I know it's pretty easy. It's just one of those like so that may be my first experience with it so maybe I'll try that soon. But yeah, I mean the days of keys and giving around access keys are really behind us and are really bad security practice. Although like I said, I'm an old dog, can't learn new tricks so I'm not super into it. [01:10:16] Speaker B: Don't look at my local AWS CLI then. I definitely have a few in there. [01:10:21] Speaker A: Yeah, no, I do too. I have a couple tools so they're not sitting in plain text like you know, like through one integration with one password that I don't have to. That's really. They're in the file in plain text but it's all on my laptop and I don't have to reissue them and that's much better for me. But again, I should do it. I'm just, I'm a bad for work stuff all the time. Use it all the time. How about this? Embracing the cloud shell. Like, I forget the cloud shell exists pretty much all the time and it's not. I mean I use it a lot in Google. Like the G cloud shell I use all the time. But the AWS one doesn't come to mind. [01:10:57] Speaker B: I use the Amazon, sorry, the Azure one probably once every couple weeks. I always forget the AWS one exists. Every now and then I'm like, you know, helping somebody out and I'm logging into their account, they gave me a user, I was like, I'm like, oh, I need to run this one command because I know how to get it this way. And I'm like, oh wait, how do I do this? And then I remember versus like you said, setting up credentials and or SSO and authenticating. I'm like, oh, it's here. But like, I swear that's maybe once a year when I have this vain memory. I'm like, oh, this very much simplified what I was gonna do. And you know, then I forget about it for a year. Cause most of my AWS stuff is just set up. That old dog teaching us new tricks is difficult at times. [01:11:39] Speaker A: Yeah, I mean what I, what I have embraced and I'm very happy with is enabling passkeys for my MFA on AWS accounts. So, so, so nice to not have to go get my two factor token out and just like, oh yeah, I can just use my passkey. I don't have to do that. So that's been a life changer. [01:11:58] Speaker B: I'm worried for the day that my yubikey dies. [01:12:01] Speaker A: I mean, the passkey is as good as the Yubikey in my opinion, without having the burden of having to have a key that you have to physically move around because passkeys are basically a virtual yubikey that lives inside of one password. So it's great. Now if I could use the passkeys on the command line, maybe then I would be sort of into it more, but. [01:12:19] Speaker B: Oh, that'd be interesting. [01:12:20] Speaker A: That'd be cool. So again, like I, I should do this. I will do this because it's the right thing to do. And like I've, I've slowly been moving to Secrets Manager on old legacy stuff that I've been working on. Like I need to get off of Even, even the WordPress site that runs the cloud pod is now using Secrets Manager on the back end. And yes, I should have used parameters because they're cheaper, but I forgot and then I don't want to change it now. It works, but, you know, it's nice. [01:12:48] Speaker B: There's a conversation that I think we should pull, Ryan, and maybe if we have any friends on the show that are. Security. Security people of. Is a passkey, which is a second form of authentication, useful if you're. If it's all stored in the same place as your password. [01:13:08] Speaker A: Hey, now, let's not get into that conversation, shall we? [01:13:11] Speaker B: I mean, an after show one day. How about that? [01:13:14] Speaker A: Okay, sure. [01:13:14] Speaker B: We need Ryan and maybe at least one other security person, because I just have questions around all of it. [01:13:22] Speaker A: So, I mean, I do require. When I go into 1Password on my computer, I have to use a fingerprint to unlock it and it locks after a minute or two. So, yes, technically, if you took my laptop, you would have my vault for my 1Password vault locally, but you had to be able to get into it with my fingerprint or the password, which is not a password I use anywhere else. So I feel sort of okay about this. But actually what we should do is let's table that and we'll bring this topic back up with Ryan next week because I want to know if he's an old dog who hasn't moved off of access keys as well. [01:13:56] Speaker B: There's just those legacy accounts. I have a few of them hanging out that I'm like, yeah, so check. [01:14:01] Speaker A: Out next week's after show. We'll talk about that. Definitely. After the show, we'll bring this up with Ryan and see what he thinks. [01:14:07] Speaker B: But then don't ever try to find our accounts, guys. That's all I have to say. [01:14:12] Speaker A: Right? All right, well, our next cloud journey is really a reflection by our friends over at CrowdStrike. Basically, they have introduced granular content control features allowing customers to pin specific security configuration versions and set different deployment schedules across test systems, workstations and critical infrastructure through host group policies. Because they crashed the entire Internet a year ago by changing that. [01:14:36] Speaker B: Whatever happened with the lawsuit with Oman. [01:14:38] Speaker A: Delta, it's still happening. It hasn't. I'm keeping an eye on that. Okay. The company established a dedicated digital operation center to unify monitoring and incident response capabilities across millions of sensors worldwide. Processing telemetry at exabyte scale from endpoints, cloud containers and other systems. A new Falcon Superlab test test thousands of OS kernels, hardware and third party application combinations with plans to add customer profile testing that validate products and specific deployment environments. And Crosby is creating a Chief Resilience Officer role Reporting directly to the CEO and launching Project Ascent to explore security capabilities outside kernel space while maintaining effectiveness against kernel level threats. The platform now provides real time visibility through Content Quality Dashboard, showing release progression across early access and general availability phases, all with automated deployment adjustments via file confusion, soar and workflows. And really they came down to this is three pillars of resilient design for them. Foundational, adaptive and continuous. And then they also go into basically how that then drove key improvements across these different areas, which resulted in what I just talked about. So in general this is cool. I'm really happy they're being transparent about what they're doing. I wish they'd been more vocal about what they were going to do on some of this stuff over the last year instead of waiting a year. But I do appreciate the effort. I wish they would do like I like. I love to see a picture of this thing or is it not a thing in physical space? It's only a digital thing around like the Operation Digital Operations Center. Like is it really digital or does that just mean it's got digital displays? Like what is that? [01:16:14] Speaker B: I think it's digital digital. [01:16:15] Speaker A: Is it a knock? You know, I'm curious. [01:16:18] Speaker B: I assume it's a global knock. That's 24, 7 around the, like around the world. [01:16:22] Speaker A: Yeah, okay, for sure. But yeah, again, the amount of data they're trying to suck up from all these sensors and you are familiar with large scale sensor deployments and picking up telemetry from those systems, I mean there's a lot of challenges, some of that. So there's a lot of actually really interesting technical talks I'd love to hear about behind this solution. But in general, I appreciate this new commitment to resilience that they had, which I wish they'd had all along, but they had not had prior. [01:16:52] Speaker B: Yeah, I mean, and all these things are just foundational things that you should be doing. And if you can get the basics right, which they call here, like the fundamental, like in the article, they talk about them. But like if you get those fundamentals correct, getting to that next level, getting a lot of those next things isn't that difficult. But if you have that core strength set up, your life's going to be easier. And they do in this article obviously talk a lot about like CrowdStrike specific things. But. But a lot of these things are things that probably people should have been complaining about earlier. But until they had the incident, no one cared. [01:17:29] Speaker A: Yep. I mean someone cared, but there wasn't much you could do because you're paying for a SaaS service and you're like following on an SLA, like, well, I have an SLA and so we're going to trust that CrowdStrike doesn't want to embarrass themselves publicly in front of all their customers. And this SLA will protect me. And in this case, neither one of those things was true, apparently. But now here we're in a much better place. [01:17:53] Speaker B: I also like a year after they are adding a. They're adding the Chief Resilient Officer and a Chief Technology Innovation Officer. Like it felt like the Resiliency Officer probably should have been, you know, working towards probably earlier on. And maybe it's, you know, they didn't want to wait for the blog post. I don't really know. [01:18:16] Speaker A: Maybe it's one of these things where they were like, look, we don't want to add a new leader in. We have very clear things we have to execute on. And so that's where we're focusing on building that out first. And now that we feel like we're in a good place now, we want somebody who just thinks about what are the future opportunities for resilience. And so it's more greenfield type position now versus what the last year is very tactical and maybe that CEO directly was driving that tactical execution. I don't know. It is sort of interesting. [01:18:43] Speaker B: Yeah. To announce two C level positions. [01:18:46] Speaker A: Well, so Alex Ionescu has been there for a while. I think they just changed his role to Chief Technology Innovation Officer. But he's been at the company for quite a while and then they're giving him this project ascent to see if they can figure out how to do stuff outside of kernel space, which is what Linux and other things have been pushing for as well. And I think it's a good idea to. If you can do security without accessing kernel, then your security tool doesn't become a major threat to the operating system, which is what we've learned Crowdstrike can be hackers dreamed of doing the damage that they did. [01:19:20] Speaker B: Yeah. It's amazing what one minor problem caused. [01:19:24] Speaker A: You know, so one one extra carriage space. [01:19:28] Speaker B: Not at all a big deal. Don't worry about it. Still wonder their whole testing procedure they did originally, but we'll bypass that. [01:19:34] Speaker A: Yeah. Again, make sure you're doing your FMEAs. Make sure you're doing chaos engineering and testing all of your cloud systems for resilience. And if you think you have it protected, always good to test it because if you have too much hubris in this area, you can eat crow pretty hard all right, Matt, I think it's another fantastic week here at the Cloud Pod. [01:19:56] Speaker B: It has been. We'll talk to you later, Justin. [01:19:59] Speaker A: See you next week. [01:20:00] Speaker B: Hey. [01:20:03] Speaker A: And that's all for this week in Cloud. We'd like to thank our sponsor, Archera. Be sure to click the link in our show notes to learn more about their services. While you're at it, head over to our [email protected] where you can subscribe to our newsletter, join our Slack community, send us your feedback, and ask any questions you might have. Thanks for listening, and we'll catch you. [01:20:22] Speaker B: On the next episode. [01:20:27] Speaker A: Sam.

Show Notes

Titles we almost went with this week:

General News

AI Is Going Great – or How ML Makes Its Money

Cloud Tools

AWS

GCP

Azure

Cloud Journey

Closing

Chapters

Episode Transcript

Other Episodes

Episode

Autonomous Cloud Pod – Ep 40

Episode 143

143: It’s Chaos in the Cloud Pod Studio

Episode 208

208: Azure AI Lost in Space