269: Crowdstrike: Does Anyone Know the Graviton of this Situation?

[00:00:07] Speaker A: Welcome to the cloud pod, where the forecast is always cloudy. We talk weekly about all things AWs, GCP, and Azure. We are your hosts, Justin, Jonathan, Ryan, and Matthew. [00:00:18] Speaker B: Episode 269 recorded for the week of July 23, 2024. Crowdstrike. Does anyone know the graviton of the situation? Good evening. Ryan and Matt, good evening. [00:00:31] Speaker A: Good evening. I don't know where we are. [00:00:35] Speaker B: Yeah, it's maybe been a bit of a long weekend for some of us who've had to deal with Crowdstrike. So that's our first topic of the night. So we'll just jump into it because I went to India, I came back, you guys did a great show. I listened to it, I loved it. We were supposed to record on Friday because of some other scheduling snafus, which we're going to try not to record on Friday so we can help it. And then Thursday night, Crowdstrike said, nope, nope, you're not going to record tomorrow because we're going to cause the world's largest it outage in history. [00:01:06] Speaker A: Yeah. [00:01:07] Speaker B: And so that's why we were not going to episode out last week. Apologies for that. But for those of you who lived under a rock, CrowdStrike, a popular EDR solution, caused major disruptions to the world's it systems with an errant update to their software that caused server Windows servers to blue screen of death, disrupting travels, airplanes, trains, governments, news organizations, and more, including our day jobs. Crowdstrike remove the errant file quickly. But still, the damage is done with tons of systems requiring manual intervention to be recovered because this little bad boy corrupted it in kernel memory, kernel space privilege access. And so as soon as Windows hits this little bad boy, it crashes with a blue screen of death and doesn't allow you to run any automation, anything to do to fix it, unless you get kind of fancy in some cool things we saw people do, apparently you could sometimes reboot up to 15 times and that would recover the file somehow magically, which was just like insanity. As you're waiting for servers to boot up 15 times, you also swing the hard drives between broken servers to working servers or manually remove the files, put them back, and then of course, deal with trusted compute issues, because now your kernel space has magically changed that authorization. So good times there. Then. Apparently before this also Azure Central region decided to have a big massive outage, too. So when we all got called, when the SeV one s went off, we all thought it was Azure and the Internet being hacked. Because of course, earlier that day, Trump gave a great speech at the RNC giving us doom and gloom. And I was like a hacker. At this point in time, taking on the Internet would make complete sense. Going into this, Azure had a major outage. The Internet's on fire. We don't know what's going on. And so, yeah, I thought maybe Trump caused it, which would have been sort of funny. CrowdStrike CEO George Kurtz, who also happened to be the CTO at McAfee during the 2010 update fiasco, which I learned this last weekend, stated that he was deeply sorry for the crowdstrike outage and has vowed to make sure that every customer is fully recovered and this never happens again. So I would hope by Monday, most people are fully back up and running, although Delta Airlines is still having some problems. But unrelated, I think crowds at this point, they have some other software problems I saw on the interwebs, but Ryan and Matt, how's your crowdstrike? [00:03:21] Speaker C: Well, I got to start with the azure outage, which was loads of fun. When a couple of our production regions are located in that Azure central region, the best message I've gotten from that during my day job was one region set off an alert and the next one, and someone just sends a message, going to Microsoft, break things again. That was a great way as I was sitting on a plate just to read this as everything's going down. So that was fun that, uh, they released the PFR today that I guess we'll talk about next week. Um, because I think it was like, mid morning. Well, I saw it mid morning. I think it might have been out earlier, but, yeah, they did a good job of just decapitating themselves to azure. That was a lot of fun. [00:04:06] Speaker A: Nice. Yeah, I got a little bit of leeway because I didn't have any workloads on the azure outage. And then by the time we were really affected, all of our core systems were down. And so, like, most people couldn't even log in and do things. So, like, I got pulled in late, but I got to sleep in on Friday, which was nice. But I did have to spend the majority of Friday and then most of Sunday or Saturday automating solutions to apply them at scale. So it was a, which is fun. You know, a little part of me, I don't know, enjoys that a little bit. The firefighting aspect of it also found it funny, you know, like, you know, just the type of solutions when you have to get super creative. Like, I don't know, I find that fun. And this bug, where it was and how it manifested, like, it's in a rough place. Right. There's not a lot of options. [00:05:01] Speaker B: Well, I mean, it's really an Achilles heel of the cloud. I mean, to fix this, you need to be able to boot a server into safe mode or into recovery mode and then remove this file manually, which requires that you have console access, which Amazon just added a couple of years ago. I remember we talked about it like, oh my God, you get console access through a serial port. That's amazing. I can't believe we have that. [00:05:24] Speaker A: But I don't think you can actually act on it, can you? I was really pleasantly surprised with Google's setup because, which isn't common. I haven't used the AWS one in depth, but with the Google one all it was built into, all the tooling and the command sequences all worked. It actually worked out pretty well where we were able to, while it's still a physical terminal, and then emulated like through nine layers. So it's pretty resistant to most automation. [00:06:03] Speaker B: I mean, when I got on the call with you guys, I think it was Friday morning, you even called in, I'm taking a nap. And I came onto the bridge and you guys were like, yeah, we're talking about like, different paths we're gonna go down. One is this thing, and then once TCl extract, I'm like, yeah, excuse me, or expect TCL, expect. I'm like, what? And then someone else is like, off on a go lang project. [00:06:25] Speaker A: Yeah, we went old school and new school at the same time. [00:06:27] Speaker B: Yeah, that was kind of great. But three different streams of people trying different things to solve this problem at scale. And it was, yeah, that was fun to hear. You guys geek out about TCL for a little bit, but, yeah, I'll give. [00:06:42] Speaker A: GitHub copilot a lot of credit for coding RXC because it's been far too long since I, I've dealt with any of that. And I, I remember general patterns, but, like, syntax and specifics, like I was asking a lot of questions of. But, and it. I will, you know, like, that's using it in a firefighting mode. It was very effective and it was very helpful. And, you know, it's not the prettiest thing that ever exists, but it was effective in the end and did get us out of the bind and where we didn't have, we could relieve the people who are literally doing it by hand in serial console, which is oh so painful. [00:07:25] Speaker B: Yeah. But it was a lot faster to watch. We were basically able to watch the vmware console for on premise things. And then the other side of it was just fine with the GCP. You had to do the g cloud commands and all that stuff. So it was fun. The old school, new school even. And that, like, oh, yeah, this is a whole different animal. [00:07:47] Speaker A: Yeah. Just comparing our traditional data center and their GCP, it is kind of interesting. [00:07:54] Speaker C: It's always fun when you're like, okay, everyone sit down. No stupid ideas. Like, these crazy ideas that you have, like, end up being the ones that work, but you would never realistically have the opportunity to try them because, you know, one, how often, and God, I hope in your day job you're not actively logging into the serial port for fun or how to automate your deployments. Just sounds like you're doing something horribly wrong at that point. [00:08:19] Speaker B: I mean, I did it once when they announced the feature. I mean, that was my extent of it. I was like, this feature works great. I can do it. And, like, cool, I'll never hope to use it. And I did. [00:08:27] Speaker A: And unlike Crowdstrike, you never get the test of that kind of scale because I don't like to update everything in the world all at once. [00:08:35] Speaker B: Oh, come on. [00:08:35] Speaker C: It was only like 85 million Windows machines. Come on. Oh, sorry. 8.58.5 million. [00:08:41] Speaker B: 8.5 million Windows PCs were impacted, allegedly, per Microsoft, which I don't know how they know that, but, you know, Windows updates maybe. I was very curious. [00:08:51] Speaker A: Crowdstrike was producing a lot of tooling to track and identify nodes. [00:08:55] Speaker B: I mean, Crowdstrike had a bunch. [00:08:57] Speaker A: Yeah, yeah, no, so, I mean, I can see them getting the numbers, but. [00:09:00] Speaker B: It was sort of like the part that flabbergast me about that mostly, though, is that it's 8.5 million, which is only 1% of the entire Windows PC ecosystem. I was like, what? Like, that's a crazy number to me. [00:09:12] Speaker A: That is a crazy number. [00:09:14] Speaker C: Oh, don't do the math. I don't want to know what that number is. Why are there so many Windows machines out there? [00:09:19] Speaker B: All the Iot devices, man. [00:09:21] Speaker A: Yeah. And it's, it's weird to find, like, you know, you saw that blue screen on, like, and some of it was Photoshop, and so I don't even know what's real. But, like, there's always, it always surprises me when there's, like, something I wouldn't expect, like, you know, either, whether it's a digital display or, or some sort of, like, automated vending machine, you get the blue screen of death. You're like, windows? [00:09:44] Speaker B: Really? [00:09:45] Speaker C: Well, then I'm like, why are you writing an EDR solution? Not hardened windows like Ce or whatever the hell. [00:09:51] Speaker B: Well, I think a lot of it, I think a lot of it is Ce, but then, you know, security comes in and goes like, well, if it's windows, we have to have crowdstrike. [00:09:57] Speaker A: You have to do with it. Yeah. Otherwise we can't say we're the covered by our ER solution. [00:10:03] Speaker C: Yeah, yeah. Insurance companies pick you like that. [00:10:07] Speaker B: They can be, yeah. It's interesting because, you know, I happened to be at San Francisco airport and I was flying on United and there happened to be a power issue where they, I think they lost like a third phase, one of the three phases, they dropped out. So, you know, check in computers were all down, but like, the airport wasn't completely down. Baggage, baggage belts were working and things, but they just couldn't print anything out. And so we were in this massive line of humans trying to have them figure out doctor process for when the powers out of the building, which they were not good at. But the fun thing about that was as you went through, things came back online, the power could boot back up. And I was watching the it crew walk behind the check in desks and turn on the tvs and they're all booting windows. And then all they display is premier access check bag. Like, they're like, they're literally like one whole screen with just like two words that are just literally like related to united baggage check in. Like first class premiere, one k bag drop. And like, you're running a complete windows system for that. Like, why? And I saw them even someone posted. [00:11:16] Speaker A: You guys know about Raspberry PI's? [00:11:17] Speaker B: Like, they, or like have one Windows box that has an HDMI multiplier on it. You just multiply out the same signal to all these tvs. Like, just crazy. So yeah, there's an amazing amount of Windows Iot devices that you don't think about, I think is what is that. [00:11:32] Speaker C: Really considered IoT, though, if it's just for tv output? [00:11:36] Speaker B: I mean, I don't know how much is in it, but if it's connected. [00:11:39] Speaker A: To the Internet, I say it counts. But, yeah, I don't know if they. [00:11:42] Speaker B: Are, but yeah, I don't know how this happens. [00:11:47] Speaker A: Right. [00:11:47] Speaker B: This is one of the big questions we still have because it's Monday after disaster. I was hoping crowdstrike would have something out before we recorded today about the root cause, but they haven't fully disclosed what happened. But they have written some good posts how to fix. They have lots of instructions, AWS and Azure helping their customers resolve this thing. So is GCP. I don't wanna undermine them. As well as third party vendors you could be using to help you recover your systems, though there's lots of good documentation they've done, but they've not yet provided a full root cause. But they have provided some technical explanation, which is basically that they pushed what they call an S channel update out to the system and that had some invalid data in it of some kind. And that basically caused the crowds Sy's file kernel extension driver to basically error, which because it's in privileged memory space, causes the blue screen. There's questions that come out of that like, okay, so do you test this? Because it seems pretty clear that when you upgrade this particular thing, it's going to cause an outage. And so I would assume it would repro pretty quickly and test. So how did that happen? And then you really updated 8.5 million windows hosts within like 18 minutes of this file being available? That's kind of crazy, but sort of also makes sense because it's a security tool and you're trying to prevent zero days and you're doing all these things. [00:13:09] Speaker A: And you want to be able to react right when there's a new vulnerability that's actively gaining access to systems, you want to be able to fix it quickly. So I understand why the mechanism exists, but yeah, there's going to be some. [00:13:23] Speaker C: Hard questions and they like this file gets updated multiple times and you can't control this file too. It's not. [00:13:31] Speaker B: Well that was one of the things that we were all shocked about because you can configure crowdstrike to be in different levels, so you can be n minus one, n minus two, n minus three, which is applied to the crowdstrike version that you run on the machine. That does not apply to the S. [00:13:44] Speaker C: Channel updates, which yeah, we learned that one too. [00:13:47] Speaker B: Yeah. And sort of just like. But why? If I'm willing to take a, it's me as a system admin, I should be able to choose to how much risk I'm willing to take and I'm willing to wait two or 3 hours for zero day to be protected in my system for when you release it to not blue screen my entire production environment. [00:14:06] Speaker A: Yeah, yeah, I mean I get the automatic updates, but I do want them phased out, right. So that not everything goes down all at once. And so yeah, I suspect their entire product roadmap is just thrown away for the rest of the year. They'll be working on giving the options on. [00:14:21] Speaker B: We're going to be working on our CI CD pipeline. [00:14:24] Speaker A: Yeah, well, see ICD plus also you build it into the product. [00:14:27] Speaker C: So QA check. [00:14:28] Speaker A: Yeah, you have to have those controls for the customers now because you have to rebuild all that trust. And so how you have to give customers the knobs to pick and choose when those updates go on and give them all the ability to phase it out and do all that n plus one at the s general level. Like all of that has to happen. [00:14:47] Speaker C: Well then another thing I read today a little bit, I didn't go into too much detail was like why does the tool need to be running in kernel mode and why crowdstrike has this ability on Windows? Because the EU said that they had to open up kernel mode in 2009. But I think it was like a couple of years ago, Mac told the EU to go away. We're not enabling kernel mode for anyone else. [00:15:11] Speaker A: Yeah. Before they had released stuff, I assumed it was part of the sensor suite for scanning, analyzing memory for patterns that look like. And it's not. It's not. It's an update channel. And it's like, that is horrifying because it shouldn't be in that space. I agree. [00:15:31] Speaker B: Well, but, you know, so I saw that article this morning about the EU might be the reason why Microsoft doesn't protect the kernel more. And I think that's a cop out. Basically the EU saying we want fair and equal competition. And basically what Mac did or Apple did was they basically created a custom API that basically does what crowdstrike needs to do in the kernel and provides that to serve to crowdstrike and other vendors. They're all on equal footing. They all get access to the same API, they can all implement the same features, but Mac controls it at the API. [00:16:02] Speaker C: I read the whole article. [00:16:03] Speaker B: There's no reason that Microsoft couldn't do the exact same thing as long as defender, which is their EDR solution, doesn't have access to the kernel in a way that crowdstrike and other vendors can't do it. But if you want to put them all through an API to the kernel for this type of data, Microsoft could do that. I think. Yeah, I saw the article this morning too, and I was thinking about it. I was like, that's just a cop out. Microsoft should be protecting the kernel more than they are. You think about all the security vendors that exist and how much level crap they do to the kernel. We rolled out Crowdstrike in the last year, unfortunately, but we replaced Sophos, which was even worse, and trying to get that out of the kernel and get it to uninstall properly and all things was like a major undertaking in a lot of ways. And these things in the same way. [00:16:53] Speaker A: It corrupted stuff and then boot loader and all your signature validations, it's terrible. [00:17:01] Speaker B: So in real time follow up, by the way, I did enable console on my server on EC two and I can log into it and I can put in commands, etcetera at the command line, in the serial. So yes, you could do the same automation we did on G cloud that way. But then I was also curious, like what was the actual instructions for AWS? Because I wasn't doing rec. They built a SSM package, an automated runbook called Start Ec two rescue workflow that basically did everything you needed to do as part of SSM. Going into an encrypted volume, reattaching it, doing the drive change out, and then basically going through the whole process of reattaching it, which is kind of cool. [00:17:36] Speaker A: Well how would that. Oh, so you're running SSM on like a rescue node then? [00:17:39] Speaker C: Yeah, yeah. [00:17:40] Speaker B: Yep. [00:17:40] Speaker C: That was the way I saw to do it. [00:17:42] Speaker B: That's great. [00:17:43] Speaker C: Think that was method three or two of the way to do it. I read through them the other day. I mean, I will say what is nice is everyone is out there trying to help. [00:17:55] Speaker B: Yes. [00:17:55] Speaker C: You know, every it department needs a hug right now that's affected by this because everyone has had a long weekend. [00:18:04] Speaker B: One of my favorite methods was someone created a pixie boot script. So basically if you were already booting your windows boxes with pixie Boot, you reboot the box. It would basically load this pixie boot loader that would basically run a small Linux kernel, change the file out, and then basically you put a file saying I've already done this. And then when it rebooted again, it would basically pass the pixie booter back into windows and it would come back up. I'm like, that's brilliant. [00:18:27] Speaker C: Yeah, that's awesome. [00:18:29] Speaker B: But that only works if you're enabled. [00:18:30] Speaker A: For Pixie Boot, if you've configured it that way. Right. So, and a lot, for security reasons, a lot of, a lot of people don't do that. [00:18:37] Speaker B: Right. [00:18:37] Speaker A: Because there's, you know, you can, you can also, what could possibly go wrong? Yeah, I know, but it is super cool. Like it's, and I can imagine for especially like user workstations, that was super critical. Yeah. [00:18:55] Speaker C: The only interesting part is now you can kind of see what EDR solution every, like a lot of these companies use. Like you can go look online and see who was down during the last 48 hours. And it's interesting data that a threat actor could use in the future. [00:19:11] Speaker B: Well, I mean, you thought SolarWinds was bad and what they were able to accomplish there, it's like, okay, so now, now we've just shown that security vendors have extra access to the kernel. And if I'm going to, if I'm going to be a state, nation state actor, I'm going to go after security vendors who have this access. So, like, they all need to be prepared for more attacks because you just show, you know, like, Iran wishes they could cause this kind of damage in a hack. Right, to corporations that crowdstrike did to the, you know, to the world on Friday. So it's gonna be interesting. I'm very curious. We'll keep you posted here on the club pod as we hear root cause data. Hopefully, they don't keep it all, like, behind paywalls, so we can actually talk about it, but we'll see. [00:19:52] Speaker A: Yeah, I hope that, I think that they will have to come out big in the public because of the size and scope of this outage. They have to rebuild a lot of trust if they want to maintain their customers and have any hope of attracting new customers in the future. [00:20:08] Speaker C: Yeah, yeah. I really think they need to do a full public RCA. What happened where they're fixing it, you know, otherwise, I think people are just going to get mad. [00:20:17] Speaker B: Yeah, well, this is always the challenge, right? Because you have, you know, you have people who, you know, like Okta, right? Like, I never really felt they had a great response to, you know, what happened. Like, they kind of were very dismissive. They were very, like, it's a third party's fault, you know, et cetera. And then you had Solarwinds who originally people started out with, like, solarwinds as a victim about, you know, by within weeks, they were the villain in the whole process because they had known about it and lied and, you know, got called in front of Congress. All kinds of things happen. [00:20:52] Speaker A: Conviction, I believe. [00:20:53] Speaker B: I think so, too. Yeah. [00:20:54] Speaker A: Yeah. [00:20:55] Speaker B: But it's definitely a risk. And how they respond is really what makes them successful or not successful going forward. That McAfee outage that happened in 2010 when they had that bad update resulted in them getting bought by private equity. A couple of years later, they never really recovered. So it'll be interesting to see how things go. Um, I did just see here that crowdstrike has already been called to Congress, so hopefully they, they don't think that's going to be the public vote because congressional hearings are the worst. [00:21:26] Speaker A: Oh, they're so. They so are. [00:21:28] Speaker B: Yeah, but we'll see you, Matt. [00:21:30] Speaker A: Oh, that senator is trying to get explained to how security systems work in kernel space. [00:21:36] Speaker B: Like, what's the difference between XDR and Kopersky? Like, oh, it'll be great. [00:21:42] Speaker C: I mean, this is just gonna be fun to watch for some popcorn, I think, just to see how they broke the technical level down to that, for sure. [00:21:49] Speaker A: I don't know, man. Watching some of the Facebook stuff, it's made me really just disheartened. Like, it's cool if senators don't know specifically, but they have staff and they can research. Some of the questions that get asked are pretty offensive. Anyway, I digress. [00:22:09] Speaker B: Well, talk about other security companies that have a lot of value and a lot of trust in your organization. Wiz, who we've talked about many times here, is apparently in talks with Google to be acquired for $23 billion. This deal is not done yet. It's been a week since this rumor came out, and it still hasn't closed. I'm sure Friday's crowdstrike thing does have no impact on this deal. [00:22:30] Speaker A: Oh, I think the price just went up. [00:22:31] Speaker B: Oh, I think so, too. So we'll see what happens. I imagine Google's a little bit concerned about regulatory pressure, and they bought Mandiant, for God's sakes. You're buying Wiz too, and other things. So be curious if this actually happens or has fruition or if someone comes in and tries to outbid them for Wiz. Perhaps now, but we'll keep an eye on that, see if Wiz gets snapped up by the Google or by someone else. [00:22:56] Speaker A: Yeah, I still haven't played firsthand with Wiz, and I hear nothing but good things. And so I'm very conflicted on this because where Wiz is, wonder if it's going to be more exposed in Google products like mandate has become, or is it going to be sort of behind the scenes integration? And so it's. We'll see. I think that's, I'm just curious in how things shakes down. [00:23:25] Speaker B: Yeah, moving on to AI is going great. Definitely going better than security. Databricks is announcing the general availability of serverless compute for notebooks, jobs, and Delta live tables on AWS and Azure databricks customers already enjoy fast, simple, and reliable serverless compute for databricks, SQL and Databricks model servers. Same capability is now available for all ETL workloads on the data intelligence platform, including Apache, Spark, and Delta live tables. You write the code, Databrick provides a workload startup, automatic infrastructure scaling and seamless version upgrades. With Databricks runtime and poorly with service compute you can only build for what you use and the work done. Databricks is currently offering introductory promotional discount on service compute, available now until October 31. [00:24:08] Speaker A: Can't remember offhand if the data plane has already been made serverless by databricks. I know, like if you're a customer, you can have the data plan be in your own environment and have it automated. I just don't know if you could do that in like sort of, and leveraging databricks directly to sort of leverage that infrastructure. But this is the first product I know, I guess other than Google because Vertex, you can have them host the notebooks. So I mean, this is pretty cool. And I hope I'm right about databricks already having the data plane because then you really don't need any servers part of this. That's the way it should be. [00:24:48] Speaker B: Moving to AWS S three express one zone now supports AWS cloudtrail data event logging, which allows you to monitor all object level operations like put object, get object and delete object, in addition to bucket level actions like create and delete buckets. This enables auditing for governance and compliance, can help you take advantage of the s three express one zone 50% lower request cost compared to the s three standard storage costs. I haven't tried to use this because I like data durability, but the fact that this wasn't in cloudtrail kind of blows my mind. And for shame Amazon, that you would not have had this in your logging already because just because it's a single zone doesn't mean you shouldn't provide audit logging to it. [00:25:31] Speaker A: Yeah, I was shocked because again, like I, my use of s three is typically, I want that data to be very resilient to outage. And so I was, it was a learn, you know, and it's like not everyone's just that cheap of a bastard, you know, aws a little bit, sure. But. [00:25:52] Speaker C: Yeah, well this was the express one zone too, not just like the IA one zone, which I thought was interesting. So are there others types that are not fully logged? And like, like you said, how do you not just have this, like, I. [00:26:07] Speaker B: Mean, it seems like table stakes in my mind now at this point, but if you're doing something on Amazon, it's going to be in the cloudtrail log. And if it isn't in the cloudtrail log, that's a really big mess. [00:26:18] Speaker A: I mean, I get how s three is sort of an edge case in the sense that you're not logging against the s three APIs. Those exist. They're logging, but it's access to the bucket via HTTP or what have you. All the different things that you can make s three the backend for. But still it should be part of every release like day one, not a feature. How many years later? [00:26:48] Speaker C: This is like a. I was just a year and a half. [00:26:50] Speaker B: Year and a half. [00:26:51] Speaker C: Yeah, two reinvent you. There's a lot of things I think that are table stakes, but this is like one of the core ones that. [00:27:00] Speaker B: 100% should be now because this is a little bit past. We're going back before New York summit. The New York Amazon summit was a couple, you know, was two weeks ago. And you know, I remember when the summit included cool things like, you know, the fact that graviton four was made generally available or the r eight g instances of the graviton four are generally available. That would have been a summit thing. Didn't even make summit blog post only. So sad. Graviton four based EZ two r eight g instances are now generally available. AWS has built more than 2 million graviton processors and has more than 50,000 customers using AWS Graviton based instances to achieve the best price performance for their applications era j instances offer larger instance size, up to three x more vcpus, which you get up to 48 xl, three times the memory, which is up to 1.5 petabytes, and 75% more memory bandwidth, and two times more l two cache over the r seven g instances, early benchmarking data for the Groudon four was showing about a 30% faster performance benchmark. [00:28:01] Speaker A: You know, because it's all the indirectly related to AI. That's why I didn't make the summit. [00:28:05] Speaker B: Yeah, exactly. [00:28:08] Speaker C: Every time I hear these benchmarks for these newer and, you know, bigger boxes, no matter if you're graviton or intel, we're just like, if I ever told somebody that they're gonna actually like, they need this to run their production workload, like, I feel like somebody would have laughed at me, like, you're not architecting the cloud correctly. [00:28:25] Speaker A: I feel like, yeah, I spend most of my time trying to talk people out of these large instances. [00:28:31] Speaker B: I mean, the problem is things like SAP Hana and Oracle and MySQL and postgres exist and they don't care about your cost savings. They want the biggest and best all. [00:28:40] Speaker C: Databases at that point. I want to run it as a managed service. I'm just saying. Yeah, I guess SAP Hana. Yeah, yeah, but that's, they have their own SAP Hana series. I thought. I don't remember what the letters are anymore. [00:28:53] Speaker B: There are, there is a special X series and some z series too. But yeah, those are real expensive. [00:29:00] Speaker C: No, the Z I thought was the cpu. Overclocked ones. [00:29:03] Speaker B: No, that might be high frequency which. [00:29:05] Speaker C: Those actually I found use cases for and they're not that much more expensive. [00:29:09] Speaker B: At one point in the existence of the cloud pod I thought that I could remember all of the instance variations that exist. I used to know them and I have given up on that dream because there's just no possible way that I will ever remember them all because a, they have weird names on all of the cloud providers that make no sense. And number two, they change so freaking often I can't keep track of them anymore. So yeah, I do remember what the, what the a versus the g versus the I is. That's still a struggle and that's been around now for like least two years. [00:29:42] Speaker C: But I do like Amazon's where at least it's like the AG and I like, at least tell me the processor I ever found that same. It might be, it might be there, I just haven't figured it out. [00:29:54] Speaker B: Yeah, it's there in Google. It's definitely not there in Azure. [00:29:59] Speaker C: I gave up on Azure. There's like 18 letters to describe the damn thing. I can't figure out why. [00:30:05] Speaker B: Yeah, when you need a key to solve this problem, it starts to get a little concerning. Well, if you also want a different type of instance, type. Here again, the complexity. Amazon is introducing its new inference capability, delivering up to two x higher throughput while reducing cost by up to 50%. For generative AI models such as llama three, Mistral and Mixtrol. For example, with the llama three seven DB model, you can achieve up to 2400 tokens per second on an ML P 548 xlarge instance versus 1200 tokens per second previously without optimization. This allows customers to choose from several options such as speculative decoding, quantitization and compilation, and apply them to their generative AI models. As you wish. See, I don't know what that is. [00:30:47] Speaker A: Cool. Yeah. [00:30:49] Speaker C: Yay. Aihdeenen. [00:30:51] Speaker A: I was kind of learning that mistrial and mixed trial are two separate models. [00:30:55] Speaker B: Yeah, that was a little bit of a thing too. Yeah. But yeah, already you're into the, oh, the MLP five instance types. I thought the P five s were IBM, you know, weird Motorola chips. [00:31:09] Speaker C: No, it's the p eight which is the old database. [00:31:12] Speaker B: Right now that's a different thing too. Yeah. [00:31:14] Speaker C: Yeah. [00:31:15] Speaker B: They also gave us three new features for Amazon FSX for Netapp on tap, which they use three blog posts to do. So we'll combine into one because we can't talk about this seriously. [00:31:24] Speaker A: Yeah. [00:31:25] Speaker B: First up, they can now provide higher scalability and flexibility compared to previous generations. Previously, the system consisted of a single high availability pair of file servers with up to four gigabits of throughput. Now, the next gen file system can be created or expanded with up to twelve ha pairs, allowing you to scale up to 72 gigabits per second of total throughput, or six gigabits per pair, giving you the flexibility to scale performance and storage to be the needs of your most demanding workloads. You also now leverage an NVMe over tcp block storage protocol with Netapp on tap. Using NVMe TCP, you can accelerate your block storage workloads such as database NVDI with lower latency compared to traditional iSCSI block storage, simplifying multipath configurations relative to iSCSI. And having this in Amazon is a first that I'm aware of. So thank you for this capability. And you can now read data from volume, what is being restored from a backup. This feature read provides read access during backup restores, allowing you to improve your RTO by up to 17 times for read only workloads. [00:32:21] Speaker A: That's cool. I'm not sure having IsCsi is cool. Like, I get why people want it. I don't know that just because you can. I don't know if you should. [00:32:34] Speaker B: Well, I mean, if you want to do SQL clustering, it's a great way to do it without having to do a lot of availability. Group magic. [00:32:42] Speaker A: Yeah, yeah, no, I mean it's, but it's that, you know, it's the same thing that we're talking about the large instance sizes. Like we spend all time trying to architect anti patterns to this in the cloud. Yeah, I mean, I get it. I mean it's a much more performant transfer. And so like it's, it is, you know, it's been used in data centers where you have direct access to the storage infrastructure. So you know, there's a lot of patterns. But again, I would use some of these sparingly because at this level is where you get really into really tailoring your cloud environment, requiring, and then goes back to the same data center thing where you require specialized resources with specialized knowledge in order to construct and you're going to end up with a lot of data center flaws that we've been trying to get away from. And you're going to go through the bottleneck of having these very highly trained engineers that understand at a super low level, these storage operations have to provision your storage pools and hopefully automate it. You never know. Yeah. [00:33:49] Speaker B: So nvme over Tcp is not iscsi, just to be clear. But it's basically ISCSI. [00:33:56] Speaker C: Yeah. [00:33:56] Speaker A: Okay. [00:33:56] Speaker B: It's basically in kernel, it's much more performant than iSCSI is, and it is the new hotness to replace iSCSI, but it is not technically ice guzzy. Don't correct us. [00:34:08] Speaker A: Okay. [00:34:08] Speaker C: Yeah, I was reading about it a little bit as Ryan was talking there. I was like, oh, this is kind of cool. But yeah, it's just the new version. [00:34:17] Speaker A: Yeah. [00:34:17] Speaker B: And it's a. It's much faster, has a bunch of multiplexing capabilities that ICECSI never had. It is a worthwhile upgrade if you have the choice to do it. [00:34:27] Speaker A: Maybe we'll just edit out my entire rant. [00:34:29] Speaker B: No, you're still, you're still correct. Your rant is right. ISCSI versus NVme TCP, which is a much more complicated acronym. I just wish they called it isCsi two. I would have been fine. [00:34:39] Speaker C: Well, I mean, it helps when you're improving. When iSCSI was made a standard, a draft standard in March 2020. Sorry, March 2000. Oh yeah, just straight 2000. [00:34:50] Speaker B: Yeah. [00:34:50] Speaker C: So like, these protocols are not meant to be running at these higher levels as what the Internet was probably the size of what they're talking now that we've upgraded to here like 14 terabytes of. [00:35:02] Speaker B: Yeah. [00:35:02] Speaker C: And transferring 772 gigabits per second, the Internet would have been downloaded in 3 seconds. [00:35:11] Speaker B: Back then, I remember the days when we said, who's going to use more than 100 megabits? The desktop. It's crazy talk. Now, we're now running like ten gig to compute stations in some cases. I mean, most people don't need more than gig, but our use case is for ten gigs. Crazy. Amazon ECS will now enforce software version consistency for your containerized applications, helping you ensure all tasks in your application are identical and that all code changes go through safeguards defined in your deployment pipeline. Basically, image tags are not immutable, but images are, and so therefore there is no standard mechanism to prevent different versions from being unintentionally deployed when you configure a containerized application using image tags. And so now ECS resolves this by the wonderful world of digest hashes with the SHA 256 hash of the image manifest when you deploy an update to ECs services and enforce that all tasks in the service are identical and launch with the image digest that matches. This means even if you use immutable image tag like latest and your task mission and your server scales out after the deployment, the correct image which was used when deploying the service is used for launching the new tasks, which this will be fun to troubleshoot someday when you think you're using latest, but you're not. [00:36:19] Speaker A: Well, the interesting part about this is because I actually really like this change because it is using sort of mathematically guaranteeing the workload is what you said the workload is. But it's funny because it is going to be a mixed bag because the ability to tag an image with a shared tag that you refresh and change the image out from underneath has been something that's been used and pretty much been called out as an anti pattern using environment specific labels or latest or, and so it's sort of this weird thing. And I've used this to get myself out of binds for sure, actually, specifically in ECS, like using, using latest to update stuff as part of like the underlying platform. [00:37:07] Speaker B: So like, or even just retagging a new image as the old image version number. I've done it once. [00:37:16] Speaker A: Oh yeah. Oh absolutely. [00:37:17] Speaker C: Definitely done that. [00:37:18] Speaker B: So, yeah, so yeah, like, I mean, like yes, you can. Latest is just one example of the abuse you can do, but even the version image tags, you could reapply multiple times and virtually. I've seen a CI cd pipeline that does that. Not at any place I worked, but I've seen it at a place that I helped out with and I was like, this is a really bad practice. Well, I'm glad to see it. Although it will be fun to troubleshoot because I'll forget about it until it burns me. [00:37:41] Speaker A: So that's how it'll work. Why don't you go? [00:37:44] Speaker B: All right, so let's move to the summit, the official summit, which uh, I knew I was in trouble when doctor, uh, doctor whatever his name showed up. [00:37:56] Speaker A: Uh, wow. [00:37:59] Speaker B: Matt. [00:38:00] Speaker A: Doctor Matt Wood. [00:38:01] Speaker B: Yes. Doctor Matt Wood. Thank you. Wow. [00:38:02] Speaker A: There we go. Geez, that took too long to recall. [00:38:05] Speaker B: This is, this is a problem post crowdstrike. Yeah, my brain is fried. Uh, I knew I was in trouble when Doctor Matt Wood was the main keynote speaker. And I knew I was even more in trouble when one of the screenshots I saw come across my Twitter feed was this bad boy that I shared with both of you that'll be in the show notes as well that Amazon says they've released 326 generative AI features into general availability since 2023. And they're saying that is two times more than twice as many ML and generate AI features launched into general availability than any other leading provider combined in the last 18 months. And they show on this slide aws with the bar going far and to the right with another bar below, it's saying company one in a lovely rainbow color scheme. As a company one with very small green box and a company two and a telltale blue color, I can only. [00:38:54] Speaker A: Apply to one vendor very specific blue color. [00:38:57] Speaker B: Yeah. As trailing far behind. And I would like to know from all of you, do you feel that Amazon is leading at the level of this chart? Does not make you just immediately angry? Because it made me angry. I was angry about it. I was like, this is crap. [00:39:11] Speaker A: I mean, this just makes me want to resign from the podcast because we're going to talk about this many features. Yeah, like, I'm over it. [00:39:18] Speaker C: I mean, we had to make a whole section for it at one point that Ryan aptly named. [00:39:23] Speaker A: Yeah, I mean, no, there's no way that this feels realistic to me. [00:39:27] Speaker B: Like, there's never the bullshit that Amazon's always done. They count region expansion. Like every region they expand to is a release. I'm like, no, it's not. It's just you making it more available. So yeah, you guys have, you know, 40 some odd regions or whatever and you've now deployed it. So every feature gets basically times 40. So I mean, if you divided that by 40, the number is probably more reasonable and then it would compare just fine to Amazon or to Azure into GCP. But like this is like the most ridiculous slide they could possibly have put up on stage, in my opinion. Like, yes, Doctor Matt woods, I'm sure this was very helpful in your performance review where you got your bonus, where you've showed how much you've done. But this is ridiculous. [00:40:07] Speaker C: It's just the way Amazon always does stuff. Azure goes the opposite way and puts like 17 things in one post, which is linked to 16 other posts to find anything. And aws does. Oh, we added GPT 35, we added Mistral, we added Claude. Like everything is its own multiplied by regions. It's just, it's just the difference in. [00:40:29] Speaker B: The way that we used to have a lightning round because you're like, we don't want to talk about all these things now. We just don't talk about them at all because like, you realize it's just stupid. But yeah, anyways, that made me mad. So yes, so that now can tell you that the summit is going to be all AI, nothing but AI. And they're going to add to the CRM 26 that they did. So first up, Vector search for Amazon memory DB is now generally available. Vector search for memory DB is a new capability that allows you to store, index, retrieve and search vectors to develop real time machine learning and generative AI apps within memory performance and multi az durability. With this launch, Amazon Memory DB delivers the fastest vector search performance at the highest recall rates among popular vector databases on AWS. And you no longer have to make trade offs around throughput recall and latency which are traditionally intentioned with other solutions. You also now use one memory DB database to store your app data and millions of vectors with millions of single digit millisecond queries and update response times at the highest levels of recall. [00:41:24] Speaker A: I mean this sounds expensive. Yeah, but I think it's cool though, right? Vector search in general is just a new paradigm. I haven't quite got my head all the way around yet, to be honest. [00:41:35] Speaker B: Like I understand some of it but I don't understand all of it for sure. [00:41:38] Speaker A: Yeah, I need to use it and then I, you know, cause that's the only way I learned, but, and then having that all on memory, I need. [00:41:47] Speaker C: Jonathan to explain it to me. [00:41:49] Speaker A: Yeah, Jonathan's the only way I know anything. [00:41:53] Speaker C: Like maybe the 6th time I'll figure it out, most likely the 7th. [00:41:56] Speaker A: Yeah, but yeah, having it in memory, it's just the performance of that, being able to return it as quick as possible and all the caching and availability you get at that layer. It's pretty neat, but I imagine this is not cheap. [00:42:13] Speaker B: Yeah, it sounds expensive. Amazon is releasing a new no code solution with a public preview of AWS. App Studio app Studio is a generative AI powered service that uses natural language to create enterprise grade applications in minutes without requiring software development skills. It's as easy as creating a new app using the new generative AI assistant and building and launching it out there. May you live longer than honeycode. [00:42:37] Speaker A: Yeah, well, I mean if Honeycode was just, you know, trying to build apps around Excel and if, if their limitation of the, the AI is based around just AI powered Excel apps, then yeah, no, I just won't be great. I do like the idea just describing. [00:42:57] Speaker C: What you want enterprise great. It has to be better, guys. [00:43:02] Speaker A: Yeah, yeah. [00:43:03] Speaker B: I started to go down this path of like hey, I should play with this and I got to the page where you can sign up for the trial and preview all that. Then like there's all kinds of like sign up for the admin groups and builder groups and like all these check boxes you have to acknowledge and check. And I, and I was looking at that and I was like, hmm, I'm going to create a new Amazon account because this seems like something's going to cost me a lot of money if I don't know how to turn it off properly. And I prefer to use a privacy credit card that does not live if I don't want it to. So it's like I'm going to, I'm going to create that somewhere else. I just haven't had a chance to do that yet. But it sounded scary and I was like, this is going to get expensive. [00:43:40] Speaker A: It also hearkens back the days of honeycode where you had to have your own honeycode account and they were trying to build a whole community around it. [00:43:47] Speaker B: Yeah, I mean, like when you have three acknowledgments that you have to check just to get that created. I was like, yeah, lots of cloudformation code. I don't know what it does. Like, just everything about it feels dirty and I want to find out the hard way that it cost me $3,000. [00:44:04] Speaker A: And you just want to run an experiment, right? Like I don't need to now I'm going to end up with all this infrastructure and stuff that I didn't know I deployed. Like just to click some buttons. Exactly and be like, yeah, this is still that. [00:44:19] Speaker B: Amazon key. [00:44:20] Speaker C: No code. [00:44:21] Speaker B: It's no code. It's dumb though. [00:44:23] Speaker A: Now I know it's dumb. [00:44:25] Speaker B: Amazon Q apps are now generally available with some new capabilities that were not available during the preview, such as an API for Amazon Q apps and the ability to specify data sources at the individual card level. New features include specifying data sources at card level so you can specify data sources for outputs to be generated from. And the Amazon Q apps API allows you to now create and manage queue apps programmatically with APIs for managed apps, app library and app sessions. No one cares about that, so we're going to keep on going. Amazon Q developer customization capability is now generally available for your inline code completion and IDE needs. This is launching a preview of a customization chat as well. You now customize Amazon queue to generate specific code recommendations from private code repositories in the IDE code editor and the chat. Amazon Q is an AI coding companion. Of course, it helps developers accelerate application by offering code recommendations that integrate development environment derived from existing components and code. And the new chat capability allows you to ask the chatbot questions of the code in the project they currently have open in the IDE. My biggest question about this is, does it pay attention to my git commits? Because that bot might get real surly with me real quick. [00:45:32] Speaker A: You know, I think Doctor Matt Wood should probably, you know, think about his slide instead of doing bar charts. Maybe, maybe do a little time based chart, because these are, you know, features that chat GPT was announcing like 18 months ago, two years ago. [00:45:49] Speaker B: Yeah. [00:45:50] Speaker A: So it's like, yeah, this is why it feels dishonest. This is why that slide makes it makes you angry. It's that you can, oh, you can ask Amazon Q questions in your ide. [00:46:01] Speaker B: Well, I mean, maybe in Matt Wood's world, he only knows of Amazon. Maybe he doesn't know Azure and Google existed. They truly are company one and company two fictitious companies. And so if you don't know what they're doing, you can't say you're copying them because you didn't know they were doing that. [00:46:18] Speaker A: Yeah, maybe matt wood is actually AI. [00:46:23] Speaker B: He's definitely been AI for a long time. All right, we're moving into more agent territory. So agents for bedrock now support memory retention and code interpretation, retaining memory across multiple interactions. This allows you to retain a summary of the conversations with each user and be able to provide a smooth, adaptive experience, especially for complex multi step tasks such as user facing interactions. Enterprise automation solutions like booking flights or processing insurance claims. This also supports code interpretation agents, which can now dynamically generate and run code snippets within a secure sandbox environment, be able to address complex use cases such as data analysis, data visualization, text processing, solving equations and optimization problems. And this is Westboro territory, so cool. But we have a sandbox. That code can't get out of the sandbox. That's what crowdstrike said, too. [00:47:13] Speaker C: It's amazing. Sometimes I look at what these companies are announcing, I'm like, holy hell. How do we get from just a simple LLM to where all this is going and I where it's all going? It's just so fascinating to me of like, how much more there still is in this world that we can still do with the same technology and incremental improvements now on it. [00:47:37] Speaker B: I was reading this morning that one of the ways that chat GPT is testing their LLMs now to make them chat with each other. Like, okay, apparently this cannot end well. I'm like, please don't do this can't be good. [00:47:49] Speaker A: Yeah, yeah, yeah. At what point do they figure out that humans are the weak point in this link? [00:47:57] Speaker B: Why are they asking this dumb question they've answered a thousand times? Well, if you're worried about AI safety, there's guardrails for Amazon bedrock, which can now detect hallucinations and safeguard apps built using custom or third party foundational models. This can help prevent undesirable content, block prompt attacks, and remove sensitive information for privacy reasons. Guide Rose Robotics provides additional customizable safeguards on top of native projections offered by the foundational model, driving the best safety features in the industry, which it says, they say basically blocks as much as 85% more harmful content, allows customers to customize and apply safety, privacy and truthfulness protections within a single solution, and filters over 75% of hallucinated responses. For rag and summarization workloads. [00:48:40] Speaker C: The cloud pod hallucinates over Crowdstrike trying to destroy the world. [00:48:44] Speaker B: Mm hmm. [00:48:45] Speaker C: I feel like that should have been our show title. [00:48:47] Speaker B: Well, you weren't, uh, you weren't on top of the game when you. [00:48:50] Speaker C: Not on top of my game. [00:48:51] Speaker A: Yeah. I don't know why we do the show titles at the beginning. [00:48:54] Speaker C: Yeah, we really got to do both. Yes. [00:48:56] Speaker B: So by the time we get to the end of the show, we're tired and we don't want to do. [00:49:01] Speaker A: No, you're right. I'm surprised at that, that bedrock can do this. I mean, it seems do this feels like. I don't, I don't know, like, no, detective hallucinations based off of the third party models like that seems crazy to me and just highlights how little I know about how these models work and how a platform like bedrock operates. Because in my head, it's just, you ask the model question, you get an answer back. And so guardrails, clearly, there's more information being exchanged at a different level. Let's take and protect hallucinations. So neat. [00:49:39] Speaker C: I wonder if over time they're going to have to build these guardrails into like, control tower for your multi account strategies where different teams have different bedrocks, implementations that, hey, I don't even know if security compliance, AI governance committee can start to help build these guardrails for your organization. So different development teams don't, like, go completely left because of a hallucination. They put the guardrails in at the more global level because I assume a lot of these are still either behind the scenes or by account or by deployment. [00:50:21] Speaker A: Yeah, hopefully it turns into one of those things where you're really just querying bedrock API for what's enabled for it because then it'd be easy to roll into something like guard duty or something along those lines. [00:50:36] Speaker C: But config where you can set it your own lambda and check. [00:50:40] Speaker A: Well you're right. When you think about models and what they're set for and all those things like trying to manage those as an organization, I think most security people have their fingers and ears going because there's so much, there's already so much to do and this is a complete, completely new type of problem and it's, there's nothing you can build upon in your existing stack to try to fix things at that level. And so like they're just barely getting it into the platform right now. So it's like yeah, crazy. [00:51:16] Speaker C: And then over time you're going to have like I'm waiting for different LLM models to have like different, oh, what's it called? Like, like Firefox's controls or like oh, my brain's dying. [00:51:32] Speaker A: Oh, just like isolation and sandboxing and that kind of thing. [00:51:35] Speaker C: No, I was thinking like you know, like the different open source frameworks have like different licensing licensing associated with them. Yeah. Where like some companies won't use certain ones because of the way that they're written and, you know, implications of it. So I'm kind of waiting over time for that to happen and then you're out to have, you know, your compliance department like you said, like say only these specific models can be used because otherwise you were required to do X, Y and Z due to the licensing model. [00:52:02] Speaker A: Yep. [00:52:03] Speaker B: We got two more to get through and then we're done with summit. [00:52:08] Speaker A: Sorry, we'll stop talking. Yeah. [00:52:10] Speaker B: Knowledge base for Bedrock, a foundational model and agents can retrieve contextual information from your company's private data sources. For rag rags help financial miles deliver more relevant, accurate and customized responses. Now you can connect in addition to s three, your web domains, confluence, Salesforce and Sharepoint as data sources in your rag applications and hope that these applications don't leak all your data. [00:52:33] Speaker A: I always think about connecting all these data sources and how is that? Basically just inducing hallucination in my mind. [00:52:41] Speaker C: And you know, exclude area to exclude like riot specific, like SharePoint site. [00:52:46] Speaker A: Exactly. [00:52:49] Speaker B: And the last one, Sagemaker Studio can now simplify and accelerate the machine learning development lifecycle with Amazon. Q developer in Sagemaker Studio is a Genai powered assistant built natively into the Sagemaker Jupyter lab experience. This uses natural language inputs and crafts a tailored execution plan for your machine learning development lifecycle by recommending the best tools for each tasks, providing step by step guidances, generating code to get started, and offering troubleshooting assistance when you encounter errors. [00:53:16] Speaker A: It's going to be need to be one hell of an AI bot if it's going to get me to successfully run a spark job. [00:53:25] Speaker B: I mean, if I can run a spark drive with AI, that'd be great. Jira added, AI and my JQL has gotten great. I ask questions all the time. I get queried. I'm like, these are awesome. [00:53:41] Speaker A: I suspect your JQL is actually getting worse, but your prompt engineering, your prompt engineering, excellent. [00:53:49] Speaker B: Actually, there's been some things about JQL I learned that I did not know because I was like, how do I, like, I asked a question and like possibly Aquarius. I'm like, oh, I didn't know I could set that before. Yeah, yeah, I think it's actually helping. [00:54:00] Speaker A: I admit to the same thing. Like there's a lot of functions in JKL I wasn't aware of. [00:54:07] Speaker B: All right, moving on to GCP. Google is announcing an automated cloud SQL upgrade tool for major versions and enterprise plus customers of MySQL and postgres SQL. The tool provides automated upgrade assessments, scripts to resolve issues and in place major version upgrades as well as Enterprise plus edition upgrades all in one go. It's particularly useful for organizations that want to avoid extended support fees associated with cloud SQL. Extended support so you start charging for extended support. So then you build a tool you give you to move faster. It's interesting order of operations, but sure. Key features of this include automated pre upgrade assessments where checks are curated based on recommendations available from MySQL and postgres. Detailed assessment reports automated scripts to resolve issues in place major version enterprise plus upgrades leveraging Cloud SQL's in place major version upgrade feature. That's a lot of upgrades in majors. [00:54:54] Speaker A: Yeah, especially since enterprises still won't upgrade to the modern version. [00:55:02] Speaker C: It's like they haven't actually talked to any customers in the extent support thing to realize that they're there for a reason. [00:55:09] Speaker A: Well, I don't know. I mean, the reality is, is that, you know, whenever they, they reach out, you know, I'm sure the enterprise is like, but you need to make a tool so that we can upgrade and they make the tool and then they still don't upgrade breaks. It's like, but then we have to change all the other stuff. But I do think that these types of things are awesome just because it is really scary, especially if you're an sre.org who doesn't really fully understand the data structure, but you know you're going to have it implement these types of changes. So I like that pre upgrade assessment. But you know, always test roll this through software development lifecycle because it's not foolproof. [00:55:49] Speaker B: I mean, you don't have to test. You can be a bajillion dollar company and just push channel updates out, man. [00:55:55] Speaker C: Yeah, don't worry about that. [00:55:56] Speaker A: Yeah, right now I'm feeling very conservative. [00:55:58] Speaker B: I wonder why. [00:56:00] Speaker C: I feel like I am doing a phenomenal job with the quad testing I do with any software code I write. I might actually compile it locally, which is better than some places. [00:56:13] Speaker B: The compute flexible cud has been expanded to cover cloud run on demand resources, most GKE autopilot pods, and the premiums for autopilot performance and accelerator compute classes. With one cut purchase, you can now cover eligible spend on all three of those products. Since the new expanded compute flexible cut has a higher discount than the previous GKE autopilot cud and greater overall flexibility, they will be retiring the GKA autopilot cud. So you should talk to your account team. If you are using GK autopilot codes, you might be able to get a one time translation fee depending on your contract. [00:56:45] Speaker C: I love when single things support multiple so I don't have to think about it. It's like, how much money do you want? Divide by four. So I give you a little bit. So I recall, refresh as needed once a quarter and here you go. Now I don't need to manage 16 different like things. [00:57:00] Speaker B: Yeah, I mean, I just, I wish the flexible cuts just this should be automatic. Like, why would you even create a GKE autopilot cut to begin with? Why didn't just make that part of your flexible cuts to begin with? Like, I feel the cud thing is an area that Google like sauce apiece planes and was like, that's a really cool idea. We should do that too. And then implementation is a bit lacking. [00:57:22] Speaker C: I still just want savings plan. [00:57:24] Speaker A: Yeah, no, that's true. I would rather have flexibility there. [00:57:28] Speaker C: I just like want savings plans to support databases. Azure doesn't do it, but AWS doesn't do it. I don't understand why. [00:57:36] Speaker A: And there's such a stateful workload. It seems obvious that it should. So I don't know, I do. The wording about most pods is funny to me. Like, I'm like, what pod would you not cover? [00:57:54] Speaker B: The AI ones, the ones that are using ephemeral you know, like spot instances maybe, I guess. [00:58:01] Speaker A: Yeah, that would make sense, I suppose. Why not? [00:58:05] Speaker C: Well, spot normally isn't covered by anything or cuts. [00:58:10] Speaker B: That's the only time I could think of that you wouldn't cover. Well, Ryan, I know for your aspiring infosec career and all of us are now infosec. As we troubleshoot crowdstrike issues, Google has given us a Secops masterclass. This is a six week platform agnostic education program for modern secops. The course leverages the automatic security operations framework and continuous detection continuous response methodology and you can become an expert in little as six weeks available to you on Coursera. [00:58:40] Speaker A: And this will be. I'm sure the content won't be heavily towards security command center and the enterprise operator. [00:58:47] Speaker B: I say it's not vendor specific, but. [00:58:51] Speaker A: Yeah, just say they've been hyping that platform for a while. I think they dumped a ton of money into it hoping for Mandy integration to really pay. [00:59:01] Speaker B: I mean, I'm sure they'll mention that right here in the blog post. Our platform can help reduce the complexity of psychops and enhance the particular security operation centers and features innovations such as frontline threat intelligence, Gemini investigative assistant, playbook assistant, and autonomous parsers which are all part of our security tools. So I'm sure it'd be great. [00:59:20] Speaker A: I mean, that's it. I'm going to take the class. [00:59:23] Speaker C: I'm definitely not registering for it right now and then I'll probably forget to do actually do any part of it. [00:59:29] Speaker B: Well, that's part for the course, yeah. Datafx Catalog Google's Cloud's a next generation data asset inventory platform provides a unified inventory for all your metadata. Whether you're sources are in Google Cloud or in your data center, today is generally available. Dataplex catalog allows you to search and discover your data across the organization, understanding its context to better assess its suitability for data consumption needs, enable data governance over your data assets and further enrich it with additional business and technical metadata to capture the context and knowledge about your data realm. Benefits of the Dataplex catalog wide range metadata types self configuring the metadata structure for your custom resources, interact with all metadata associated with an entry through a single atomic crud operation and fetch multiple metadata annotations associated with search or list responses. And there are no charges for basic API operations of the dataplex catalog. [01:00:22] Speaker A: I really like that this is supporting both data on GCP and off GCP because that's the reality is almost always that you have data in multiple places and if you're trying to catalog everything so that you have a place to search and understand where your data is and sensitivity and the metadata around it. If you have three different versions of that catalog, it's worse than just having one. So this is cool. [01:00:53] Speaker B: And our last Google story. Google has enhanced the ability to get Ha and still meet your data residency requirements with Google Cloud Spanner. For those of you who know, you can get five nines of availability with Spanner, but that typically required you to be in a multi region setup which included a third region outside of a geographic area. You could achieve 99.99% with only two cloud regions in the same country. So like Australia or Japan, where there's a couple different regions, you get 99.99. But now they're allowing you to get up to five nines availability with the new spanner dual region configuration available only in Australia, Germany, India and Japan. To solve this, it takes advantage of countries that have multiple regions in a single geography, including places like Delhi and Mumbai. The whole goal here is that you're getting basically six cloud spanner nodes in two regions. That gives you the proper amount of availability that you would have had to get in the three regions, which would have been two, two and two, I believe. And that's how they're giving you that additional nine of availability. [01:01:52] Speaker C: I mean, with the data, resilient with the data, regional requirements like something like this is slowly going to be required more and more. It's interesting that, you know, before you only got four nines with it, but also at one point, compliance always wins. And keeping the data in the correct country, it keeps you out of, you know, compliance, health, you know, it's kind of important. So, you know, it's nice to be able to get that extra nine. Is this something I would go out of my way to probably do? Pretty probably. What's, what's the difference between four nines and five nights? [01:02:27] Speaker B: It's like minutes. It's only, the only the craziest people need that kind of availability that they would care between these two options, in my opinion. [01:02:37] Speaker C: So 52 minutes a year versus five minutes a year. [01:02:40] Speaker A: Yeah. [01:02:42] Speaker C: And I rounded there, go to six nights, you go down to 31 seconds a year. [01:02:48] Speaker B: Yeah. I don't know anybody who's offering six nine. I mean, durability. Amazon does, but not on availability. [01:02:53] Speaker A: Yeah. [01:02:54] Speaker C: Is it like eleven nine to durability? [01:02:56] Speaker B: I think so. [01:02:57] Speaker C: Which at this point they probably need increases. It's probably more than losing one file a year, probably. [01:03:03] Speaker B: Well, Azure has woken up, but it's. [01:03:05] Speaker C: AI and they crashed a region. Don't worry about that. [01:03:11] Speaker B: First up, OpenAI is the fastest model ever. The GPT 40 mini is now available on Azure AI. The AI mini now allows customers to deliver studying results at lower costs and a blazing speed. It's nearly smarter than GPT 3.5 turbo, scoring 82% on the measuring massive multitask language understanding compared to the 70% of GPT 3.5. And at 60% cheaper than GPT 3.5 model, delivers an expanded 128 kilobyte context window and integrates the improved multilingual capabilities of GPT 4.0. The GPT 4.0 mini announced by OpenAI at the same time is available simultaneously on Azure AI, supporting text processing capabilities with excellent speed and image, audio and video coming later. All I can think about this at the point is you had GPT 3.5 turbo and now your GPT 4.0 mini and then you had GPT 3.5, then you had GPT 4.0. Can you just pick a naming convention already and just stick with it? Because I can't keep track. [01:04:08] Speaker C: Grab the turbos in there. [01:04:09] Speaker B: Yeah, I like the turbo. The turbos I mean, didn't really make sense as the smaller faster one. I guess that's maybe turboish, but not as capable. I don't know, uh, it's a sort of a. [01:04:19] Speaker C: So you, so unlike AWS Azure, careful that this is not available in many of the regions, um, which has, you know, definitely makes Azure a little bit harder at times, but you get less press announces because of it and you. [01:04:34] Speaker B: Won'T get, as they announced, some new regions, they won't tell you about it every day. Yeah, and they did give us also new storage capabilities. So not all bad news from Azure. This week they're announcing the latest advancements to premium SSD V two and ultra disks, the next generation of Azure disk storage. First up, they now support incremental snapshots of the PV two and Ultra disk, which are reliable and cost effective point in time backups of your disks that store only the changes made since last snapshot. Azure native fully managed backups and recovery for PV two and Ultra disks allow you to protect your vm with a single click, as well as in preview support for the Azure site recovery. To automate all of your PV two recovery needs. It also now allows application consistent vm resource points for PV two and ultra disk, and third party app support for backup and doctor, in case you're using things like Veeam or Rubrik to do your backups. That's not available to you, as well as encryption at host for PV two and Ultra disk and trusted launch support round out the new features and capabilities to you. [01:05:31] Speaker C: A lot of these just feel like good quality of life improvements that they really need to get out there, like the incremental snapshot support PB two. Also Ultradix go to decently large sizes, so you probably don't really need to be snapping the whole drive if you're just handling little bits and pieces of change. [01:05:51] Speaker B: I like this idea of encryption at hosts. Is that pretty common in azure these days? Because normally when you have an EBS encryption, the EBS is still server storage side encryption, but encryption at host means that it's being encrypted by the actual server itself. [01:06:06] Speaker C: I don't know enough about that, so I will have to homework research a little bit more about that. Yes, don't grade me next week. I'll probably forget. [01:06:17] Speaker B: I will probably forget to. And I have some oracle stories for you as well. First of all, Elon Musk has gotten impatient with Oracle and has now said they're no longer going to get their LLM business for his startup, Xai. Elon has pivoted and decided to build his own AI datacenter. Because who doesn't love to burn money? Musk explained, when your fate depends on being the fastest, by far, we must have our own hands on the steering wheel. Apparently the issues stalled over must demand that the facility must be faster than Oracle thought possible to build. Only a month ago, Ellison trumpeted Xai as one of the several big and successful companies choosing Oracle cloud and must clarify that they're using Oracle for a smaller AI use case at the moment, but will not be their big AI datacenter. [01:07:00] Speaker A: Yeah, my money is totally on the fact that this is, I bet you it's going to take them longer to get this set up than whatever date they're looking at for Oracle because it's, you know, everyone's pulling from the same pool of resources to build AI model so that it's difficult to get inventory. Precisely. And then all of the work to set up and sort of maintain your. [01:07:27] Speaker B: Own. [01:07:29] Speaker A: If you're got, you know, a well oiled project. But I mean, even OCI has got to be a little bit more refined than most people's management of cloud infrastructure. [01:07:38] Speaker B: I mean, Elon's the guy who said during the crowdstrike outage that he's ripping it off of his Linux boxes as we speak, or as I tweet. Yeah, so, I mean, like, this guy has no patience for anything. So, you know, Oracle telling him it's going to take 45 days is probably 44 days too long, right? [01:07:52] Speaker A: Like they're no, it's chaos. I'm glad I don't work for any of the companies just because it seems like there's always some sort of announcement about something that they're doing right this second. I'm like, if you or someone else in that company wanted to get something done, you know, be constantly just getting your plans just thrown away. [01:08:11] Speaker B: Yeah, I definitely would not want to have been at Twitter then had that get bought by him and what he's done to that. I would have found another job. That's about it. Done. So, and Oracle is announcing one other thing today that is the Exadata exascale I like to call ex expensive an intelligent data architecture for the cloud that provides extreme performance for all Oracle database workloads, including AI, vector processing, analytics and transactions at any scale. Exadatabase database service on Xscale infrastructure is the most flexible database environment we have ever worked with, said Luis Madeiro, director of cloud and data solutions at Quiztor. Its ability to scale efficiently will allow us to move all workloads to high performance environments with minimal immigration time. Because it leverages exadata technology, we also have confidence in our data resiliency and security, something that has proven difficult to achieve in other environments. In addition, x scale scalability will enable us to grow resources quickly and with minimal costs as our business expands. No, it's not minimal cost, but sure. [01:09:08] Speaker A: Okay. [01:09:09] Speaker B: Several benefits of exadata Xscale include elastic pay per use resources. With exascale resources are completely elastic and pay per use with no extra charges. For IOP's users only specify the number of database, server ecpus and storage capacity they need, and every database is spread across pooled storage servers for high performance and availability, limiting the need to provision dedicated database and storage servers. This reduces the cost of entry of the infrastructure for exude database services by up to 95%, enables flexible and granular online scaling resources. The intelligent storage cloud with Exascale, Oracle delivers the world's only RDMA capable storage cloud. I don't want to talk about that more, so we're going to move on to next one intelligent AI. Of course, Xscale uses AI smart scan, a unique way to offload data, and compute intensive AI vector search operations to the x scale intelligence storage cloud Intelligent OLTP, which again, I don't care about that, and intelligent analytics unique data intelligence automatically offloads the intensive SQL queries intelligence storage cloud. Again, magic and database aware intelligent clones. Because the clones will solve all problems. Users can instantly create full copies of or thin clones using the exascale intelligence storage cloud. And it's redirect on write technology. And all of that is very expensive, no matter how cheap it might appear on anything. [01:10:23] Speaker C: I think that somebody needs to re listen to that and see how many times you had to xo in it. And just out of curiosity, like over. [01:10:29] Speaker B: Under 75, I don't know, it was a lot, but. [01:10:36] Speaker A: I mean, like a lot of oracle announcements. This sounds, you know, like not exactly like as far fetched as some of them, but like, this is a lot all in one announcement that typically you, when you get to the details of it, it's a little bit less available than they say, you know, so that's, we'll see. Like I'm, you know, or we won't actually, because I will never play with it. Never mind. [01:11:12] Speaker C: Just looking at it, I think would be more expensive than my credit card bill. I was willing to. [01:11:16] Speaker A: Yeah, my credit limit. Yeah, exactly. Just log into the console and max it out. [01:11:21] Speaker C: It's like we see that you only can spend $100,000 above that. You are too low of a customer for us. [01:11:27] Speaker B: They definitely won't take your business. Well, that's it. That was a lot of news for the last two weeks. Plus a crowd strike, global outage of upper proportions. And there was a president who said he's not rerunning for election. So, I mean, end of RNC, there was a lot going on last week. I'm tired and I'm still dealing with the jet lag from India, so it's been good. But you guys did a great job on the show while I was gone. [01:11:54] Speaker A: Thank you. [01:11:54] Speaker B: I thought I listened to it when I got back and checked it out. I thought you did well. So thank you for all, for getting that done. I don't have to chastise you guys as much anymore. You guys are delivered multiple times. [01:12:05] Speaker A: No, we've been shamed. [01:12:10] Speaker B: Anyways. All right, guys, well, I'll talk to you next week here at the clap pod. [01:12:14] Speaker C: All right, see ya. [01:12:15] Speaker A: Bye, everybody. [01:12:19] Speaker B: And that is the week in cloud. Check out our website, the home of the cloud pod, where you can join our newsletter Slack team. Send feedback or ask [email protected]. or tweet us with the hashtag hash poundthecloudpod.

Show Notes

Titles we almost went with this week:

A big thanks to this week’s sponsor:

We’re sponsorless! Want to reach a dedicated audience of cloud engineers? Send us an email or hit us up on our Slack Channel and let’s chat!

General News

AI Is Going Great – Or, How ML Makes All It’s Money

AWS

GCP

Azure

OCI

Closing

Episode Transcript

Other Episodes

Episode 227

227: The Cloud Pod Peeps at Azure’s Explicit Proxy

Episode 201

201: The CloudPod is assimilated and joins the Azure Collective

Episode 97

Episode 97: S3 Buckets land the key to success