274: The Cloud Pod is Still Not Open Source

[00:00:07] Speaker A: Welcome to the cloud pod, where the forecast is always cloudy. We talk weekly about all things AWs, GCP, and Azure. We are your hosts, Justin, Jonathan, Ryan, and Matthew. [00:00:18] Speaker B: Episode 274 recorded for the week of September 3, 2024. The cloud pod is still not open source. Good evening, Ryan and Matt. How are you guys doing? [00:00:28] Speaker A: Doing good. [00:00:29] Speaker C: Doing well. [00:00:30] Speaker B: This is a short week, which this means that this Tuesday is actually a Monday, which makes me mad because I had to do show notes on a holiday. But it's okay, I got them done. And we're here to record once again. All right, well, let's get into some general news. Shay Bannon, who's one of the founders of Elasticsearch and Cabana, is pleased to blog that elasticsearch and cabana are now open source again. He says everyone at elastic is ecstatic to be open source again, and it's part of his and elastics DNA. They're going to be open sourced by adding the AGPL as another license option next to ELV two and SSPL in the coming weeks. They never stopped believing or behaving like an open source company for the blog post after they changed the license. But being able to use the term open source by using AGPL, an OSI approved license, removes any question or FUD people might have used against them. Jay says that the change three years ago was because they had issues with AWS and the market confusion they're offering was causing. So after trying all the other options, changed their license, knowing it would result in a fork with a different name was the only path that they could take. While it's painfully said it worked. Three years later, Amazon has fully invested in their open search fork. The market confusion has mostly gone and their partnership with AWS is stronger than ever. Even being named partner of the year with AWS, they want to make life for their users as simple as possible. So if you're okay with the ALV two or the SSPL, then you can keep using that license. And they aren't removing anything, just giving you another option with the AGPL, as he calls out, there are trolls and people who will pick at this announcement, and so he tried to address those in advance. And first up, changing the license. That was a mistake. And elastic now backtracks it from it again. They say they did not backtrack from it. They said they're giving this as additional option. They aren't living in the past, they want to build a better future for their users. AGPL is not true open source license X's and they like to point out that AGPL is an OSI approved license and is widely adopted. One mongodb, for example, uses the AGPL as well as grafana. Now, the AGPL is a shift left license, if I recall, which requires you to return your source code upstream if you're making advantage of the code. And the last thing, elastic change is licensed because they are not doing well. And Shea says, I will start by saying that I'm excited today, as ever, about the future of elastic. I am tremendously proud of our products and our teams executing execution. Blah, blah, blah, blah. Shipped a bunch of features. Customers don't have any idea what they're doing. Yada, yada, yada. And we'll wait for earnings, I guess, to see how they're really doing. But that's it. Elasticsearch is now open source again. And for those who are not, have video of this is. That's no one that was with quotation marks. [00:03:05] Speaker A: Yeah, I have a hard time thinking that this has nothing to do with performance. You know, there was quite the reputation hit when they changed the license for. And since you can do open search now, which is truly open source, I imagine there's a lot of people that are sort of adopting that instead. I know there was a lot, a lot of looks that way in my day job. [00:03:27] Speaker C: It's also a lot of the features that are in open search that, you know, to me, it's a lot of that base level security stuff that you have to pay elasticsearch, the premium for, which are just in the base open search feature. So while I think it's great that they are moving back to open source, they still need to open source the free features that. Sorry, the paid feature that open search has built in for free. Otherwise, they don't think there's going to be. People are not going to start swinging back the other way because, well, there's a company behind this one versus AWS is sort of involved in running the. [00:04:06] Speaker A: Foundation for it, especially the security related features. Right. Like that was the big no no in my mind for elasticsearch, encryption and single sign on and all that were paid. [00:04:18] Speaker C: Have you met Asher? [00:04:22] Speaker B: Interesting enough, they actually did just announce earnings five days ago. I was looking up while you guys were chattering. So they had a disappointing q one Fy 25, and have lowered full year guidance with weakness pinned on a sales reorganization. Of course, it was on 829. Their stock closed 103.64 share, and then on the 30th day after, they dropped to $75 a share. Yeah. So I'm going to say that market conditions are not great for the macroclimate elasticsearch. And if open source and not being open sourced anymore was one of the reasons why they were getting some pressure, this does feel like maybe a reaction to market pressure they were getting, even if they say it isn't. [00:05:03] Speaker A: Yeah. [00:05:04] Speaker C: I mean, the real question is, are people actually going to pivot back or new people to the market getting a look at them now versus before where they weren't? [00:05:13] Speaker B: I don't know. I stopped buying elasticsearch after my last company and all the scars that Ryan and Jonathan and I got and why we hate elasticsearch, so I try not to pay attention to them too much these days. [00:05:25] Speaker A: He even scowls at me when I tell him about my older open source cluster that I can't upgrade because, well, I guess now I can upgrade. [00:05:33] Speaker B: So we'll see it. Just move it to Opensearch. [00:05:36] Speaker A: I could have done that too. [00:05:38] Speaker B: Yeah, that's what you should. [00:05:39] Speaker C: He's just a mask if someone likes pain, but we knew that by now. [00:05:42] Speaker A: No, no, no. I'm using elasticsearch the way it's intended to be used, which is for specific applications where I control both the producer and the consumer. [00:05:51] Speaker C: Nice. I feel like that's cheating. [00:05:54] Speaker B: I was thinking about doing something with APM for something side project, and I was like, I could use new relic or I could use Datadog. Elastic has an APM. And then I kicked myself and I said, why would I even attempt to use Opensearch for this? I woke myself up pretty quickly, but I almost regressed for a second. It was bad. All right, let's move on to AI is how ML makes money, and digitalocean is hoping to make lots of money with their new Nvidia. H 100 GPU's available on the digital ocean kubernetes, which is called docs. Early access customers have a choice of one H 100 chip or eight H 100 chips. Sorry, you can't have four. One or eight. That's all you get. H 100 nodes are done, of course, in high demand for building and trading all of your AI workloads. So this is a great alternative option to other cloud providers like AWS, GCP, and Azure, who don't have the capacity for you anyways. Most likely. [00:06:48] Speaker A: I wonder how many people are actually because of the capacity constraints, are having to utilize multiple clouds for this. It's kind of crazy if you think about people using capacity across digital ocean, GCP, Azure and AWS trying to get a model training done, but it's possible. [00:07:11] Speaker B: Yeah. [00:07:13] Speaker C: Sound where you use multi cloud. No, it's for the GPU's. That's it. [00:07:20] Speaker B: It's the killer use case of multi cloud. GPU's. [00:07:22] Speaker C: Yes. [00:07:23] Speaker B: Who knew? [00:07:25] Speaker C: That's how multi cloud is going to become a thing. We are going to containerize our workload and make it work on kubernetes, cross cloud across every cloud so you can consume GPU's on every single cloud. Oracle, digitalocean, AWS, Azure, GCP. Good luck. [00:07:42] Speaker A: I mean, that's been the story for like ten years and there's still a lot of people who haven't done that. [00:07:50] Speaker B: All right, well, I'm gonna take you back to July, guys. It's been a few weeks now since this blog post was announced, but this is about Amazon Prime Day. And so typically we talk about all the things that happen to Amazon Prime Day and all the amazing scaling they do. And I was trying to save it for when all three of us plus Jonathan were on the show for a week. But it's been a little hectic this summer, summer vacations. And so because it sounds a little long the tooth, I felt like we should probably talk about it before it becomes so old that no one really cares. So Amazon Prime Day, of course, was July 17 through 18th. I think this article came out right at the beginning of August, if I recall correctly. And this was not written by Jeff, but it was supported by Jeff and one of the other authors. And so Amazon easy to service, such as a Rufus and search, use AWS artificial intelligence chips under the hood. And Amazon deployed a cluster of over 80,000 inferential and tranium chips for Prime Day. And they use over 250,000 ADO graviton chips to power more than 5800 distinct Amazon.com service, double that of 2023. I guess that's why I couldn't get any gpu's in July. Jesus. EBS used 264 petabytes of storage, 62% more than a year before, with 5.6 trillion read write operations. And they transferred 444 petabytes of data during the event, an 81% increase over the prior year. To see Ryan shaking his head and disappointing. [00:09:09] Speaker A: Oh no. I mean, sorry, I was waiting for you to sort of get through it, but I realized that there's no way. Yeah, no, I'm. You know, it's crazy deploying a cluster of 80,000 inferential and training in chips for prime days doesn't make sense unless you're constantly retraining models. [00:09:28] Speaker B: Well, inferential is to respond, uh, for AI queries, correct? Yeah. [00:09:33] Speaker A: So I mean, that makes sense. Yeah. [00:09:36] Speaker B: So people coming in searching for a product, asking questions to the AI about the product, you know, all the different ways they've embedded AI into the Amazon store. Yeah. [00:09:44] Speaker A: You see it in the app, like, there's definitely like ways to ask AI questions about the thing you're looking at. It's just. That's crazy. It's a huge number, as usual. [00:09:55] Speaker B: Yeah, well, there's big numbers here too, so I can give you some more. [00:09:58] Speaker A: Sure. [00:09:59] Speaker B: Aurora had 6311 database instances running postgres and MySQL compatible editions, processing over 376 billion transactions and storing 2.9 terabytes of data and transferring 913 terabytes of data. That's a lot of database queries. [00:10:14] Speaker A: That's a lot of queries. [00:10:15] Speaker C: I'm actually impressed that the transfer is entire than the actual raw storage. [00:10:22] Speaker B: You're turning data back and forth from the database so you don't always have to read, is always typically heavier than write on a database. So it makes some sense to me. [00:10:29] Speaker C: Well, yeah, but they're storing three petabytes, approximately. They only transferred under one. Yeah, I just would have expected it to be a little bit more or just due to reading it and everything else they need to do with it. [00:10:44] Speaker B: Maybe they use Dynamo to solve some of these problems. With dynamo process included powering many things, including Alexa, Amazon.com sites and Amazon fulfillment centers. Over the course of the prime days, they made tens of trillions of calls. The DynamoDB API and DynamoDB maintained high availability while delivering single digit millisecond responses and a peak of 146 million requests per second. [00:11:08] Speaker C: Wow, there's a lot of requests per second. [00:11:11] Speaker B: It is. [00:11:13] Speaker A: I love the NoSQL for a lot of that stuff like single table calls. It's so easy. [00:11:20] Speaker B: Elasticache service more than a quadrillion request on a single day with a peak of over 1 trillion requests per minute. That's maybe your database. [00:11:29] Speaker A: There you go. [00:11:30] Speaker C: Yeah, they did a really good job of caching. Got it. [00:11:34] Speaker A: Yeah. [00:11:35] Speaker B: Yep. Quicksight dashboard saw 107,000 unique hits with 1300 unique visitors and delivered over 1.6 million queries. Sagemaker processed 145 billion inference requests. SES sent 30% more email than the prior year. I notice they didn't tell us how many emails that is because it's probably in kapajillions. [00:11:55] Speaker A: The amount of prime day emails I got was a ridiculous amount. [00:11:59] Speaker C: Yeah, yeah, I was just saying, my wife and I definitely ordered independently things and we got quite a few too. [00:12:06] Speaker A: The quick site is strange. Like what is like it's not in the product, I don't think. [00:12:11] Speaker B: Especially for Amazon fulfillment centers. Amazon management, people watching how the sales are going. Those are the executive dashboards. [00:12:19] Speaker C: Yeah. It's only 1300 unique visitors, so it definitely is subsetted out in this way. [00:12:26] Speaker B: Yeah. Guard duty monitored nearly 6 trillion log events per hour, a 31.9% increase. Cloud trail got 976 billion events and cloudfront a peak load of over 500 million hp requests per minute for a total of over 1.3 trillion HTTP requests during Prime Day, 30% more than the year before. Rigorous preparation, of course, is the key to these kind of numbers. And for example, they use 733 AWS fault injection service experiments which are run to test resilience and ensure Amazon.com remained highly available through the whole event. With a rebranded AWS countdown support program, your organization can handle those type of events as well and use try and true methods. I did not know they renamed the event service you get with premium support to AWS countdown, which is a much better name. Yes. [00:13:10] Speaker A: Yeah. I can't even remember what service used. [00:13:13] Speaker B: To be called, but it was, it was something terrible. [00:13:14] Speaker A: Yeah, it was terrible. So bad I just flushed it from my memory. [00:13:20] Speaker B: Infrastructure, event management, what these are called. [00:13:22] Speaker C: That's right. [00:13:24] Speaker A: That would be why I didn't remember. [00:13:26] Speaker C: I would love to be at a company where I'm running somebody at this scale, be like, you know, they're like, cool, come have us do it. But like the amount of companies that run stuff at this insane scale is going to be in the single digits. [00:13:38] Speaker B: I feel like you're talking about faang companies mostly. [00:13:42] Speaker C: Yeah, like, okay, you're in single digits. Number of companies here, guys. [00:13:46] Speaker A: Yeah, I'm excited about those 733, you know, fault experiments. That's pretty rad. [00:13:52] Speaker B: Yeah, it's pretty awesome. [00:13:53] Speaker C: I still love that service, you know, like I've only played with it once and never really used it heavily, but it's really great to simulate and I love the concept behind it. I just wish I had more exposure to actually leveraging it. [00:14:06] Speaker A: Yeah, I want to use it to, you know, certify disaster recovery, but trying to get someone who has to audit disaster recovery process to understand it's a little tricky. [00:14:20] Speaker C: I think that's more for ha though. [00:14:23] Speaker B: Yeah, it's more ha testing than doctor. [00:14:25] Speaker C: Yeah, than doctor. [00:14:26] Speaker A: No, but that's how I do doctor. Just my ha, doctor is dumb. [00:14:33] Speaker C: I mean, you still have to check that box, but you still have to. [00:14:38] Speaker B: Fail the traffic to one zone to the other for the doctor test. [00:14:40] Speaker A: If I can destroy an entire region and still stays up. That's good doctor, as far as I'm concerned. [00:14:46] Speaker B: I mean, you could do that on GCP, but then they wouldn't have capacity for you. [00:14:49] Speaker C: So just say Microsoft took down the entire region the same day that cloud front, the cloud crowdstrike took down the entire world about 6 hours before. But you know, it's not impossible to take down a region. Just use us. Yellow one. [00:15:05] Speaker A: I've definitely have regions go out, but that's what I mean. Like, I don't, I've never had a doctor because I usually have active, active services. [00:15:12] Speaker C: Yeah, yeah. [00:15:15] Speaker B: Well AWS is announcing AWS parallel computing services, or AWS PCs, a new managed service that helps customers set up a manage HPC clusters so they seamlessly run their simulations at virtually any scale. On AWS, using the Slurm scheduler, you can work in a familiar HPC environment, externing your timed results instead of worrying about your infrastructure. This is a managed service of an open source tool that they provided way back in November 2018 with the original launch of AWS parallel cluster. This open source tool allowed you to build and deploy pocs and production HPC environments, and you can take advantage of a CLI API and Python libraries, but you were responsible for the updates as well as tearing down and redeploying your cluster. The managed service makes everything available via the AWS management console, AWS SDK, and the AWS CLI. Your system administrators can create a managed Slurm cluster that uses the compute and storage configs they previously used. Identity and job allocation services are also usable. A couple quotes here from one Ian Cole, director of advanced compute and simulation AWS developing a cure for catastrophic disease, designing novel materials, advancing renewable energy, and revolutionizing transportation are problems that we just can't afford to have waiting in a queue. Managing HPC workloads, particularly the most complex and challenging extreme scale workloads, is extraordinarily difficult. Our aim is that every scientist and engineer using AWS parallel computing service, regardless of organization size, is the most productive person in their field because they have the same top tier HPC capabilities as large enterprises to solve the world's toughest challenges anytime they need to and at scale and Maxar intelligence Travis Hartman, director of weather and climate at Maxwell Intelligence, has to say, as a longtime user of AdO's HPC services, we are excited to test the service driven approach from Adobe's parallel computing service. We found great potential for AWS parallel computing service to bring better cluster visibility, compute provisioning and service integration to Max, our intelligence Weatherdesk platform, which would enable the team to make their time sensitive HPC clusters more resilient and easier to manage. [00:17:07] Speaker C: Big servers and a lot of them running at once. [00:17:11] Speaker A: It's funny because I'm trying to, in my head I'm always trying to figure out done a lot of this in many years, but I have by other terms like state machine step functions, parallel execution of lambdas. But it is funny. The managed service aspect. I hope this scales to zero and you're just maintaining a cluster configuration with. [00:17:38] Speaker B: A slurm scheduler you can scale it to zero, which is one of the benefits of it, even though it has a dumb name. [00:17:44] Speaker A: Well, yeah, but are you still paying for it? [00:17:47] Speaker B: Yeah, I mean, I assume you might be paying for some of the configurations that you're storing, but I assume the compute that you're not using that you're not paying for, just like any other Amazon service. Yeah, I mean I don't have a use case for it, but if I did, I'd be super happy. [00:18:00] Speaker A: Yeah. [00:18:03] Speaker B: Silicon angles John Furrier got an exclusive interview with new AWS CEO Matt Garmin and how he plans to shape the future of cloud and Aihdenhe. Garmin was a key architect of Amazon EC two compute services. Now, as a new CEO, he faces leading AWS into the future. And this is a future dominated apparently by generative AI. On generative AI, Garmin says that their job at AWS is to help customers and companies take advantage of AI, a secure, reliable and performant platform that allows them to innovate in ways they never imagined before. Garmin says he sees AI as a transformative force that could redefine avoid trajectory, and asserts that they never obsess about their competitors. Instead, they only obsess about their customers. Well, as a customer, I don't really care about AI that much, and I'd like you to give me other cool cloud things other than AI. But apparently that's not the case at the moment as they're continuing to focus on the future, not dwelling on the past, which apparently has those pesky EC two instances. Garmin in the interview stressed the importance of inference, which is leveraging the knowledge of the AI to generate insights or perform tasks as the true killer app of the generative AIH. He says all the money and effort that people are spending on building these large training models don't make sense if there isn't a huge amount of inference on the backend to build interesting things. He sees inference not just as a function, but as an integral building block that will be embedded in every app, he says, where the real value is. And Garmin believes generative AI could unlock new dimensions for AWS, enabling it to maintain its dominance while expanding into new areas of growth. Garmin views developers serves as a lifeloat of AWS, even though he's killing them with Aihdenkhdev. And Adrian is not just a cloud provider, it's an enabler of innovation on all levels, from the small startup to largest enterprise. If you're doing AI things, Garmin isn't just investing in silicon with training and infringed chips, but in the whole ecosystem by betting on open, scalable technologies. And he points to their investments in Ethernet. Networking, for example, has allowed them to outperform traditional infiniband networks in terms of scalability and reliability. Garmin apparently is confident that's the task in AI and cloud innovate and AWS is not the best technology, but a partnership that is focused on helping customers succeed. Please stop talking about AI. [00:20:02] Speaker A: Yeah, I don't think that's gonna ever happen. [00:20:05] Speaker B: No, I sadly don't think. I do feel like we're sort of nearing the trough of disillusionment. Wall street starting to ask questions about returns on investment of AI. Like there's Nvidia stock is a little bit sketchy right now because they're the only ones making money on AI. And can that continue on forever? So I do feel like we're or maybe at the end of the super hype cycle, but I still think it's going to be a big thing for many years to come. [00:20:30] Speaker A: Yeah, there's too many things that you can inject AI into in a way that any app can take advantage of it in some shape or form. You know, like, it's so, it's, I don't know, it's going to be something we hear about forever, but yeah, it would be nice if it wasn't everything as an AI announcement. [00:20:49] Speaker C: Well, I feel like we're reaching the point when AI has already been shoved in at the low hanging fruit for things where you're like, cool, you know, EBS is AI cool. That doesn't really help me and I don't really care about it. And I feel like now you're starting to hit those higher level services, you've done the building blocks, and now hopefully they can start to piece things together to be useful. AI versus just everyone raising their hands say I have AI and things, you know, and I think that's what's going to be interesting. Is one of those higher level services the same way they've done with S three Ec two? We have these base things here now we have the building blocks, now we're able to build those higher level services. You're hoping that that's where we're going now, which is less. Just. We've checked the box AI thing that we now support and, okay, there's a cool feature that we can actually do something interesting with. Maybe I'm wrong. [00:21:46] Speaker A: I mean, I don't think there's going to be a user interface that doesn't have AI on it. Yeah, it's like, it's, yeah, we're, I mean, yeah, I agree we're going to default service on the, on the stuff, the low hanging fruit, but it's going to be just ubiquitous. [00:22:01] Speaker C: It's, you know, hey, there is an app, there's a thing, you know, it's a default checkbox at this point of, okay, you are building a new service. It by default has AI in it. [00:22:11] Speaker A: Oh, for sure. [00:22:12] Speaker C: Everything. [00:22:12] Speaker B: Like, I mean, if that was true, I mean, they don't even add encryption by default on Amazon services. So I don't mind if you add AI as part of the day one feature, but can you add support for cloudtrail encryption, some other key things that would be nice to have day one still confuses me. [00:22:30] Speaker C: I really feel like that needs to be a day zero feature. Yeah, really a day negative 30 feature. But that's just me. [00:22:37] Speaker B: Like, how do you even ship without being supporting cloudtrail? I'm like, it makes zero sense to me. Yeah. I'm actually now very intrigued about reinvent this year. And what does Garvin have to say about AI, apparently and inferentia and generative AI? Because, I mean, we're not that far away from re invent. [00:22:58] Speaker A: Yeah, but I mean, it's gotta be more of the same, I think. Right. Is what, what's in this article. [00:23:05] Speaker B: Yeah, that's my fear. Which I think is really disservice to their customers because I think there's so many other cool things they could be building right now that have AI attached to them. That's fine. I don't care if you sprinkle a little AI into it, but give me managed services and give me things that I want. Amazon easy to status checks now support the reachability health of your attached ebs. Volume. This allows your instance to be queried. See if it can make and complete I o operations. Use the new status check to quickly detect attachment issues or volume impairments that may impact the scaling of your apps running on EC two. You can further integrate these task checks with auto scaling groups to monitor the health of your EC two instance and replace impacted instances to ensure high availability and reliability of your apps. Attach EBS SAS checks to be used along with the instance status and system status checks to monitor the health of your instances. And this one's like, I get it. It's nice that this is there. Seems straightforward that you'd want to know that your EBS volume attached. But really, the reason why people typically don't like an EBS volume is because of its performance, not because of its attachment status. They do their own set of custom checks, typically on the EBS volume, to make sure it's getting the expected IO throughput, which I do not believe is part of this particular status check. [00:24:20] Speaker C: I think the idea of this is cool, like you've said, but I don't think I ever can think of a real use case for a lot of this. Most time at that point, you're doing a custom health check because you want to make sure that a service or something very specific is running not. The assumption is Amazon has at least made the EBS volume available. Like, that's why I feel like this is. [00:24:44] Speaker A: Yeah, I mean, it is, it's, you know, it's the same option as being able to configure a liveliness of health check as a target. And so, yeah, typically you're going to have a custom one that's going to actually look at several different parameters of if your app is functional, but if you didn't need that, you just want. I think this is a nice sort of easy button. So use each c, two health checks and enable the EBS volume. [00:25:13] Speaker B: I don't think it's bad. I'm going to use it. I just, you know, it's not typically what I'm worried about with EBS volumes, which is typically more around is the EBS volume performing to my specification what I'm expecting? [00:25:25] Speaker A: I mean, this would be really good for like an AZ failure, right? Like that's, that type of thing is where I think it would have a lot of value because you probably have, if you have a custom health check, your app is probably going to fail if it can't read GBS, but for other reasons, like logging or something. [00:25:44] Speaker C: Does this mean after however many years I've been on AWS, which I don't want to try to do that math right now because that terrifies me a little bit, that there's no longer two health checks per node. It's the networking and the vm. I mean, they've actually added a third. [00:25:58] Speaker B: They've added a third. They've added a third. [00:25:59] Speaker C: It's both terrifying and interesting. That was like, the. There were so many questions on those AWS exams that were like, there are two health checks. You know, one has failed. You know, things along those lines that now all are no longer valid. [00:26:15] Speaker A: And when you have a real world scenario where only one has failed, you're like, what the. [00:26:19] Speaker B: Right. It's always a fun scenario. You're like, what? How is one failing and one. [00:26:24] Speaker C: How does that even happen? Need that a few times. It's normally the networking one that fails. [00:26:29] Speaker B: I learned. I don't learn of a new one this weekend, an auto scaling group. In the. In the auto scaling monitoring of the nodes in the target group, it can tell you that this node is abnormal, but then it doesn't tell you why it's abnormal, other than it has deviations in its metrics. You can set up a routing policy that basically has it not route as much traffic to a server that's reporting abnormal or potentially even replace it if it's abnormal. But I'm sort of like, it'd be nice if you told me why it's abnormal, because then I at least have something to go look at to go figure out why I have a problem. But, yeah, it was a new one I learned this week. [00:27:07] Speaker A: I mean, I don't know, if it kills it, the abnormality goes away. I don't care. [00:27:10] Speaker B: The fan. I don't care either, but it doesn't. It can still serve traffic to a server that's saying it's abnormal. Unless you're in a certain weighted target group routing, then it'll potentially route out of that normal. If it says a certain number of transactions are abnormal, it'll fail it. But, yeah, it's a little strange. [00:27:28] Speaker A: I mean, how many times have I. You know, when you look at the cluster and you're looking at it and one's outside, you're just like, kill that one. You know, and then see if your issues go away. A lot. [00:27:40] Speaker B: A lot. [00:27:40] Speaker A: Yeah. [00:27:41] Speaker C: Definitely didn't tell my team to do that. The other day on my day job, when we got it alert, I was like, yeah, this one, just kill it. It's fine. [00:27:48] Speaker B: Oh, no, I think it's good practice. I do it all the time. Like, I'm having a problem with these servers. That server has been replaced in three weeks, dead and destroyed. [00:27:56] Speaker C: It's actually one of my favorite features. Is like there's like the EC two instance refresh cycle or whatever it's called. That like is like said to be more than seven days and just replace it. Like keep it on a short timeline, just replace them often. Make your life easy. And like Azure doesn't have, I don't know if Google has a corresponding feature or not, but that feature I would love in life. [00:28:18] Speaker A: Yeah, I mean, I use that for deployments. Do a node base. It was the best way to do it. You could do a b or green, you could do a phase rollout. It was pretty awesome. [00:28:31] Speaker B: That's pretty great. Well, if you are an organization that has over 1000 accounts, you can now implement governance, best practices and standardized configurations across the account in your OU at a greater scale in your control tower. When registering an OU or enabling AWS control tower baseline on an OU member, accounts receive best practice configurations, controls, baseline resources such as IM roles, cloudtrail at config and Identity center. Previously you can only register ous with 300 or less accounts. So this is a three x increase over what they used to be able to do. And I have supported 400, 500 accounts in AWS at one time, and that was a lot. 1000 in multiple ous. Sounds like a freak ton that I dont want to deal with. [00:29:14] Speaker A: I mean, it depends on what management, you know, operations. If you have a pretty sophisticated configuration in control tower, and it just does everything you need, then sweet. [00:29:25] Speaker B: As long as it does it. That's the key thing. [00:29:27] Speaker A: Yeah, yeah. [00:29:28] Speaker C: Every time I see things that support this number of accounts, I'm like, okay, despite what everybody wants to say, the base cost for, there is a base cost for an AWS account. By the time you implemented cloudtrail and guard duty and config and all those, and you have to enable some of those services here. And I'm like, okay, the base costs of just writing those are going to be a lot. But then again, if you have a thousand accounts, you probably don't care about the $300. [00:29:55] Speaker A: Yeah, yeah, I mean, but there's also, you know, those are configurable per ou, so you're not bound to a single config across all thousands. [00:30:04] Speaker C: Config is required. There are some of these services that require for control tower tower. It runs itself. [00:30:11] Speaker A: Oh really? You have to have config on? [00:30:13] Speaker C: Yes, I believe config has to be turned on because that's actually how it automates, fixing itself and setting itself up. [00:30:19] Speaker A: I did not know that like config. [00:30:20] Speaker C: Is one of them in cloud trail. I don't know cloud trails required, but I know definitely no config is, I. [00:30:28] Speaker B: Mean, cloud trail, not, I mean, cloud trail is the best practice. You should have. Cloud trail. [00:30:30] Speaker C: Yeah, well, you could do the organizational cloud trail. You don't necessarily need it at the per account level. [00:30:38] Speaker B: All right, let's move on to GCP. They've announced several new Bigquery Gemini features this week, including SQL code generation and explanation with Gemini, python cogeneration, data canvases, data insight and partitioning, and cluster recommendations, all driven by Bigquery Gemini data Insight starts with data discovery and assessing which insights you can get from your data assets. Imagine having a library of insightful questions tailored specifically to your data you didn't even know you should have asked, which is pretty much everything I learned about AI. Oh, I didn't think to ask that question. Data insight eliminates the guesswork with pre validated, ready to run queries offering immediate insights to you. For instance, you are working with a table containing customer churn data. Data insights might prompt you to explore the factors contributing to churn with specific customer segments. Assumes that that data is in the table, but minor details. Gemini for Bigquery now helps you write and modify SQL or python code using straightforward and natural language prompts, referencing relevant schemas and metadata. And this helps reduce errors and inconsistencies in your code while empowering users to craft complex, accurate queries, even if they have limited coding experience, which with bigquery, since I'm still learning it, I appreciate. [00:31:41] Speaker A: Yeah, yeah. I do not like writing SQL queries at all. And so asking in plain language, awesome. [00:31:51] Speaker B: I mean, if I could just get Amazon Athena to add AI to help me write Athena queries, I'd be really happy because that's where I get stuck every time. Like, this syntax breaks me for some reason I don't quite understand. [00:32:04] Speaker A: That's the cool thing about bigquery and Gemini, is that they just built it right into the console. [00:32:10] Speaker B: Yep. Well, I mean, Amazon could do that. [00:32:11] Speaker A: With Q. Amazon could. Oh, Q. Okay, sure. Dare you. I dare you to. [00:32:19] Speaker B: But I mean, it is, it's kind of nice because even when I was at the Finops conference a couple months ago, it was talking to a couple of vendors and they were like showing me this terrible YAML or JSON configuration file. And I'm like, oh, so you mean the Finops guy has to write all that? And he's like, well, yeah, or you can just use this AI feature we have built in and you just tell it what you want it to do. And we generate it for you automatically in the correct schema. I'm like, this is amazing. You went from this is a no to this is a yes. That easily. [00:32:49] Speaker A: Yeah, as long as you can export that YAML config so that you can version control it. [00:32:55] Speaker C: Unlikely. [00:32:55] Speaker A: That is one of those things where using AI, it's like, well, do I have to ask the same prompts in the same order in the same words? [00:33:02] Speaker C: How's this going to work? And the odds are you'll get different answers. [00:33:06] Speaker B: Yeah, most likely Google is rolling out Gems first previewed at Google I O, Gems is a new feature that lets you customize Gemini to create your own personal AI expert on any topic you want. These are now available for Gemini Advanced business and enterprise users and their new image generation model. Imagen three will be rolling out across Gemini, Gemini Advanced business and enterprise in the next coming days. Gems allows you to create a team of experts to help you think through a challenging project, brainstorm ideas for an upcoming event, or write the perfect caption for a social media post. Some of the pre made gems available for you include the learning Coach, the brainstormer, the career guy, the writing editor and the coding partner. And Imogen. Three sets a new high watermark for image quality, but with built in safeguards and adhere to product designs principles because they don't want to go woke again. Across a wide range of benchmarks, imagen three performs a favorably compared to other image generation models available. I've yet to have a chance to go test and see if it makes a african American George Washington or nothing, but I assume they fix that basic issue. [00:34:04] Speaker A: Be kind of curious to use that, the whole like team of experts, because that's kind of crazy. [00:34:12] Speaker B: That's kind of cool. I was wondering if I could get all of those, those premade gems at the same time. Like I'm going to do a brainstorming session with the career coach and the coding partner and the Brainstormer and then like, the career guides, like, you should really think about getting a new job. I like to use SQL server on kubernetes and it's like, yeah, I think you should update your resume. That's what that should be. [00:34:32] Speaker A: Just stop. [00:34:34] Speaker C: The premade gems I think are one of the coolest features here where it's like, I think that it gets people in the door to start to play with it, which is nice. [00:34:42] Speaker B: Yeah, I'm curious if this is going to roll into Gemini, just Gemini dot google.com, that I could try these things out too, because conceptually I get it. It's kind of nice. And you can ask Gemini to brainstorm with you or to help you with your career or whatever to make this available to you in a different way. Kind of cool, like a chat type service. [00:35:01] Speaker A: Yeah, I'll be trying this out because I want to see if do like project planning or like, you know, far out forecasting with something like this. [00:35:13] Speaker B: It will also disagree with your estimates. Ryan, it's okay. [00:35:16] Speaker A: No, it probably will back me up for once. Unlike someone did you. [00:35:22] Speaker C: Whatever you, whatever you think it is, right, just add a zero to it and then you're probably accurate. [00:35:27] Speaker A: No, it's the other way around. I go through all the details and then Justin's like, eh, half of it. [00:35:33] Speaker C: Well, that's because he's trying to rationalize it where you are trying to estimate what the reality of the cost is. See different. Different life goals here. [00:35:42] Speaker B: Ryan wants the project to be successful. I want the project to live. If you give a five year time horizon, the project never gets funded. But if you want the project to get funded, it's like, okay, how do you get me something valuable in a year? And then we can ship that and then we can build more cool things to the thing. It's like, no, no, it's five years or nothing. I'm like, yeah, okay, so it's gonna take us a year and a half and everyone's like, cool, we'll fund it. I'm like, thanks, appreciate it. And Ryan's like, what do you mean it's a year and a half? Like, it's a fun game we play. [00:36:12] Speaker A: I've gotten a lot better with understanding it, you know, delivering iterative value over time as long as you, you have a five year plan, but then you fund it little by little. [00:36:23] Speaker B: Yeah, exactly. Getting there. [00:36:25] Speaker A: I'm getting there. [00:36:25] Speaker B: You're getting there. It's called phases or I still am. [00:36:29] Speaker C: Very much like a MVP person than the iterative, like, what is the minimum available product to get this out the door, which is like phase one, and then iterate over that because otherwise you get bogged down in detail. [00:36:41] Speaker B: Yeah, see, Ryan sees that as MVP is all it's going to get done and then we're never going to invest in it, and then it's just going to be tech debt sitting there forever as an MVP. That's Ryan's fear. I mean, he came from Yahoo. [00:36:50] Speaker C: He's not wrong. I'm just saying. [00:36:52] Speaker A: How many times have you seen that? A lot. [00:36:54] Speaker B: Yeah. [00:36:55] Speaker C: I mean, I'm pretty sure I've seen that many many upon many times that I might have caused some of that at my day job. I get where he's coming from and the fear is reality because, you know, him and I both have to maintain these things. For Justin moves on to the next. [00:37:12] Speaker B: Shiny object of, I mean, Ryan doesn't let me forget. I mean, to be fair, it's years of grief. I have to hear about it. [00:37:19] Speaker A: But it's true. Like all the way back. [00:37:22] Speaker B: Remember that thing you did to me back at that job five times ago? Like, I'm so bitter about that. Talked to a guy there yesterday. Still broken. [00:37:32] Speaker C: Sometimes they wonder the services I've stood up for customers, like a couple of years ago or like at this point, like five, six years ago, like if they still run. I know at one point it was an old customer I worked on. It was like four years later and I wrote a really simple python script that just figured EBS backups well before there was any managed service or anything along the line. So it was like five years later and they still were using it. I was like, how is this still running? [00:37:58] Speaker B: Because Amazon never deprecates anything, so it'll still run Amazon or Python two X, no problem. Yeah, but what we see would be real fun is if you added your personal email address to it, so that way every time it ran, it would email you so you could track its lifecycle. [00:38:11] Speaker C: No, I'm good. I don't think I want to know that information. [00:38:14] Speaker B: Nice. All right, let's move on to Google's new Instant snapshots for compute engine, which provides near instantaneous, high frequency point in time checkpoints of a disk that can be rapidly restored as needed. Instant snapshots have a RPO of seconds and RTO in the tens of seconds. Google Cloud is the only hyperscaler to provide high performance check pointing that allows you to recover in seconds. Common use cases for this feature include enable rapid recovery from user error, application software failures, and file system corruptions backup verification workflows, such as database workloads that create periodic snapshots and immediately restore them to run data consistency checks, taking restore points before an application upgrade to enable rollback in the event that the maintenance goes terribly, terribly wrong, which will allow you to improve your developer activity, verify the state before your backup, or increase backup frequencies. Some additional benefits over traditional snapshots include in place backups at the zonal and regional disk level. They're fast, they fast restore, and they're convertible to backup or archive, which basically means you're replicating them out to a second point of presence for long term geo rented storage. And Ryan, I'd like you to get this set up on all of our operating system drives for crowdstrike as soon as possible. [00:39:18] Speaker A: No kidding? Yeah, that was my first question when reading through this. It's like, okay, I would use this in a second. Can you coordinate with the file system, you know, like in the case of windows, so that you lock everything down, take the backup, or is that even needed with this technology? Maybe they're doing sort of those sidecar EBS snapshots and so it doesn't really need that. Kind of interesting to see. [00:39:46] Speaker B: They couldn't be using the sidecar because that wouldn't be instantaneous if they use the sidecar because that requires you to put it into volume, volume disk, and then it does a thing and it's, it takes forever. So I doubt it's, I doubt that it is acid compliant, let's put it that way. [00:40:02] Speaker C: Is this the fast, like the fast restore feature that AWS has? [00:40:08] Speaker B: So they talk about the, they talk about other cloud providers having this, but theirs is even faster than the fast restore process, they say, because they'll be able to be performant within tens of seconds. But they did compare it to AWS at some point where they mentioned not by name, AWS say other cloud providers have this, but it takes blah, blah, blah time. I don't know where it's at in the article, but I can't find it. [00:40:31] Speaker C: I mean, while I appreciate the only one that I could really think of that the speed of it, I guess, is extremely useful is fail maintenance window. A lot of the others have verification. Like what do I care if it takes five minutes versus 30 seconds to verify my backup at that point, even if it's, especially if it's an automated process, what do I care? [00:40:52] Speaker B: I could obviously see it for test boxes where I'm doing development. Like, hey, I'm going to push a code build I'm not sure about. I'm going to snapshot it. Then when my code blows up the operating system, I can just revert. Sorry. There's definitely some good use cases where being able to do it quickly and efficiently and then be able to recover quickly are nice. It's interesting, they don't mention windows in here. They don't mention anything about minutes. They don't mention SQL or anything like that. There's definitely some limitations, I think, on this one that you probably need to read the fine print on, which is. [00:41:24] Speaker A: Why I'm making this space, because I'm trying to get rid of it. [00:41:29] Speaker C: With crowdstrike, it wouldn't have helped you because they would just push the update again to you. [00:41:33] Speaker B: Well, for the hour, I could just keep restoring for the hour. And then when they told me hey, we've rolled it back, I'd be like okay, now do a restore. I mean, from an operating system perspective, the only thing that is really sensitive, typically if you are segmenting your c drive properly, which this is a time for a quick diatribe. When you have a Windows server and you like to install iis or other applications on it, you should always put the other applications in iis on the D drive. Your C drive should be reserved permanently for the operating system and potentially the page file, although I can even make an argument that the page file should be on a different disk, although that will break the ability for you to get memory crash dumps. So you do need to still maintain a c drive. But the only thing that matters in that typically is the page file, which if you corrupt the page file in a snapshot restore operation, windows just recreates it. So you're fine. So you could be able to recover from Windows box, most likely in the event of a disaster, unless you mess with the UFi, which may be with Windows patching. So that'd be where you probably couldn't use it. But if you took a once a day snapshot of the operating system drive, you'd be pretty safe to recover to that. Unless you went patching, then you're screwed. [00:42:40] Speaker A: Yeah, I suspect this wouldn't work necessarily with Windows because it does mention that it's not memory safe, it's or application safe due to the memory information. So I assume that, but again, on. [00:42:53] Speaker B: A C drive of a Windows computer, the only thing that's really in that you couldn't use this with SQL Server, there's no way SQL server you would lose data. That's why I said it's not asset compliant. But for operating system drive, for Windows, the only thing that's really in memory is in the page file. That's part that you would lose, but it would be fine. [00:43:12] Speaker A: I guess that's for a while though it would break on Windows even though it would be fine. [00:43:17] Speaker B: When the Windows first booted up, it would identify the page file is corrupted and just create a new one. All right, Google Cloud is launching a memory store for Valky. This is the preview of Valky 7.2 support. Memory store for Valkyrie joins Memorystore for Redis cluster and memory store for Redis. And does ask a question, does Redis go open source too since now I have Valkyrie 7.2. [00:43:43] Speaker A: Yeah, we'll see if you know if that license type, which I forget the name of already, GPL AGPL. [00:43:54] Speaker B: If it. [00:43:54] Speaker A: Is really working out for Mongo and Grafana now elastic, maybe, maybe we will find out. [00:44:05] Speaker B: I haven't heard much about Valkyrie since they forked. I assume people are adopting it. I didn't hear much about open tofu for quite a while. Then everyone started talking about open tofu. So I assume it's one of those things as the cloud providers get support for it. I do think Valkyrie is already supported on AWS elasticache. I think Microsoft was supporting it earlier as well. I think Google is late to the party on supporting Valkyrie, but we'll see. [00:44:28] Speaker A: I'm actually surprised to see it manage service offerings this fast. [00:44:34] Speaker B: And then the last story for Google this week. Earlier this year, Google announced the general availability of hyper disk storage pools with advanced capacity. That helps you simplify management and lower the total cost of ownership of your block storage capacity. Today they're bringing the same innovation to block storage performance through hybrid disk storage pools with advanced performance. We now provision IOP's and throughput in aggregate, which hybrid disk storage pools will dynamically allocate as your app read and write data, allowing you to increase resource utilization and radically simplify performance planning and management rate. [00:45:03] Speaker C: At the hyper disk level. [00:45:05] Speaker B: Yeah, it's just basically taking a pool of IOP's and you're allocating it to different disks dynamically through ML or AI, similar to what you're doing for the capacity of your disk. It's nice. I appreciate it. I don't know that I use it. [00:45:19] Speaker A: Yeah, I mean, it's funny because it's been so long since I used block storage, I'm trying to dredge up old memories of what were the problems. Yeah, I have moved on from here where I don't use this a whole lot other than when managed providers are providing it on the backend. [00:45:42] Speaker B: Moving on to Azure at hot chips 2024, which I am excited. There's a conference called hot chips. Microsoft initially shared some specs on the Maya 100, Microsoft's first gen custom AI accelerator designed for large scale AI workflows deployed in Azure, and a sign that this chip is getting closer to reality. They're now releasing more details to us this week. The Maya 100 accelerator is purpose built for a wide range of cloud based AI workloads and utilizes TSMC's n five process with the Kawas S interposed technology. Whatever the hell that is equipped with a large Ondai SRAM Maya 100 rectical size soc or system on chipdehenhezenhe die. Combined with the four HM two e dies provide a total of 1.8 terabytes per second of bandwidth and 64 gigs of capacity to accommodate AI scale data handling requirements. The chip architecture includes a high speed tensor unit for training and inference, while supporting a wide range of data types, including low precision data types such as MX data format. Vector processing is loosely coupled superscalar engine build with custom instruction set architecture ISA to support a wide range of data types including F 32 and Bf 16, and a direct memory access engine supports different tensor sharding schemes and a hardware semaphore enables asynchronous programming on the Maya system. Maya 100 also supports up to 4800gb/second of all gather and scatter, reduce bandwidth operations and 1200gb in all to all bandwidth, which none of that really made any sense to me, but someone out there is super excited. [00:47:08] Speaker A: Yeah. [00:47:11] Speaker C: The numbers are a lot and they sound good. [00:47:13] Speaker A: I just, I worry that I'm getting out of touch with this, some of these AI things, because this really does seem like a whole different language to me and I don't know what any of it means. I can't even infer closely other than capacity. [00:47:27] Speaker C: I heard rectal size reticle size system. [00:47:31] Speaker B: On a chip die. [00:47:32] Speaker C: Yeah, I need to get translator for that sentence. [00:47:36] Speaker A: Yeah, yeah. I'm just, I'm not sure whether or not, like I'm just too far gone into the managed services part where I don't really want this level of detail anymore. Like just do the thing. I'm paying to do the thing and all the, this type of processor with this type of chip and you know, these types of things are irrelevant. But also like maybe, maybe in that space, if you're deep in it, you need that performance. It's really hard to say. [00:48:03] Speaker B: Yeah, so the reticle size system on chip is basically the maximum field size of 26 mm by 33 mm or 858 mm total squared, basically where the processing bits can be put onto the system on chip. [00:48:18] Speaker A: Okay. I don't care about any of that. [00:48:20] Speaker C: So I think that you tried to exploit it, but I don't think that actually answered anything. [00:48:25] Speaker A: This is like going really deep into CPU architecture. [00:48:28] Speaker B: This is like silicone level stuff. Thanks Microsoft for sharing the details, but we're going to move right along to the next story. [00:48:35] Speaker C: Yeah, to be fair, once a hot chips 2024. [00:48:38] Speaker B: Yeah. Where people are nerding out on this other, these like this is the ryans of chip people. [00:48:42] Speaker A: I get it now. [00:48:42] Speaker B: Hot chips. [00:48:43] Speaker A: I get it now. [00:48:44] Speaker B: Process for chips. [00:48:46] Speaker C: This is why. This is the problem with doing a podcast at night. Sometimes we're a little slow. [00:48:50] Speaker B: All right. [00:48:51] Speaker A: Yeah, because it's at night. That's why. [00:48:54] Speaker B: I don't think we would have got that even during the day. [00:48:56] Speaker A: So, no, there's no amount of coffee that would have made that make sense to me. [00:49:01] Speaker B: Azure is interesting new and simplified subscription limits for Azure SQL database and Azure synapse analytics dedicated SQL pools or formerly SQL data warehouse. The new features include the new vcore based limits, which will be directly equivalent to the DTU and DWU. And they're eliminating the DTU, whichever that means. Default logical server limits have changed. Configurable vcore limits have now increased. And there's a new portal experience because every Azure product needs a new portal experience and all subscribers will now default to a limit of 250 logical servers. This is where I go to Matt and say, matt, DtU, dw, and vcores. Why do I care? [00:49:38] Speaker A: Oh, I can translate this one. [00:49:41] Speaker C: Go. Yeah, this will be fun. [00:49:43] Speaker A: Our SQL database product wasn't making any money. And what we want to do is change the pricing model to where you're paying for compute units instead of b cores. And so it's time based instead of just number based. No, I'm so close. [00:50:01] Speaker C: They went from one metric, which was their original metric of a weird combination of memory and cpu and maximum storage allocation to the newer one, which is supposed to simplify it. [00:50:15] Speaker B: Yeah. So apparently one v core is approximately equal to 100 to 125 dTuse. [00:50:21] Speaker C: And depending on if you're on business critical or on standard, which is how you know I hate my life. That I know that. [00:50:27] Speaker B: And the fact that it's a range is bullshit. We'll flat out call that a B's. [00:50:31] Speaker C: Well, no business critical. I think it's the 125 and the standard GP. The standard tier is the 100. So that's why it's the range. Because it's, it also depends on your workload. [00:50:41] Speaker B: If your workloads were memory, that is absolute baloney. That how they built that out, like I, that would just make me mad. Like I would literally like every, every meeting with my Azure rep, I'd be like, have you fixed that B's? And if they told me no, be like, get the f out. [00:50:58] Speaker C: So the other advantage of them moving this way is you have a better ratio of memory slash storage unit. However, they bundle that together to storage. So I've seen it where at my day job we had problems of memory and cpu where we had to increase the memory and cpu due to we needed more underlying storage. And at certain thresholds you can only have so much storage. So I think at like remember like twelve before you get like two terabytes storage. So we needed more storage. We didn't need really more cpu, so we had to scale up. So that's the other benefit. And it's really just trying to for them to deprecate out the DTU model as they moved everything to the v core, the hyperscale gets rid of even more of those, which is their like Aurora and all this announcement is really under the hood is we're deprecating the old thing that the rest of the business has deprecated and we're just showing up to the door saying we've done it now. [00:51:57] Speaker B: That's nice with all this announcement, but I also love some of the frequently asked questions I have here, like do quotas apply equally to my serverless DBS? Yes, they do. Currently, quotas are determined by overall usage, meaning that service usage is considered in the same manner as provision usage. Because that's not confusing. So your serverless stuff can't spin up because you've used all your provisioned usage? That's super annoying. And then is there a way to get it notified before I hit my quota limit? Not at this time. However, as a workaround, you can leverage the subscription usage API with the usage name parameter set to regional vcore quota for SQL, DB and DW, which tells me what that is. The quota, but not oh no, you can just get number of in use. [00:52:36] Speaker A: And so you get your quota and the number in use and you just, you just pull that every minute or so forever. [00:52:44] Speaker B: Okay. I'm so glad that you're the Azure guy, Matt, and not me, because I. [00:52:49] Speaker C: Somehow blame you for it still. But don't worry about that, we'll get to that later. [00:52:52] Speaker B: You can blame me all day long. I do feel bad about it. I talked about this recently. Azure is please announced several enhancements to their vmware solution for Azure. Because you like to burn your Azure money and your broadcom money at the same time, you now have access to vmware solution in 33 regions, including a DoD SRG impact level four region in the Azure government cloud. Expanded support for FCF with NetApp and VMware being able to simplify their FCF hybrid environment by leveraging Netapp on tap software so you can pay even more money for your VMware with NetApp licenses and you can now leverage spot eco by NetApp with your vSphere VMs in the cloud if you'd like to use spot instances on your VMs for some reason, as well as a new collaboration with Jetstream which enhances doctor and ransomware protection. With Jetstream delivering advanced Doctor that offers near zero rpo and instant rto, which none of these made any sense to me either, but if you're a vmware guy, I suppose this is great. [00:53:44] Speaker C: Can I translate this? [00:53:45] Speaker B: Sure. [00:53:46] Speaker C: How do you burn and piss off your CFO for all your capital and piss off your CFO in 15 minutes or less? [00:53:54] Speaker A: It's just one stop shopping for, you know, youre a vmware environment, storage to support it, and then you're backing it all up in case of Doctor, and you'll never use that, but you'll pay. [00:54:04] Speaker C: Through the nose for it with as near zero rpo. So you know it's really high Doctor, so it's even more expensive. Nothing about this sounds cheap. [00:54:15] Speaker B: Nope, nothing's on cheap except for eco spot Eco. [00:54:19] Speaker A: We're trying to make it sound cheap. [00:54:21] Speaker B: Trying to help you out with that. [00:54:23] Speaker C: If there's capacity in the region, right? [00:54:26] Speaker B: If there's capacity. All right, well, let's go to a cloud journey where I have a great article that I thought I'd talk to you about from Richard Saratur. He works at Google. I think it's a dev rel. He had a pretty great blog post this week in his newsletter that he sends out every week that I check out all his cool articles because he gets some good stuff that I like to read. And this is four ways to pay down tech debt by reasonably removing stuff from your architecture. Maybe first start with VMware based on our last conversation. [00:54:52] Speaker C: Followed by Netapp. [00:54:54] Speaker B: Followed by Netapp. Yep. So he basically starts out covering debt and really architectural debt from carrying things like eight products to do the same thing in every category. Brittle automation only partially works or still requires manual workarounds and black magic. Unique customization package software that prevents upgrades of modern versions or half finished ivory tower designs where the complex distributed system isn't fully in place and may never be. Too much coupling, too little coupling, unsupported frameworks, and on and on are things that he considers to be tech debt. And he has four ways for you to think about coming of it. So number one, you ready? [00:55:27] Speaker C: We're ready. [00:55:28] Speaker B: You look ready. Number one, stop moving so much gosh darn data around. How many components do you have that get data from point a to point b. How many ETL pipelines do you consolidate or hydrate your data? Messaging, event processing switch to send this data around, or even API calls that suck data from system a to system b and you get rid of some of this, some examples or things that might be able to help you do this. You can perform analytics queries against data sitting in different places by leveraging something like bigquery omni for example, which runs in AWS, Azure and GCP. So instead of shipping all of your data to bigquery, you can run it locally. You can enrich your data from outside the database. You might have an ETL job in place to bring reference data into your data warehouse to supplement what's already there. Other things like bigquery, federated queries you can reach live into postgres, SQL, MySQL, Spanner, or even SAP to get hydrated data or perform complex SQL analytics against log data instead of copying and sending logs to online systems like elasticsearch. That's number one. What do you guys think? [00:56:25] Speaker A: I mean, yeah, these are the things that you want to do but you never have time for. So that makes sense. And I do agree that looking at things to identify improvements, this is a pretty good, you know, starting place or generalization of like here's the things. It is funny because I'm reading through it and like, yeah, it makes a lot of sense, but I'm also like. [00:56:52] Speaker B: Well, I mean, I was thinking about this is a great pitch for Google because I don't think I could do this on AWS because all the data storage is separate for every product because of their isolation model. Where on GCP I can do these type of things because they have one data layer. [00:57:05] Speaker C: I mean, I think the other interesting thing about this is query the data from where it lives. So his whole comment to like query SQL and pull data from here and query alive here, the piece that doesn't count into is there cost associated with querying quorus or if you are querying a SQL database in AWS at postgres and I messed that one up in Azure and something in GCP, are you going to get killed with egress costs or latency where the response time isn't going to be what's usable for your company? So I think there's a few other things to also consider. But if you can just query where it is and not have to shove it all in your data lake or do etls in other places, I think it does make sense. [00:57:51] Speaker B: All right, number two, compress the stack by removing duplicative components. This is time to break out the chainsaw. Time to kill duplicated products or too many best of breed solutions that you might have added to your process. And he quotes a rule of thumb of one of his colleagues, Josh McKenty. If it's emerging by a few, if it's mature, no more than two, which was a clever little saying there. You don't need multiple database platforms or project management solutions or leverage multiple purpose services and embrace good enough in your architecture. Do you have multiple databases? Maybe you should wait 15 days before you buy a specialized vector database. Just use postgres or any number of existing databases that now support vector. Multiple messaging buses and stream processors. Consolidate to pub sub, for example. And really at the end he's just saying leverage managed services and get rid of. So I like this one quite a bit. [00:58:38] Speaker A: I do too. I just, I'm sort of like the trick into this is replacing it, right? This is still identification of technology. I actually don't know if that's really the problem to be solved. I think the problem is like how do you prioritize and change these? And I thought that, you know, the article, it sort of references opinions, but you know, the reality is you have to be constantly making changes. Yeah, like that's the only thing, you know, and maybe making changes by using this identification model or maybe just trying new tools. [00:59:14] Speaker B: Like again, I think it's a question of like, you know, if you have three schedulers at your company, do you need three schedulers or can you consolidate into one scheduler, for example? Or do you need kinesis and pub sub and Kafka in your architecture when one would do just fine? And maybe you choose the managed service, or maybe you choose Kafka or confluent cloud so you don't even have to manage it. Yes, there's a cost of switching, there's a cost of change. But then ultimately you're removing duplicative components that cost you money and potentially cause debt. Because when you have kinesis and Kafka and these things, which one do you choose for data you want? Which will. The answer is you might be using multiple because you need data feeds from two different solutions. One data feed is on kinesis from that publisher and the other ones on pub sub for the other. Now you have complexity of managing multiple connectors and multiple queuing logics and that can become probably problematic for you too. Number three, replace hyper customized software and automation with managed services and vanilla infrastructure. And he points out you are not Google or that special, your company likely does a few things that are secret sauce. The rest is identical to every other company in the freaking world. Fit the team to the software, not the other way around. This customization leads to lock in and you get stuck in upgrade purgatory forever. A lot of SAP admins out there are saying, yes, I understand. No one gets rewarded for their super highly customized Kubernetes cluster use things like Google Engine or Kubernetes engine, autopilot or paper pod or some other way to not have to manage something highly customized to your organization. Because nothing but pain will come from that customization. [01:00:46] Speaker A: Yeah, I can't, I can't agree with this one more. Like it's, I've been at a few companies now where, you know, the highly customized for workload is just a, it always blows up eventually. So like, you know, Yahoo. We had a tendency to build our own tools because we could, but then, you know, and we tricked ourselves and be like, oh, well, nothing will support the scale that we need and the whole thing. But then at a certain point you couldn't hire anyone because the skillset had moved on and everyone has, everyone else in the market's been using these commonplace tools. Yeah, the tight customization, like learn when you can make those compromises because it's, most of the time it's fine. [01:01:31] Speaker C: Yeah, most of the time you don't need that extra performance that you're squeezing out of it for the added complexity. And honestly, most likely the cause of many underlying outages, whether you want to believe it or not, is going to be this thing that's so customized that only meets your niche use case that you don't need. You know, if you add a millisecond or two here to your product, is that really the end all be all? Maybe if you're a trading platform, but for a lot of companies, those couple milliseconds will actually matter. [01:02:03] Speaker B: Number four is tone it down on microservices and distributed systems. People have gone overkill on microservices. You don't need dozens of serverless functions to serve a static web app or a big complex JavaScript framework for two pages. Tech debt often comes from over engineering system when you'd be better off smashing it back into an app hosted in cloud run, for example, there'd be fewer moving parts and all the agility you want with something, some bulls cloud run. He's not saying go full DHS and go full monolith, but most folks would be better off defaulting to more monolithic systems running a server or two than a plethora of microservices running on containers. [01:02:38] Speaker C: I mean, one of the things they've always said is use the right tool at the right time in the right place, you know, there. So design for what you're going to need for the next six months to a year, or depending on if you're expecting a lot more growth, but, you know, design for that point and reevaluate that, you know, from there and say, okay, we've designed it. It's working for our current scale. If we want to go to the next tier, we're seeing that growth. And, okay, then we need every architect that's one piece. [01:03:06] Speaker A: Yeah. It's a common fallacy that you want to develop everything as a microservice so that you can manage them and update them separately. But really, if you only have a single customer of that API or your microservice, it shouldn't be separate. It's really about understanding the contracts and ins and outs and who's I, who needs to use the service and what features do you need? Because it doesn't really make any sense in a lot of cases that I've seen to develop a new microservice when it should just be an enhancement to the existing service. [01:03:40] Speaker B: Well, that's the fourth from him. Do you guys have a fifth suggestion for getting rid of tech debt in your architecture? [01:03:45] Speaker A: Yeah, mine would just be making changes in general. And when you make changes, rip stuff out, play around with new technologies, and then have architect so you can make frequent changes to that without a complete disaster. It's one of those things where I see people do, which is instead of removing duplicative components or that they just build another one on top of it. Well, if you're making changes all the time, which there's risk, and I get that, but you will, you'll be constantly applying new logic to your application, which by its inherent nature will remove tech debt. [01:04:27] Speaker C: Yeah. What I tell people is every single task you work on, you know, 10% of it should be cleaning stuff up as you're going through it. So if you are working on something, and this is less, you know, architectural related, but in general, you know, clean stuff up as you go through it, you know, if you spend a little bit of time, you know, all the time doing it, then it will by nature be a little bit cleaner. So like the old saying, like, you know, trying to clean your room every time you walk by, you pick up one thing and then the major cleanup at the end of the day isn't a big deal. Try explaining that to myself in real life and, you know, the odds are not high. [01:05:05] Speaker B: Yeah. So the one, one architectural area that I would say is stop. Stop stressing about lock in so much. So many companies built so much abstraction into their code base that they had to now maintain and manage because of vendor lock in fears. And again, I think if you have a reasonable amount of time to engineer out of a lock in situation, just take the lock in. If it takes you a year, cool. If you have a problem that needs you to get out of the solution, a year is probably a reasonable amount of time for you to take to go do it right, to make sure you don't make the same sins of the past that you did to do it a little better this time that you did last time. As Matt and Ryan pointed out. I think that's probably my big one, is I see sometimes shops just writing so much abstraction code, trying to avoid the lock into the database or the lock into the cloud provider in the sphere of some massive cost increase. The reality is, if the cost increase is coming, it's at renewal time anyways, and you're already screwed. You're going to bite the bullet, you're going to pay that money regardless, and then you're going to move forward just like the guys who are all being burned by VMware right now. Would you have been right to build, you know, abstractions for VMware? Well, yeah, if you've done it four years ago, but, you know, now you're just eating the pipe and you're going to pay the broadcom tax. And next year, when your renewal comes up, you're going to tell Broadcom to go f themselves because you've moved to something else that's more affordable. But again, managing all that expectation for locking up front, I think is just a path to tech debt that's going to burn you. [01:06:29] Speaker A: Yeah. The funny thing is, VMware was like the argument, right? You build a, build it out on VMware and it can run in any cloud. It's like, wow, man, that didn't work out so well, did it? Probably would have been better just going, yeah, EC two or GCP compute directly. It is a balance. As someone who builds and maintains developer productivity tools, it is one of those things where a certain layer of abstraction is a good thing, but being very aware of where that should stop is important, and it's subjective, and your customers are going to demand customization and you're going to have to push against it. So it's this really interesting sort of push and pull thing that happens. I do agree. It doesn't make any sense to avoid the lock in and just build it all custom. Great. [01:07:26] Speaker B: All right, guys, well, that's another fantastic week in the cloud. We'll see you next week. [01:07:31] Speaker A: Bye, everybody. [01:07:32] Speaker C: Bye, everyone. [01:07:37] Speaker B: And that is the week in cloud. Check out our website, the home of the cloud pod, where you can join our newsletter Slack team. Send feedback or ask [email protected]. or tweet us with the hashtag hash poundthecloudpod.

Show Notes

Titles we almost went with this week:

A big thanks to this week’s sponsor:

We’re sponsorless! Want to get your brand, company, or service in front of a very enthusiastic group of cloud news seekers? You’ve come to the right place! Send us an email or hit us up on our slack channel for more info.

General News

AI Is Going Great – Or How ML Makes All It’s Money

AWS

GCP

Azure

Cloud Journey Series

Closing

Episode Transcript

Other Episodes

Episode 190

190: Finally a Crowdsourced re:Invent Prediction Show

Episode 103

Episode 103: Bezos retires over Slack outage — Episode 103

Episode 236

236: We Now Measures the Largest Chips Used to Generate an LLM - or a 21st century #$%& Measuring Contest