CloudPro #59: OpenAI scaled Kubernetes to 7,500 nodes

AWS Discontinues Cloud9, CodeCommit, CloudSearch, and more

Packt

Aug 09, 2024

Welcome to the 59th edition of CloudPro! Today, we’ll talk about:

⭐Masterclass:

🔍Secret Knowledge:

⚡Techwave:

🛠️HackHub: Best Tools for the Cloud

Cheers,

Shreyans Singh

Editor-in-Chief

Forwarded this Email? Signup Here

FEEDBACK FORM

⭐MasterClass: Tutorials & Guides

⭐OpenAI scaled Kubernetes to 7,500 nodes

In January 2021, OpenAI scaled their Kubernetes clusters to 7,500 nodes to support large-scale machine learning workloads, such as training models like GPT-3, CLIP, and DALL·E. This was a significant technical achievement, as it allowed for a unified and scalable infrastructure that could handle both massive jobs and rapid, small-scale research iterations without requiring changes to the code. The scaling process involved overcoming various challenges related to networking, API server stress, and resource management, leading to the development of specialized solutions to maintain cluster stability and performance at such a large scale.

⭐70% reduction in storage costs with s3-batch-object-store

Embrace reduced its storage costs by 70% by creating an open-source module called s3-batch-object-store. This module efficiently stores and retrieves multiple objects within a single file on Amazon S3, minimizing storage and operational expenses. Previously, Embrace stored large objects in Cassandra, which was expensive and inefficient due to frequent data compactions and high EBS storage costs. The new approach batches multiple objects into one file before uploading to S3, significantly reducing the number of PUT operations and storage costs.

⭐Stop using TCP health checks for Kubernetes applications

TCP health checks in Kubernetes are often inadequate for monitoring the true health of an application because they only verify if a TCP connection can be established, missing critical issues like failed internal processes or service outages. Instead, using higher-level protocols like HTTP or gRPC for readiness and liveness probes can provide more accurate, observable, and actionable insights into an application's state, improving reliability and reducing downtime.

⭐Set up a complete CI/CD pipeline for a Java application using Jenkins, Docker, and Kubernetes

This guide walks you through setting up a complete CI/CD pipeline for a Java application using Jenkins, Docker, and Kubernetes, with deployment managed by Argo CD. It starts by setting up a Jenkins server on an AWS EC2 instance, configuring Jenkins with plugins for Docker and SonarQube, and creating a pipeline that builds, tests, and deploys the application. The project includes static code analysis with SonarQube, Docker image management, and automated deployment to a Kubernetes cluster. The guide also covers setting up Argo CD for GitOps-based deployments and using Lens as a Kubernetes IDE to manage the cluster. Finally, it explains how to automate pipeline triggers using GitHub webhooks.

⭐4 things wrong with most Kafka installations — and how to avoid them

Many Kafka installations suffer from common issues like low replication factors, which can lead to data loss during broker failures; ignoring client library changes, risking system instability without proper testing; deploying infrastructure in a single data center, making it vulnerable to disasters; and lacking robust failure-handling strategies, which can cause unprocessed or corrupted data to disrupt operations. To avoid these pitfalls, set adequate replication, rigorously test updates, distribute infrastructure across multiple locations, and implement strong monitoring and error-handling mechanisms.

🔍Secret Knowledge: Learning Resources

🔍Delivering Millions of Notifications Within Seconds During the Super Bowl

Zhen Zhou, a senior software engineer at Duolingo, shared how his team built a notification system capable of delivering millions of notifications within seconds during the Super Bowl. Faced with the daunting challenge of sending 4 million notifications in just 5 seconds, they had to ensure the system was fast, scalable, and resilient while avoiding issues like self-inflicted DDoS attacks. Through a combination of careful planning, asynchronous processing, and strategic use of AWS services, they successfully met the demands of the Super Bowl ad campaign.

🔍How Binance built a 100PB log service with Quickwit

Binance, the world's leading cryptocurrency exchange, faced challenges with managing massive amounts of log data using Elasticsearch. To improve efficiency, they migrated to Quickwit, an open-source distributed search engine. Over six months, Binance scaled Quickwit to handle 100 petabytes of logs, achieving 1.6 petabytes of daily indexing. This migration resulted in significant cost savings, reducing compute costs by 80% and storage costs by 20 times.

🔍Can Postgres replace Redis as a cache?

Traditionally, Redis is favored for caching due to its speed and specialized features like expiration and eviction policies. However, the author discusses how PostgreSQL, specifically with its unlogged tables and stored procedures for data expiration, could potentially replace Redis in some cases. While PostgreSQL can offer cost savings and simplify tech stacks by consolidating databases, it doesn't match Redis's performance in read and write speeds. Redis remains superior for dedicated caching needs due to its optimized design for fast data retrieval and specialized caching features.

🔍What Alternatives to Rancher in 2024?

In 2024, if you're exploring alternatives to Rancher, consider Syself for its user-friendly Kubernetes-as-a-Service platform that integrates Cluster API for smooth cluster management and offers strong security and compliance. Talos is another option, known for its minimal OS designed specifically for Kubernetes, simplifying operations but potentially facing limitations with certain security policies. OpenShift has also evolved significantly, providing robust multi-cluster management and improved scalability with its latest versions. Additionally, SpectroCloud Palette stands out with its extensive multi-cluster management capabilities, user-friendly interface, and integrations like Terraform and Crossplane.

🔍Optimizing Docker Images for Python Production Services

Build efficient Docker images by employing techniques such as multi-stage builds and effective caching. It outlines how multi-stage builds separate the build environment from the runtime environment, significantly reducing the final image size by excluding unnecessary build tools. Caching strategies are highlighted to speed up builds and minimize resource usage by reusing unchanged layers. The guide provides practical tips for creating lean Dockerfiles for Python services and compares optimized Docker images with unoptimized ones, showing a substantial reduction in size and build time.

⚡ TechWave: Cloud News & Analysis

⚡AWS Discontinues Cloud9, CodeCommit, CloudSearch, and more.

Amazon Web Services (AWS) is quietly discontinuing several services, including Cloud9, SimpleDB, and CodeCommit. This decision, revealed through a tweet by AWS's Chief Evangelist Jeff Barr, reflects a broader strategy shift as AWS acknowledges it cannot be a one-stop solution for every need. By deprecating these services, AWS aims to focus more on its core strengths in infrastructure while partnering with specialized vendors for other needs.

⚡Announcing Advanced Container Networking Services for your Azure Kubernetes Service clusters

Microsoft Azure has launched Advanced Container Networking Services for Azure Kubernetes Service (AKS) clusters, which includes a new feature called Advanced Network Observability. This service enhances AKS by improving observability, security, and compliance with advanced tools for monitoring network traffic and performance. The suite leverages technologies like eBPF for real-time metrics and logs, allowing detailed tracking of network flows, traffic issues, and DNS metrics. It offers integrated visualization through Azure's managed Prometheus and Grafana, or custom setups, and supports cross-node flow tracking to better understand and resolve complex networking issues.

⚡Kubernetes 1.31 - What’s new?

Kubernetes 1.31 introduces several key enhancements: it adds support for AppArmor profiles, which improve container security by enforcing specific security policies; improves connectivity reliability with KubeProxy Ingress, making node terminations smoother; and introduces new features like Pod-level resource limits and refined job success policies. Significant changes include the removal of in-tree cloud provider code for a more vendor-neutral platform and updates to kubectl, such as separating user preferences from cluster configurations. Overall, this release focuses on security, reliability, and user experience improvements.

⚡Our Audit of Homebrew

The audit of Homebrew, conducted last summer and sponsored by the Open Tech Fund, examined the security of Homebrew's core codebase and related repositories. The review identified several issues that, while not critical, could potentially be exploited to undermine Homebrew’s security, such as unauthorized code execution or manipulation of build processes. These issues could affect the integrity of both Homebrew’s package installations and its continuous integration/continuous deployment (CI/CD) workflows. The audit highlighted vulnerabilities in Homebrew's sandboxing mechanisms, CI/CD configurations, and package handling processes.

⚡OpenTofu 1.8.0 is Out

OpenTofu 1.8.0 introduces several new features and improvements, including the ability to use variables and locals in more areas of code, support for a new .tofu file extension for backward compatibility, and enhanced testing with provider mocking. The update also brings a new RFC process for better documentation and introduces Go libraries for easier interoperability. The registry usage has grown significantly, and a new web-based user interface for the OpenTofu Registry is coming soon.

🛠️HackHub: Best Tools for Cloud

🛠️CtrlSpice/otel-desktop-viewer

`otel-desktop-viewer` is a CLI tool that lets you visualize and explore OpenTelemetry traces locally on your machine, without needing to send data to an external telemetry service.

🛠️ajayd-san/gomanagedocker

`goManageDocker` is a fast terminal-based tool built with Go that simplifies Docker management tasks, letting you easily run, exec, delete, and view Docker objects using a keyboard-driven interface.

🛠️evanrolfe/trayce_gui

TrayceGUI is a cross-platform desktop app for monitoring network requests in Docker containers using the TrayceAgent.

🛠️m-adawi/swarm-cd

SwarmCD is a GitOps and Continuous Deployment tool for Docker Swarm that automates stack updates by monitoring Git repositories.

🛠️mostlycloudysky/cloudysetup

CloudySetup CLI is a tool that uses AI to generate, manage, and apply AWS resource configurations via the AWS Cloud Control API