Complete Guide to Software System Design: From Architecture to Production

Introduction

Software system design is the discipline of building scalable, maintainable, reliable, and secure software systems. It represents the bridge between business requirements and technical implementation, transforming abstract ideas into concrete, deployable architectures that can serve millions of users while maintaining high performance, availability, and user satisfaction.

In today's digital landscape, where applications need to handle massive scale, process petabytes of data, and provide real-time responses across the globe, understanding system design has become not just important—it's absolutely critical. Whether you're building a simple web application or architecting the next social media platform, the principles of good system design apply universally.

                        
                            Why System Design Matters
                        
                        Scalability: Handle growth from 100 to 100 million users without complete rewrites
Reliability: Ensure your system stays available even when components fail
Performance: Deliver sub-second response times to users worldwide
Maintainability: Make changes and add features without breaking existing functionality
Cost Efficiency: Optimize resource usage to reduce operational expenses
Security: Protect user data and prevent security breaches

                    

"If coding builds a product, system design builds the ecosystem in which that product thrives. It's the difference between a working prototype and a production-ready system that serves millions."

Complete Table of Contents

01

Types of System Design

Functional, non-functional, architectural, data & storage, and deployment design

02

Functional vs Non-Functional

Understanding requirements, constraints, and quality attributes

03

System Architecture Basics

Layers, patterns, logical and physical architecture

04

High-Level vs Low-Level Design

HLD components, LLD implementation details

05

Monolith vs Microservices vs Serverless

Architectural patterns, pros, cons, and when to use each

06

API Design & Communication

REST, GraphQL, gRPC, WebSockets, sync/async patterns

07

Databases & Data Models

SQL, NoSQL types, indexing, normalization, sharding

08

Storage Concepts

Block, object, file storage, CDN, distributed systems

09

Caching & Performance

Caching strategies, Redis, Memcached, CDN caching

10

Scalability Strategies

Horizontal, vertical scaling, replication, partitioning

11

Load Balancing

Algorithms, L4/L7 balancing, health checks, failover

12

Message Queues & Streaming

RabbitMQ, Kafka, event-driven architecture

13

CAP Theorem & Consistency

Consistency, availability, partition tolerance, BASE vs ACID

14

Security & Authentication

OAuth, JWT, encryption, zero-trust architecture

15

CI/CD & DevOps

Pipelines, Docker, Kubernetes, infrastructure as code

16

Monitoring & Observability

Logs, metrics, traces, Prometheus, Grafana, ELK

17

Design Patterns

MVC, Repository, Circuit Breaker, CQRS, Saga

18

UML Diagrams

Use case, class, sequence, deployment diagrams

19

Cloud Architecture

VPC, subnets, auto-scaling, multi-region deployment

20

Cost Optimization

Resource optimization, spot instances, caching strategies

21

Real-World System Design

YouTube, WhatsApp, Uber, Twitter architecture case studies

Section 01

Types of Software System Design

System design is a multi-faceted discipline that requires understanding various design types, each focusing on different aspects of building robust software systems. Let's explore the five primary categories in comprehensive detail.

1. Functional Design

Functional design defines "What the system should do" - the features, behaviors, and business logic that deliver value to users. This encompasses user requirements, use cases, business workflows, and feature specifications.

Core Components:

User Stories: Narrative descriptions from user perspective (As a [role], I want [feature] so that [benefit])
Use Cases: Detailed scenarios with preconditions, steps, and postconditions
Business Rules: Logic governing system behavior and data processing
Data Flow: How information moves through the system

Example: E-commerce Cart

✓ Users can add/remove items to cart
✓ Cart persists across sessions
✓ Apply discount codes and coupons
✓ Calculate total with taxes and shipping
✓ Show real-time inventory status
✓ Save items for later purchase
✓ Share cart with others via link
✓ Suggest related products

Functional Requirements Example

Requirement ID: FR-001
Title: User Authentication System

Description:
The system shall provide secure user authentication with multiple methods

Functional Requirements:
1. Email/Password Authentication
   - Users can register with email and password
   - Password must be 8+ characters with special chars
   - Email verification required before login
   
2. Social Authentication
   - Support Google OAuth 2.0
   - Support Facebook Login
   - Auto-create user profile from social data
   
3. Two-Factor Authentication (Optional)
   - SMS-based OTP
   - Authenticator app support (TOTP)
   - Backup codes generation
   
4. Session Management
   - JWT tokens with 24-hour expiry
   - Refresh token mechanism
   - Logout from all devices option
   
5. Password Recovery
   - Email-based reset link (valid 1 hour)
   - Security questions as backup
   - Lock account after 5 failed attempts

2. Non-Functional Design

Non-functional design specifies "How well the system performs" - the quality attributes that determine user satisfaction, system reliability, and operational excellence.

Category	Attributes	Specific Requirements	Measurement Metrics
Performance	Response time, throughput, latency	API responds in <200ms (p95), Page load <2s	Percentiles (p50, p95, p99), RPS, TPS
Scalability	Horizontal/vertical scaling, elasticity	Support 1M concurrent users, 10K RPS	Users/second, Data volume, Auto-scale time
Availability	Uptime, fault tolerance, redundancy	99.99% SLA (52min downtime/year)	Uptime %, MTBF, MTTR
Reliability	Error rate, data integrity, recovery	Error rate <0.1%, Zero data loss, RPO 1hr	Error rate %, RPO, RTO
Security	Authentication, encryption, compliance	OAuth 2.0, TLS 1.3, GDPR compliant	Vulnerabilities, Compliance score
Maintainability	Code quality, modularity, documentation	80% test coverage, Modular architecture	Code complexity, Test coverage %
Usability	User experience, accessibility	WCAG 2.1 AA, Mobile responsive	Task completion time, Error rate
Cost Efficiency	Resource optimization, cloud costs	Monthly cloud cost < $10K for 100K users	Cost per user, Infrastructure cost

Trade-off Example: Performance vs Consistency

Scenario: Social media "likes" counter

Option A: Eventual Consistency

✓ Ultra-fast response (<10ms)
✓ Handle millions of likes/second
✗ Count may be slightly inaccurate temporarily
Best for: Social media, analytics

Option B: Strong Consistency

✓ Always accurate count
✓ Reliable for critical operations
✗ Slower response (50-100ms)
Best for: Banking, inventory

3. Architectural Design

Architectural design determines the overall structure, components, and their interactions. It's the blueprint that guides implementation and ensures all requirements are met efficiently.

🏢

Layered Architecture

Organizes system into horizontal layers (Presentation → Business Logic → Data Access → Database)

Best for: Traditional enterprise apps, monoliths

🔌

Client-Server

Separates concerns: clients handle UI, servers manage data and business logic

Best for: Web apps, mobile backends

⚡

Event-Driven

Components communicate via events; producers emit, consumers react asynchronously

Best for: Real-time systems, IoT

📦

Microservices

Independent services organized around business capabilities, each with own database

Best for: Large-scale apps, multiple teams

🔗

Service-Oriented

Services communicate via protocols (SOAP, REST), with ESB for integration

Best for: Enterprise integration

🧩

Hexagonal (Ports & Adapters)

Core business logic isolated from external concerns via ports and adapters

Best for: Domain-driven design

4. Data & Storage Design

Data design focuses on how data is structured, stored, accessed, and maintained. Poor data design leads to performance bottlenecks, data inconsistencies, and scalability issues.

Key Considerations:

Schema Design: Normalize for consistency or denormalize for performance
Indexing Strategy: B-tree, hash, full-text indexes for query optimization
Partitioning: Horizontal (sharding) or vertical data splitting
Replication: Master-slave, multi-master for high availability
Data Migration: ETL processes for data movement

Database Selection:

Relational (PostgreSQL, MySQL): Structured data, ACID transactions
Document (MongoDB): JSON documents, flexible schema
Key-Value (Redis): Fast caching, session storage
Column-Family (Cassandra): Time-series, write-heavy workloads
Graph (Neo4j): Social networks, relationships
Search (Elasticsearch): Full-text search, analytics

5. Deployment & Infrastructure Design

Deployment design defines how the system is packaged, deployed, and operated in production. Modern deployment strategies enable rapid releases with minimal downtime.

Component	Purpose	Technologies	Best Practices
CI/CD Pipeline	Automate build, test, deploy	Jenkins, GitLab CI, GitHub Actions	Automated tests, staged rollouts
Containerization	Package with dependencies	Docker, containerd	Multi-stage builds, small images
Orchestration	Manage container lifecycle	Kubernetes, Docker Swarm	Auto-scaling, health checks
Infrastructure as Code	Version control infrastructure	Terraform, CloudFormation, Ansible	Declarative configs, state management
Monitoring	Observe system health	Prometheus, Datadog, New Relic	SLIs, SLOs, alerts

Section 02

Functional vs Non-Functional Requirements

Requirements form the foundation of system design. Functional requirements define what the system does, while non-functional requirements specify how well it does it. Both are equally critical for project success.

Functional Requirements

Define specific behaviors, features, and functions the system must provide to satisfy business needs and user goals.

Example: Video Streaming Platform

1
Video Upload: Users can upload videos (MP4, AVI, MOV) up to 5GB. System supports drag-and-drop and multi-file selection.
2
Transcoding: Automatic conversion to multiple resolutions (2160p, 1080p, 720p, 480p, 360p) with adaptive bitrate streaming.
3
Thumbnail Generation: System creates 10 thumbnails at different timestamps; creator selects default.
4
Playback Controls: Play, pause, seek, volume, speed (0.25x to 2x), quality selector, fullscreen, picture-in-picture.
5
Social Features: Like, dislike, comment, share, subscribe, create playlists, save to watch later.
6
Search & Discovery: Full-text search, filters (duration, upload date, views), recommendations based on history.
7
Analytics: Creators view watch time, demographics, traffic sources, revenue reports.
8
Content Moderation: AI-based flagging of inappropriate content; manual review queue for admins.

Characteristics of Good Functional Requirements:

✓ Clear and unambiguous
✓ Testable and verifiable
✓ Traceable to business goals
✓ Feasible within constraints
✓ Prioritized by importance

Non-Functional Requirements

Specify quality attributes, performance characteristics, and constraints that determine system success and user satisfaction.

Example: Video Streaming Platform

1
Performance: Video buffering starts <2 seconds. Seek operations <500ms. Upload speed 10MB/s minimum.
2
Scalability: Support 50M concurrent viewers. Handle 100K uploads/hour. Store 500PB of video content.
3
Availability: 99.99% uptime SLA (52.6 min downtime/year). Multi-region deployment with automatic failover.
4
Reliability: Zero data loss guarantee. Automatic retry for failed uploads. Self-healing infrastructure.
5
Security: AES-256 encryption for stored videos. TLS 1.3 for transmission. DRM for premium content.
6
Usability: Support 25 languages. WCAG 2.1 AA accessibility. Works on iOS 14+, Android 10+, modern browsers.
7
Cost Efficiency: CDN bandwidth cost <$0.02/GB. Storage costs optimized with tiered storage.
8
Compliance: COPPA compliance for users under 13. GDPR for EU users. Content ID system for copyright.

Characteristics of Good Non-Functional Requirements:

✓ Quantifiable and measurable
✓ Realistic and achievable
✓ Aligned with business priorities
✓ Trade-offs documented
✓ Testable through benchmarks

Understanding Trade-offs in Requirements

Real-world system design requires balancing conflicting requirements. Here are common trade-offs you'll encounter:

Performance vs Security

Conflict: Encryption, authentication checks, and security validations add latency

Solution: Use hardware acceleration for encryption, implement smart caching for auth tokens, optimize security checks on critical paths

Consistency vs Availability

Conflict: CAP theorem - can't have perfect consistency and availability during network partitions

Solution: Use eventual consistency for non-critical data (social feeds), strong consistency for financial transactions

Scalability vs Complexity

Conflict: Microservices enable infinite scaling but increase operational complexity

Solution: Start with modular monolith, extract services when specific scaling needs emerge

Cost vs Performance

Conflict: High performance requires premium infrastructure (faster CPUs, more RAM, global CDN)

Solution: Implement intelligent caching, auto-scaling, database query optimization, use CDN strategically

Time-to-Market vs Quality

Conflict: Rushing features can compromise code quality, testing, and architecture

Solution: MVP approach with quality gates, technical debt tracking, regular refactoring sprints

Flexibility vs Security

Conflict: Open APIs and integrations increase attack surface

Solution: API gateway with rate limiting, OAuth 2.0, input validation, regular security audits

Section 03

System Architecture Basics

System architecture is the fundamental organization of a software system, including its components, their relationships, and the principles governing their design and evolution. A well-designed architecture enables scalability, maintainability, and resilience.

System Layers Visualization

┌────────────────────────────────────────────────────────┐
│                    [CLIENT LAYER]                      │
│          Web Browser | Mobile App | Desktop            │
│         (React, Vue, Angular, Flutter, SwiftUI)        │
└────────────────────────────────────────────────────────┘
                           ↓ HTTPS/WSS
┌────────────────────────────────────────────────────────┐
│                   [CDN / EDGE LAYER]                   │
│        Cloudflare, AWS CloudFront, Akamai              │
│       (Static Assets, DDoS Protection, Caching)        │
└────────────────────────────────────────────────────────┘
                           ↓
┌────────────────────────────────────────────────────────┐
│                   [LOAD BALANCER]                      │
│         NGINX, HAProxy, AWS ALB/NLB, F5                │
│    (Traffic Distribution, SSL Termination, Health)     │
└────────────────────────────────────────────────────────┘
                           ↓
┌────────────────────────────────────────────────────────┐
│              [APPLICATION LAYER - API TIER]            │
│        Node.js | Django | Spring Boot | Go             │
│         Multiple Instances (Horizontal Scaling)        │
│              Stateless for Easy Scaling                │
└────────────────────────────────────────────────────────┘
         ↓                  ↓                  ↓
┌──────────────────┐ ┌────────────┐ ┌─────────────────┐
│  [CACHE LAYER]   │ │ [MESSAGE]  │ │ [SEARCH INDEX]  │
│  Redis/Memcached │ │   QUEUE    │ │  Elasticsearch  │
│  (Session, Data) │ │ Kafka/RMQ  │ │  (Full-text)    │
└──────────────────┘ └────────────┘ └─────────────────┘
                           ↓
┌────────────────────────────────────────────────────────┐
│            [BUSINESS LOGIC LAYER - SERVICES]           │
│     Microservices / Service-Oriented Architecture      │
│   User Service | Payment | Notification | Analytics    │
└────────────────────────────────────────────────────────┘
                           ↓
┌────────────────────────────────────────────────────────┐
│              [DATA ACCESS LAYER - DAL]                 │
│          ORM (Hibernate, TypeORM, SQLAlchemy)          │
│         Repository Pattern, Query Builders             │
└────────────────────────────────────────────────────────┘
                           ↓
┌────────────────────────────────────────────────────────┐
│                [DATABASE / STORAGE LAYER]              │
│  Primary DB: PostgreSQL/MySQL (Write Operations)       │
│  Read Replicas: Multiple copies (Read Operations)      │
│  NoSQL: MongoDB, Cassandra (Specific Use Cases)        │
│  Object Storage: S3, GCS (Files, Images, Videos)       │
└────────────────────────────────────────────────────────┘

Layer	Responsibilities	Technologies	Scaling Strategy
Presentation	UI rendering, user input, client-side validation, responsive design	React, Vue, Angular, Flutter	CDN caching, lazy loading
Application	Request routing, authentication, rate limiting, API endpoints	Express, FastAPI, Spring Boot	Horizontal scaling, stateless
Business Logic	Core functionality, business rules, workflows, orchestration	Domain models, Services	Microservices, async processing
Data Access	CRUD operations, transactions, query optimization	Repository, DAO patterns	Connection pooling, caching
Database	Persistent storage, data integrity, consistency	PostgreSQL, MongoDB, Redis	Replication, sharding

                            
                                Architectural Principles
                            
                            Separation of Concerns: Each layer has distinct responsibilities
Loose Coupling: Minimize dependencies between components
High Cohesion: Related functionality grouped together
Abstraction: Hide implementation details behind interfaces
Scalability: Design for horizontal and vertical scaling
Resilience: Graceful degradation and fault tolerance

                        

Section 04

High-Level Design vs Low-Level Design

System design is typically divided into two phases: High-Level Design (HLD) focuses on the overall architecture and component relationships, while Low-Level Design (LLD) dives into implementation details, algorithms, and data structures.

HLD vs LLD: Understanding the Difference

High-Level Design (HLD)

Focus: System architecture, component interaction, data flow

What HLD Covers:

System architecture overview
Component identification
Database design (ER diagrams)
API contracts
Technology stack selection
Scalability strategy
Security architecture

Audience:

Architects, Product Managers, Stakeholders, Engineering Teams

Deliverables:

• Architecture diagrams
• Component diagrams
• Data flow diagrams
• Infrastructure layout

Low-Level Design (LLD)

Focus: Implementation details, algorithms, class design

What LLD Covers:

Class diagrams
Sequence diagrams
Database schema (tables, indexes)
Algorithms and data structures
Method signatures
Error handling logic
Design patterns

Audience:

Developers, QA Engineers, Technical Leads

Deliverables:

• Class diagrams
• Sequence diagrams
• Pseudocode
• Database schemas

Aspect	High-Level Design (HLD)	Low-Level Design (LLD)
Abstraction Level	High (30,000 ft view)	Low (ground level view)
Created By	Solution Architects, System Designers	Software Engineers, Developers
Phase	Early design phase	Implementation phase
Questions Answered	"What components?", "How do they interact?"	"How to implement?", "What algorithm?"
Time Investment	Days to weeks	Weeks to months
Changes Impact	High (affects entire system)	Low (affects specific modules)

Example: URL Shortener (like bit.ly)

High-Level Design

Requirements Gathering:

Functional:

• Generate short URL from long URL
• Redirect short URL to original URL
• Custom short URLs (optional)
• Analytics (click count, geography)
• URL expiration

Non-Functional:

• High availability (99.99%)
• Low latency (< 100ms redirect)
• Scale: 100M URLs created/month
• Read heavy (100:1 read/write ratio)
• Prevent collisions

Capacity Estimation:

// Traffic Estimates
            Write requests: 100M URLs/month
            = 100M / (30 days × 24 hrs × 3600 sec)
            ≈ 40 URLs/second

            Read requests: 100:1 ratio
            = 40 × 100 = 4,000 reads/second

            // Storage Estimates
            Average URL size: 500 bytes
            100M URLs/month × 500 bytes = 50 GB/month
            Over 5 years: 50 GB × 12 × 5 = 3 TB

            // Bandwidth
            Write: 40 req/s × 500 bytes = 20 KB/s
            Read: 4000 req/s × 500 bytes = 2 MB/s

HLD Architecture Diagram:

            ┌─────────────────────────────────────────────────────────────────┐
            │                    URL Shortener System                          │
            └─────────────────────────────────────────────────────────────────┘

                                    ┌──────────────┐
                                    │    Client    │
                                    │  (Browser)   │
                                    └──────┬───────┘
                                            │
                                    HTTPS  │
                                            ↓
                                ┌───────────────────────┐
                                │   Load Balancer       │
                                │   (NGINX/HAProxy)     │
                                └───────────┬───────────┘
                                            │
                            ┌───────────────┼───────────────┐
                            ↓               ↓               ↓
                    ┌───────────┐   ┌───────────┐   ┌───────────┐
                    │   App     │   │   App     │   │   App     │
                    │ Server 1  │   │ Server 2  │   │ Server 3  │
                    └─────┬─────┘   └─────┬─────┘   └─────┬─────┘
                        │               │               │
                        └───────────────┼───────────────┘
                                        ↓
                                ┌─────────────────┐
                                │  Cache (Redis)  │
                                │  - URL mappings │
                                │  - Hot data     │
                                └────────┬────────┘
                                        │ Cache miss
                                        ↓
                            ┌──────────────────────────┐
                            │   Database (PostgreSQL)  │
                            │   Primary + Read Replicas│
                            │                          │
                            │   Tables:                │
                            │   - urls (id, short,     │
                            │           long, created) │
                            │   - analytics (clicks,   │
                            │                location) │
                            └──────────────────────────┘

            API Endpoints:
            POST /api/shorten
            Body: { "url": "https://example.com/very-long-url" }
            Response: { "short_url": "https://short.ly/abc123" }

            GET /{short_code}
            Redirect to original URL (302)
            
            GET /api/stats/{short_code}
            Returns analytics data

Technology Stack (HLD Level):

Frontend:

React.js, Next.js

Backend:

Node.js / Go / Python

Database:

PostgreSQL + Redis

Load Balancer:

AWS ALB / NGINX

Hosting:

AWS EC2 / Kubernetes

Monitoring:

Prometheus + Grafana

Low-Level Design

URL Encoding Algorithm:

// Approach 1: Base62 Encoding
            class URLShortener {
            private static BASE62 = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
            
            // Convert auto-incrementing ID to Base62
            encodeID(id: number): string {
                if (id === 0) return this.BASE62[0];
                
                let shortURL = '';
                while (id > 0) {
                shortURL = this.BASE62[id % 62] + shortURL;
                id = Math.floor(id / 62);
                }
                return shortURL;
            }
            
            // Decode short URL back to ID
            decodeURL(shortURL: string): number {
                let id = 0;
                for (let char of shortURL) {
                id = id * 62 + this.BASE62.indexOf(char);
                }
                return id;
            }
            
            async createShortURL(longURL: string): Promise {
                // 1. Check if URL already exists (avoid duplicates)
                const existing = await this.db.findByLongURL(longURL);
                if (existing) return existing.shortCode;
                
                // 2. Generate new ID (auto-increment or UUID)
                const id = await this.db.getNextID();
                
                // 3. Encode ID to Base62
                const shortCode = this.encodeID(id);
                
                // 4. Store mapping
                await this.db.insert({
                id: id,
                shortCode: shortCode,
                longURL: longURL,
                createdAt: new Date(),
                expiresAt: new Date(Date.now() + 365 * 24 * 60 * 60 * 1000) // 1 year
                });
                
                // 5. Cache for fast lookups
                await this.cache.set(`short:${shortCode}`, longURL, 3600);
                
                return `https://short.ly/${shortCode}`;
            }
            
            async redirect(shortCode: string): Promise {
                // 1. Check cache first
                let longURL = await this.cache.get(`short:${shortCode}`);
                
                if (!longURL) {
                // 2. Cache miss - query database
                const record = await this.db.findByShortCode(shortCode);
                
                if (!record) {
                    throw new Error('URL not found');
                }
                
                if (record.expiresAt < new Date()) {
                    throw new Error('URL expired');
                }
                
                longURL = record.longURL;
                
                // 3. Update cache
                await this.cache.set(`short:${shortCode}`, longURL, 3600);
                }
                
                // 4. Track analytics (async, non-blocking)
                this.trackClick(shortCode).catch(err => console.error(err));
                
                return longURL;
            }
            
            private async trackClick(shortCode: string): Promise {
                await this.db.incrementClickCount(shortCode);
                // Store detailed analytics in separate table
                await this.analyticsDB.insert({
                shortCode,
                timestamp: new Date(),
                // ip, userAgent, referer, etc.
                });
            }
            }

            // Example Usage:
            const shortener = new URLShortener(db, cache);

            // Shorten
            const shortURL = await shortener.createShortURL(
            'https://example.com/very-long-url-with-params?id=123&source=email'
            );
            // Returns: https://short.ly/a1B2c3

            // Redirect
            const originalURL = await shortener.redirect('a1B2c3');
            // Returns: https://example.com/very-long-url-with-params?id=123&source=email

Database Schema (LLD):

-- PostgreSQL Schema
            CREATE TABLE urls (
                id BIGSERIAL PRIMARY KEY,
                short_code VARCHAR(10) UNIQUE NOT NULL,
                long_url TEXT NOT NULL,
                user_id BIGINT,
                created_at TIMESTAMP DEFAULT NOW(),
                expires_at TIMESTAMP,
                click_count INTEGER DEFAULT 0,
                is_custom BOOLEAN DEFAULT FALSE,
                INDEX idx_short_code (short_code),
                INDEX idx_long_url_hash (MD5(long_url)),
                INDEX idx_user_id (user_id),
                INDEX idx_created_at (created_at)
            );

            CREATE TABLE clicks (
                id BIGSERIAL PRIMARY KEY,
                short_code VARCHAR(10) NOT NULL,
                clicked_at TIMESTAMP DEFAULT NOW(),
                ip_address INET,
                user_agent TEXT,
                referer TEXT,
                country VARCHAR(2),
                city VARCHAR(100),
                FOREIGN KEY (short_code) REFERENCES urls(short_code),
                INDEX idx_short_code_time (short_code, clicked_at)
            );

            CREATE TABLE users (
                id BIGSERIAL PRIMARY KEY,
                email VARCHAR(255) UNIQUE NOT NULL,
                password_hash VARCHAR(255) NOT NULL,
                created_at TIMESTAMP DEFAULT NOW(),
                plan VARCHAR(20) DEFAULT 'free'
            );

            -- Partitioning for analytics (optional, for scale)
            CREATE TABLE clicks_2025_11 PARTITION OF clicks
            FOR VALUES FROM ('2025-11-01') TO ('2025-12-01');

Class Diagram (LLD):

┌─────────────────────────────────────────┐
            │           URLShortenerService           │
            ├─────────────────────────────────────────┤
            │ - db: Database                          │
            │ - cache: CacheService                   │
            │ - analytics: AnalyticsService           │
            ├─────────────────────────────────────────┤
            │ + createShortURL(longURL): string      │
            │ + redirect(shortCode): string          │
            │ + getAnalytics(shortCode): Stats       │
            │ - encodeID(id): string                  │
            │ - decodeURL(shortURL): number          │
            │ - trackClick(shortCode): void          │
            └─────────────────────────────────────────┘
                            │
                            │ uses
                            ↓
            ┌─────────────────────────────────────────┐
            │          URLRepository                  │
            ├─────────────────────────────────────────┤
            │ - connection: DatabaseConnection        │
            ├─────────────────────────────────────────┤
            │ + findByShortCode(code): URL            │
            │ + findByLongURL(url): URL               │
            │ + insert(url: URL): boolean             │
            │ + incrementClickCount(code): void       │
            │ + getNextID(): number                   │
            └─────────────────────────────────────────┘

            ┌─────────────────────────────────────────┐
            │              CacheService               │
            ├─────────────────────────────────────────┤
            │ - redis: RedisClient                    │
            ├─────────────────────────────────────────┤
            │ + get(key): Promise             │
            │ + set(key, value, ttl): Promise   │
            │ + delete(key): Promise            │
            └─────────────────────────────────────────┘

            ┌─────────────────────────────────────────┐
            │              URL (Model)                │
            ├─────────────────────────────────────────┤
            │ + id: number                            │
            │ + shortCode: string                     │
            │ + longURL: string                       │
            │ + userId: number                        │
            │ + createdAt: Date                       │
            │ + expiresAt: Date                       │
            │ + clickCount: number                    │
            └─────────────────────────────────────────┘

When to Focus on HLD vs LLD

Focus on HLD When:

Starting a new project
Define architecture before coding
System design interviews
45-60 min focus on HLD, not implementation
Scaling existing system
Rearchitect for performance/scale
Communication with stakeholders
Explain system without technical jargon

Focus on LLD When:

Ready to implement
HLD approved, time to write code
Optimizing algorithms
Improve time/space complexity
Bug fixing
Debug specific implementation issues
Code reviews
Ensure clean, maintainable code

Section 05

Monolith vs Microservices vs Serverless

Choosing the right architectural pattern is one of the most important decisions in system design. Each approach has its own trade-offs in terms of complexity, scalability, deployment, and team structure. Let's dive deep into each.

Architectural Patterns Comparison

Aspect	Monolith	Microservices	Serverless
Deployment	Single unit	Independent services	Functions
Scalability	Scale entire app	Scale services independently	Auto-scale per function
Development Speed	Fast initially	Slower (coordination)	Fast (isolated functions)
Complexity	Low	High	Medium
Technology Stack	Single stack	Polyglot (multiple languages)	Limited (platform-specific)
Team Structure	Single team	Multiple teams per service	Small teams
Cost (Small Scale)	Low (1-2 servers)	High (many services)	Very low (pay per use)
Cost (Large Scale)	High (scale everything)	Optimized (scale what's needed)	Can be expensive
Best For	MVPs, small teams, simple apps	Large teams, complex apps, scale	Event-driven, variable load

Monolithic Architecture

All components of the application run in a single process. Simpler to develop, test, and deploy initially.

Architecture Diagram

┌─────────────────────────────────────┐
            │      Monolithic Application         │
            ├─────────────────────────────────────┤
            │                                     │
            │  ┌───────────────────────────────┐ │
            │  │    Presentation Layer         │ │
            │  │    (UI/Controllers)           │ │
            │  └───────────────────────────────┘ │
            │                                     │
            │  ┌───────────────────────────────┐ │
            │  │    Business Logic Layer       │ │
            │  │  • User Management            │ │
            │  │  • Order Processing           │ │
            │  │  • Payment Handling           │ │
            │  │  • Inventory Management       │ │
            │  │  • Shipping Logic             │ │
            │  └───────────────────────────────┘ │
            │                                     │
            │  ┌───────────────────────────────┐ │
            │  │    Data Access Layer          │ │
            │  │    (Database Access)          │ │
            │  └───────────────────────────────┘ │
            │                                     │
            └──────────────┬──────────────────────┘
                        ↓
                    ┌──────────────┐
                    │   Database   │
                    │  (Postgres)  │
                    └──────────────┘

            Deployed as: Single JAR/WAR/executable
            Scaled by: Running multiple instances
            behind load balancer

Code Example

// Express.js Monolith Structure
            project/
            ├── src/
            │   ├── controllers/
            │   │   ├── userController.js
            │   │   ├── orderController.js
            │   │   ├── productController.js
            │   │   └── paymentController.js
            │   ├── services/
            │   │   ├── userService.js
            │   │   ├── orderService.js
            │   │   └── paymentService.js
            │   ├── models/
            │   │   ├── User.js
            │   │   ├── Order.js
            │   │   └── Product.js
            │   ├── routes/
            │   │   └── index.js
            │   └── app.js
            └── package.json

            // app.js - Single entry point
            const express = require('express');
            const app = express();

            app.use('/api/users', userRoutes);
            app.use('/api/orders', orderRoutes);
            app.use('/api/products', productRoutes);
            app.use('/api/payments', paymentRoutes);

            app.listen(3000, () => {
            console.log('Monolith running on port 3000');
            });

            // All features share same:
            // - Codebase
            // - Database connection
            // - Dependencies
            // - Deployment pipeline

Advantages

Simple Development: One codebase, easy to understand
Easy Testing: End-to-end tests straightforward
Simple Deployment: One deployment artifact
Performance: In-process calls (no network overhead)
Easy Debugging: Single application to debug
Transaction Management: ACID transactions easy
Lower Operational Cost: Few servers initially

Disadvantages

Scaling: Must scale entire app (wasteful)
Deployment Risk: Small change requires full redeploy
Technology Lock-in: Stuck with initial tech choice
Large Codebase: Becomes hard to maintain
Long Build Times: As app grows, builds slow down
Team Coordination: Merge conflicts, blocking each other
Single Point of Failure: Bug can crash entire app

When to Choose Monolith

✓ MVP/Startup: Need to move fast, validate idea
✓ Small Team: < 10 developers
✓ Simple Domain: Not too many business capabilities
✓ Limited Traffic: Can scale vertically initially
✓ Cost-Conscious: Keep infrastructure simple

Pro tip: Start with a well-structured monolith. You can always extract microservices later if needed (modular monolith approach).

Microservices Architecture

Application broken into small, independent services that communicate via APIs. Each service owns its data and can be deployed independently.

Microservices Architecture:

                                    ┌──────────────────┐
                                    │   API Gateway    │
                                    │  (Authentication,│
                                    │   Rate Limiting) │
                                    └────────┬─────────┘
                                            │
                        ┌────────────────────┼────────────────────┐
                        │                    │                    │
                        ↓                    ↓                    ↓
            ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
            │  User Service    │  │  Order Service   │  │ Product Service  │
            ├──────────────────┤  ├──────────────────┤  ├──────────────────┤
            │ • Registration   │  │ • Create Order   │  │ • Catalog        │
            │ • Authentication │  │ • Order Status   │  │ • Search         │
            │ • Profile        │  │ • Order History  │  │ • Inventory      │
            ├──────────────────┤  ├──────────────────┤  ├──────────────────┤
            │  REST/gRPC API   │  │  REST/gRPC API   │  │  REST/gRPC API   │
            └────────┬─────────┘  └────────┬─────────┘  └────────┬─────────┘
                    │                     │                     │
                    ↓                     ↓                     ↓
            ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
            │   User DB       │  │   Order DB      │  │  Product DB     │
            │  (PostgreSQL)   │  │  (MongoDB)      │  │ (Elasticsearch) │
            └─────────────────┘  └─────────────────┘  └─────────────────┘

                        ↓                    ↓                    ↓
            ┌──────────────────────────────────────────────────────────┐
            │         Message Bus (Kafka / RabbitMQ)                   │
            │  Events: UserCreated, OrderPlaced, InventoryUpdated     │
            └──────────────────────────────────────────────────────────┘
                        ↑                    ↑                    ↑
                        │                    │                    │
            ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
            │ Payment Service  │  │ Shipping Service │  │ Notification Svc │
            └──────────────────┘  └──────────────────┘  └──────────────────┘

            Each service:
            • Independent deployment
            • Own database (database per service pattern)
            • Technology freedom (polyglot)
            • Team ownership

Service Structure

// Order Service (Node.js)
            order-service/
            ├── src/
            │   ├── api/
            │   │   └── orderController.js
            │   ├── services/
            │   │   └── orderService.js
            │   ├── models/
            │   │   └── Order.js
            │   ├── events/
            │   │   ├── publisher.js
            │   │   └── subscriber.js
            │   └── server.js
            ├── package.json
            ├── Dockerfile
            └── kubernetes/
                └── deployment.yaml

            // orderService.js
            class OrderService {
            async createOrder(orderData) {
                // 1. Validate order
                const validationResult = await this.validateOrder(orderData);
                
                // 2. Save order
                const order = await Order.create(orderData);
                
                // 3. Publish event (async)
                await eventBus.publish('OrderCreated', {
                orderId: order.id,
                userId: order.userId,
                items: order.items,
                total: order.total
                });
                
                return order;
            }
            }

            // Event Listener in Inventory Service
            eventBus.subscribe('OrderCreated', async (event) => {
            // Reduce inventory for ordered items
            for (const item of event.items) {
                await Inventory.decrement(item.productId, item.quantity);
            }
            });

Inter-Service Communication

Synchronous (REST/gRPC)

// Order service calls User service
            async function createOrder(userId, items) {
            // Sync call to verify user
            const user = await fetch(
                `http://user-service/api/users/${userId}`
            );
            
            if (!user.ok) {
                throw new Error('User not found');
            }
            
            // Continue with order creation
            }

Use when: Need immediate response, simple request-reply

Asynchronous (Events)

// Order service publishes event
            await kafka.send({
            topic: 'order-events',
            messages: [{
                key: orderId,
                value: JSON.stringify({
                type: 'OrderCreated',
                orderId,
                userId,
                items
                })
            }]
            });

            // Multiple services react
            // - Inventory: Reduce stock
            // - Payment: Charge card
            // - Shipping: Create shipment
            // - Notification: Send email

Use when: Broadcast to multiple services, eventual consistency OK

Advantages

Independent Scaling: Scale services based on load
Technology Diversity: Use best tool for each job
Fault Isolation: Service failure doesn't crash all
Independent Deployment: Deploy services separately
Team Autonomy: Teams own services end-to-end
Easier Understanding: Smaller, focused codebases
Reusability: Services can be reused across apps

Disadvantages

Complexity: Distributed system challenges
Network Latency: Inter-service calls are slower
Data Consistency: Eventual consistency, no ACID
Testing: Integration testing is complex
Monitoring: Need distributed tracing
DevOps Overhead: More services to manage
Higher Cost: More infrastructure initially

When to Choose Microservices

✓ Large Team: > 20 developers, need team autonomy
✓ Complex Domain: Many distinct business capabilities
✓ Different Scaling Needs: Some services need more resources
✓ Technology Diversity: Need different languages/frameworks
✓ High Availability: Can't afford full system downtime

Warning: Don't start with microservices! The complexity is not worth it until you have the scale and team size to justify it.

Serverless Architecture

Run code without managing servers. Functions execute in response to events, auto-scale, and you pay only for execution time.

Serverless Architecture (AWS Lambda Example):

            ┌─────────────────────────────────────────────────────────────┐
            │                      Event Sources                          │
            ├─────────────────────────────────────────────────────────────┤
            │  API Gateway │ S3 Upload │ DynamoDB │ SNS/SQS │ CloudWatch │
            └──────┬───────┴─────┬─────┴────┬─────┴────┬────┴──────┬─────┘
                │             │          │          │           │
                ↓             ↓          ↓          ↓           ↓
            ┌──────────────────────────────────────────────────────────────┐
            │              AWS Lambda Functions (Auto-scale)               │
            ├──────────────────────────────────────────────────────────────┤
            │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐    │
            │  │ Create   │  │ Process  │  │ Generate │  │ Send     │    │
            │  │ User     │  │ Image    │  │ Report   │  │ Email    │    │
            │  │ Function │  │ Function │  │ Function │  │ Function │    │
            │  └──────────┘  └──────────┘  └──────────┘  └──────────┘    │
            └──────┬───────────────┬──────────────┬──────────────┬────────┘
                │               │              │              │
                ↓               ↓              ↓              ↓
            ┌──────────────────────────────────────────────────────────────┐
            │                    Backend Services                          │
            ├──────────────────────────────────────────────────────────────┤
            │  DynamoDB  │  S3 Storage  │  SES (Email)  │  CloudWatch     │
            └──────────────────────────────────────────────────────────────┘

            Characteristics:
            • No server management
            • Auto-scaling (0 to 1000s of instances)
            • Pay per request (not per hour)
            • Event-driven execution
            • Stateless functions

Lambda Function Example

// Image Processing Lambda (Node.js)
            const AWS = require('aws-sdk');
            const sharp = require('sharp');
            const s3 = new AWS.S3();

            exports.handler = async (event) => {
            // Triggered by S3 upload
            const bucket = event.Records[0].s3.bucket.name;
            const key = event.Records[0].s3.object.key;
            
            // Download image from S3
            const image = await s3.getObject({
                Bucket: bucket,
                Key: key
            }).promise();
            
            // Resize image (3 sizes)
            const sizes = [
                { name: 'thumbnail', width: 150 },
                { name: 'medium', width: 500 },
                { name: 'large', width: 1200 }
            ];
            
            for (const size of sizes) {
                const resized = await sharp(image.Body)
                .resize(size.width)
                .toBuffer();
                
                // Upload resized image
                await s3.putObject({
                Bucket: bucket,
                Key: `${size.name}/${key}`,
                Body: resized,
                ContentType: 'image/jpeg'
                }).promise();
            }
            
            return {
                statusCode: 200,
                body: 'Images processed'
            };
            };

            // Auto-scales: 1 upload = 1 execution
            // 1000 uploads = 1000 parallel executions
            // Cost: Pay only for execution time

API Endpoint Example

// API Gateway + Lambda
            // GET /api/users/{id}

            exports.handler = async (event) => {
            const userId = event.pathParameters.id;
            
            // Query DynamoDB
            const params = {
                TableName: 'Users',
                Key: { userId }
            };
            
            const result = await dynamodb.get(params).promise();
            
            if (!result.Item) {
                return {
                statusCode: 404,
                body: JSON.stringify({ error: 'User not found' })
                };
            }
            
            return {
                statusCode: 200,
                headers: {
                'Content-Type': 'application/json',
                'Access-Control-Allow-Origin': '*'
                },
                body: JSON.stringify(result.Item)
            };
            };

            // Serverless Framework config
            // serverless.yml
            service: user-api

            provider:
            name: aws
            runtime: nodejs18.x
            region: us-east-1

            functions:
            getUser:
                handler: handler.getUser
                events:
                - http:
                    path: users/{id}
                    method: get
                    cors: true

Advantages

No Server Management: Provider handles infrastructure
Auto-Scaling: Scales to 0 and to 1000s automatically
Cost-Effective: Pay per execution (not idle time)
Fast Development: Focus on code, not infrastructure
Event-Driven: Perfect for async workflows
High Availability: Built-in redundancy
Quick Deployments: Deploy functions in seconds

Disadvantages

Cold Start: First request slower (100-1000ms)
Vendor Lock-in: AWS Lambda != Azure Functions
Execution Limits: 15min timeout, memory limits
Local Testing: Hard to replicate cloud environment
Debugging: Harder than traditional apps
Cost at Scale: Can be expensive for high-traffic
Stateless: Can't maintain connections

When to Choose Serverless

✓ Variable Load: Traffic spikes unpredictably
✓ Event-Driven: Triggered by uploads, messages, schedules
✓ Rapid Prototyping: Want to move fast without ops
✓ Background Jobs: Image processing, data transforms
✓ Low Traffic: Don't want to pay for idle servers

Best Practice: Hybrid approach - Use serverless for specific workloads (APIs, background jobs) alongside traditional servers for core services.

Decision Matrix: Which Architecture?

Choose Monolith If:

✓ Building MVP
✓ Team < 10 people
✓ Simple domain
✓ Need speed to market
✓ Budget constraints
✓ Predictable load

Examples:

Basecamp, StackOverflow (started monolith)

Choose Microservices If:

✓ Team > 20 people
✓ Complex domain
✓ Different scaling needs
✓ Need tech diversity
✓ Independent deployments
✓ High availability critical

Examples:

Netflix, Uber, Amazon

Choose Serverless If:

✓ Variable/unpredictable load
✓ Event-driven workflows
✓ Want zero ops
✓ Short-running tasks
✓ Cost optimization
✓ Rapid experimentation

Examples:

Coca-Cola, iRobot, Nordstrom

Golden Rule

"Start simple (monolith), evolve as needed. Architecture should solve real problems, not theoretical ones."

Most successful companies started with monoliths and evolved. Premature microservices is a common mistake.

Section 06

API Design & Communication

APIs (Application Programming Interfaces) are the contracts that define how different components of your system communicate. Well-designed APIs enable seamless integration, scalability, and maintainability. Poor API design leads to technical debt, integration issues, and frustrated developers.

API Communication Paradigms

API Type	Protocol	Data Format	Best Use Cases	Performance
REST	HTTP/HTTPS	JSON, XML	Web APIs, CRUD operations, public APIs	Good (stateless, cacheable)
GraphQL	HTTP/HTTPS	JSON	Complex queries, mobile apps, flexible data needs	Excellent (fetch exactly what you need)
gRPC	HTTP/2	Protocol Buffers	Microservices, high-performance, real-time	Excellent (binary, multiplexing)
WebSocket	WS/WSS	JSON, Binary	Real-time chat, gaming, live updates	Excellent (persistent connection)
SOAP	HTTP, SMTP	XML	Enterprise, banking, legacy systems	Lower (verbose XML)

REST API Design Principles

Resource-Based URLs

✅ Good REST URLs:
        GET    /api/v1/users
        GET    /api/v1/users/123
        POST   /api/v1/users
        PUT    /api/v1/users/123
        DELETE /api/v1/users/123
        GET    /api/v1/users/123/orders

        ❌ Bad REST URLs:
        GET    /api/v1/getAllUsers
        POST   /api/v1/createUser
        GET    /api/v1/user?action=delete&id=123

HTTP Methods & Status Codes

GET: Retrieve resources (200 OK, 404 Not Found)
POST: Create resources (201 Created, 400 Bad Request)
PUT: Update/Replace (200 OK, 204 No Content)
PATCH: Partial update (200 OK)
DELETE: Remove resources (204 No Content)
401: Unauthorized (missing/invalid auth)
403: Forbidden (authenticated but no permission)
429: Too Many Requests (rate limit exceeded)
500: Internal Server Error
503: Service Unavailable

                                REST API Best Practices
                                Versioning: Use URL versioning (/api/v1/) or header versioning
Pagination: Use limit/offset or cursor-based pagination
Filtering & Sorting: /users?status=active&sort=created_at:desc
HATEOAS: Include links to related resources in responses
Rate Limiting: Return X-RateLimit headers
Idempotency: PUT and DELETE should be idempotent

                            

GraphQL: Query Exactly What You Need

GraphQL Query

query {
        user(id: "123") {
            name
            email
            posts(limit: 5) {
            title
            createdAt
            comments {
                author
                text
            }
            }
        }
        }

Advantage: Fetch nested data in one request, no over-fetching

Equivalent REST Calls

// Multiple requests needed:
        GET /api/users/123
        GET /api/users/123/posts?limit=5
        GET /api/posts/1/comments
        GET /api/posts/2/comments
        GET /api/posts/3/comments
        ...

        // 5+ API calls vs 1 GraphQL query
        // More network overhead
        // Higher latency

Disadvantage: Can cause N+1 query problems, harder to cache

gRPC: High-Performance Microservices Communication

Protocol Buffer Definition (.proto)

syntax = "proto3";

        service UserService {
        rpc GetUser (GetUserRequest) returns (User);
        rpc ListUsers (ListUsersRequest) returns (stream User);
        rpc CreateUser (CreateUserRequest) returns (User);
        }

        message User {
        int32 id = 1;
        string name = 2;
        string email = 3;
        int64 created_at = 4;
        }

        message GetUserRequest {
        int32 id = 1;
        }

Advantages

• Binary protocol (faster)
• HTTP/2 multiplexing
• Strongly typed
• Bidirectional streaming
• Auto code generation

Disadvantages

• Not human-readable
• Limited browser support
• Steeper learning curve
• Debugging harder

Best For

• Internal microservices
• Real-time systems
• High-throughput APIs
• Mobile backends

Synchronous vs Asynchronous Communication

Synchronous (Request-Response)

Client waits for server response before proceeding. Direct, immediate feedback.

Client                    Server
        |                         |
        |-------- Request ------->|
        |                         |
        |                    (Processing)
        |                         |
        |<------- Response -------|
        |                         |
        (Continue execution)

Use Cases:

• User authentication
• Payment processing
• Data retrieval (CRUD)
• Real-time validation

Technologies:

HTTP/REST, gRPC, GraphQL

Asynchronous (Event-Driven)

Client sends message and continues. Server processes independently and may respond later.

Producer              Queue              Consumer
        |                     |                   |
        |---- Publish Event ->|                   |
        |                     |                   |
        (Continue immediately)  |---- Deliver ----->|
        |                     |                   |
        |                     |              (Processing)
        |                     |                   |
        |<------- (Optional Callback) ------------|

Use Cases:

• Email notifications
• Order processing
• Video transcoding
• Background jobs

Technologies:

RabbitMQ, Kafka, AWS SQS/SNS

API Security Best Practices

1. Authentication & Authorization

✓ Use OAuth 2.0 for third-party access
✓ JWT tokens with short expiry (15-30 min)
✓ Refresh token rotation
✓ API keys for server-to-server

2. Rate Limiting

✓ Token bucket algorithm
✓ Per-user and per-IP limits
✓ Return 429 with Retry-After header
✓ Different tiers (free: 100/hr, paid: 10k/hr)

3. Input Validation

✓ Validate all input parameters
✓ Use schema validation (JSON Schema)
✓ Sanitize SQL/NoSQL queries
✓ Prevent injection attacks

4. HTTPS & Encryption

✓ TLS 1.3 for all communications
✓ Certificate pinning for mobile apps
✓ Encrypt sensitive data at rest
✓ HSTS headers

Section 07

Databases & Data Models

Choosing the right database and designing an efficient data model are among the most critical decisions in system design. The wrong choice can lead to performance bottlenecks, scalability issues, and costly migrations later.

SQL vs NoSQL: When to Use Each

Aspect	SQL (Relational)	NoSQL (Non-Relational)
Data Model	Tables with rows and columns, fixed schema	Documents, key-value, column-family, graph
Schema	Strict, predefined schema (DDL)	Flexible, schema-less or dynamic schema
Transactions	ACID compliant (strong consistency)	BASE model (eventual consistency)
Scaling	Vertical scaling (scale up)	Horizontal scaling (scale out)
Query Language	SQL (standardized)	Varies by database (MongoDB Query, CQL)
Best For	Complex queries, joins, transactions, financial data	Flexible data, high write throughput, rapid development
Examples	PostgreSQL, MySQL, Oracle, SQL Server	MongoDB, Cassandra, Redis, Neo4j, DynamoDB

Relational Databases (SQL)

Normalization Forms

1NF (First Normal Form)

Each column contains atomic values, no repeating groups

Example:
❌ orders: customer_phones: "123, 456"
✅ phone_numbers: table with customer_id

2NF (Second Normal Form)

1NF + no partial dependencies on composite keys

Example:
❌ order_items: (order_id, product_id, customer_name)
✅ customer_name in orders table only

3NF (Third Normal Form)

2NF + no transitive dependencies

Example:
❌ orders: (id, customer_id, customer_city, city_zip)
✅ city_zip in separate cities table

When to Denormalize

While normalization reduces redundancy, denormalization can improve read performance:

Read-heavy workloads where joins are expensive
Reporting and analytics databases
Frequently accessed data that rarely changes
When caching isn't sufficient

Database Indexing Strategies

Index Type	How It Works	Best For	Trade-offs
B-Tree Index	Balanced tree structure, sorted data	Range queries, sorting, unique constraints	Write overhead, storage space
Hash Index	Hash function maps keys to locations	Exact match lookups (WHERE id = 123)	No range queries, no sorting
Full-Text Index	Inverted index for text search	Search in text fields, fuzzy matching	Large storage, complex queries
Composite Index	Index on multiple columns (a, b, c)	Queries filtering on multiple columns	Column order matters for performance
Covering Index	Index includes all queried columns	Avoid table lookups, fast reads	Larger index size
Partial Index	Index only subset of rows (WHERE clause)	Frequent queries on specific data	Smaller size, faster writes

-- Creating indexes in PostgreSQL

        -- Simple B-tree index
        CREATE INDEX idx_users_email ON users(email);

        -- Composite index (order matters!)
        CREATE INDEX idx_orders_customer_date ON orders(customer_id, created_at);

        -- Partial index (only active users)
        CREATE INDEX idx_active_users ON users(email) WHERE status = 'active';

        -- Covering index (includes columns in SELECT)
        CREATE INDEX idx_users_covering ON users(email) INCLUDE (name, created_at);

        -- Full-text search index
        CREATE INDEX idx_posts_search ON posts USING GIN(to_tsvector('english', content));

        -- Unique index
        CREATE UNIQUE INDEX idx_users_email_unique ON users(email);

NoSQL Database Categories

📄

Document Databases

MongoDB, CouchDB, Firestore

Store data as JSON-like documents with flexible schemas.

{
        "_id": "user123",
        "name": "John Doe",
        "email": "john@example.com",
        "addresses": [
            {
            "type": "home",
            "street": "123 Main St",
            "city": "NYC"
            }
        ],
        "preferences": {
            "theme": "dark",
            "notifications": true
        }
        }

Best For:

• Content management systems
• User profiles and catalogs
• Real-time analytics
• Rapid prototyping

🔑

Key-Value Stores

Redis, Memcached, DynamoDB

Simple key-value pairs, extremely fast lookups.

// Redis commands
        SET user:123:session "abc-def-789"
        GET user:123:session
        // Returns: "abc-def-789"

        HSET user:123 name "John" email "j@example.com"
        HGET user:123 name
        // Returns: "John"

        EXPIRE user:123:session 3600  // TTL: 1 hour

Best For:

• Caching layer
• Session management
• Real-time leaderboards
• Rate limiting counters

📊

Column-Family Stores

Cassandra, HBase, ScyllaDB

Store data in column families, optimized for write-heavy workloads.

-- CQL (Cassandra Query Language)
        CREATE TABLE user_activity (
        user_id UUID,
        timestamp TIMESTAMP,
        action TEXT,
        metadata MAP,
        PRIMARY KEY (user_id, timestamp)
        ) WITH CLUSTERING ORDER BY (timestamp DESC);

        -- Write-optimized: append-only log structure

Best For:

• Time-series data
• IoT sensor data
• Event logging
• High write throughput

🕸️

Graph Databases

Neo4j, ArangoDB, Amazon Neptune

Store nodes and relationships, optimized for connected data.

// Cypher Query (Neo4j)
        MATCH (user:User {name: "John"})-[:FRIENDS_WITH]->(friend)
        -[:LIKES]->(post:Post)
        WHERE post.created_at > date('2025-01-01')
        RETURN friend.name, post.title

        // Find friends' posts efficiently

Best For:

• Social networks
• Recommendation engines
• Fraud detection
• Knowledge graphs

Database Sharding & Partitioning

Sharding splits your database horizontally across multiple servers to handle massive scale.

Hash-Based Sharding

hash(user_id) % num_shards = shard_id

Pro: Even distribution
Con: Resharding is hard

Range-Based Sharding

user_id 1-1M → Shard1, 1M-2M → Shard2

Pro: Range queries fast
Con: Hotspots possible

Geo-Based Sharding

US users → US shard, EU → EU shard

Pro: Low latency, compliance
Con: Uneven loads

Sharding Challenges:

• Cross-shard queries are expensive
• No distributed transactions (need eventual consistency)
• Resharding requires downtime or complex migration
• Application logic must be shard-aware

Section 08

Storage Concepts

Modern applications require different storage solutions for different types of data. Understanding the characteristics, performance, and cost implications of each storage type is crucial for system design.

Storage Types Comparison

Storage Type	Characteristics	Use Cases	Examples	Cost
Block Storage	Raw storage volumes, low latency, high IOPS	Databases, boot volumes, transactional data	AWS EBS, Azure Disk, SAN	$$$$
Object Storage	Unlimited scale, HTTP access, metadata-rich	Media files, backups, static assets, archives	AWS S3, Google Cloud Storage, Azure Blob	$
File Storage	Hierarchical, shared access, POSIX compliant	Shared documents, home directories, CMS	AWS EFS, Azure Files, NFS, SMB	$$$
Cold Storage	Archival, rare access, retrieval delays	Compliance archives, backups, logs	AWS Glacier, Google Coldline	¢

Block Storage

Provides raw storage blocks that can be formatted with any file system. Attached to VMs/containers as volumes.

Characteristics:

Low Latency: 1-5ms read/write
High IOPS: 16,000-64,000 IOPS for SSDs
Snapshots: Point-in-time backups
Encryption: At-rest and in-transit

AWS EBS Volume Types:

gp3 (General Purpose SSD): 3,000-16,000 IOPS
io2 (Provisioned IOPS SSD): Up to 64,000 IOPS
st1 (Throughput Optimized HDD): Big data, logs
sc1 (Cold HDD): Infrequent access

Object Storage

Stores data as objects with unique identifiers, accessible via HTTP/HTTPS APIs. Highly scalable and durable.

// AWS S3 Structure
        s3://my-bucket/users/avatars/user-123.jpg

        Object Components:
        - Key: users/avatars/user-123.jpg
        - Value: Binary data
        - Metadata: Content-Type, Cache-Control, custom tags
        - Version ID: (if versioning enabled)
        - Access Control: IAM policies, bucket policies

        // Upload with metadata
        PUT /users/avatars/user-123.jpg HTTP/1.1
        Host: my-bucket.s3.amazonaws.com
        Content-Type: image/jpeg
        x-amz-storage-class: INTELLIGENT_TIERING
        x-amz-server-side-encryption: AES256

S3 Storage Classes

S3 Standard: $0.023/GB, frequent access
S3 Intelligent-Tiering: Auto-moves between tiers
S3 Infrequent Access: $0.0125/GB, monthly access
S3 Glacier: $0.004/GB, archival (hours retrieval)
S3 Glacier Deep Archive: $0.00099/GB (12hrs retrieval)

Best Practices

✓ Use CloudFront CDN for static assets
✓ Enable versioning for critical data
✓ Lifecycle policies for cost optimization
✓ Multipart upload for files > 100MB
✓ Use presigned URLs for temporary access
✓ Enable server access logging

Content Delivery Network (CDN)

CDNs distribute content to edge locations worldwide, reducing latency and improving user experience.

User in Japan                    User in USA                 User in Europe
            |                              |                              |
            |                              |                              |
        [Tokyo Edge]                 [LA Edge]                    [London Edge]
            |                              |                              |
            ↓                              ↓                              ↓
        ←------ Cache Miss ----→  [Origin Server (US-East)]  ←----- Cache Miss -----→
                                    (Your application/S3)

        Cache Hit: ~20-50ms latency
        Origin Request: 200-500ms latency

        CDN Benefits:
        ✓ 80-95% requests served from edge (cache hits)
        ✓ Reduced origin server load
        ✓ DDoS protection (CloudFlare, Akamai)
        ✓ SSL/TLS termination at edge
        ✓ Image optimization & compression

Cache Strategies

Cache-Control: max-age=3600
Browser caches for 1 hour
ETag / If-None-Match
Conditional requests
Vary: Accept-Encoding
Separate cache for gzip
Immutable assets
main.abc123.js (versioned)

Popular CDNs

Cloudflare: 275+ locations, free tier
AWS CloudFront: 450+ PoPs, AWS integration
Akamai: Enterprise, 4000+ servers
Fastly: Real-time purging, VCL

What to Cache

✓ Static assets (JS, CSS, images)
✓ Product catalog pages
✓ API responses (with short TTL)
✗ User-specific content
✗ Checkout/payment pages

Distributed File Systems

HDFS (Hadoop Distributed File System)

Store massive datasets across clusters, fault-tolerant with replication.

• Block size: 128MB default
• Replication factor: 3x
• Write once, read many
• Best for batch processing

GFS (Google File System)

Inspiration for HDFS, powers Google's infrastructure.

• Chunk size: 64MB
• Master-slave architecture
• Optimized for large files
• Used by BigTable, MapReduce

Amazon EFS (Elastic File System)

Managed NFS for AWS, scales automatically.

• POSIX-compliant
• Multi-AZ replication
• Pay for what you use
• Works with EC2, Lambda, ECS

Ceph

Open-source, unified storage (object + block + file).

• CRUSH algorithm for data placement
• Self-healing and self-managing
• Used by OpenStack
• Highly scalable

Section 09

Caching & Performance

Caching is one of the most powerful techniques to improve system performance. By storing frequently accessed data in fast storage, you can reduce database load, lower latency, and handle more traffic with the same infrastructure.

Multi-Layer Caching Architecture

┌─────────────────────────────────────────────────────────┐
│              Layer 1: Browser Cache                     │
│         (localStorage, sessionStorage, Cache API)       │
│              Latency: ~1ms | Hit Rate: 60-70%           │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│              Layer 2: CDN / Edge Cache                  │
│         (CloudFlare, CloudFront, Akamai)                │
│             Latency: ~20-50ms | Hit Rate: 80-90%        │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│         Layer 3: Application Cache (In-Memory)          │
│              (Node cache, Guava, Caffeine)              │
│              Latency: <1ms | Hit Rate: 70-85%           │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│          Layer 4: Distributed Cache (Redis)             │
│         (Redis, Memcached, Hazelcast)                   │
│             Latency: 1-5ms | Hit Rate: 85-95%           │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│              Layer 5: Database Query Cache              │
│         (MySQL Query Cache, PostgreSQL)                 │
│             Latency: 5-20ms | Hit Rate: 60-80%          │
└─────────────────────────────────────────────────────────┘
                            ↓
┌─────────────────────────────────────────────────────────┐
│                 Database (Disk Storage)                 │
│              Latency: 50-200ms (Cache Miss)             │
└─────────────────────────────────────────────────────────┘

Cache Hit Rate Impact

Example: Database query takes 100ms, cache lookup takes 1ms

90% hit rate: Average: (0.9 × 1ms) + (0.1 × 100ms) = 10.9ms
95% hit rate: Average: (0.95 × 1ms) + (0.05 × 100ms) = 5.95ms
99% hit rate: Average: (0.99 × 1ms) + (0.01 × 100ms) = 1.99ms

Takeaway: Going from 90% to 99% hit rate gives you 5x improvement!

Caching Strategies & Patterns

Cache-Aside (Lazy Loading)

Application Logic:
1. Check cache
2. If HIT → return cached data
3. If MISS:
   a. Query database
   b. Store in cache
   c. Return data

[App] → [Cache] (miss)
  ↓        ↓
  └──→ [Database]
       ↓
    Store in cache
    Return to app

// Cache-Aside Pattern
async function getUser(userId) {
  // 1. Try cache first
  const cached = await redis.get(`user:${userId}`);
  if (cached) {
    return JSON.parse(cached);
  }
  
  // 2. Cache miss - query DB
  const user = await db.query(
    'SELECT * FROM users WHERE id = ?', 
    [userId]
  );
  
  // 3. Store in cache (TTL: 1 hour)
  await redis.setex(
    `user:${userId}`, 
    3600, 
    JSON.stringify(user)
  );
  
  return user;
}

Pros & Cons:

✓ Only cache what's needed

✓ Cache failures don't break app

✗ Initial request is slow (cache miss)

✗ Cache stampede risk

Write-Through

Write Operation:
1. Write to cache
2. Cache writes to DB
3. Return success

[App] → [Cache] → [Database]
          ↓
    Always in sync

Read Operation:
[App] → [Cache] (always HIT)

// Write-Through Pattern
async function updateUser(userId, data) {
  // 1. Update cache
  await redis.set(
    `user:${userId}`, 
    JSON.stringify(data)
  );
  
  // 2. Write to database
  await db.query(
    'UPDATE users SET ? WHERE id = ?',
    [data, userId]
  );
  
  return data;
}

// Reads are always fast
async function getUser(userId) {
  const cached = await redis.get(`user:${userId}`);
  return JSON.parse(cached);
}

Pros & Cons:

✓ Cache always fresh

✓ Predictable read performance

✗ Write latency increased

✗ Unused data cached

Write-Behind (Write-Back)

Write Operation:
1. Write to cache (fast)
2. Return success immediately
3. Async batch write to DB

[App] → [Cache] → Queue
          ↓         ↓
    Return fast   [Worker]
                     ↓
                [Database]

// Write-Behind Pattern
async function updateUser(userId, data) {
  // 1. Write to cache immediately
  await redis.set(`user:${userId}`, JSON.stringify(data));
  
  // 2. Add to write queue (async)
  await redis.lpush('db_write_queue', JSON.stringify({
    type: 'UPDATE_USER',
    userId,
    data,
    timestamp: Date.now()
  }));
  
  return data; // Return immediately
}

// Background worker processes queue
setInterval(async () => {
  const writes = await redis.lrange('db_write_queue', 0, 99);
  // Batch write to DB
  await db.batchUpdate(writes);
  await redis.ltrim('db_write_queue', 100, -1);
}, 5000); // Every 5 seconds

Pros & Cons:

✓ Fastest write performance

✓ Can batch DB operations

✗ Risk of data loss

✗ Complex consistency

Read-Through

Read Operation:
1. App requests from cache
2. Cache fetches from DB if miss
3. Cache stores and returns

[App] → [Cache]
          ↓ (miss)
       [Database]
          ↓
    Auto-populate cache

// Cache library handles DB fetch
const cacheConfig = {
  store: redisStore,
  ttl: 3600,
  // Cache fetches from DB automatically
  refresh: async (key) => {
    const userId = key.split(':')[1];
    return await db.query(
      'SELECT * FROM users WHERE id = ?',
      [userId]
    );
  }
};

// Application code is simple
const user = await cache.get(`user:${userId}`);

Pros & Cons:

✓ Simplified app logic

✓ Automatic cache population

✗ Initial request slow

✗ Less control

Redis: Advanced Caching Techniques

Redis Data Structures

// 1. Strings (simple cache)
SET user:123 "John Doe" EX 3600
GET user:123

// 2. Hash (structured data)
HSET user:123 name "John" email "j@ex.com"
HGETALL user:123

// 3. Lists (queues, activity feeds)
LPUSH notifications:user:123 "New message"
LRANGE notifications:user:123 0 9

// 4. Sets (unique items, tags)
SADD user:123:tags "developer" "golang"
SISMEMBER user:123:tags "developer"

// 5. Sorted Sets (leaderboards)
ZADD leaderboard 1000 "player1" 950 "player2"
ZREVRANGE leaderboard 0 9 WITHSCORES

// 6. HyperLogLog (count unique)
PFADD unique_visitors user1 user2 user3
PFCOUNT unique_visitors

// 7. Bitmap (boolean flags)
SETBIT active_users:2025-11-08 123 1
BITCOUNT active_users:2025-11-08

Redis Use Cases

Session Store:

SETEX session:abc123 1800 '{"user_id":123}'

Rate Limiting:

INCR api:user:123:count
EXPIRE api:user:123:count 3600
// Allow 1000 requests per hour

Real-time Leaderboard:

ZINCRBY game:scores 10 "player123"
ZREVRANK game:scores "player123"

Pub/Sub Messaging:

PUBLISH notifications "New order #123"
SUBSCRIBE notifications

Distributed Lock:
```
SET lock:resource:123 "uuid" NX EX 30
```

Cache Stampede Problem & Solution

Problem: Cache Stampede

When cache expires, multiple requests simultaneously hit the database, causing overload.

Cache expires at 10:00:00
10:00:00.001 - Request 1 → DB
10:00:00.002 - Request 2 → DB
10:00:00.003 - Request 3 → DB
... 1000 requests hit DB!

Solution: Locking

async function getWithLock(key) {
  let data = await redis.get(key);
  if (data) return data;
  
  // Try to acquire lock
  const lockKey = `lock:${key}`;
  const acquired = await redis.set(
    lockKey, '1', 'NX', 'EX', 10
  );
  
  if (acquired) {
    // This request fetches from DB
    data = await db.query(...);
    await redis.setex(key, 3600, data);
    await redis.del(lockKey);
  } else {
    // Wait and retry
    await sleep(100);
    return getWithLock(key);
  }
  
  return data;
}

Cache Eviction Policies

Policy	Description	Best For	Implementation
LRU (Least Recently Used)	Evict items not accessed for longest time	General purpose, access patterns vary	Doubly linked list + hash map
LFU (Least Frequently Used)	Evict items with lowest access count	When popularity matters (trending content)	Min-heap + hash map
FIFO (First In First Out)	Evict oldest items first	Time-based data, logs, simple scenarios	Queue
TTL (Time To Live)	Items expire after set duration	Session data, temporary data	Expiration timestamps
Random	Evict random items	Fast, when access patterns unpredictable	Random selection

Section 10

Scalability Strategies

Scalability is the ability of a system to handle increased load by adding resources. It's not just about handling more users—it's about doing so efficiently while maintaining performance and reliability.

Vertical vs Horizontal Scaling

Vertical Scaling (Scale Up)

┌─────────────────┐      ┌─────────────────┐
│   4 CPU Cores   │      │  16 CPU Cores   │
│     16 GB RAM   │  →   │    64 GB RAM    │
│    500 GB SSD   │      │   2 TB NVMe     │
└─────────────────┘      └─────────────────┘
 Current Server           Upgraded Server

Advantages:

✓ Simple - no code changes needed
✓ No distributed system complexity
✓ Better for single-threaded apps
✓ ACID transactions easier
✓ Lower network latency

Disadvantages:

✗ Hardware limits (max CPU/RAM)
✗ Single point of failure
✗ Downtime during upgrades
✗ Expensive (non-linear cost)
✗ Limited by single machine

Best For:

Databases, monolithic apps, early-stage startups, quick wins

Horizontal Scaling (Scale Out)

┌──────────┐              ┌──────────┐
│ Server 1 │              │ Server 1 │
│ 4 CPU    │      →       │ 4 CPU    │
│ 16 GB    │              │ 16 GB    │
└──────────┘              └──────────┘
                          ┌──────────┐
                          │ Server 2 │
                          │ 4 CPU    │
                          │ 16 GB    │
                          └──────────┘
                          ┌──────────┐
                          │ Server 3 │
                          │ 4 CPU    │
                          │ 16 GB    │
                          └──────────┘

Advantages:

✓ Infinite scalability (add more servers)
✓ High availability (redundancy)
✓ No downtime for scaling
✓ Cost-effective (commodity hardware)
✓ Geographic distribution

Disadvantages:

✗ Complex architecture
✗ Data consistency challenges
✗ Network latency between nodes
✗ Load balancing needed
✗ More operational overhead

Best For:

Web servers, microservices, stateless applications, cloud-native apps

When to Use Each Strategy

Start with Vertical Scaling: It's simpler and faster to implement. Vertical scaling is perfect when you're below the limit of a single powerful machine.

Move to Horizontal Scaling: When you hit hardware limits, need high availability, or when cost becomes prohibitive. Most modern systems eventually require horizontal scaling.

Database Scaling Techniques

Database Replication

Master-Slave (Primary-Replica)

              ┌─────────────┐
              │   MASTER    │
              │  (Writes)   │
              └──────┬──────┘
                     │ Replicate
         ┌───────────┼───────────┐
         ↓           ↓           ↓
    ┌────────┐  ┌────────┐  ┌────────┐
    │ SLAVE1 │  │ SLAVE2 │  │ SLAVE3 │
    │(Reads) │  │(Reads) │  │(Reads) │
    └────────┘  └────────┘  └────────┘

Write: 1 server  |  Read: 3+ servers

Use Case: Read-heavy workloads (90% reads, 10% writes)

Replication Lag: Async replication may cause 0.1-1s delay

Example: E-commerce product catalog

Multi-Master (Active-Active)

    ┌──────────┐  ←→  ┌──────────┐
    │ MASTER 1 │  ←→  │ MASTER 2 │
    │ (R/W)    │  ←→  │ (R/W)    │
    └──────────┘      └──────────┘
         ↕                  ↕
    Bi-directional Replication

Both accept writes simultaneously

Use Case: Multi-region, high availability

Challenge: Conflict resolution needed

Example: Global CRM system

// MySQL Master-Slave Configuration
// Master (my.cnf):
server-id = 1
log-bin = /var/log/mysql/mysql-bin.log
binlog_do_db = production_db

// Slave (my.cnf):
server-id = 2
relay-log = /var/log/mysql/relay-bin
read_only = 1

// Application Code (using read replicas)
const dbConfig = {
  master: { host: 'master.db.com', port: 3306 },
  slaves: [
    { host: 'slave1.db.com', port: 3306 },
    { host: 'slave2.db.com', port: 3306 }
  ]
};

// Write to master
await masterDB.query('INSERT INTO users VALUES (...)');

// Read from slaves (round-robin)
const randomSlave = slaves[Math.floor(Math.random() * slaves.length)];
await slaveDB.query('SELECT * FROM users WHERE ...');

Database Sharding (Horizontal Partitioning)

Split data across multiple databases based on a shard key. Each shard is a complete database with subset of data.

Hash-Based

shard = hash(user_id) % num_shards

user_id: 123
hash(123) = 8472
8472 % 4 = 0 → Shard 0

Pro: Even distribution
Con: Hard to add shards

Range-Based

user_id 1-1M    → Shard 0
user_id 1M-2M   → Shard 1
user_id 2M-3M   → Shard 2

Pro: Easy range queries
Con: Hotspots

Directory-Based

Lookup table:
user_id: 123 → Shard 2
user_id: 456 → Shard 1

Pro: Flexible
Con: Extra lookup

Sharding Challenges

❌ No JOINs across shards: Need application-level joins
❌ No foreign keys across shards: Referential integrity at app level
❌ Resharding is complex: Requires data migration
❌ Uneven data distribution: Some shards may be larger

Table Partitioning (Vertical Partitioning)

Split table columns into multiple tables based on access patterns.

Before Partitioning

users table:
┌────────┬────────┬──────────┬──────────┬──────────┐
│  id    │  name  │  email   │  avatar  │  bio     │
├────────┼────────┼──────────┼──────────┼──────────┤
│  123   │  John  │ j@ex.com │ (1MB)    │ (text)   │
└────────┴────────┴──────────┴──────────┴──────────┘

Problem: Large rows, slow queries

After Partitioning

users_core (frequently accessed):
┌────────┬────────┬──────────┐
│  id    │  name  │  email   │
└────────┴────────┴──────────┘

users_profile (rarely accessed):
┌────────┬──────────┬──────────┐
│  id    │  avatar  │  bio     │
└────────┴──────────┴──────────┘

Benefit: Faster queries, better caching

Auto-Scaling Strategies

Reactive Scaling

Scale based on current metrics (CPU, memory, request rate)

Example:
If CPU > 70% for 5 minutes → Add 2 instances
If CPU < 30% for 10 minutes → Remove 1 instance

Scheduled Scaling

Scale based on predictable patterns

Example:
8 AM - 10 AM: Scale to 20 instances
10 PM - 6 AM: Scale to 5 instances
Black Friday: Scale to 100 instances

Predictive Scaling

Use ML to predict load and scale proactively

Example:
Analyze 30 days of traffic
Predict: Monday 3 PM will have 50% more traffic
Scale up 15 minutes before

// AWS Auto Scaling Configuration Example
{
  "MinSize": 2,
  "MaxSize": 50,
  "DesiredCapacity": 5,
  "TargetTrackingConfiguration": {
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ASGAverageCPUUtilization",
      "TargetValue": 70.0
    },
    "ScaleOutCooldown": 300,  // Wait 5 min before scaling out again
    "ScaleInCooldown": 600    // Wait 10 min before scaling in
  },
  "StepScalingPolicy": {
    "AdjustmentType": "PercentChangeInCapacity",
    "StepAdjustments": [
      { "MetricIntervalLowerBound": 0, "MetricIntervalUpperBound": 10, "ScalingAdjustment": 10 },
      { "MetricIntervalLowerBound": 10, "MetricIntervalUpperBound": 20, "ScalingAdjustment": 20 },
      { "MetricIntervalLowerBound": 20, "ScalingAdjustment": 30 }
    ]
  }
}

Section 11

Load Balancing

Load balancers distribute incoming traffic across multiple servers, ensuring no single server becomes overwhelmed. They're critical for high availability, scalability, and performance.

Load Balancing Algorithms

Round Robin

Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A
Request 5 → Server B

Distributes requests sequentially to each server in turn.

Pros: Simple, fair distribution

Cons: Ignores server capacity

Best for: Uniform server specs

Weighted Round Robin

Server A (weight: 3)
Server B (weight: 2)
Server C (weight: 1)

Pattern:
A, A, A, B, B, C (repeat)

Assigns more requests to servers with higher weights.

Pros: Handles different capacities

Cons: Static weights

Best for: Mixed server specs

Least Connections

Server A: 5 connections
Server B: 8 connections
Server C: 3 connections

New request → Server C
(has least connections)

Routes to server with fewest active connections.

Pros: Accounts for load

Cons: Overhead tracking

Best for: Long connections

Least Response Time

Server A: 50ms avg
Server B: 120ms avg
Server C: 30ms avg

New request → Server C
(fastest response)

Routes to server with lowest response time.

Pros: Best user experience

Cons: Complex calculation

Best for: Performance-critical

IP Hash

Client IP: 192.168.1.100
hash(192.168.1.100) % 3 = 1

Always routes to Server B
(session persistence)

Routes based on client IP hash.

Pros: Session persistence

Cons: Uneven distribution

Best for: Stateful apps

Random

Request 1 → Server B
Request 2 → Server A
Request 3 → Server B
Request 4 → Server C
(randomly selected)

Selects server randomly.

Pros: Simple, fast

Cons: May be uneven

Best for: Stateless, uniform load

Layer 4 vs Layer 7 Load Balancing

Layer 4 (Transport Layer)

Routes based on IP address and TCP/UDP port. Doesn't inspect packet content.

Client Request
    ↓
[L4 Load Balancer]
    │
    ├→ Looks at: IP + Port
    ├→ Decision: TCP 192.168.1.100:443
    ↓
Routes to Server B

Characteristics:

✓ Very fast (no content inspection)
✓ Low latency (1-2ms overhead)
✓ Handles any protocol (HTTP, WebSocket, database)
✓ High throughput (millions of connections)
✗ No content-based routing
✗ No SSL termination

Examples:

AWS NLB, HAProxy (TCP mode), NGINX stream module

Use Case: Database load balancing, TCP applications, ultra-low latency

Layer 7 (Application Layer)

Routes based on HTTP headers, URLs, cookies. Full content inspection.

Client Request: GET /api/users
    ↓
[L7 Load Balancer]
    │
    ├→ Looks at: URL, Headers, Cookies
    ├→ Decision: /api/* → API servers
    ↓
Routes to API Server Pool

Characteristics:

✓ Content-based routing
✓ SSL/TLS termination
✓ URL rewriting, redirects
✓ Authentication, rate limiting
✗ Higher latency (10-50ms)
✗ More resource intensive

Examples:

AWS ALB, NGINX, HAProxy (HTTP mode), Traefik

Use Case: Web applications, microservices routing, API gateways

// NGINX Layer 7 Configuration
http {
  upstream api_servers {
    server api1.example.com:8080;
    server api2.example.com:8080;
    server api3.example.com:8080;
  }
  
  upstream web_servers {
    server web1.example.com:3000;
    server web2.example.com:3000;
  }
  
  server {
    listen 80;
    
    # Route API requests to API servers
    location /api/ {
      proxy_pass http://api_servers;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header Host $host;
    }
    
    # Route static content to web servers
    location / {
      proxy_pass http://web_servers;
      proxy_cache my_cache;
      proxy_cache_valid 200 1h;
    }
    
    # Route admin to specific server
    location /admin {
      proxy_pass http://admin.example.com:9000;
      auth_basic "Admin Area";
      auth_basic_user_file /etc/nginx/.htpasswd;
    }
  }
}

Health Checks & Failover

Health checks ensure traffic only goes to healthy servers. When a server fails, the load balancer automatically routes traffic elsewhere.

Types of Health Checks

1. TCP Health Check

Check if server accepts TCP connections

Every 10 seconds:
  Try to connect to server:80
  If connection successful → Healthy
  If 3 consecutive failures → Unhealthy

2. HTTP Health Check

Send HTTP request, check response code

GET /health HTTP/1.1
Expected: 200 OK
If 2xx/3xx → Healthy
If 4xx/5xx or timeout → Unhealthy

3. Custom Health Check

Check specific conditions (DB connection, disk space)

// /health endpoint
{
  "status": "healthy",
  "database": "connected",
  "disk_space": "85%",
  "response_time": "50ms"
}

Health Check Configuration

// AWS ALB Health Check
{
  "HealthCheckProtocol": "HTTP",
  "HealthCheckPath": "/health",
  "HealthCheckIntervalSeconds": 30,
  "HealthCheckTimeoutSeconds": 5,
  "HealthyThresholdCount": 2,
  "UnhealthyThresholdCount": 3,
  "Matcher": {
    "HttpCode": "200-299"
  }
}

// Scenario:
// Check every 30 seconds
// Wait 5 seconds for response
// 2 consecutive successes → Healthy
// 3 consecutive failures → Unhealthy

Timeline:
00:00 - Check failed (1/3)
00:30 - Check failed (2/3)
01:00 - Check failed (3/3) → UNHEALTHY
       (Remove from pool)
01:30 - Check success (1/2)
02:00 - Check success (2/2) → HEALTHY
       (Add back to pool)

Failover Strategies

Active-Active

All servers handle traffic simultaneously

✓ Maximum resource utilization
✓ No idle servers
✓ Better performance

Active-Passive

Standby servers activated only on failure

✓ Simple failover
✓ Clear primary/backup
✗ Resource waste (idle servers)

Global Load Balancing (GSLB)

Route users to the nearest data center based on geography, reducing latency.

User in Japan                User in USA                User in Europe
     ↓                           ↓                           ↓
[DNS Query: api.example.com]
     ↓                           ↓                           ↓
[Global Load Balancer / GeoDNS]
     ↓                           ↓                           ↓
Returns:              Returns:              Returns:
Tokyo DC              Virginia DC           London DC
13.230.10.50          54.88.45.20          18.130.70.30
     ↓                           ↓                           ↓
Latency: 20ms         Latency: 30ms         Latency: 25ms

GeoDNS Routing

Route based on user's geographic location

AWS Route 53, Cloudflare, NS1

Latency-Based Routing

Route to region with lowest latency

Measure actual latency, not just distance

Failover Routing

Redirect to backup region if primary fails

Automatic disaster recovery

Section 12

Message Queues & Streaming

Message queues and streaming platforms enable asynchronous communication between services, decoupling components and improving system resilience. They're essential for building scalable, event-driven architectures.

Message Queue vs Event Streaming

Aspect	Message Queue	Event Streaming
Purpose	Task distribution, job processing	Real-time data pipelines, event sourcing
Message Retention	Deleted after consumption	Retained for days/weeks (configurable)
Consumption Model	Competitive (one consumer per message)	Publish-subscribe (multiple consumers)
Ordering	FIFO within queue (optional)	Strict ordering within partition
Performance	Thousands of messages/second	Millions of messages/second
Examples	RabbitMQ, AWS SQS, Azure Service Bus	Apache Kafka, AWS Kinesis, Pulsar
Use Cases	Email sending, image processing, task queue	Activity tracking, log aggregation, analytics

RabbitMQ: Message Queue Deep Dive

┌────────────┐         ┌──────────────┐         ┌────────────┐         ┌────────────┐
│  Producer  │────────→│   Exchange   │────────→│   Queue    │────────→│  Consumer  │
│  (App)     │  Publish│  (Routing)   │  Bind   │  (Buffer)  │  Consume│  (Worker)  │
└────────────┘         └──────────────┘         └────────────┘         └────────────┘

Exchange Types:
1. Direct    - Route by exact routing key match
2. Fanout    - Broadcast to all queues
3. Topic     - Route by pattern matching (*.error, logs.#)
4. Headers   - Route by message headers

RabbitMQ Example: Task Queue

// Producer (Send tasks)
const amqp = require('amqplib');

async function sendTask(taskData) {
  const connection = await amqp.connect('amqp://localhost');
  const channel = await connection.createChannel();
  
  const queue = 'task_queue';
  await channel.assertQueue(queue, {
    durable: true  // Survive broker restart
  });
  
  channel.sendToQueue(
    queue,
    Buffer.from(JSON.stringify(taskData)),
    { persistent: true }  // Survive queue restart
  );
  
  console.log('Task sent:', taskData);
  await channel.close();
  await connection.close();
}

// Usage
await sendTask({
  type: 'send_email',
  to: 'user@example.com',
  subject: 'Welcome'
});

Worker (Process tasks)

// Consumer (Process tasks)
async function startWorker() {
  const connection = await amqp.connect('amqp://localhost');
  const channel = await connection.createChannel();
  
  const queue = 'task_queue';
  await channel.assertQueue(queue, { durable: true });
  
  // Fair dispatch: don't give worker more than 1 task
  channel.prefetch(1);
  
  console.log('Worker waiting for tasks...');
  
  channel.consume(queue, async (msg) => {
    const task = JSON.parse(msg.content.toString());
    console.log('Processing:', task);
    
    try {
      // Process task (e.g., send email)
      await processTask(task);
      
      // Acknowledge success
      channel.ack(msg);
    } catch (error) {
      // Reject and requeue if failed
      channel.nack(msg, false, true);
    }
  });
}

startWorker();

RabbitMQ Patterns

Work Queue

Distribute tasks among workers

Use: Background jobs, image processing

Pub/Sub

Broadcast messages to subscribers

Use: Notifications, logging

RPC

Request-reply pattern

Use: Async API calls

Apache Kafka: Event Streaming Platform

Kafka Cluster Architecture:

┌─────────────────────────────────────────────────────────────────┐
│                        Kafka Cluster                             │
│  ┌────────────┐    ┌────────────┐    ┌────────────┐            │
│  │ Broker 1   │    │ Broker 2   │    │ Broker 3   │            │
│  │            │    │            │    │            │            │
│  │ Topic: logs│    │ Topic: logs│    │ Topic: logs│            │
│  │ Partition 0│    │ Partition 1│    │ Partition 2│            │
│  │ (Leader)   │    │ (Replica)  │    │ (Replica)  │            │
│  └────────────┘    └────────────┘    └────────────┘            │
└─────────────────────────────────────────────────────────────────┘
        ↑                   ↑                   ↑
        │                   │                   │
┌───────┴───────┐   ┌───────┴───────┐   ┌───────┴───────┐
│  Producer 1   │   │  Producer 2   │   │  Producer 3   │
│  (Web Server) │   │  (Mobile App) │   │  (IoT Device) │
└───────────────┘   └───────────────┘   └───────────────┘

        ↓                   ↓                   ↓
┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│  Consumer     │   │  Consumer     │   │  Consumer     │
│  Group 1      │   │  Group 2      │   │  Group 3      │
│  (Analytics)  │   │  (Monitoring) │   │  (Backup)     │
└───────────────┘   └───────────────┘   └───────────────┘

Each consumer group independently reads all events

Kafka Producer

// Kafka Producer
const { Kafka } = require('kafkajs');

const kafka = new Kafka({
  clientId: 'my-app',
  brokers: ['kafka1:9092', 'kafka2:9092', 'kafka3:9092']
});

const producer = kafka.producer();

async function produceEvents() {
  await producer.connect();
  
  // Send single event
  await producer.send({
    topic: 'user-events',
    messages: [
      {
        key: 'user-123',  // Partition key
        value: JSON.stringify({
          event: 'user_login',
          userId: 123,
          timestamp: Date.now()
        }),
        headers: {
          'correlation-id': 'abc-123'
        }
      }
    ]
  });
  
  // Batch send for better performance
  await producer.sendBatch({
    topicMessages: [
      {
        topic: 'user-events',
        messages: Array.from({ length: 100 }, (_, i) => ({
          key: `user-${i}`,
          value: JSON.stringify({ event: 'page_view', userId: i })
        }))
      }
    ]
  });
}

produceEvents();

Kafka Consumer

// Kafka Consumer
const consumer = kafka.consumer({
  groupId: 'analytics-group',
  // Start from earliest if new consumer
  fromBeginning: true
});

async function consumeEvents() {
  await consumer.connect();
  
  await consumer.subscribe({
    topics: ['user-events', 'order-events'],
    // Listen to all partitions
    fromBeginning: false
  });
  
  await consumer.run({
    eachMessage: async ({ topic, partition, message }) => {
      const event = JSON.parse(message.value.toString());
      
      console.log({
        topic,
        partition,
        offset: message.offset,
        key: message.key.toString(),
        event
      });
      
      // Process event
      await handleEvent(event);
      
      // Kafka auto-commits offset periodically
      // Or manual: await consumer.commitOffsets([...])
    }
  });
}

consumeEvents();

Kafka Key Concepts

Topics & Partitions

• Topic: Category of messages (e.g., "orders")
• Partition: Ordered, immutable sequence of records
• Offset: Unique ID of each record in partition
• Replication: Each partition replicated across brokers

Consumer Groups

• Each consumer in group reads from different partitions
• Enables parallel processing
• Automatic rebalancing on consumer add/remove
• Each group maintains its own offset

Event-Driven Architecture Patterns

Event Sourcing

Store all changes as sequence of events instead of current state.

Example: Bank Account

// Traditional (store current state)
account_balance: $1000

// Event Sourcing (store events)
1. AccountCreated    → $0
2. DepositMade       → +$1500
3. WithdrawalMade    → -$300
4. WithdrawalMade    → -$200
Current Balance = $1000

Advantages:
✓ Full audit trail
✓ Replay events to rebuild state
✓ Time travel debugging

CQRS (Command Query Responsibility Segregation)

Separate read and write models for better scalability.

Commands (Write)       Events        Queries (Read)
     ↓                     ↓                ↓
[Write Model]  →  [Event Store]  →  [Read Model]
   (Postgres)           (Kafka)      (Elasticsearch)
                                     (Redis cache)

Examples:
- Write optimized DB for commands
- Denormalized views for queries
- Event bus synchronizes them

Advantages:
✓ Scale reads/writes independently
✓ Optimize each model separately

Saga Pattern (Distributed Transactions)

Manage distributed transactions across microservices using compensating actions.

Choreography-Based Saga

// Order Processing Saga

1. Order Service
   → Create Order (PENDING)
   → Emit: OrderCreated event

2. Payment Service
   ← Listen: OrderCreated
   → Charge Payment
   → Emit: PaymentCompleted event

3. Inventory Service
   ← Listen: PaymentCompleted
   → Reserve Items
   → Emit: ItemsReserved event

4. Shipping Service
   ← Listen: ItemsReserved
   → Schedule Delivery
   → Emit: ShipmentScheduled

5. Order Service
   ← Listen: ShipmentScheduled
   → Update Order (CONFIRMED)

// If any step fails:
PaymentFailed event
→ Cancel Order (compensation)

Orchestration-Based Saga

// Saga Orchestrator

Orchestrator controls flow:

try {
  // Step 1
  order = await orderService.create();
  
  // Step 2
  payment = await paymentService.charge();
  
  // Step 3
  await inventoryService.reserve();
  
  // Step 4
  await shippingService.schedule();
  
  // Complete
  await orderService.confirm(order.id);
  
} catch (error) {
  // Compensate in reverse order
  await shippingService.cancel();
  await inventoryService.release();
  await paymentService.refund();
  await orderService.cancel();
}

Real-World Use Cases

Email Notification System

Architecture: RabbitMQ + Worker Pool

[User Action] → [RabbitMQ Queue]
                         ↓
              ┌──────────┼──────────┐
              ↓          ↓          ↓
         [Worker 1] [Worker 2] [Worker 3]
              ↓          ↓          ↓
         [SMTP Server (SendGrid)]

Benefits:
• Async - user doesn't wait for email
• Retry failed emails automatically
• Scale workers based on queue depth
• Rate limiting to avoid spam filters

Real-Time Analytics Pipeline

Architecture: Kafka + Stream Processing

[Web/Mobile] → [Kafka: events topic]
                         ↓
              [Kafka Streams / Flink]
                   ↓          ↓
          [Aggregation]  [Filtering]
                   ↓          ↓
         [Kafka: analytics] [Kafka: alerts]
                   ↓          ↓
           [Dashboard]  [Alert Service]

Benefits:
• Real-time insights (seconds, not hours)
• Handle millions of events/second
• Replay for debugging/reprocessing
• Multiple consumers for different purposes

Video Processing Pipeline

Architecture: SQS + Lambda + S3

[Upload] → [S3 Bucket]
              ↓ (trigger)
         [SQS Queue]
              ↓
    [Lambda: Transcode Job]
              ↓
    [Generate: 1080p, 720p, 480p]
              ↓
         [S3: Output Bucket]
              ↓
    [Update: DB with URLs]

Benefits:
• Serverless - no server management
• Auto-scaling based on uploads
• Pay only for processing time
• Handle spiky traffic (viral videos)

E-Commerce Order Processing

Architecture: Kafka + Microservices

[Order Created] → [Kafka]
                       ↓
        ┌──────────────┼──────────────┐
        ↓              ↓              ↓
  [Payment Svc] [Inventory Svc] [Email Svc]
        ↓              ↓              ↓
  Charge Card    Reserve Items    Send Confirm
        ↓              ↓              ↓
        └──────────────┼──────────────┘
                       ↓
              [Shipping Service]

Benefits:
• Loose coupling between services
• Each service can scale independently
• Failed steps can retry without affecting others
• Easy to add new consumers (analytics, fraud detection)

Section 13

CAP Theorem & Consistency Models

The CAP theorem is a fundamental principle in distributed systems that states you can only guarantee two out of three properties: Consistency, Availability, and Partition Tolerance. Understanding these trade-offs is crucial for designing distributed systems.

Understanding CAP Theorem

You Can Only Pick Two!

C

Consistency

Every read receives the most recent write or an error

Example:
Update user balance: $100 → $150
All nodes immediately see $150
No stale reads

A

Availability

Every request receives a response (success or failure)

Example:
System always responds
Even if some nodes are down
May return stale data

P

Partition Tolerance

System continues despite network partitions

Example:
Network splits cluster
Nodes can't communicate
System still operates

CP Systems

Consistency + Partition Tolerance

Network Partition Occurs:
Node A ←─X─→ Node B

Write to Node A → Success
Read from Node B → ERROR
(B can't confirm latest state)

System blocks until partition heals

Characteristics:

• Strong consistency guaranteed
• May become unavailable during partitions
• Blocks operations until sync
• Returns errors rather than stale data

Examples:

MongoDB, HBase, Redis (single instance), ZooKeeper

Use Cases:

Banking systems, inventory management, configuration management

AP Systems

Availability + Partition Tolerance

Network Partition Occurs:
Node A ←─X─→ Node B

Write to Node A → Success
Read from Node B → OLD DATA
(Returns last known state)

Both nodes accept requests

Characteristics:

• Always available for reads/writes
• May return stale data
• Eventual consistency model
• Continues during network issues

Examples:

Cassandra, DynamoDB, CouchDB, Riak

Use Cases:

Social media feeds, shopping carts, caching, analytics

CA Systems

Consistency + Availability

No Network Partition:
Node A ←───→ Node B

Write to Node A → Synced to B
Read from Node B → LATEST DATA

System fails if network partitions

Characteristics:

• Strong consistency + high availability
• Only works in single-node or LAN
• Not viable for distributed systems
• Network partitions will occur!

Examples:

Traditional RDBMS (single server), PostgreSQL (single node)

Reality:

In distributed systems, partitions are inevitable. CA is theoretical.

Important Reality: Partition Tolerance is Not Optional

In real distributed systems, network partitions will happen (hardware failures, network issues, datacenter problems). Therefore, the practical choice is between:

CP: Sacrifice availability to maintain consistency during partitions
AP: Sacrifice consistency to maintain availability during partitions

The real question: "When network fails, do you want your system to be unavailable but correct (CP), or available but potentially incorrect (AP)?"

ACID vs BASE: Transaction Models

ACID

Traditional database transaction model (CP systems)

AAtomicity

All operations in transaction succeed or all fail. No partial updates.

BEGIN TRANSACTION
  UPDATE accounts SET balance = balance - 100 WHERE id = 1;
  UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

If step 2 fails → step 1 is rolled back
All or nothing!

CConsistency

Database remains in valid state. All constraints, triggers, and rules are enforced.

// Example: Foreign key constraint
accounts (id, balance CHECK balance >= 0)

// Transaction that violates constraint fails
UPDATE accounts SET balance = -50 WHERE id = 1;
// ERROR: Check constraint violated

IIsolation

Concurrent transactions don't interfere. Appears as if executed serially.

Transaction 1: Read balance (100) → Write (150)
Transaction 2: Read balance (100) → Write (120)

With proper isolation:
T1 completes → balance = 150
T2 reads 150 → balance = 170

Without isolation:
Both read 100, both write → lost update!

DDurability

Committed transactions survive crashes. Persisted to non-volatile storage.

COMMIT TRANSACTION;
// Success response sent to client

[POWER FAILURE OCCURS]

After restart:
Data is still there (written to disk)
Write-ahead log ensures durability

Best For:

Financial systems, e-commerce orders, inventory management, any system where correctness is critical

Examples:

PostgreSQL, MySQL, Oracle, SQL Server

BASE

NoSQL/distributed system model (AP systems)

BBasically Available

System guarantees availability. May return stale data but always responds.

Request to read user profile:

Node A is down
↓
Redirect to Node B
↓
Return data (may be slightly stale)
↓
User gets response (not error)

SSoft State

State of system may change without input (due to eventual consistency).

Write to Node A: likes = 100
Read from Node B: likes = 95 (stale)

Wait a few seconds...

Read from Node B: likes = 100 (synced)

State changed without new writes!

EEventual Consistency

System will become consistent over time if no new updates are made.

Time 0: Write "status=active" to Node A
Time 1: Node B reads → "status=inactive" (stale)
Time 2: Replication completes
Time 3: Node B reads → "status=active" (consistent)

No strict timing guarantee
Eventually all nodes agree

Best For:

Social media, recommendations, analytics, shopping carts, any system where availability > consistency

Examples:

Cassandra, DynamoDB, Riak, CouchDB

Consistency Levels Spectrum

Different systems offer tunable consistency levels. You can choose the right trade-off for each operation.

Strong Consistency

(Slower, More Consistent)

Eventual Consistency

(Faster, Less Consistent)

Linearizable

Strongest guarantee. Appears as single copy.

Latency: Highest

Sequential

All operations in some total order.

Latency: High

Causal

Causally related ops ordered.

Latency: Medium

Read-your-writes

See own writes immediately.

Latency: Low

Eventual

Becomes consistent eventually.

Latency: Lowest

Tunable Consistency: Cassandra Example

Write Consistency Levels

// Replication Factor = 3

// ONE: Fast, low consistency
Write succeeds after 1 replica ACK
Latency: ~5ms

// QUORUM: Balanced (most common)
Write succeeds after 2/3 replicas ACK
Latency: ~15ms

// ALL: Strong consistency
Write succeeds after all 3 replicas ACK
Latency: ~50ms
(Fails if any replica down)

Read Consistency Levels

// ONE: Fastest, may be stale
Read from 1 replica
Latency: ~5ms

// QUORUM: Strong consistency
Read from 2/3 replicas, return latest
Latency: ~15ms

// ALL: Highest consistency
Read from all 3 replicas
Latency: ~50ms

Formula for strong consistency:
W + R > N (replication factor)
Example: Write QUORUM + Read QUORUM
2 + 2 > 3 ✓ (strong consistency)

Real-World Consistency Trade-offs

Banking System (CP - Strong Consistency)

Account balance must always be accurate. Correctness > Availability.

Design Choices:

• Use ACID transactions (PostgreSQL, MySQL)
• Synchronous replication
• 2-phase commit for distributed transactions
• Block operations during network partition
• Better to show error than wrong balance

"I'd rather ATM says 'unavailable' than dispense $1000 when I only have $100"

Facebook News Feed (AP - Eventual Consistency)

Must always be available. Slight inconsistency acceptable. Availability > Consistency.

Design Choices:

• Use Cassandra (AP system)
• Eventual consistency model
• Multiple data centers, async replication
• Continue serving during network issues
• OK if like count slightly delayed

"Better to show 99 likes instead of 100 than to show error page"

E-Commerce (Hybrid - Different Operations)

Mix strong and eventual consistency based on operation criticality.

Design Choices:

Shopping Cart: Eventual (AP) - OK if items appear delayed
Inventory: Strong (CP) - Prevent overselling
Payment: ACID (CP) - Must be exact
Product Reviews: Eventual (AP) - Slight delay OK
Order Status: Eventual (AP) - Updates can lag

"Use the right consistency model for each use case"

Ticket Booking (CP - Strong Consistency)

Cannot oversell seats. Must prevent double booking. Correctness critical.

Design Choices:

• Use distributed locks (Redis, ZooKeeper)
• Optimistic locking with version numbers
• Reserve seats with timeout (10 min hold)
• Strong consistency for seat availability
• Queue system for high demand events

"Better to queue users than to oversell concert tickets"

Section 14

Security & Authentication

Security is not optional—it's fundamental. A single security breach can destroy user trust, result in massive fines, and damage your reputation permanently. Understanding authentication, authorization, encryption, and security best practices is crucial for every system architect.

Authentication vs Authorization

Authentication

"Who are you?"

Verifying the identity of a user or system. Proving you are who you claim to be.

Methods:

Something you know: Password, PIN
Something you have: Phone, security key
Something you are: Fingerprint, face
Multi-Factor Auth (MFA): Combine 2+ methods

// Authentication Example
POST /api/auth/login
{
  "email": "user@example.com",
  "password": "********"
}

Response:
{
  "token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
  "user": { "id": 123, "email": "user@example.com" }
}

✓ Identity verified

Authorization

"What can you do?"

Determining what an authenticated user is allowed to access or perform.

Models:

RBAC: Role-Based (admin, user, guest)
ABAC: Attribute-Based (user.age > 18)
ACL: Access Control Lists
PBAC: Policy-Based Access Control

// Authorization Example
GET /api/admin/users
Authorization: Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...

Check: Does user have 'admin' role?
  ✓ Yes → Return data
  ✗ No  → 403 Forbidden

Permission checked after authentication

Remember:

Authentication happens first (verify identity), then Authorization (check permissions). You can be authenticated but not authorized. Example: You're logged in (authenticated) but can't access admin panel (not authorized).

OAuth 2.0 & JWT Tokens

OAuth 2.0 Flow

OAuth 2.0 is an authorization framework that enables applications to obtain limited access to user accounts. Most common for "Login with Google/Facebook/GitHub".

Authorization Code Flow (Most Secure):

┌────────────┐                                      ┌──────────────┐
│   User     │                                      │   Your App   │
└─────┬──────┘                                      └──────┬───────┘
      │                                                    │
      │ 1. Click "Login with Google"                      │
      │───────────────────────────────────────────────────>│
      │                                                    │
      │ 2. Redirect to Google                             │
      │<───────────────────────────────────────────────────│
      │                                                    │
      │         ┌─────────────────────┐                   │
      │────────>│  Google Auth Server │                   │
      │         └──────────┬──────────┘                   │
      │ 3. Enter credentials & authorize                  │
      │                    │                               │
      │ 4. Auth Code       │                               │
      │<───────────────────┘                               │
      │                                                    │
      │ 5. Send auth code to Your App                     │
      │───────────────────────────────────────────────────>│
      │                                                    │
      │                    ┌──────────────────────────────>│
      │                    │ 6. Exchange code for token    │
      │         ┌──────────┴──────────┐                   │
      │         │  Google Auth Server │                   │
      │         └──────────┬──────────┘                   │
      │                    │ 7. Access Token               │
      │                    │<──────────────────────────────│
      │                                                    │
      │ 8. Access Token stored, user logged in            │
      │<───────────────────────────────────────────────────│
      │                                                    │

OAuth 2.0 Grant Types

Authorization Code

Most secure. For server-side apps. Uses secret key.

Use: Web applications

Implicit Flow

Token returned directly. Less secure. Deprecated.

Use: Legacy SPAs (don't use)

Client Credentials

Machine-to-machine auth. No user involved.

Use: API-to-API, microservices

PKCE (for mobile/SPA)

Secure for public clients. No client secret needed.

Use: Mobile apps, SPAs

Implementation Example

// Node.js with Passport.js
const passport = require('passport');
const GoogleStrategy = require('passport-google-oauth20');

passport.use(new GoogleStrategy({
    clientID: process.env.GOOGLE_CLIENT_ID,
    clientSecret: process.env.GOOGLE_CLIENT_SECRET,
    callbackURL: "https://app.com/auth/google/callback"
  },
  function(accessToken, refreshToken, profile, done) {
    // Find or create user in your database
    User.findOrCreate({ 
      googleId: profile.id,
      email: profile.emails[0].value,
      name: profile.displayName
    }, function (err, user) {
      return done(err, user);
    });
  }
));

// Routes
app.get('/auth/google',
  passport.authenticate('google', { 
    scope: ['profile', 'email'] 
  })
);

app.get('/auth/google/callback', 
  passport.authenticate('google', { 
    failureRedirect: '/login' 
  }),
  function(req, res) {
    res.redirect('/dashboard');
  }
);

JWT (JSON Web Tokens)

JWT is a compact, self-contained way to securely transmit information between parties as a JSON object. Commonly used for authentication and information exchange.

JWT Structure: header.payload.signature

┌─────────────────────────────────────────────────────────────────┐
│                         JWT TOKEN                                │
├─────────────────────────────────────────────────────────────────┤
│ eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.                           │
│ eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ. │
│ SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c                     │
└─────────────────────────────────────────────────────────────────┘
      │                          │                      │
      ▼                          ▼                      ▼
   HEADER                    PAYLOAD               SIGNATURE
   
HEADER (Base64):              PAYLOAD (Base64):         SIGNATURE:
{                             {                         HMACSHA256(
  "alg": "HS256",               "sub": "1234567890",      base64UrlEncode(header) + "." +
  "typ": "JWT"                  "name": "John Doe",       base64UrlEncode(payload),
}                               "iat": 1516239022,        secret
                                "exp": 1516242622       )
                              }

Creating & Verifying JWT

// Creating JWT (Node.js)
const jwt = require('jsonwebtoken');

// Sign token
const token = jwt.sign(
  { 
    userId: 123,
    email: 'user@example.com',
    role: 'admin'
  },
  process.env.JWT_SECRET,  // Secret key
  { 
    expiresIn: '24h',      // Token expires in 24 hours
    issuer: 'myapp.com',
    audience: 'myapp-users'
  }
);

// Token: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...

// Verify token
try {
  const decoded = jwt.verify(
    token, 
    process.env.JWT_SECRET
  );
  console.log(decoded);
  // { userId: 123, email: '...', exp: ... }
  
} catch(error) {
  // Invalid token, expired, or wrong signature
  console.error('Invalid token:', error.message);
}

Middleware Protection

// Express.js authentication middleware
function authenticateToken(req, res, next) {
  // Get token from header
  const authHeader = req.headers['authorization'];
  const token = authHeader && authHeader.split(' ')[1];
  
  if (!token) {
    return res.status(401).json({ 
      error: 'Access token required' 
    });
  }
  
  jwt.verify(token, process.env.JWT_SECRET, (err, user) => {
    if (err) {
      return res.status(403).json({ 
        error: 'Invalid or expired token' 
      });
    }
    
    // Attach user to request
    req.user = user;
    next();
  });
}

// Protected route
app.get('/api/profile', 
  authenticateToken,  // Middleware
  (req, res) => {
    res.json({ 
      user: req.user,
      message: 'Protected data'
    });
  }
);

JWT Security Best Practices

✓ Use strong secret key (minimum 256 bits)
✓ Set short expiration times (15-60 minutes)
✓ Use refresh tokens for longer sessions
✓ Store tokens in httpOnly cookies (not localStorage)
✓ Validate token on every request
✓ Don't store sensitive data in payload (it's just Base64, not encrypted)
✓ Implement token blacklisting for logout

Encryption: Protecting Data

Encryption at Rest

Protecting data stored on disk (databases, files, backups)

Technologies:

AES-256: Symmetric encryption standard
Database TDE: Transparent Data Encryption
AWS KMS: Key Management Service
Disk encryption: LUKS, BitLocker

// Node.js - Encrypt sensitive data
const crypto = require('crypto');

const algorithm = 'aes-256-gcm';
const key = crypto.randomBytes(32);
const iv = crypto.randomBytes(16);

function encrypt(text) {
  const cipher = crypto.createCipheriv(algorithm, key, iv);
  let encrypted = cipher.update(text, 'utf8', 'hex');
  encrypted += cipher.final('hex');
  const authTag = cipher.getAuthTag();
  return {
    encrypted,
    authTag: authTag.toString('hex'),
    iv: iv.toString('hex')
  };
}

// Encrypt password before storing
const sensitive = encrypt('user-password-123');
// Store: sensitive.encrypted + authTag + iv

Encryption in Transit

Protecting data moving between systems (HTTPS, VPN)

Technologies:

TLS 1.3: Latest protocol for HTTPS
SSL Certificates: Let's Encrypt, DigiCert
VPN: WireGuard, OpenVPN
mTLS: Mutual TLS for service-to-service

// NGINX TLS Configuration
server {
    listen 443 ssl http2;
    server_name api.example.com;
    
    # TLS 1.3 only
    ssl_protocols TLSv1.3;
    ssl_certificate /etc/ssl/cert.pem;
    ssl_certificate_key /etc/ssl/key.pem;
    
    # Strong ciphers
    ssl_ciphers 'ECDHE-ECDSA-AES128-GCM-SHA256';
    ssl_prefer_server_ciphers off;
    
    # HSTS (force HTTPS)
    add_header Strict-Transport-Security 
      "max-age=63072000" always;
    
    # Certificate stapling
    ssl_stapling on;
    ssl_stapling_verify on;
}

Password Hashing (NOT Encryption)

Never encrypt passwords—hash them with one-way algorithms. Hashes cannot be reversed.

// ❌ WRONG - Never do this!
password = encrypt("mypassword123", secretKey);
// Can be decrypted if key is compromised

// ✅ CORRECT - Use bcrypt
const bcrypt = require('bcrypt');

// Hash password (with salt)
const saltRounds = 12;
const hash = await bcrypt.hash('mypassword123', saltRounds);
// $2b$12$K5Q9LwZJFqjPMZ3lA3vhO.rT7...

// Verify password
const match = await bcrypt.compare(
  'user-input-password',
  storedHash
);
if (match) {
  console.log('Password correct!');
}

Hashing Algorithms:

✓ bcrypt: Adaptive, slow by design, industry standard
✓ Argon2: Modern, won password hashing competition
✓ scrypt: Memory-hard, resistant to ASICs
✗ MD5: Broken, never use
✗ SHA1: Deprecated, vulnerable
⚠ SHA256: Too fast for passwords (OK for data integrity)

Zero Trust Architecture

"Never trust, always verify" - Assume breach, verify everything, grant least privilege access.

Traditional Security (Perimeter-Based)

┌─────────────────────────────────┐
│       FIREWALL (Castle Wall)    │
├─────────────────────────────────┤
│     ✓ Inside = Trusted          │
│     ✗ Outside = Untrusted       │
│                                  │
│  Problem: Once inside,           │
│  lateral movement is easy        │
│  (Attackers own the castle)      │
└─────────────────────────────────┘

✗ Trust internal network
✗ VPN = full access
✗ Single point of failure

Zero Trust (Identity-Based)

┌─────────────────────────────────┐
│    Every Request Verified        │
├─────────────────────────────────┤
│  1. Authenticate (Who are you?)  │
│  2. Authorize (What can you do?) │
│  3. Encrypt (Secure channel)     │
│  4. Monitor (Detect anomalies)   │
│                                  │
│  Every service, every time       │
│  No implicit trust               │
└─────────────────────────────────┘

✓ Verify every request
✓ Least privilege access
✓ Micro-segmentation

Zero Trust Principles

1. Verify Explicitly

Authenticate and authorize based on all available data points (user, device, location, behavior)

2. Least Privilege

Limit user access with Just-In-Time and Just-Enough-Access (JIT/JEA)

3. Assume Breach

Minimize blast radius, segment access, verify end-to-end encryption, use analytics

4. Identity as Perimeter

User identity becomes the security boundary, not network location

5. Device Compliance

Verify device health before granting access (OS version, patches, antivirus)

6. Continuous Monitoring

Real-time threat detection, anomaly detection, automated response

Section 15

CI/CD & DevOps Architecture

CI/CD (Continuous Integration/Continuous Deployment) and DevOps practices enable teams to deliver software faster, more reliably, and with higher quality. Automation, collaboration, and feedback loops are key to modern software delivery.

Complete CI/CD Pipeline

┌──────────────────────────────────────────────────────────────────────────────┐
│                         CI/CD PIPELINE STAGES                                 │
└──────────────────────────────────────────────────────────────────────────────┘

1. CODE              2. BUILD            3. TEST              4. DEPLOY
   ↓                    ↓                   ↓                    ↓
┌─────────┐         ┌─────────┐        ┌─────────┐        ┌─────────┐
│  Git    │  Push   │ Compile │        │  Unit   │        │  Dev    │
│  Commit │────────>│ Install │───────>│ Tests   │───────>│  Env    │
│         │         │ Deps    │        │         │        │         │
└─────────┘         └─────────┘        └─────────┘        └─────────┘
                         ↓                   ↓                  ↓
                    ┌─────────┐        ┌─────────┐        ┌─────────┐
                    │ Docker  │        │ Integr. │        │ Staging │
                    │ Build   │        │ Tests   │        │  Env    │
                    └─────────┘        └─────────┘        └─────────┘
                         ↓                   ↓                  ↓
                    ┌─────────┐        ┌─────────┐        ┌─────────┐
                    │ Push to │        │ E2E     │        │ Prod    │
                    │Registry │        │ Tests   │        │  Env    │
                    └─────────┘        └─────────┘        └─────────┘
                         ↓                   ↓                  ↓
                    ┌─────────┐        ┌─────────┐        ┌─────────┐
                    │Security │        │ Perform.│        │ Monitor │
                    │  Scan   │        │ Tests   │        │ & Alert │
                    └─────────┘        └─────────┘        └─────────┘

                    If ANY stage fails → Pipeline stops, team notified

Continuous Integration (CI)

Automatically build and test code changes frequently (multiple times per day)

Key Practices:

✓ Commit code to main branch frequently
✓ Automated build on every commit
✓ Run automated tests
✓ Fast feedback (< 10 minutes)
✓ Fix broken builds immediately

Benefits:

• Catch bugs early • Reduce integration problems • Faster development • Higher code quality

Continuous Deployment (CD)

Automatically deploy every change that passes tests to production

Key Practices:

✓ Automated deployment pipeline
✓ Blue-green or canary deployments
✓ Automated rollback on failure
✓ Feature flags for risk mitigation
✓ Comprehensive monitoring

Benefits:

• Faster time to market • Reduced deployment risk • Smaller, safer changes • Quick feedback

GitHub Actions Pipeline Example

# .github/workflows/ci-cd.yml
name: CI/CD Pipeline

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    
    steps:
    - name: Checkout code
      uses: actions/checkout@v3
    
    - name: Setup Node.js
      uses: actions/setup-node@v3
      with:
        node-version: '18'
        cache: 'npm'
    
    - name: Install dependencies
      run: npm ci
    
    - name: Run linter
      run: npm run lint
    
    - name: Run unit tests
      run: npm test -- --coverage
    
    - name: Run integration tests
      run: npm run test:integration
    
    - name: Build application
      run: npm run build
    
    - name: Security scan
      run: npm audit --audit-level=high
    
    - name: Build Docker image
      run: |
        docker build -t myapp:${{ github.sha }} .
        docker tag myapp:${{ github.sha }} myapp:latest
    
    - name: Run container security scan
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: 'myapp:latest'
    
  deploy-staging:
    needs: build-and-test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/develop'
    
    steps:
    - name: Deploy to staging
      run: |
        kubectl set image deployment/myapp \
          myapp=myapp:${{ github.sha }} \
          -n staging
    
    - name: Run smoke tests
      run: npm run test:smoke -- --env=staging
    
  deploy-production:
    needs: build-and-test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    
    steps:
    - name: Deploy to production (Blue-Green)
      run: |
        # Deploy to green environment
        kubectl set image deployment/myapp-green \
          myapp=myapp:${{ github.sha }} \
          -n production
        
        # Wait for deployment
        kubectl rollout status deployment/myapp-green -n production
        
        # Switch traffic (update service selector)
        kubectl patch service myapp -n production \
          -p '{"spec":{"selector":{"version":"green"}}}'
    
    - name: Monitor deployment
      run: |
        # Check error rate for 10 minutes
        npm run monitor:production
        
    - name: Rollback on failure
      if: failure()
      run: |
        kubectl patch service myapp -n production \
          -p '{"spec":{"selector":{"version":"blue"}}}'

Docker: Containerization

Containers package application code with dependencies, ensuring consistent behavior across environments.

Dockerfile Example

# Multi-stage build for optimized image
FROM node:18-alpine AS builder

WORKDIR /app

# Copy package files
COPY package*.json ./

# Install dependencies
RUN npm ci --only=production

# Copy source code
COPY . .

# Build application
RUN npm run build

# Production stage
FROM node:18-alpine

WORKDIR /app

# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

# Copy built app from builder
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nodejs:nodejs /app/package.json ./

# Switch to non-root user
USER nodejs

# Expose port
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s \
  CMD node healthcheck.js || exit 1

# Start application
CMD ["node", "dist/server.js"]

Docker Compose (Local Development)

# docker-compose.yml
version: '3.8'

services:
  app:
    build: .
    ports:
      - "3000:3000"
    environment:
      - NODE_ENV=development
      - DB_HOST=postgres
      - REDIS_HOST=redis
    depends_on:
      - postgres
      - redis
    volumes:
      - ./src:/app/src  # Hot reload
    networks:
      - app-network
  
  postgres:
    image: postgres:15-alpine
    environment:
      - POSTGRES_DB=myapp
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=password
    volumes:
      - postgres-data:/var/lib/postgresql/data
    networks:
      - app-network
  
  redis:
    image: redis:7-alpine
    networks:
      - app-network
  
  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
    depends_on:
      - app
    networks:
      - app-network

volumes:
  postgres-data:

networks:
  app-network:
    driver: bridge

Docker Best Practices

Use multi-stage builds (smaller images)
Use specific tags, not :latest
Run as non-root user
Minimize layers (combine RUN commands)

Use .dockerignore file
Scan images for vulnerabilities
Order layers by change frequency
Use health checks

Kubernetes: Container Orchestration

Kubernetes automates deployment, scaling, and management of containerized applications across clusters of machines.

Kubernetes Cluster Architecture:

┌─────────────────────────────────────────────────────────────────┐
│                        CONTROL PLANE                             │
│  ┌──────────────┐  ┌──────────────┐  ┌─────────────────────┐   │
│  │ API Server   │  │  Scheduler   │  │ Controller Manager  │   │
│  │ (Entry point)│  │ (Pod placing)│  │ (Desired state)     │   │
│  └──────────────┘  └──────────────┘  └─────────────────────┘   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  etcd (Cluster state storage)                            │   │
│  └──────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│                         WORKER NODES                             │
├──────────────────────┬──────────────────────┬───────────────────┤
│   NODE 1             │   NODE 2             │   NODE 3          │
│  ┌────────────────┐  │  ┌────────────────┐  │  ┌──────────────┐│
│  │ kubelet        │  │  │ kubelet        │  │  │ kubelet      ││
│  │ (Node agent)   │  │  │ (Node agent)   │  │  │ (Node agent) ││
│  └────────────────┘  │  └────────────────┘  │  └──────────────┘│
│  ┌────────────────┐  │  ┌────────────────┐  │  ┌──────────────┐│
│  │ POD 1          │  │  │ POD 3          │  │  │ POD 5        ││
│  │ ┌────────────┐ │  │  │ ┌────────────┐ │  │  │┌───────────┐││
│  │ │ Container  │ │  │  │ │ Container  │ │  │  ││Container  │││
│  │ └────────────┘ │  │  │ └────────────┘ │  │  │└───────────┘││
│  └────────────────┘  │  └────────────────┘  │  └──────────────┘│
│  ┌────────────────┐  │  ┌────────────────┐  │                  │
│  │ POD 2          │  │  │ POD 4          │  │                  │
│  └────────────────┘  │  └────────────────┘  │                  │
└──────────────────────┴──────────────────────┴───────────────────┘

Kubernetes Deployment

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
  namespace: production
spec:
  replicas: 3  # Run 3 pods
  selector:
    matchLabels:
      app: myapp
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: myapp
        version: v1.0.0
    spec:
      containers:
      - name: myapp
        image: myapp:1.0.0
        ports:
        - containerPort: 3000
        env:
        - name: NODE_ENV
          value: "production"
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: password
        resources:
          requests:
            memory: "128Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5

Service & Ingress

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: myapp-service
spec:
  selector:
    app: myapp
  ports:
  - protocol: TCP
    port: 80
    targetPort: 3000
  type: ClusterIP

---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp-ingress
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt"
    nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
  tls:
  - hosts:
    - api.example.com
    secretName: myapp-tls
  rules:
  - host: api.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-service
            port:
              number: 80

---
# horizontal-pod-autoscaler.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: myapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: myapp
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Infrastructure as Code (IaC)

Manage infrastructure through code instead of manual processes. Version control, automation, and repeatability.

Terraform Example

# main.tf
provider "aws" {
  region = "us-east-1"
}

# VPC
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  
  tags = {
    Name = "production-vpc"
  }
}

# EKS Cluster
resource "aws_eks_cluster" "main" {
  name     = "production-cluster"
  role_arn = aws_iam_role.eks_cluster.arn
  
  vpc_config {
    subnet_ids = aws_subnet.private[*].id
  }
}

# RDS Database
resource "aws_db_instance" "postgres" {
  identifier        = "production-db"
  engine            = "postgres"
  engine_version    = "15.3"
  instance_class    = "db.t3.medium"
  allocated_storage = 100
  
  db_name  = "myapp"
  username = "admin"
  password = var.db_password
  
  backup_retention_period = 7
  multi_az               = true
  
  tags = {
    Environment = "production"
  }
}

# S3 Bucket
resource "aws_s3_bucket" "uploads" {
  bucket = "myapp-uploads-production"
  
  versioning {
    enabled = true
  }
  
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
}

Benefits & Tools

Benefits:

✓ Version control infrastructure
✓ Repeatable deployments
✓ Disaster recovery (rebuild from code)
✓ Documentation (code is docs)
✓ Cost tracking & optimization

Popular Tools:

Terraform: Multi-cloud, declarative
CloudFormation: AWS native
Pulumi: Use real programming languages
Ansible: Configuration management
CDK: AWS Cloud Development Kit

Section 16

Monitoring & Observability

"You can't improve what you can't measure." Monitoring and observability provide visibility into your system's health, performance, and behavior. The three pillars of observability are Logs, Metrics, and Traces.

The Three Pillars of Observability

📝

Logs

Discrete events with timestamps describing what happened

[2025-11-08 23:30:15] INFO: User login
  userId: 123
  ip: 192.168.1.100
  userAgent: Chrome/120

[2025-11-08 23:30:16] ERROR: DB query failed
  query: SELECT * FROM users
  error: Connection timeout
  duration: 5000ms

Best for: Debugging, audit trails, error analysis

Tools: ELK Stack, Loki, Splunk, Papertrail

📊

Metrics

Numerical measurements over time (counters, gauges, histograms)

http_requests_total 1547
http_request_duration_seconds 0.234
cpu_usage_percent 67.5
memory_used_bytes 2147483648
active_users 1250
db_connections_active 45
cache_hit_rate 0.95

Best for: Dashboards, alerts, trends, capacity planning

Tools: Prometheus, Grafana, Datadog, CloudWatch

🔍

Traces

Journey of a request through distributed system

TraceID: abc123
 
API Gateway  [20ms]
    ↓
Auth Service [15ms]
    ↓
User Service [45ms]
    ├─ DB Query [30ms]
    └─ Cache  [5ms]
    ↓
Response [Total: 80ms]

Best for: Performance bottlenecks, latency analysis

Tools: Jaeger, Zipkin, OpenTelemetry, AWS X-Ray

When to Use Each

Logs:

"What error occurred at 3 AM?"

"Who accessed this resource?"

Metrics:

"Is CPU usage trending up?"

"What's our 99th percentile latency?"

Traces:

"Why is this request slow?"

"Which service is the bottleneck?"

Prometheus + Grafana Stack

The industry-standard open-source monitoring solution. Prometheus collects metrics, Grafana visualizes them.

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'api-servers'
    static_configs:
      - targets: 
        - 'api1:9090'
        - 'api2:9090'
        - 'api3:9090'
    
  - job_name: 'databases'
    static_configs:
      - targets: ['postgres:9187']
      
  - job_name: 'redis'
    static_configs:
      - targets: ['redis:9121']

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

# Alert rules
rule_files:
  - 'alerts.yml'

alerting:
  alertmanagers:
    - static_configs:
      - targets: ['alertmanager:9093']

Instrumenting Application

// Node.js with prom-client
const promClient = require('prom-client');

// Create metrics
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

const activeConnections = new promClient.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});

// Middleware to track metrics
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    
    httpRequestDuration
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .observe(duration);
    
    httpRequestTotal
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .inc();
  });
  
  next();
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

PromQL Queries (Prometheus Query Language)

# Request rate (requests per second)
rate(http_requests_total[5m])

# 95th percentile latency
histogram_quantile(0.95, 
  rate(http_request_duration_seconds_bucket[5m])
)

# Error rate
rate(http_requests_total{status_code=~"5.."}[5m])
/
rate(http_requests_total[5m])

# CPU usage average across pods
avg(cpu_usage_percent) by (pod)

# Memory usage > 80%
memory_used_bytes / memory_total_bytes > 0.8

# Alert Rules (alerts.yml)
groups:
- name: example_alerts
  rules:
  - alert: HighErrorRate
    expr: |
      rate(http_requests_total{status_code=~"5.."}[5m])
      / rate(http_requests_total[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value }}%"
  
  - alert: HighLatency
    expr: |
      histogram_quantile(0.95,
        rate(http_request_duration_seconds_bucket[5m])
      ) > 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency detected"

ELK Stack (Elasticsearch, Logstash, Kibana)

Centralized logging solution for collecting, processing, storing, and visualizing log data.

Log Flow in ELK Stack:

┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│ Application  │────>│   Filebeat   │────>│   Logstash   │────>│Elasticsearch │
│   Logs       │     │  (Shipper)   │     │  (Process)   │     │   (Store)    │
└──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
                                                                        │
                                                                        ↓
                                                                ┌──────────────┐
                                                                │    Kibana    │
                                                                │ (Visualize)  │
                                                                └──────────────┘

Components:
- Filebeat: Lightweight log shipper
- Logstash: Parse, transform, enrich logs
- Elasticsearch: Store and index logs
- Kibana: Search, visualize, analyze

Logstash Configuration

# logstash.conf
input {
  beats {
    port => 5044
  }
  
  file {
    path => "/var/log/app/*.log"
    start_position => "beginning"
  }
}

filter {
  # Parse JSON logs
  json {
    source => "message"
  }
  
  # Parse timestamps
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }
  
  # Add geo location from IP
  geoip {
    source => "client_ip"
  }
  
  # Extract fields
  grok {
    match => { 
      "message" => "%{COMBINEDAPACHELOG}" 
    }
  }
  
  # Remove sensitive data
  mutate {
    remove_field => ["password", "credit_card"]
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
  }
  
  # Also output to stdout for debugging
  stdout {
    codec => rubydebug
  }
}

Structured Logging Best Practices

// Bad logging
console.log('User logged in');
console.log('Error: ' + error);

// Good structured logging (winston)
const winston = require('winston');

const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  transports: [
    new winston.transports.File({ 
      filename: 'app.log' 
    })
  ]
});

logger.info('User logged in', {
  userId: 123,
  email: 'user@example.com',
  ip: '192.168.1.100',
  userAgent: req.headers['user-agent'],
  timestamp: new Date().toISOString()
});

logger.error('Database query failed', {
  error: error.message,
  stack: error.stack,
  query: 'SELECT * FROM users',
  duration: 5000,
  correlationId: req.id
});

Log Levels:

ERROR: Something failed
WARN: Something unexpected
INFO: Important events
DEBUG: Detailed info (dev only)
TRACE: Very detailed (never in prod)

Distributed Tracing with Jaeger

Track requests as they flow through microservices, identifying bottlenecks and dependencies.

OpenTelemetry Implementation

// Node.js with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

// Initialize tracer
const provider = new NodeTracerProvider();

// Configure Jaeger exporter
const exporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces',
  serviceName: 'api-service'
});

provider.addSpanProcessor(
  new BatchSpanProcessor(exporter)
);

provider.register();

// Auto-instrument HTTP and Express
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation()
  ]
});

// Manual span creation
const tracer = provider.getTracer('api-service');

app.get('/api/users/:id', async (req, res) => {
  const span = tracer.startSpan('get-user-handler');
  
  try {
    // Child span for DB query
    const dbSpan = tracer.startSpan('db-query', {
      parent: span
    });
    const user = await db.query('SELECT * FROM users WHERE id = ?', [req.params.id]);
    dbSpan.end();
    
    // Child span for cache
    const cacheSpan = tracer.startSpan('cache-set');
    await redis.set(`user:${req.params.id}`, JSON.stringify(user));
    cacheSpan.end();
    
    res.json(user);
  } catch (error) {
    span.setStatus({ code: SpanStatusCode.ERROR });
    span.recordException(error);
    res.status(500).json({ error: error.message });
  } finally {
    span.end();
  }
});

Trace Visualization

Trace: Order Checkout (TraceID: abc123)
Total Duration: 856ms

API Gateway           [15ms] ────────┐
                                     ↓
Auth Service          [25ms]        └─→ [40ms]
                                         ↓
Order Service        [450ms] ←──────────┘
  ├─ Validate Items   [50ms]
  ├─ Check Inventory [120ms]
  │   └─ DB Query    [110ms] ← SLOW!
  ├─ Calculate Tax    [30ms]
  └─ Create Order    [250ms]
      ├─ DB Insert   [200ms] ← SLOW!
      └─ Cache Set    [40ms]
                        ↓
Payment Service      [300ms]
  ├─ Stripe API     [280ms]
  └─ Update Order    [20ms]
                        ↓
Notification        [51ms]
  └─ Send Email     [45ms]

Insights:
- DB queries are slow (need indexing?)
- Stripe API adds most latency
- Overall: 95% time in downstream services

Tracing Benefits:

✓ Identify slow services/queries
✓ Understand service dependencies
✓ Debug complex failures
✓ Optimize critical paths
✓ Measure end-to-end latency

Section 17

Design Patterns

Design patterns are reusable solutions to common problems in software design. They represent best practices refined over years of experience and provide a shared vocabulary for developers to communicate complex ideas efficiently.

Architectural Design Patterns

MVC (Model-View-Controller)

Separates application into three interconnected components: Model (data), View (UI), Controller (logic).

MVC Architecture Flow:

      ┌─────────────────────────────────────┐
      │           USER                      │
      └───────────┬─────────────────────────┘
                  │ HTTP Request
                  ↓
      ┌─────────────────────────────────────┐
      │        CONTROLLER                   │
      │  • Handles requests                 │
      │  • Updates model                    │
      │  • Selects view                     │
      └────┬──────────────────────┬─────────┘
           │                      │
           ↓                      ↓
┌──────────────────┐    ┌─────────────────┐
│     MODEL        │    │      VIEW       │
│  • Business      │←───│  • Presentation │
│    logic         │    │  • Templates    │
│  • Data access   │    │  • HTML         │
│  • Validation    │    └─────────────────┘
└──────────────────┘
           │
           ↓
┌──────────────────┐
│    DATABASE      │
└──────────────────┘

Express.js MVC Example

// Model (models/User.js)
class User {
  constructor(db) {
    this.db = db;
  }
  
  async findById(id) {
    return await this.db.query(
      'SELECT * FROM users WHERE id = ?', 
      [id]
    );
  }
  
  async create(userData) {
    const { name, email, password } = userData;
    return await this.db.query(
      'INSERT INTO users (name, email, password) VALUES (?, ?, ?)',
      [name, email, password]
    );
  }
  
  async update(id, data) {
    return await this.db.query(
      'UPDATE users SET ? WHERE id = ?',
      [data, id]
    );
  }
}

// Controller (controllers/UserController.js)
class UserController {
  constructor(userModel) {
    this.userModel = userModel;
  }
  
  async getUser(req, res) {
    try {
      const user = await this.userModel.findById(req.params.id);
      res.render('user/profile', { user });
    } catch (error) {
      res.status(500).render('error', { error });
    }
  }
  
  async createUser(req, res) {
    try {
      const user = await this.userModel.create(req.body);
      res.redirect(`/users/${user.id}`);
    } catch (error) {
      res.status(400).render('user/new', { 
        error: error.message 
      });
    }
  }
}

// View (views/user/profile.ejs)
<!-- HTML template with user data -->
<h1><%= user.name %></h1>
<p>Email: <%= user.email %></p>

// Routes (routes/users.js)
const router = express.Router();
const controller = new UserController(userModel);

router.get('/users/:id', (req, res) => 
  controller.getUser(req, res)
);
router.post('/users', (req, res) => 
  controller.createUser(req, res)
);

Benefits:

✓ Separation of concerns
✓ Easier to test
✓ Parallel development
✓ Code reusability
✓ Multiple views for same data

Use Cases:

• Web applications
• Desktop applications
• Mobile apps
• Any UI-driven application

Repository Pattern

Abstracts data access logic, providing a collection-like interface for accessing domain objects.

Without Repository (Direct DB Access)

// ❌ Tight coupling to database
class UserService {
  async getUser(id) {
    // Direct SQL in service
    const result = await db.query(
      'SELECT * FROM users WHERE id = ?',
      [id]
    );
    return result[0];
  }
  
  async createUser(data) {
    // Direct SQL
    const result = await db.query(
      'INSERT INTO users SET ?',
      [data]
    );
    return result.insertId;
  }
}

// Problems:
// - Hard to test
// - Hard to switch databases
// - SQL scattered everywhere
// - No abstraction

With Repository Pattern

// ✓ Clean abstraction
interface IUserRepository {
  findById(id: number): Promise<User>;
  findByEmail(email: string): Promise<User>;
  create(user: User): Promise<User>;
  update(id: number, user: User): Promise<User>;
  delete(id: number): Promise<void>;
}

class UserRepository implements IUserRepository {
  constructor(private db: Database) {}
  
  async findById(id: number): Promise<User> {
    const row = await this.db.query(
      'SELECT * FROM users WHERE id = ?',
      [id]
    );
    return this.mapToEntity(row);
  }
  
  async create(user: User): Promise<User> {
    const result = await this.db.query(
      'INSERT INTO users SET ?',
      [user]
    );
    return { ...user, id: result.insertId };
  }
  
  private mapToEntity(row: any): User {
    return new User(row.id, row.name, row.email);
  }
}

// Service uses repository
class UserService {
  constructor(private userRepo: IUserRepository) {}
  
  async getUser(id: number) {
    return await this.userRepo.findById(id);
  }
}

// Benefits:
// - Easy to mock for testing
// - Can swap implementations
// - Centralized data access
// - Clean separation

CQRS (Command Query Responsibility Segregation)

Separates read and write operations into different models. Commands modify data, Queries retrieve data.

Traditional (Single Model):
┌──────────┐
│  Client  │
└─────┬────┘
      ↓
┌─────────────┐
│  Service    │  Both reads and writes
│  (CRUD)     │  use same model
└──────┬──────┘
       ↓
┌──────────────┐
│  Database    │
└──────────────┘

CQRS (Separate Models):
                    
┌──────────┐         ┌──────────┐
│  Client  │         │  Client  │
└─────┬────┘         └─────┬────┘
      │ Write              │ Read
      ↓                    ↓
┌─────────────┐      ┌─────────────┐
│  Command    │      │   Query     │
│  Handler    │      │   Handler   │
└──────┬──────┘      └──────┬──────┘
       ↓                    ↓
┌──────────────┐      ┌──────────────┐
│ Write Model  │      │  Read Model  │
│ (Normalized) │─────→│(Denormalized)│
│  PostgreSQL  │ Sync │ Elasticsearch│
└──────────────┘      └──────────────┘

Command Side (Write)

// Command
class CreateOrderCommand {
  constructor(
    public userId: number,
    public items: OrderItem[],
    public total: number
  ) {}
}

// Command Handler
class CreateOrderHandler {
  constructor(
    private orderRepo: OrderRepository,
    private eventBus: EventBus
  ) {}
  
  async handle(command: CreateOrderCommand) {
    // Validate
    this.validateOrder(command);
    
    // Create order (write model)
    const order = new Order({
      userId: command.userId,
      items: command.items,
      total: command.total,
      status: 'pending'
    });
    
    await this.orderRepo.save(order);
    
    // Publish event
    await this.eventBus.publish(
      new OrderCreatedEvent(order)
    );
    
    return order.id;
  }
  
  private validateOrder(command: CreateOrderCommand) {
    if (command.items.length === 0) {
      throw new Error('Order must have items');
    }
    if (command.total <= 0) {
      throw new Error('Invalid total');
    }
  }
}

Query Side (Read)

// Query
class GetOrderQuery {
  constructor(public orderId: number) {}
}

// Query Handler
class GetOrderHandler {
  constructor(private readDb: ReadDatabase) {}
  
  async handle(query: GetOrderQuery) {
    // Read from optimized read model
    return await this.readDb.query(`
      SELECT 
        o.id,
        o.total,
        o.status,
        u.name as customer_name,
        u.email,
        GROUP_CONCAT(
          CONCAT(i.name, ' x', oi.quantity)
        ) as items
      FROM orders o
      JOIN users u ON o.user_id = u.id
      JOIN order_items oi ON o.id = oi.order_id
      JOIN items i ON oi.item_id = i.id
      WHERE o.id = ?
      GROUP BY o.id
    `, [query.orderId]);
  }
}

// Event Handler (Sync read model)
class OrderCreatedEventHandler {
  constructor(private readDb: ReadDatabase) {}
  
  async handle(event: OrderCreatedEvent) {
    // Update denormalized read model
    await this.readDb.upsert('order_views', {
      id: event.orderId,
      customer_name: event.customerName,
      total: event.total,
      items_count: event.items.length,
      created_at: event.timestamp
    });
  }
}

When to Use CQRS:

✓ Good For:

• Complex business logic
• Different read/write requirements
• High read:write ratio
• Need for multiple read models
• Event sourcing architecture

✗ Overkill For:

• Simple CRUD applications
• Small systems
• When consistency is critical
• Limited team experience
• Tight deadlines

Resilience Design Patterns

Circuit Breaker Pattern

Prevents cascading failures by stopping calls to a failing service, allowing it time to recover.

Circuit Breaker States:

┌─────────────────┐
│     CLOSED      │ Normal operation
│  (Allow calls)  │ Track failures
└────────┬────────┘
         │ Threshold exceeded
         │ (e.g., 5 failures)
         ↓
┌─────────────────┐
│      OPEN       │ Block all calls
│  (Fail fast)    │ Return error immediately
└────────┬────────┘
         │ After timeout
         │ (e.g., 30 seconds)
         ↓
┌─────────────────┐
│   HALF-OPEN     │ Allow test request
│  (Try one call) │
└────────┬────────┘
         │
    ┌────┴─────┐
    │          │
Success?    Failure?
    │          │
    ↓          ↓
  CLOSED     OPEN

Implementation

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.threshold || 5;
    this.timeout = options.timeout || 30000;
    this.resetTimeout = null;
    this.state = 'CLOSED';
    this.failureCount = 0;
    this.nextAttempt = Date.now();
  }
  
  async call(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker is OPEN');
      }
      this.state = 'HALF-OPEN';
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  onSuccess() {
    this.failureCount = 0;
    this.state = 'CLOSED';
  }
  
  onFailure() {
    this.failureCount++;
    
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.timeout;
      console.log('Circuit breaker opened');
    }
  }
}

// Usage
const breaker = new CircuitBreaker({
  threshold: 5,
  timeout: 30000
});

async function callExternalAPI() {
  try {
    return await breaker.call(async () => {
      return await fetch('https://api.example.com/data');
    });
  } catch (error) {
    console.log('Circuit breaker prevented call');
    return getCachedData(); // Fallback
  }
}

Benefits:

✓ Prevents cascade failures
✓ Fast fail (no waiting)
✓ Automatic recovery
✓ Protects downstream services
✓ Reduces resource waste

Saga Pattern

Manages distributed transactions across microservices using compensating transactions.

// Order Saga Implementation
class OrderSaga {
  constructor(services) {
    this.orderService = services.orderService;
    this.paymentService = services.paymentService;
    this.inventoryService = services.inventoryService;
    this.shippingService = services.shippingService;
  }
  
  async execute(orderData) {
    const saga = {
      orderId: null,
      paymentId: null,
      reservationId: null,
      shipmentId: null
    };
    
    try {
      // Step 1: Create Order
      saga.orderId = await this.orderService.create(orderData);
      console.log('✓ Order created:', saga.orderId);
      
      // Step 2: Process Payment
      saga.paymentId = await this.paymentService.charge({
        orderId: saga.orderId,
        amount: orderData.total,
        cardToken: orderData.paymentToken
      });
      console.log('✓ Payment processed:', saga.paymentId);
      
      // Step 3: Reserve Inventory
      saga.reservationId = await this.inventoryService.reserve({
        orderId: saga.orderId,
        items: orderData.items
      });
      console.log('✓ Inventory reserved:', saga.reservationId);
      
      // Step 4: Schedule Shipping
      saga.shipmentId = await this.shippingService.schedule({
        orderId: saga.orderId,
        address: orderData.shippingAddress
      });
      console.log('✓ Shipping scheduled:', saga.shipmentId);
      
      // Success - Update order status
      await this.orderService.confirm(saga.orderId);
      
      return { success: true, orderId: saga.orderId };
      
    } catch (error) {
      console.error('Saga failed:', error.message);
      
      // Compensate in reverse order
      await this.compensate(saga, error);
      
      throw new Error('Order processing failed: ' + error.message);
    }
  }
  
  async compensate(saga, error) {
    console.log('Starting compensation...');
    
    // Reverse operations in opposite order
    if (saga.shipmentId) {
      await this.shippingService.cancel(saga.shipmentId);
      console.log('✓ Shipping cancelled');
    }
    
    if (saga.reservationId) {
      await this.inventoryService.release(saga.reservationId);
      console.log('✓ Inventory released');
    }
    
    if (saga.paymentId) {
      await this.paymentService.refund(saga.paymentId);
      console.log('✓ Payment refunded');
    }
    
    if (saga.orderId) {
      await this.orderService.cancel(saga.orderId, error.message);
      console.log('✓ Order cancelled');
    }
  }
}

// Usage
const saga = new OrderSaga(services);

try {
  const result = await saga.execute({
    userId: 123,
    items: [{ id: 1, quantity: 2 }],
    total: 99.99,
    paymentToken: 'tok_123',
    shippingAddress: { /* ... */ }
  });
  console.log('Order completed:', result.orderId);
} catch (error) {
  console.error('Order failed:', error.message);
}

Choreography-Based Saga:

• Services listen to events
• No central coordinator
• More decoupled
• Harder to track

Orchestration-Based Saga:

• Central orchestrator
• Explicit flow control
• Easier to understand
• Single point of failure

Other Essential Patterns

Retry Pattern

Automatically retry failed operations with exponential backoff

async function retry(fn, retries = 3) {
  for (let i = 0; i < retries; i++) {
    try {
      return await fn();
    } catch (error) {
      if (i === retries - 1) throw error;
      await sleep(Math.pow(2, i) * 1000);
    }
  }
}

Bulkhead Pattern

Isolate resources to prevent cascade failures

Example: Separate thread pools for different services

Adapter Pattern

Convert interface to match client expectations

Example: Wrapper for third-party APIs

Factory Pattern

Create objects without specifying exact class

Example: Database connection factory

Observer Pattern

Subscribe to events and get notified of changes

Example: Event emitters, pub/sub

Singleton Pattern

Ensure only one instance exists

Example: Database connection pool

Section 18

UML Diagrams

UML (Unified Modeling Language) provides a standardized way to visualize system design. Different diagram types serve different purposes in documenting and communicating system architecture.

Use Case Diagram

Shows how users (actors) interact with the system. Focuses on functionality from user's perspective.

E-Commerce Use Case Example

┌─────────────────────────────────────────────────────┐
│            E-Commerce System                        │
│                                                     │
│   ┌──────────────────────────────────────────┐    │
│   │  Browse Products                         │    │
│   └───────┬──────────────────────────────────┘    │
│           │                                        │
│   ┌───────▼──────────────────────────────────┐    │
│   │  Search Products                         │    │
│   └───────┬──────────────────────────────────┘    │
│           │                                        │
│   ┌───────▼──────────────────────────────────┐    │
│   │  Add to Cart                             │    │
│   └───────┬──────────────────────────────────┘    │
│           │                                        │
│   ┌───────▼──────────────────────────────────┐    │
│   │  Checkout                                │    │
│   │    ├─ Select Shipping                    │    │
│   │    ├─ Apply Coupon                       │    │
│   │    └─ Make Payment  ──────┐              │    │
│   └──────────────────────────┬┘              │    │
│                              │                │    │
│   ┌──────────────────────────▼──────────┐    │    │
│   │  Track Order                        │    │    │
│   └──────────────────────────┬──────────┘    │    │
│                              │                │    │
│   ┌──────────────────────────▼──────────┐    │    │
│   │  Write Review                       │    │    │
│   └─────────────────────────────────────┘    │    │
│                                               │    │
└───────────────────────────────────────────────┘    │
         ↑                              ↑             
         │                              │             
    ┌────┴────┐                  ┌─────┴──────┐     
    │ Customer│                  │  Payment   │     
    │ (Actor) │                  │  Gateway   │     
    └─────────┘                  │  (System)  │     
                                 └────────────┘     

Legend:
- Oval: Use Case
- Stick figure: Actor
- Line: Association
- Arrow: Include/Extend relationship

Key Components

Actors

• Primary: Customer, Admin
• Secondary: Payment gateway, Email service
• System: Inventory management

Use Cases

• Verb phrases (Browse, Search, Buy)
• User goals (Complete order)
• System functions (Process payment)

Relationships

• Include: Always performed (Login ← Checkout)
• Extend: Optional (Apply coupon → Checkout)
• Generalization: Inheritance

When to Use

✓ Requirements gathering
✓ Stakeholder communication
✓ Feature planning
✓ Test case generation

Class Diagram

Shows system structure: classes, attributes, methods, and relationships between classes.

┌────────────────────────────────────────────────────────────────────┐
│                     Order Management System                        │
└────────────────────────────────────────────────────────────────────┘

┌───────────────────┐               ┌───────────────────┐
│      User         │               │      Order        │
├───────────────────┤               ├───────────────────┤
│ - id: number      │1            * │ - id: number      │
│ - name: string    │───────────────│ - userId: number  │
│ - email: string   │  places       │ - total: decimal  │
│ - password: string│               │ - status: string  │
├───────────────────┤               │ - createdAt: Date │
│ + register()      │               ├───────────────────┤
│ + login()         │               │ + calculate()     │
│ + updateProfile() │               │ + confirm()       │
└───────────────────┘               │ + cancel()        │
        △                           └─────────┬─────────┘
        │                                     │
        │ inherits                            │1
        │                                     │
┌───────┴───────┐                            │
│               │                             │contains
│               │                             │
┌───────────────┴─────┐            ┌─────────▼─────────┐
│   PremiumUser       │            │    OrderItem      │
├─────────────────────┤          * ├───────────────────┤
│ - membershipLevel   │            │ - orderId: number │
│ - discountRate      │            │ - productId: int  │
├─────────────────────┤            │ - quantity: int   │
│ + applyDiscount()   │            │ - price: decimal  │
└─────────────────────┘            ├───────────────────┤
                                   │ + getSubtotal()   │
                                   └─────────┬─────────┘
                                             │
                                             │references
                                             │
                                   ┌─────────▼─────────┐
                                   │     Product       │
                                   ├───────────────────┤
                                   │ - id: number      │
                                   │ - name: string    │
                                   │ - price: decimal  │
                                   │ - stock: int      │
                                   ├───────────────────┤
                                   │ + updateStock()   │
                                   │ + checkAvailability()│
                                   └───────────────────┘

Relationships:
─────── Association (uses)
───────── Composition (contains, strong ownership)
───────○ Aggregation (has-a, weak ownership)
───────△ Inheritance (is-a)

Visibility:
+ Public
- Private
# Protected

Relationships

Association: User places Order (1 to many)
Composition: Order contains OrderItems (strong)
Aggregation: Shopping cart has Products (weak)
Inheritance: PremiumUser extends User
Dependency: Uses temporarily

Multiplicity

1: Exactly one
0..1: Zero or one
*: Zero or more
1..*: One or more
m..n: Range (e.g., 2..5)

Best Practices

✓ Show only relevant details
✓ Use clear naming
✓ Group related classes
✓ Show key relationships
✓ Include multiplicity

Sequence Diagram

Shows how objects interact over time. Focuses on message flow and order of operations.

User Login Sequence

User      Browser    API Server   Auth Service   Database
 │           │            │             │            │
 │  Enter    │            │             │            │
 │ Password  │            │             │            │
 ├──────────>│            │             │            │
 │           │            │             │            │
 │           │POST /login │             │            │
 │           ├───────────>│             │            │
 │           │            │             │            │
 │           │            │ Verify JWT  │            │
 │           │            ├────────────>│            │
 │           │            │             │            │
 │           │            │             │ Query user │
 │           │            │             ├───────────>│
 │           │            │             │            │
 │           │            │             │ User data  │
 │           │            │             │<───────────┤
 │           │            │             │            │
 │           │            │ Check pwd   │            │
 │           │            │<────────────┤            │
 │           │            │             │            │
 │           │            │ Generate    │            │
 │           │            │   Token     │            │
 │           │            ├────────────>│            │
 │           │            │             │            │
 │           │            │   JWT       │            │
 │           │            │<────────────┤            │
 │           │            │             │            │
 │           │  200 OK    │             │            │
 │           │  + Token   │             │            │
 │           │<───────────┤             │            │
 │           │            │             │            │
 │  Success  │            │             │            │
 │<──────────┤            │             │            │
 │           │            │             │            │

Alternative flow (failure):
 │           │            │             │            │
 │           │            │ Invalid pwd │            │
 │           │            │<────────────┤            │
 │           │            │             │            │
 │           │ 401 Error  │             │            │
 │           │<───────────┤             │            │
 │           │            │             │            │
 │  Error    │            │             │            │
 │<──────────┤            │             │            │

Components

Participants

• Actors (User, Admin)
• Systems (API Server, Database)
• Objects (Controller, Service)

Messages

Synchronous: ──────> (wait for response)
Asynchronous: ─────-> (no wait)
Return: <────── (dashed)
Self-call: ↻ (to itself)

Control Structures

alt: if-else conditions
opt: optional flow
loop: iteration
par: parallel execution

Use Cases

✓ API flow documentation
✓ Debugging complex flows
✓ Understanding timing
✓ Integration testing

Deployment Diagram

Shows physical deployment of artifacts on hardware nodes. Represents infrastructure and deployment architecture.

┌──────────────────────────────────────────────────────────────────────┐
│                   Production Deployment Architecture                  │
└──────────────────────────────────────────────────────────────────────┘

                              ┌─────────────────┐
                              │  <>     │
                              │  CloudFlare CDN │
                              │  (Global Edge)  │
                              └────────┬────────┘
                                       │ HTTPS
                                       ↓
                              ┌─────────────────┐
                              │  <>     │
                              │  Load Balancer  │
                              │  (AWS ALB)      │
                              └────────┬────────┘
                                       │
                ┌──────────────────────┼──────────────────────┐
                │                      │                      │
                ↓                      ↓                      ↓
    ┌───────────────────┐  ┌───────────────────┐  ┌───────────────────┐
    │  <>         │  │  <>         │  │  <>         │
    │  Web Server 1     │  │  Web Server 2     │  │  Web Server 3     │
    │  (EC2 t3.medium)  │  │  (EC2 t3.medium)  │  │  (EC2 t3.medium)  │
    ├───────────────────┤  ├───────────────────┤  ├───────────────────┤
    │ <>      │  │ <>      │  │ <>      │
    │ Node.js App       │  │ Node.js App       │  │ Node.js App       │
    │ Docker Container  │  │ Docker Container  │  │ Docker Container  │
    └─────────┬─────────┘  └─────────┬─────────┘  └─────────┬─────────┘
              │                       │                       │
              └───────────────────────┼───────────────────────┘
                                      │
                                      ↓
                          ┌───────────────────────┐
                          │  <>             │
                          │  Redis Cluster        │
                          │  (ElastiCache)        │
                          │  - Cache              │
                          │  - Session Store      │
                          └───────────────────────┘
                                      │
                                      ↓
                          ┌───────────────────────┐
                          │  <>             │
                          │  PostgreSQL Primary   │
                          │  (RDS db.r5.large)    │
                          │  - Multi-AZ           │
                          └──────────┬────────────┘
                                     │ Replication
                      ┌──────────────┼──────────────┐
                      ↓              ↓              ↓
          ┌──────────────┐  ┌──────────────┐  ┌──────────────┐
          │ Read Replica │  │ Read Replica │  │ Read Replica │
          │  (us-east)   │  │  (us-west)   │  │  (eu-west)   │
          └──────────────┘  └──────────────┘  └──────────────┘

                          ┌───────────────────────┐
                          │  <>             │
                          │  S3 Bucket            │
                          │  - Static Assets      │
                          │  - User Uploads       │
                          │  - Backups            │
                          └───────────────────────┘

Network Configuration:
- VPC: 10.0.0.0/16
- Public Subnet: 10.0.1.0/24 (Load Balancer)
- Private Subnet: 10.0.2.0/24 (App Servers)
- Private Subnet: 10.0.3.0/24 (Databases)
- Security Groups: Configured per tier

Nodes

• Physical servers
• Virtual machines
• Containers
• Cloud services
• Network devices

Artifacts

• Application code
• Docker images
• JAR/WAR files
• Static files
• Configuration

Communication

• HTTP/HTTPS
• TCP/UDP
• Database protocols
• Message queues
• RPC calls

Section 19

Cloud Architecture

Cloud architecture leverages cloud computing services to build scalable, resilient, and cost-effective systems. Understanding networking, compute, and storage services is essential for modern system design.

VPC & Network Architecture

Virtual Private Cloud (VPC) is your isolated network in the cloud. It provides complete control over networking configuration.

AWS VPC Architecture (Multi-AZ High Availability):

┌────────────────────────────────────────────────────────────────────────┐
│                    VPC: 10.0.0.0/16 (us-east-1)                       │
│                                                                        │
│  ┌──────────────────────────────────────────────────────────────────┐ │
│  │                      Internet Gateway                             │ │
│  └────────────────────────────┬─────────────────────────────────────┘ │
│                               │                                        │
│  ┌────────────────────────────┴─────────────────────────────────────┐ │
│  │                     NAT Gateway (Public Subnet)                   │ │
│  └────────────────────────────┬─────────────────────────────────────┘ │
│                               │                                        │
├───────────────────────────────┼────────────────────────────────────────┤
│    Availability Zone A        │       Availability Zone B              │
├───────────────────────────────┼────────────────────────────────────────┤
│  ┌─────────────────────────┐  │  ┌─────────────────────────┐          │
│  │ Public Subnet           │  │  │ Public Subnet           │          │
│  │ 10.0.1.0/24             │  │  │ 10.0.2.0/24             │          │
│  ├─────────────────────────┤  │  ├─────────────────────────┤          │
│  │ • Load Balancer         │  │  │ • Load Balancer         │          │
│  │ • Bastion Host          │  │  │ • NAT Gateway           │          │
│  └──────────┬──────────────┘  │  └──────────┬──────────────┘          │
│             │                 │             │                         │
│  ┌──────────▼──────────────┐  │  ┌──────────▼──────────────┐          │
│  │ Private Subnet          │  │  │ Private Subnet          │          │
│  │ 10.0.3.0/24             │  │  │ 10.0.4.0/24             │          │
│  ├─────────────────────────┤  │  ├─────────────────────────┤          │
│  │ • EC2 (App Servers)     │  │  │ • EC2 (App Servers)     │          │
│  │ • ECS/EKS (Containers)  │  │  │ • ECS/EKS (Containers)  │          │
│  │ • Lambda Functions      │  │  │ • Lambda Functions      │          │
│  └──────────┬──────────────┘  │  └──────────┬──────────────┘          │
│             │                 │             │                         │
│  ┌──────────▼──────────────┐  │  ┌──────────▼──────────────┐          │
│  │ Private Subnet          │  │  │ Private Subnet          │          │
│  │ 10.0.5.0/24             │  │  │ 10.0.6.0/24             │          │
│  ├─────────────────────────┤  │  ├─────────────────────────┤          │
│  │ • RDS Primary           │  │  │ • RDS Standby           │          │
│  │ • ElastiCache           │  │  │ • ElastiCache           │          │
│  │ • Internal Load Balancer│  │  │ • Internal Load Balancer│          │
│  └─────────────────────────┘  │  └─────────────────────────┘          │
│                               │                                        │
└───────────────────────────────┴────────────────────────────────────────┘

Security Groups:
- Web Tier: Allow 80/443 from Internet
- App Tier: Allow traffic from Web Tier only
- DB Tier: Allow 5432 from App Tier only

Network ACLs: Subnet-level firewall rules
Route Tables: Control traffic routing

Terraform VPC Configuration

# main.tf
resource "aws_vpc" "main" {
  cidr_block           = "10.0.0.0/16"
  enable_dns_hostnames = true
  enable_dns_support   = true
  
  tags = {
    Name = "production-vpc"
  }
}

# Internet Gateway
resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id
}

# Public Subnets
resource "aws_subnet" "public" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${count.index + 1}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true
  
  tags = {
    Name = "public-subnet-${count.index + 1}"
    Tier = "Public"
  }
}

# Private Subnets (App Tier)
resource "aws_subnet" "private_app" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${count.index + 3}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = {
    Name = "private-app-subnet-${count.index + 1}"
    Tier = "Application"
  }
}

# Private Subnets (Data Tier)
resource "aws_subnet" "private_data" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = "10.0.${count.index + 5}.0/24"
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = {
    Name = "private-data-subnet-${count.index + 1}"
    Tier = "Database"
  }
}

# NAT Gateway
resource "aws_eip" "nat" {
  count  = 2
  domain = "vpc"
}

resource "aws_nat_gateway" "main" {
  count         = 2
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index].id
}

# Route Tables
resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id
  
  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }
}

resource "aws_route_table" "private" {
  count  = 2
  vpc_id = aws_vpc.main.id
  
  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main[count.index].id
  }
}

Security Groups

# security_groups.tf

# Load Balancer Security Group
resource "aws_security_group" "alb" {
  name        = "alb-sg"
  description = "Security group for ALB"
  vpc_id      = aws_vpc.main.id
  
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
    description = "HTTPS from Internet"
  }
  
  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
    description = "HTTP from Internet"
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# Application Security Group
resource "aws_security_group" "app" {
  name        = "app-sg"
  description = "Security group for app servers"
  vpc_id      = aws_vpc.main.id
  
  ingress {
    from_port       = 3000
    to_port         = 3000
    protocol        = "tcp"
    security_groups = [aws_security_group.alb.id]
    description     = "Allow from ALB"
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# Database Security Group
resource "aws_security_group" "db" {
  name        = "db-sg"
  description = "Security group for database"
  vpc_id      = aws_vpc.main.id
  
  ingress {
    from_port       = 5432
    to_port         = 5432
    protocol        = "tcp"
    security_groups = [aws_security_group.app.id]
    description     = "PostgreSQL from app tier"
  }
}

Network Best Practices:

✓ Use multiple AZs for high availability
✓ Separate public/private subnets
✓ NAT Gateway for outbound internet from private subnets
✓ Security groups over NACLs (stateful vs stateless)
✓ VPC Flow Logs for network monitoring
✓ VPC Peering for multi-VPC communication

Auto Scaling Architecture

Automatically adjust compute capacity to maintain performance and optimize costs.

Auto Scaling Group Configuration

# auto_scaling.tf
resource "aws_launch_template" "app" {
  name_prefix   = "app-server-"
  image_id      = data.aws_ami.amazon_linux_2.id
  instance_type = "t3.medium"
  
  vpc_security_group_ids = [aws_security_group.app.id]
  
  user_data = base64encode(<<-EOF
    #!/bin/bash
    yum update -y
    yum install -y docker
    systemctl start docker
    systemctl enable docker
    
    # Pull and run application
    docker pull myapp:latest
    docker run -d -p 3000:3000 \
      -e DB_HOST=${aws_db_instance.main.address} \
      -e REDIS_HOST=${aws_elasticache_cluster.main.cache_nodes[0].address} \
      myapp:latest
  EOF
  )
  
  iam_instance_profile {
    name = aws_iam_instance_profile.app.name
  }
  
  monitoring {
    enabled = true
  }
}

resource "aws_autoscaling_group" "app" {
  name                = "app-asg"
  min_size            = 2
  max_size            = 10
  desired_capacity    = 3
  health_check_type   = "ELB"
  health_check_grace_period = 300
  
  vpc_zone_identifier = aws_subnet.private_app[*].id
  target_group_arns   = [aws_lb_target_group.app.arn]
  
  launch_template {
    id      = aws_launch_template.app.id
    version = "$Latest"
  }
  
  tag {
    key                 = "Name"
    value               = "app-server"
    propagate_at_launch = true
  }
}

# Scale up policy
resource "aws_autoscaling_policy" "scale_up" {
  name                   = "scale-up"
  autoscaling_group_name = aws_autoscaling_group.app.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = 2
  cooldown               = 300
}

# Scale down policy
resource "aws_autoscaling_policy" "scale_down" {
  name                   = "scale-down"
  autoscaling_group_name = aws_autoscaling_group.app.name
  adjustment_type        = "ChangeInCapacity"
  scaling_adjustment     = -1
  cooldown               = 300
}

# CloudWatch Alarms
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "120"
  statistic           = "Average"
  threshold           = "70"
  alarm_actions       = [aws_autoscaling_policy.scale_up.arn]
  
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.app.name
  }
}

resource "aws_cloudwatch_metric_alarm" "low_cpu" {
  alarm_name          = "low-cpu"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "120"
  statistic           = "Average"
  threshold           = "30"
  alarm_actions       = [aws_autoscaling_policy.scale_down.arn]
  
  dimensions = {
    AutoScalingGroupName = aws_autoscaling_group.app.name
  }
}

Scaling Strategies

Target Tracking

Maintain a target metric value (e.g., 70% CPU)

resource "aws_autoscaling_policy" "target_tracking" {
  name                   = "target-tracking-cpu"
  autoscaling_group_name = aws_autoscaling_group.app.name
  policy_type            = "TargetTrackingScaling"
  
  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value = 70.0
  }
}

Step Scaling

Different scaling amounts based on alarm severity

• CPU 50-70%: Add 1 instance
• CPU 70-90%: Add 2 instances
• CPU > 90%: Add 3 instances

Scheduled Scaling

Scale based on predictable patterns

resource "aws_autoscaling_schedule" "morning_scale_up" {
  autoscaling_group_name = aws_autoscaling_group.app.name
  scheduled_action_name  = "morning-scale-up"
  min_size               = 5
  max_size               = 10
  desired_capacity       = 8
  recurrence             = "0 8 * * MON-FRI"
}

resource "aws_autoscaling_schedule" "evening_scale_down" {
  autoscaling_group_name = aws_autoscaling_group.app.name
  scheduled_action_name  = "evening-scale-down"
  min_size               = 2
  max_size               = 10
  desired_capacity       = 3
  recurrence             = "0 20 * * *"
}

Multi-Region Deployment

Deploy across multiple geographic regions for global reach, disaster recovery, and compliance.

Multi-Region Active-Active Architecture:

                    ┌──────────────────────┐
                    │   Route 53 (DNS)     │
                    │  Latency-based       │
                    │  Routing             │
                    └────┬────────────┬────┘
                         │            │
           ┌─────────────┘            └─────────────┐
           │                                        │
           ↓                                        ↓
  ┌────────────────────┐                  ┌────────────────────┐
  │   US-EAST-1        │                  │   EU-WEST-1        │
  │   (Primary)        │                  │   (Secondary)      │
  ├────────────────────┤                  ├────────────────────┤
  │ • CloudFront CDN   │                  │ • CloudFront CDN   │
  │ • Load Balancer    │                  │ • Load Balancer    │
  │ • EC2 ASG (3-10)   │                  │ • EC2 ASG (3-10)   │
  │ • ElastiCache      │                  │ • ElastiCache      │
  │ • RDS Primary      │◄───Replication───│ • RDS Read Replica │
  │ • S3 (Primary)     │◄───Replication───│ • S3 (Replica)     │
  └────────────────────┘                  └────────────────────┘
           │                                        │
           └──────────────┬──────────────┬──────────┘
                          ↓              ↓
                  ┌───────────────────────────┐
                  │  Global DynamoDB Tables   │
                  │  (Multi-region)           │
                  └───────────────────────────┘

Benefits:
✓ Low latency (users routed to nearest region)
✓ Disaster recovery (automatic failover)
✓ Compliance (data residency requirements)
✓ Load distribution (reduce single region load)

Deployment Strategies

Active-Active: All regions serve traffic simultaneously. Better utilization, more complex.
Active-Passive: One region active, others on standby. Simpler, but wasted capacity.
Hot-Warm-Cold: Primary (hot), backup ready (warm), cold backup (archives).

Data Replication

Database: RDS cross-region replicas, DynamoDB global tables
Storage: S3 cross-region replication (CRR)
Cache: Redis Global Datastore
Consistency: Eventual consistency trade-off

Challenges

• Data consistency across regions
• Increased infrastructure costs
• Complex deployment pipelines
• Cross-region network latency
• Testing disaster recovery

Best Practices

✓ Use managed services (RDS, DynamoDB)
✓ Automate everything (IaC)
✓ Monitor all regions
✓ Regular DR drills
✓ Plan for region failure

Section 20

Cost Optimization

Cloud costs can spiral out of control if not managed properly. A well-architected system balances performance, reliability, and cost. Smart optimization can reduce cloud bills by 40-70% without sacrificing quality.

Cloud Cost Optimization Strategies

Compute Optimization

Right-Sizing Instances

Match instance types to actual workload requirements

# Analyze CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time 2025-10-01T00:00:00Z \
  --end-time 2025-11-01T00:00:00Z \
  --period 3600 \
  --statistics Average

# Results show average CPU: 15%
# Current: t3.xlarge (4 vCPU, $0.1664/hr)
# Optimized: t3.medium (2 vCPU, $0.0416/hr)
# Savings: 75% reduction = $1,050/month

Spot Instances

Use spare AWS capacity at up to 90% discount

Best For: Batch jobs, data processing, CI/CD, stateless apps

Not For: Databases, critical services with strict SLAs

resource "aws_spot_fleet_request" "batch" {
  allocation_strategy      = "lowestPrice"
  target_capacity         = 10
  valid_until             = "2025-12-31T23:59:59Z"
  terminate_instances_with_expiration = true
  
  launch_specification {
    instance_type = "c5.large"
    ami           = "ami-12345678"
    spot_price    = "0.05"  # Max price
  }
  
  # Fallback to on-demand if no spot
  launch_specification {
    instance_type = "c5.large"
    ami           = "ami-12345678"
  }
}

# Savings: $0.085/hr (on-demand) → $0.025/hr (spot)
# 70% cost reduction!

Reserved Instances / Savings Plans

Commit to 1-3 years for predictable workloads

• 1-year: 30-40% savings
• 3-year: 50-72% savings
• Mix: Reserved for baseline + Spot for peaks

Storage Optimization

S3 Storage Classes

Class	Cost/GB	Use Case
Standard	$0.023	Frequent access
IA	$0.0125	Monthly access
Glacier	$0.004	Archival
Deep Archive	$0.00099	Long-term

# S3 Lifecycle Policy
resource "aws_s3_bucket_lifecycle_configuration" "optimize" {
  bucket = aws_s3_bucket.main.id
  
  rule {
    id     = "move-to-ia"
    status = "Enabled"
    
    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }
    
    transition {
      days          = 90
      storage_class = "GLACIER"
    }
    
    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }
    
    expiration {
      days = 2555  # 7 years
    }
  }
}

# Example: 1TB data
# Month 1: $23 (Standard)
# Month 2-3: $12.50 (IA)
# Month 4-12: $4 (Glacier)
# Year 2+: $0.99 (Deep Archive)
# Total savings: 95% over 2 years

Database Optimization

Read Replicas: Offload read traffic (cheaper than scaling primary)
Aurora Serverless: Auto-scale based on load
Compression: Reduce storage by 60-80%
Archiving: Move old data to cheaper storage

Delete Unused Resources

# Find unattached EBS volumes
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].[VolumeId,Size,VolumeType]'

# Find old snapshots (> 30 days)
aws ec2 describe-snapshots --owner-ids self \
  --query 'Snapshots[?StartTime<=`2024-10-01`].[SnapshotId,StartTime]'

# Unused Elastic IPs (charges apply!)
aws ec2 describe-addresses \
  --filters "Name=association-id,Values=''"

# Common waste: 30% of resources unused!

Network & Data Transfer Optimization

Data transfer costs are often overlooked but can be 10-20% of total cloud bill.

Data Transfer Pricing (AWS)

Transfer Type	Cost
Data IN (Internet → AWS)	FREE
Data OUT (AWS → Internet)	$0.09/GB
Same AZ	FREE
Cross-AZ (same region)	$0.01/GB each way
Cross-Region	$0.02/GB
Via CloudFront	$0.085/GB (cheaper!)

Cost Example:

API serving 10TB/month to users:

❌ Direct from EC2: 10,000 GB × $0.09 = $900/month
✅ Via CloudFront: 10,000 GB × $0.085 = $850/month
✅ With caching (90% hit): 1,000 GB × $0.085 = $85/month
Savings: $815/month (90%!)

Optimization Strategies

1. Use CDN/CloudFront

✓ Cache static content at edge
✓ Reduce origin data transfer
✓ Lower latency for users
✓ Cheaper than direct transfer

2. Compress Data

// Enable compression in NGINX
gzip on;
gzip_types text/plain text/css application/json 
           application/javascript text/xml 
           application/xml image/svg+xml;
gzip_min_length 1000;

// Result: 1MB → 200KB (80% reduction)
// Transfer cost: $0.09 → $0.018 per file

3. Regional Architecture

✓ Deploy close to users
✓ Minimize cross-region transfers
✓ Use VPC endpoints (free transfer)
✓ Keep services in same AZ when possible

Cost Monitoring & Governance

AWS Cost Management Tools

AWS Cost Explorer

• Visualize spending trends
• Forecast future costs
• Filter by service, tag, region
• Identify cost spikes

AWS Budgets

resource "aws_budgets_budget" "monthly" {
  name              = "monthly-budget"
  budget_type       = "COST"
  limit_amount      = "1000"
  limit_unit        = "USD"
  time_unit         = "MONTHLY"
  
  notification {
    comparison_operator = "GREATER_THAN"
    threshold          = 80
    threshold_type     = "PERCENTAGE"
    notification_type  = "ACTUAL"
    
    subscriber_email_addresses = [
      "team@example.com"
    ]
  }
}

Cost Allocation Tags

• Tag by: Environment, Team, Project
• Track costs per application
• Chargeback to departments
• Identify waste quickly

Cost Optimization Checklist

Right-size instances
Review CPU/memory usage monthly
Use auto-scaling
Scale down during off-hours
Implement caching
Redis/Memcached, CloudFront CDN
Use Spot/Reserved instances
Mix for optimal cost-performance
Delete unused resources
Old snapshots, unattached volumes, test servers
Optimize storage
S3 lifecycle policies, database cleanup
Monitor data transfer
Use VPC endpoints, regional deployment
Set up alerts
Budget alerts, anomaly detection
Review monthly
Cost reports, optimization opportunities

Section 21

Real-World System Design Case Studies

Let's examine how tech giants architect their systems to serve billions of users. These case studies reveal practical applications of the concepts we've covered.

YouTube Architecture

2+ billion users, 500+ hours uploaded per minute

System Requirements

Functional:

• Upload videos (any format, up to 256GB)
• Stream videos (adaptive bitrate)
• Search videos (full-text, filters)
• User engagement (likes, comments, subscribe)
• Recommendations (personalized)

Non-Functional:

• 1 billion hours watched daily
• Video starts in < 2 seconds
• 99.9% availability
• Global CDN for low latency
• Handle viral videos (10M+ views)

High-Level Architecture

┌─────────────────────────────────────────┐
│          YouTube Architecture            │
└─────────────────────────────────────────┘

User Upload Flow:
[User] → [Load Balancer] → [Upload Service]
             ↓
    [Google Cloud Storage]
             ↓
    [Transcoding Pipeline]
      (Parallel workers)
             ↓
    Multiple formats:
    • 2160p (4K)
    • 1080p (Full HD)
    • 720p, 480p, 360p
    • Various codecs (VP9, H.264)
             ↓
    [CDN Distribution]
    (1000+ edge locations)

Video Playback Flow:
[User] → [CDN Edge Server] → [HLS/DASH Stream]
             ↓ (Cache Miss)
    [Origin Servers]
             ↓
    [Video Metadata DB]
    (MySQL/Bigtable)

Search & Recommendations:
[User Query] → [Search Service]
                   ↓
            [Elasticsearch]
                   ↓
            [ML Model]
         (TensorFlow recommendations)
                   ↓
            [Result Ranking]

Storage Strategy

Original Videos: Google Cloud Storage (cold storage for old videos)
Processed Videos: Multiple formats cached at edge
Metadata: MySQL + Bigtable (views, likes, comments)
Total: Exabytes (millions of TB)

Scalability Techniques

CDN: Cache popular videos at edge (90%+ hit rate)
Sharding: Partition by video ID
Read Replicas: Scale database reads
Async Processing: Transcoding in background

Key Optimizations

Adaptive Streaming: Adjust quality based on bandwidth
Prefetching: Load next segments ahead
Compression: VP9 codec (50% smaller)
ML: Predict popular videos, pre-cache

WhatsApp Architecture

2+ billion users, 100+ billion messages daily

Architecture Highlights

WhatsApp Messaging Flow:

[User A]
   ↓ Send message
[WhatsApp Client]
   ↓ XMPP Protocol
[Load Balancer]
   ↓
[Chat Server (Erlang)]
   ↓
┌──────────────────────────┐
│ 1. Store message         │
│    (Mnesia DB - in-memory)│
│                          │
│ 2. Check User B online?  │
│    (Redis presence)      │
│                          │
│ 3. If online: Push       │
│    If offline: Queue     │
└──────────────────────────┘
   ↓
[User B receives message]
   ↓
[Send ACK (double check)]
   ↓
[Delete from server]

Key: Messages deleted after delivery!
End-to-end encrypted (Signal Protocol)

Tech Stack:

Language: Erlang (2M+ connections per server)
Database: Mnesia (in-memory), RocksDB
Caching: Redis (presence, online status)
Media: S3-like object storage
Push: FCM (Android), APNs (iOS)

Design Decisions

Why Erlang?

✓ Built for telecom (millions of concurrent connections)
✓ Hot code reloading (no downtime)
✓ Lightweight processes (millions per server)
✓ Built-in fault tolerance

Minimal Data Storage

• Messages deleted after delivery
• Only metadata stored (who sent what when)
• Media stored temporarily (30 days)
• Privacy-focused design

Scalability Approach

Sharding: By phone number hash
Load Balancing: Consistent hashing
Presence: Redis pub/sub
Team Size: ~50 engineers (incredibly efficient!)

End-to-End Encryption

WhatsApp uses Signal Protocol for E2EE (even WhatsApp can't read messages)

How it works:

1. Device generates key pair (public/private)
2. Public keys exchanged via server
3. Message encrypted with recipient's public key
4. Only recipient's private key can decrypt

Benefits:

✓ Complete privacy
✓ WhatsApp cannot access content
✓ Forward secrecy (keys rotate)
✓ Verification via QR code

Uber Architecture

Real-time ride matching, 10,000+ cities

Ride Matching Algorithm

Uber Ride Request Flow:

[User requests ride]
   ↓
[Location: lat=40.7589, lng=-73.9851]
   ↓
[Geohashing / QuadTree Indexing]
   ↓
Find nearby drivers:
┌─────────────────────────────┐
│ QuadTree (spatial index)    │
│                             │
│   Search radius: 2 miles    │
│   Found: 15 drivers         │
│                             │
│   Filter:                   │
│   • Available               │
│   • High rating (>4.7)      │
│   • Car type matches        │
│   Result: 8 drivers         │
└─────────────────────────────┘
   ↓
[Score & Rank drivers]
  Factors:
  • Distance to pickup
  • Driver rating
  • Acceptance rate
  • Estimated wait time
   ↓
[Send request to top driver]
   ↓
Driver accepts? 
  ✓ Yes → Match confirmed
  ✗ No → Try next driver (15s timeout)
   ↓
[Real-time tracking]
  WebSocket connection
  Location updates every 4s

Technical Components

Geospatial Indexing

// Geohash encoding
Location: 40.7589, -73.9851
Geohash: dr5ru6
Precision: 6 chars = ~1.2km area

// QuadTree for efficient search
class QuadTree {
  search(lat, lng, radius) {
    // O(log n) instead of O(n)
    // Check only relevant quadrants
    return nearby_drivers;
  }
}

// Redis Geo commands
GEOADD drivers -73.9851 40.7589 "driver123"
GEORADIUS drivers -73.9851 40.7589 2 km

Surge Pricing

Algorithm: Supply vs Demand in area
Real-time: Calculate every 1-2 minutes
Factors: Request rate, driver availability, events
Implementation: Kafka streams + ML

Schemaless Storage

Uber's custom MySQL-based storage (handles 100K+ writes/sec)

• Stores trip data, driver info, payments
• Auto-sharding by ID
• Fallback to another shard on failure

Microservices

• Dispatch Service
• Pricing Service
• Payment Service
• Notification Service
• Maps Service
• ETA Service
2,200+ microservices!

Tech Stack

Backend: Go, Python, Node.js
DB: MySQL, PostgreSQL, Cassandra
Cache: Redis
Messaging: Kafka
Orchestration: Kubernetes

Challenges Solved

✓ Real-time matching (< 1s)
✓ GPS accuracy issues
✓ Network reliability
✓ Peak hour scaling (10x load)
✓ Global consistency

Twitter (X) Architecture

500M tweets/day, real-time timeline delivery

Timeline Delivery Challenge

Problem:

Celebrity with 100M followers tweets. How to deliver to all followers' timelines instantly?

❌ Bad Approach: Pull on Read

When user opens app:
1. Find people they follow (1000 users)
2. Get latest tweets from each (1000 queries!)
3. Merge and sort by timestamp
4. Return top 50

Problem: Too slow! 1000+ DB queries

✓ Good Approach: Fan-out on Write

When celebrity tweets:
1. Insert tweet into timeline of all followers
   (Async, background workers)
2. Pre-compute timelines

When user opens app:
1. Read from their pre-computed timeline (1 query!)
2. Fast! O(1) instead of O(n)

Implementation: Redis sorted sets

Hybrid Approach

For Normal Users (< 10K followers):

Fan-out on write - Push to all followers' timelines

✓ Fast reads
✓ Acceptable write cost

For Celebrities (100M+ followers):

Pull on read - Fetch their tweets when user reads timeline

✓ Avoid 100M writes per tweet
✓ Cache aggressively

Timeline Assembly:

// Merge timelines
timeline = get_precomputed_timeline(user_id);
celebrity_tweets = get_celebrity_tweets(user_id);
final_timeline = merge(timeline, celebrity_tweets);
return top(final_timeline, 50);

Tech Stack Highlights:

Backend: Scala, Java
Timeline: Manhattan (distributed DB)
Cache: Redis, Memcached
Search: Elasticsearch
Messaging: Kafka
Storage: MySQL, Manhattan

Conclusion & Final Notes

Congratulations!

You've completed an extensive journey through Software System Design. From fundamentals to advanced architecture patterns, you now have the knowledge to design scalable, reliable, and maintainable systems.

📚

21 Chapters

Comprehensive coverage of all system design topics

💻

100+ Examples

Real code and architecture diagrams

🚀

Production-Ready

Battle-tested patterns and best practices

"The best system design is one that solves real problems elegantly, scales gracefully, and can be maintained by your team."

Key Takeaways

Design Principles

Start Simple: Don't over-engineer. Add complexity only when needed.
Think Trade-offs: Every decision has pros and cons. Choose wisely.
Design for Failure: Assume everything will fail. Plan accordingly.
Measure Everything: You can't improve what you don't measure.

Best Practices

Automate: CI/CD, testing, infrastructure, monitoring.
Document: Architecture decisions, APIs, runbooks.
Iterate: Build → Measure → Learn → Improve.
Communicate: Share knowledge, review designs, collaborate.

Scalability Mindset

Horizontal > Vertical: Scale out, not just up.
Stateless Services: Easier to scale and deploy.
Cache Aggressively: Every layer can benefit.
Async When Possible: Don't block user requests.

Common Pitfalls to Avoid

Premature Optimization: Build for now, not hypothetical scale.
Single Points of Failure: Always have redundancy.
Ignoring Monitoring: You're flying blind without observability.
Technical Debt: Pay it down regularly or it compounds.

Continue Your Learning Journey

Books

• Designing Data-Intensive Applications by Martin Kleppmann
• System Design Interview by Alex Xu
• Building Microservices by Sam Newman
• Site Reliability Engineering by Google
• Clean Architecture by Robert Martin

Online Courses

• Grokking System Design (Educative)
• System Design Primer (GitHub)
• Microservices Architecture (Udemy)
• AWS Solutions Architect
• Kubernetes Deep Dive

Websites & Blogs

• High Scalability (highscalability.com)
• Martin Fowler's Blog
• Engineering blogs: Netflix, Uber, Airbnb
• AWS Architecture Blog
• InfoQ, Hacker News

Communities

• r/systemdesign (Reddit)
• System Design Discord servers
• Dev.to, Hashnode
• Local meetups & conferences
• Twitter: #systemdesign

Hands-On Practice

• Build side projects
• Contribute to open source
• LeetCode system design problems
• Mock interviews (Pramp, Interviewing.io)
• Study real systems on GitHub

Certifications

• AWS Solutions Architect
• Google Cloud Architect
• Azure Solutions Architect
• Kubernetes (CKA, CKAD)
• TOGAF (Enterprise Architecture)

Your Journey Starts Now

System design is not just about passing interviews—it's about building systems that solve real problems, delight users, and stand the test of time. Start small, iterate often, and never stop learning.

Remember:

"Build. Measure. Learn."

"The best time to start was yesterday. The second best time is now."

🎯

Set Goals

💪

Practice Daily

🚀

Ship It!