Non-Functional System Characteristics | System Design

Introduction

When architecting systems, functional requirements define what the application does (e.g., "A user can add an item to the shopping cart" or "A client can post a text message").

Conversely, non-functional characteristics—often called the "-ilities" of software architecture—define how the system behaves under heavy traffic, during data-center outages, and across extended operational lifespans. These structural traits separate a fragile prototype running on a local machine from an enterprise ecosystem capable of serving millions of concurrent global requests.

1. Scalability vs. Elasticity

While closely related, a system can be highly scalable without necessarily being elastic.

Scalability: The structural capacity of an architecture to handle increased volume or traffic workloads by fluidly expanding its hardware footprint.
Elasticity: The system's capability to dynamically scale resources up or down automatically in real-time response to immediate demand fluctuations (e.g., an e-commerce platform automatically spinning up 50 nodes on Black Friday and tearing them down at midnight).

Vertical Scaling (Scale Up)

Increasing the performance thresholds of an individual computing node by hot-swapping or provisioning more heavy components (more vCPU cores, larger RAM capacities, faster NVMe SSD disks).

Pros: Zero architectural refactoring required; processes maintain ultra-low latency since inter-thread communication happens completely in-memory on a single motherboard without crossing network boundaries.
Cons: Bound by strict physical hardware engineering limits. It presents a critical Single Point of Failure (SPOF), introduces costly operational maintenance downtime during scaling events, and becomes exponentially expensive at the high-end hardware thresholds.

Horizontal Scaling (Scale Out)

Expanding the system's runtime capability by adding extra computing machines into the active pool.

Pros: Virtually limitless upper bounds of compute capability. Provides built-in system redundancy and minimizes operational costs by running workloads across cost-effective commodity server fleets.
Cons: Drastically elevates architectural complexity. Applications must be entirely stateless to process traffic safely across an arbitrary multi-node array, requiring smart reverse proxies, network load balancers, and distributed database clustering engines.

2. Availability vs. Reliability

These terms represent fundamentally distinct facets of architectural resilience. A system can be continually available while being deeply unreliable.

Reliability

The exact statistical probability that a system component will flawlessly execute its assigned function under specified conditions for a dedicated length of time without an operational failure block. It is explicitly tracked using MTBF (Mean Time Between Failures).

The Analogy: An commercial passenger jet is reliable if its mechanics complete an entire scheduled flight frame without mid-air failures. If an engine fails halfway through a flight, the aircraft is unreliable.

Availability

The precise percentage ratio of time a platform remains actively responsive and reachable to ingest and fulfill client requests. It is determined using the mathematical relationship between failure rates and restoration cycles:

Availability = \frac{\text{MTBF}}{\text{MTBF} + \text{MTTR}}

Where MTTR is the Mean Time to Repair or Restore (how fast engineers can patch, reboot, or isolate a failed system node back into production).

The Analogy: An airplane is available if it sits at the airport gate fueled and ready for passenger boarding. Technically, it can register as available even if its navigation computers are experiencing a latent bug that will trigger a crash mid-flight (making it available but deeply unreliable).

The "Nines" of Availability

High Availability (HA) tiers are structurally measured using standard tiers of "Nines." Every extra digit of uptime significantly changes the backend design and engineering cost:

Availability	Allowed Downtime per Year	Allowed Downtime per Day	Class of Production System
99% ("Two Nines")	3.65 days	14.4 minutes	Internal administrative scripts, sandbox tools.
99.9% ("Three Nines")	8.76 hours	1.44 minutes	Standard digital consumer web storefronts.
99.99% ("Four Nines")	52.6 minutes	8.64 seconds	Scaled cloud data layer clusters, global payment pipelines.
99.999% ("Five Nines")	5.26 minutes	0.86 seconds	Telecom core carrier networks, critical healthcare grids.

[!CAUTION] Designing for 99.999% ("Five Nines") availability requires multi-region active-active database configurations and instant automated network routing shifts. This exponentially spikes platform engineering complexity and monthly cloud spend due to continuous data sync synchronization traffic across international fiber backbones.

3. Fault Tolerance & Redundancy

Fault tolerance is the inherent structural trait enabling an infrastructure to seamlessly continue data processing operations even during severe concurrent failures of underlying hardware or software parts. Achieving this requires stripping out all Single Points of Failure (SPOFs) via systemic redundancy.

Redundancy Strategies

Active-Passive (Failover): A primary server ingests 100% of the transactional workload while an identical secondary server sits idle as a hot/warm standby. A centralized clustering coordinator continuously reads a heartbeat signal from the primary server. If the primary instance goes completely dark, the controller routes traffic instantly onto the passive replica (Failover).
Active-Active (Dynamic Load-Balancing): All nodes within the computing cluster actively ingest and process a fractional division of global production traffic concurrently. If an active instance crashes, the fronting network load balancer isolates the dead node from its rotation registry and smoothly redistributes the traffic strain across the remaining healthy machines.

Resiliency Patterns (Beyond Redundant Hardware)

True fault tolerance requires software-level protective guardrails to keep a single bad component from causing a cascading failure across your entire microservices mesh:

Circuit Breakers: If a down-stream billing service becomes unresponsive, an upstream service trips a circuit breaker pattern locally. It fails fast instantly rather than wasting network threads waiting for connection timeouts, protecting server thread limits.
Rate Limiting & Throttling: Restricts incoming traffic bursts from consuming all system threads, guaranteeing baseline performance for legitimate users during sudden traffic surges.
Graceful Degradation: The design choice to switch off secondary, heavy cosmetic features (such as user recommendations or real-time typing indicators) during peak traffic spikes to prioritize vital transactional paths (such as purchases or messages).

4. SLA, SLO, and SLI

Engineering organizations quantify, monitor, and legally commit to non-functional system attributes by aligning three core service metrics:

SLI (Service Level Indicator): A specific, atomic, quantitative metric measuring raw runtime system performance in production.
- Example: "The HTTP response latency of our /checkout API endpoint over a rolling 5-minute interval."
SLO (Service Level Objective): A target operational target or range of performance success established as a goal for a specific SLI.
- Example: "99.5% of valid HTTP requests to our /checkout API endpoint must return a response in less than 200ms over any given 30-day billing window."
SLA (Service Level Agreement): The binding corporate or legal contract dictating operational commitments made explicitly to paying enterprise customers. It directly references SLO benchmarks and outlines financial penalties, account credits, or refund metrics if software compliance misses the mark.
- Example: "If our core payment platform availability falls below our 99.9% SLO within any single calendar month, our corporate clients will receive a 15% billing credit on their next subscription cycle."