API Governance for Engineering Organizations

How to organize and manage microservice APIs at scale.

Back to API Governance Framework

Description of this system for implementers, leaders, or executives.

Application Plan: Build from Composition

Table of Contents

  1. Overview
  2. Architecture
  3. Core Concepts
  4. Component Specifications
  5. Integration Contracts
  6. Implementation Guide

1. Overview

Design Philosophy

This architecture balances off-the-shelf components with custom governance logic, centered on subscriptions as the source of truth for all policy enforcement.

Key Principles:

Why Subscriptions Matter (The “Aha!” Moment)

The Problem with Traditional Governance: In most organizations, “governance” is a PDF document that nobody reads. Security rules are hardcoded in gateway configs, rate limits are guessed, and nobody knows who is using which API. When you need to deprecate an API, you send a mass email and hope for the best.

The Subscription-Centric Solution: By making the Subscription the atomic unit of governance, we solve multiple problems with one elegant concept:

  1. It’s Elegant: A subscription connects a Consumer to a specific API Version in an Environment. This single record holds the “contract” between the two parties.
  2. It Simplifies Everything:
    • Security? Check the subscription scope.
    • Rate Limiting? Check the subscription tier.
    • Auditing? Log the subscription ID.
    • Deprecation? Notify the subscription owners.
  3. It Makes Governance Real: Instead of a policy document saying “You must have approval,” the system physically prevents access without an active subscription. Governance becomes a runtime reality, not a paperwork exercise.

High-Level Architecture

%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#e8f4f8','primaryTextColor':'#000','primaryBorderColor':'#2c5aa0','lineColor':'#2c5aa0','edgeLabelBackground':'#fff','fontSize':'14px'}}}%%
graph TD
    A[Backstage Developer Portal] --> B[API Registry & Governance Core]
    B --> C[Kong Gateway]
    B --> D[Auditor]
    C --> D
    
    style A fill:#e1f5ff,stroke:#0288d1,stroke-width:2px
    style B fill:#fff3e0,stroke:#f57c00,stroke-width:3px
    style C fill:#e1f5ff,stroke:#0288d1,stroke-width:2px
    style D fill:#fff3e0,stroke:#f57c00,stroke-width:3px

Component Roles:

Data Ownership Boundaries

Backstage Owns (Native Catalog):

Registry Owns (Governance Data):

Integration Principle:

Backstage is the UX layer; Registry is the policy engine.

Developers discover APIs in Backstage, but all governance actions (request subscription, approve access, deprecate version) flow through Registry APIs. Backstage plugins make Registry data visible and actionable, but never bypass Registry business logic.


2. Core Concepts

Subscriptions as First-Class Citizens

Subscriptions are the source of truth for policy enforcement. Rather than encoding rules at the API or Gateway level, all authorization and governance decisions derive from subscription metadata.

Subscription Schema
{
  "subscription_id": "uuid",
  
  // Natural composite key—UNIQUE constraint on these four fields
  "consumer_app_id": "uuid",     // WHO is consuming
  "api_id": "uuid",               // WHAT they're consuming  
  "api_version": "2.1.0",         // WHICH version
  "environment": "production",    // WHERE (dev/staging/prod)
  
  "protocol": "openapi",          // openapi | graphql | asyncapi | grpc
  
  "status": "approved|pending|revoked|expired",
  
  "scope": {
    // REST (OpenAPI): endpoints and HTTP methods
    "endpoints": ["/users/{id}", "/users/{id}/orders"],
    "operations": ["GET", "POST"],
    // GraphQL: allowed operations and fields
    "graphql_operations": ["query GetUser", "mutation UpdateUser"],
    "graphql_fields": ["User.id", "User.name", "User.email"],
    // AsyncAPI: channels and message types
    "channels": ["orders.created", "orders.updated"],
    "message_types": ["OrderCreatedEvent", "OrderUpdatedEvent"],
    // Common: field-level access control
    "fields": ["id", "name", "email"]
  },
  
  "purpose": "Customer support dashboard needs read-only access to user profile and order history for troubleshooting",
  
  "data_classification": {
    "max_sensitivity": "confidential",
    "pii_access": true,
    "phi_access": false,
    "pci_access": false
  },
  
  "sla_tier": "gold",
  "rate_limits": {
    "requests_per_second": 100,
    "daily_quota": 1000000,
    "burst_allowance": 150
  },
  
  "throughput_estimate": {
    "peak_rps": 75,
    "avg_rps": 20,
    "justification": "500K active support users, 5% concurrency"
  },
  
  "lifecycle": {
    "approved_by": "api.owner@company.com",
    "approved_at": "2025-11-01T10:00:00Z",
    "expires_at": "2026-11-01T10:00:00Z",
    "auto_renew": true
  },
  
  "compliance": {
    "attestation_required": true,
    "last_review": "2025-10-15",
    "reviewer": "security.team@company.com"
  }
}

Subscription Uniqueness:

The combination of (consumer_app_id, api_id, api_version, environment) forms a natural composite key and must be unique:

Gateway Lookup Example:

GET /subscriptions?app_id={app_id}&api_id={api_id}&version={version}&env={env}
→ Returns single subscription record with all policy metadata

The subscription_id (UUID) remains useful as:

Why Subscriptions Drive Policy:

  1. Scoped Access Control: Endpoint/operation scope defines precisely what the consumer can call—Gateway enforces this without hardcoded rules
  2. Purpose-Based Auditing: Purpose string enables audit review (“Is this subscription still being used for its stated intent?”)
  3. Data Classification Enforcement: Gateway/OPA checks subscription’s max_sensitivity against API endpoint data classification
  4. Dynamic Rate Limiting: SLA tier + throughput estimate inform Gateway rate limiting without manual Kong configuration per consumer
  5. Compliance Traceability: Every API call logs subscription_id, linking traffic to approved purpose and data classification
  6. Time-Bounded Authorization: expires_at enables automatic subscription expiration without manual revocation
  7. Chargeback Accuracy: Throughput estimate + SLA tier + actual usage = chargeback calculation

Policy Derivation Examples:

# OPA Policy: Check if endpoint is in subscription scope
allow {
  input.subscription.scope.endpoints[_] == input.request.path
  input.subscription.scope.operations[_] == input.request.method
  input.subscription.status == "approved"
  time.now_ns() < time.parse_rfc3339_ns(input.subscription.lifecycle.expires_at)
}

# OPA Policy: Data classification check
allow {
  api_classification := data.apis[input.api_id].classification
  subscription_clearance := input.subscription.data_classification.max_sensitivity
  classification_level(api_classification) <= classification_level(subscription_clearance)
}

Policy Engine Boundary

Externalize Policy Decisions (OPA/Cedar):

To avoid hard-coding business rules in Kong plugins, all policy decisions are delegated to an external policy engine:

Policy Engine Integration:

Architecture: Open Source Kong + Custom Policy Service

We’re using Open Source Kong with a custom policy engine service to maximize flexibility and cost-effectiveness:

Policy Engine Service:

Authorization Check Flow:

  1. Request arrives at Kong Gateway with API key or JWT
  2. Custom Kong plugin extracts: consumer_app_id, api_id, version, environment, request.method, request.path
  3. Plugin calls Registry API: POST /subscriptions/check with extracted metadata
  4. Registry queries database for subscription using composite key (consumer_app_id, api_id, version, environment)
  5. Registry calls OPA with subscription metadata + request context:
    {
      "input": {
        "subscription": {...subscription_record...},
        "request": {"method": "GET", "path": "/users/123", "headers": {...}},
        "api": {...api_metadata...}
      }
    }
    
  6. OPA evaluates policies, returns decision with reason:
    {
      "allow": true,
      "reason": "subscription_active_and_scoped",
      "headers": {"X-RateLimit-Remaining": "950"},
      "ttl": 30
    }
    
  7. Registry returns decision to Kong plugin with cache TTL
  8. Kong allows/denies request, adds custom headers if specified

Custom Kong Plugin Requirements:

Build a Lua plugin (kong/plugins/registry-auth) that:

Policy Management Workflow:

Example Policy (Rego):

package authz

default allow = false

# Allow if subscription is active and not expired
allow {
  input.subscription.status == "approved"
  time.now_ns() < time.parse_rfc3339_ns(input.subscription.lifecycle.expires_at)
  endpoint_in_scope
  method_in_scope
}

endpoint_in_scope {
  input.subscription.scope.endpoints[_] == input.request.path
}

method_in_scope {
  input.subscription.scope.operations[_] == input.request.method
}

# Generate reason for deny
reason = msg {
  not input.subscription.status == "approved"
  msg := "subscription_not_approved"
}

reason = msg {
  not time.now_ns() < time.parse_rfc3339_ns(input.subscription.lifecycle.expires_at)
  msg := "subscription_expired"
}

Caching Strategy:

To minimize latency impact while maintaining security:

Failure Mode Policy:

Gateway behavior when policy engine is unreachable:


3. Component Specifications

3.1 API Registry & Governance Core (Custom Build)

Responsibilities:

Key APIs:

Endpoint Purpose
POST /subscriptions/check Authorization check called by Gateway on every request
GET /apis/{id}/versions Retrieve all versions of an API
POST /subscriptions Request new subscription
PATCH /subscriptions/{id} Approve/revoke subscription
PATCH /apis/{id}/versions/{version}/lifecycle Update lifecycle state (deprecate, retire)
GET /subscriptions?consumer_app_id={id} List subscriptions for a consumer

Technology Stack:

3.2 Kong Gateway (Configure)

Responsibilities:

URL Structure by Protocol

URL patterns vary by API protocol. The gateway handles all three first-class protocols:

REST APIs (OpenAPI):

https://{environment}-gateway.company.com/{api-slug}/{version}/{resource-path}

Examples:
GET  https://prod-gateway.company.com/users-api/v2/users/{user_id}
POST https://prod-gateway.company.com/orders-api/v3/orders
GET  https://staging-gateway.company.com/products-api/v1/products?category=electronics

GraphQL APIs:

https://{environment}-gateway.company.com/{api-slug}/{version}/graphql

Examples:
POST https://prod-gateway.company.com/orders-graphql/v2/graphql
     Body: {"query": "query GetOrder($id: ID!) { order(id: $id) { ... } }"}

GraphQL APIs expose a single endpoint; versioning applies to the schema, not individual endpoints.

AsyncAPI (Event-Driven):

wss://{environment}-gateway.company.com/{api-slug}/{version}/events
kafka://{environment}-broker.company.com/{channel-name}

Examples:
wss://prod-gateway.company.com/order-events/v2/events
kafka://prod-broker.company.com/orders.created.v2

Event-driven APIs use WebSocket connections or message broker topics. The gateway validates subscriptions before allowing channel subscriptions.

Four-Field Composite Key Extraction:

  1. Environment: Derived from gateway cluster hostname (prod-gatewayproduction, staging-gatewaystaging)
  2. API ID: Mapped from URL slug (/users-api/) to Registry UUID via Kong route configuration
  3. Version: Explicit in URL path (/v2/) or topic name (.v2)
  4. Consumer App ID: Extracted from JWT bearer token claims (app_id field)
Kong Route Configuration

Each API version gets its own Kong service and route with registry metadata:

REST API Example:

services:
  - name: users-api-v2-prod
    url: http://users-service-v2.internal:8080
    routes:
      - name: users-api-v2-route
        paths:
          - /users-api/v2
        strip_path: true
    plugins:
      - name: registry-auth
        config:
          api_id: "550e8400-e29b-41d4-a716-446655440000"
          version: "2.0"
          protocol: "openapi"
          environment: "production"
          registry_url: "http://registry-service:8080"

GraphQL API Example:

services:
  - name: orders-graphql-v2-prod
    url: http://orders-graphql-service-v2.internal:8080/graphql
    routes:
      - name: orders-graphql-v2-route
        paths:
          - /orders-graphql/v2/graphql
        methods:
          - POST
    plugins:
      - name: registry-auth
        config:
          api_id: "661e8400-e29b-41d4-a716-446655440001"
          version: "2.0"
          protocol: "graphql"
          environment: "production"
          registry_url: "http://registry-service:8080"
      - name: graphql-rate-limiting
        config:
          max_cost_per_request: 1000  # Limit query complexity

GraphQL-Specific Considerations:

Custom Plugin Implementation

Build a Lua plugin (kong/plugins/registry-auth) with the following behavior:

  1. Validate JWT: Verify bearer token signature and extract app_id claim
  2. Build composite key: Combine consumer_app_id (from JWT), api_id (from route config), version (from route config), environment (from route config)
  3. Check cache: Look up authorization decision in Kong shared memory using composite key
  4. Call Registry: If cache miss, call POST /subscriptions/check with request metadata
  5. Cache response: Store decision for TTL seconds (returned by Registry)
  6. Enforce policy: Allow/deny request based on decision, inject headers for rate limits and subscription ID
  7. Log decision: Emit structured log with subscription_id, policy_decision, and request metadata for Auditor

Failure handling: Default to fail-closed (deny) if Registry is unreachable, with per-API override for fail-open in non-production environments.

Advanced Optimization: In-Memory Subscription Store

The “Autonomous Gateway” Pattern:

Instead of calling the Registry on every request (or relying on short-lived caches), we can load the entire active subscription dataset into Kong’s shared memory. This is a proven pattern used by companies like Stripe, Cloudflare, and Fastly when sub-millisecond latency is critical.

Why This Works:

Implementation Pattern:

-- 1. Kong Configuration (nginx.conf)
lua_shared_dict subscriptions 512m;

-- 2. Plugin Initialization (init_worker_by_lua)
local function sync_subscriptions()
  while true do
    local res = http.get("http://registry:8080/subscriptions/export")
    if res.status == 200 then
      local data = cjson.decode(res.body)
      local version_hash = ngx.md5(res.body)
      
      -- Only update if data changed
      if version_hash ~= ngx.shared.subscriptions:get("version_hash") then
        for _, sub in ipairs(data.subscriptions) do
          local key = sub.consumer_app_id .. ":" .. sub.api_id .. ":" .. sub.version .. ":" .. sub.environment
          ngx.shared.subscriptions:set(key, cjson.encode(sub))
        end
        ngx.shared.subscriptions:set("version_hash", version_hash)
        ngx.log(ngx.INFO, "Synced ", #data.subscriptions, " subscriptions")
      end
    end
    ngx.sleep(30) -- Poll every 30 seconds
  end
end

ngx.timer.at(0, sync_subscriptions)

-- 3. Authorization Check (access phase)
local key = consumer_app_id .. ":" .. api_id .. ":" .. version .. ":" .. environment
local sub_json = ngx.shared.subscriptions:get(key)

if not sub_json then
  return kong.response.exit(403, {message = "No subscription found"})
end

local subscription = cjson.decode(sub_json)
-- Enforce policy using in-memory data (nanosecond-scale lookup)

Registry Support:

Add a new endpoint to Registry:

GET /subscriptions/export
Returns:
{
  "version": "abc123hash",
  "exported_at": "2025-11-22T10:00:00Z",
  "subscriptions": [
    {...full subscription record...},
    {...full subscription record...}
  ]
}

Push Invalidation (Optional Enhancement):

For critical operations (like emergency subscription revocation), Registry can push invalidation webhooks to Kong:

POST https://gateway.company.com/_admin/invalidate
{
  "subscription_id": "uuid",
  "action": "revoke"
}

Gateway immediately removes that subscription from shared memory, without waiting for the next sync cycle.

Tradeoffs:

When to Use This:

3.3 Backstage Developer Portal (Customize/Extend)

Responsibilities:

Backstage’s Native Entity Model

Core Entities:

What Backstage Lacks (Registry Must Provide):

  1. API Versioning Model: Backstage treats users-api-v2 and users-api-v3 as separate entities
  2. Subscription Management: No concept of Component → API subscriptions with approval workflows
  3. Environment Isolation: No distinction between dev/staging/production
  4. Scope/Authorization Metadata: No fields for endpoint scopes, rate limits, data classification
  5. Deprecation Workflows: Lifecycle is just a label; no enforcement or transition timelines
  6. Compliance/Purpose Tracking: No purpose strings, attestations, or compliance metadata
Custom Backstage Plugins Required

Build these React components to surface Registry data:

  1. <ApiVersionsCard>: Displays version table with maturity badges
  2. <SubscriptionRequestButton>: One-click subscription workflow
  3. <MySubscriptionsTab>: Component-level view of API dependencies
  4. <DeprecationTimeline>: Visual countdown for deprecated APIs
  5. <ComplianceWidget>: Dashboard for security/compliance teams
Backstage Catalog YAML Examples

The platform supports three first-class API protocols. Each uses the same governance workflow but with protocol-appropriate specifications:

REST API (OpenAPI):

apiVersion: backstage.io/v1alpha1
kind: API
metadata:
  name: users-api
  description: User management and authentication
  annotations:
    registry.io/api-id: "550e8400-e29b-41d4-a716-446655440000"
    registry.io/classification: "confidential"
    registry.io/protocol: "openapi"
spec:
  type: openapi
  lifecycle: production
  owner: platform-team
  definition:
    $text: https://github.com/company/users-api/blob/main/openapi.yaml

GraphQL API:

apiVersion: backstage.io/v1alpha1
kind: API
metadata:
  name: orders-graphql
  description: Order queries and mutations for mobile and web clients
  annotations:
    registry.io/api-id: "661e8400-e29b-41d4-a716-446655440001"
    registry.io/classification: "internal"
    registry.io/protocol: "graphql"
spec:
  type: graphql
  lifecycle: production
  owner: commerce-team
  definition:
    $text: https://github.com/company/orders-graphql/blob/main/schema.graphql

Event-Driven API (AsyncAPI):

apiVersion: backstage.io/v1alpha1
kind: API
metadata:
  name: order-events
  description: Order lifecycle events for downstream consumers
  annotations:
    registry.io/api-id: "772e8400-e29b-41d4-a716-446655440002"
    registry.io/classification: "internal"
    registry.io/protocol: "asyncapi"
spec:
  type: asyncapi
  lifecycle: production
  owner: commerce-team
  definition:
    $text: https://github.com/company/order-events/blob/main/asyncapi.yaml

3.4 Auditor (Custom Build)

Responsibilities:

Data Sources:

Key Metrics:


4. Integration Contracts

4.1 Registry ↔ Gateway

Authorization Check API:

POST /subscriptions/check
Request:
{
  "consumer_app_id": "uuid",
  "api_id": "uuid",
  "version": "2.1.0",
  "environment": "production",
  "request": {
    "method": "GET",
    "path": "/users/123",
    "headers": {...}
  }
}

Response:
{
  "allow": true,
  "reason": "subscription_active_and_scoped",
  "subscription_id": "uuid",
  "scopes": ["users:read"],
  "rate_limits": {
    "requests_per_second": 100,
    "burst_allowance": 150
  },
  "headers": {
    "X-RateLimit-Remaining": "950"
  },
  "ttl": 30
}

Event Hooks (Registry → Gateway):

Event Trigger Gateway Action
subscription.revoked Subscription revoked Immediately invalidate cache for that subscription
version.published New API version available Update routing configuration
api.deprecated API marked deprecated Inject deprecation warnings in response headers

4.2 Gateway ↔ Auditor

Golden Log Envelope:

Gateway emits structured log for every API call. The schema adapts to the protocol:

REST (OpenAPI) Log:

{
  "trace_id": "uuid",
  "subscription_id": "uuid",
  "api_id": "uuid",
  "version": "2.1.0",
  "protocol": "openapi",
  "environment": "production",
  "route": "/users/{id}",
  "verb": "GET",
  "status": 200,
  "latency_ms": 45,
  "size_bytes": 1024,
  "policy_decision": "allow",
  "error_class": null
}

GraphQL Log:

{
  "trace_id": "uuid",
  "subscription_id": "uuid",
  "api_id": "uuid",
  "version": "2.1.0",
  "protocol": "graphql",
  "environment": "production",
  "operation_name": "GetOrder",
  "operation_type": "query",
  "fields_accessed": ["order.id", "order.status", "order.lineItems"],
  "query_complexity": 42,
  "status": 200,
  "latency_ms": 78,
  "size_bytes": 2048,
  "policy_decision": "allow",
  "error_class": null
}

AsyncAPI (Event) Log:

{
  "trace_id": "uuid",
  "subscription_id": "uuid",
  "api_id": "uuid",
  "version": "2.1.0",
  "protocol": "asyncapi",
  "environment": "production",
  "channel": "orders.created",
  "message_type": "OrderCreatedEvent",
  "action": "publish",
  "size_bytes": 512,
  "policy_decision": "allow",
  "error_class": null
}

Auditor ingests these logs for:

4.3 Backstage ↔ Registry

Read Operations
Backstage View Registry API Call Purpose
API Catalog Page GET /apis/{id}/versions Display available versions, maturity, lifecycle state
API Version Detail GET /apis/{id}/versions/{version} Show version metadata, deprecation status
My Subscriptions GET /subscriptions?consumer_app_id={id} List all APIs this app consumes
Subscription Detail GET /subscriptions/{id} Show approval status, scope, rate limits
Deprecation Dashboard GET /apis?lifecycle=deprecated List deprecated APIs with sunset timelines
Compliance Report GET /subscriptions?data_classification=pii Audit PII access
Write Operations
User Action in Backstage Registry API Call Outcome
“Request Subscription” POST /subscriptions Creates pending subscription, notifies API owner
“Approve Subscription” PATCH /subscriptions/{id} Updates status to approved, triggers Gateway sync
“Mark API Deprecated” PATCH /apis/{id}/versions/{version}/lifecycle Sets state to deprecated, starts sunset countdown
“Publish New Version” POST /apis/{id}/versions Creates new version, triggers CI/CD
“Revoke Subscription” DELETE /subscriptions/{id} Soft deletes subscription, invalidates Gateway cache
Caching Strategy

To minimize Registry API calls:

Security/Authorization

5. Implementation Guide

Implementation Phases

Phase 1 (Months 1-3): Foundation

Phase 2 (Months 4-6): Governance

Phase 3 (Months 7-9): Analytics & Polish


Appendix: Entity Mapping

Backstage ↔ Registry Entity Mapping

Backstage Concept Registry Equivalent Notes
Component (e.g., checkout-service) Consumer App 1:1 mapping via metadata.annotations['registry.io/app-id']
API entity API + all its versions Backstage API = umbrella; Registry API = container for versions
N/A (separate API entities) API Version Key difference: Registry has first-class versioning
spec.lifecycle (label only) lifecycle_state enum Registry enforces transitions; Backstage just displays
N/A Subscription Backstage has no native concept - purely Registry domain
metadata.annotations Custom governance metadata Store registry.io/api-id, registry.io/classification, etc.