provisioning/docs/src/architecture/adr/adr-014-secretumvault-integration.md

658 lines
24 KiB
Markdown
Raw Normal View History

2026-01-08 21:22:57 +00:00
# ADR-014: SecretumVault Integration for Secrets Management
## Status
**Accepted** - 2025-01-08
## Context
The provisioning system manages sensitive data across multiple infrastructure layers: cloud provider credentials, database passwords, API keys, SSH keys, encryption keys, and service tokens. The current security architecture (ADR-009) includes SOPS for encrypted config files and Age for key management, but lacks a centralized secrets management solution with dynamic secrets, access control, and audit logging.
### Current Secrets Management Challenges
**Existing Approach**:
1. **SOPS + Age**: Static secrets encrypted in config files
- Good: Version-controlled, gitops-friendly
- Limited: Static rotation, no audit trail, manual key distribution
2. **Nickel Configuration**: Declarative secrets references
- Good: Type-safe configuration
- Limited: Cannot generate dynamic secrets, no lifecycle management
3. **Manual Secret Injection**: Environment variables, CLI flags
- Good: Simple for development
- Limited: No security guarantees, prone to leakage
### Problems Without Centralized Secrets Management
**Security Issues**:
- ❌ No centralized audit trail (who accessed which secret when)
- ❌ No automatic secret rotation policies
- ❌ No fine-grained access control (Cedar policies not enforced on secrets)
- ❌ Secrets scattered across: SOPS files, env vars, config files, K8s secrets
- ❌ No detection of secret sprawl or leaked credentials
**Operational Issues**:
- ❌ Manual secret rotation (error-prone, often neglected)
- ❌ No secret versioning (cannot rollback to previous credentials)
- ❌ Difficult onboarding (manual key distribution)
- ❌ No dynamic secrets (credentials exist indefinitely)
**Compliance Issues**:
- ❌ Cannot prove compliance with secret access policies
- ❌ No audit logs for regulatory requirements
- ❌ Cannot enforce secret expiration policies
- ❌ Difficult to demonstrate least-privilege access
### Use Cases Requiring Centralized Secrets Management
1. **Dynamic Database Credentials**:
- Generate short-lived DB credentials for applications
- Automatic rotation based on policies
- Revocation on application termination
2. **Cloud Provider API Keys**:
- Centralized storage with access control
- Audit trail of credential usage
- Automatic rotation schedules
3. **Service-to-Service Authentication**:
- Dynamic tokens for microservices
- Short-lived certificates for mTLS
- Automatic renewal before expiration
4. **SSH Key Management**:
- Temporal SSH keys (ADR-009 SSH integration)
- Centralized certificate authority
- Audit trail of SSH access
5. **Encryption Key Management**:
- Master encryption keys for data at rest
- Key rotation and versioning
- Integration with KMS systems
### Requirements for Secrets Management System
-**Dynamic Secrets**: Generate credentials on-demand with TTL
-**Access Control**: Integration with Cedar authorization policies
-**Audit Logging**: Complete trail of secret access and modifications
-**Secret Rotation**: Automatic and manual rotation policies
-**Versioning**: Track secret versions, enable rollback
-**High Availability**: Distributed, fault-tolerant architecture
-**Encryption at Rest**: AES-256-GCM for stored secrets
-**API-First**: RESTful API for integration
-**Plugin Ecosystem**: Extensible backends (AWS, Azure, databases)
-**Open Source**: Self-hosted, no vendor lock-in
## Decision
Integrate **SecretumVault** as the centralized secrets management system for the provisioning platform.
### Architecture Diagram
```
┌─────────────────────────────────────────────────────────────┐
│ Provisioning CLI / Orchestrator / Services │
│ │
│ - Workspace initialization (credentials) │
│ - Infrastructure deployment (cloud API keys) │
│ - Service configuration (database passwords) │
│ - SSH temporal keys (certificate generation) │
└────────────┬────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ SecretumVault Client Library (Rust) │
│ (provisioning/core/libs/secretum-client/) │
│ │
│ - Authentication (token, mTLS) │
│ - Secret CRUD operations │
│ - Dynamic secret generation │
│ - Lease renewal and revocation │
│ - Policy enforcement │
└────────────┬────────────────────────────────────────────────┘
│ HTTPS + mTLS
┌─────────────────────────────────────────────────────────────┐
│ SecretumVault Server │
│ (Rust-based Vault implementation) │
│ │
│ ┌───────────────────────────────────────────────────┐ │
│ │ API Layer (REST + gRPC) │ │
│ ├───────────────────────────────────────────────────┤ │
│ │ Authentication & Authorization │ │
│ │ - Token auth, mTLS, OIDC integration │ │
│ │ - Cedar policy enforcement │ │
│ ├───────────────────────────────────────────────────┤ │
│ │ Secret Engines │ │
│ │ - KV (key-value v2 with versioning) │ │
│ │ - Database (dynamic credentials) │ │
│ │ - SSH (certificate authority) │ │
│ │ - PKI (X.509 certificates) │ │
│ │ - Cloud Providers (AWS/Azure/OCI) │ │
│ ├───────────────────────────────────────────────────┤ │
│ │ Storage Backend │ │
│ │ - Encrypted storage (AES-256-GCM) │ │
│ │ - PostgreSQL / Raft cluster │ │
│ ├───────────────────────────────────────────────────┤ │
│ │ Audit Backend │ │
│ │ - Structured logging (JSON) │ │
│ │ - Syslog, file, database sinks │ │
│ └───────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Backends (Dynamic Secret Generation) │
│ │
│ - PostgreSQL/MySQL (database credentials) │
│ - AWS IAM (temporary access keys) │
│ - Azure AD (service principals) │
│ - SSH CA (signed certificates) │
│ - PKI (X.509 certificates) │
└─────────────────────────────────────────────────────────────┘
```
### Implementation Characteristics
**SecretumVault Provides**:
- ✅ Dynamic secret generation with configurable TTL
- ✅ Secret versioning and rollback capabilities
- ✅ Fine-grained access control (Cedar policies)
- ✅ Complete audit trail (all operations logged)
- ✅ Automatic secret rotation policies
- ✅ High availability (Raft consensus)
- ✅ Encryption at rest (AES-256-GCM)
- ✅ Plugin architecture for secret backends
- ✅ RESTful and gRPC APIs
- ✅ Rust implementation (performance, safety)
**Integration with Provisioning System**:
- ✅ Rust client library (native integration)
- ✅ Nushell commands via CLI wrapper
- ✅ Nickel configuration references secrets
- ✅ Cedar policies control secret access
- ✅ Orchestrator manages secret lifecycle
- ✅ SSH integration for temporal keys
- ✅ KMS integration for encryption keys
## Rationale
### Why SecretumVault Is Required
| Aspect | SOPS + Age (current) | HashiCorp Vault | SecretumVault (chosen) |
|--------|----------------------|-----------------|------------------------|
| **Dynamic Secrets** | ❌ Static only | ✅ Full support | ✅ Full support |
| **Rust Native** | ⚠️ External CLI | ❌ Go binary | ✅ Pure Rust |
| **Cedar Integration** | ❌ None | ❌ Custom policies | ✅ Native Cedar |
| **Audit Trail** | ❌ Git only | ✅ Comprehensive | ✅ Comprehensive |
| **Secret Rotation** | ❌ Manual | ✅ Automatic | ✅ Automatic |
| **Open Source** | ✅ Yes | ⚠️ MPL 2.0 (BSL now) | ✅ Yes |
| **Self-Hosted** | ✅ Yes | ✅ Yes | ✅ Yes |
| **License** | ✅ Permissive | ⚠️ BSL (proprietary) | ✅ Permissive |
| **Versioning** | ⚠️ Git commits | ✅ Built-in | ✅ Built-in |
| **High Availability** | ❌ Single file | ✅ Raft cluster | ✅ Raft cluster |
| **Performance** | ✅ Fast (local) | ⚠️ Network latency | ✅ Rust performance |
### Why Not Continue with SOPS Alone?
SOPS is excellent for **static secrets in git**, but inadequate for:
1. **Dynamic Credentials**: Cannot generate temporary DB passwords
2. **Audit Trail**: Git commits are insufficient for compliance
3. **Rotation Policies**: Manual rotation is error-prone
4. **Access Control**: No runtime policy enforcement
5. **Secret Lifecycle**: Cannot track usage or revoke access
6. **Multi-System Integration**: Limited to files, not API-accessible
**Complementary Approach**:
- SOPS: Configuration files with long-lived secrets (gitops workflow)
- SecretumVault: Runtime dynamic secrets, short-lived credentials, audit trail
### Why SecretumVault Over HashiCorp Vault?
**HashiCorp Vault Limitations**:
1. **License Change**: BSL (Business Source License) - proprietary for production
2. **Not Rust Native**: Go binary, subprocess overhead
3. **Custom Policy Language**: HCL policies, not Cedar (provisioning standard)
4. **Complex Deployment**: Heavy operational burden
5. **Vendor Lock-In**: HashiCorp ecosystem dependency
**SecretumVault Advantages**:
1. **Rust Native**: Zero-cost integration, no subprocess spawning
2. **Cedar Policies**: Consistent with ADR-008 authorization model
3. **Lightweight**: Smaller binary, lower resource usage
4. **Open Source**: Permissive license, community-driven
5. **Provisioning-First**: Designed for IaC workflows
### Integration with Existing Security Architecture
**ADR-009 (Security System)**:
- SOPS: Static config encryption (unchanged)
- Age: Key management for SOPS (unchanged)
- SecretumVault: Dynamic secrets, runtime access control (new)
**ADR-008 (Cedar Authorization)**:
- Cedar policies control SecretumVault secret access
- Fine-grained permissions: `read:secret:database/prod/password`
- Audit trail records Cedar policy decisions
**SSH Temporal Keys**:
- SecretumVault SSH CA signs user certificates
- Short-lived certificates (1-24 hours)
- Audit trail of SSH access
## Consequences
### Positive
- **Security Posture**: Centralized secrets with audit trail and rotation
- **Compliance**: Complete audit logs for regulatory requirements
- **Operational Excellence**: Automatic rotation, dynamic credentials
- **Developer Experience**: Simple API for secret access
- **Performance**: Rust implementation, zero-cost abstractions
- **Consistency**: Cedar policies across entire system (auth + secrets)
- **Observability**: Metrics, logs, traces for secret access
- **Disaster Recovery**: Secret versioning enables rollback
### Negative
- **Infrastructure Complexity**: Additional service to deploy and operate
- **High Availability Requirements**: Raft cluster needs 3+ nodes
- **Migration Effort**: Existing SOPS secrets need migration path
- **Learning Curve**: Operators must learn vault concepts
- **Dependency Risk**: Critical path service (secrets unavailable = system down)
### Mitigation Strategies
**High Availability**:
```bash
# Deploy SecretumVault cluster (3 nodes)
provisioning deploy secretum-vault --ha --replicas 3
# Automatic leader election via Raft
# Clients auto-reconnect to leader
```
**Migration from SOPS**:
```bash
# Phase 1: Import existing SOPS secrets into SecretumVault
provisioning secrets migrate --from-sops config/secrets.yaml
# Phase 2: Update Nickel configs to reference vault paths
# Phase 3: Deprecate SOPS for runtime secrets (keep for config files)
```
**Fallback Strategy**:
```rust
// Graceful degradation if vault unavailable
let secret = match vault_client.get_secret("database/password").await {
Ok(s) => s,
Err(VaultError::Unavailable) => {
// Fallback to SOPS for read-only operations
warn!("Vault unavailable, using SOPS fallback");
sops_decrypt("config/secrets.yaml", "database.password")?
},
Err(e) => return Err(e),
};
```
**Operational Monitoring**:
```toml
# prometheus metrics
secretum_vault_request_duration_seconds
secretum_vault_secret_lease_expiry
secretum_vault_auth_failures_total
secretum_vault_raft_leader_changes
# Alerts: Vault unavailable, high auth failure rate, lease expiry
```
## Alternatives Considered
### Alternative 1: Continue with SOPS Only
**Pros**: No new infrastructure, simple
**Cons**: No dynamic secrets, no audit trail, manual rotation
**Decision**: REJECTED - Insufficient for production security
### Alternative 2: HashiCorp Vault
**Pros**: Mature, feature-rich, widely adopted
**Cons**: BSL license, Go binary, HCL policies (not Cedar), complex deployment
**Decision**: REJECTED - License and integration concerns
### Alternative 3: Cloud Provider Native (AWS Secrets Manager, Azure Key Vault)
**Pros**: Fully managed, high availability
**Cons**: Vendor lock-in, multi-cloud complexity, cost at scale
**Decision**: REJECTED - Against open-source and multi-cloud principles
### Alternative 4: CyberArk, 1Password, etc.
**Pros**: Enterprise features
**Cons**: Proprietary, expensive, poor API integration
**Decision**: REJECTED - Not suitable for IaC automation
### Alternative 5: Build Custom Secrets Manager
**Pros**: Full control, tailored to needs
**Cons**: High maintenance burden, security risk, reinventing wheel
**Decision**: REJECTED - SecretumVault provides this already
## Implementation Details
### SecretumVault Deployment
```bash
# Deploy via provisioning system
provisioning deploy secretum-vault \
--ha \
--replicas 3 \
--storage postgres \
--tls-cert /path/to/cert.pem \
--tls-key /path/to/key.pem
# Initialize and unseal
provisioning vault init
provisioning vault unseal --key-shares 5 --key-threshold 3
```
### Rust Client Library
```rust
// provisioning/core/libs/secretum-client/src/lib.rs
use secretum_vault::{Client, SecretEngine, Auth};
pub struct VaultClient {
client: Client,
}
impl VaultClient {
pub async fn new(addr: &str, token: &str) -> Result<Self> {
let client = Client::new(addr)
.auth(Auth::Token(token))
.tls_config(TlsConfig::from_files("ca.pem", "cert.pem", "key.pem"))?
.build()?;
Ok(Self { client })
}
pub async fn get_secret(&self, path: &str) -> Result<Secret> {
self.client.kv2().get(path).await
}
pub async fn create_dynamic_db_credentials(&self, role: &str) -> Result<DbCredentials> {
self.client.database().generate_credentials(role).await
}
pub async fn sign_ssh_key(&self, public_key: &str, ttl: Duration) -> Result<Certificate> {
self.client.ssh().sign_key(public_key, ttl).await
}
}
```
### Nushell Integration
```nushell
# Nushell commands via Rust CLI wrapper
provisioning secrets get database/prod/password
provisioning secrets set api/keys/stripe --value "sk_live_xyz"
provisioning secrets rotate database/prod/password
provisioning secrets lease renew lease_id_12345
provisioning secrets list database/
```
### Nickel Configuration Integration
```nickel
# provisioning/schemas/database.ncl
{
database = {
host = "postgres.example.com",
port = 5432,
username = secrets.get "database/prod/username",
password = secrets.get "database/prod/password",
}
}
# Nickel function: secrets.get resolves to SecretumVault API call
```
### Cedar Policy for Secret Access
```cedar
// policy: developers can read dev secrets, not prod
permit(
principal in Group::"developers",
action == Action::"read",
resource in Secret::"database/dev"
);
forbid(
principal in Group::"developers",
action == Action::"read",
resource in Secret::"database/prod"
);
// policy: CI/CD can generate dynamic DB credentials
permit(
principal == Service::"github-actions",
action == Action::"generate",
resource in Secret::"database/dynamic"
) when {
context.ttl <= duration("1h")
};
```
### Dynamic Database Credentials
```rust
// Application requests temporary DB credentials
let creds = vault_client
.database()
.generate_credentials("postgres-readonly")
.await?;
println!("Username: {}", creds.username); // v-app-abcd1234
println!("Password: {}", creds.password); // random-secure-password
println!("TTL: {}", creds.lease_duration); // 1h
// Credentials automatically revoked after TTL
// No manual cleanup needed
```
### Secret Rotation Automation
```toml
# secretum-vault config
[[rotation_policies]]
path = "database/prod/password"
schedule = "0 0 * * 0" # Weekly on Sunday midnight
max_age = "30d"
[[rotation_policies]]
path = "api/keys/stripe"
schedule = "0 0 1 * *" # Monthly on 1st
max_age = "90d"
```
### Audit Log Format
```json
{
"timestamp": "2025-01-08T12:34:56Z",
"type": "request",
"auth": {
"client_token": "sha256:abc123...",
"accessor": "hmac:def456...",
"display_name": "service-orchestrator",
"policies": ["default", "service-policy"]
},
"request": {
"operation": "read",
"path": "secret/data/database/prod/password",
"remote_address": "10.0.1.5"
},
"response": {
"status": 200
},
"cedar_policy": {
"decision": "permit",
"policy_id": "allow-orchestrator-read-secrets"
}
}
```
## Testing Strategy
**Unit Tests**:
```rust
#[tokio::test]
async fn test_get_secret() {
let vault = mock_vault_client();
let secret = vault.get_secret("test/secret").await.unwrap();
assert_eq!(secret.value, "expected-value");
}
#[tokio::test]
async fn test_dynamic_credentials_generation() {
let vault = mock_vault_client();
let creds = vault.create_dynamic_db_credentials("postgres-readonly").await.unwrap();
assert!(creds.username.starts_with("v-"));
assert_eq!(creds.lease_duration, Duration::from_secs(3600));
}
```
**Integration Tests**:
```bash
# Test vault deployment
provisioning deploy secretum-vault --test-mode
provisioning vault init
provisioning vault unseal
# Test secret operations
provisioning secrets set test/secret --value "test-value"
provisioning secrets get test/secret | assert "test-value"
# Test dynamic credentials
provisioning secrets db-creds postgres-readonly | jq '.username' | assert-contains "v-"
# Test rotation
provisioning secrets rotate test/secret
```
**Security Tests**:
```rust
#[tokio::test]
async fn test_unauthorized_access_denied() {
let vault = vault_client_with_limited_token();
let result = vault.get_secret("database/prod/password").await;
assert!(matches!(result, Err(VaultError::PermissionDenied)));
}
```
## Configuration Integration
**Provisioning Config**:
```toml
# provisioning/config/config.defaults.toml
[secrets]
provider = "secretum-vault" # "secretum-vault" | "sops" | "env"
vault_addr = "https://vault.example.com:8200"
vault_namespace = "provisioning"
vault_mount = "secret"
[secrets.tls]
ca_cert = "/etc/provisioning/vault-ca.pem"
client_cert = "/etc/provisioning/vault-client.pem"
client_key = "/etc/provisioning/vault-client-key.pem"
[secrets.cache]
enabled = true
ttl = "5m"
max_size = "100MB"
```
**Environment Variables**:
```bash
export VAULT_ADDR="https://vault.example.com:8200"
export VAULT_TOKEN="s.abc123def456..."
export VAULT_NAMESPACE="provisioning"
export VAULT_CACERT="/etc/provisioning/vault-ca.pem"
```
## Migration Path
**Phase 1: Deploy SecretumVault**
- Deploy vault cluster in HA mode
- Initialize and configure backends
- Set up Cedar policies
**Phase 2: Migrate Static Secrets**
- Import SOPS secrets into vault KV store
- Update Nickel configs to reference vault paths
- Verify secret access via new API
**Phase 3: Enable Dynamic Secrets**
- Configure database secret engine
- Configure SSH CA secret engine
- Update applications to use dynamic credentials
**Phase 4: Deprecate SOPS for Runtime**
- SOPS remains for gitops config files
- Runtime secrets exclusively from vault
- Audit trail enforcement
**Phase 5: Automation**
- Automatic rotation policies
- Lease renewal automation
- Monitoring and alerting
## Documentation Requirements
**User Guides**:
- `docs/user/secrets-management.md` - Using SecretumVault
- `docs/user/dynamic-credentials.md` - Dynamic secret workflows
- `docs/user/secret-rotation.md` - Rotation policies and procedures
**Operations Documentation**:
- `docs/operations/vault-deployment.md` - Deploying and configuring vault
- `docs/operations/vault-backup-restore.md` - Backup and disaster recovery
- `docs/operations/vault-monitoring.md` - Metrics, logs, alerts
**Developer Documentation**:
- `docs/development/secrets-api.md` - Rust client library usage
- `docs/development/cedar-secret-policies.md` - Writing Cedar policies for secrets
- Secret engine development guide
**Security Documentation**:
- `docs/security/secrets-architecture.md` - Security architecture overview
- `docs/security/audit-logging.md` - Audit trail and compliance
- Threat model and risk assessment
## References
- [SecretumVault GitHub](https://github.com/secretum-vault/secretum) (hypothetical, replace with actual)
- [HashiCorp Vault Documentation](https://www.vaultproject.io/docs) (for comparison)
- ADR-008: Cedar Authorization (policy integration)
- ADR-009: Security System Complete (current security architecture)
- [Raft Consensus Algorithm](https://raft.github.io/)
- [Cedar Policy Language](https://www.cedarpolicy.com/)
- SOPS: [https://github.com/getsops/sops](https://github.com/getsops/sops)
- Age Encryption: [https://age-encryption.org/](https://age-encryption.org/)
---
**Status**: Accepted
**Last Updated**: 2025-01-08
**Implementation**: Planned
**Priority**: High (Security and compliance)
**Estimated Complexity**: Complex