provisioning/docs/src/architecture/adr/ADR-007-kms-simplification.md

266 lines
7.6 KiB
Markdown
Raw Normal View History

# ADR-007: KMS Service Simplification to Age and Cosmian Backends
**Status**: Accepted
**Date**: 2025-10-08
**Deciders**: Architecture Team
**Related**: ADR-006 (KMS Service Integration)
## Context
The KMS service initially supported 4 backends: HashiCorp Vault, AWS KMS, Age, and Cosmian KMS. This created unnecessary complexity and unclear guidance about which backend to use for different environments.
### Problems with 4-Backend Approach
1. **Complexity**: Supporting 4 different backends increased maintenance burden
2. **Dependencies**: AWS SDK added significant compile time (~30 s) and binary size
3. **Confusion**: No clear guidance on which backend to use when
4. **Cloud Lock-in**: AWS KMS dependency limited infrastructure flexibility
5. **Operational Overhead**: Vault requires server setup even for simple dev environments
6. **Code Duplication**: Similar logic implemented 4 different ways
### Key Insights
- Most development work doesn't need server-based KMS
- Production deployments need enterprise-grade security features
- Age provides fast, offline encryption perfect for development
- Cosmian KMS offers confidential computing and zero-knowledge architecture
- Supporting Vault AND Cosmian is redundant (both are server-based KMS)
- AWS KMS locks us into AWS infrastructure
## Decision
Simplify the KMS service to support only 2 backends:
1. **Age**: For development and local testing
- Fast, offline, no server required
- Simple key generation with `age-keygen`
- X25519 encryption (modern, secure)
- Perfect for dev/test environments
2. **Cosmian KMS**: For production deployments
- Enterprise-grade key management
- Confidential computing support (SGX/SEV)
- Zero-knowledge architecture
- Server-side key rotation
- Audit logging and compliance
- Multi-tenant support
Remove support for:
- ❌ HashiCorp Vault (redundant with Cosmian)
- ❌ AWS KMS (cloud lock-in, complexity)
## Consequences
### Positive
1. **Simpler Code**: 2 backends instead of 4 reduces complexity by 50%
2. **Faster Compilation**: Removing AWS SDK saves ~30 seconds compile time
3. **Clear Guidance**: Age = dev, Cosmian = prod (no confusion)
4. **Offline Development**: Age works without network connectivity
5. **Better Security**: Cosmian provides confidential computing (TEE)
6. **No Cloud Lock-in**: Not dependent on AWS infrastructure
7. **Easier Testing**: Age backend requires no setup
8. **Reduced Dependencies**: Fewer external crates to maintain
### Negative
1. **Migration Required**: Existing Vault/AWS KMS users must migrate
2. **Learning Curve**: Teams must learn Age and Cosmian
3. **Cosmian Dependency**: Production depends on Cosmian availability
4. **Cost**: Cosmian may have licensing costs (cloud or self-hosted)
### Neutral
1. **Feature Parity**: Cosmian provides all features Vault/AWS had
2. **API Compatibility**: Encrypt/decrypt API remains primarily the same
3. **Configuration Change**: TOML config structure updated but similar
## Implementation
### Files Created
1. `src/age/client.rs` (167 lines) - Age encryption client
2. `src/age/mod.rs` (3 lines) - Age module exports
3. `src/cosmian/client.rs` (294 lines) - Cosmian KMS client
4. `src/cosmian/mod.rs` (3 lines) - Cosmian module exports
5. `docs/migration/KMS_SIMPLIFICATION.md` (500+ lines) - Migration guide
### Files Modified
1. `src/lib.rs` - Updated exports (age, cosmian instead of aws, vault)
2. `src/types.rs` - Updated error types and config enum
3. `src/service.rs` - Simplified to 2 backends (180 lines, was 213)
4. `Cargo.toml` - Removed AWS deps, added `age = "0.10"`
5. `README.md` - Complete rewrite for new backends
6. `provisioning/config/kms.toml` - Simplified configuration
### Files Deleted
1. `src/aws/client.rs` - AWS KMS client
2. `src/aws/envelope.rs` - Envelope encryption helpers
3. `src/aws/mod.rs` - AWS module
4. `src/vault/client.rs` - Vault client
5. `src/vault/mod.rs` - Vault module
### Dependencies Changed
**Removed**:
- `aws-sdk-kms = "1"`
- `aws-config = "1"`
- `aws-credential-types = "1"`
- `aes-gcm = "0.10"` (was only for AWS envelope encryption)
**Added**:
- `age = "0.10"`
- `tempfile = "3"` (dev dependency for tests)
**Kept**:
- All Axum web framework deps
- `reqwest` (for Cosmian HTTP API)
- `base64`, `serde`, `tokio`, etc.
## Migration Path
### For Development
```bash
# 1. Install Age
brew install age # or apt install age
# 2. Generate keys
age-keygen -o ~/.config/provisioning/age/private_key.txt
age-keygen -y ~/.config/provisioning/age/private_key.txt > ~/.config/provisioning/age/public_key.txt
# 3. Update config to use Age backend
# 4. Re-encrypt development secrets
```
### For Production
```bash
# 1. Set up Cosmian KMS (cloud or self-hosted)
# 2. Create master key in Cosmian
# 3. Migrate secrets from Vault/AWS to Cosmian
# 4. Update production config
# 5. Deploy new KMS service
```
See `docs/migration/KMS_SIMPLIFICATION.md` for detailed steps.
## Alternatives Considered
### Alternative 1: Keep All 4 Backends
**Pros**:
- No migration required
- Maximum flexibility
**Cons**:
- Continued complexity
- Maintenance burden
- Unclear guidance
**Rejected**: Complexity outweighs benefits
### Alternative 2: Only Cosmian (No Age)
**Pros**:
- Single backend
- Enterprise-grade everywhere
**Cons**:
- Requires Cosmian server for development
- Slower dev iteration
- Network dependency for local dev
**Rejected**: Development experience matters
### Alternative 3: Only Age (No Production Backend)
**Pros**:
- Simplest solution
- No server required
**Cons**:
- Not suitable for production
- No audit logging
- No key rotation
- No multi-tenant support
**Rejected**: Production needs enterprise features
### Alternative 4: Age + HashiCorp Vault
**Pros**:
- Vault is widely known
- No Cosmian dependency
**Cons**:
- Vault lacks confidential computing
- Vault server still required
- No zero-knowledge architecture
**Rejected**: Cosmian provides better security features
## Metrics
### Code Reduction
- **Total Lines Removed**: ~800 lines (AWS + Vault implementations)
- **Total Lines Added**: ~470 lines (Age + Cosmian + docs)
- **Net Reduction**: ~330 lines
### Dependency Reduction
- **Crates Removed**: 4 (aws-sdk-kms, aws-config, aws-credential-types, aes-gcm)
- **Crates Added**: 1 (age)
- **Net Reduction**: 3 crates
### Compilation Time
- **Before**: ~90 seconds (with AWS SDK)
- **After**: ~60 seconds (without AWS SDK)
- **Improvement**: 33% faster
## Compliance
### Security Considerations
1. **Age Security**: X25519 (Curve25519) encryption, modern and secure
2. **Cosmian Security**: Confidential computing, zero-knowledge, enterprise-grade
3. **No Regression**: Security features maintained or improved
4. **Clear Separation**: Dev (Age) never used for production secrets
### Testing Requirements
1. **Unit Tests**: Both backends have comprehensive test coverage
2. **Integration Tests**: Age tests run without external deps
3. **Cosmian Tests**: Require test server (marked as `#[ignore]`)
4. **Migration Tests**: Verify old configs fail gracefully
## References
- [Age Encryption](https://github.com/FiloSottile/age) - Modern encryption tool
- [Cosmian KMS](https://cosmian.com/kms/) - Enterprise KMS with confidential computing
- [ADR-006](ADR-006-provisioning-cli-refactoring.md) - Previous KMS integration
- [Migration Guide](../migration/KMS_SIMPLIFICATION.md) - Detailed migration steps
## Notes
- Age is designed by Filippo Valsorda (Google, Go security team)
- Cosmian provides FIPS 140-2 Level 3 compliance (when using certified hardware)
- This decision aligns with project goal of reducing cloud provider dependencies
- Migration timeline: 6 weeks for full adoption