syntaxis/docs/provision/management-orchestration.md
Jesús Pérez 9cef9b8d57 refactor: consolidate configuration directories
Merge _configs/ into config/ for single configuration directory.
Update all path references.

Changes:
- Move _configs/* to config/
- Update .gitignore for new patterns
- No code references to _configs/ found

Impact: -1 root directory (layout_conventions.md compliance)
2025-12-26 18:36:23 +00:00

28 KiB

🎛️ Estrategia de Gestión y Orquestación en Producción

Fecha: 2025-11-20 Nivel: Arquitectura y Operaciones Enfoque: Multi-proyecto, escalable, production-grade


📋 Tabla de Contenidos

  1. Modelo de Gestión Centralizado
  2. Orquestación Multi-Proyecto
  3. Ciclo de Vida de Servicios
  4. Control de Cambios
  5. Monitoreo y Observabilidad
  6. Disaster Recovery

🏛️ Modelo de Gestión Centralizado

Arquitectura de Control Central

┌─────────────────────────────────────────────────────────────┐
│              CONTROL CENTER (Central Repository)            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Git Repository (Single Source of Truth)                   │
│  ├─ /services/                                             │
│  │  ├─ catalog.toml          (Global service registry)     │
│  │  ├─ patterns.toml         (Deployment patterns)         │
│  │  └─ versions.toml         (Version tracking)            │
│  │                                                         │
│  ├─ /projects/               (Multi-tenant definitions)    │
│  │  ├─ project-a/            (Project-specific configs)    │
│  │  │  ├─ services.toml                                    │
│  │  │  ├─ deployment.toml                                  │
│  │  │  └─ monitoring.toml                                  │
│  │  ├─ project-b/                                          │
│  │  └─ project-c/                                          │
│  │                                                         │
│  ├─ /infrastructure/         (KCL cluster definitions)     │
│  │  ├─ staging.k             (Staging cluster)             │
│  │  └─ production.k          (Production cluster)          │
│  │                                                         │
│  ├─ /policies/               (Governance rules)            │
│  │  ├─ security.toml         (Security policies)           │
│  │  ├─ compliance.toml       (Compliance rules)            │
│  │  └─ sla.toml              (SLA definitions)             │
│  │                                                         │
│  └─ /documentation/          (Autogenerated docs)          │
│     ├─ SERVICES.md           (Service inventory)           │
│     ├─ TOPOLOGY.md           (Dependency map)              │
│     └─ RUNBOOK.md            (Operational procedures)      │
│                                                             │
└─────────────────────────────────────────────────────────────┘
         ↑                              ↓
         │                              │
    CI/CD Pipeline             GitOps Agent (ArgoCD/Flux)
    - Validate                 - Sync configs
    - Test                     - Apply changes
    - Generate                 - Monitor drift
    - Publish

Flujo de Cambios Controlado

1. PROPUESTA DE CAMBIO
   Developer → PR con cambios en catalog.toml

2. VALIDACIÓN AUTOMÁTICA
   ├─ Schema validation
   ├─ Dependency analysis
   ├─ Security scanning
   ├─ Generate preview outputs
   └─ Run integration tests

3. REVISIÓN HUMANA
   Code review + deployment review

4. MERGE Y PUBLICACIÓN
   Merge a main → trigger CI/CD

5. GENERACIÓN
   ├─ Generate Docker images
   ├─ Generate K8s manifests
   ├─ Generate Terraform code
   └─ Generate KCL schemas

6. DEPLOYMENT AUTOMÁTICO
   ArgoCD/Flux sync a:
   ├─ Staging (auto)
   ├─ Production (manual approval)
   └─ Multi-region (canary)

🌍 Orquestación Multi-Proyecto

Arquitectura Multi-Tenant

┌────────────────────────────────────────────────────────┐
│         ServiceRegistry Central (Shared)               │
│  - Core service definitions                           │
│  - Global patterns                                    │
│  - Shared infrastructure config                       │
└────────────────────────────────────────────────────────┘
                ↓    ↓    ↓    ↓
    ┌───────────┴────┴────┴────┴──────────┐
    │                                      │
    ↓                ↓                ↓
┌──────────┐   ┌──────────┐   ┌──────────┐
│ Project  │   │ Project  │   │ Project  │
│    A     │   │    B     │   │    C     │
├──────────┤   ├──────────┤   ├──────────┤
│ Services │   │ Services │   │ Services │
│ (subset) │   │ (subset) │   │ (subset) │
│          │   │          │   │          │
│ Custom:  │   │ Custom:  │   │ Custom:  │
│ - Repos  │   │ - Repos  │   │ - Repos  │
│ - Env    │   │ - Env    │   │ - Env    │
│ - Policy │   │ - Policy │   │ - Policy │
└──────────┘   └──────────┘   └──────────┘
    │              │              │
    └──────────────┴──────────────┘
              ↓
    ┌─────────────────────┐
    │  Orchestrator       │
    │  (Central Control)  │
    │                     │
    │ validate()          │
    │ generate()          │
    │ deploy()            │
    │ monitor()           │
    └─────────────────────┘
              ↓
    ┌─────────────────────────────────┐
    │  Infrastructure Targets         │
    │  ├─ Docker (local dev)         │
    │  ├─ Kubernetes (staging)       │
    │  ├─ Kubernetes (production)    │
    │  └─ KCL (cluster mgmt)         │
    └─────────────────────────────────┘

Gestión de Dependencias Cross-Project

# /services/catalog.toml (Global)

[service.shared-auth]
name = "shared-auth"
type = "microservice"
owner = "platform-team"
version = "1.2.3"
compat_version_min = "1.0.0"
deprecation_date = "2026-06-30"

[service.shared-auth.consumers]
# Qué proyectos usan este servicio
projects = ["project-a", "project-b", "project-c"]
required_by = ["api-gateway"]

[service.shared-auth.governance]
# Quién puede cambiar esto
approvers = ["@platform-team", "@security-team"]
change_log = "link-to-changelog"
break_changes_notice = 30  # días de aviso

---

# /projects/project-a/services.toml (Project-specific)

[project-specific.project-a]
name = "E-Commerce Platform"
environment = "production"
tier = "critical"

[service.project-a-frontend]
name = "project-a-frontend"
type = "web"
description = "E-commerce UI"
version = "2.1.0"

[service.project-a-frontend.dependencies]
requires = ["project-a-api", "shared-auth"]
optional = ["analytics-service"]

# Validación: shared-auth debe estar en versión >= 1.0.0
# Automático: Si shared-auth cambia, project-a se valida

Validación de Cambios Cross-Project

// En provisioning/src/orchestrator.rs

pub struct OrchestratorValidator;

impl OrchestratorValidator {
    /// Validar cambios en dependencies globales
    pub async fn validate_cross_project_impact(
        &self,
        changed_service: &Service,
        registry: &ServiceRegistry,
    ) -> Result<CrossProjectImpact> {
        let affected_projects = registry
            .find_consumers(changed_service.id())
            .await?;

        let mut impact = CrossProjectImpact::new();

        for project in affected_projects {
            // Check version compatibility
            if !self.check_version_compat(changed_service, &project)? {
                impact.add_breaking_change(project);
            }

            // Check dependency graph
            if self.would_create_cycle(&changed_service, &project)? {
                impact.add_circular_dependency(project);
            }

            // Check SLA compliance
            if !self.check_sla_impact(changed_service, &project)? {
                impact.add_sla_violation(project);
            }
        }

        Ok(impact)
    }

    /// Notificar affected projects
    pub async fn notify_affected_projects(
        &self,
        impact: &CrossProjectImpact,
    ) -> Result<()> {
        for (project, issues) in impact.iter() {
            // Send notification to project owners
            self.notify_slack(
                &format!("⚠️ Service change impacts {}: {:?}",
                    project.name, issues)
            ).await?;

            // Create automatic issue in project
            self.create_github_issue(project, issues).await?;
        }
        Ok(())
    }
}

🔄 Ciclo de Vida de Servicios

Estados y Transiciones

DEVELOPMENT
    ↓
    [Versioning: 0.x.y]
    [Stability: Experimental]
    ├─ Code review required
    ├─ Tests required
    └─ Only in staging
    ↓
BETA
    ↓
    [Versioning: 1.0.0-beta.x]
    [Stability: Unstable]
    ├─ Limited production use (opt-in)
    ├─ 30-day notice for changes
    └─ Changelog required
    ↓
GA (General Availability)
    ↓
    [Versioning: 1.x.y]
    [Stability: Stable]
    ├─ Backward compatibility guaranteed
    ├─ SLA: 99.9% availability
    └─ Deprecation path required for breaking changes
    ↓
MAINTENANCE
    ↓
    [Versioning: 1.x.y LTS]
    [Stability: Mature]
    ├─ Security fixes only
    ├─ 12-month support window
    └─ No new features
    ↓
DEPRECATED
    ↓
    [Versioning: old]
    [Stability: EOL]
    ├─ Replacement service recommended
    ├─ 6-month migration window
    └─ No support
    ↓
RETIRED
    ↓
    [Removed from catalog]

Transiciones Controladas

# services-catalog.toml con ciclo de vida

[service.old-api]
name = "old-api"
status = "deprecated"
deprecation_date = "2025-11-20"
sunset_date = "2026-05-20"      # 6 meses desde deprecación
replacement_service = "new-api"

[service.old-api.migration]
guide_url = "https://wiki.example.com/old-api-migration"
support_until = "2026-05-20"
migration_difficulty = "easy"    # easy, moderate, hard
estimated_effort_hours = 2

[service.new-api]
name = "new-api"
status = "ga"
version = "2.0.0"
compat_with = []                 # Sin retrocompatibilidad

[service.new-api.sla]
availability = "99.9%"
response_time_p99 = "100ms"
break_handling = "graceful"

🔐 Control de Cambios

Proceso de Cambio de Servicios

┌──────────────────────────────────────┐
│   1. IDENTIFICAR NECESIDAD DE CAMBIO │
├──────────────────────────────────────┤
│                                      │
│  Tipos de cambios:                   │
│  ├─ Bugfix (patch, low risk)        │
│  ├─ Feature (minor, medium risk)    │
│  ├─ Breaking Change (major, high)   │
│  └─ Deprecation (migration needed)  │
│                                      │
└──────────────────────────────────────┘
                ↓
┌──────────────────────────────────────┐
│   2. CREAR PROPUESTA DE CAMBIO       │
├──────────────────────────────────────┤
│                                      │
│  Branch: feature/service-update      │
│  Edit: services/catalog.toml         │
│  Add: CHANGELOG entry                │
│  Document: Migration guide (si breaking)
│                                      │
└──────────────────────────────────────┘
                ↓
┌──────────────────────────────────────┐
│   3. VALIDACIÓN AUTOMÁTICA           │
├──────────────────────────────────────┤
│                                      │
│  Schema validation                   │
│  ├─ ✓ TOML syntax                    │
│  ├─ ✓ All required fields            │
│  └─ ✓ Semantic constraints           │
│                                      │
│  Dependency validation               │
│  ├─ ✓ No circular dependencies       │
│  ├─ ✓ All required services exist    │
│  └─ ✓ Version compatibility          │
│                                      │
│  Cross-project impact                │
│  ├─ ✓ No SLA violations              │
│  ├─ ✓ Consumer notification          │
│  └─ ✓ Breaking change policy         │
│                                      │
│  Security scanning                   │
│  ├─ ✓ No exposed credentials         │
│  ├─ ✓ Compliance rules OK            │
│  └─ ✓ Network policies valid         │
│                                      │
│  Preview generation                  │
│  ├─ ✓ Docker Compose preview         │
│  ├─ ✓ K8s manifests preview          │
│  ├─ ✓ Terraform preview              │
│  └─ ✓ Show diff from current         │
│                                      │
└──────────────────────────────────────┘
                ↓
┌──────────────────────────────────────┐
│   4. REVISIÓN DE CAMBIOS (CODE REVIEW)│
├──────────────────────────────────────┤
│                                      │
│  Reviewer role: @platform-team       │
│  Questions to answer:                │
│  ├─ ¿Por qué este cambio?            │
│  ├─ ¿Afecta a otros proyectos?       │
│  ├─ ¿Necesita documentación?         │
│  ├─ ¿Es una breaking change?         │
│  └─ ¿Hay tests?                      │
│                                      │
│  Approval gates:                     │
│  ├─ Code review (required)           │
│  ├─ Security review (if sensitive)   │
│  ├─ Architecture review (if major)   │
│  └─ Compliance check (if regulated)  │
│                                      │
└──────────────────────────────────────┘
                ↓
┌──────────────────────────────────────┐
│   5. MERGE A MAIN                    │
├──────────────────────────────────────┤
│                                      │
│  Merge strategy: Squash + sign       │
│  Commit message: Include change type │
│  Tag: version-x.y.z                  │
│  CHANGELOG: Auto-updated             │
│                                      │
└──────────────────────────────────────┘
                ↓
┌──────────────────────────────────────┐
│   6. CI/CD PIPELINE TRIGGERED        │
├──────────────────────────────────────┤
│                                      │
│  ├─ Generate Docker images           │
│  ├─ Push to registry                 │
│  ├─ Update K8s manifests             │
│  ├─ Update Terraform modules         │
│  ├─ Generate KCL schemas             │
│  ├─ Create release notes             │
│  └─ Publish documentation            │
│                                      │
└──────────────────────────────────────┘
                ↓
┌──────────────────────────────────────┐
│   7. DEPLOYMENT A STAGING            │
├──────────────────────────────────────┤
│                                      │
│  Automated (ArgoCD sync):            │
│  ├─ Apply K8s manifests              │
│  ├─ Health check (5 min)             │
│  ├─ Integration test (10 min)        │
│  ├─ Performance test (5 min)         │
│  └─ Rollback if failed               │
│                                      │
└──────────────────────────────────────┘
                ↓
┌──────────────────────────────────────┐
│   8. MANUAL APPROVAL PARA PRODUCCIÓN │
├──────────────────────────────────────┤
│                                      │
│  Approval gates:                     │
│  ├─ Staging tests passed             │
│  ├─ Product owner approval           │
│  ├─ Security approval (if sensitive) │
│  └─ Operations approval              │
│                                      │
│  Deployment options:                 │
│  ├─ Immediate (low-risk changes)     │
│  ├─ Canary (5% traffic)              │
│  ├─ Blue-Green (full switch)         │
│  └─ Scheduled (off-hours)            │
│                                      │
└──────────────────────────────────────┘
                ↓
┌──────────────────────────────────────┐
│   9. DEPLOYMENT A PRODUCCIÓN         │
├──────────────────────────────────────┤
│                                      │
│  Deployment flow:                    │
│  ├─ Health check pre-deployment      │
│  ├─ Canary deployment (if enabled)   │
│  ├─ Gradual rollout                  │
│  ├─ Health check post-deployment     │
│  └─ Monitoring & alerts              │
│                                      │
│  Rollback readiness:                 │
│  ├─ Previous version tagged          │
│  ├─ Rollback automated if needed     │
│  ├─ Automatic incident creation      │
│  └─ Notification to team             │
│                                      │
└──────────────────────────────────────┘
                ↓
┌──────────────────────────────────────┐
│   10. MONITOREO POST-DEPLOYMENT      │
├──────────────────────────────────────┤
│                                      │
│  SLA monitoring (24h):               │
│  ├─ Availability > 99.9%             │
│  ├─ Error rate < 0.1%                │
│  ├─ P99 latency < threshold          │
│  └─ No critical incidents            │
│                                      │
│  Si falla:                           │
│  ├─ Automatic rollback               │
│  ├─ Incident created                 │
│  ├─ RCA scheduled                    │
│  └─ Fix deployed                     │
│                                      │
└──────────────────────────────────────┘

📊 Monitoreo y Observabilidad

Métricas de Servicios

# monitoring.toml para cada proyecto

[monitoring.project-a]
level = "critical"  # critical, high, medium, low

[service.project-a-api.metrics]
# Availability
availability_threshold = 99.9
availability_check_interval = "60s"

# Performance
response_time_p50 = "50ms"
response_time_p95 = "100ms"
response_time_p99 = "200ms"

# Error rates
error_rate_threshold = 0.1  # 0.1%
error_type_alerts = ["5xx", "timeout", "timeout"]

# Business metrics
active_users_threshold = 1000
transaction_success_rate = 99.5

# Resource utilization
cpu_threshold = "80%"
memory_threshold = "85%"
disk_threshold = "90%"

[service.project-a-api.alerts]
critical = ["slack#critical-alerts"]
warning = ["slack#warnings", "email@ops"]
info = ["slack#info"]

[service.project-a-api.alerts.escalation]
time_to_escalate = "5m"
escalate_to = ["@on-call"]
page_if = "critical"

Dashboard de Control

# monitoring-dashboard.yml

apiVersion: v1
kind: ServiceMonitoringDashboard
metadata:
  name: platform-control-center
  namespace: monitoring

sections:
  - name: Service Inventory
    widgets:
      - type: service_list
        filter: status == "active"
        columns: [name, version, status, owner, sla]

      - type: dependency_graph
        show_cross_project: true
        highlight_breaking_changes: true

      - type: deployment_timeline
        time_range: "7d"
        show_rollbacks: true

  - name: Change Management
    widgets:
      - type: pending_changes
        approval_required: true

      - type: change_pipeline_status
        stages: [validation, review, deploy-staging, deploy-prod]

      - type: rollback_frequency
        time_range: "30d"
        by_service: true

      - type: change_calendar
        show_maintenance_windows: true

  - name: SLA Compliance
    widgets:
      - type: sla_status
        group_by: project
        highlight_violations: true

      - type: mttr_trends
        time_range: "30d"

      - type: incident_frequency
        by_severity: true

  - name: Cross-Project Impact
    widgets:
      - type: shared_service_usage
        show_deprecations: true

      - type: version_compliance
        check_min_versions: true

      - type: migration_status
        for_deprecated_services: true

🚨 Disaster Recovery

Estrategia de Recuperación

ESCENARIO 1: Service Definition Corrupted
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Detection: Validation pipeline fails
├─ Automatic rollback of commit
├─ Revert PR automatically
├─ Notify team on Slack
└─ Create incident

Recovery: 30 seconds
├─ Last known good state restored
├─ Previous manifests still deployed
├─ No downtime

---

ESCENARIO 2: Breaking Change Not Caught
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Detection: Project fails integration tests
├─ Pre-deployment checks catch it
├─ Automatic canary rollback (5% traffic)
├─ Full rollback if P99 latency increases
└─ Incident auto-created

Prevention:
├─ Cross-project validation (before deploy)
├─ SLA monitoring (during canary)
├─ Automated rollback thresholds

---

ESCENARIO 3: Infrastructure Failure
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Detection: KCL cluster schema fails
├─ Terraform apply fails validation
├─ No deployment attempted
├─ Configuration stays in valid state

Recovery: Manual remediation
├─ Infrastructure team alerted
├─ Re-generate KCL from service definitions
├─ Terraform plan reviewed
└─ Apply with approval

---

ESCENARIO 4: Accidental Service Deletion
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Detection: CI validates missing service is in use
├─ Validation fails (service required but not in registry)
├─ Commit cannot merge
└─ PR blocked until restored

Recovery:
├─ Git history shows deleted service definition
├─ Restore from last commit
├─ Re-validate dependencies
└─ Redeploy

Backup y Versionado

Git Repository (Auto-backed up)
├─ services/catalog.toml (all history)
├─ projects/*/services.toml (all history)
├─ infrastructure/*.k (all history)
└─ Tags: v1.0.0, v1.1.0, etc. (releases)

Artifact Repository (Docker Registry)
├─ syntaxis/syntaxis-api:v2.1.0
├─ syntaxis/syntaxis-api:v2.0.0 (previous)
├─ syntaxis/syntaxis-api:v1.9.0 (previous)
└─ Latest: syntaxis/syntaxis-api:latest

Kubernetes (etcd backup)
├─ Daily snapshots
├─ 30-day retention
└─ Cross-region replication

Disaster Recovery SLA
├─ RTO (Recovery Time Objective): < 1 hour
├─ RPO (Recovery Point Objective): < 5 minutes
└─ Backup test: Weekly

🎯 Resumen Operacional

Checklist de Gestión Diaria

☐ Morning Briefing (10 min)
  ├─ Check pending changes
  ├─ Review deployment status
  ├─ Check SLA compliance
  └─ Review critical incidents

☐ Change Review (30 min/change)
  ├─ Validate change request
  ├─ Review impact analysis
  ├─ Approve or request changes
  └─ Monitor deployment

☐ Incident Management (as needed)
  ├─ Create incident ticket
  ├─ Assess impact
  ├─ Initiate rollback if needed
  └─ Post-mortem within 24h

☐ End of Day Reporting (5 min)
  ├─ Document any incidents
  ├─ Update SLA tracking
  ├─ Prepare handoff for next shift
  └─ Archive deployment logs

Escalation Matrix

SEVERITY | RESPONSE | ESCALATION | NOTIFICATION
─────────┼──────────┼────────────┼──────────────
Critical │ 5 min    │ 15 min     │ Page on-call
High     │ 15 min   │ 30 min     │ Slack #high
Medium   │ 30 min   │ 1 hour     │ Slack #med
Low      │ 1 hour   │ 4 hours    │ Slack #low
Info     │ 4 hours  │ None       │ Daily digest

📚 Documentación Automática

# Generada automáticamente desde catalog.toml

docs/
├─ SERVICES.md
│  ├─ Service listing con metadata
│  ├─ Port assignments
│  ├─ Dependencies
│  └─ Health check endpoints
│
├─ TOPOLOGY.md
│  ├─ Dependency graph (texto)
│  ├─ Cross-project impact
│  ├─ Version compatibility matrix
│  └─ Deprecation timeline
│
├─ RUNBOOK.md
│  ├─ How to add new service
│  ├─ How to update service
│  ├─ How to deprecate service
│  ├─ How to handle incidents
│  └─ How to perform maintenance
│
├─ DEPLOYMENT.md
│  ├─ Deployment procedure
│  ├─ Rollback procedure
│  ├─ Canary deployment
│  └─ Blue-green deployment
│
└─ TROUBLESHOOTING.md
   ├─ Common issues
   ├─ Root causes
   ├─ Solutions
   └─ Prevention measures

Conclusión: Una gestión centralizada con validación automatizada, control de cambios estricto y observabilidad completa permite escalar a múltiples proyectos sin sacrificar seguridad o confiabilidad.