Project Overview

Repository overview, key capabilities, quick start notes, and the companion documentation map.
Platform Reference View Markdown
Sentinel Secure X

Sentinel Secure X is a production-oriented device trust and control platform with:
mTLS-ready device heartbeats
JWT-protected admin APIs
Device certificate pinning tied to device_id
A lightweight live dashboard
At-least-once queued command delivery with agent acknowledgements
Audit logging for auth, enrollment, command dispatch, and rejected identity events
Admin approval flow for new devices and certificate rotations
Distinct approver tracking with optional multi-admin approval for sensitive devices
Certificate validity and expiry visibility for rotation planning
Explicit certificate revocation with backend enforcement
CA-driven certificate revocation checks via CRLs on the device mTLS path
Role-based admin authorization for viewing, operations, identity review, and certificate security actions
Optional TOTP multi-factor authentication for admin accounts
WebAuthn/passkey enrollment and phishing-resistant admin sign-in
Server-side admin session tracking, logout, and live token revocation
Persistent login throttling and temporary lockout for repeated auth failures
Idle session expiry and active-session caps for admin accounts
Monotonic heartbeat counters with backend anti-replay enforcement
Signed device attestation evidence with nonce freshness and PCR baseline validation
Signed policy bundle enforcement for command authorization
Time-bound policy exception workflow with dual approval and expiry
Signed agent update staging, artifact verification, install hooks, and rollback hooks
Encrypted artifact storage for sensitive device outputs with retention windows
Workload-authenticated trust-feed endpoints for downstream integrations
Signed outbound webhook delivery with durable retry queue
First-class Splunk HEC delivery targets with encrypted connector tokens
NAC-style trust enforcement connectors for allow/quarantine/block decisions
IdP/conditional-access connectors for allow/step-up/block session policy
Append-only audit sink replication for off-box evidence retention
Signed short-lived device trust assertions for downstream enforcement
External signer/HSM-ready signing hooks for evidence export, trust assertions, and webhook delivery
TPM quote attestation provider integration via pluggable collector/verifier commands
Hardware-binding continuity for TPM attestation, with optional AK rotation allowance on the same bound device
Protected command review workflow for sensitive remote actions
Tamper-evident audit chain verification
Signed audit evidence exports for external verification
Standard service deployment artifacts for Linux, Windows, and macOS
This repository intentionally avoids stealthy persistence and automatic endpoint lockout logic. In production, use OS-native service recovery, explicit policy approval, and audited enforcement workflows.
Project layout

server/: Flask backend, database models, admin auth, and static dashboard
agent/: certificate-authenticated heartbeat agent and service wrappers
deploy/: certificate generation, signing helpers, and reverse-proxy/service assets for API, delivery, and maintenance workers
Companion docs

Installation guide: docs/installation_guide.md
User manual: docs/user_manual.md
Production baseline: docs/production_profile.md
Operator API workflows: docs/operator_api_workflows.md
NGINX mTLS proxy: deploy/nginx/README.md
Linux systemd services: deploy/systemd/README.md
Windows agent service: deploy/windows/README.md
macOS launchd agent: deploy/macos/README.md
Testing

Validate the control-plane environment with make server-preflight ENV_FILE=.env or python3 -m server.preflight --env-file .env.
Validate reverse-proxy deployments with make server-preflight-proxy ENV_FILE=.env or python3 -m server.preflight --env-file .env --expect-proxy-cert-headers.
Smoke-check an NGINX proxy deployment with make nginx-proxy-smoke NGINX_PROXY_SMOKE_ARGS='--base-url https://sentinel.example.com --ca-file /path/to/proxy-ca.pem'.
Smoke-check a Linux control-plane host with make systemd-smoke or python3 deploy/systemd/smoke_test_services.py --base-url http://127.0.0.1:8000.
Rootless Linux hosts can deploy the control plane with python3 deploy/systemd/install_services.py --user-mode ... and then verify it with python3 deploy/systemd/smoke_test_services.py --user-mode.
Preview or render a macOS launchd plist with make macos-launchd-preview or make macos-launchd-render, smoke-check a loaded macOS agent service with make macos-launchd-smoke, and remove it cleanly with make macos-launchd-uninstall.
Validate a Windows agent config before service installation with .\deploy\windows\preflight-agent-config.ps1 from PowerShell, or see deploy/windows/README.md for the full Windows service flow.
Build release-style device-agent bundles for Linux, macOS, and Windows with make build-agent-bundles or python3 deploy/build_agent_bundles.py.
Render the browser-friendly docs microsite with make build-docs-site or python3 deploy/build_docs_site.py.
Run the curated suite in stable module-sized chunks with make test or python3 -m tests.run_suite.
Run a faster smoke pass with make test-fast.
Refresh the cached security-flow schema and then run the full suite with make test-ci or python3 -m tests.run_suite --refresh-security-flow-schema.
Run an individual named shard with make test-shard SHARD=agent-core.
Regenerate the cached security-flow schema fixture with make refresh-schema or python3 -m tests.regenerate_security_flow_schema after model or migration changes.
List the named CI shards with make list-shards or python3 -m tests.run_suite --list-shards.
Quick start

Create a virtual environment and install dependencies.
Copy .env.example to .env and adjust values.
Generate development certificates with deploy/certs/generate_dev_certs.sh.
Create an admin password hash:
   python3 -c "from werkzeug.security import generate_password_hash; print(generate_password_hash('change-me'))"
   ```

   Optional: configure multiple admin accounts with `SENTINEL_ADMIN_USERS_JSON`, for example:

   ```json
   {
     "admin": {"password_hash":"<hash>","roles":["admin"]},
     "operator1": {"password_hash":"<hash>","roles":["viewer","operator"]},
     "reviewer2": {"password_hash":"<hash>","roles":["viewer","approver"]},
     "security1": {"password_hash":"<hash>","roles":["viewer","security"]},
     "mfaadmin": {"password_hash":"<hash>","roles":["viewer","approver"],"mfa_totp_secret":"<base32-secret>"},
     "passkeyadmin": {"password_hash":"<hash>","roles":["viewer","approver"],"passkey_required":true}
   }
   ```

   You can generate a TOTP secret with:

   ```bash
   python3 -c "import base64, secrets; print(base64.b32encode(secrets.token_bytes(20)).decode().rstrip('='))"
   ```

5. Generate local development secrets and a data-protection key, then paste the printed values into `.env`:

   ```bash
   python3 - <<'PY'
   from cryptography.fernet import Fernet
   import secrets

   print("SENTINEL_SECRET_KEY=" + secrets.token_urlsafe(48))
   print("SENTINEL_JWT_SECRET=" + secrets.token_urlsafe(48))
   print('SENTINEL_DATA_ENCRYPTION_KEYS_JSON={"v1":"' + Fernet.generate_key().decode() + '"}')
   print("SENTINEL_DATA_ACTIVE_KEY_ID=v1")
   print("SENTINEL_MONITORING_BEARER_TOKEN=" + secrets.token_urlsafe(32))
   PY
   ```

   `make server-preflight ENV_FILE=.env` will fail until those values stop using placeholders and at least one active data-protection key is configured.

6. Apply database migrations before starting new code:

   ```bash
   make db-upgrade
   ```

   This is required for local SQLite too. Newer health, metrics, recovery, scheduler, and integration paths expect the latest Alembic revision.
   If you have an older dev SQLite database from before the Alembic workflow was introduced, back it up and let Sentinel create a fresh migrated database rather than trying to reuse the stale file in place.

7. Start the development mTLS server with:

   ```bash
   export SENTINEL_TLS_CERT="$(pwd)/deploy/certs/out/server.crt"
   export SENTINEL_TLS_KEY="$(pwd)/deploy/certs/out/server.key"
   export SENTINEL_TLS_CA="$(pwd)/deploy/certs/out/ca.crt"
   python3 -m server.dev_tls
   ```

   Keep `SENTINEL_TRUST_PROXY_CERT_HEADERS=false` in `.env` for this direct local TLS path.
   If you also want to validate the provided NGINX proxy flow locally, copy `.env` to a separate env file such as `.env.proxy`, set `SENTINEL_TRUST_PROXY_CERT_HEADERS=true` there, and run proxy-specific checks against that file.

8. Copy `agent/config.example.json` to `agent/config.json`, or start from `agent/config.windows.example.json` / `agent/config.macos.example.json` on those platforms, update paths if needed, and run:

   ```bash
   python3 -m agent.agent_secure
   ```

   The sample config uses `__CURRENT_PYTHON__` for the TPM collector and updater hook commands so those helper scripts run with the same interpreter as the agent on Linux, Windows, or macOS.
   It intentionally leaves `attestation_private_key_path`, `attestation_public_key_path`, and `update_signing_public_key_path` empty so the basic agent preflight passes before you enable signed attestation or signed update manifests.

   If you want device attestation enabled, generate an Ed25519 attestation keypair for the agent, point those two config fields at the generated files, and set the PCR baselines on the server:

   ```bash
   openssl genpkey -algorithm Ed25519 -out deploy/certs/out/device_attest.key
   openssl pkey -in deploy/certs/out/device_attest.key -pubout -out deploy/certs/out/device_attest.pub
   ```

   Optional: sign a policy bundle for command governance:

   ```bash
   openssl genpkey -algorithm Ed25519 -out deploy/policy/policy_signing.key
   openssl pkey -in deploy/policy/policy_signing.key -pubout -out deploy/policy/policy_signing.pub
   python3 deploy/policy/sign_policy_bundle.py deploy/policy/example_policy.json deploy/policy/policy_signing.key
   ```

  Optional: sign an agent update manifest for the staged updater path and set `update_signing_public_key_path` to `deploy/updates/update_signing.pub`:

  ```bash
  openssl genpkey -algorithm Ed25519 -out deploy/updates/update_signing.key
  openssl pkey -in deploy/updates/update_signing.key -pubout -out deploy/updates/update_signing.pub
  python3 deploy/updates/sign_update_manifest.py deploy/updates/example_manifest.json deploy/updates/update_signing.key
  ```

   The agent config also includes reference local hook commands in `deploy/updates/example_installer.py` and
   `deploy/updates/example_rollback.py`. Replace those with your package manager, software distribution, or
   OS-native installer workflow before production use.

9. Open `https://localhost:8443` for the landing page, then use `https://localhost:8443/dashboard` for the operator console and sign in with the admin credentials you configured.
10. After a bootstrap password login, you can enroll a passkey from the dashboard and then use `Use Passkey` for phishing-resistant sign-in.

## Production notes

- Terminate TLS and enforce client certificates at a reverse proxy such as NGINX.
- Allow normal browser access for the admin dashboard, but require a trusted client certificate for `/api/heartbeat`.
- Only allow the Flask app to trust client-certificate headers from that proxy.
- Issue device certificates with a common name that matches the device's `device_id`.
- On the first accepted heartbeat, the backend pins the presented certificate fingerprint to that device.
- If `SENTINEL_REQUIRE_NEW_DEVICE_APPROVAL=true`, newly seen devices stay pending until an admin reviews the identity request.
- Certificate rotations create a pending review request that an admin can approve or reject from the dashboard or `POST /api/devices/<device_id>/identity-review`.
- `SENTINEL_IDENTITY_REVIEW_TTL_SECONDS` controls how long enrollment and rotation reviews stay pending before they expire.
- `SENTINEL_SENSITIVE_TAGS` and `SENTINEL_SENSITIVE_IDENTITY_APPROVALS` let you require distinct approvers for tagged devices.
- `SENTINEL_CERT_EXPIRING_SOON_DAYS` marks certificates that are approaching expiry in the dashboard and API.
- `SENTINEL_ROTATION_RECOMMENDED_DAYS` and `SENTINEL_ROTATION_REQUIRED_DAYS` derive device rotation urgency from the real certificate expiry date.
- `POST /api/devices/<device_id>/certificate/revoke` revokes the currently bound certificate and prevents further heartbeats from that certificate fingerprint.
- `SENTINEL_DEVICE_CERTIFICATE_CRL_PATHS` accepts one or more PEM/DER CRL files for upstream CA revocation enforcement on device certificates.
- When CRL checking is enabled, Sentinel expects the client-certificate serial number to be forwarded from TLS termination. The dev TLS entrypoint now provides that automatically, and the provided NGINX config forwards `X-SSL-Client-Serial` and `X-SSL-Client-Issuer-DN`.
- CRL validation is fail-closed: if a configured CRL is unavailable, stale, or the certificate serial number is missing, Sentinel rejects the heartbeat rather than silently skipping revocation checks.
- `SENTINEL_DEVICE_CERTIFICATE_OCSP_REQUIRED=true` enables fail-closed OCSP checking on the device mTLS path. Configure `SENTINEL_DEVICE_CERTIFICATE_ISSUER_PATH` with the issuing CA certificate, or Sentinel cannot validate OCSP responses.
- `SENTINEL_DEVICE_CERTIFICATE_OCSP_URL` overrides the responder URL. If unset, Sentinel uses the device certificate's AIA OCSP URL. The TLS layer must forward the full client certificate PEM so the backend can build and verify the OCSP request.
- `SENTINEL_DEVICE_CERTIFICATE_OCSP_TIMEOUT_SECONDS` and `SENTINEL_DEVICE_CERTIFICATE_OCSP_MAX_AGE_SECONDS` control responder timeout and freshness policy. OCSP validation is fail-closed: unknown, stale, invalid, or unavailable responses cause Sentinel to reject the heartbeat.
- Use `SENTINEL_ADMIN_USERS_JSON` when you need more than one admin identity for approval separation or role separation; the legacy single-admin username/hash fields still work.
- Built-in roles are `viewer`, `operator`, `approver`, `security`, and `admin`. You can also attach explicit `permissions` arrays per user for exceptions.
- Structured admin accounts can include `mfa_totp_secret` to require a TOTP code at login.
- Structured admin accounts can also include `passkey_required: true`. Once that account has at least one enrolled passkey, password-only login is refused and the operator must use the WebAuthn sign-in flow.
- `SENTINEL_ADMIN_TOTP_PERIOD_SECONDS`, `SENTINEL_ADMIN_TOTP_DIGITS`, and `SENTINEL_ADMIN_TOTP_WINDOW` tune the accepted TOTP policy.
- `SENTINEL_WEBAUTHN_RP_ID`, `SENTINEL_WEBAUTHN_RP_NAME`, `SENTINEL_WEBAUTHN_ALLOWED_ORIGINS`, `SENTINEL_WEBAUTHN_TIMEOUT_MS`, `SENTINEL_WEBAUTHN_CHALLENGE_TTL_SECONDS`, and `SENTINEL_WEBAUTHN_USER_VERIFICATION` control WebAuthn/passkey policy.
- `SENTINEL_ADMIN_LOGIN_FAILURE_WINDOW_SECONDS`, `SENTINEL_ADMIN_LOGIN_MAX_FAILURES`, and `SENTINEL_ADMIN_LOGIN_LOCKOUT_SECONDS` control persistent login throttling by username and source IP.
- `SENTINEL_ADMIN_SESSION_IDLE_TIMEOUT_SECONDS` expires inactive admin sessions before their JWT lifetime ends.
- `SENTINEL_ADMIN_MAX_ACTIVE_SESSIONS_PER_USER` caps concurrent active sessions and automatically prunes older ones on new login.
- `SENTINEL_REQUIRE_HEARTBEAT_COUNTERS` requires a positive monotonic `heartbeat_counter` on every device heartbeat.
- `SENTINEL_ATTESTATION_REQUIRED` requires continuous attestation evidence on the device channel. On the first heartbeat the backend issues a nonce challenge and advertises the configured provider in the response.
- `SENTINEL_ATTESTATION_PROVIDER=signed_measurement_v1` keeps the original Ed25519 signed-measurement path.
- `SENTINEL_ATTESTATION_PROVIDER=tpm_quote_v1` switches the backend to TPM quote verification mode. In that mode the server passes the attestation bundle to `SENTINEL_TPM_QUOTE_VERIFIER_COMMAND`, which should validate the quote and return a JSON verdict.
- `SENTINEL_ATTESTATION_CHALLENGE_TTL_SECONDS`, `SENTINEL_ATTESTATION_MAX_EVIDENCE_AGE_SECONDS`, and `SENTINEL_ATTESTATION_MAX_CLOCK_SKEW_SECONDS` control attestation freshness policy.
- `SENTINEL_ATTESTATION_REQUIRE_HARDWARE_BINDING=true` tells the TPM verifier path to require a stable hardware binding identifier, such as an endorsement-key fingerprint or another verifier-defined binding id.
- `SENTINEL_ATTESTATION_ALLOW_KEY_ROTATION_WITH_STABLE_BINDING=true` allows an attestation key to rotate without breaking trust as long as the verifier proves the same underlying hardware binding. If the binding changes, Sentinel rejects the heartbeat and records a hardware-identity mismatch event.
- `SENTINEL_ATTESTATION_REQUIRE_ENDORSEMENT_CHAIN=true` requires TPM attestations to include an endorsement certificate chain anchored in `SENTINEL_ATTESTATION_TRUSTED_ENDORSEMENT_ROOT_PATHS`. Sentinel validates that chain server-side and can derive the stable hardware binding from the EK certificate fingerprint.
- `SENTINEL_ATTESTATION_ALLOWED_TPM_MANUFACTURERS` optionally allowlists TPM manufacturers. Sentinel compares that policy against the reported `tpm_manufacturer` and, when an endorsement chain is present, the EK certificate subject organization.
- `SENTINEL_ATTESTATION_PCR_BASELINES_JSON` defines the approved PCR values that the signed measurement payload must match. A mismatch moves the device out of the trusted tier and rejects the heartbeat.
- `SENTINEL_TPM_QUOTE_VERIFIER_TIMEOUT_SECONDS` controls the verifier subprocess timeout. Reference adapter scripts live in `deploy/tpm/` and are meant to be replaced with real `tpm2-tools` wrappers or an enterprise verifier.
- The TPM verifier contract can now return `hardware_identity`, including a stable `binding_id` or `endorsement_key_id`. It may also forward an `endorsement_chain` so Sentinel can apply hardware-root trust policy itself. Sentinel persists the resulting binding and exposes it in the dashboard and device API so operators can spot unexpected TPM identity changes.
- The device dashboard now exposes `trust_tier` and attestation state so you can distinguish merely observed devices from cryptographically verified ones.
- `SENTINEL_POLICY_BUNDLE_PATH` points to a JSON policy bundle, `SENTINEL_POLICY_SIGNING_PUBLIC_KEY_PATH` points to the Ed25519 public key used to verify it, and `SENTINEL_POLICY_REQUIRE_VERIFIED_BUNDLE=true` forces the platform to fall back to the built-in baseline if that bundle is missing or unverifiable.
- `GET /api/policy/status` shows the active policy version, source, verification state, and effective command rules.
- Command authorization is now policy-driven: the active bundle can require a minimum `trust_tier`, require verified attestation, and force review approvals per action before a command is queued or dispatched.
- `POST /api/policy/exceptions` creates a narrowly scoped break-glass request for a specific `device_id` and `action` when policy blocks that action because of trust tier or attestation state. These exceptions are time-bound and require explicit review through `POST /api/policy/exceptions/<exception_id>/review`.
- `SENTINEL_POLICY_EXCEPTION_REVIEW_TTL_SECONDS`, `SENTINEL_POLICY_EXCEPTION_MAX_DURATION_SECONDS`, `SENTINEL_POLICY_EXCEPTION_APPROVALS`, `SENTINEL_SENSITIVE_POLICY_EXCEPTION_APPROVALS`, and `SENTINEL_POLICY_EXCEPTION_REQUIRE_SEPARATE_APPROVER` control exception governance.
- The agent now supports a three-step signed updater path. `stage_update` stores a signed manifest locally after verifying its Ed25519 signature. `apply_update` downloads the artifact, verifies its SHA-256 digest, and then invokes the configured local installer hook. `rollback_update` invokes the configured rollback hook when the installer reported that rollback is available.
- `update_download_dir`, `update_apply_command`, `update_rollback_command`, `update_download_timeout_seconds`, `update_apply_timeout_seconds`, and `update_rollback_timeout_seconds` live in `agent/config.json` and control the local updater contract.
- The shipped sample agent config uses `__CURRENT_PYTHON__` in those helper command arrays so Python-based hooks run with the agent's current interpreter instead of assuming `python3` exists on the host.
- The reference updater scripts in `deploy/updates/example_installer.py` and `deploy/updates/example_rollback.py` demonstrate the JSON stdin/stdout contract. In production, point those commands at your signed package deployment tool, endpoint management agent, or OS-native installer wrapper.
- The agent also supports `attestation_provider: "tpm_quote_v1"` plus `tpm_quote_command` in `agent/config.json`. That command receives a JSON request on stdin and should emit a TPM quote bundle on stdout. The reference script in `deploy/tpm/example_quote_collector.py` shows the expected contract.
- Device command results can now be ingested as encrypted artifacts. `collect_diagnostics` returns a structured diagnostics snapshot from the agent, which the backend encrypts at rest before storing.
- `SENTINEL_DATA_ENCRYPTION_KEYS_JSON` defines the available Fernet keys, `SENTINEL_DATA_ACTIVE_KEY_ID` selects the active write key, and `SENTINEL_DATA_RETENTION_DAYS` controls artifact retention.
- `GET /api/artifacts` and `GET /api/artifacts/<artifact_id>` expose encrypted artifact metadata and decrypted payloads to operators with `artifact.read`. In the default role map that means `security` and `admin`.
- `SENTINEL_WORKLOAD_CLIENTS_JSON` defines machine clients for downstream integrations. Those clients authenticate with `Authorization: Bearer <token>` plus `X-Sentinel-Workload-Id`.
- Workload clients can also enforce signed inbound requests. Set `request_hmac_secret` and `require_signed_requests=true` on a client entry, then send `X-Sentinel-Workload-Timestamp` plus `X-Sentinel-Workload-Signature: sha256=<hex>` where the HMAC input is `client_id + "\\n" + method + "\\n" + path + "\\n" + timestamp + "\\n" + sha256(body)`. `SENTINEL_WORKLOAD_SIGNATURE_MAX_AGE_SECONDS` defines the freshness window.
- Workload clients can now move beyond shared bearer tokens entirely. Set `require_workload_assertion=true` on a client entry and Sentinel will reject the static token path for that client, accepting only short-lived Sentinel-signed workload assertions on inbound workload routes.
- `POST /api/integrations/workload-assertions/<client_id>` issues a short-lived workload assertion for a named client, `GET /api/integrations/workload-assertions` lists recent issued assertions, and `POST /api/integrations/workload-assertions/<assertion_id>/revoke` lets operators revoke one before expiry.
- `SENTINEL_WORKLOAD_ASSERTION_MAX_LIFETIME_SECONDS` caps any per-client assertion lifetime override, `SENTINEL_WORKLOAD_ASSERTION_REQUIRE_TRACKED_ISSUANCE` forces inbound assertions to have been minted and recorded by Sentinel first, `SENTINEL_WORKLOAD_ASSERTION_ONE_TIME_USE` rejects a second use of the same assertion, and `SENTINEL_WORKLOAD_ASSERTION_BIND_SOURCE_IP` binds the assertion to the first source IP that presents it.
- Workload clients can also require inbound mTLS. Set `require_client_certificate=true` and optionally pin `required_certificate_common_name`, `required_certificate_issuer`, and `allowed_certificate_fingerprints` on a client entry. Sentinel will then require a verified client certificate on workload-authenticated routes before it accepts the request.
- `GET /api/integrations/trust-feed` exposes a machine-readable device trust snapshot for SIEM/NAC-style consumers, and `GET /api/integrations/events` exposes recent audit events to workload clients with the right permissions.
- `POST /api/integrations/webhooks` registers an HTTPS webhook destination for outbound trust events, `GET /api/integrations/webhooks` and `GET /api/integrations/deliveries` expose queue health, and `POST /api/integrations/webhooks/flush` processes the current retry queue.
- `POST /api/integrations/splunk-hec` registers a typed Splunk HEC connector that uses the same durable queue, signatures, retries, and audit trail as generic webhooks.
- `POST /api/integrations/slack-webhooks` registers a typed Slack incoming-webhook connector for operational notifications. By default it subscribes to `audit.integration.microsoft_connector_unhealthy`, `audit.integration.microsoft_connector_suppressed`, `audit.integration.microsoft_connector_recovered`, `audit.operations.background_service_unhealthy`, `audit.operations.background_service_recovered`, and `audit.operations.maintenance_worker_leader_failover`, then posts a human-readable message with `text` and `blocks` to the Slack webhook URL you provide.
- `POST /api/integrations/pagerduty-events` registers a typed PagerDuty Events API v2 connector for operational notifications. Its default policy is now escalation-oriented: it subscribes to `audit.integration.microsoft_connector_escalated`, `audit.integration.microsoft_connector_suppression_escalated`, `audit.integration.microsoft_connector_recovered`, `audit.operations.background_service_unhealthy`, `audit.operations.background_service_recovered`, and `audit.operations.maintenance_worker_leader_failover`, so PagerDuty can page on sustained Microsoft incidents, required background-service outages, and maintenance failovers.
- Slack, PagerDuty, and any other audit-subscribing connector can now carry alert-routing policy inside `delivery_config`: `minimum_severity` (`info|low|medium|high|critical`), `connector_kinds` (for example `["microsoft_sentinel"]`), and `include_suppressed` (`true` by default). Those filters are applied before Sentinel queues an audit delivery, so low-signal transitions never enter the outbox for connectors that do not want them.
- Audit-subscribing connectors can also use `incident_policy_template` for opinionated defaults. `chatops_immediate` is tuned for human-readable chat notifications plus maintenance failovers and background-service health changes, `paging_escalation` is tuned for sustained incidents, recovery, background-service outages, and maintenance failovers, and `siem_comprehensive` subscribes to Microsoft connector states plus maintenance and background-service transitions for archival or correlation sinks. Explicit `subscribed_events`, `minimum_severity`, `connector_kinds`, and `include_suppressed` fields still override the template when you need a custom variant.
- `GET /api/integrations/incident-policy-templates` lists the built-in and custom incident-routing templates Sentinel knows about, and `POST` plus `PATCH /api/integrations/incident-policy-templates/<template_id>` let operators create and update custom templates without editing code. Custom templates are versioned, audited, and can be referenced from any audit-subscribing connector through `incident_policy_template`.
- When a connector is created from a template without explicit event/routing overrides, Sentinel marks those fields as template-managed. That means later template updates can roll out automatically to existing connectors for subscriptions and alert-routing behavior, while explicit per-connector overrides still pin that connector to its own local policy.
- `GET /api/integrations/incident-policy-templates/bundle-status` shows whether Sentinel currently has signing and verification material for incident-policy bundle export/import, plus receipt-signing readiness for reviewed promotions.
- The same status endpoint now exposes verification policy: trusted issuers, maximum accepted bundle age, and allowed future clock skew for incoming bundles.
- `GET /api/integrations/incident-policy-templates/export` returns a signed incident-policy bundle envelope with `payload` plus `signature`. Use `include_builtin`, `include_disabled`, `template_ids=...`, and `target_environments=prod,dr` to scope what gets exported and where it is allowed to land.
- `POST /api/integrations/incident-policy-templates/import` verifies a signed bundle and upserts custom templates from it. Add `?dry_run=true` or `{"dry_run": true}` to preview `create`, `update`, `unchanged`, and `skipped_builtin` actions plus projected versions before you change live routing. When direct import is disabled, the same endpoint stays available for preview but rejects live apply and points operators to the reviewed import-request path.
- `GET /api/integrations/incident-policy-templates/import-requests` lists pending and completed reviewed template-import requests, and `POST /api/integrations/incident-policy-templates/import-requests` submits a signed bundle for governed promotion instead of immediate application.
- `POST /api/integrations/incident-policy-templates/import-requests/<request_id>/review` records an approval or rejection. Once the required number of distinct approvals is reached, Sentinel applies the bundle immediately and stores the final apply summary on the request record.
- `POST /api/integrations/incident-policy-templates/import-requests/<request_id>/refresh` recomputes the reviewed preview from current state, clears stale approvals, and restarts the review TTL. This is the supported recovery path after drift detection.
- `POST /api/integrations/incident-policy-templates/import-requests/<request_id>/cancel` withdraws a pending reviewed import, records the cancellation reason in the audit trail, and frees the same bundle hash to be submitted again later.
- `GET /api/integrations/incident-policy-templates/import-requests/<request_id>/receipt` exports the signed promotion receipt for an applied reviewed import. The receipt includes bundle provenance, approvals, and the exact apply summary used at promotion time.
- Reviewed imports now fail closed on preview drift by default: if the current import result no longer matches the stored reviewed preview at final approval time, Sentinel returns a drift report instead of silently applying changed state.
- `SENTINEL_INCIDENT_POLICY_TEMPLATE_SIGNING_PRIVATE_KEY_PATH`, `SENTINEL_INCIDENT_POLICY_TEMPLATE_SIGNING_PUBLIC_KEY_PATH`, `SENTINEL_INCIDENT_POLICY_TEMPLATE_SIGNING_COMMAND`, `SENTINEL_INCIDENT_POLICY_TEMPLATE_SIGNING_TIMEOUT_SECONDS`, and `SENTINEL_INCIDENT_POLICY_TEMPLATE_SIGNING_ISSUER` control bundle export/import provenance. The same Ed25519 key pair can be shared across environments or replaced with an external signing command.
- `SENTINEL_INCIDENT_POLICY_TEMPLATE_ENVIRONMENT_ID` labels the current deployment environment, and `SENTINEL_INCIDENT_POLICY_TEMPLATE_REQUIRE_TARGET_ENVIRONMENT=true` makes Sentinel reject incoming bundles that do not explicitly name this environment in `target_environments`.
- `SENTINEL_INCIDENT_POLICY_TEMPLATE_ALLOWED_ISSUERS`, `SENTINEL_INCIDENT_POLICY_TEMPLATE_MAX_BUNDLE_AGE_SECONDS`, and `SENTINEL_INCIDENT_POLICY_TEMPLATE_MAX_CLOCK_SKEW_SECONDS` let you pin trusted promotion issuers and reject stale or future-dated bundles before review or import.
- `SENTINEL_INCIDENT_POLICY_TEMPLATE_DIRECT_IMPORT_ENABLED=false` turns off direct apply on the import endpoint for higher-assurance environments, while still allowing `dry_run` preview and the reviewed promotion flow through import requests.
- Exported bundles can now carry both `source_environment` and `target_environments`, and the offline helper in `deploy/incident_policies/sign_incident_policy_template_bundle.py` supports `--source-environment` and `--target-environments` so offline promotion workflows keep the same environment-scoping rules as the API path.
- `SENTINEL_INCIDENT_POLICY_TEMPLATE_IMPORT_REVIEW_TTL_SECONDS`, `SENTINEL_INCIDENT_POLICY_TEMPLATE_IMPORT_REQUIRED_APPROVALS`, and `SENTINEL_INCIDENT_POLICY_TEMPLATE_IMPORT_REQUIRE_SEPARATE_APPROVER` control how long a promotion request stays reviewable, how many approvals it needs, and whether the requester is blocked from self-approval.
- `SENTINEL_INCIDENT_POLICY_TEMPLATE_IMPORT_REQUIRE_PREVIEW_MATCH=true` makes final approval re-check the live import plan against the stored reviewed preview. If the underlying template state drifted, Sentinel blocks apply and returns both the expected and current plans for operator review.
- Reviewed import requests are now replay-resistant at the queue level: Sentinel rejects a second request for the same verified bundle hash while an earlier request is still pending or has already been applied.
- `deploy/incident_policies/sign_incident_policy_template_bundle.py` wraps an unsigned bundle payload in the same signed envelope shape Sentinel exports, and `deploy/incident_policies/example_incident_policy_bundle_payload.json` defines the expected payload structure for offline promotion between environments.
- `POST /api/integrations/microsoft-sentinel` registers a typed Microsoft Sentinel / Azure Monitor Logs Ingestion connector. It uses a Microsoft Entra application client ID and secret plus a DCR immutable ID and stream name, acquires a bearer token for `https://monitor.azure.com/.default`, and sends normalized Sentinel Secure X trust records to `{endpoint}/dataCollectionRules/{dcr}/streams/{stream}?api-version=2023-01-01`.
- `POST /api/integrations/entra-group-sync` registers a typed Microsoft Entra group-sync connector. It uses Microsoft Graph app-only credentials to move a device's Entra directory object into trust-tier-mapped groups such as `trusted_group_id`, `restricted_group_id`, and `blocked_group_id`, which lets existing Conditional Access policies key off Sentinel trust state.
- Entra group sync now supports Microsoft Graph batch mode. By default Sentinel will batch add/remove membership operations through `POST /$batch` after it checks current group membership, which reduces API chatter when a trust change requires multiple group updates. Set `graph_batch_enabled=false` on the connector if you need the simpler direct-call path instead.
- Microsoft-aware delivery paths now reuse Entra and Azure Monitor access tokens until shortly before expiry, and Sentinel honors `Retry-After` when Microsoft APIs throttle a delivery so the outbox backs off instead of immediately retrying.
- `GET /api/integrations/webhooks` now includes a Microsoft delivery observability summary: live token-cache status from the integration worker heartbeat when available, plus active Microsoft delivery failures classified as `auth`, `throttled`, `network`, `identity`, `config`, or `upstream`.
- Microsoft connector health is now budget-driven. `SENTINEL_MICROSOFT_CONNECTOR_THRESHOLD_AUTH`, `..._CONFIG`, `..._IDENTITY`, `..._NETWORK`, `..._THROTTLED`, `..._UPSTREAM`, and `..._UNKNOWN` define how many consecutive failures Sentinel tolerates before a connector flips from `degraded` to `unhealthy`; `SENTINEL_MICROSOFT_CONNECTOR_QUEUE_BACKLOG_THRESHOLD`, `SENTINEL_MICROSOFT_CONNECTOR_DEAD_LETTER_THRESHOLD`, and `SENTINEL_MICROSOFT_CONNECTOR_STALE_SUCCESS_SECONDS` add queue and freshness SLOs on top.
- Sentinel can now optionally suppress deliveries to unhealthy Microsoft connectors between probe attempts. `SENTINEL_MICROSOFT_CONNECTOR_SUPPRESS_UNHEALTHY_DELIVERIES=true` enables that behavior and `SENTINEL_MICROSOFT_CONNECTOR_SUPPRESSION_SECONDS` defines the cooldown window before Sentinel allows another probe delivery through. Suppressed deliveries are re-queued without consuming retry budget.
- Sentinel can also escalate sustained Microsoft connector incidents into second-stage operational alerts. `SENTINEL_MICROSOFT_CONNECTOR_ESCALATE_UNHEALTHY_SECONDS` defines how long a connector may stay `unhealthy` before Sentinel emits `microsoft_connector_escalated`, and `SENTINEL_MICROSOFT_CONNECTOR_ESCALATE_SUPPRESSED_SECONDS` does the same for long-lived suppression via `microsoft_connector_suppression_escalated`.
- `POST /api/integrations/nac-connectors` registers a NAC-style enforcement target. Sentinel emits signed `allow`, `quarantine`, or `block` decisions based on certificate state, approval state, and current trust tier so downstream network-control systems can act on device posture.
- `POST /api/integrations/idp-connectors` registers an IdP or conditional-access target. Sentinel emits signed `allow`, `step_up`, or `block` session-policy decisions so identity systems can require stronger auth or deny access when the device falls out of trust.
- `POST /api/integrations/audit-sinks` registers an append-only audit replication target. Every new audit event is mirrored into the durable integration outbox as `audit.<category>.<action>` so you can preserve evidence off-box in near real time.
- `POST /api/integrations/recovery-runners` registers a typed recovery execution target. Sentinel emits signed `recovery.job.queued` deliveries containing the queued backup or restore-drill job so an external runner can pick it up without polling the admin API.
- Splunk HEC tokens are stored encrypted with Sentinel's data-protection keys and are never returned by the integration listing API. Keep `SENTINEL_DATA_ENCRYPTION_KEYS_JSON` set to production-grade key material before using connector secrets in a real environment.
- Microsoft Sentinel client secrets are also stored encrypted and are never returned by the listing API. The target URL should be the DCR or DCE ingestion endpoint base, while the connector config stores the DCR immutable ID and stream name for the final Logs Ingestion API path.
- Microsoft Entra group-sync client secrets are also stored encrypted and are never returned by the listing API. Devices can publish `directory.entra_device_id` when they already know the Entra device object ID, or `directory.azure_ad_device_id` when they only know the Entra registered device ID. Sentinel will resolve and cache the Graph object ID before syncing group membership.
- NAC connector bearer tokens are also stored encrypted and never returned by the listing API. By default Sentinel treats `trusted` and `monitored` devices as `allow`, `observed` and `restricted` devices as `quarantine`, and revoked devices as `block`, with per-connector overrides available for those decisions.
- IdP connector bearer tokens are stored encrypted and never returned by the listing API. By default Sentinel treats healthy devices as `allow`, pending or merely observed devices as `step_up`, and restricted or revoked devices as `block`, with per-connector overrides available for those policies.
- Audit sink bearer tokens are also stored encrypted and never returned by the listing API. The mirrored payload includes the full audit event, sequence number, previous hash, and event hash so downstream storage can preserve the existing tamper-evident chain.
- PagerDuty routing keys are also stored encrypted and never returned by the listing API. Sentinel uses stable dedup keys such as `sentinel:microsoft-connector:<connector_webhook_id>` for Microsoft connector incidents and `sentinel:maintenance-worker:<lease_name>` for maintenance leadership alerts.
- Slack incoming webhook URLs are also stored encrypted and never returned by the listing API. Because the webhook URL itself is the shared secret, Sentinel stores only a non-sensitive base target URL in the visible connector record and uses the encrypted webhook URL at delivery time.
- PagerDuty payload severity is normalized from Sentinel alert severity, so `critical` connector failures still page urgently while lower-severity transitions can be filtered out or downgraded before delivery.
- Outbound deliveries are signed with the Ed25519 key at `SENTINEL_INTEGRATION_SIGNING_PRIVATE_KEY_PATH` and carried in an outbox model with retry/backoff controls from `SENTINEL_INTEGRATION_WEBHOOK_BATCH_SIZE`, `SENTINEL_INTEGRATION_WEBHOOK_MAX_RETRIES`, and `SENTINEL_INTEGRATION_WEBHOOK_RETRY_BACKOFF_SECONDS`.
- `SENTINEL_INTEGRATION_DELIVERY_LEASE_SECONDS` defines how long a worker may hold a delivery lease before another worker reclaims it after a crash or hang. `SENTINEL_INTEGRATION_WORKER_POLL_SECONDS` controls how often the standalone worker polls when the queue is empty.
- `python3 -m server.worker` runs a standalone outbox worker, and `python3 -m server.worker --once` drains a single batch for smoke testing. The provided systemd unit at `deploy/systemd/sentinel-secure-x-worker.service` is the production path; the `/api/integrations/webhooks/flush` endpoint remains useful for manual operator-triggered retries.
- `python3 -m server.maintenance` runs the control-plane maintenance worker, and `python3 -m server.maintenance --once` runs a single expiration/cleanup sweep. The provided systemd unit at `deploy/systemd/sentinel-secure-x-maintenance.service` is the production path for scheduled maintenance.
- `python3 -m server.update_campaign_scheduler` runs the update-campaign scheduler, and `python3 -m server.update_campaign_scheduler --once` runs a single rollout scheduling pass. The provided systemd unit at `deploy/systemd/sentinel-secure-x-update-campaign-scheduler.service` is the production path for automatic rollout progression across approved or active campaigns.
- `python3 -m server.scheduled_job_coordinator` runs a lease-aware coordinator that can own multiple registered scheduled jobs from one loop, and `python3 -m server.scheduled_job_coordinator --once` runs a single coordination pass. The provided systemd unit at `deploy/systemd/sentinel-secure-x-scheduled-job-coordinator.service` is the production path for central job ownership when you want one singleton service to drive maintenance and rollout scheduling together.
- Registered jobs now resolve through explicit runner hooks instead of a hardcoded `if job_name == ...` dispatcher. Each job definition carries a `handler_import`, and the coordinator lazily imports the owning module to pick up the decorated runner when it needs to execute that job.
- `SENTINEL_SCHEDULED_JOB_MODULES` controls which modules may self-register scheduled jobs. That means new periodic duties can be added through module-level registration instead of editing the central scheduler catalog directly.
- The delivery worker, maintenance worker, and update-campaign scheduler all record durable service heartbeats in the database. `GET /api/admin/services` lets authorized operators inspect those heartbeats, and `GET /api/health/ready` can fail when a required background service heartbeat is missing or stale.
- Sentinel also tracks durable scheduled-job observations for the maintenance sweep and update-campaign dispatch loops. `GET /api/admin/services/status` includes per-job last-run state, missed-run health, and recovery status, and the audit stream emits `scheduled_job_unhealthy` and `scheduled_job_recovered` when those periodic duties drift or recover.
- `GET /api/admin/scheduled-jobs` exposes the registered job catalog directly, including each job's description, owning service, required/optional posture, configured interval, due-time metadata, and merged observed state even before the job has ever reported a run.
- Registered jobs now also declare explicit execution semantics in that catalog: `execution_guarantee`, `idempotent`, `logical_key_mode`, and `dedupe_window_seconds`. The current built-ins are `maintenance_sweep` as `at_least_once` and `update_campaign_dispatch` as `at_most_once` with a schedule-window logical run key.
- `GET /api/admin/scheduled-jobs/<job_name>/history` returns durable execution history for a registered job, including trigger type, actor, outcome, reason, timestamps, and captured run details for manual, coordinator, and service-loop executions.
- Execution history now stores a stable `execution_id`, the job's `logical_job_key` when dedupe is in play, plus a snapshot of the declared execution guarantee and idempotency posture for that run.
- `POST /api/admin/scheduled-jobs/<job_name>/run` lets authorized operators trigger a lease-safe manual run for a registered job. The underlying job still honors its own leader lease, so a manual run returns `standby` instead of duplicating work when another instance already owns that singleton duty.
- At-most-once jobs now claim a logical execution window before running. If the same window is requested again after a failover or a manual retry, Sentinel blocks the duplicate attempt, returns `deduplicated`, and records `scheduled_job_duplicate_execution_blocked` in the audit trail instead of running the side effects twice.
- `POST /api/admin/scheduled-jobs/<job_name>/suppress` and `POST /api/admin/scheduled-jobs/<job_name>/resume` let authorized operators place a registered job into a time-bounded suppression window or return it to active scheduling. Suppressed jobs remain visible in the catalog, stop showing as runnable to the coordinator, and block manual runs until resumed or expired.
- `GET /api/admin/services/status` returns an operator-focused summary of API and background-service health, including API heartbeat instances, maintenance leader/standby state, update-campaign scheduler leader/standby state, scheduled-job coordinator leader/standby state, lease activity, stale/error counts, durable observed background-service health state, registered-job due counts, and the current readiness checks alongside the raw service list.
- The maintenance worker now uses a database-backed leader lease, so multiple nodes can run the maintenance service without duplicating expiration sweeps. Non-leader instances enter `standby` and keep reporting heartbeat state until the active leader lease expires.
- Sentinel now tracks required background-service health transitions as durable audit events. When a required worker or maintenance leader becomes unavailable, stale, or unhealthy, Sentinel emits `background_service_unhealthy`; when it returns to a healthy state, Sentinel emits `background_service_recovered`. Those transitions flow through the same alert-routing and audit-sink paths as connector incidents.
- `GET /api/health/live` is a lightweight liveness probe, `GET /api/health/ready` checks database reachability, migration state, delivery-worker freshness, and optional maintenance-worker freshness, and `GET /api/metrics` exposes a small Prometheus-style metrics surface. Set `SENTINEL_MONITORING_BEARER_TOKEN` to require `Authorization: Bearer ...` on the metrics endpoint.
- The metrics surface now also exports current observed background-service state as `sentinel_background_services_total{observed_state=...}` plus per-service gauges like `sentinel_background_service_observed_state{service_name=...,service_check=...,observed_state=...}` and age via `sentinel_background_service_observation_age_seconds`.
- Scheduled-job visibility is exported too, including `sentinel_scheduled_jobs_total{observed_state=...}`, `sentinel_scheduled_job_observed_state{job_name=...,service_name=...,observed_state=...}`, and last-run timing metrics for maintenance and update-campaign scheduling.
- Registered-job scheduling policy is exported as metrics too, including `sentinel_scheduled_job_required{job_name=...,service_name=...}`, `sentinel_scheduled_job_coordinator_managed{job_name=...,service_name=...}`, `sentinel_scheduled_job_runnable_now{job_name=...,service_name=...}`, and `sentinel_scheduled_job_due_now{job_name=...,service_name=...}` so operators can distinguish optional jobs from required ones and alert when a scheduler loop is actively runnable or due.
- `SENTINEL_REQUIRE_SCHEDULED_JOB_COORDINATOR=true` makes readiness require the central job coordinator, while `SENTINEL_SCHEDULED_JOB_COORDINATOR_INSTANCE_ID`, `SENTINEL_SCHEDULED_JOB_COORDINATOR_POLL_SECONDS`, `SENTINEL_SCHEDULED_JOB_COORDINATOR_LEASE_SECONDS`, `SENTINEL_SCHEDULED_JOB_COORDINATOR_MANAGED_JOBS`, `SENTINEL_SCHEDULED_JOB_MODULES`, and `SENTINEL_SCHEDULED_JOB_MAX_SUPPRESSION_SECONDS` control its identity, cadence, leader lease, which registered jobs it owns, which modules may populate the registered-job catalog, and the maximum operator suppression window for a job.
- `SENTINEL_ENABLE_API_SERVICE_HEARTBEAT=true` enables a throttled per-request API heartbeat so `sentinel-api` instances appear in service status and heartbeat metrics. `SENTINEL_API_INSTANCE_ID` pins a stable API instance label, `SENTINEL_HEALTH_API_STALE_SECONDS` defines its freshness window, and `SENTINEL_API_HEARTBEAT_MIN_INTERVAL_SECONDS` limits write frequency under steady request load.
- The metrics surface now includes Microsoft connector telemetry such as `sentinel_microsoft_token_cache_entries{provider=...,state=...}` and `sentinel_microsoft_connector_active_failures_total{delivery_kind=...,classification=...}` so alerting can distinguish auth, throttling, and upstream-health problems.
- The same metrics surface now exports Microsoft connector health counts as `sentinel_microsoft_connectors_total{delivery_kind=...,health_state=...}` plus configured error budgets as `sentinel_microsoft_connector_error_budget_threshold{classification=...}`. That makes it easier to alert on “connector is unhealthy” instead of only raw retry volume.
- Suppressed Microsoft deliveries are also exported as `sentinel_microsoft_connectors_suppressed_total{delivery_kind=...,reason=...}` so operators can tell when Sentinel is intentionally holding traffic until the next probe window.
- Microsoft connectors now persist observed health and suppression state on the connector record itself, which lets Sentinel emit audit transitions only once per actual state change instead of on every retry loop. The listing API includes those observed fields, and the audit trail will record `microsoft_connector_unhealthy`, `microsoft_connector_suppressed`, and `microsoft_connector_recovered` transitions as the connector degrades and later recovers.
- `SENTINEL_HEALTH_WORKER_STALE_SECONDS` controls how old the delivery-worker heartbeat can get before readiness turns unhealthy. `SENTINEL_WORKER_INSTANCE_ID` lets you pin a stable delivery-worker instance label for metrics and service-heartbeat inspection.
- `SENTINEL_REQUIRE_MAINTENANCE_WORKER=true` makes readiness require the maintenance worker, `SENTINEL_HEALTH_MAINTENANCE_STALE_SECONDS` defines that worker's heartbeat freshness window, and `SENTINEL_MAINTENANCE_INSTANCE_ID` plus `SENTINEL_MAINTENANCE_POLL_SECONDS` control its identity and sweep cadence.
- `SENTINEL_MAINTENANCE_LEASE_SECONDS` controls how long the maintenance leader lease stays valid between renewals. Size this longer than the poll interval so standby nodes only take over when the active leader stops renewing.
- `SENTINEL_REQUIRE_UPDATE_CAMPAIGN_SCHEDULER=true` makes readiness require the update-campaign scheduler, `SENTINEL_HEALTH_UPDATE_CAMPAIGN_SCHEDULER_STALE_SECONDS` defines its freshness window, and `SENTINEL_UPDATE_CAMPAIGN_SCHEDULER_INSTANCE_ID`, `SENTINEL_UPDATE_CAMPAIGN_SCHEDULER_POLL_SECONDS`, and `SENTINEL_UPDATE_CAMPAIGN_SCHEDULER_LEASE_SECONDS` control its identity, cadence, and leader-lease duration.
- `POST /api/recovery/backups` records a signed backup manifest, `POST /api/recovery/drills` records a restore drill, and `GET /api/recovery/status` summarizes whether the latest verified backup plus the latest successful drill meet your freshness windows.
- `GET /api/recovery/backups`, `GET /api/recovery/drills`, and `GET /api/recovery/environments` expose the recent recovery record history to operators with `recovery.read`.
- `POST /api/recovery/jobs` queues a backup or restore-drill execution request, `GET /api/recovery/jobs` lists those jobs, `POST /api/recovery/jobs/<job_id>/review` approves or rejects protected restore jobs, and `POST /api/recovery/jobs/<job_id>/result` lets a workload-authenticated recovery runner submit the result back to Sentinel.
- Restore-drill targets that match `SENTINEL_RECOVERY_PROTECTED_TARGET_KEYWORDS` or explicitly request `protected_target=true` pause in `pending_approval` until `SENTINEL_RECOVERY_PROTECTED_APPROVALS` distinct reviewers approve them. `SENTINEL_RECOVERY_REVIEW_TTL_SECONDS` sets the review window and `SENTINEL_RECOVERY_REVIEW_REQUIRE_SEPARATE_APPROVER` blocks requester self-approval.
- Recovery runners authenticate with `SENTINEL_WORKLOAD_CLIENTS_JSON` using the `recovery_runner` role, which grants `recovery.job.submit`. A successful backup result can include a signed backup manifest envelope, and a restore-drill result can also report a restore environment object so Sentinel tracks sandbox lifecycle and signs a validation bundle for the recorded drill.
- `SENTINEL_RECOVERY_SIGNING_PUBLIC_KEY_PATH` pins the Ed25519 public key used to verify backup manifests and sign restore-validation bundles. `SENTINEL_RECOVERY_BACKUP_STALE_SECONDS`, `SENTINEL_RECOVERY_DRILL_STALE_SECONDS`, and `SENTINEL_RECOVERY_ENVIRONMENT_TTL_SECONDS` define when recovery state falls out of a ready posture or an unattended restore sandbox expires.
- `POST /api/update-campaigns` creates a signed agent-update rollout, `GET /api/update-campaigns` lists recent campaigns, `POST /api/update-campaigns/<campaign_id>/review` approves or rejects a pending rollout, and `POST /api/update-campaigns/<campaign_id>/dispatch` advances the active ring by queueing `stage_update` and `apply_update` commands for eligible devices.
- `GET /api/update-campaigns/<campaign_id>/signals` lists recorded rollout-governance signals, `POST /api/update-campaigns/<campaign_id>/signals` lets a workload-authenticated security integration submit an external `pause` or `halt` signal, and `POST /api/update-campaigns/<campaign_id>/signals/<signal_id>/clear` lets an admin clear that signal once the upstream incident is resolved.
- `POST /api/integrations/rollout-signals/<provider>` is a connector-facing normalization endpoint for rollout governance. Sentinel currently understands `splunk_hec`, `edr_alert`, `nac_enforcement`, and `idp_conditional_access` payloads, resolves the target campaign from `campaign_id` or `device_id`, ignores healthy `allow` decisions, and deduplicates repeated upstream events by their external event ID.
- The rollout-signal normalization path also understands Microsoft-native providers. `microsoft_defender_xdr` can convert serious Defender XDR incidents into `pause` or `halt` rollout signals, and `microsoft_sentinel_incident` can ingest Sentinel incident payloads while ignoring already-closed or false-positive incidents.
- The rollout-signal routes can now be layered as `assertion + mTLS + request HMAC` for high-assurance writers. In that mode a connector needs a valid Sentinel-signed workload assertion, a matching client certificate, and a fresh HMAC over the request body before Sentinel will accept the signal.
- Update campaigns snapshot target devices from `target_device_ids` and/or `target_tags`, then assign them to ordered rollout rings. Each ring supports `batch_size`, `match_tags`, `auto_apply`, and `soak_seconds`, so you can do pilot-first staging before broader deployment and force soak time before the next ring opens.
- Rollout windows now support both absolute `start_at` / `stop_at` bounds and recurring schedule gates with `allowed_weekdays`, `daily_start_time_utc`, and `daily_end_time_utc`. Outside the recurring schedule Sentinel defers dispatch with the next eligible time instead of permanently halting the campaign.
- `SENTINEL_UPDATE_CAMPAIGN_SIGNING_PUBLIC_KEY_PATH` pins the Ed25519 public key used to validate submitted update manifests before Sentinel accepts a campaign. `SENTINEL_UPDATE_CAMPAIGN_REVIEW_TTL_SECONDS`, `SENTINEL_UPDATE_CAMPAIGN_REQUIRED_APPROVALS`, and `SENTINEL_UPDATE_CAMPAIGN_REQUIRE_SEPARATE_APPROVER` enforce dual control at the campaign level instead of per-device ad hoc approval.
- `POST /api/update-campaigns/<campaign_id>/pause` and `/resume` let operators stop or continue rollout progression without deleting the campaign. Sentinel also records `pause_reason` for automatic or manual pauses so operators can distinguish a canary gate from an explicit human stop.
- `SENTINEL_UPDATE_CAMPAIGN_DEFAULT_RING_SOAK_SECONDS`, `SENTINEL_UPDATE_CAMPAIGN_DEFAULT_MAX_FAILED_DEVICES`, `SENTINEL_UPDATE_CAMPAIGN_DEFAULT_MAX_FAILURE_RATE_PERCENT`, `SENTINEL_UPDATE_CAMPAIGN_DEFAULT_MIN_CANARY_HEALTH_SCORE`, `SENTINEL_UPDATE_CAMPAIGN_DEFAULT_CANARY_LOW_HEALTH_ACTION`, `SENTINEL_UPDATE_CAMPAIGN_DEFAULT_HALT_ON_ROLLBACK`, and `SENTINEL_UPDATE_CAMPAIGN_DEFAULT_HALT_ON_TRUST_REGRESSION` define the default rollout safety posture. Sentinel can now auto-pause or auto-halt before opening the next ring when the previous ring's health score drops below policy.
- `SENTINEL_UPDATE_CAMPAIGN_EXTERNAL_SIGNAL_DEFAULT_TTL_SECONDS` controls how long an external pause/halt signal remains active by default. While an active external signal exists, Sentinel will refuse to resume the campaign and will continue enforcing that upstream pause/halt recommendation.
- When Sentinel pauses or halts a campaign, outstanding undelivered campaign commands are cancelled automatically and the state change is audited and emitted as `update.campaign.paused` or `update.campaign.halted`.
- Workload clients that need to influence rollout governance should be registered in `SENTINEL_WORKLOAD_CLIENTS_JSON` with the `rollout_signal_writer` role, which grants `update.campaign.signal.submit`. The same role can call either the direct signal route or the normalized integration adapter route, and you can require HMAC-signed requests for those routes on a per-client basis.
- The dashboard now shows a lightweight update-campaign status summary, and `/api/metrics` exports rollout counts by campaign state and per-device assignment state.
- The signing helper in `deploy/recovery/sign_backup_manifest.py` produces the envelope format expected by `POST /api/recovery/backups`, and `deploy/recovery/example_backup_manifest.json` defines the manifest shape.
- For higher-assurance key custody, you can replace in-process private-key use with external signer commands. `SENTINEL_EVIDENCE_SIGNING_COMMAND`, `SENTINEL_TRUST_ASSERTION_SIGNING_COMMAND`, and `SENTINEL_INTEGRATION_SIGNING_COMMAND` receive canonical bytes on stdin as base64url JSON and return a signature. Pair them with the matching `*_SIGNING_PUBLIC_KEY_PATH` so Sentinel can verify the returned signature before use. `SENTINEL_SIGNING_COMMAND` is available as a shared fallback.
- The current event surface includes `device.trust.updated`, `device.identity.pending`, `device.identity.approved`, `device.identity.rejected`, `device.certificate.revoked`, `policy.exception.approved`, `microsoft_connector_unhealthy`, `microsoft_connector_escalated`, `microsoft_connector_suppressed`, `microsoft_connector_suppression_escalated`, `microsoft_connector_recovered`, `background_service_unhealthy`, `background_service_recovered`, `maintenance_worker_leader_elected`, `maintenance_worker_leader_reacquired`, `maintenance_worker_leader_failover`, `scheduled_job_coordinator_leader_elected`, `scheduled_job_coordinator_leader_reacquired`, `scheduled_job_coordinator_leader_failover`, `update_campaign_scheduler_leader_elected`, `update_campaign_scheduler_leader_reacquired`, `update_campaign_scheduler_leader_failover`, `scheduled_job_unhealthy`, `scheduled_job_recovered`, `recovery.job.review_requested`, `recovery.job.approved`, `recovery.job.rejected`, `recovery.job.queued`, `recovery.job.completed`, `update.campaign.requested`, `update.campaign.approved`, `update.campaign.rejected`, `update.campaign.signal.recorded`, `update.campaign.signal.updated`, `update.campaign.signal.cleared`, `update.campaign.paused`, `update.campaign.halted`, and `update.campaign.dispatched`.
- `GET /api/integrations/device-assertions/<device_id>` lets a workload client fetch a signed EdDSA device-trust assertion for itself, and `GET /api/devices/<device_id>/trust-assertion?audience=<name>` lets an authorized admin mint one for an allowed downstream audience.
- `SENTINEL_TRUST_ASSERTION_SIGNING_PRIVATE_KEY_PATH`, `SENTINEL_TRUST_ASSERTION_ISSUER`, `SENTINEL_TRUST_ASSERTION_LIFETIME_SECONDS`, and `SENTINEL_TRUST_ASSERTION_ALLOWED_AUDIENCES` control trust-assertion issuance policy.
- `POST /api/integrations/workload-assertions/<client_id>` lets an admin mint a short-lived EdDSA workload assertion for a configured machine client, and `GET /api/integrations/workload-assertions/status` shows whether Sentinel is ready to issue and verify those assertions.
- `SENTINEL_WORKLOAD_ASSERTION_SIGNING_PRIVATE_KEY_PATH`, `SENTINEL_WORKLOAD_ASSERTION_ISSUER`, `SENTINEL_WORKLOAD_ASSERTION_LIFETIME_SECONDS`, and `SENTINEL_WORKLOAD_ASSERTION_AUDIENCE` control workload-assertion issuance and verification policy.
- `GET /api/audit/export` produces a signed JSON evidence package containing current audit-chain integrity, effective policy metadata, recent audit events, and a fleet trust snapshot. Configure `SENTINEL_EVIDENCE_SIGNING_PRIVATE_KEY_PATH` with an Ed25519 private key before enabling this in production.
- Reference adapter code for the external signer contract lives in `deploy/signing/example_external_signer.py`. Replace that script with your HSM, KMS, or hardware-token wrapper and keep the corresponding public key pinned in Sentinel.
- Admin tokens are checked against the live admin directory on every request, so disabling an account or changing its roles takes effect without waiting for token expiry.
- Admin JWTs are also bound to persisted server-side sessions, so `POST /api/admin/logout` and `POST /api/admin/sessions/<session_id>/revoke` invalidate tokens immediately.
- `GET /api/admin/sessions` lists the current operator's sessions for review and cleanup.
- `GET /api/admin/passkeys` lists the current operator's active passkeys, `POST /api/admin/passkeys/register/options` and `/verify` enroll a new one, and `POST /api/admin/passkeys/<credential_id>/revoke` revokes a lost or replaced credential.
- The backend now uses Alembic migrations instead of silently mutating schema on boot. Run `python3 -m flask --app server.app db upgrade` before starting new code, or use the provided service unit so migrations run automatically during service start.
- `SENTINEL_DATABASE_REQUIRE_POSTGRESQL=true` blocks the control plane from starting against SQLite, which is the recommended production posture. `SENTINEL_AUTO_CREATE_SCHEMA_ON_BOOT` remains available for temporary dev bootstrapping only and should stay `false` in real environments.
- The generated baseline migration lives under `migrations/`, so future schema changes can be added as new Alembic revisions instead of relying on ad hoc `db.create_all()` behavior.
- The current migration chain includes service-heartbeat support for operational monitoring. After pulling new code, run `python3 -m flask --app server.app db upgrade` before expecting health and metrics endpoints to report worker state.
- The current migration chain also includes `recovery_backup` and `recovery_drill`, so the same `db upgrade` step is required before backup manifests and restore drills can be recorded.
- `SENTINEL_PROTECTED_COMMANDS` lists actions that should not be dispatched immediately.
- `SENTINEL_PROTECTED_COMMAND_APPROVALS`, `SENTINEL_COMMAND_REVIEW_TTL_SECONDS`, and `SENTINEL_COMMAND_REVIEW_REQUIRE_SEPARATE_APPROVER` control dual-control review for those protected actions.
- Protected actions are requested through `POST /api/command` and reviewed through `POST /api/commands/<command_id>/review`.
- The dashboard defaults to `collect_diagnostics` as a safe protected action.
- Audit events are hash-chained in sequence order. `GET /api/audit/integrity` verifies that the stored event hashes and previous-hash links still match the recorded event contents.
- The agent persists its next heartbeat counter in `state_path` so counters keep increasing across restarts. If that state is lost on an already-enrolled device, the backend will reject stale/replayed counters until the agent state is restored or the device record is reset.
- The agent also persists the most recent attestation nonce in `state_path`. If that nonce is lost, the backend will issue a fresh challenge and the device will temporarily drop out of the trusted tier until it responds with a new signed attestation.
- Store the JWT signing secret and admin password hashes in environment variables.
- Keep PostgreSQL as the production system of record and run Alembic upgrades as part of deploys. SQLite should stay limited to local development and isolated test runs.
- Keep remote actions idempotent, because queued commands are re-sent until the agent acknowledges them.
- For backend-only smoke tests without a TLS proxy, you can set `SENTINEL_REQUIRE_DEVICE_MTLS=false` and run `python3 -m server.app`.