Merge pull request #82 from input-output-hk/next-2026-04-16
Node `10.7.1`, Dbsync `13.7.0.4`, Mithril `2617.0`
Node `10.7.1`, Dbsync `13.7.0.4`, Mithril `2617.0`
* opsTf: add lifecycle rule to expire orphaned delete markers. Mimir / Loki compactors call DeleteObject under versioning, which creates markers instead of purging. Without expired_object_delete_marker the markers accumulate forever once their underlying versions expire. * profile-monitoring: drop unwired Prometheus server. Alloy is the universal scrape-and-forward agent in this stack and on the monitoring node remote_writes directly into local Mimir, so the standalone Prometheus server (no scrape_configs, no remote_write) was dead code. Blackbox-exporter stays — reached on-demand via Caddy /blackbox/* for ad-hoc HTTPS probes; continuous probing belongs in a follow-up that adds a prometheus.scrape block to profile-grafana-alloy. * monitoringOauthGoogleSubmodule: convert allowedDomains (listOf str) to allowedDomain (nullOr str). Google's hd OAuth parameter is single-valued; the list type was fiction the code resolved with head, leaving extra entries half-bound. Cross-tenant access is intentionally unsupported by this profile. * opsTf: comment why DeleteObject is granted under Object Lock and why DeleteObjectVersion is intentionally omitted.
- Cap mimir start-limit at 5/600s (matches mimir-rules-sync) so a permanently-broken config fails the unit instead of boot-looping the journal indefinitely. Operator must reset-failed afterwards. - Comment Caddy handle exclusivity so future edits don't collapse the write-hash /api/v1/push routes into the broader admin-hash routes. - Note opsTf's local dashToSnake exists to keep the helper callable from any terranix workspace without a cardano-parts flake closure. - Note bootstrap.nix unmanagedBuckets scopes to rain_artifacts only; mimir/loki flow through mkMonitoringBucketResources.
Adds an opt-in monitoring profile running Grafana + Mimir + Loki +
Prometheus + blackbox-exporter on a single Colmena machine, fronted by
Caddy with ACME-issued TLS and Google OAuth on Grafana. Cluster-wide
configuration lives under flake.cardano-parts.cluster.infra.monitoring,
read by the bootstrap opentofu workspace, profile-grafana-alloy, and
profile-monitoring itself so all three stay in lockstep.
Storage: cardano-parts.lib.opsTf.mkMonitoringBucketResources provisions
per-cluster Mimir and Loki S3 buckets with Object Lock + lifecycle
wired together so app-level retention and storage-level retention
cannot drift. objectLockMode picks between a 1-day soft lock (default,
~1x storage) and a full-retention hard lock (~2x storage during the
retention window). Both modes use GOVERNANCE locks so a separately-
permissioned operator role holding s3:BypassGovernanceRetention can
break-glass; the EC2 role attached to the monitoring node gets
least-privilege data-plane access only (no DeleteBucket /
PutBucketPolicy / governance bypass) via
cardano-parts.lib.opsTf.mkMonitoringIamPolicy.
Both opsTf builders are pure helpers (no pkgs dependency) so existing
downstream repos that don't use this template's bootstrap workspace
can call them from any terranix-driven workspace without copying
code; the template's bootstrap.nix and cluster.nix are now the
canonical callers.
profile-grafana-alloy: when infra.monitoring.enable = true, alloy
auto-targets the in-cluster monitoring node, making the
grafana-alloy-{metrics,loki}-url sops secrets optional.
opsLib: adds parseDir and readNixImport so monitoring rule files and
the existing tofu grafana workspace can share the same .nix-import
corpus without duplicate helpers.
profile-monitoring + profile-grafana-alloy read cluster infra through
groupFlake.config (the consuming flake's self) rather than
flake.config — the latter closes over cardano-parts' own flake-parts
evaluation where these options carry their declared null defaults,
not the consumer's values.
Template additions wire all of the above into a downstream-consumable
shape: optional monitoring host in colmena.nix, infra.monitoring stanza
in cluster.nix, S3 bucket + IAM policy blocks in opentofu, and a
README section covering retention, lock modes, and the required sops
secrets.