|
| 1 | +# Production Deployment Analysis: SeiNode CRD vs sei-infra Snapshotter |
| 2 | + |
| 3 | +## Current State: sei-infra Snapshotter (EC2/Terraform) |
| 4 | + |
| 5 | +The existing snapshotter is entirely EC2-based with no Kubernetes involvement: |
| 6 | + |
| 7 | +- **3 instances** (`m7i.8xlarge`) in `eu-central-1` running `seid v6.3.0` |
| 8 | +- **32 TiB EBS** storage (RAID0 across 5 disks) |
| 9 | +- **Weekly AMI snapshots** (Mondays 16:00 UTC via cron + `aws ec2 create-image`) |
| 10 | +- **S3 bucket**: `pacific-1-snapshots/state-sync/` for Tendermint snapshots |
| 11 | +- **DynamoDB** metadata table: `pacific-1_snapshot_metadata` |
| 12 | +- **ALB** with Route53 DNS (`rpc.sei-archive.pacific-1.seinetwork.io`) |
| 13 | +- **WAF** rate limiting (300 req/5min per IP) |
| 14 | +- **IAM role**: `pacific-1-snapshot-iam-role` with S3 + DynamoDB permissions |
| 15 | + |
| 16 | +### Bootstrap Flow |
| 17 | + |
| 18 | +1. `ec2_init.sh` downloads sei-infra from S3, runs `generate_configs.py` |
| 19 | +2. `mount_ebs_volumes.sh` sets up RAID0 or JBOD |
| 20 | +3. `post_init.sh` installs seid binary, configures, starts systemd service |
| 21 | +4. `init_configure.sh` runs `seid init`, fetches peers, sets pruning to `nothing`, enables SeiDB + OCC |
| 22 | +5. Crontab schedules `snapshot.sh` for weekly AMI creation |
| 23 | + |
| 24 | +### Snapshot Generation |
| 25 | + |
| 26 | +| Setting | Value | |
| 27 | +|---------|-------| |
| 28 | +| Cron | `0 16 * * 1` (Mondays 16:00 UTC) | |
| 29 | +| Mechanism | `snapshot.sh` → `aws ec2 create-image` (no-reboot) | |
| 30 | +| Retention | AMIs older than 30 days removed | |
| 31 | +| Metadata | DynamoDB with `last_update`, `height`, `imageName` | |
| 32 | + |
| 33 | +### State-Syncer (Snapshot Source) |
| 34 | + |
| 35 | +- 3 instances in `eu-central-1` |
| 36 | +- Snapshot interval: 4000 blocks |
| 37 | +- Script: `create_snapshot.sh` (halt-height loop, `seid tendermint snapshot`, keep 5 recent) |
| 38 | +- Peers: `primaryEndpoint: https://sei-rpc.polkachu.com` |
| 39 | + |
| 40 | +--- |
| 41 | + |
| 42 | +## SeiNode CRD: Production Snapshotter |
| 43 | + |
| 44 | +### Example Manifest (Archive Mode) |
| 45 | + |
| 46 | +```yaml |
| 47 | +apiVersion: sei.io/v1alpha1 |
| 48 | +kind: SeiNode |
| 49 | +metadata: |
| 50 | + name: pacific-1-snapshotter |
| 51 | + namespace: sei-nodes |
| 52 | +spec: |
| 53 | + chainId: pacific-1 |
| 54 | + image: "ghcr.io/sei-protocol/sei:v6.3.0" |
| 55 | + sidecar: |
| 56 | + image: ghcr.io/sei-protocol/seictl@sha256:78acbf33cc62c41f65766eef10698af2656b3a169eef3be19f707af6f6f51d62 |
| 57 | + resources: |
| 58 | + requests: |
| 59 | + cpu: "500m" |
| 60 | + memory: "256Mi" |
| 61 | + entrypoint: |
| 62 | + command: ["seid"] |
| 63 | + args: ["start", "--home", "/sei"] |
| 64 | + storage: |
| 65 | + retainOnDelete: true |
| 66 | + archive: |
| 67 | + peers: |
| 68 | + - ec2Tags: |
| 69 | + region: eu-central-1 |
| 70 | + tags: |
| 71 | + ChainIdentifier: pacific-1 |
| 72 | + Component: state-syncer |
| 73 | + snapshotGeneration: |
| 74 | + keepRecent: 5 |
| 75 | + destination: |
| 76 | + s3: |
| 77 | + bucket: pacific-1-snapshots |
| 78 | + prefix: state-sync/ |
| 79 | + region: eu-central-1 |
| 80 | +``` |
| 81 | +
|
| 82 | +### Generated Kubernetes Resources |
| 83 | +
|
| 84 | +| Resource | Details | |
| 85 | +|----------|---------| |
| 86 | +| StatefulSet | 1 replica, `seid` + `seictl` sidecar | |
| 87 | +| Service | Headless (`ClusterIP: None`), `PublishNotReadyAddresses: true` | |
| 88 | +| PVC | `data-{nodeName}`, StorageClass `gp3-10k-750`, 2000Gi for archive | |
| 89 | +| Init containers | `seid-init` (chain init), `sei-sidecar` (restartable) | |
| 90 | + |
| 91 | +### PlatformConfig (Controller Environment Variables) |
| 92 | + |
| 93 | +| Env Var | Default | Purpose | |
| 94 | +|---------|---------|---------| |
| 95 | +| `SEI_NODEPOOL_NAME` | `sei-node` | Karpenter NodePool for scheduling | |
| 96 | +| `SEI_TOLERATION_KEY` | `sei.io/workload` | Taint key to tolerate | |
| 97 | +| `SEI_TOLERATION_VALUE` | `sei-node` | Taint value | |
| 98 | +| `SEI_SERVICE_ACCOUNT` | `seid-node` | ServiceAccount for node pods | |
| 99 | +| `SEI_STORAGE_CLASS_PERF` | `gp3-10k-750` | StorageClass for full/validator/archive | |
| 100 | +| `SEI_STORAGE_CLASS_DEFAULT` | `gp3` | StorageClass for other modes | |
| 101 | +| `SEI_STORAGE_SIZE_DEFAULT` | `1000Gi` | PVC size for full/validator | |
| 102 | +| `SEI_STORAGE_SIZE_ARCHIVE` | `2000Gi` | PVC size for archive | |
| 103 | +| `SEI_RESOURCE_CPU_ARCHIVE` | `8` | CPU request for archive | |
| 104 | +| `SEI_RESOURCE_MEM_ARCHIVE` | `48Gi` | Memory request for archive | |
| 105 | +| `SEI_RESOURCE_CPU_DEFAULT` | `4` | CPU request for full/validator | |
| 106 | +| `SEI_RESOURCE_MEM_DEFAULT` | `32Gi` | Memory request for full/validator | |
| 107 | +| `SEI_SNAPSHOT_REGION` | `eu-central-1` | Default S3 region for snapshots | |
| 108 | + |
| 109 | +### Resource Sizing by Mode |
| 110 | + |
| 111 | +| Mode | StorageClass | Size | CPU | Memory | |
| 112 | +|------|-------------|------|-----|--------| |
| 113 | +| archive | gp3-10k-750 | 2000Gi | 8 | 48Gi | |
| 114 | +| full, validator | gp3-10k-750 | 1000Gi | 4 | 32Gi | |
| 115 | +| replayer | gp3 | 1000Gi | 4 | 32Gi | |
| 116 | + |
| 117 | +### Hardcoded Values |
| 118 | + |
| 119 | +| Setting | Value | |
| 120 | +|---------|-------| |
| 121 | +| Data dir | `/sei` | |
| 122 | +| Default sidecar image | `ghcr.io/sei-protocol/seictl@sha256:...` | |
| 123 | +| Default sidecar port | 7777 | |
| 124 | +| Snapshot upload cron | `0 0 * * *` (daily midnight) | |
| 125 | +| Snapshot interval | 2000 blocks (in config-apply) | |
| 126 | + |
| 127 | +--- |
| 128 | + |
| 129 | +## Gap Analysis: sei-infra vs SeiNode CRD |
| 130 | + |
| 131 | +| Aspect | sei-infra (EC2) | SeiNode CRD | Gap? | |
| 132 | +|--------|-----------------|-------------|------| |
| 133 | +| Compute | m7i.8xlarge (32 vCPU, 128GB) | 8 CPU, 48Gi | Tunable via PlatformConfig env vars | |
| 134 | +| Storage | 32 TiB EBS RAID0 | 2 TiB PVC (gp3-10k-750) | Need larger PVC for full archive | |
| 135 | +| S3 bucket | `pacific-1-snapshots` | Configurable per-node | No gap | |
| 136 | +| ALB/DNS | ALB + Route53 + WAF | Not managed by controller | External concern (Ingress/Gateway API) | |
| 137 | +| DynamoDB metadata | Snapshot metadata tracking | Not implemented | Gap -- could add to seictl | |
| 138 | +| AMI snapshots | EC2 AMI creation | N/A (Tendermint snapshots to S3) | Different approach, arguably better | |
| 139 | +| IAM | Instance profile | ServiceAccount + IRSA | Via PlatformConfig `SEI_SERVICE_ACCOUNT` | |
| 140 | +| Monitoring | Prometheus EC2 SD + alerts | Need ServiceMonitor/PodMonitor | External concern | |
| 141 | +| Snapshot verification | Dedicated verifier EC2 | Not implemented | Gap | |
| 142 | +| Multi-instance | 3 EC2 instances | 1 replica StatefulSet | Could create multiple SeiNodes | |
| 143 | + |
| 144 | +## What's Needed for Production |
| 145 | + |
| 146 | +### Already handled by the controller |
| 147 | +- Node lifecycle (bootstrap, init, running) |
| 148 | +- Snapshot generation and S3 upload |
| 149 | +- Peer discovery via EC2 tags |
| 150 | +- Genesis configuration (embedded or S3) |
| 151 | +- Config generation (pruning, state-sync intervals, etc.) |
| 152 | +- PVC provisioning and cleanup |
| 153 | +- Tolerations, affinity, service account assignment |
| 154 | + |
| 155 | +### Needs external setup |
| 156 | +- **Controller in prod Flux kustomization** (currently dev only) |
| 157 | +- **Prod PlatformConfig tuning** (storage sizes, resource limits for prod workloads) |
| 158 | +- **IRSA ServiceAccount** with S3 permissions in the node namespace |
| 159 | +- **Networking** (ALB/Ingress, DNS, WAF -- separate from controller) |
| 160 | +- **Monitoring** (ServiceMonitor for Prometheus, alerts ported from sei-infra) |
| 161 | +- **Storage sizing** -- `SEI_STORAGE_SIZE_ARCHIVE` may need to be much larger for full archive |
| 162 | + |
| 163 | +### Future enhancements |
| 164 | +- Snapshot verification (automated post-upload check) |
| 165 | +- DynamoDB metadata tracking for snapshot catalog |
| 166 | +- Multi-replica support (if needed for HA/load distribution) |
0 commit comments