This operations manual provides step-by-step instructions for performing backups before a Pharia AI version upgrade and restoring to the previous version if issues are encountered with the new version.
- Overview
- Prerequisites
- Maintenance Window Planning
- Scaling Down Application Deployments
- Pre-Upgrade Backup Procedures
- Performing the Upgrade
- Post-Upgrade Verification and Scale-Up
- Restore Procedures
- Troubleshooting
- Appendix: Quick Reference
This manual covers the backup and restore workflow for Pharia AI deployments during version upgrades. The process involves:
- Pre-Upgrade: Backing up PostgreSQL databases, Kubernetes secrets, and Qdrant vector databases
- Upgrade: Perform your Helm upgrade to the new version (using your organization's standard procedures)
- Verification: Testing the upgraded deployment
- Restore (if needed): Restoring all components to the previous version
⚠️ Important: Always perform backups before any upgrade. The restore process requires all backups to be completed successfully.Note: This manual focuses on backup and restore procedures. Helm upgrade/rollback operations should be performed according to your organization's change management procedures.
Before starting the upgrade process, ensure you have:
- kubectl - Configured with access to your Kubernetes cluster
- helm - Version 3.x or later
- PostgreSQL client tools -
psql,pg_dump,pg_restore - bash - Version 4.0 or later
- jq - JSON processor
- curl - HTTP client
- mc (MinIO client) - For S3 operations (Qdrant restore only)
- Kubernetes cluster access with appropriate permissions (also port-forward permission)
- Network access to PostgreSQL databases
- Network access to Qdrant instances
- S3 bucket access (for Qdrant backups)
- Helm chart repository access
Ensure you have the backup/restore scripts from this repository:
pharia-ai-backup-restore/- PostgreSQL and Kubernetes secrets backup/restoreqdrant-backup-restore/- Qdrant vector database backup/restore
Before starting, collect the following information:
- Current Helm Release Name: The name of your Pharia AI Helm release
- Kubernetes Namespace: The namespace where Pharia AI is deployed (typically
pharia-ai) - Current Version: The current Pharia AI version (check with
helm list -n <namespace>) - Target Version: The version you're upgrading to
- PostgreSQL Connection Details: Host, port, database names, credentials
- Qdrant Connection Details: Service endpoints, API keys (if configured)
- S3 Configuration: Endpoint, bucket name, access credentials (for Qdrant)
- Small deployments (< 10GB data): 2-4 hours
- Medium deployments (10-100GB data): 4-8 hours
- Large deployments (> 100GB data): 8-12 hours
- Notify stakeholders of the maintenance window
- Schedule during low-traffic periods
- Ensure team members are available for support
- Prepare rollback plan and communicate it
- Verify backup storage has sufficient space
- Test backup scripts in a non-production environment (if possible)
# Verify cluster connectivity
kubectl cluster-info
# Verify Helm access
helm list -n <namespace>
# Check pod status
kubectl get pods -n <namespace>
# Verify PostgreSQL connectivity (if accessible)
psql -h <postgres-host> -p <port> -U <user> -d <database> -c "SELECT version();"⏱️ Estimated Time: 5-10 minutes
⚠️ Important: Scaling down application deployments before backup ensures data consistency across all systems and prevents in-flight transactions during the backup process.
Scaling down provides several critical benefits:
- Data Consistency: Guarantees point-in-time consistency across PostgreSQL, Qdrant, and application state
- No In-Flight Transactions: Prevents partial transactions or data changes during backup
- Clean Restore State: Ensures a known, consistent state for rollback scenarios
- Safer Operations: Eliminates risk of data corruption during backup/restore
# List all deployments in the namespace
kubectl get deployments -n <namespace>
# You might also be able to get all pharia-ai deployments using our default label
kubectl get deployment -l pharia.ai/edition=1
# Get detailed deployment information
kubectl get deployments -n <namespace> -o wide📝 Critical: Document current replica counts before scaling down. You'll need this information to restore normal operations.
# Get current replica counts for all deployments
kubectl get deployments -n <namespace> -o custom-columns=NAME:.metadata.name,REPLICAS:.spec.replicas,READY:.status.readyReplicas
# Compare with pharia-ai application deployments
kubectl get deployments -n <namespace> -l pharia.ai/edition=1 -o custom-columns=NAME:.metadata.name,REPLICAS:.spec.replicas,READY:.status.readyReplicas
# Save to a file for reference
kubectl get deployments -n <namespace> -o json | jq -r '.items[] | "\(.metadata.name): \(.spec.replicas)"' > deployment-replicas-backup.txt
# Add any missed application deployments into the deployment-replicas-backup.txt
# Display the saved replica counts
cat deployment-replicas-backup.txtDocument the replica counts:
Deployment Name | Current Replicas
--------------------------------|------------------
<deployment-1> | _____________
<deployment-2> | _____________
<deployment-3> | _____________
# Scale down specific application deployment
kubectl scale deployment <deployment-name> --replicas=0 -n <namespace>
# Or scale down all application deployments at once (exclude database/infrastructure pods)
# Example: Scale down all deployments with label 'app.kubernetes.io/component=application'
kubectl scale deployment -l pharia.ai/edition=1 --replicas=0 -n <namespace>
# Wait for pods to terminate
kubectl wait --for=delete pod -l app=<app-label> -n <namespace> --timeout=300sNote: Do NOT scale down database deployments (PostgreSQL, Qdrant). Only scale down application workloads that interact with the databases.
# Verify application pods are terminated
kubectl get pods -n <namespace>
# Check deployment replica counts
kubectl get deployments -n <namespace>
# Verify only database and infrastructure pods remain
kubectl get pods -n <namespace> --field-selector=status.phase=RunningExpected state:
- Application deployments should show 0/0 ready replicas
- Database pods (PostgreSQL, Qdrant) should remain running
- No application pods should be in Running state
✅ Scale Down Checklist:
- Current replica counts documented
- Replica backup file saved (
deployment-replicas-backup.txt) - Application deployments scaled to 0 replicas
- All application pods terminated
- Database pods still running
- Cluster in stable state
⏱️ Estimated Time: 30-60 minutes (depending on data size)
Perform all backups before starting the upgrade. Document the backup timestamps for reference during restore.
# Navigate to the support repository
cd /path/to/support
# Create backup directories (if they don't exist)
mkdir -p pharia-ai-backup-restore/database-backups
mkdir -p pharia-ai-backup-restore/secrets-backups
mkdir -p qdrant-backup-restore/backupscd pharia-ai-backup-restore
# Copy the example configuration
cp config.yaml.example config.yaml
# Edit config.yaml with your database details
nano config.yaml # or use your preferred editorExample config.yaml:
backup_dir: "./database-backups"
databases:
- name: pharia_chat_db
host: postgres-chat-db.pharia-ai.svc.cluster.local
port: 5432
user: pharia_chat_user
password: your_password_here
- name: pharia_assistant_db
host: postgres-assistant-db.pharia-ai.svc.cluster.local
port: 5432
user: pharia_assistant_user
password: your_password_here
...Note: Ensure you have your database connection details for all databases (host, port, database name, username, password) available before proceeding.
# You may need to port forward your database services if running in kubernetes cluster
# Test connection to verify credentials
psql -h <host> -p <port> -U <user> -d <database> -c "SELECT version();"# Backup all databases configured in config.yaml
./bin/pharia-backup.sh db backupExpected Output:
[2025-01-XX XX:XX:XX] Starting database backup...
[2025-01-XX XX:XX:XX] Backing up database: pharia_ai_db
[2025-01-XX XX:XX:XX] Backup completed: database-backups/pharia_ai_db_2025-01-XX_XXXXXX.sql
[2025-01-XX XX:XX:XX] All backups completed successfully
Verify Backup Files:
# List backup files with timestamps
ls -lth database-backups/
# Note the backup filenames and timestamps for restore referenceRecord the following information:
- Backup timestamp:
_____________ - Backup file locations:
_____________ - Database names backed up:
_____________
# Backup all secrets in the Pharia AI namespace
./bin/pharia-backup.sh secrets backup pharia-aiExpected Output:
[2025-01-XX XX:XX:XX] Starting secrets backup...
[2025-01-XX XX:XX:XX] Backing up secrets from namespace: pharia-ai
[2025-01-XX XX:XX:XX] Backup completed: secrets-backups/pharia-ai_2025-01-XX_XXXXXX.tar.gz
[2025-01-XX XX:XX:XX] All backups completed successfully
# List backup files
ls -lth secrets-backups/
# Verify backup archive integrity
tar -tzf secrets-backups/pharia-ai_<timestamp>.tar.gz | head -5Record the following information:
- Backup timestamp:
_____________ - Backup file location:
_____________ - Number of secrets backed up:
_____________
cd ../qdrant-backup-restore
# Create environment file
cp .env.sample .env # If .env.sample exists, otherwise create .env manually
nano .env # or use your preferred editorExample .env file for backup:
# Qdrant API key (leave as is if none exists)
export QDRANT_API_KEY="your-api-key-here"
# Source Qdrant hosts (comma-separated)
# You may need to port-forward your Qdrant service if running in kubernetes
export QDRANT_SOURCE_HOSTS="http://document-index-qdrant-headless.pharia-ai.svc.cluster.local:6333"
# Restore hosts (empty for backup)
export QDRANT_RESTORE_HOSTS=""
# Auto-discover peers from cluster info (true for Kubernetes)
export GET_PEERS_FROM_CLUSTER_INFO="true"
# Timeout settings
export CURL_TIMEOUT="1800" # 30 minutes
# Wait for tasks to complete
export QDRANT_WAIT_ON_TASK="true"Note: Ensure you have your Qdrant connection details (service endpoint, API key if configured) available before proceeding.
# Source the environment file
source .env
# Verify environment variables are set
echo "Source hosts: $QDRANT_SOURCE_HOSTS"
echo "API key set: $([ -n "$QDRANT_API_KEY" ] && echo "Yes" || echo "No")"# Create snapshots of all collections
./qdrant_backup_recovery.sh create_snapExpected Output:
[2025-01-XX XX:XX:XX] fetching collections!
[2025-01-XX XX:XX:XX] collections file updated, found X collection(s)!
[2025-01-XX XX:XX:XX] Creating snapshot for collection: collection_name
[2025-01-XX XX:XX:XX] Snapshot created successfully: snapshot_name
...
[2025-01-XX XX:XX:XX] All snapshots created successfully
Important: Collection aliases are not included in snapshots and must be backed up separately.
# Fetch and backup collection aliases
./qdrant_backup_recovery.sh get_collaExpected Output:
[2025-01-XX XX:XX:XX] Fetching collection aliases...
[2025-01-XX XX:XX:XX] Found X alias(es)
[2025-01-XX XX:XX:XX] Aliases saved to collection_aliases file
# Check state files
ls -lh collections snapshots collection_aliases
# View snapshots created
cat snapshots
# View collections backed up
cat collections
# View aliases backed up
cat collection_aliasesRecord the following information:
- Backup timestamp:
_____________ - Number of collections:
_____________ - Number of snapshots:
_____________ - Number of aliases:
_____________ - S3 bucket location:
_____________
Before proceeding with the upgrade, verify all backups:
# PostgreSQL backups
echo "=== PostgreSQL Backups ==="
ls -lh pharia-ai-backup-restore/database-backups/
# Kubernetes secrets backups
echo "=== Kubernetes Secrets Backups ==="
ls -lh pharia-ai-backup-restore/secrets-backups/
# Qdrant backups
echo "=== Qdrant Backups ==="
ls -lh qdrant-backup-restore/*.csv qdrant-backup-restore/collections qdrant-backup-restore/snapshots qdrant-backup-restore/collection_aliases 2>/dev/null✅ Backup Checklist:
- Application deployments scaled down
- Replica counts documented
- PostgreSQL database backup completed successfully
- Kubernetes secrets backup completed successfully
- Qdrant snapshots created successfully
- Qdrant collection aliases backed up
- All backup files verified and accessible
- Backup timestamps documented
- Backup file locations documented
After completing all backups, perform your Helm upgrade according to your organization's standard change management procedures.
Record the following for reference:
- Upgrade timestamp:
_____________ - Previous version:
_____________ - New version:
_____________ - Helm release name:
_____________ - Namespace:
_____________
Note: Application deployments remain scaled down during the upgrade but may be scaled up after upgrade, depending on your Org standards.
⏱️ Estimated Time: 30-60 minutes
# Check all pods are running (databases and infrastructure)
kubectl get pods -n <namespace>
### Step 2: Verify Database Connectivity
```bash
# Test PostgreSQL connectivity
psql -h <host> -p <port> -U <user> -d <database> -c "SELECT version();"
# Verify Qdrant connectivity
# You will need to port forward the qdrant service if running on kubernetes
kubectl port-forward svc/document-index-qdrant-headless 6333:6333
curl -s http://localhost:6333/collectionsNote: If your pods already scaled back up then you can skip this step
Once infrastructure verification is complete, restore application deployments to their original replica counts.
# Restore replica counts from backup file
cat deployment-replicas-backup.txt
# Scale up specific deployment to original replica count
kubectl scale deployment <deployment-name> --replicas=<original-count> -n <namespace>
# Or use a script to restore all deployments
while IFS=': ' read -r name replicas; do
echo "Scaling $name to $replicas replicas..."
kubectl scale deployment "$name" --replicas="$replicas" -n <namespace>
done < deployment-replicas-backup.txt# Watch pods coming online
kubectl get pods -n <namespace> -w
# Wait for all pods to be ready
kubectl wait --for=condition=ready pod --all -n <namespace> --timeout=600s
# Check for any pod errors
kubectl get pods -n <namespace> | grep -v Running | grep -v Completed# Check application endpoints
kubectl get ingress -n <namespace>
# Test application connectivity (adjust endpoint as needed)
curl -k https://<your-app-url>/health
# Check application logs for startup errors
kubectl logs -n <namespace> -l app=<app-label> --tail=50 --since=5mPerform your standard functional tests:
- Application login/authentication works
- Core application features operational
- Database queries executing correctly
- Vector search functionality working (if applicable)
- API endpoints responding correctly
- Integration points functioning
# Check resource usage
kubectl top pods -n <namespace>
# Check for any resource constraints
kubectl describe pods -n <namespace> | grep -A 5 "Limits\|Requests"
# Monitor for any crash loops or restarts
kubectl get pods -n <namespace> -o custom-columns=NAME:.metadata.name,RESTARTS:.status.containerStatuses[0].restartCount✅ Post-Upgrade Verification Checklist:
- All infrastructure pods running
- Database connectivity verified
- Qdrant connectivity verified
- Application deployments scaled up to original replica counts
- All application pods running and ready
- Application is accessible
- Functional tests passed
- No critical errors in logs
- No excessive pod restarts
- Resource usage is normal
Success: If all checks pass, the upgrade is complete. Document the successful upgrade and notify stakeholders.
⏱️ Estimated Time: 1-2 hours (depending on data size)
If issues are encountered after the upgrade, follow this restore procedure to revert to the previous version.
⚠️ Critical:
- Perform rollback steps in the exact order specified
- Do not skip any steps
- Application deployments should remain scaled down during restore
- Postgres Expectation: provide a new Postgres database instance matching same PG version of old instance where the restoration will happen
- Qdrant Expectation: Qdrant supports overwrite so you can use same Qdrant instance for restoration or provide a new Qdrant instance with same Qdrant version
Before starting the restore process, document:
- Issues encountered:
_____________ - Error messages:
_____________ - Affected components:
_____________ - Restore timestamp:
_____________
Follow same steps in "Scaling Down Application Deployments" at the beginning of this document
Use your organization's standard Helm rollback procedures to revert to the previous release version.
Note: This manual focuses on backup and restore procedures. Perform Helm rollback according to your organization's change management procedures.
cd pharia-ai-backup-restore
# List available backups
./bin/pharia-backup.sh db restore -l all
# Or manually list
ls -lth database-backups/# Restore all databases from backup
./bin/pharia-backup.sh db restore allExpected Output:
[2025-01-XX XX:XX:XX] Starting database restore...
[2025-01-XX XX:XX:XX] Restoring database: pharia_ai_db
[2025-01-XX XX:XX:XX] Restore completed successfully
[2025-01-XX XX:XX:XX] All databases restored
Or restore specific database:
# Restore specific database
./bin/pharia-backup.sh db restore <database-name>
# Or restore from specific backup file
./bin/pharia-backup.sh db restore -f database-backups/<backup-file> <database-name># Verify database connectivity
psql -h <host> -p <port> -U <user> -d <database> -c "SELECT COUNT(*) FROM information_schema.tables;"
# Check database content (adjust query based on your schema)
psql -h <host> -p <port> -U <user> -d <database> -c "SELECT * FROM <your-table> LIMIT 5;"# List available secret backups
./bin/pharia-backup.sh secrets restore -l
# Or manually list
ls -lth secrets-backups/# Restore from latest backup (with force to overwrite existing secrets)
./bin/pharia-backup.sh secrets restore --latest -f -n pharia-aiExpected Output:
[2025-01-XX XX:XX:XX] Starting secrets restore...
[2025-01-XX XX:XX:XX] Restoring secrets to namespace: pharia-ai
[2025-01-XX XX:XX:XX] Restored secret: secret-name-1
[2025-01-XX XX:XX:XX] Restored secret: secret-name-2
...
[2025-01-XX XX:XX:XX] All secrets restored successfully
# Verify secrets exist
kubectl get secrets -n pharia-ai
# Verify specific secret (example)
kubectl get secret <secret-name> -n pharia-ai -o yaml
- Expectation is you provide a new database instance matching same PG version of old instance where the restoration will happen
cd ../qdrant-backup-restore
# Update .env file for restore
nano .envExample .env file for restore:
# Qdrant API key
export QDRANT_API_KEY="your-api-key-here"
# Source hosts (where snapshots are stored)
# Port forward qdrant service to local
export QDRANT_SOURCE_HOSTS="http://localhost:6333"
# Restore hosts (target Qdrant instances)
export QDRANT_RESTORE_HOSTS="http://localhost:6334"
# S3 configuration (for fetching snapshots)
export QDRANT_S3_ENDPOINT_URL="http://minio.default.svc.cluster.local:9000"
export QDRANT_S3_ACCESS_KEY_ID="your-access-key"
export QDRANT_S3_SECRET_ACCESS_KEY="your-secret-key"
export QDRANT_S3_BUCKET_NAME="qdrant-snapshots"
# Auto-discover peers
export GET_PEERS_FROM_CLUSTER_INFO="true"
# Timeout settings
export CURL_TIMEOUT="1800"
# Wait for tasks
export QDRANT_WAIT_ON_TASK="true"
# Optional: Filter snapshots by datetime (leave empty for all)
# (e.g 2026-01-29 or even include time 2026-01-29-08-55) was taken
# Format yyyy-mm-dd-hh-mm-ss; collection snapshots(s)=full backup
# This can be extended to match entire name of the snapshot e.g prefix-cache-398093660832563-2026-01-29-08-55-49.snapshot but filtering using # datetime is more sensible.
export QDRANT_SNAPSHOT_DATETIME_FILTER=""# Source the environment file
source .env
# Verify environment variables
echo "Source hosts: $QDRANT_SOURCE_HOSTS"
echo "Restore hosts: $QDRANT_RESTORE_HOSTS"
echo "S3 bucket: $QDRANT_S3_BUCKET_NAME"# Restore all snapshots
./qdrant_backup_recovery.sh recover_snapExpected Output:
[2025-01-XX XX:XX:XX] Starting snapshot recovery...
[2025-01-XX XX:XX:XX] Recovering collection: collection_name
[2025-01-XX XX:XX:XX] Snapshot recovery completed successfully
...
[2025-01-XX XX:XX:XX] All snapshots recovered
# Restore collection aliases
./qdrant_backup_recovery.sh recover_collaExpected Output:
[2025-01-XX XX:XX:XX] Starting alias recovery...
[2025-01-XX XX:XX:XX] Recovered alias: alias_name -> collection_name
...
[2025-01-XX XX:XX:XX] All aliases recovered
# Verify collections exist
curl -s http://localhost:6333/collections | jq
# Verify specific collection
curl -s http://localhost:6333/collections/<collection-name> | jq
# Check collection aliases
curl -s http://localhost:6333/aliases | jq# Verify all database pods are running
kubectl get pods -n <namespace>
# Verify database connectivity
psql -h <host> -p <port> -U <user> -d <database> -c "SELECT version();"
# Verify Qdrant connectivity
curl -s http://localhost:6333/collectionsAfter verifying all data is restored, scale up application deployments:
Follow same steps in "Scale Up Application Deployments" at the earlier section of this document
# Verify all pods are running
kubectl get pods -n <namespace>
# Verify application is accessible
curl -k https://<your-app-url>/health
# Verify database connectivity
kubectl exec -n <namespace> <postgres-pod> -- psql -U <user> -d <database> -c "SELECT version();"
# Verify Qdrant connectivity
kubectl exec -n <namespace> <qdrant-pod> -- curl -s http://localhost:6333/collections✅ Restore Checklist:
- Issues documented
- Application deployments scaled down
- Helm rollback completed (using your organization's procedures)
- PostgreSQL databases restored
- Kubernetes secrets restored
- Qdrant snapshots restored
- Qdrant collection aliases restored
- Infrastructure verified before scale-up
- Application deployments scaled up to original replica counts
- All pods running and ready
- Application is accessible
- Functional tests passed
Symptoms:
Error: connection to server failed
Solutions:
- Verify network connectivity to PostgreSQL host
- Check firewall rules
- Verify credentials in
config.yaml - Test connection manually:
psql -h <host> -p <port> -U <user> -d <database>
Symptoms:
Error: timeout waiting for snapshot creation
Solutions:
- Increase
CURL_TIMEOUTin.envfile (e.g.,3600for 1 hour) - Check Qdrant pod resources (CPU/memory)
- Verify S3 connectivity from Qdrant pods
- Check Qdrant logs:
kubectl logs -n <namespace> <qdrant-pod>
Symptoms:
Error: permission denied
Solutions:
- Verify
kubectlhas appropriate permissions - Check RBAC rules for the service account
- Verify namespace access:
kubectl auth can-i get secrets -n <namespace>
Symptoms:
Error: database restore failed
Solutions:
- Verify backup file exists and is not corrupted
- Check database connection details
- Ensure database exists:
CREATE DATABASE <name>; - Check disk space on database server
- Review PostgreSQL logs
Symptoms:
Error: secret already exists
Solutions:
- Use force flag:
--latest -f - Delete existing secret first:
kubectl delete secret <name> -n <namespace> - Verify namespace is correct
Symptoms:
Error: snapshot recovery failed
Solutions:
- Verify S3 credentials are correct
- Check S3 bucket accessibility
- Verify snapshot exists in S3
- Check Qdrant logs:
kubectl logs -n <namespace> <qdrant-pod> - Verify collection doesn't already exist (may need to delete first)
If you encounter issues not covered in this manual:
-
Check Logs:
# Application logs kubectl logs -n <namespace> <pod-name> --tail=100 # Helm release events helm get events <release-name> -n <namespace> # Kubernetes events kubectl get events -n <namespace> --sort-by='.lastTimestamp'
-
Review Documentation:
-
Contact Support:
- Provide error messages and logs
- Include backup/restore timestamps
- Share relevant configuration (sanitized)
- Deployment replica backup:
deployment-replicas-backup.txt(created during scale-down) - PostgreSQL backups:
pharia-ai-backup-restore/database-backups/ - Kubernetes secrets backups:
pharia-ai-backup-restore/secrets-backups/ - Qdrant state files:
qdrant-backup-restore/collections,snapshots,collection_aliases - PostgreSQL config:
pharia-ai-backup-restore/config.yaml - Qdrant config:
qdrant-backup-restore/.env
Document Version: 2.0 Last Updated: 2026-02-05 Maintained By: Aleph Alpha Support Team