Skip to content

scotttyso/Cisco-AI-Pods

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

70 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Cisco AI Pods Runbook Index

Table of Contents

Overview

This directory contains comprehensive documentation for the Cisco AI Pods infrastructure. The runbook is organized into focused guides for each component and operational procedure. This is a living document.

Best Practices

Documentation

  • Keep runbooks updated with any customizations
  • Document all deviations from standard procedures
  • Maintain change logs for configuration modifications
  • Regular review and validation of procedures

Security

  • Rotate API tokens and credentials regularly
  • Use encrypted connections for all management
  • Implement proper access controls
  • Regular security audits and updates

Operations

  • Regular health checks and monitoring
  • Scheduled maintenance windows
  • Automated backup procedures
  • Performance monitoring and optimization

Runbook Components

Network infrastructure deployment guide

  • Cisco Nexus switch configuration
  • VLAN and routing setup
  • Integration with compute and storage
  • Monitoring and maintenance

Automation for C845A/C880A/C885A GPU server deployment

  • Cisco C8XX M8 GPU server deployment
  • Redfish API configuration procedures
  • GPU-optimized BIOS settings
  • Integration with main AI Pods infrastructure

Automation for Cisco Intersight deployment

  • Essential setup steps
  • Key configuration files
  • Common troubleshooting
  • Verification procedures

Complete deployment guide covering all components

  • Prerequisites and planning
  • End-to-end deployment procedures
  • Integration steps
  • Post-deployment validation

Automation for Pure Storage deployment

  • FlashArray/FlashBlade configuration
  • Ansible automation procedures
  • Host integration steps
  • Performance optimization

Comprehensive troubleshooting reference

  • Common issues and resolutions
  • Performance troubleshooting
  • Recovery procedures
  • Escalation processes

⚠️ - CRITICAL Prepare the Environment

Follow the Steps to Prepare the Environment

Quick Start Workflow

⚠️ CRITICAL: Always follow the Deployment Execution Order defined in the main runbook.

For New Deployments

  1. Planning Phase:

  2. Phase 1 - Foundation Setup (FIRST):

    • Follow Network Configuration - Phase 1 for management network
    • Establish basic network connectivity before any automation
    • Checkpoint: Verify management network connectivity
  3. Phase 2 - Infrastructure Deployment:

    • Use Intersight Automation for compute infrastructure
    • Checkpoint: Validate Intersight deployment before proceeding
  4. Phase 3 - C845A/C880A/C885A M8 GPU Servers:

    • Follow C845A/C880A/C885A Configuration Guide for additional GPU infrastructure
    • Configure using Redfish API for GPU-specific features
    • Checkpoint: Validate GPU functionality and DateTime sync
  5. Phase 4 - Storage Configuration:

  6. Phase 5 - OpenShift Deployment:

  7. Phase 6 - Integration & Validation:

    • Run verification procedures from each guide
    • Perform end-to-end testing
    • Document any customizations
  8. Phase 7 - Application Platform:

    • Deploy OpenShift/Kubernetes
    • Configure container orchestration

Troubleshooting

  1. Identify Component:

  2. Follow Procedures:

    • Use Troubleshooting Guide for comprehensive procedures
    • Check component-specific guides for detailed steps
    • Escalate following documented procedures

Repository Structure Reference

Cisco-AI-Pods
β”œβ”€β”€ c885/                        # Cisco C885A automation
β”‚   β”œβ”€β”€ main.fsai.yaml           # C885 configuration data model
β”œβ”€β”€ intersight/                  # Cisco Intersight automation
β”‚   β”œβ”€β”€ global_settings.ezi.yaml # Global Parameters
β”‚   β”œβ”€β”€ main.tf                  # Main Terraform module
β”‚   β”œβ”€β”€ organizations/           # Organization data model
β”‚   β”œβ”€β”€ policies/                # Policy data model
β”‚   β”œβ”€β”€ pools/                   # Pool data model
β”‚   β”œβ”€β”€ provider.tf              # Provider Attributes
β”‚   β”œβ”€β”€ templates/               # Templates data model
β”‚   └── variables.tf             # Terraform sensitive variables
β”œβ”€β”€ network/                     # Network Device Configurations
β”‚   └── *.txt                    # Switch configuration templates
β”œβ”€β”€ openshift/                   # Cisco Intersight automation
β”‚   β”œβ”€β”€ global_settings.ezi.yaml # Global Parameters
β”‚   β”œβ”€β”€ main.tf                  # Main Terraform module
β”‚   β”œβ”€β”€ organizations/           # Organization data model
β”‚   β”œβ”€β”€ policies/                # Policy data model
└── pure_storage/                # Pure Storage automation
    β”œβ”€β”€ tasks/                   # Ansible playbooks
    β”œβ”€β”€ vars/                    # Ansible vars
    β”œβ”€β”€ configure_pure_storage_arrays.yaml                # Top-level Ansible Playbook
    └── requirements.yaml        # Ansible Requirements

Key Configuration Files

Intersight Global Settings

  • Location: Cisco-AI-Pods/Intersight/global_settings.ezi.yaml
  • Purpose: Central configuration for Intersight deployment
  • Key Settings: Intersight FQDN, tags, global parameters

Pure Storage Inventory Files

  • Location: Cisco-AI-Pods/pure_storage/vars/main.fsai.yaml
  • Purpose: Ansible inventory for storage automation
  • Content: Storage array IPs, credentials, connection details

Network Templates

  • Location: Cisco-AI-Pods/network/*.txt
  • Purpose: Switch configuration templates
  • Usage: Customize and apply to network devices

Support and Escalation

Internal Support

  • Level 1: Infrastructure team β†’ Component-specific guides
  • Level 2: Senior engineers β†’ Troubleshooting Guide
  • Level 3: Vendor support β†’ Escalation procedures

Vendor Support

  • Cisco TAC: Intersight, UCS, and network issues
  • Pure Storage: Storage array and performance issues
  • DevNet Community: Terraform and Ansible community support

Quick Commands Reference

Terraform Operations

terraform init          # Initialize working directory
terraform validate      # Validate configuration
terraform plan          # Preview changes
terraform apply         # Apply changes
terraform destroy       # Destroy infrastructure

Ansible Operations

ansible-galaxy collection install -r requirements.yaml
ansible-playbook main.yml    # Run Pure Storage setup

For full environment and dependency setup, see Prepare the Environment.

Network Operations

copy running-config startup-config  # Save configuration
ping                                # Basic connectivity tests
show bgp ipv4 unicast               # See the BGP IPv4 unicasting table
show interface brief                # Interface status
show ip route                       # IP Routing Table
show vlan brief                     # VLAN configuration
show version                        # Software version
traceroute                          # Path validation

Updates and Maintenance

Runbook Updates

  • Review quarterly for accuracy
  • Update after major infrastructure changes
  • Validate procedures after software updates
  • Incorporate lessons learned from incidents

Infrastructure Updates

  • Follow change management procedures
  • Test in non-production first
  • Document all changes
  • Update runbooks accordingly

Document Information

  • Created: June 13, 2025
  • Version: 1.2
  • Last Updated: July 12, 2025
  • Maintained By: Infrastructure Automation Team

For questions or updates to this runbook, please contact the Infrastructure team or submit an issue in the repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors