- Introduction
- What is Dataset Manager?
- Key Features
- Architecture Overview
- Prerequisites
- Installation
- Getting Started
- Working with Datasets
- Best Practices
- Troubleshooting
- API Reference
The Dataset Manager is a powerful feature of the NetApp DataOps Toolkit that provides a simplified, intuitive interface for managing datasets backed by NetApp ONTAP storage. It abstracts away the complexities of ONTAP volume management and presents data scientists, data engineers, and developers with a familiar, filesystem-based approach to working with large datasets.
Dataset Manager transforms ONTAP volumes into easy-to-use "datasets" that appear as simple directories on your local filesystem. Each dataset is actually an ONTAP volume with all the enterprise-grade capabilities of NetApp storage—including instant cloning, snapshots, and space efficiency—but accessed through a familiar directory structure.
Traditional approach (without Dataset Manager):
# Manually manage ONTAP volumes
from netapp_dataops.traditional import create_volume, clone_volume, mount_volume
# Create volume
create_volume(volume_name="my_data", volume_size="500GB", junction="/my_data")
# Mount it
mount_volume(volume_name="my_data", mountpoint="/mnt/my_data")
# Work with data...
# Later: unmount, manage snapshots separately, etc.Dataset Manager approach:
# Simple, intuitive dataset operations
from netapp_dataops.traditional.datasets import Dataset
# Create dataset - automatically creates volume and makes it accessible
dataset = Dataset(name="my_data", max_size="500GB")
# Access files immediately through a simple path
print(f"Dataset location: {dataset.local_file_path}")
# Work with data using standard Python file operations
with open(f"{dataset.local_file_path}/experiment1.csv", "w") as f:
f.write("data,label\n1,a\n2,b")- No manual mounting required: Datasets are immediately accessible through a pre-mounted root volume
- Standard file operations: Use Python's built-in file I/O, pandas, or any other tools
- Automatic namespace management: New datasets appear instantly in your local filesystem
- Space-efficient clones: Create full copies in seconds that share data with the source
- Instant snapshots: Capture point-in-time copies with zero performance impact
- Zero data movement: Clones and snapshots leverage NetApp's FlexClone technology
- Experiment tracking: Snapshot datasets before and after training runs
- Model versioning: Clone datasets to preserve data states for specific model versions
- Reproducibility: Restore exact data states from snapshots
- Collaboration: Share datasets instantly through clones
- Powered by NetApp ONTAP: Benefit from enterprise reliability, performance, and efficiency
- NFS-based: Works with Linux and macOS hosts
- Scalable: Manage datasets from gigabytes to petabytes
- Data protection: Built-in snapshot and replication capabilities
Dataset Manager uses a hierarchical structure with a single "root volume" that serves as the parent for all datasets:
Root Volume (e.g., "dataset_mgr_root")
└── Mounted at: /mnt/datasets (or your chosen location)
├── training_data_v1/ ← Dataset 1 (ONTAP volume)
│ ├── images/
│ └── labels.csv
├── training_data_v2/ ← Dataset 2 (ONTAP volume)
│ ├── images/
│ └── labels.csv
└── inference_data/ ← Dataset 3 (ONTAP volume)
└── input_data.parquet
How it works:
- Root Volume: A single ONTAP volume mounted permanently on your local system
- Datasets as Volumes: Each dataset is a separate ONTAP volume, automatically junctioned under the root volume
- Transparent Access: The NFS client sees one continuous directory tree
- Automatic Discovery: New datasets appear immediately without manual mounting
Technical Details:
- Uses ONTAP junction paths to create the hierarchy:
/<root_volume>/<dataset_name> - Inherits export policy from root volume for consistent NFS access permissions
- Supports all ONTAP volume features: snapshots, clones, efficiency features
- Compatible with FlexVol and FlexGroup volumes
Operating Systems:
- Linux (RHEL, CentOS, Ubuntu, Debian, etc.)
- macOS
Storage System:
- NetApp AFF, FAS, or ONTAP Select (ONTAP 9.7+)
- NetApp Cloud Volumes ONTAP (ONTAP 9.7+)
- Amazon FSx for NetApp ONTAP
Python:
- Python 3.8–3.13
pip(usually bundled with Python; verify withpip --version)
Required Utilities:
mount(for checking mount status)mountpoint(for validating mount points)- On Linux:
nfs-common(Debian/Ubuntu) ornfs-utils(RHEL/CentOS)
ONTAP Permissions: Your ONTAP user account needs permissions to:
- Create and delete volumes
- Create and delete snapshots
- Clone volumes
- Modify junction paths
Local System Permissions:
rootorsudoaccess for:- Adding entries to
/etc/fstab(during setup) - Initial mounting operations (if not using fstab)
- Adding entries to
- Read/write access to the root volume mountpoint
- Network connectivity from your host to the ONTAP data LIF
- NFS protocol enabled on the ONTAP SVM
- Appropriate export policy rules for your host's IP address
Dataset Manager mounts ONTAP volumes via NFS, so the NFS client utilities must be installed on your system.
Ubuntu / Debian:
sudo apt-get update && sudo apt-get install -y nfs-commonRHEL / CentOS / Fedora:
sudo dnf install -y nfs-utilsmacOS:
NFS client support is built in to macOS — no additional installation is required.
It is recommended to install the toolkit inside a Python virtual environment to keep dependencies isolated.
Create and activate a virtual environment:
python3 -m venv ~/netapp-dataops-venv
source ~/netapp-dataops-venv/bin/activateInstall the toolkit:
pip install netapp-dataops-traditionalThis installs the package with support for NetApp ONTAP (AFF, FAS, Cloud Volumes ONTAP, Amazon FSx for NetApp ONTAP, and ONTAP Select).
Tip: Add
source ~/netapp-dataops-venv/bin/activateto your shell's startup file (e.g.,~/.bashrcor~/.zshrc) so the environment is activated automatically in new terminal sessions.
Confirm that the toolkit was installed correctly:
netapp_dataops_cli.py --helpYou should see the toolkit's help output. If the command is not found, ensure the virtual environment is activated and that its bin directory is on your PATH.
You can also verify the Python library is importable:
from netapp_dataops.traditional.datasets import Dataset
print("Installation successful!")Note: Python 3.8–3.13 is required.
Dataset Manager is configured as part of the NetApp DataOps Toolkit setup process. If you haven't configured the toolkit yet, run:
netapp_dataops_cli.py configDuring configuration, you'll be asked about Dataset Manager setup. You have two options:
If you don't have an existing root volume, the toolkit can create one for you:
=== Dataset Manager Configuration ===
Do you have a pre-existing Dataset Manager 'root' volume? (yes/no) [no]: no
Would you like to create a new Dataset Manager 'root' volume now? (yes/no) [yes]: yes
Enter desired Dataset Manager "root" volume name [dataset_mgr_root]: dataset_mgr_root
Enter desired local mountpoint for Dataset Manager "root" volume: /mnt/datasets
Creating Dataset Manager root volume 'dataset_mgr_root'...
Root volume 'dataset_mgr_root' created successfully
Would you like to add your Dataset Manager "root" volume to /etc/fstab now? (yes/no) [yes]: yes
Elevated privileges required to modify /etc/fstab...
[sudo] password for user:
Successfully added entry to /etc/fstab
What happens:
- A small (1GB) ONTAP volume is created named
dataset_mgr_root - The volume is junctioned at
/dataset_mgr_root - An entry is added to
/etc/fstabfor automatic mounting on reboot - The volume is mounted at your specified mountpoint
If you already have a root volume created:
=== Dataset Manager Configuration ===
Do you have a pre-existing Dataset Manager 'root' volume? (yes/no) [no]: yes
Enter Dataset Manager "root" volume name (or 'abort' to cancel): my_existing_root
Enter desired local mountpoint for Dataset Manager "root" volume: /mnt/datasets
Would you like to add your Dataset Manager "root" volume to /etc/fstab now? (yes/no) [yes]: yes
Requirements for existing volumes:
- Volume must exist in ONTAP
- Junction path should be
/<volume_name>(e.g.,/my_existing_root) - Volume should have an appropriate export policy for your host
After configuration, verify that Dataset Manager is ready:
from netapp_dataops.traditional.datasets import Dataset, get_datasets
# This should work without errors if configuration is correct
datasets = get_datasets()
print(f"Dataset Manager is configured. Found {len(datasets)} existing datasets.")You can also check that the root volume is mounted:
# Check mount status
mount | grep dataset_mgr_root
# Check directory exists and is accessible
ls -la /mnt/datasetsOnce configured, creating a dataset is simple:
from netapp_dataops.traditional.datasets import Dataset
# Create a new dataset with 100GB maximum size
my_dataset = Dataset(name="my_first_dataset", max_size="100GB")
print(f"Dataset created!")
print(f"Location: {my_dataset.local_file_path}")
print(f"Size: {my_dataset.max_size}")
# Start working with your data immediately
import pandas as pd
# Create a sample CSV file
df = pd.DataFrame({
'feature1': [1, 2, 3],
'feature2': [4, 5, 6],
'label': ['a', 'b', 'c']
})
# Save to your dataset
df.to_csv(f"{my_dataset.local_file_path}/training_data.csv", index=False)
print(f"Data saved to dataset!")What just happened:
- An ONTAP volume named
my_first_datasetwas created with 100GB capacity - The volume was automatically junctioned at
/dataset_mgr_root/my_first_dataset - The dataset is immediately accessible at
/mnt/datasets/my_first_dataset - You can use standard file operations to work with the data
Create a new dataset by specifying a name and maximum size:
from netapp_dataops.traditional.datasets import Dataset
# Create dataset - various size formats supported
dataset = Dataset(name="training_data_v1", max_size="500GB")
# Size can be specified in different units
dataset2 = Dataset(name="small_dataset", max_size="10GB")
dataset3 = Dataset(name="large_dataset", max_size="5TB")
dataset4 = Dataset(name="tiny_dataset", max_size="100MB")Size format options:
GB- Gigabytes (e.g., "100GB")TB- Terabytes (e.g., "5TB")MB- Megabytes (e.g., "500MB")PB- Petabytes (e.g., "2PB")
Important notes:
- Dataset names must be unique within your ONTAP system
- Names follow ONTAP volume naming conventions
- The
max_sizeis the thin-provisioned maximum; actual space consumption grows as you add data - Use
print_output=Trueto see detailed operation messages:dataset = Dataset(name="my_data", max_size="100GB", print_output=True)
To work with an existing dataset, simply create a Dataset instance without specifying max_size:
from netapp_dataops.traditional.datasets import Dataset
# Bind to existing dataset - no max_size needed
dataset = Dataset(name="training_data_v1")
print(f"Dataset: {dataset.name}")
print(f"Size: {dataset.max_size}")
print(f"Location: {dataset.local_file_path}")
print(f"Is Clone: {dataset.is_clone}")
if dataset.is_clone:
print(f"Source: {dataset.source_dataset_name}")Attributes available:
name- Dataset namemax_size- Maximum provisioned size (e.g., "500GB")local_file_path- Full path to the dataset directoryis_clone- Boolean indicating if this is a cloned datasetsource_dataset_name- Original dataset name (for clones)
Retrieve all existing datasets with a single function call:
from netapp_dataops.traditional.datasets import get_datasets
# Get all datasets
datasets = get_datasets()
print(f"Found {len(datasets)} datasets:")
for ds in datasets:
print(f" - {ds.name} ({ds.max_size})")
if ds.is_clone:
print(f" └─ Clone of: {ds.source_dataset_name}")
# With verbose output
datasets = get_datasets(print_output=True)Example output:
Found 3 datasets:
- training_data_v1 (500GB)
- inference_data (50GB)
- training_data_v2 (500GB)
└─ Clone of: training_data_v1
Datasets appear as regular directories, so you can use standard file operations:
from netapp_dataops.traditional.datasets import Dataset
import os
import shutil
dataset = Dataset(name="my_data")
# Create directories
os.makedirs(f"{dataset.local_file_path}/raw", exist_ok=True)
os.makedirs(f"{dataset.local_file_path}/processed", exist_ok=True)
# Write files
with open(f"{dataset.local_file_path}/raw/data.txt", "w") as f:
f.write("Important data")
# Copy files
shutil.copy("local_file.csv", f"{dataset.local_file_path}/raw/")
# Use with pandas
import pandas as pd
df = pd.read_csv("source.csv")
df.to_parquet(f"{dataset.local_file_path}/processed/data.parquet")
# Use with any library
import numpy as np
np.save(f"{dataset.local_file_path}/arrays/features.npy", some_array)Use the get_files() method to retrieve information about all files in a dataset:
dataset = Dataset(name="my_data")
# Get list of all files with metadata
files = dataset.get_files()
print(f"Dataset contains {len(files)} files:")
for file_info in files:
print(f" {file_info['filename']}")
print(f" Path: {file_info['filepath']}")
print(f" Size: {file_info['size_human']}")
print(f" Bytes: {file_info['size']}")Example output:
Dataset contains 3 files:
training_data.csv
Path: /mnt/datasets/my_data/training_data.csv
Size: 245.3 MB
Bytes: 257234567
model_config.json
Path: /mnt/datasets/my_data/model_config.json
Size: 2.1 KB
Bytes: 2148
features.npy
Path: /mnt/datasets/my_data/arrays/features.npy
Size: 1.2 GB
Bytes: 1288490188
Use cases:
- Auditing dataset contents
- Calculating total data size
- Finding specific files
- Generating reports
Snapshots provide point-in-time copies of your dataset. They're instant, consume minimal space initially, and are perfect for tracking dataset versions.
dataset = Dataset(name="training_data")
# Create snapshot with automatic timestamp-based name
snapshot_name = dataset.snapshot()
print(f"Created snapshot: {snapshot_name}")
# Output: Created snapshot: training_data_20240212_143022
# Create snapshot with custom name
snapshot_name = dataset.snapshot(name="before_cleaning")
print(f"Created snapshot: {snapshot_name}")
# Output: Created snapshot: before_cleaningList all snapshots for a dataset:
dataset = Dataset(name="training_data")
# Get all snapshots
snapshots = dataset.get_snapshots()
print(f"Dataset has {len(snapshots)} snapshots:")
for snap in snapshots:
print(f" - {snap['name']}")
print(f" Created: {snap['create_time']}")Example output:
Dataset has 3 snapshots:
- before_cleaning
Created: 2024-02-12 14:30:22
- after_cleaning
Created: 2024-02-12 15:45:10
- final_version
Created: 2024-02-12 18:20:00
1. Before/After Snapshots:
# Before making changes
dataset.snapshot(name="before_transform")
# Make your changes
transform_data(dataset.local_file_path)
# After changes complete
dataset.snapshot(name="after_transform")2. Experiment Tracking:
# Snapshot before each training run
experiment_id = "exp_001"
dataset.snapshot(name=f"before_{experiment_id}")
# Train your model
train_model(dataset.local_file_path, experiment_id)
# Snapshot after training
dataset.snapshot(name=f"after_{experiment_id}")3. Daily/Periodic Snapshots:
from datetime import datetime
# Daily snapshot
date_str = datetime.now().strftime("%Y%m%d")
dataset.snapshot(name=f"daily_{date_str}")Cloning creates a new dataset that's an exact copy of the source dataset. Thanks to NetApp FlexClone technology, clones:
- Are created in seconds (regardless of dataset size)
- Initially consume almost no additional storage space
- Share unchanged data blocks with the source
- Become independent as you modify data
source_dataset = Dataset(name="training_data_v1")
# Create a clone
cloned_dataset = source_dataset.clone(name="training_data_v2")
print(f"Clone created: {cloned_dataset.name}")
print(f"Location: {cloned_dataset.local_file_path}")
print(f"Is Clone: {cloned_dataset.is_clone}")
print(f"Source: {cloned_dataset.source_dataset_name}")
# Modify clone without affecting source
with open(f"{cloned_dataset.local_file_path}/new_file.txt", "w") as f:
f.write("This only exists in the clone")Common cloning scenarios:
1. Experimentation:
# Clone production data for testing
prod_data = Dataset(name="production_dataset")
test_data = prod_data.clone(name="test_dataset")
# Experiment freely without risk
run_experimental_pipeline(test_data.local_file_path)
# Delete test dataset when done
test_data.delete()2. Team Collaboration:
# Each team member gets their own copy
base_dataset = Dataset(name="shared_data")
alice_data = base_dataset.clone(name="alice_working_copy")
bob_data = base_dataset.clone(name="bob_working_copy")
# Each person can work independently3. Model Training Variants:
# Create clones for different preprocessing approaches
raw_data = Dataset(name="raw_training_data")
# Clone for each variant
normalized = raw_data.clone(name="normalized_training")
standardized = raw_data.clone(name="standardized_training")
augmented = raw_data.clone(name="augmented_training")
# Apply different transformations to each
apply_normalization(normalized.local_file_path)
apply_standardization(standardized.local_file_path)
apply_augmentation(augmented.local_file_path)4. Clone from Snapshot: While the Dataset API doesn't directly support cloning from a snapshot, you can use the underlying volume operations:
from netapp_dataops.traditional import clone_volume
# Clone from a specific snapshot
clone_volume(
new_volume_name="restored_data",
source_volume_name="training_data",
source_snapshot_name="before_cleaning",
junction="/dataset_mgr_root/restored_data"
)
# Access as a dataset
restored = Dataset(name="restored_data")Permanently delete datasets when they're no longer needed:
dataset = Dataset(name="temporary_data")
# Delete the dataset (both clones and originals can be deleted by default)
dataset.delete()Safety features:
By default, delete() allows deletion of both clones and original datasets. To restrict deletion to clones only, set delete_non_clone=False:
# This will work (both clones and originals can be deleted by default)
clone = Dataset(name="my_clone")
clone.delete() # ✓ Works
original = Dataset(name="my_original_data")
original.delete() # ✓ Also works (delete_non_clone defaults to True)
# To restrict deletion to clones only:
original.delete(delete_non_clone=False) # ✗ Will raise error if not a cloneComplete cleanup example:
from netapp_dataops.traditional.datasets import Dataset, get_datasets
# Find and delete all datasets matching a pattern
all_datasets = get_datasets()
for ds in all_datasets:
if ds.name.startswith("temp_"):
print(f"Deleting temporary dataset: {ds.name}")
ds.delete(delete_non_clone=True)Important warnings:
⚠️ Deletion is permanent and cannot be undone⚠️ All data in the dataset will be lost⚠️ Snapshots of the dataset will also be deleted⚠️ Ensure you have backups if data is important
Use clear, descriptive names for datasets:
# ✓ Good (imports omitted for brevity - use: from netapp_dataops.traditional.datasets import Dataset)
Dataset(name="customer_data_2024", max_size="1TB")
Dataset(name="training_images_v1", max_size="500GB")
Dataset(name="model_inference_cache", max_size="50GB")
# ✗ Avoid
Dataset(name="data", max_size="1TB") # Too generic
Dataset(name="test123", max_size="1TB") # Not descriptive
Dataset(name="my-dataset", max_size="1TB") # Hyphens may cause issuesRecommended patterns:
{project}_{purpose}_{version}- e.g., "fraud_detection_training_v3"{team}_{dataset_type}- e.g., "ml_team_feature_store"{timestamp}_{description}- e.g., "20240212_experiment_data"
Develop a consistent snapshotting strategy:
# Assuming: from netapp_dataops.traditional.datasets import Dataset
# Before major operations
dataset.snapshot(name="before_preprocessing")
preprocess_data(dataset.local_file_path)
# After major milestones
dataset.snapshot(name="after_preprocessing")
dataset.snapshot(name="before_training")
train_model(dataset.local_file_path)
dataset.snapshot(name="after_training")
# For experiments
dataset.snapshot(name=f"exp_{experiment_id}_start")
# ... experiment ...
dataset.snapshot(name=f"exp_{experiment_id}_end")Keep clones organized and clean up when finished:
from netapp_dataops.traditional.datasets import Dataset, get_datasets
# Prefix clones with purpose
base_data = Dataset(name="production_data")
test_clone = base_data.clone(name="test_experiment_1")
# Clean up temporary clones regularly
all_datasets = get_datasets()
for ds in all_datasets:
if ds.name.startswith("test_") and ds.is_clone:
print(f"Cleaning up test clone: {ds.name}")
ds.delete() # Safe - clones can be deleted by defaultChoose appropriate sizes based on your data growth:
# Assuming: from netapp_dataops.traditional.datasets import Dataset
# Add headroom for growth
current_data_size = "50GB"
max_size = "200GB" # 4x current size for growth
dataset = Dataset(name="my_data", max_size=max_size)
# For dynamic workloads, go larger
dynamic_dataset = Dataset(
name="streaming_data",
max_size="2TB" # Room for accumulation
)Implement proper error handling:
from netapp_dataops.traditional.datasets import (
Dataset,
DatasetError,
DatasetExistsError,
DatasetConfigError,
DatasetVolumeError
)
try:
dataset = Dataset(name="my_data", max_size="100GB")
except DatasetExistsError:
print("Dataset already exists, binding to existing...")
dataset = Dataset(name="my_data")
except DatasetConfigError as e:
print(f"Configuration error: {e}")
print("Please run: netapp_dataops_cli.py config")
except DatasetVolumeError as e:
print(f"Volume operation failed: {e}")
except DatasetError as e:
print(f"Dataset error: {e}")Track your datasets and their usage:
from netapp_dataops.traditional.datasets import get_datasets
def audit_datasets():
"""Generate a report of all datasets and their file counts."""
datasets = get_datasets()
print(f"\n{'Dataset Name':<30} {'Size':<15} {'Files':<10} {'Type':<10}")
print("-" * 70)
for ds in datasets:
files = ds.get_files()
total_size = sum(f['size'] for f in files)
ds_type = "Clone" if ds.is_clone else "Original"
print(f"{ds.name:<30} {ds.max_size:<15} {len(files):<10} {ds_type:<10}")
# Run regularly
audit_datasets()Error message:
DatasetConfigError: Dataset manager is not enabled. Run 'netapp_dataops_cli.py config' to configure.
Solution:
# Run configuration wizard
netapp_dataops_cli.py config
# Follow prompts to enable Dataset ManagerError message:
DatasetConfigError: Root mountpoint '/mnt/datasets' is not accessible.
Diagnosis:
# Check if mountpoint exists
ls -la /mnt/datasets
# Check if it's mounted
mount | grep datasets
# Check fstab entry
cat /etc/fstab | grep dataset_mgr_rootSolution:
# If not mounted, mount it manually
sudo mount -a
# Or mount specifically
sudo mount -t nfs data_lif:/dataset_mgr_root /mnt/datasets
# Verify
df -h /mnt/datasetsError message:
Error: Junction path '/dataset_mgr_root/my_data' is already in use by another volume.
Solution: Use a different dataset name or investigate the conflicting volume:
from netapp_dataops.traditional import list_volumes
# List all volumes to find the conflict
volumes = list_volumes(print_output=True)
# Choose a different name
dataset = Dataset(name="my_data_v2", max_size="100GB")Error message:
Volume 'my_data' already exists but is not managed by the Dataset Manager
Explanation: The volume exists but isn't junctioned under the root volume.
Solution: Either:
- Use a different name for your dataset
- Manually fix the existing volume's junction path in ONTAP
- Delete the conflicting volume (if safe to do so)
# Option 1: Different name
dataset = Dataset(name="my_data_new", max_size="100GB")
# Option 3: Delete conflicting volume (careful!)
from netapp_dataops.traditional import delete_volume
delete_volume(volume_name="my_data", delete_non_clone=True)Symptoms:
- Commands hang when accessing dataset
- "Stale file handle" errors
Diagnosis:
# Check mount status
mount | grep dataset_mgr_root
# Try to access
ls /mnt/datasets # May hangSolution:
# Force unmount (may require root)
sudo umount -f /mnt/datasets
# Remount
sudo mount -a
# Verify
ls /mnt/datasetsSymptoms:
PermissionError: [Errno 13] Permission denied: '/mnt/datasets/my_data/file.txt'
Diagnosis:
# Check permissions
ls -la /mnt/datasets/my_data/
# Check NFS mount options
mount | grep datasetsSolution:
- Check ONTAP export policy rules
- Verify NFS mount options
- Check local file permissions
# Check current user ID
id
# Verify export policy in ONTAP allows your host
# Update export policy if needed (in ONTAP CLI):
# vserver export-policy rule create -vserver svm_name -policyname default -clientmatch your_ip -rorule sys -rwrule sys -superuser sysError message:
DatasetVolumeError: Failed to create snapshot for dataset 'my_data'
Common causes:
- Maximum snapshot count reached
- Snapshot with same name already exists
- Insufficient ONTAP permissions
Solution:
# List existing snapshots
dataset = Dataset(name="my_data")
snapshots = dataset.get_snapshots()
print(f"Current snapshot count: {len(snapshots)}")
# Delete old snapshots if needed
from netapp_dataops.traditional.ontap import delete_snapshot
for snap in snapshots[:10]: # Delete oldest 10
delete_snapshot(volume_name="my_data", snapshot_name=snap['name'])
# Use unique snapshot names
from datetime import datetime
unique_name = f"snapshot_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
dataset.snapshot(name=unique_name)Get detailed information about operations:
# Enable output for all operations
dataset = Dataset(name="my_data", max_size="100GB", print_output=True)
dataset.snapshot(name="debug_snap")
cloned = dataset.clone(name="debug_clone")Verify your configuration file:
from netapp_dataops.traditional.core import _retrieve_config
config = _retrieve_config(print_output=True)
print("Dataset Manager enabled:", config.get('datasetManagerEnabled'))
print("Root volume:", config.get('datasetManagerRootVolume'))
print("Root mountpoint:", config.get('datasetManagerRootMountpoint'))Test connection to ONTAP:
from netapp_dataops.traditional import list_volumes
# This will fail if ONTAP connection is broken
try:
volumes = list_volumes(print_output=True)
print("ONTAP connection successful")
except Exception as e:
print(f"ONTAP connection failed: {e}")Dataset(name: str, max_size: Optional[str] = None, print_output: bool = False)Create or access a dataset.
Parameters:
name(str): Dataset name (ONTAP volume name)max_size(str, optional): Maximum size for new datasets (e.g., "100GB"). Required when creating new datasets, omit when accessing existing ones.print_output(bool): Whether to print detailed operation messages. Default: False
Returns:
- Dataset instance
Raises:
DatasetExistsError: Dataset name already exists (when creating new)DatasetConfigError: Dataset Manager not configured or configuration invalidDatasetVolumeError: ONTAP volume operation failedDatasetError: General dataset error
Examples:
# Create new dataset
dataset = Dataset(name="my_data", max_size="500GB")
# Access existing dataset
dataset = Dataset(name="my_data")
# With verbose output
dataset = Dataset(name="my_data", max_size="100GB", print_output=True)dataset.name # str: Dataset name
dataset.max_size # str: Maximum provisioned size (e.g., "500GB")
dataset.local_file_path # str: Full path to dataset directory
dataset.is_clone # bool: True if dataset is a clone
dataset.source_dataset_name # str: Source dataset name (if clone), else Nonedataset.get_files() -> List[Dict[str, Any]]Get list of all files in the dataset.
Returns: List of dictionaries with keys:
filename(str): File namefilepath(str): Full path to filesize(int): File size in bytessize_human(str): Human-readable size (e.g., "1.2 GB")
Raises:
DatasetError: Failed to list files
Example:
files = dataset.get_files()
for f in files:
print(f"{f['filename']}: {f['size_human']}")dataset.snapshot(name: Optional[str] = None) -> strCreate a snapshot of the dataset.
Parameters:
name(str, optional): Snapshot name. If not provided, automatic timestamp-based name is used.
Returns:
- str: Name of created snapshot
Raises:
DatasetVolumeError: Snapshot creation failed
Example:
# Auto-generated name
snap_name = dataset.snapshot()
print(snap_name) # "my_data_20240212_143022"
# Custom name
snap_name = dataset.snapshot(name="before_training")dataset.get_snapshots() -> List[Dict[str, Any]]Get list of all snapshots for the dataset.
Returns: List of dictionaries with keys:
name(str): Snapshot namecreate_time(str): Snapshot creation timestamp
Raises:
DatasetVolumeError: Failed to list snapshots
Example:
snapshots = dataset.get_snapshots()
for snap in snapshots:
print(f"{snap['name']} created at {snap['create_time']}")dataset.clone(name: str) -> DatasetCreate a space-efficient clone of the dataset.
Parameters:
name(str): Name for the cloned dataset
Returns:
- Dataset: New Dataset instance representing the clone
Raises:
DatasetExistsError: Clone name already existsDatasetVolumeError: Clone operation failed
Example:
source = Dataset(name="training_data")
clone = source.clone(name="test_data")
print(f"Clone created at: {clone.local_file_path}")dataset.delete(delete_non_clone: bool = True)Permanently delete the dataset.
Parameters:
delete_non_clone(bool): If True, allows deletion of non-clone datasets. If False, only clones can be deleted. Default: True
Raises:
DatasetVolumeError: Deletion failed
Warning: Deletion is permanent and cannot be undone. All data and snapshots will be lost.
Example:
# Delete clone (safe by default)
clone_dataset.delete()
# Delete original dataset (must explicitly allow)
original_dataset.delete(delete_non_clone=True)get_datasets(print_output: bool = False) -> List[Dataset]Get all existing datasets managed by Dataset Manager.
Parameters:
print_output(bool): Whether to print detailed information. Default: False
Returns:
- List[Dataset]: List of all Dataset instances
Raises:
DatasetConfigError: Dataset Manager not configuredDatasetVolumeError: Failed to retrieve datasets
Example:
from netapp_dataops.traditional.datasets import get_datasets
# Get all datasets
datasets = get_datasets()
print(f"Found {len(datasets)} datasets")
# With verbose output
datasets = get_datasets(print_output=True)Base exception for all Dataset Manager errors.
Raised when attempting to create a dataset that already exists.
Raised when Dataset Manager configuration is invalid or missing.
Raised when an ONTAP volume operation fails.
Example usage:
from netapp_dataops.traditional.datasets import (
DatasetError,
DatasetExistsError,
DatasetConfigError,
DatasetVolumeError
)
try:
dataset = Dataset(name="my_data", max_size="100GB")
except DatasetExistsError:
print("Dataset already exists")
except DatasetConfigError:
print("Please run: netapp_dataops_cli.py config")
except DatasetVolumeError as e:
print(f"ONTAP operation failed: {e}")
except DatasetError as e:
print(f"General error: {e}")Here's a complete workflow demonstrating Dataset Manager capabilities:
#!/usr/bin/env python3
"""
Complete Dataset Manager workflow example.
Demonstrates dataset creation, file management, snapshots, cloning, and cleanup.
"""
from netapp_dataops.traditional.datasets import Dataset, get_datasets, DatasetError
import pandas as pd
import numpy as np
def main():
print("=== Dataset Manager Workflow Demo ===\n")
# Step 1: Create initial dataset
print("Step 1: Creating initial dataset...")
try:
training_data = Dataset(
name="ml_training_data_v1",
max_size="200GB",
print_output=True
)
print(f"✓ Dataset created at: {training_data.local_file_path}\n")
except DatasetError as e:
print(f"✗ Error: {e}\n")
return
# Step 2: Add some data
print("Step 2: Adding training data...")
df = pd.DataFrame({
'feature1': np.random.rand(1000),
'feature2': np.random.rand(1000),
'label': np.random.randint(0, 2, 1000)
})
df.to_csv(f"{training_data.local_file_path}/training_data.csv", index=False)
print("✓ Data added\n")
# Step 3: Create snapshot before training
print("Step 3: Creating snapshot before training...")
snap_name = training_data.snapshot(name="before_training")
print(f"✓ Snapshot created: {snap_name}\n")
# Step 4: Simulate model training and add results
print("Step 4: Simulating training and adding results...")
results = pd.DataFrame({
'epoch': range(10),
'loss': np.random.rand(10),
'accuracy': np.random.rand(10)
})
results.to_csv(f"{training_data.local_file_path}/training_results.csv", index=False)
print("✓ Training results added\n")
# Step 5: Create post-training snapshot
print("Step 5: Creating snapshot after training...")
snap_name = training_data.snapshot(name="after_training")
print(f"✓ Snapshot created: {snap_name}\n")
# Step 6: Clone for experimentation
print("Step 6: Creating clone for experimentation...")
experiment_data = training_data.clone(name="ml_experiment_v1")
print(f"✓ Clone created at: {experiment_data.local_file_path}\n")
# Step 7: Modify clone
print("Step 7: Modifying cloned dataset...")
with open(f"{experiment_data.local_file_path}/experiment_notes.txt", "w") as f:
f.write("Experiment with modified hyperparameters\n")
print("✓ Experiment data modified\n")
# Step 8: List all files in datasets
print("Step 8: Listing files in original dataset...")
files = training_data.get_files()
print(f"Original dataset has {len(files)} files:")
for f in files:
print(f" - {f['filename']}: {f['size_human']}")
print()
print("Listing files in cloned dataset...")
files = experiment_data.get_files()
print(f"Cloned dataset has {len(files)} files:")
for f in files:
print(f" - {f['filename']}: {f['size_human']}")
print()
# Step 9: List all snapshots
print("Step 9: Listing all snapshots...")
snapshots = training_data.get_snapshots()
print(f"Dataset has {len(snapshots)} snapshots:")
for snap in snapshots:
print(f" - {snap['name']} ({snap['create_time']})")
print()
# Step 10: List all datasets
print("Step 10: Listing all datasets...")
all_datasets = get_datasets()
print(f"Found {len(all_datasets)} total datasets:")
for ds in all_datasets:
ds_type = "clone" if ds.is_clone else "original"
print(f" - {ds.name} ({ds.max_size}) [{ds_type}]")
if ds.is_clone:
print(f" └─ Source: {ds.source_dataset_name}")
print()
# Step 11: Cleanup (optional)
print("Step 11: Cleanup...")
response = input("Delete experiment clone? (yes/no): ")
if response.lower() == 'yes':
experiment_data.delete()
print("✓ Experiment clone deleted\n")
else:
print("✓ Experiment clone preserved\n")
print("=== Workflow Complete ===")
if __name__ == "__main__":
main()Save this script and run it to see Dataset Manager in action!
Happy dataset managing! 🚀