diff --git a/kernel-agentic/Makefile b/kernel-agentic/Makefile index 0b3857b..f6f4975 100644 --- a/kernel-agentic/Makefile +++ b/kernel-agentic/Makefile @@ -2,7 +2,7 @@ TORCH_FILE_DIR ?= datasets/torch_example_kernels KERNEL_FILE_DIR ?= kernel_codes KERNEL_LANG ?= hip -MODE ?= multimodal +MODE ?= text ACTIVATE_VENV_CMD = exec bash -c "source .venv/bin/activate && exec bash" diff --git a/kernel-agentic/docs/hip/AMD Instinct MI300 CDNA3 Architecture Guide for High-Quality HIP Kernel Development.md b/kernel-agentic/docs/hip/AMD Instinct MI300 CDNA3 Architecture Guide for High-Quality HIP Kernel Development.md new file mode 100644 index 0000000..4cb3de0 --- /dev/null +++ b/kernel-agentic/docs/hip/AMD Instinct MI300 CDNA3 Architecture Guide for High-Quality HIP Kernel Development.md @@ -0,0 +1,414 @@ +# AMD Instinct MI300 CDNA3 Architecture Guide for High-Quality HIP Kernel Development + + +## Abstract + +This comprehensive guide provides essential knowledge for developing high-performance HIP kernels specifically optimized for the AMD Instinct MI300 CDNA3 architecture. The document focuses on unique architectural features and programming considerations that differentiate MI300 from NVIDIA AI accelerators, enabling developers to leverage the full potential of AMD's latest compute architecture. Key areas covered include the revolutionary Matrix Fused Multiply-Add (MFMA) instructions, novel data formats like FP8 and BF8, structured sparse matrix support, and the dual register file architecture that sets MI300 apart from competing solutions. + +## Table of Contents + +1. [Introduction](#introduction) +2. [Matrix Arithmetic Architecture](#matrix-arithmetic-architecture) +3. [Advanced Data Format Support](#advanced-data-format-support) +4. [Sparse Matrix Acceleration](#sparse-matrix-acceleration) +5. [Memory Hierarchy and Operations](#memory-hierarchy-and-operations) +6. [Register Architecture and Management](#register-architecture-and-management) +7. [Execution Model and Control Flow](#execution-model-and-control-flow) +8. [Performance Optimization Strategies](#performance-optimization-strategies) +9. [Key Differences from NVIDIA Architectures](#key-differences-from-nvidia-architectures) +10. [Best Practices for HIP Kernel Development](#best-practices-for-hip-kernel-development) +11. [Conclusion](#conclusion) +12. [References](#references) + +## Introduction + +The AMD Instinct MI300 represents a significant advancement in compute architecture, introducing the CDNA3 instruction set architecture specifically designed for artificial intelligence and high-performance computing workloads. Unlike traditional GPU architectures that evolved from graphics processing, CDNA3 was purpose-built for compute-intensive applications, resulting in unique architectural decisions that require specialized knowledge for optimal kernel development. + +The MI300's architecture introduces several groundbreaking features that distinguish it from both previous AMD architectures and competing NVIDIA solutions. Most notably, the introduction of dedicated Matrix Arithmetic Instructions (MAI) with a separate accumulation register file, native support for emerging data formats like FP8 and BF8, and hardware-accelerated structured sparse matrix operations represent paradigm shifts in how developers should approach kernel optimization. + +This guide synthesizes critical information from the official AMD Instinct MI300 CDNA3 Instruction Set Architecture Reference Guide, focusing specifically on aspects that impact HIP kernel development. Rather than covering general GPU programming concepts that large language models already understand, this document concentrates on MI300-specific features, architectural nuances, and programming patterns that enable developers to write high-performance kernels that fully exploit the hardware's capabilities. + +Understanding these architectural details is crucial for several reasons. First, the MI300's dual register file system requires careful management of data movement between architectural and accumulation registers. Second, the extensive MFMA instruction family offers numerous variants optimized for different matrix dimensions and data types, requiring informed selection based on workload characteristics. Third, the hardware's native support for structured sparsity and novel data formats opens new optimization opportunities that don't exist on other platforms. + +## Matrix Arithmetic Architecture + +The cornerstone of MI300's compute capabilities lies in its revolutionary Matrix Arithmetic Instructions (MFMA), which represent a fundamental departure from traditional vector processing approaches. The MFMA subsystem is built around a dedicated Matrix Core unit that operates independently from the standard SIMD execution units, providing specialized hardware optimized for the matrix operations that dominate modern AI and scientific computing workloads. + +### Dual Register File Architecture + +The most distinctive aspect of MI300's matrix architecture is its implementation of dual register files. Unlike conventional GPU architectures that utilize a single, unified register file, MI300 maintains separate Architectural VGPRs (Arch VGPRs) and Accumulation VGPRs (AccVGPRs). This architectural decision enables several critical optimizations that directly impact kernel performance. + +The Architectural VGPRs serve as the primary register file for standard vector operations and data movement, maintaining compatibility with existing shader instruction sets. These registers handle input data preparation, intermediate computations, and results that don't require matrix-specific processing. In contrast, the AccVGPRs are exclusively dedicated to matrix operations, providing optimized storage and access patterns for matrix data that remains within the matrix computation pipeline. + +Data movement between these register files occurs through explicit V_ACCVGPR_READ and V_ACCVGPR_WRITE instructions, giving developers precise control over when and how matrix data transitions between different processing domains. This explicit model requires careful planning but enables sophisticated optimization strategies, such as keeping frequently accessed matrix data resident in AccVGPRs while using Arch VGPRs for auxiliary computations. + +The separation also enables concurrent operations, where standard vector instructions can execute on Arch VGPRs while matrix operations proceed on AccVGPRs, effectively increasing the overall computational throughput when workloads can be appropriately decomposed. This parallelism is particularly valuable in complex kernels that combine matrix operations with element-wise processing, memory management, or control flow logic. + +### Fundamental Matrix Operations + +The Matrix Core unit's fundamental computational primitive is the 4×1 × 1×4 outer product operation, which produces 16 output values in a single operation. This design choice reflects careful analysis of common matrix operation patterns in AI workloads, where outer products serve as building blocks for larger matrix multiplications. By optimizing the hardware for this specific operation size, AMD achieved an optimal balance between hardware complexity and computational efficiency. + +The outer product primitive supports both dense and structured sparse inputs, with the sparse variant implementing 4:2 structured sparsity where exactly two out of every four values along the reduction dimension are zero. This flexibility allows the same hardware to efficiently process both dense matrices common in fully-connected layers and sparse matrices increasingly used in modern neural network architectures for improved efficiency. + +MFMA instructions combine multiple outer product operations, both in parallel and in series, to implement larger matrix operations. For example, a 32×32×1 MFMA instruction orchestrates 64 parallel 4×1 × 1×4 outer products to compute the full result matrix. This hierarchical approach enables the hardware to scale efficiently across different matrix sizes while maintaining optimal utilization of the underlying computational units. + +### MFMA Instruction Variants + +The MI300 provides an extensive family of MFMA instructions, each optimized for specific matrix dimensions and data types. The instruction naming convention follows the pattern V_MFMA_[output_type]_[M]X[N]X[K][_[B]B]_[input_type], where M, N, and K represent the matrix dimensions, B indicates the number of matrix blocks processed simultaneously, and the type specifications define the precision of inputs and outputs. + +For single-precision floating-point operations, the V_MFMA_F32_*_F32 family provides options ranging from small 4×4×1 matrices processed in 16-block batches to large 32×32×2 single-block operations. The 4×4×1_16B variant completes in just 8 cycles while processing 16 separate matrix operations, making it ideal for scenarios with many small matrices. Conversely, the 32×32×2 variant requires 64 cycles but processes much larger matrices, optimizing for scenarios with fewer, larger computational blocks. + +Half-precision operations through the V_MFMA_F32_*_F16 family offer increased throughput by processing more data per instruction. The 32×32×8 variant can process matrices with 8 elements along the K dimension in 32 cycles, effectively doubling the computational density compared to single-precision variants. This capability is particularly valuable for inference workloads where the reduced precision of FP16 is acceptable. + +The integer instruction family V_MFMA_I32_*_I8 targets quantized neural network workloads, processing 8-bit integer inputs to produce 32-bit integer outputs. These instructions implement the multiply-accumulate pattern common in quantized inference, where 8-bit weights and activations are multiplied and accumulated into higher-precision results to prevent overflow. + +Double-precision support through V_MFMA_F64_*_F64 instructions addresses scientific computing workloads that require maximum numerical precision. While these instructions have lower throughput due to the increased data size and computational complexity, they provide essential capabilities for applications where numerical accuracy is paramount. + +### Broadcasting and Data Permutation + +MFMA instructions support sophisticated broadcasting and data permutation capabilities that enable efficient implementation of various matrix operation patterns. The CBSZ (Broadcast Size) field controls how data is broadcast within matrix blocks, allowing a single input value to be used across multiple matrix elements. This capability is essential for implementing operations like matrix-vector multiplication or bias addition where one operand has reduced dimensionality. + +The ABID (Broadcast ID) field specifies which block should serve as the source for broadcasting operations when multiple blocks are processed simultaneously. This feature enables efficient implementation of operations where one matrix operand is shared across multiple independent matrix operations, reducing memory bandwidth requirements and improving cache efficiency. + +The BLGP (Lane Group Permutation) field provides eight different permutation patterns that control how data is distributed across the 64 lanes of a wavefront. These permutations enable efficient mapping of various matrix layouts to the hardware's execution model, allowing developers to optimize data organization for specific access patterns. The permutations include no broadcast, broadcasting from different 32-lane or 16-lane groups, and rotation operations that shift data across lanes. + +For double-precision MFMA instructions, the BLGP field serves a different purpose, controlling the implicit negation of input matrices A, B, and C. This repurposing reflects the different optimization priorities for double-precision operations, where numerical precision often takes precedence over complex data permutation patterns. + +## Advanced Data Format Support + +MI300's support for advanced data formats represents one of its most significant differentiators from competing architectures. The hardware provides native support for emerging 8-bit floating-point formats that are becoming increasingly important in AI workloads, along with sophisticated conversion and rounding capabilities that enable efficient mixed-precision computing. + +### FP8 and BF8 Formats + +The introduction of 8-bit floating-point formats addresses the growing demand for efficient AI inference while maintaining acceptable numerical accuracy. MI300 supports two distinct 8-bit formats: FP8 (E4M3) and BF8 (E5M2), each optimized for different use cases and numerical requirements. + +FP8 (E4M3) utilizes a 4-bit exponent and 3-bit mantissa configuration with a bias of 8, providing a dynamic range suitable for many neural network activation patterns. Notably, FP8 does not support infinity or NaN representations, instead using the maximum representable value to indicate overflow conditions. This design choice reflects the format's optimization for inference workloads where such special values are rarely encountered and the simplified handling can improve performance. + +The format's range extends from ±2^(-10) for the smallest denormalized values to 240 for the maximum normalized values, with a minimum normalized value of ±2^(-7). This range covers the typical activation distributions found in many neural network layers, making FP8 particularly suitable for activation storage and computation in inference scenarios. + +BF8 (E5M2) employs a 5-bit exponent and 2-bit mantissa with a bias of 16, providing a different trade-off between range and precision. The extended exponent range allows BF8 to represent much larger values, with a maximum of 57,344 compared to FP8's 240. BF8 also supports infinity representations, making it more suitable for scenarios where numerical robustness is important. + +The mantissa precision in BF8 is reduced compared to FP8, but the extended range makes it particularly suitable for weight storage in neural networks, where the distribution of values often spans a wider range than activations. The minimum denormalized value is 2^(-17), and the minimum normalized value is ±2^(-15), providing coverage for very small weight values that might be important for model accuracy. + +### Conversion and Rounding Operations + +MI300 provides comprehensive conversion capabilities between different precision formats, with particular attention to the rounding modes that can significantly impact numerical accuracy in iterative computations. The CVT_PK_FP8_F32 and CVT_PK_BF8_F32 instructions convert pairs of 32-bit floating-point values to packed 8-bit formats, enabling efficient data compression for storage and transmission. + +These conversion instructions support standard IEEE rounding modes and provide control over input modifications through absolute value and negation operations. The Op_Sel[3] field controls various aspects of the conversion process, allowing fine-tuned control over how the conversion handles edge cases and precision loss. + +The CVT_SR_FP8_F32 instruction introduces stochastic rounding, a technique that has gained attention in machine learning research for its ability to maintain numerical accuracy during training with reduced precision. Instead of always rounding to the nearest representable value, stochastic rounding probabilistically chooses between the two nearest values based on the fractional part of the original value. This approach helps prevent systematic bias that can accumulate during iterative training processes. + +Stochastic rounding requires a random number source, which is provided through the second operand of the CVT_SR_FP8_F32 instruction. The Op_Sel[3:2] field controls various aspects of the stochastic rounding process, including how the random bits are interpreted and applied. This capability enables MI300 to support cutting-edge training techniques that rely on stochastic quantization for maintaining model quality while reducing computational and memory requirements. + +### Configuration Requirements + +Proper utilization of FP8 and BF8 formats requires specific hardware configuration to ensure correct operation. The SH_MEM_CONFIG register's bit[8] must be set to 1 to enable the correct behavior for BF8 and FP8 operations. This configuration bit affects various aspects of the floating-point processing pipeline, ensuring that the specialized handling required for these formats is properly enabled. + +The configuration also affects how overflow and underflow conditions are handled during conversions. When FP16_OVFL is set to 1, values that exceed the representable range of the target format are clamped to the maximum representable value rather than being converted to infinity or NaN. This behavior is often preferred in AI workloads where maintaining finite values is more important than preserving the mathematical properties of IEEE floating-point arithmetic. + +The interaction between these configuration settings and the various MFMA instructions creates a complex optimization space where developers must carefully balance numerical accuracy, performance, and memory efficiency. Understanding these trade-offs is crucial for developing kernels that effectively leverage the advanced data format capabilities of MI300. + +## Sparse Matrix Acceleration + +MI300's hardware support for structured sparse matrices represents a significant advancement in accelerating the sparse neural networks that are becoming increasingly important for efficient AI deployment. The V_SMFMAC (Sparse Matrix Fused Multiply-ACcumulate) instruction family provides native hardware acceleration for 4:2 structured sparsity, enabling significant performance improvements for appropriately structured workloads. + +### 4:2 Structured Sparsity Pattern + +The 4:2 structured sparsity pattern requires that exactly two out of every four consecutive elements along the matrix K-dimension are zero. This constraint might initially seem restrictive, but it provides several important advantages that make it practical for many AI workloads. The regular structure enables efficient hardware implementation while still providing substantial memory and computational savings compared to dense operations. + +The sparsity pattern is enforced at the granularity of groups of four elements, meaning that within each group of four consecutive values along the reduction dimension, exactly two positions must contain zeros. The positions of the non-zero elements can vary between groups, providing flexibility in representing various sparsity patterns that arise naturally in neural networks or can be induced through structured pruning techniques. + +This structured approach contrasts with unstructured sparsity, where zeros can appear at arbitrary positions. While unstructured sparsity can achieve higher compression ratios, it requires complex indexing schemes and irregular memory access patterns that are difficult to accelerate efficiently in hardware. The 4:2 structure provides a sweet spot between compression efficiency and hardware implementation complexity. + +The 2:1 compression ratio achieved by 4:2 sparsity is significant in practical applications. For large neural networks, this compression translates directly to reduced memory bandwidth requirements, smaller model storage, and improved cache efficiency. When combined with the computational savings from skipping zero multiplications, the overall performance improvement can be substantial for workloads that can be structured to match this sparsity pattern. + +### Index Encoding and Reconstruction + +The sparse matrix representation uses a compact index encoding scheme where pairs of 2-bit values indicate which two positions within each group of four contain non-zero values. This encoding requires only 4 bits to represent the sparsity pattern for each group of four elements, resulting in minimal overhead for the index information. + +The index values are stored in separate VGPRs from the non-zero data values, allowing the hardware to process the sparsity pattern and data values through different pathways optimized for their respective characteristics. The index processing logic reconstructs the full matrix structure by inserting zeros at the appropriate positions based on the encoded pattern, enabling the matrix multiplication hardware to operate on the reconstructed dense representation. + +This reconstruction process occurs transparently within the hardware, meaning that software developers work with the compressed representation while the execution units operate on appropriately structured data. The hardware manages the complexity of coordinating between the sparse data, index information, and dense operands to produce correct results. + +The index encoding supports all possible combinations of two non-zero positions within groups of four, providing complete flexibility in representing any 4:2 sparse pattern. The encoding is designed to be efficiently processed by the hardware's index reconstruction logic, minimizing the overhead associated with sparse processing. + +### SMFMAC Instruction Characteristics + +The V_SMFMAC instructions implement accumulate-style operations where the output matrix serves as both an input (for accumulation) and the destination for results. This design pattern is common in neural network computations where results are accumulated across multiple matrix operations, such as in attention mechanisms or recurrent neural network implementations. + +The instruction format repurposes the C operand input field to hold the index data offset, reflecting the different data flow requirements of sparse operations compared to dense MFMA instructions. This design choice enables efficient encoding while maintaining consistency with the overall instruction format architecture. + +Only the A matrix can be sparse in SMFMAC operations; the B and C matrices must be dense. This limitation reflects both hardware complexity considerations and the common usage patterns in neural networks, where weight matrices (typically the A operand) are often sparse while activation matrices (typically the B operand) remain dense. + +The performance characteristics of SMFMAC instructions depend heavily on the actual sparsity pattern and data layout. When the sparsity pattern is well-matched to the hardware's expectations and the data is properly organized in memory, SMFMAC operations can provide substantial performance improvements over equivalent dense operations. However, poorly organized sparse data or sparsity patterns that don't align well with the 4:2 structure can result in performance degradation compared to dense alternatives. + +## Memory Hierarchy and Operations + +MI300's memory hierarchy incorporates several unique features that distinguish it from both previous AMD architectures and competing solutions. Understanding these features is crucial for developing high-performance kernels that effectively utilize the available memory bandwidth and minimize latency through optimal data placement and access patterns. + +### Local Data Share (LDS) Architecture + +The Local Data Share (LDS) serves as MI300's implementation of fast on-chip shared memory, providing high-bandwidth, low-latency storage that can be shared among all threads within a workgroup. The LDS architecture in MI300 includes several enhancements and unique characteristics that impact kernel design and optimization strategies. + +LDS memory is organized as a banked structure that enables concurrent access from multiple threads when accesses target different banks. The banking scheme is designed to support common access patterns found in matrix operations and data sharing scenarios, but developers must understand the banking rules to avoid conflicts that can serialize memory accesses and reduce performance. + +The DS_* instruction family provides comprehensive support for LDS operations, including standard loads and stores as well as atomic operations that enable sophisticated synchronization and data sharing patterns. The atomic operations include support for various data types and operation modes, including compare-and-swap operations that enable lock-free algorithms and advanced synchronization primitives. + +One unique capability of MI300's LDS implementation is the support for direct loading from global memory buffers to LDS without intermediate storage in VGPRs. The BUFFER_LOAD_* instructions can specify LDS as the destination, enabling efficient data staging operations that bypass the register file. This capability is particularly valuable for kernels that need to load large amounts of data into shared memory for subsequent processing by multiple threads. + +The LDS address calculation follows the pattern CalcDsAddr(ADDR, OFFSET0, OFFSET1), where multiple offset components can be combined to support complex addressing patterns. This flexibility enables efficient implementation of multi-dimensional array accesses and other complex data structures that are common in scientific computing and AI workloads. + +### Global Wave Sync (GWS) Capabilities + +Global Wave Sync represents a unique synchronization primitive that enables coordination between different workgroups executing on the same compute unit. This capability extends beyond the traditional shared memory model where synchronization is limited to threads within a single workgroup, enabling new algorithmic approaches that can improve efficiency for certain classes of problems. + +GWS operations use similar instruction patterns to LDS operations but operate at a different scope, allowing workgroups to coordinate their execution and share intermediate results. This capability is particularly valuable for algorithms that have global dependencies or require coordination across large numbers of threads that exceed the capacity of a single workgroup. + +The implementation of GWS requires careful consideration of memory consistency and ordering guarantees, as operations that span multiple workgroups must maintain coherent views of shared data. The hardware provides appropriate synchronization mechanisms to ensure that GWS operations complete in a well-defined order and that all participants observe consistent results. + +Applications that can effectively utilize GWS include certain types of reductions, prefix scans, and other collective operations where the natural decomposition exceeds workgroup boundaries. However, the use of GWS requires careful algorithm design to ensure that the synchronization overhead doesn't outweigh the benefits of the increased parallelism. + +### Buffer Memory Operations + +MI300's buffer memory operations provide flexible and efficient mechanisms for accessing global memory through the MUBUF (Memory Untyped Buffer) instruction family. These instructions support various addressing modes and data formats that enable efficient implementation of common memory access patterns found in compute kernels. + +Buffer addressing combines multiple components including base addresses, indices, offsets, and strides to support complex data layouts. The addressing calculation can incorporate thread IDs automatically, enabling efficient implementation of per-thread data access patterns without requiring explicit address computation in the kernel code. + +The buffer resource descriptor provides comprehensive control over memory access behavior, including stride information, bounds checking, and cache control policies. The stride field supports up to 18 bits for certain instruction types, enabling efficient access to large data structures with regular layouts. + +Swizzling support in buffer operations enables optimized memory access patterns that can improve cache efficiency and reduce bank conflicts. The swizzling parameters control how linear addresses are mapped to physical memory locations, allowing developers to optimize for specific access patterns that are common in their workloads. + +The buffer operations also support direct loading to LDS memory, as mentioned previously, which enables efficient data staging operations. This capability is particularly valuable for kernels that implement tiling strategies where data is loaded into shared memory for processing by multiple threads. + +## Register Architecture and Management + +MI300's register architecture presents unique challenges and opportunities that significantly impact kernel performance and resource utilization. The combination of scalar and vector registers, along with the specialized accumulation registers for matrix operations, requires careful management to achieve optimal performance. + +### Scalar General Purpose Registers (SGPRs) + +SGPRs serve as the primary storage for scalar values, addresses, and control information that is shared across all threads in a wavefront. MI300 provides 102 SGPRs (SGPR0 through SGPR101) plus several special-purpose registers that serve specific architectural functions. + +The most significant constraint on SGPR usage is the limitation that at most one SGPR can be read per VALU (Vector ALU) instruction. This restriction requires careful instruction scheduling and register allocation to avoid pipeline stalls. When multiple SGPR values are needed for a single operation, they must be loaded in separate instructions or combined through scalar operations before being used in vector computations. + +SGPR alignment requirements are strict for multi-word operations. 64-bit operations require even-aligned SGPR pairs, while larger operations require alignment to multiples of four. These alignment constraints must be considered during register allocation to avoid wasted register space and ensure efficient instruction encoding. + +The special-purpose registers include VCC (Vector Condition Code), EXEC (Execution mask), M0 (Memory descriptor), and various trap and system registers. These registers serve specific architectural functions and have unique usage patterns that must be understood for effective kernel development. + +VCC serves as the default destination for vector comparison operations and as a source for conditional operations. The 64-bit VCC register provides one bit per thread in the wavefront, enabling efficient implementation of conditional execution patterns. + +The EXEC register controls which threads in a wavefront are active for each instruction. Understanding EXEC mask management is crucial for implementing control flow and ensuring that inactive threads don't interfere with computation or memory operations. + +M0 serves as a memory descriptor register that provides addressing information for certain memory operations. Its usage is particularly important for LDS operations and other specialized memory access patterns. + +### Vector General Purpose Registers (VGPRs) + +VGPRs provide per-thread storage for vector operations, with each VGPR containing one value per thread in the 64-thread wavefront. The VGPR file is shared between standard vector operations and serves as the interface to the specialized AccVGPR file used for matrix operations. + +VGPR allocation and alignment follow similar rules to SGPRs, with even alignment required for 64-bit operations and higher alignment requirements for larger data types. The alignment requirements can impact register utilization efficiency, particularly in kernels that mix different data types or operation sizes. + +VGPR indexing provides dynamic access to the register file using the M0 register as an index. This capability enables implementation of algorithms that require indirect register access, such as certain types of data permutation or dynamic data structure access. However, indexed access typically has higher latency than direct register access and should be used judiciously. + +The interaction between VGPRs and AccVGPRs requires explicit management through V_ACCVGPR_READ and V_ACCVGPR_WRITE instructions. These data movement operations have specific latency and throughput characteristics that must be considered when scheduling matrix operations and other computations. + +### Accumulation Vector General Purpose Registers (AccVGPRs) + +AccVGPRs represent MI300's most distinctive register architecture feature, providing dedicated storage optimized for matrix operations. These registers are physically separate from the standard VGPR file and are accessed exclusively through matrix instructions and explicit data movement operations. + +The AccVGPR file is designed to support the data flow patterns common in matrix operations, with optimized connectivity to the matrix execution units. Data stored in AccVGPRs can remain resident across multiple matrix operations, enabling efficient implementation of complex matrix computations without intermediate transfers to the standard register file. + +AccVGPR allocation follows the same alignment rules as VGPRs, but the usage patterns are typically different due to the nature of matrix operations. Matrix instructions often require contiguous blocks of registers to store matrix data, leading to different optimization considerations compared to scalar or simple vector operations. + +The capacity and organization of the AccVGPR file impact the types of matrix operations that can be efficiently supported. Large matrix operations may require careful blocking and data movement strategies to fit within the available AccVGPR space while maintaining high utilization of the matrix execution units. + +Data movement between VGPRs and AccVGPRs introduces latency and throughput considerations that must be balanced against the benefits of keeping data in the specialized register file. Optimal kernel design requires understanding when to keep data in AccVGPRs versus when to move it to the standard register file for other operations. + +## Execution Model and Control Flow + +MI300's execution model builds upon the traditional SIMD (Single Instruction, Multiple Data) approach while incorporating enhancements that improve efficiency for modern compute workloads. Understanding the execution model is crucial for writing kernels that effectively utilize the hardware's capabilities while avoiding performance pitfalls. + +### Wavefront Execution Characteristics + +MI300 executes instructions using 64-thread wavefronts, where all active threads in a wavefront execute the same instruction simultaneously. This wavefront size represents a key architectural decision that impacts memory access patterns, synchronization requirements, and overall kernel design strategies. + +The 64-thread wavefront size affects memory coalescing requirements, as optimal memory access patterns must align with the 64-thread execution width. Memory operations that can be coalesced across all 64 threads achieve maximum bandwidth utilization, while non-coalesced accesses may result in multiple memory transactions and reduced performance. + +Thread divergence within a wavefront is handled through the EXEC mask, which controls which threads participate in each instruction. When threads follow different execution paths due to conditional branches, the hardware executes both paths sequentially while masking inactive threads. This approach ensures correctness but can reduce effective utilization when divergence is frequent or long-lasting. + +The wavefront execution model interacts with the memory hierarchy in important ways. LDS operations are shared across all threads in a workgroup, which may span multiple wavefronts. Understanding how wavefronts within a workgroup coordinate their LDS usage is important for avoiding conflicts and ensuring efficient data sharing. + +### Arbitrary Divergent Control Flow + +MI300 provides hardware support for arbitrary divergent control flow, enabling efficient execution of complex branching patterns that are common in many compute workloads. This capability goes beyond simple conditional execution to support nested loops, function calls, and other complex control structures. + +The hardware maintains a stack-based mechanism for tracking divergent execution paths, allowing threads within a wavefront to follow different control flow paths while maintaining the ability to reconverge when the paths merge. This approach provides flexibility for implementing complex algorithms while maintaining reasonable execution efficiency. + +The efficiency of divergent control flow depends on the specific branching patterns and the degree of divergence. When most threads follow the same path, the overhead is minimal. However, when threads frequently diverge into many different paths, the sequential execution of different paths can significantly reduce effective throughput. + +Kernel developers can optimize for divergent control flow by organizing algorithms to minimize divergence, using techniques such as data reorganization to group threads with similar execution paths, or restructuring algorithms to reduce the complexity of branching patterns. + +### Dependency Resolution and Scheduling + +MI300's instruction scheduling and dependency resolution mechanisms are designed to hide latency and maximize throughput for typical compute workloads. Understanding these mechanisms enables developers to write kernels that achieve high instruction-level parallelism and efficient resource utilization. + +Matrix operations have specific dependency requirements that must be satisfied to ensure correct execution. The hardware requires a certain number of independent instructions between the issuance of a matrix instruction and subsequent accesses to its results or modifications of its input registers. These dependency requirements are documented for each instruction type and must be carefully managed in hand-optimized kernels. + +Scalar memory operations use the LGKM_CNT counter to track outstanding memory requests and provide synchronization points for dependent operations. The counter is incremented when memory operations are issued and decremented when they complete, allowing software to determine when data is available for use. + +The S_WAITCNT instruction provides comprehensive synchronization capabilities with separate counters for different types of operations. Understanding how to use S_WAITCNT effectively is crucial for ensuring correct execution while minimizing unnecessary stalls that can reduce performance. + +## Performance Optimization Strategies + +Achieving optimal performance on MI300 requires understanding the unique characteristics of its architecture and applying optimization strategies that are specifically tailored to its capabilities. This section outlines key optimization approaches that can significantly impact kernel performance. + +### Matrix Operation Optimization + +Optimizing matrix operations on MI300 requires careful consideration of the MFMA instruction variants, data layout, and register management strategies. The choice of MFMA instruction should be based on the specific matrix dimensions, data types, and throughput requirements of the target workload. + +For workloads with many small matrices, using MFMA instructions with higher block counts (such as 4x4x1_16B) can provide better throughput by processing multiple independent operations simultaneously. Conversely, workloads with larger matrices may benefit from instructions that process larger matrix dimensions in fewer operations. + +Data layout optimization is crucial for achieving optimal memory bandwidth utilization. Matrix data should be organized to enable coalesced memory accesses across the 64-thread wavefront, and the layout should be compatible with the input and output patterns expected by the chosen MFMA instructions. + +AccVGPR management strategies can significantly impact performance by minimizing unnecessary data movement between register files. Keeping frequently accessed matrix data in AccVGPRs while using standard VGPRs for auxiliary computations can improve overall efficiency. + +The broadcasting and permutation capabilities of MFMA instructions can be leveraged to implement complex matrix operation patterns efficiently. Understanding how to use the CBSZ, ABID, and BLGP fields enables optimization of operations like matrix-vector multiplication, bias addition, and other common neural network primitives. + +### Memory Access Optimization + +Memory access optimization on MI300 requires understanding the memory hierarchy, cache behavior, and access pattern requirements for different types of operations. The goal is to maximize memory bandwidth utilization while minimizing latency through effective use of the memory hierarchy. + +LDS optimization involves understanding the banking structure and organizing data access patterns to avoid bank conflicts. Data should be laid out to enable concurrent access from multiple threads, and algorithms should be structured to take advantage of the high bandwidth and low latency characteristics of LDS memory. + +Global memory access optimization focuses on achieving coalesced access patterns that can be efficiently serviced by the memory system. The buffer addressing capabilities should be used to implement efficient strided access patterns, and cache control bits should be used to optimize cache behavior for specific access patterns. + +The direct LDS loading capability can be used to implement efficient data staging strategies where global memory data is loaded directly into LDS for subsequent processing. This approach can reduce register pressure and improve memory bandwidth utilization for certain types of algorithms. + +### Register Allocation and Management + +Effective register allocation on MI300 requires balancing the competing demands of different instruction types while respecting alignment requirements and usage constraints. The dual register file architecture adds complexity but also provides optimization opportunities when properly managed. + +SGPR allocation should minimize the number of different SGPRs accessed within individual VALU instructions to avoid violating the single-SGPR-read constraint. Scalar computations should be organized to pre-compute values that will be used multiple times in vector operations. + +VGPR allocation should consider the alignment requirements of different operation types and organize data to minimize wasted register space due to alignment padding. The allocation should also consider the data flow between standard vector operations and matrix operations that use AccVGPRs. + +AccVGPR allocation should focus on keeping frequently accessed matrix data resident while minimizing unnecessary data movement. The allocation strategy should consider the matrix dimensions and operation patterns to ensure efficient utilization of the specialized register file. + +### Instruction Scheduling and Latency Hiding + +Instruction scheduling on MI300 should focus on maximizing instruction-level parallelism while respecting dependency constraints and resource limitations. The goal is to keep all execution units busy while minimizing pipeline stalls and resource conflicts. + +Matrix instruction scheduling requires careful attention to the dependency requirements between matrix operations and other instruction types. The required number of independent instructions between dependent matrix operations must be maintained to ensure correct execution. + +Memory instruction scheduling should balance the need to issue memory operations early (to hide latency) with the need to avoid excessive resource consumption that could limit other operations. The LGKM_CNT counter should be monitored to ensure that memory operations complete in a timely manner. + +Mixed instruction scheduling involves coordinating between different instruction types (scalar, vector, matrix, memory) to achieve optimal overall throughput. Understanding the execution unit capabilities and resource requirements of different instruction types enables effective scheduling strategies. + +## Key Differences from NVIDIA Architectures + +Understanding the differences between MI300 and NVIDIA AI accelerators is crucial for developers who need to port kernels between platforms or optimize for specific architectural characteristics. This section highlights the most significant differences that impact kernel development and performance optimization. + +### Matrix Operation Architecture Differences + +The most fundamental difference lies in the matrix operation architecture. MI300's dual register file system with separate AccVGPRs contrasts sharply with NVIDIA's unified register file approach. This architectural difference requires different optimization strategies and affects how matrix data is managed throughout kernel execution. + +MI300's 4×1 × 1×4 outer product primitive differs from NVIDIA's typical 4×4 × 4×4 matrix operation building blocks. This difference affects how larger matrix operations are decomposed and can influence the optimal blocking strategies for different matrix sizes and shapes. + +The explicit data movement between register files in MI300 (via V_ACCVGPR_READ/WRITE) contrasts with NVIDIA's more implicit register management. This difference provides more control but requires more explicit management of data flow in kernel code. + +MFMA instruction variants in MI300 offer different trade-offs compared to NVIDIA's WMMA instructions. The extensive family of MFMA instructions provides more granular control over matrix dimensions and block counts, enabling fine-tuned optimization for specific workload characteristics. + +### Data Format and Precision Differences + +MI300's native support for FP8 and BF8 formats represents a significant advantage for certain AI workloads, particularly when these formats align well with the numerical requirements of the target application. The hardware support for stochastic rounding in FP8 conversions is particularly unique and valuable for training applications. + +The specific format definitions (E4M3 for FP8, E5M2 for BF8) may differ from NVIDIA's implementations, requiring careful attention to numerical behavior when porting applications between platforms. The range and precision characteristics of these formats can affect algorithm behavior and numerical stability. + +XF32 support in MI300 provides a middle ground between FP16 and FP32 that may not have direct equivalents in NVIDIA architectures. This format can be valuable for applications that need more precision than FP16 but can accept less than full FP32 precision. + +### Sparse Matrix Support Differences + +MI300's 4:2 structured sparsity support through V_SMFMAC instructions provides hardware acceleration for a specific sparsity pattern that may differ from NVIDIA's sparse matrix capabilities. The 4:2 pattern and its hardware implementation may be more or less suitable than NVIDIA's approaches depending on the specific sparsity characteristics of the target workload. + +The index encoding scheme and reconstruction process in MI300 may require different data preparation and layout strategies compared to NVIDIA's sparse matrix implementations. Understanding these differences is crucial for achieving optimal performance with sparse workloads. + +### Memory Hierarchy Differences + +The LDS implementation in MI300 may have different banking schemes, capacity, and access patterns compared to NVIDIA's shared memory. These differences can affect optimal data layout strategies and access pattern optimization. + +MI300's Global Wave Sync capability provides cross-workgroup synchronization primitives that may not have direct equivalents in NVIDIA architectures. This capability enables different algorithmic approaches that may be more or less suitable depending on the target application. + +Buffer memory operations and addressing modes in MI300 may differ from NVIDIA's global memory access patterns, requiring different optimization strategies for memory bandwidth utilization and cache efficiency. + +### Execution Model Differences + +The 64-thread wavefront size in MI300 contrasts with NVIDIA's 32-thread warp size, affecting memory coalescing requirements, synchronization patterns, and optimal workgroup organization strategies. + +Divergent control flow handling may differ between the architectures, with different performance characteristics for various branching patterns. Understanding these differences is important for optimizing kernels with complex control flow. + +The instruction scheduling and dependency resolution mechanisms may have different characteristics, requiring different approaches to instruction-level parallelism and latency hiding. + +## Best Practices for HIP Kernel Development + +Based on the architectural characteristics and optimization opportunities outlined in previous sections, this section provides concrete best practices for developing high-performance HIP kernels on MI300. + +### Matrix Operation Best Practices + +When implementing matrix operations, choose MFMA instruction variants based on the specific requirements of your workload. For applications with many small matrices, prefer higher block count variants (like 4x4x1_16B) to maximize throughput. For applications with larger matrices, use variants that process larger dimensions efficiently. + +Organize matrix data layout to align with the input and output patterns expected by your chosen MFMA instructions. Ensure that data can be loaded efficiently into the appropriate register files and that memory access patterns are coalesced across the wavefront. + +Minimize data movement between VGPRs and AccVGPRs by keeping frequently accessed matrix data in AccVGPRs when possible. Plan your algorithm to batch matrix operations and minimize the frequency of register file transfers. + +Leverage the broadcasting and permutation capabilities of MFMA instructions to implement complex operations efficiently. Use CBSZ for operations that require broadcasting within matrix blocks, and use BLGP to optimize data distribution patterns. + +### Memory Access Best Practices + +Design LDS usage patterns to avoid bank conflicts by organizing data layout and access patterns appropriately. Use the direct LDS loading capability to stage data efficiently from global memory when implementing tiling strategies. + +Optimize global memory access patterns for coalescing by ensuring that consecutive threads access consecutive memory locations when possible. Use the buffer addressing capabilities to implement efficient strided access patterns for multi-dimensional data structures. + +Use appropriate cache control bits (GLC, NV) to optimize cache behavior for your specific access patterns. Consider the temporal and spatial locality characteristics of your memory accesses when choosing cache policies. + +Plan memory access scheduling to hide latency by issuing memory operations early and overlapping them with computation when possible. Use S_WAITCNT appropriately to synchronize memory operations without introducing unnecessary stalls. + +### Register Management Best Practices + +Allocate SGPRs carefully to respect the single-SGPR-read constraint for VALU instructions. Pre-compute scalar values that will be used multiple times in vector operations, and organize scalar computations to minimize SGPR pressure. + +Plan VGPR allocation to respect alignment requirements while minimizing wasted register space. Consider the data flow between different operation types and organize register usage to support efficient data movement. + +Manage AccVGPR allocation to support your matrix operation patterns while minimizing unnecessary data movement. Consider the matrix dimensions and operation sequences when planning AccVGPR usage. + +Monitor overall register pressure to ensure that your kernel can achieve good occupancy. Balance register usage against other resource requirements to find the optimal operating point for your specific workload. + +### Performance Optimization Best Practices + +Profile your kernels to identify performance bottlenecks and optimization opportunities. Use appropriate profiling tools to understand instruction throughput, memory bandwidth utilization, and resource utilization characteristics. + +Optimize instruction scheduling to maximize instruction-level parallelism while respecting dependency constraints. Pay particular attention to matrix instruction dependencies and memory operation synchronization requirements. + +Consider algorithmic optimizations that can take advantage of MI300's unique capabilities, such as structured sparsity support or advanced data format capabilities. Evaluate whether algorithm modifications can improve performance by better matching the hardware characteristics. + +Validate numerical accuracy when using reduced precision formats or optimization techniques that may affect numerical behavior. Ensure that performance optimizations don't compromise the correctness or accuracy requirements of your application. + +## Conclusion + +The AMD Instinct MI300 CDNA3 architecture represents a significant advancement in compute acceleration, introducing unique features that require specialized knowledge for optimal utilization. The dual register file architecture, extensive MFMA instruction family, native support for emerging data formats, and hardware-accelerated structured sparsity create new optimization opportunities while requiring different approaches compared to traditional GPU architectures. + +Success in developing high-performance HIP kernels for MI300 requires understanding these architectural innovations and applying optimization strategies that are specifically tailored to the hardware's capabilities. The matrix arithmetic instructions provide powerful tools for accelerating AI and scientific computing workloads, but they require careful attention to data layout, register management, and instruction scheduling to achieve optimal performance. + +The advanced data format support enables efficient implementation of mixed-precision algorithms that can provide significant performance and memory efficiency improvements for appropriate workloads. The sparse matrix acceleration capabilities open new possibilities for deploying efficient neural network models that leverage structured sparsity for improved performance. + +Memory hierarchy optimization remains crucial, with the LDS and buffer memory systems providing high-performance data access capabilities when used appropriately. The unique features like Global Wave Sync and direct LDS loading enable algorithmic approaches that may not be possible or efficient on other architectures. + +As AI and scientific computing workloads continue to evolve, the architectural innovations in MI300 position it well for emerging requirements around efficiency, precision flexibility, and sparse computation. Developers who master these architectural features will be well-positioned to create high-performance applications that fully leverage the capabilities of this advanced compute platform. + +The investment in understanding MI300's unique characteristics pays dividends not only in immediate performance improvements but also in preparing for future architectural developments that will likely build upon these foundational innovations. The principles and techniques outlined in this guide provide a foundation for continued optimization and adaptation as both hardware and software ecosystems evolve. + +## References + +[1] AMD Instinct MI300 CDNA3 Instruction Set Architecture Reference Guide, Advanced Micro Devices, Inc., June 2025. + +[2] AMD GPUOpen Blog: AMD Lab Notes - Matrix Cores README, https://gpuopen.com/learn/amd-lab-notes/amd-lab-notes-matrix-cores-README/ + +[3] AMD Matrix Instruction Calculator, RadeonOpenCompute, https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator + diff --git a/kernel-agentic/docs/hip/AMD MI300 HIP Kernel Programming Guide_ CDNA3 Architecture Insights.md b/kernel-agentic/docs/hip/AMD MI300 HIP Kernel Programming Guide_ CDNA3 Architecture Insights.md new file mode 100644 index 0000000..35ed08a --- /dev/null +++ b/kernel-agentic/docs/hip/AMD MI300 HIP Kernel Programming Guide_ CDNA3 Architecture Insights.md @@ -0,0 +1,331 @@ +# AMD MI300 HIP Kernel Programming Guide: CDNA3 Architecture Insights + + +## Executive Summary + +The AMD CDNA3 architecture, embodied in the MI300 series accelerators, represents a paradigmatic shift in GPU design philosophy that fundamentally impacts how high-performance HIP kernels should be written and optimized. Unlike traditional monolithic GPU designs, CDNA3 embraces a heterogeneous chiplet architecture that introduces unique programming considerations, memory hierarchy optimizations, and performance characteristics that differ significantly from NVIDIA's AI accelerators. + +This guide synthesizes critical architectural insights from the AMD CDNA3 white paper to provide large language models and developers with the specialized knowledge necessary to generate high-quality HIP kernels optimized for MI300 hardware. The focus is on architectural features that are either unique to AMD or implemented differently from NVIDIA solutions, as general GPU programming concepts are assumed to be well-understood. + +The MI300 series introduces revolutionary concepts including memory-side caching through AMD Infinity Cache, 2:4 structured sparsity support, novel data types like TF32 and OCP-compliant FP8, and a relaxed memory coherency model that requires explicit synchronization. These features, combined with the chiplet-based design and enhanced matrix processing capabilities, create both opportunities and challenges for kernel optimization that are distinct from CUDA programming paradigms. + + + + +## 1. CDNA3 Architecture Overview: Chiplet-Based Design Implications + +The AMD CDNA3 architecture fundamentally departs from traditional monolithic GPU designs by implementing a heterogeneous chiplet approach that has profound implications for kernel programming and optimization strategies. Understanding this architectural foundation is crucial for writing efficient HIP kernels that can fully exploit the hardware capabilities. + +### 1.1 Heterogeneous Chiplet Organization + +The MI300 series processors are constructed using up to 8 Accelerator Complex Dies (XCDs) and 4 I/O Dies (IODs), each fabricated on different process nodes and optimized for specific functions. The XCDs, manufactured on TSMC's 5nm process, contain the computational elements and lower-level cache hierarchy, while the IODs, built on TSMC's 6nm process, house the memory controllers, AMD Infinity Cache, and system interconnects. This separation allows for specialized optimization of each component while enabling vertical 3D stacking through advanced packaging technologies. + +Each XCD contains exactly 40 Compute Units (CUs), with 38 active units and 2 disabled for yield management purposes. This yields a total of 304 active CUs across the full MI300X configuration, representing approximately 40% more computational resources than the previous generation MI250X. The consistent 38-CU configuration per XCD creates predictable resource allocation patterns that kernel developers can exploit for load balancing and work distribution strategies. + +The chiplet design introduces unique considerations for memory access patterns and inter-CU communication. Unlike monolithic designs where all CUs share uniform access to memory controllers, the CDNA3 architecture creates a hierarchical access pattern where CUs within the same XCD have lower latency access to the local L2 cache, while cross-XCD communication must traverse the AMD Infinity Fabric network. This architectural characteristic suggests that kernel designs should prioritize data locality within XCD boundaries when possible, and carefully consider the cost of cross-XCD data sharing. + +### 1.2 Asynchronous Compute Engine Architecture + +Each XCD incorporates 4 Asynchronous Compute Engines (ACEs) that serve as the primary work distribution mechanism for compute shader workgroups. Each ACE is nominally associated with 40 CUs, though the actual active count is 38 due to yield management. This 4-ACE configuration provides fine-grained control over work distribution and enables sophisticated load balancing strategies that can adapt to varying computational workloads. + +The ACE architecture differs significantly from NVIDIA's GigaThread Engine approach by providing multiple independent scheduling domains within each XCD. This design enables better isolation between concurrent kernels and can reduce scheduling overhead for workloads that can be effectively partitioned across the available ACEs. Kernel developers should consider designing workgroup distributions that align with the 4-ACE structure to minimize scheduling conflicts and maximize throughput. + +The hardware scheduler (HWS) coordinates work distribution across all ACEs and manages the hardware queues (HQDO-7) that feed work to the compute accelerators. Understanding this scheduling hierarchy is important for optimizing kernel launch patterns and minimizing dispatch overhead, particularly for workloads that involve frequent kernel launches or complex dependency chains. + +### 1.3 Compute Unit Internal Architecture + +The CDNA3 Compute Units represent a comprehensive redesign that doubles or quadruples performance per CU for vector and matrix workloads compared to the previous generation. Each CU functions as a complete, highly threaded parallel processor core that includes instruction fetching and scheduling, execution units for scalar, vector, and matrix operations, and load/store pipelines with integrated L1 cache and Local Data Share (LDS). + +A critical architectural innovation is the shared 64KB instruction cache between pairs of CUs, which doubles the capacity from the previous generation while maintaining nearly constant die area. This design exploits the common pattern where adjacent CUs execute identical instruction streams, effectively increasing the cacheable instruction window and improving hit rates. Kernel developers should be aware that instruction cache efficiency is maximized when neighboring CUs execute similar code paths, suggesting that workgroup assignment strategies should consider instruction locality alongside data locality. + +The enhanced source caching mechanism provides improved register reuse and bandwidth amplification, allowing each vector register read to support multiple downstream vector or matrix operations. This architectural feature rewards kernel designs that maximize register reuse and minimize redundant memory accesses, particularly for computationally intensive operations where the same data elements are used across multiple computational stages. + + +## 2. Memory Hierarchy and Caching Strategy: The Infinity Cache Revolution + +The CDNA3 memory hierarchy represents one of the most significant departures from conventional GPU memory systems and introduces programming considerations that are fundamentally different from NVIDIA architectures. Understanding these differences is crucial for optimizing memory access patterns and achieving peak performance in HIP kernels. + +### 2.1 Three-Tier Cache Hierarchy with Memory-Side Caching + +The CDNA3 architecture implements a unique three-tier cache hierarchy consisting of L1 vector data cache, L2 cache, and the revolutionary AMD Infinity Cache. This design differs markedly from traditional two-tier GPU cache hierarchies and introduces novel optimization opportunities that kernel developers must understand to achieve optimal performance. + +The L1 vector data cache has been substantially enhanced with a doubled cache line size of 128 bytes and doubled capacity to 32KB per CU. This larger cache line size is particularly beneficial for streaming workloads and vectorized operations that access contiguous memory regions. The increased line size also doubles the bandwidth between the L1 cache and the core, providing improved data delivery rates for bandwidth-intensive kernels. However, the larger cache lines also mean that memory access patterns with poor spatial locality may suffer from increased cache pollution, making careful attention to data layout and access patterns even more critical. + +The L2 cache serves as a 4MB, 16-way set-associative cache shared by all 38 CUs within an XCD. The L2 is organized into 16 parallel channels of 256KB each, enabling massive parallelism with the ability to sustain four requests from different CUs per cycle. This design provides a combined throughput of 2KB per clock per XCD, with aggregate read bandwidth across all XCDs reaching up to 34.4 TB/s. The L2 cache plays a critical role as the lowest level where hardware coherency is automatically maintained, making it the boundary between coherent and non-coherent memory operations. + +### 2.2 AMD Infinity Cache: Memory-Side Cache Innovation + +The AMD Infinity Cache represents a paradigm shift in GPU cache design, implementing a memory-side cache architecture that fundamentally differs from traditional cache hierarchies. Unlike conventional caches that can hold dirty data evicted from lower levels, the Infinity Cache is designed as a shared memory-side cache that exclusively caches the contents of memory and cannot hold dirty data. + +This design choice provides two significant advantages that impact kernel programming strategies. First, the Infinity Cache does not participate in coherency protocols and does not need to handle snoop traffic, which significantly improves efficiency and reduces latency for coherency operations from lower-level caches. Second, the cache can hold nominally uncacheable memory such as I/O buffers, providing performance benefits for kernels that work with mixed data types or perform I/O operations alongside computation. + +The Infinity Cache is organized around 128 parallel channels across 8 HBM stacks, with each channel being 64 bytes wide and connected to 2MB of data arrays. The total capacity of 256MB provides substantial caching capability, while the peak bandwidth of 17.2 TB/s approaches the aggregate bandwidth of previous generation L2 caches. This massive bandwidth makes the Infinity Cache particularly effective for workloads with good temporal locality but poor spatial locality, as it can efficiently serve repeated accesses to scattered memory locations. + +### 2.3 Relaxed Coherency Model and Synchronization Requirements + +A critical difference from NVIDIA architectures is the CDNA3's relaxed coherency model, which requires explicit synchronization to provide strong coherency and ordering guarantees. The L1 vector data cache operates with very relaxed coherency semantics, meaning that kernel developers must explicitly manage cache coherency through appropriate synchronization primitives and memory fence operations. + +This relaxed coherency model provides performance benefits by eliminating the overhead of automatic coherency maintenance, but it places additional responsibility on kernel developers to ensure correct memory ordering. Kernels that share data between workgroups or that require specific memory ordering semantics must use explicit synchronization operations such as memory fences, atomic operations, or barrier synchronization to ensure correctness. + +The coherency boundary at the L2 cache level means that operations within a single XCD can rely on hardware-maintained coherency, while operations that span multiple XCDs require explicit synchronization. This architectural characteristic suggests that kernel designs should minimize cross-XCD data sharing when possible, or carefully structure such sharing to use appropriate synchronization mechanisms. + +### 2.4 HBM3/HBM3E Memory Interface Optimization + +The CDNA3 architecture upgrades to HBM3 for MI300X and MI300A products, and HBM3E for MI325X, providing substantial memory capacity and bandwidth improvements. The MI300X provides 192GB of HBM3 memory with 5.3 TB/s peak bandwidth, while the MI325X offers 256GB of HBM3E with 6.0 TB/s peak bandwidth. These specifications represent significant improvements over previous generations and enable new classes of memory-intensive applications. + +The memory controllers are distributed across the IODs and operate at 5.2 Gbps for HBM3 and 6.0 Gbps for HBM3E. Each IOD manages two HBM stacks, creating a distributed memory architecture that can provide excellent bandwidth utilization when memory accesses are properly distributed across all stacks. Kernel developers should consider memory access patterns that can effectively utilize all available memory controllers to achieve peak bandwidth utilization. + +The channel-based organization extends from the L2 cache through the Infinity Cache to the HBM interface, with each HBM stack associated with 16 parallel channels. This consistent channel organization provides predictable performance characteristics and enables sophisticated memory access optimization strategies that can align data placement with the underlying hardware organization. + + +## 3. Matrix Core Technology and Advanced Data Type Support + +The CDNA3 Matrix Cores represent a substantial evolution in specialized compute capabilities, introducing new data types and computational paradigms that are specifically optimized for modern AI and machine learning workloads. Understanding these capabilities and their optimal usage patterns is essential for developing high-performance HIP kernels for AI applications. + +### 3.1 Enhanced Matrix Core Architecture + +The Matrix Cores in CDNA3 have been comprehensively redesigned to provide dramatic performance improvements across all supported data types. The architecture delivers generational improvements ranging from 1.7x for FP64 operations to 6.8x for INT8 operations compared to the previous CDNA2 generation. These improvements are achieved through a combination of increased parallelism, enhanced data path widths, and optimized instruction scheduling. + +Each Compute Unit contains integrated Matrix Core functionality that can execute matrix operations in parallel with vector operations, enabling sophisticated kernel designs that can overlap different types of computation. The Matrix Cores support a wide range of data types with varying throughput characteristics, allowing kernel developers to choose the optimal precision for their specific workload requirements while maximizing computational throughput. + +The peak theoretical performance for matrix operations reaches impressive levels: 163.4 TFLOP/s for FP32 matrix operations, 1,307.4 TFLOP/s for FP16/BF16 operations, and an extraordinary 2,614.9 TFLOP/s for FP8 operations on the MI300X. These performance levels represent substantial improvements over previous generations and enable new classes of computationally intensive applications that were previously impractical. + +### 3.2 Novel Data Type Support: TF32 and FP8 + +The CDNA3 architecture introduces support for two critical new data types that are becoming increasingly important in modern AI workloads: TF32 and FP8. These data types provide different trade-offs between precision, performance, and memory efficiency, enabling kernel developers to optimize for specific application requirements. + +TF32 is a 19-bit hybrid data format that combines the 10-bit mantissa precision of FP16 with the 8-bit exponent range of BF16, plus a sign bit. Despite its name suggesting a 32-bit format, TF32 is actually more compact while providing a precision and range combination that can effectively replace FP32 in most machine learning applications without accuracy degradation. The Matrix Cores provide full-rate support for TF32 operations at 1,024 FLOPS per clock per CU, offering a compelling balance between performance and precision for training workloads that require higher precision than FP16 but don't need full FP32 precision. + +FP8 support follows the OCP 8-bit Floating Point Specification, providing two variants optimized for different use cases. The E5M2 variant, with a 5-bit exponent and 2-bit mantissa, is optimized for training workloads where the extended range is more important than mantissa precision. The E4M3 variant, with a 4-bit exponent and 3-bit mantissa, is optimized for inference workloads where mantissa precision is more critical than extended range. The Matrix Cores can achieve 4,096 operations per clock per CU for FP8 operations, representing 16x the throughput of FP32 operations while using only 1/4 the memory bandwidth. + +### 3.3 Structured Sparsity Support and 2:4 Sparse Operations + +One of the most innovative features of the CDNA3 Matrix Cores is native support for structured sparsity, specifically the 2:4 sparse pattern where at least two values within every group of four input values are zero. This sparsity support is available for matrix operations using INT8, FP8, FP16, and BF16 data types, enabling up to double the computational throughput for workloads that can exploit this sparsity pattern. + +The sparse matrix support is implemented through a compact representation where non-zero data is stored in dense form with additional metadata tracking the locations of zero values. This approach allows the dense representation to fit directly into the Matrix Core pipeline while enabling the hardware to skip computations involving zero values. When the sparsity requirements are met, the Matrix Cores can achieve up to 8,000 operations per clock per CU, representing a substantial performance improvement for compatible workloads. + +The 2:4 sparsity pattern is particularly well-suited to many neural network architectures, especially attention mechanisms in transformer-based models and convolution-based networks. Kernel developers working with these types of models should consider whether their data can be structured to exploit this sparsity support, as the performance benefits can be substantial. However, it's important to note that the sparsity must be structured in the specific 2:4 pattern to be exploitable by the hardware. + +### 3.4 Matrix Core Programming Considerations + +Effective utilization of the Matrix Cores requires careful attention to data layout, operation scheduling, and memory access patterns. The Matrix Cores are designed to work most efficiently with data that is properly aligned and organized to match the hardware's internal data paths. Kernel developers should ensure that matrix data is laid out in memory with appropriate alignment and that matrix dimensions are chosen to maximize hardware utilization. + +The integration of Matrix Cores within the Compute Units enables sophisticated kernel designs that can overlap matrix operations with vector operations and memory accesses. This capability allows for the development of fused kernels that can perform complex operations without intermediate memory round-trips, potentially providing significant performance improvements for workloads that can exploit this parallelism. + +Memory bandwidth considerations are particularly important when working with the Matrix Cores, as the high computational throughput can quickly become memory-bound if data access patterns are not optimized. The enhanced cache hierarchy, including the Infinity Cache, can help mitigate memory bandwidth limitations for workloads with good temporal locality, but kernel developers must still carefully consider data reuse patterns and memory access optimization. + +### 3.5 Performance Optimization Strategies + +Achieving optimal performance with the Matrix Cores requires a holistic approach that considers data types, sparsity patterns, memory access patterns, and operation scheduling. Kernel developers should start by selecting the most appropriate data type for their precision requirements, considering the substantial performance benefits available with lower-precision formats when accuracy requirements permit. + +For workloads that can exploit sparsity, restructuring data to match the 2:4 sparse pattern can provide dramatic performance improvements. This may require preprocessing steps to identify and reorganize sparse data, but the computational benefits can justify this overhead for many applications. The sparse support is particularly valuable for inference workloads where the sparsity patterns can be determined offline and optimized for the specific hardware capabilities. + +Memory access optimization becomes even more critical when working with the high-throughput Matrix Cores. Kernel designs should prioritize data reuse, minimize memory round-trips, and structure memory accesses to take advantage of the cache hierarchy. The large cache line sizes and substantial cache capacities in CDNA3 can provide significant benefits for workloads that can maintain good spatial and temporal locality. + + +## 4. Key Differences from NVIDIA AI Accelerators + +Understanding the fundamental differences between AMD CDNA3 and NVIDIA AI accelerators is crucial for developers transitioning between platforms or optimizing kernels for cross-platform compatibility. These differences span architectural philosophy, memory systems, programming models, and performance characteristics. + +### 4.1 Architectural Philosophy: Chiplets vs. Monolithic Design + +The most fundamental difference between CDNA3 and NVIDIA architectures lies in the basic design philosophy. NVIDIA's H100 and A100 accelerators follow a monolithic die approach where all computational and memory control functions are integrated onto a single large die. This design provides uniform access patterns and simplified programming models but is limited by the maximum practical die size and manufacturing yield considerations. + +In contrast, CDNA3 embraces a heterogeneous chiplet architecture that separates computational functions (XCDs) from memory and I/O functions (IODs). This approach enables specialized optimization of each chiplet type and allows for more flexible scaling through the addition of more chiplets. However, it also introduces hierarchical access patterns and requires more sophisticated programming strategies to achieve optimal performance. + +The chiplet approach provides several advantages that impact kernel programming. The ability to disable individual CUs for yield management (2 per XCD) provides more predictable performance characteristics compared to monolithic designs where yield issues might affect larger functional blocks. The separation of compute and memory functions also enables independent optimization of each subsystem, potentially providing better performance for specific workload types. + +### 4.2 Memory Hierarchy Differences: Memory-Side Cache vs. Traditional Caching + +The memory hierarchy represents one of the most significant differences between CDNA3 and NVIDIA architectures. NVIDIA accelerators typically implement a traditional two-level cache hierarchy (L1 and L2) with write-through L1 caches and hardware-managed coherency. This approach provides predictable behavior and simplified programming models but may not be optimal for all workload types. + +CDNA3's three-tier hierarchy with the memory-side Infinity Cache introduces novel optimization opportunities that don't exist in NVIDIA architectures. The memory-side cache design means that the Infinity Cache can hold data that would be uncacheable in traditional architectures, such as I/O buffers or streaming data. This capability can provide significant performance benefits for kernels that work with mixed data types or perform complex memory access patterns. + +The relaxed coherency model in CDNA3 contrasts sharply with NVIDIA's hardware-managed coherency. While NVIDIA's approach simplifies programming by automatically maintaining cache coherency, it also introduces overhead that may not be necessary for all workloads. CDNA3's explicit synchronization requirements provide more control over coherency operations but require more sophisticated programming to ensure correctness. + +### 4.3 Compute Unit Organization and Scheduling Differences + +The organization of computational resources differs significantly between the two architectures. NVIDIA's Streaming Multiprocessors (SMs) typically contain 64-128 CUDA cores along with specialized Tensor Cores, with a single GigaThread Engine managing work distribution across all SMs. This centralized scheduling approach provides good load balancing but may introduce bottlenecks for certain workload types. + +CDNA3's approach with 4 Asynchronous Compute Engines per XCD provides more distributed scheduling and can offer better isolation between concurrent workloads. Each ACE manages a subset of the available CUs, enabling more fine-grained control over work distribution and potentially reducing scheduling overhead for workloads that can be effectively partitioned. + +The shared instruction cache between pairs of CUs in CDNA3 is another unique feature that doesn't have a direct equivalent in NVIDIA architectures. This design can provide significant benefits for workloads where adjacent CUs execute similar instruction streams, but it also requires careful consideration of workgroup assignment strategies to maximize cache efficiency. + +### 4.4 Data Type and Precision Support Variations + +While both architectures support a range of data types for AI workloads, there are important differences in implementation and performance characteristics. NVIDIA's Tensor Cores have evolved through multiple generations with different capabilities, and the specific data types and operations supported can vary significantly between different GPU models. + +CDNA3's support for TF32 as a native data type represents a unique approach to balancing precision and performance. While NVIDIA accelerators can perform TF32 operations, the implementation details and performance characteristics may differ. The OCP-compliant FP8 support in CDNA3 also follows industry standards that may not be directly compatible with NVIDIA's FP8 implementations. + +The structured sparsity support in CDNA3 follows the 2:4 pattern that is also supported by NVIDIA architectures, but the implementation details and performance characteristics can differ significantly. Kernel developers need to understand these differences to optimize sparsity exploitation for each platform. + +### 4.5 Programming Model and Software Stack Differences + +The programming model differences between HIP and CUDA represent both opportunities and challenges for kernel developers. HIP is designed to provide CUDA-like syntax while enabling cross-platform compatibility, but there are subtle differences in semantics and capabilities that can impact kernel performance and correctness. + +The ROCm software stack's open-source nature provides greater visibility into the underlying implementation compared to NVIDIA's closed-source approach. This transparency can enable more sophisticated optimization strategies but also requires developers to have a deeper understanding of the software stack internals. + +Memory management approaches also differ between the platforms. NVIDIA's Unified Memory system provides automatic data migration between CPU and GPU memory spaces, while AMD's approach typically requires more explicit memory management. The MI300A APU variant provides true unified memory that eliminates the need for data copies, but this capability is unique to the APU configuration. + +### 4.6 Virtualization and Multi-Tenancy Approaches + +The virtualization capabilities of CDNA3 and NVIDIA architectures follow different philosophies that impact how kernels can be deployed in multi-tenant environments. NVIDIA's Multi-Instance GPU (MIG) technology provides fixed partition sizes with strong isolation guarantees, but limited flexibility in partition configuration. + +CDNA3's spatial partitioning approach based on XCDs provides more flexible partition sizes and can be combined with NUMA memory partitioning for sophisticated resource allocation strategies. The SR-IOV support also provides hardware-level isolation that can be valuable for certain deployment scenarios. + +These virtualization differences can impact kernel design strategies, particularly for applications that need to run in multi-tenant environments or that require specific resource allocation patterns. Understanding the capabilities and limitations of each approach is important for developing kernels that can effectively utilize the available hardware resources. + +### 4.7 Interconnect and Scaling Characteristics + +The interconnect technologies used for multi-GPU scaling also differ between the platforms. NVIDIA's NVLink technology has evolved through multiple generations with varying bandwidth and topology capabilities, while AMD's Infinity Fabric provides a different approach to inter-GPU communication. + +The fully connected 8-GPU topologies enabled by CDNA3's Infinity Fabric can provide advantages for certain communication patterns, particularly all-reduce and all-gather operations that are common in distributed machine learning workloads. However, the specific performance characteristics and optimal usage patterns can differ from NVIDIA's NVLink-based solutions. + +Understanding these interconnect differences is crucial for developing kernels that will be used in multi-GPU configurations, as the optimal communication strategies and data distribution patterns can vary significantly between platforms. + + +## 5. HIP Kernel Programming Best Practices for CDNA3 + +Developing high-performance HIP kernels for CDNA3 requires understanding the unique architectural characteristics and optimizing for the specific capabilities and constraints of the platform. This section provides concrete guidance for kernel developers to achieve optimal performance on MI300 hardware. + +### 5.1 Memory Access Pattern Optimization + +The CDNA3 memory hierarchy with its three-tier cache system and relaxed coherency model requires careful attention to memory access patterns. The doubled cache line size of 128 bytes means that kernels should be designed to maximize spatial locality within these larger cache lines. Sequential memory accesses that can fill entire cache lines will achieve better bandwidth utilization than scattered access patterns. + +The memory-side Infinity Cache provides unique optimization opportunities that don't exist in traditional GPU architectures. Kernels that can maintain good temporal locality across large working sets can benefit significantly from the 256MB cache capacity and 17.2 TB/s bandwidth. This is particularly valuable for iterative algorithms or kernels that process the same data multiple times with different operations. + +The relaxed coherency model requires explicit synchronization for cross-workgroup communication or when specific memory ordering is required. Kernel developers should use appropriate memory fence operations, atomic operations, or barrier synchronization to ensure correctness. The coherency boundary at the L2 cache level means that operations within a single XCD can rely on hardware coherency, while cross-XCD operations require explicit synchronization. + +### 5.2 Workgroup and Thread Block Organization + +The 4-ACE architecture within each XCD suggests that workgroup organization should consider the scheduling hierarchy to minimize conflicts and maximize throughput. Workgroups should be sized and distributed to enable effective utilization of all available ACEs while maintaining good load balance across the 38 active CUs per XCD. + +The shared instruction cache between pairs of CUs rewards kernel designs where adjacent CUs execute similar instruction streams. This suggests that workgroup assignment strategies should consider instruction locality alongside data locality. Kernels with divergent control flow should be structured to minimize the impact on instruction cache efficiency. + +The Local Data Share (LDS) remains at 64KB per CU, consistent with previous generations. Effective utilization of LDS for data sharing between threads within a workgroup can reduce memory traffic and improve performance. The enhanced L1 cache capacity and bandwidth can also reduce the pressure on LDS for certain access patterns. + +### 5.3 Matrix Core Utilization Strategies + +Achieving optimal performance with the Matrix Cores requires careful attention to data layout, operation scheduling, and precision selection. Matrix data should be organized in memory with appropriate alignment to match the hardware's internal data paths. The specific alignment requirements may vary depending on the data type and operation being performed. + +The integration of Matrix Cores within the Compute Units enables sophisticated kernel designs that can overlap matrix operations with vector operations and memory accesses. Kernels should be structured to take advantage of this parallelism by organizing computations to minimize dependencies and enable concurrent execution of different operation types. + +Data type selection can have dramatic performance implications. FP8 operations can achieve 16x the throughput of FP32 operations while using only 1/4 the memory bandwidth. TF32 provides a good balance between precision and performance for many applications. Kernel developers should carefully evaluate their precision requirements and select the most appropriate data type to maximize performance. + +### 5.4 Sparsity Exploitation Techniques + +The 2:4 structured sparsity support in the Matrix Cores can provide up to 2x performance improvements for compatible workloads. However, exploiting this capability requires that data be structured in the specific 2:4 pattern where at least two values in every group of four are zero. This may require preprocessing steps to identify and reorganize sparse data. + +Kernels that work with naturally sparse data, such as attention mechanisms in transformer models or certain types of convolution operations, should be evaluated for sparsity exploitation potential. The performance benefits can be substantial, but the overhead of data reorganization must be considered in the overall performance analysis. + +The sparse support is available for INT8, FP8, FP16, and BF16 data types, providing flexibility in precision selection while maintaining sparsity benefits. Kernel developers should consider whether lower precision formats can be used to enable both sparsity and precision optimizations simultaneously. + +### 5.5 Cross-Platform Compatibility Considerations + +When developing kernels that need to run on both AMD and NVIDIA platforms, careful attention to programming model differences is essential. While HIP provides CUDA-like syntax, there are semantic differences that can impact performance and correctness. Memory management approaches, synchronization semantics, and performance characteristics can all differ between platforms. + +The relaxed coherency model in CDNA3 may require additional synchronization compared to NVIDIA platforms with hardware-managed coherency. Kernels should be designed with explicit synchronization that ensures correctness on both platforms, even if some synchronization operations may be redundant on certain platforms. + +Data type support and performance characteristics can vary significantly between platforms. Kernels should be designed with fallback strategies for data types or features that may not be available on all target platforms. Performance tuning may need to be platform-specific to achieve optimal results on each architecture. + +### 5.6 Debugging and Profiling Strategies + +The ROCm software stack provides comprehensive debugging and profiling tools that can help identify performance bottlenecks and correctness issues. The open-source nature of the stack provides greater visibility into the underlying implementation compared to closed-source alternatives, enabling more sophisticated debugging strategies. + +Memory access pattern analysis is particularly important for CDNA3 kernels due to the complex cache hierarchy and relaxed coherency model. Profiling tools can help identify cache miss patterns, memory bandwidth utilization, and synchronization overhead that may not be apparent from source code analysis alone. + +The chiplet architecture can introduce performance variations that may not be present in monolithic designs. Profiling should consider the distribution of work across XCDs and the impact of cross-XCD communication on overall performance. Load balancing strategies may need to be adjusted based on profiling results to achieve optimal performance. + +### 5.7 Performance Tuning and Optimization Workflow + +Developing high-performance CDNA3 kernels requires an iterative optimization workflow that considers the unique architectural characteristics. Initial kernel development should focus on correctness and basic functionality, followed by systematic optimization of memory access patterns, compute utilization, and synchronization overhead. + +Memory hierarchy optimization should be prioritized early in the development process, as the three-tier cache system can have significant impact on performance. Cache-friendly data layouts and access patterns should be established before focusing on computational optimizations. + +Matrix Core utilization should be evaluated for any kernels that perform matrix or tensor operations. The substantial performance benefits available through optimal Matrix Core usage can justify significant restructuring of computational algorithms to take advantage of these capabilities. + +The iterative nature of performance optimization means that profiling and measurement should be integrated throughout the development process. Performance characteristics can change significantly as kernels are optimized, and continuous measurement ensures that optimizations are providing the expected benefits. + + +## 6. Technical Specifications and Performance Characteristics + +### 6.1 MI300 Series Specifications Comparison + +| Specification | MI300A APU | MI300X GPU | MI325X GPU | +|---------------|------------|------------|------------| +| **Architecture** | AMD CDNA 3 | AMD CDNA 3 | AMD CDNA 3 | +| **Accelerator Complex Dies (XCD)** | 6 | 8 | 8 | +| **Active Compute Units** | 228 | 304 | 304 | +| **Stream Processors** | 14,592 | 19,456 | 19,456 | +| **Matrix Cores** | 912 | 1,216 | 1,216 | +| **Max Engine Clock** | 2,100 MHz | 2,100 MHz | 2,100 MHz | +| **CPU Cores (Zen 4)** | 24 | N/A | N/A | +| **Memory Capacity** | 128GB HBM3 | 192GB HBM3 | 256GB HBM3E | +| **Memory Bandwidth** | 5.3 TB/s | 5.3 TB/s | 6.0 TB/s | +| **Memory Interface** | 1024-bit x 8 | 1024-bit x 8 | 1024-bit x 8 | +| **L1 Cache per CU** | 32KB | 32KB | 32KB | +| **L2 Cache per XCD** | 4MB | 4MB | 4MB | +| **Infinity Cache Total** | 256MB | 256MB | 256MB | + +### 6.2 Matrix Core Performance Characteristics + +| Data Type | Operations per Clock per CU | MI300X Peak Performance | MI325X Peak Performance | Generational Improvement | +|-----------|----------------------------|------------------------|------------------------|-------------------------| +| **FP64 Matrix** | 256 | 163.4 TFLOP/s | 163.4 TFLOP/s | 1.7x | +| **FP32 Matrix** | 256 | 163.4 TFLOP/s | 163.4 TFLOP/s | 1.7x | +| **TF32 Matrix** | 1,024 | 653.7 TFLOP/s | 653.7 TFLOP/s | New | +| **FP16 Matrix** | 2,048 | 1,307.4 TFLOP/s | 1,307.4 TFLOP/s | 3.4x | +| **BF16 Matrix** | 2,048 | 1,307.4 TFLOP/s | 1,307.4 TFLOP/s | 3.4x | +| **FP8 Matrix** | 4,096 | 2,614.9 TFLOP/s | 2,614.9 TFLOP/s | New | +| **INT8 Matrix** | 4,096 | 2,614.9 TOPs | 2,614.9 TOPs | 6.8x | +| **Sparse (2:4) Performance** | Up to 8,192 | Up to 5,229.8 TFLOP/s | Up to 5,229.8 TFLOP/s | 2x with sparsity | + +### 6.3 Memory Hierarchy Performance Characteristics + +| Memory Level | Capacity | Bandwidth | Latency Characteristics | Key Features | +|--------------|----------|-----------|------------------------|--------------| +| **L1 Vector Cache** | 32KB per CU | 2KB/clock per CU | Lowest latency | 128-byte cache lines, relaxed coherency | +| **L2 Cache** | 4MB per XCD | 2KB/clock per XCD | Low latency | 16-way associative, coherency boundary | +| **Infinity Cache** | 256MB total | 17.2 TB/s aggregate | Medium latency | Memory-side cache, no dirty data | +| **HBM3/HBM3E** | 192-256GB | 5.3-6.0 TB/s | Highest latency | 8 stacks, 128 channels total | + +## 7. Conclusion and Future Considerations + +The AMD CDNA3 architecture represents a fundamental shift in GPU design philosophy that introduces both opportunities and challenges for HIP kernel developers. The heterogeneous chiplet approach, revolutionary memory hierarchy with Infinity Cache, and advanced Matrix Core capabilities provide substantial performance potential for applications that can effectively exploit these architectural innovations. + +### 7.1 Key Takeaways for Kernel Developers + +The most critical insight for kernel developers is that CDNA3 requires a different optimization mindset compared to traditional GPU architectures. The memory-side Infinity Cache, relaxed coherency model, and chiplet-based organization create optimization opportunities that don't exist in monolithic designs, but they also require more sophisticated programming strategies to achieve optimal performance. + +The Matrix Core enhancements, particularly the support for TF32 and FP8 data types along with structured sparsity, provide dramatic performance improvements for AI workloads. However, achieving these benefits requires careful attention to data layout, precision selection, and sparsity structuring that may require significant algorithmic modifications. + +The three-tier cache hierarchy with its unique characteristics demands careful consideration of memory access patterns and explicit synchronization strategies. Kernel developers must understand the coherency boundaries and design their algorithms to work effectively within the relaxed coherency model while taking advantage of the substantial cache bandwidth and capacity. + +### 7.2 Architectural Advantages and Unique Capabilities + +The CDNA3 architecture provides several unique advantages that distinguish it from competing solutions. The memory-side Infinity Cache design enables caching of data types that would be uncacheable in traditional architectures, potentially providing performance benefits for complex workloads with mixed data types. The chiplet approach enables more flexible scaling and specialized optimization of different functional units. + +The unified memory capability in the MI300A APU represents a particularly compelling advantage for certain workload types, eliminating the overhead of host-device data transfers and enabling new programming paradigms that can exploit true CPU-GPU memory sharing. This capability is unique in the current market and provides opportunities for innovative algorithm designs. + +The open-source ROCm software stack provides transparency and customization opportunities that are not available with closed-source alternatives. This openness enables more sophisticated optimization strategies and provides developers with greater control over the software stack behavior. + +### 7.3 Challenges and Considerations + +The complexity of the CDNA3 architecture also introduces challenges that kernel developers must navigate. The relaxed coherency model requires more explicit synchronization management, which can increase development complexity and the potential for subtle correctness issues. The chiplet-based design creates hierarchical access patterns that must be understood and optimized for optimal performance. + +Cross-platform compatibility considerations become more complex when targeting both AMD and NVIDIA platforms, as the architectural differences require platform-specific optimization strategies. Kernel developers must balance the benefits of platform-specific optimizations against the complexity of maintaining multiple code paths. + +### 7.4 Future Evolution and Ecosystem Development + +The CDNA3 architecture represents a significant step forward in GPU design, but it also establishes a foundation for future evolution. The chiplet approach provides a scalable framework for adding new capabilities and increasing computational resources in future generations. The software ecosystem around ROCm and HIP continues to mature, providing increasingly sophisticated tools and libraries for kernel development. + +The industry trend toward lower precision data types and structured sparsity is well-supported by CDNA3's capabilities, positioning it well for future AI workload evolution. The architectural innovations in memory hierarchy and compute organization provide a foundation for continued performance improvements as manufacturing processes and packaging technologies advance. + +Understanding and effectively utilizing the CDNA3 architecture requires a comprehensive approach that considers the unique architectural characteristics, programming model differences, and optimization opportunities. Kernel developers who invest in understanding these aspects will be well-positioned to achieve exceptional performance on MI300 hardware and contribute to the continued evolution of the AMD GPU computing ecosystem. + +The architectural innovations in CDNA3 represent more than incremental improvements; they constitute a new paradigm for GPU design that will likely influence future developments across the industry. Kernel developers who master these concepts will be prepared not only for current MI300 optimization but also for the continued evolution of heterogeneous computing architectures. + +--- + +*This guide represents a comprehensive analysis of the AMD CDNA3 architecture based on official documentation and technical specifications. Kernel developers should consult the latest ROCm documentation and AMD developer resources for the most current programming guidelines and optimization recommendations.* + diff --git a/kernel-agentic/docs/hip/HIP Kernel Programming Guide for MI300_ Key Differences from NVIDIA AI Chips.md b/kernel-agentic/docs/hip/HIP Kernel Programming Guide for MI300_ Key Differences from NVIDIA AI Chips.md new file mode 100644 index 0000000..9905d70 --- /dev/null +++ b/kernel-agentic/docs/hip/HIP Kernel Programming Guide for MI300_ Key Differences from NVIDIA AI Chips.md @@ -0,0 +1,433 @@ +# HIP Kernel Programming Guide for MI300: Key Differences from NVIDIA AI Chips + + +## Abstract + +This document provides a comprehensive guide for writing high-quality HIP kernels specifically optimized for AMD MI300 accelerators. It focuses on the unique architectural features and programming considerations that differ from NVIDIA AI chips, enabling developers to leverage the full potential of AMD's CDNA architecture. The guide covers essential topics including wavefront execution models, memory hierarchy optimization, matrix acceleration units, and performance tuning strategies specific to MI300 hardware. + +## Table of Contents + +1. [Introduction](#introduction) +2. [Architecture Overview](#architecture-overview) +3. [Wavefront vs Warp Execution Model](#wavefront-vs-warp-execution-model) +4. [Memory Hierarchy and Access Patterns](#memory-hierarchy-and-access-patterns) +5. [CDNA Matrix Acceleration Units](#cdna-matrix-acceleration-units) +6. [Synchronization and Atomic Operations](#synchronization-and-atomic-operations) +7. [Compiler Directives and Architecture Detection](#compiler-directives-and-architecture-detection) +8. [Performance Optimization Strategies](#performance-optimization-strategies) +9. [Debugging and Profiling](#debugging-and-profiling) +10. [Best Practices Summary](#best-practices-summary) +11. [References](#references) + +## Introduction + +The AMD MI300 series represents a significant advancement in accelerated computing, featuring the CDNA3 architecture specifically designed for high-performance computing and artificial intelligence workloads. Unlike NVIDIA's CUDA-based AI chips, MI300 accelerators utilize AMD's HIP (Heterogeneous-compute Interface for Portability) programming model, which provides both portability and performance optimization opportunities unique to AMD's hardware architecture. + +Understanding the fundamental differences between AMD's CDNA architecture and NVIDIA's GPU architectures is crucial for developers seeking to maximize performance on MI300 systems. This guide focuses on the less commonly known aspects of HIP programming that are specific to AMD hardware, particularly those features that distinguish MI300 from NVIDIA AI chips in terms of execution models, memory systems, and optimization strategies. + +The MI300 architecture introduces several key innovations including enhanced matrix acceleration units, optimized memory hierarchies, and unique wavefront execution patterns that require specific programming approaches to achieve optimal performance. These architectural differences necessitate a deep understanding of HIP-specific programming techniques that go beyond general GPU programming knowledge. + + + +## Architecture Overview + +### CDNA3 Compute Unit Structure + +The MI300 series is built on AMD's CDNA3 architecture, which represents a fundamental departure from traditional GPU designs optimized primarily for graphics workloads. The CDNA (Compute DNA) architecture is purpose-built for compute-intensive applications, particularly those involving machine learning, scientific computing, and data analytics. + +Each CDNA3 compute unit (CU) contains several key components that distinguish it from NVIDIA's streaming multiprocessors (SMs). The most significant architectural difference lies in the inclusion of dedicated matrix acceleration units alongside traditional vector arithmetic logic units (VALUs). This hybrid approach allows MI300 to excel at both traditional parallel computing tasks and modern AI workloads that heavily utilize matrix operations. + +The compute unit structure includes four Single Instruction Multiple Data (SIMD) units, each capable of executing 16 operations per cycle. This design choice directly impacts the wavefront size, which is typically 64 threads on AMD hardware compared to NVIDIA's 32-thread warps. The larger wavefront size can provide better memory bandwidth utilization and improved occupancy for memory-bound kernels, but requires careful consideration of control flow divergence patterns. + +### Memory Hierarchy Differences + +The CDNA3 memory hierarchy introduces several unique features that differentiate it from NVIDIA architectures. The local data share (LDS) serves as the equivalent to NVIDIA's shared memory but with distinct performance characteristics and access patterns. The LDS provides high-bandwidth, low-latency storage accessible to all threads within a workgroup, with a capacity that varies by specific MI300 model but typically exceeds comparable NVIDIA offerings. + +The vector cache system in CDNA3 operates differently from NVIDIA's L1 cache, with specific optimizations for coalesced memory access patterns common in HPC and AI workloads. Understanding these differences is crucial for optimizing memory access patterns and achieving peak performance on MI300 hardware. + +### Shader Engine Organization + +MI300 accelerators organize compute units into shader engines, which serve as the primary scheduling and resource management units. Each shader engine contains multiple compute units and shares certain fixed-function resources, including memory controllers and cache hierarchies. This organization affects how workloads are distributed across the device and influences optimal kernel launch configurations. + +The shader engine design also impacts the effectiveness of different synchronization strategies and inter-workgroup communication patterns. Developers must consider shader engine boundaries when designing algorithms that require coordination between different parts of the computation. + + +## Wavefront vs Warp Execution Model + +### Fundamental Execution Differences + +The most critical difference between AMD's HIP and NVIDIA's CUDA lies in the basic execution unit size. While NVIDIA GPUs execute threads in groups of 32 (warps), AMD GPUs traditionally use wavefronts of 64 threads. This difference has profound implications for kernel design, memory access patterns, and performance optimization strategies. + +On MI300 and other CDNA architectures, the wavefront size remains 64 threads, which means that the SIMD units execute instructions for 64 threads simultaneously. This larger execution group can provide several advantages, including better memory bandwidth utilization when accessing contiguous memory regions and improved arithmetic intensity for compute-bound kernels. + +However, the larger wavefront size also introduces unique challenges. Control flow divergence within a wavefront can be more costly than in NVIDIA's smaller warps, as more threads may be masked out during conditional execution. Developers must carefully structure conditional code to minimize divergence within 64-thread boundaries rather than the 32-thread boundaries familiar to CUDA programmers. + +### Wavefront Scheduling and Occupancy + +The wavefront scheduling mechanism on MI300 differs significantly from NVIDIA's warp scheduling. Each compute unit can accommodate multiple wavefronts simultaneously, with the exact number depending on register usage and local data share consumption. The larger wavefront size means that each wavefront consumes more resources, potentially reducing the total number of concurrent wavefronts per compute unit. + +Occupancy calculations for AMD hardware must account for the 64-thread wavefront size when determining optimal block sizes and resource usage. A kernel that achieves high occupancy on NVIDIA hardware with 32-thread warps may require adjustment to achieve similar occupancy on AMD hardware with 64-thread wavefronts. + +The wavefront scheduler prioritizes ready wavefronts based on instruction availability and resource constraints. Understanding this scheduling behavior is crucial for optimizing instruction-level parallelism and hiding memory latency through effective wavefront interleaving. + +### RDNA Dual-Mode Execution + +While MI300 primarily uses CDNA architecture, it's important to note that some AMD GPUs support dual-mode execution where wavefronts can operate in either 32-thread or 64-thread modes. This flexibility, primarily found in RDNA architectures, allows for better compatibility with code originally designed for NVIDIA hardware while maintaining the performance benefits of larger wavefronts when appropriate. + +For MI300 specifically, the CDNA3 architecture maintains the traditional 64-thread wavefront size, providing consistency and predictability for HPC and AI workloads. This design choice reflects AMD's focus on compute performance over graphics compatibility in the CDNA product line. + +### Programming Implications + +When writing kernels for MI300, developers must consider the wavefront size in several key areas. Thread block dimensions should be chosen to align with 64-thread boundaries to maximize hardware utilization. Memory access patterns should be designed to take advantage of the larger wavefront size for improved coalescing efficiency. + +Reduction operations and other collective algorithms must be adapted for 64-thread wavefronts rather than 32-thread warps. HIP provides wavefront-aware intrinsics and functions that automatically adapt to the hardware's native wavefront size, but understanding the underlying execution model is essential for optimal performance. + +The larger wavefront size also affects shared memory usage patterns and synchronization requirements. Algorithms that rely on fine-grained synchronization within small thread groups may need restructuring to work efficiently with 64-thread wavefronts. + + +## Memory Hierarchy and Access Patterns + +### Local Data Share (LDS) Optimization + +The Local Data Share (LDS) in CDNA3 architecture serves as the primary on-chip memory for inter-thread communication within a workgroup, analogous to NVIDIA's shared memory but with distinct characteristics. The LDS on MI300 provides high bandwidth and low latency access, but its optimal usage patterns differ from NVIDIA shared memory due to architectural differences in banking and access scheduling. + +LDS memory is organized into banks that can be accessed simultaneously by different threads within a wavefront. However, the banking structure and conflict resolution mechanisms differ from NVIDIA's implementation. Bank conflicts occur when multiple threads within a wavefront attempt to access the same bank simultaneously, leading to serialized access and reduced throughput. + +To optimize LDS usage on MI300, developers should structure data layouts to minimize bank conflicts while maximizing memory bandwidth utilization. This often involves careful consideration of stride patterns and data alignment, particularly when implementing algorithms that require frequent data sharing between threads. + +The LDS capacity on MI300 varies by specific model but generally provides substantial on-chip storage for complex algorithms. Effective utilization of this capacity can significantly reduce global memory traffic and improve overall kernel performance, particularly for algorithms with high data reuse patterns. + +### Global Memory Access Optimization + +Global memory access patterns on MI300 require specific optimization strategies that differ from NVIDIA hardware. The memory controllers and cache hierarchy are optimized for different access patterns, with particular emphasis on supporting the larger wavefront size and the specific memory access patterns common in HPC and AI workloads. + +Coalesced memory access remains crucial for performance, but the definition of optimal coalescing differs due to the 64-thread wavefront size. Memory transactions are optimized for 64-thread access patterns rather than 32-thread patterns, which can affect the optimal stride and alignment requirements for peak memory bandwidth. + +The vector cache system in CDNA3 provides automatic caching of global memory accesses, but its effectiveness depends on access locality and pattern predictability. Understanding the cache line sizes and replacement policies can help developers structure memory access patterns to maximize cache hit rates and minimize memory latency. + +### Memory Coalescing Strategies + +Effective memory coalescing on MI300 requires understanding the relationship between wavefront execution and memory transaction generation. With 64-thread wavefronts, the memory system can generate larger, more efficient transactions when threads access contiguous memory regions. + +The optimal memory access pattern involves having consecutive threads within a wavefront access consecutive memory locations. This pattern allows the memory controller to combine multiple thread requests into fewer, larger memory transactions, maximizing memory bandwidth utilization. + +When perfect coalescing is not possible due to algorithm constraints, developers should strive to minimize the number of memory transactions required per wavefront. This may involve restructuring data layouts, using appropriate data types, or implementing software-managed caching strategies using LDS memory. + +### Cache Hierarchy Utilization + +The CDNA3 cache hierarchy includes multiple levels of caching, each optimized for different access patterns and data types. The L0 vector cache provides the fastest access to recently used data, while higher-level caches provide larger capacity with slightly increased latency. + +Understanding the cache hierarchy is crucial for optimizing algorithms with complex memory access patterns. Temporal locality can be exploited by structuring algorithms to reuse data within cache-friendly time windows, while spatial locality can be improved by organizing data structures to maximize cache line utilization. + +The cache replacement policies and associativity characteristics of MI300 caches are optimized for compute workloads rather than graphics workloads, which can affect the optimal strategies for data management and algorithm structuring. + +### Texture and Surface Memory + +While less commonly used in compute kernels, texture and surface memory on MI300 provide specialized access patterns and data filtering capabilities that can be beneficial for certain algorithms. These memory types offer hardware-accelerated interpolation and boundary handling, which can be particularly useful for image processing and scientific computing applications. + +The texture cache hierarchy on CDNA3 is optimized for 2D spatial locality, making it effective for algorithms that exhibit spatial access patterns. Understanding when and how to use texture memory can provide performance benefits for appropriate workloads, particularly those involving regular grid-based computations. + + +## CDNA Matrix Acceleration Units + +### Matrix Core Architecture + +One of the most significant differentiators of the CDNA3 architecture in MI300 is the inclusion of dedicated matrix acceleration units, also known as Matrix Cores or MFMA (Matrix Fused Multiply-Add) units. These specialized processing units are designed specifically for the matrix operations that dominate modern AI and machine learning workloads, providing substantial performance advantages over traditional vector arithmetic units for these operations. + +The Matrix Cores in MI300 support multiple data types including FP32, FP16, BF16, and INT8, allowing for flexible precision trade-offs based on application requirements. Each Matrix Core can perform large matrix operations in a single instruction, dramatically reducing the instruction count and improving throughput for matrix-heavy computations. + +Unlike NVIDIA's Tensor Cores, which are integrated into the streaming multiprocessors, AMD's Matrix Cores are separate functional units within each compute unit. This architectural choice allows for concurrent execution of matrix operations and traditional vector operations, enabling more sophisticated kernel designs that can overlap different types of computations. + +### MFMA Instruction Set + +The MFMA instruction set provides direct access to matrix acceleration capabilities through HIP intrinsics and inline assembly. These instructions operate on matrix tiles of various sizes, typically ranging from 4x4 to 32x32 elements, depending on the data type and specific operation requirements. + +Programming with MFMA instructions requires careful consideration of data layout and memory access patterns. Matrix data must be organized in specific formats to maximize the efficiency of matrix operations, often requiring data reorganization or specialized loading patterns to achieve optimal performance. + +The MFMA instructions support various matrix operation modes, including standard matrix multiplication, accumulation operations, and specialized AI-focused operations such as convolution primitives. Understanding the capabilities and limitations of each instruction variant is crucial for effective utilization of the matrix acceleration hardware. + +### Data Type Optimization + +The Matrix Cores support multiple precision modes, each with different performance characteristics and accuracy trade-offs. FP16 and BF16 operations typically provide the highest throughput, making them ideal for training and inference workloads where reduced precision is acceptable. + +Mixed-precision programming techniques can leverage the Matrix Cores' ability to perform computations in lower precision while maintaining higher precision for accumulation operations. This approach can significantly improve performance while maintaining numerical accuracy for many AI and scientific computing applications. + +The choice of data type affects not only computational throughput but also memory bandwidth requirements and cache utilization. Lower precision data types allow for more data to be stored in on-chip memory and reduce memory traffic, but require careful management of numerical precision throughout the computation. + +### Integration with Traditional Compute + +Effective utilization of Matrix Cores often requires hybrid kernel designs that combine matrix operations with traditional vector computations. This integration allows for complex algorithms that leverage the strengths of both processing units while maintaining high overall utilization. + +Scheduling and resource management become more complex when using both Matrix Cores and vector units simultaneously. Developers must consider the resource requirements and execution latencies of both types of operations to achieve optimal overlap and minimize idle time. + +The memory hierarchy must be carefully managed when using Matrix Cores, as these units typically require large amounts of data and can quickly saturate memory bandwidth if not properly optimized. Effective use of LDS memory and cache-friendly access patterns becomes even more critical in matrix-accelerated kernels. + +### Performance Considerations + +Matrix Core utilization requires specific kernel design patterns to achieve peak performance. The matrix operations must be large enough to fully utilize the hardware capabilities while being small enough to fit within the available on-chip memory resources. + +Tiling strategies become crucial for large matrix operations that exceed the capacity of individual Matrix Core instructions. Effective tiling must balance computational efficiency with memory access overhead, often requiring sophisticated blocking algorithms and data movement optimization. + +The interaction between Matrix Cores and the wavefront execution model requires careful consideration. Matrix operations typically involve multiple wavefronts working cooperatively on different portions of the computation, requiring coordination and synchronization strategies that differ from traditional vector-only kernels. + + +## Synchronization and Atomic Operations + +### Wavefront-Level Synchronization + +Synchronization mechanisms in HIP differ from CUDA due to the larger wavefront size and different hardware architecture. The fundamental synchronization primitive `__syncthreads()` operates at the workgroup level, ensuring that all threads within a workgroup reach the synchronization point before any thread proceeds. + +The larger wavefront size in AMD hardware affects the granularity and cost of synchronization operations. With 64-thread wavefronts, synchronization barriers may involve more threads and potentially more complex coordination mechanisms compared to NVIDIA's 32-thread warps. + +HIP provides additional synchronization primitives that are specific to AMD hardware, including wavefront-level synchronization functions that operate within the 64-thread wavefront boundary. These functions can provide more efficient synchronization for algorithms that require coordination only within wavefront boundaries rather than across entire workgroups. + +### Atomic Operation Support + +The atomic operation capabilities of MI300 include support for various data types and operation modes that may differ from NVIDIA hardware. HIP provides a comprehensive set of atomic functions for both global and shared memory, with specific optimizations for the CDNA architecture. + +32-bit and 64-bit integer atomic operations are fully supported across global and LDS memory spaces. The performance characteristics of these operations depend on memory location, access patterns, and contention levels. Understanding the hardware implementation of atomic operations is crucial for designing efficient algorithms that rely on atomic updates. + +Floating-point atomic operations, including atomic add operations for both single and double precision, are supported with specific performance characteristics. The atomic floating-point operations on AMD hardware may have different performance profiles compared to NVIDIA implementations, requiring benchmarking and optimization for specific use cases. + +### Memory Ordering and Consistency + +Memory ordering semantics in HIP follow specific rules that ensure correct behavior across the complex memory hierarchy of MI300. The memory consistency model defines how memory operations are ordered and when they become visible to other threads, which is crucial for correct implementation of synchronization algorithms. + +The `__threadfence()` and `__threadfence_block()` functions provide memory ordering guarantees at different scopes, ensuring that memory operations complete before subsequent operations proceed. The implementation and performance of these functions on AMD hardware may differ from NVIDIA implementations. + +System-level memory fencing with `__threadfence_system()` provides the strongest ordering guarantees but with potentially higher performance costs. Understanding when each level of memory fencing is required is essential for correct and efficient synchronization in complex algorithms. + +### Cooperative Groups Integration + +HIP supports cooperative groups, which provide a more flexible and powerful synchronization model compared to traditional block-level synchronization. Cooperative groups allow for dynamic thread grouping and specialized synchronization patterns that can be more efficient for certain algorithms. + +The cooperative groups API in HIP provides thread block groups, grid groups, and multi-grid groups, each with different synchronization capabilities and performance characteristics. These groups can be particularly useful for implementing complex algorithms that require hierarchical synchronization patterns. + +Wavefront-level cooperative groups provide fine-grained synchronization within the 64-thread wavefront boundary, allowing for efficient implementation of algorithms that require frequent coordination between small groups of threads. + +### Performance Optimization Strategies + +Synchronization overhead can significantly impact kernel performance, particularly for algorithms with frequent synchronization requirements. Minimizing synchronization frequency and scope is crucial for maintaining high performance on MI300 hardware. + +Asynchronous execution patterns can help hide synchronization latency by overlapping computation with synchronization operations. This approach requires careful kernel design to ensure that useful work can be performed while waiting for synchronization to complete. + +The interaction between synchronization operations and the memory hierarchy requires careful consideration. Synchronization operations may flush caches or invalidate cached data, affecting subsequent memory access performance. Understanding these interactions is crucial for optimizing algorithms with complex synchronization patterns. + + +## Compiler Directives and Architecture Detection + +### HIP-Specific Preprocessor Macros + +HIP provides a comprehensive set of preprocessor macros for detecting compilation context and target architecture, which differ significantly from CUDA's macro system. The `__HIP_PLATFORM_AMD__` macro indicates compilation for AMD hardware, while `__HIP_PLATFORM_NVIDIA__` indicates NVIDIA targets, allowing for platform-specific code paths within the same source file. + +The `__HIP_DEVICE_COMPILE__` macro distinguishes between host and device compilation passes, enabling conditional compilation of device-specific code. This is particularly important for MI300 kernels that may use AMD-specific intrinsics or optimization techniques not available on other platforms. + +Architecture-specific feature detection uses the `__HIP_ARCH_*` macro family, which provides fine-grained capability queries. For example, `__HIP_ARCH_HAS_WARP_SHUFFLE__` indicates support for wavefront shuffle operations, while `__HIP_ARCH_HAS_GLOBAL_FLOAT_ATOMIC_ADD__` indicates atomic floating-point operation support. + +### Runtime Architecture Detection + +Runtime architecture detection allows kernels to adapt their behavior based on the actual hardware capabilities discovered at execution time. The `hipGetDeviceProperties()` function returns a structure containing detailed information about the target device, including compute capability, memory sizes, and supported features. + +For MI300 specifically, the device properties include information about Matrix Core availability, LDS capacity, and wavefront size. This information can be used to select optimal algorithm variants or adjust kernel launch parameters for maximum performance. + +The architecture properties structure includes boolean flags for various hardware features, such as `hasSharedInt32Atomics`, `hasWarpVote`, and `hasDoubles`. These flags provide a portable way to query hardware capabilities without relying on specific architecture version numbers. + +### Compiler Optimization Directives + +The HIP-Clang compiler provides several AMD-specific optimization directives that can significantly impact kernel performance on MI300. The `__attribute__((amdgpu_flat_work_group_size(min, max)))` attribute allows specification of workgroup size ranges, enabling more aggressive compiler optimizations. + +The `--offload-arch` compiler flag specifies the target GPU architecture, with `gfx90a` and `gfx940` being relevant for MI300 variants. Proper architecture targeting enables the compiler to generate optimized code that takes advantage of architecture-specific features and instruction sets. + +Optimization level selection with `-O2` or `-O3` can significantly impact performance, but the optimal level may depend on the specific kernel characteristics and target workload. The compiler's ability to optimize for MI300's unique architecture features requires appropriate optimization settings. + +### Feature-Specific Compilation + +Conditional compilation based on architecture features allows for optimized code paths that take advantage of MI300's unique capabilities. For example, Matrix Core utilization can be conditionally compiled based on the availability of MFMA instructions. + +```cpp +#if defined(__HIP_ARCH_HAS_MFMA__) + // Use Matrix Core acceleration + // MFMA-optimized implementation +#else + // Fallback to traditional vector operations + // Standard implementation +#endif +``` + +The wavefront size can be queried at compile time using architecture-specific macros, allowing for optimized algorithms that take advantage of the 64-thread wavefront size on AMD hardware while maintaining compatibility with other architectures. + +### Debug and Profiling Support + +The HIP compiler provides extensive debugging and profiling support through various compiler flags and runtime options. The `-g` flag enables debug information generation, while `-ggdb` provides GDB-specific tuning for use with ROCm's debugging tools. + +The `--save-temps` compiler flag preserves intermediate compilation files, which can be useful for understanding the generated assembly code and identifying optimization opportunities. This is particularly valuable when optimizing for MI300's specific instruction set and execution model. + +Runtime logging and tracing can be enabled through environment variables such as `HIP_TRACE_API` and `HIP_VISIBLE_DEVICES`, providing detailed information about kernel execution and performance characteristics on MI300 hardware. + +### Cross-Platform Compatibility + +Writing portable HIP code that performs optimally on both AMD and NVIDIA hardware requires careful use of conditional compilation and runtime detection. The HIP programming model is designed to support this portability while allowing for platform-specific optimizations. + +Platform-specific optimizations can be implemented using the HIP macro system while maintaining a common code base. This approach allows developers to take advantage of MI300's unique features while preserving compatibility with other GPU architectures. + +The HIP runtime automatically handles many platform differences, such as memory management and kernel launch mechanisms, but performance-critical code paths may require platform-specific implementations to achieve optimal performance on MI300 hardware. + + +## Performance Optimization Strategies + +### Occupancy Optimization for 64-Thread Wavefronts + +Achieving optimal occupancy on MI300 requires understanding the relationship between wavefront size, register usage, and LDS consumption. With 64-thread wavefronts, each wavefront consumes more resources than NVIDIA's 32-thread warps, which affects the maximum number of concurrent wavefronts per compute unit. + +Register pressure becomes more significant with larger wavefronts, as each wavefront requires 64 times the per-thread register allocation. Careful register usage optimization, including register spilling strategies and algorithm restructuring, can significantly impact occupancy and overall performance. + +The LDS usage per workgroup must be balanced against the desired occupancy level. Since LDS is shared among all wavefronts within a workgroup, excessive LDS usage can limit the number of concurrent workgroups and reduce overall hardware utilization. + +### Memory Access Pattern Optimization + +Memory access optimization for MI300 requires specific attention to the 64-thread wavefront size and the CDNA memory hierarchy. Coalesced access patterns should be designed for 64-thread groups rather than 32-thread groups, which may require different stride patterns and data organization strategies. + +The vector cache system in CDNA3 is optimized for specific access patterns that may differ from NVIDIA's cache hierarchy. Understanding cache line sizes, associativity, and replacement policies can help optimize memory access patterns for maximum cache utilization. + +Bandwidth optimization requires consideration of the memory controller architecture and the specific memory subsystem configuration of MI300. The high-bandwidth memory (HBM) subsystem provides substantial memory bandwidth, but achieving peak utilization requires careful attention to access patterns and memory controller load balancing. + +### Instruction-Level Optimization + +The CDNA3 instruction set provides several optimization opportunities that are specific to AMD hardware. Vector ALU utilization can be maximized through careful instruction scheduling and by avoiding instruction dependencies that could cause pipeline stalls. + +The scalar unit in CDNA3 compute units can handle uniform computations across wavefronts, reducing pressure on vector resources. Identifying and optimizing scalar operations can improve overall instruction throughput and resource utilization. + +Mixed-precision arithmetic can provide significant performance benefits on MI300, particularly when using the Matrix Cores for AI workloads. The ability to perform computations in lower precision while maintaining higher precision for accumulation can dramatically improve throughput for appropriate algorithms. + +### Workload Distribution Strategies + +Effective workload distribution across MI300's compute units requires understanding the shader engine organization and compute unit capabilities. Load balancing strategies should account for the hierarchical nature of the hardware and the potential for workload imbalances between different parts of the device. + +Dynamic load balancing techniques can help address workload irregularities that are common in real-world applications. These techniques may involve work stealing, dynamic work distribution, or adaptive algorithm selection based on runtime characteristics. + +The interaction between multiple kernels executing concurrently on MI300 requires careful resource management and scheduling. Understanding the hardware's ability to overlap different types of operations can enable more sophisticated execution strategies. + +### Algorithm-Specific Optimizations + +Matrix-heavy algorithms can benefit significantly from MI300's Matrix Cores, but effective utilization requires algorithm restructuring to match the hardware capabilities. Tiling strategies, data layout optimization, and mixed-precision techniques are crucial for achieving peak performance. + +Reduction operations and other collective algorithms must be adapted for 64-thread wavefronts and the specific characteristics of the CDNA architecture. Wavefront-level primitives and hierarchical reduction strategies can provide better performance than direct ports from NVIDIA-optimized algorithms. + +Memory-bound algorithms require specific optimization strategies that account for the CDNA memory hierarchy and bandwidth characteristics. Techniques such as software-managed caching, prefetching, and memory access reordering can significantly improve performance for these workloads. + +### Profiling and Performance Analysis + +Effective performance optimization requires comprehensive profiling and analysis tools that understand the unique characteristics of MI300 hardware. ROCm's profiling tools provide detailed insights into wavefront execution, memory access patterns, and resource utilization. + +Instruction-level profiling can reveal optimization opportunities that are specific to the CDNA instruction set and execution model. Understanding instruction latencies, throughput characteristics, and resource dependencies is crucial for fine-tuning kernel performance. + +Memory hierarchy analysis tools can help identify cache utilization patterns, memory bandwidth bottlenecks, and opportunities for access pattern optimization. These tools are essential for understanding the complex interactions between different levels of the memory hierarchy on MI300. + + +## Debugging and Profiling + +### ROCm Debugging Tools + +The ROCm ecosystem provides specialized debugging tools designed specifically for AMD GPU architectures, including MI300. The ROCgdb debugger extends the standard GDB interface with GPU-specific capabilities, allowing developers to debug kernels at the wavefront and thread level. + +Setting breakpoints in HIP kernels requires understanding the wavefront execution model and the potential for divergent execution paths. The debugger can display wavefront state, register contents, and memory values, but interpreting this information requires knowledge of the CDNA execution model. + +The debugging experience on MI300 differs from NVIDIA's debugging tools in several key ways. The larger wavefront size affects how thread state is displayed and managed, while the unique memory hierarchy requires different approaches to memory inspection and analysis. + +### Performance Profiling with ROCProfiler + +ROCProfiler provides comprehensive performance analysis capabilities specifically designed for AMD GPU architectures. The profiler can collect detailed metrics about wavefront execution, memory access patterns, instruction throughput, and resource utilization on MI300 hardware. + +Wavefront occupancy analysis reveals how effectively the hardware resources are being utilized and can identify opportunities for optimization. The profiler can show occupancy levels across different compute units and identify bottlenecks that limit overall performance. + +Memory access profiling provides insights into cache hit rates, memory bandwidth utilization, and access pattern efficiency. This information is crucial for optimizing memory-bound kernels and understanding the performance characteristics of different memory hierarchy levels. + +### Kernel Launch Configuration Analysis + +Optimal kernel launch configurations for MI300 require careful analysis of workgroup size, grid dimensions, and resource usage. The profiling tools can help identify the impact of different launch configurations on occupancy, memory access efficiency, and overall performance. + +The relationship between workgroup size and wavefront utilization is particularly important on AMD hardware. Since wavefronts contain 64 threads, workgroup sizes that are not multiples of 64 may result in partially filled wavefronts and reduced hardware utilization. + +Dynamic shared memory allocation and register usage analysis can reveal opportunities for resource optimization. Understanding how these resources are allocated and used across different wavefronts is crucial for achieving optimal performance. + +### Matrix Core Utilization Analysis + +Profiling Matrix Core utilization requires specialized tools and metrics that understand the unique characteristics of these acceleration units. The profiler can show Matrix Core occupancy, instruction throughput, and the effectiveness of data movement between Matrix Cores and other compute resources. + +Understanding the interaction between Matrix Cores and traditional vector units is crucial for optimizing hybrid kernels that use both types of processing resources. The profiler can reveal resource conflicts, scheduling inefficiencies, and opportunities for better resource utilization. + +Data layout analysis for Matrix Core operations can identify opportunities for improved memory access patterns and reduced data movement overhead. The profiler can show how effectively data is being supplied to the Matrix Cores and identify potential bottlenecks. + +### Environment Variables and Runtime Configuration + +The HIP runtime provides numerous environment variables for controlling debugging and profiling behavior. `HIP_TRACE_API` enables API call tracing, while `HIP_VISIBLE_DEVICES` controls device visibility and can be used to isolate specific MI300 devices for testing. + +Logging levels can be controlled through `HIP_LOG_LEVEL` and related variables, providing detailed information about kernel execution, memory operations, and runtime behavior. This information can be invaluable for debugging complex performance issues. + +The `HSA_TOOLS_LIB` environment variable enables integration with external profiling and analysis tools, allowing for more sophisticated performance analysis workflows that combine multiple tools and data sources. + +### Common Performance Pitfalls + +Several common performance pitfalls are specific to MI300 and the CDNA architecture. Wavefront divergence can be more costly than on NVIDIA hardware due to the larger wavefront size, making control flow optimization particularly important. + +Memory access patterns that work well on NVIDIA hardware may not be optimal for MI300 due to differences in cache hierarchy and memory controller organization. Understanding these differences is crucial for achieving optimal performance. + +Resource allocation imbalances between different types of compute resources (vector units, Matrix Cores, memory bandwidth) can limit overall performance. Effective profiling can identify these imbalances and guide optimization efforts. + + +## Best Practices Summary + +### Architecture-Specific Considerations + +When developing HIP kernels for MI300, always consider the 64-thread wavefront size in algorithm design and memory access patterns. This fundamental difference from NVIDIA's 32-thread warps affects occupancy calculations, synchronization strategies, and optimal workgroup sizes. + +Leverage the Matrix Cores for AI and linear algebra workloads by restructuring algorithms to use matrix operations where possible. The dedicated matrix acceleration units can provide substantial performance improvements for appropriate workloads, but require careful data layout and algorithm design. + +Optimize memory access patterns for the CDNA memory hierarchy, which differs significantly from NVIDIA architectures. Pay particular attention to LDS usage, cache-friendly access patterns, and the interaction between different memory hierarchy levels. + +### Performance Optimization Guidelines + +Design kernels with occupancy optimization in mind, considering the larger resource requirements of 64-thread wavefronts. Balance register usage, LDS consumption, and workgroup size to achieve optimal hardware utilization. + +Use HIP's architecture detection capabilities to implement platform-specific optimizations while maintaining code portability. This allows for optimal performance on MI300 while preserving compatibility with other GPU architectures. + +Profile extensively using ROCm's specialized tools to understand performance characteristics and identify optimization opportunities. The unique architecture of MI300 requires specific profiling approaches and metrics that differ from NVIDIA-focused tools. + +### Code Organization Strategies + +Structure code to take advantage of both vector units and Matrix Cores when appropriate, using hybrid approaches that maximize overall hardware utilization. This may require algorithm restructuring and careful resource management. + +Implement robust error handling and debugging support using HIP's debugging capabilities and ROCm tools. The complexity of MI300's architecture makes comprehensive debugging support essential for development productivity. + +Use conditional compilation and runtime detection to create portable code that performs optimally across different GPU architectures while taking full advantage of MI300's unique capabilities. + +### Development Workflow Recommendations + +Establish a development workflow that includes regular profiling and performance analysis using ROCm tools. The unique characteristics of MI300 make performance analysis an integral part of the development process rather than an afterthought. + +Test kernels across different workload sizes and input characteristics to ensure robust performance across the range of expected use cases. MI300's architecture may exhibit different performance characteristics for different workload patterns. + +Maintain awareness of ROCm ecosystem updates and new optimization techniques as the toolchain and hardware capabilities continue to evolve. The rapidly advancing nature of GPU computing makes continuous learning essential. + +## References + +[1] AMD HIP Documentation, Release 6.1.40092. Advanced Micro Devices, Inc., September 2024. Available at: https://rocm.docs.amd.com/projects/HIP/en/latest/ + +[2] AMD CDNA3 Architecture Overview. Advanced Micro Devices, Inc. Available at: https://www.amd.com/en/products/accelerators/instinct/mi300 + +[3] ROCm Documentation. Advanced Micro Devices, Inc. Available at: https://rocm.docs.amd.com/ + +[4] HIP Programming Guide. Advanced Micro Devices, Inc. Available at: https://rocm.docs.amd.com/projects/HIP/en/latest/user_guide/programming_manual.html + +[5] AMD GPU Hardware Specifications. Available at: https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html + +[6] HIP API Reference. Advanced Micro Devices, Inc. Available at: https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/ + +[7] ROCProfiler User Guide. Advanced Micro Devices, Inc. Available at: https://rocm.docs.amd.com/projects/rocprofiler/en/latest/ + +[8] AMD Matrix Instruction Calculator. Available at: https://github.com/ROCmSoftwarePlatform/amd-matrix-instruction-calculator + +--- + +*This document serves as a comprehensive guide for HIP kernel development on AMD MI300 accelerators. For the most current information and updates, please refer to the official AMD ROCm documentation and release notes.* + diff --git a/kernel-agentic/docs/hip/hip_spec.pdf b/kernel-agentic/docs/hip/hip_spec.pdf deleted file mode 100644 index 4a2a6a4..0000000 Binary files a/kernel-agentic/docs/hip/hip_spec.pdf and /dev/null differ diff --git a/kernel-agentic/docs/hip/hip_spec_mm/hip_spec-with-image-refs-enhanced.html b/kernel-agentic/docs/hip/hip_spec_mm/hip_spec-with-image-refs-enhanced.html deleted file mode 100644 index a633576..0000000 --- a/kernel-agentic/docs/hip/hip_spec_mm/hip_spec-with-image-refs-enhanced.html +++ /dev/null @@ -1,8151 +0,0 @@ - - -
- -Advanced Micro Devices, Inc.
-Sep 13, 2024
-**Following table contains:** The table appears to represent a structured outline or table of contents for a document or guide related to installing and building HIP (Heterogeneous-Compute Interface for Portability). Each row corresponds to a section or subsection of the document. - -- The first column seems to indicate the main section number or title. -- The second column provides a more detailed section or subsection number. -- The third column contains the title or description of the section or subsection. -- The fourth column appears to indicate a page number where the section or subsection can be found. - -Noteworthy values include: -- The main sections are "1 Overview" and "3 Build HIP from source." -- Subsections under "Install HIP" include "Prerequisites," "Installation," and "Verify your installation." -- The page numbers suggest that the "Overview" section spans pages 3 to 6, while "Build HIP from source" starts on page 7.
-| 1 Overview | 3 | |||
|---|---|---|---|---|
| Install HIP | ||||
| 2 | 5 | |||
| 2.1 | Prerequisites . . . . . . . . . | 5 | ||
| 2.2 | Installation . . . | 5 | ||
| 2.3 | . . . . . Verify your installation . . . | 6 | ||
| 3 | Build HIP from source | 7 | ||
| 3.1 Prerequisites | . . . . . . | |||
| . . | 7 | |||
| 3.2 Building the HIP runtime 3.3 . . | . | 7 | ||
| Build HIP tests . . . | 10 | |||
| . . . 3.4 Run HIP . . . . . . . . . . . | 11 | |||
| 4 | HIP programming model | 13 | ||
| 4.1 | 13 | |||
| 4.2 | RDNA &CDNAarchitecture summary Heterogeneous Programming . . | 14 | ||
| Single instruction multiple threads (SIMT) . . . | 14 | |||
| 4.3 4.4 | Inherent thread model . . | 15 | ||
| 4.5 | . . . . 4.4.1 Cooperative groups thread | 16 | ||
| Memory model . . . . . . . . . | 16 | |||
| 4.6 Execution model . | . . . . . | 17 | ||
| 4.6.1 Host-side | ||||
| execution | 17 17 | |||
| 4.6.2 | Device-side execution . . | |||
| 4.6.3 | Kernel launch . | 18 | ||
| 5 | Hardware implementation | 19 | ||
| Compute units | . . . . . . . | 19 | ||
| . . . . . | 20 | |||
| 5.1 | 5.1.1 5.1.2 | SIMD . . Vector cache . . | 20 | |
| 5.1.3 | . . Local data share . . | 20 | ||
| 5.1.4 | Scalar Unit . . . . | 20 | ||
| 5.2 CDNA architecture . | . . . . | 20 | ||
| 5.3 RDNA architecture . . | . . . . | 21 | ||
| 5.4 Shader engines . . | . . . . | 21 | ||
| (CLR) | ||||
| 6 | AMDcommon language runtimes | 23 | ||
| 6.1 Project organization | . . . . | 23 | ||
| How to build/install . | . . . | 23 | ||
| 6.2 | 6.2.1 | Prerequisites . . . | 23 | |
| 6.2.2 | Linux . . . . . . . | 23 | ||
| 6.2.3 | Test . . . . . . . . | 24 | ||
**Following table contains:** The table appears to represent a structured outline or index of a document, likely a technical manual or guide related to HIP (Heterogeneous-computing Interface for Portability) programming. Each row corresponds to a section or subsection of the document. - -- **Column 0**: This column seems to represent the section numbers or identifiers, which help in organizing the document into a hierarchical structure. For example, "7 7.1" indicates a main section and its subsection. - -- **Column 1**: This column contains the titles or headings of the sections and subsections. It provides a brief description of the content covered in each part of the document, such as "Host Memory" or "Coherency Controls." - -- **Column 2**: This column appears to contain additional descriptive text or continuation of the section titles, often represented by ellipses, which might indicate omitted or summarized content. - -- **Column 3**: This column likely represents page numbers where each section or subsection can be found in the document, aiding in navigation. - -Noteworthy values include the presence of ellipses in columns 1 and 2, suggesting that the full text or titles might be truncated or summarized in this preview. Additionally, the consistent page numbers (mostly 25 and 26) suggest that these sections are closely located within the document, possibly indicating a detailed discussion on a specific topic within a few pages.
-| 6.2.4 | Release notes . . . . . . . . . . . . . . . . . . . . . | 24 | ||
| HIP programming manual | 25 | |||
|---|---|---|---|---|
| 7 7.1 | Host Memory . . . . . | . . . . . . . . . . . . . . . . . . . . . | 25 | |
| 7.1.1 | Introduction . . . . . . . . . . . . . . . | 25 | ||
| 7.1.2 | . . . . . . . Memory allocation flags . . . . . . . . . | 25 | ||
| 7.1.3 | . . . . . . Numa-aware host memory allocation . . . . . . . . . | 26 | ||
| 7.1.4 | Coherency Controls . . . . . . . . . | 26 | ||
| 7.1.5 | . . . . . . . . . Visibility of Zero-Copy Host Memory . . . . . . . . | 27 | ||
| 7.1.6 | hipEventSynchronize . . . . . . . . . . . | 27 | ||
| 7.1.7 | . . . . Summary and Recommendations . . . . . . . . . . . | 27 | ||
| 7.1.8 | Managed memory allocation . . . . . . . . . . . . . | 28 | ||
| 7.1.9 | HIP Stream Memory Operations . . . . . . . . . . . | 28 | ||
| 7.2 | Direct Dispatch . . . . . . | . . . . . . . . . . . . . . . . . . . | 28 | |
| 7.3 | HIP Runtime Compilation | . . . . . . . . . . | 29 | |
| 7.4 | . . . . . . . . . HIP Graph . . . . . . . . . . . . . . . . . | . . . . . . . . . . | 29 | |
| 7.5 | Device-Side Malloc . . . . . . | . . . . . . . . . . . . . | 29 | |
| 7.6 | . . . Use of Per-thread default stream . . . | . . . . . . . . . . . . . | 29 | |
| 7.7 | Use of Long Double Type . . . . . | . . . . . . . . . . . . . . | 30 | |
| 7.8 | Use of _Float16 Type . . . . . . . | 30 | ||
| 7.9 | . . . . . . . . . . . . . . FMA and contractions . . . . . . . . . . . . . . . . . . . . . | 30 | ||
| 7.10 | Math functions with special rounding modes . . . . . . . . . . . . . . . . | . . . . . | 30 | |
| 7.11 | Creating Static Libraries . . . | . . . . . | 30 | |
| 8 HIP porting guide . . . . . . . . . . . | 33 | |||
| 8.1 | Porting a New CUDA Project . | . . . . . | 33 | |
| General Tips . . | ||||
| 8.1.1 | . . . . . . . . . . . . . . . . . . . | 33 | ||
| 8.1.2 | Scanning existing CUDA code to scope the porting effort 'in-place' . . . . . . . . . . | 33 34 | ||
| 8.1.3 | Converting a project . . | |||
| 8.1.4 | Library Equivalents . . . . . . . . . . . . . . . . . . | 35 35 | ||
| 8.2 | Distinguishing 8.2.1 | Compiler Modes . . . . . . . . . . . . . . . . Identifying HIP Target Platform . . . | 35 | |
| 8.2.2 | . . . . . . . . Identifying the Compiler: hip-clang or NVCC . . . | 36 | ||
| 8.2.3 | . Identifying Current Compilation Pass: Host or Device | 36 | ||
| 8.2.4 | Compiler Defines: Summary . . . . . . . . . . . . . . . | 37 | ||
| 8.3 | Identifying Architecture Features . . | . . . . . . . . . . . . . | 37 | |
| 8.3.1 | HIP_ARCH Defines . . . . . . . . . . . . . . . | 37 | ||
| 8.3.2 | Device-Architecture Properties . . . . . . . . . . . . | 38 | ||
| 8.3.3 | Table of Architecture Properties . . . . . . . . . . . . . . . . . . | 38 | ||
| 8.4 | Finding HIP . . . . . . . | . . . . . . . . . . . . | 39 40 | |
| 8.5 8.6 | Identifying HIP Runtime . . . . . hipLaunchKernelGGL . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 40 | |
| 8.7 | Compiler Options . . . . . . . 8.7.1 Compiler options supported | . . . . . . . . . . . . . . . . | 40 | |
| on AMDplatforms . . . | 40 | |||
| 8.8 | Linking Issues . . . | . . . . . . . . . . . . . . . . . . . . | 41 | |
| . . 8.8.1 Linking With hipcc | . . . . . . . . . . . . . . . . . . | 41 | ||
| 8.8.2 | -lm Option . . . . . . . . . . . . . . . . . | 41 | ||
| 8.9 | . . . . . Linking Code With Other Compilers | . . . . . . . . . . . . . | 41 | |
| 8.9.1 libc++ and libstdc++ | ||||
| . . . . . . . . . . . . . . . . . | 41 | |||
| 8.9.2 | HIP Headers ( hip_runtime.h , hip_runtime_api.h Compiler . . . . . . . . . . . | 42 42 | ||
| 8.9.3 Using a Standard C++ 8.9.3.1 . . | . . . . . . . . . . . . | 42 | ||
| cuda.h . . . . . 8.9.4 Choosing HIP File Extensions . . | . . . . . . . . . . . . | 42 | ||
| 8.10 | Workarounds . . . . . . . . . . . . | . . . . . . . . . . . . | 43 | |
8.10.1
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-43
-**Following table contains:** The table represents a comparison of different HIP (Heterogeneous-computing Interface for Portability) API synchronization functions. Each row corresponds to a specific HIP API function and describes its synchronization behavior and memory visibility characteristics. - -The columns are as follows: -1. **HIP API Synchronization Effect**: Describes the synchronization action performed by the HIP API function. -2. **Fence**: Indicates the type of memory release operation associated with the synchronization (e.g., system-scope release, device-scope release, or none). -3. **Coherent Memory ity**: Specifies whether coherent memory is supported (notably, there seems to be a typo or truncation in the column name). -4. **Host Visibil-**: Indicates whether the host has visibility into the memory operations (the column name appears truncated). -5. **Non-Coherent Host Memory Visi- bility**: Specifies whether non-coherent host memory visibility is supported (the column name appears truncated). - -Noteworthy values: -- All functions listed support coherent memory. -- The `hipEventSynchronize` function has a "depends - see below" note for non-coherent host memory visibility, suggesting conditional behavior or additional context not provided in the table. -- The `hipStreamWaitEvent` function does not support non-coherent host memory visibility.
-| warpSize | ||||||
|---|---|---|---|---|---|---|
| 8.10.2 Kernel launch with group size > 256 . . . . . . . . . . . . . . . . . . . . . | 43 | |||||
| 8.11 | memcpyToSymbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 43 | ||||
| 8.12 | CU_POINTER_ATTRIBUTE_MEMORY_TYPE | . . . | 44 | |||
| 8.13 | threadfence_system . . . | . . . . . . . . | 45 | |||
| . . . . . . 8.13.1 Textures and Cache Control . . . | . . . . . . . . | 45 | ||||
| 8.14 | More Tips . . . . | . . . . . . . . . | 46 | |||
| 8.14.1 | . . . . . . . . . . . HIP Logging . . . . . . . . . | . . . . . . . . . | 46 | |||
| 8.14.2 | Debugging hipcc | . . . . . . . . . . . . . . . . | 47 | |||
| 8.14.3 | Editor Highlighting . . | . . . . . . . . . . . . . | 47 | |||
| 9 | Porting CUDA driver API | 49 | ||||
| 9.1 | Introduction to the CUDA Driver and Runtime APIs . . . . . . . . | 49 | ||||
| 9.1.1 | cuModule API . . . . . . . . . . . . . . | 49 | ||||
| 9.1.2 | . . . . . . . . . cuCtx API . . . . . . . . . . . . . . . . . . . . . . . . . | 50 | ||||
| 9.2 | HIP | Module and Ctx APIs . . . . . . . . . . . . . | . . . | 50 | ||
| 9.2.1 | hipModule API . . . . . . . . . . . . . . . . . . . . . . . . . . . | 50 | ||||
| 9.2.2 | . hipCtx API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 51 | ||||
| 9.2.3 | hipify translation of CUDA Driver API . . . . . . | 51 | ||||
| 9.2.3.1 | Address Spaces . . . . . . | 51 | ||||
| 9.2.3.2 | . . . . . . Using hipModuleLaunchKernel . . | 51 | ||||
| 9.2.3.3 | Additional Information | 51 | ||||
| 9.2.4 | . . . . . . . . . . . hip-clang Implementation Notes . . . . . . . . . . | 51 | ||||
| 9.2.4.1 | .hip_fatbin . . . . . . . . . . . . . | 51 | ||||
| 9.2.4.2 | Initialization and Termination Functions | 52 | ||||
| 9.2.4.3 | Kernel Launching | 52 | ||||
| 9.2.5 | . . . . . . . . . . . . . . . . . . NVCC Implementation Notes . . . . . . . . . . . . . . . . . | 52 | ||||
| . . . . 9.2.5.1 Interoperation between HIP and CUDA Driver . . . . . . . | 52 | |||||
| 9.2.5.2 Compilation Options . . . | . . . . . . | 53 | ||||
| 9.3 | HIP | Module and Texture Driver API . . . . | . . . . . . . | 55 | ||
| 10 | Programming for HIP runtime compiler (RTC) | 57 | ||||
| 10.1 | Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 57 | ||||
| 10.2 | HIPRTC | specific options . . . . . . . . | . . . . . . . . . | 61 | ||
| 10.2.1 | Bitcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 62 | ||||
| 10.2.2 | CU Mode vs WGP mode . . . . . . | 62 | ||||
| 10.3 | Linker | APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 62 | |||
| 10.3.1 | Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 63 63 | ||||
| 10.3.2 | 10.3.1.1 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 64 | ||||
| 10.3.3 | . Backward Compatibility of LLVM Bitcode/IR . . . . . . . . . . . . . . | 64 | ||||
| 10.3.4 | Link Options . . . . . . . . . . | . . . . . . . . | 64 | |||
| 10.4 | Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 65 | ||||
| 10.5 | HIPRTC General APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 65 | ||||
| 10.6 | Lowered Names (Mangled Names) . . . . . . . . . . . | 66 | ||||
| 10.6.1 10.6.2 | Note . . . . . . . . . . . . . . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . . . . | 66 66 | ||||
| 10.7 | . . . . . . . . . . . | 67 | ||||
| 10.8 | Versioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HIP header support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 68 | ||||
| 10.9 | Deprecation notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 68 | ||||
| 11 | 69 | |||||
| 11.1 | Performance guidelines Parallel | execution | . . . . . . . . . . . . . . . . . . . . . | 69 | ||
11.1.2
-Device level
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-69
-**Following table contains:** The table represents a comparison of different libraries available in CUDA, HIP, and ROCm ecosystems, which are used for various computational tasks in high-performance computing. Each row corresponds to a specific library or set of functionalities, and the columns indicate the equivalent library in each ecosystem along with a brief comment describing the library's purpose. - -Columns: -- "CUDA Library": Lists the libraries available in the CUDA ecosystem. -- "HIP Library": Lists the equivalent libraries available in the HIP ecosystem. -- "ROCm Library": Lists the equivalent libraries available in the ROCm ecosystem. -- "Comment": Provides a brief description of the library's functionality or purpose. - -Noteworthy values: -- Some libraries do not have equivalents in all ecosystems, as indicated by "N/A" (e.g., AmgX in the HIP Library column). -- The "cuBLASLt" library has a "N/A" entry in the ROCm Library column, suggesting there is no direct equivalent in ROCm for this lightweight and flexible API version of cuBLAS. -- The "AmgX" library is only available in the CUDA ecosystem, with no equivalents in HIP or ROCm, highlighting a unique offering in CUDA for sparse iterative solvers and preconditioners with algebraic multigrid.
-| 11.1.3 | Multiprocessor level . . . . . . . . . . . . . . | 70 | ||
| 11.2 | Memory | optimization . . . . . . . . . . . . | 70 | |
| 11.2.1 | Data Transfer . . . . . . . . . . . . . | 70 | ||
| 11.2.2 | . . Device Memory Access . . . . . . . . . . | 71 | ||
| 11.3 | Optimization for maximum instruction throughput | 71 | ||
| 11.3.1 | Arithmetic instructions . . . . . . . . . | 72 | ||
| 11.3.2 | . Control flow instructions . . . . . . . . . | 72 | ||
| 11.3.3 | Synchronization . . . . . . . . . . | 72 | ||
| 11.4 | . . . . Minimizing memory thrashing . | . . . . . . . . . . | 73 | |
| 12 Debugging with HIP | 75 | |||
| 12.1 | Tracing . . | . . . . . . . . . . . . . . . . . . . . . | 75 | |
| 12.2 | Debugging . . . . . . | . . . . . . . . | 77 | |
| . . . . . . . 12.2.1 Debugging HIP applications | . . . . . . . | 77 | ||
| 12.3 | Useful environment variables | . . . . . . . | 79 | |
| 12.3.1 | . . . . Kernel enqueue serialization . . . . . . . | 79 | ||
| 12.3.2 | Making device visible . . . . . . . . . . . | 79 | ||
| 12.3.3 | Dump code object . . | 79 | ||
| 12.3.4 | . . . . . . . . . . . HSA-related environment variables (Linux) | 80 | ||
| 12.3.5 HIP environment variable summary . . | . | 80 | ||
| 12.4 | General debugging tips . . . | . . . . . . . . . . . . | 82 | |
| 13 Logging HIP activity | 83 | |||
| 13.1 | Logging level . . . . . . . . | . . . . . . . . . . . . | 83 | |
| 13.2 | Logging mask . . . . . | . . . | 84 | |
| 13.3 | . . . . . . . . . . . Logging command . . . . . . . . . . . . . . | . . . | 84 | |
| 13.4 | Logging examples . . . . . . | . . . . . . . . . . . | 85 | |
| 14 Cooperative groups . . . . . . . . . . . . | 89 | |||
| 14.1 | Cooperative groups thread model . . . | 89 | ||
| 14.2 | Group types . . . . . . . . group | . . . . . . . . . | 90 | |
| 14.2.1 | Thread-block . . . . . . . . . . . . . | 90 | ||
| 14.2.2 | Grid group . . . . . . . . . . . . . . | 90 | ||
| 14.2.3 | . Multi-grid group . . . . . . . . . . . . . | 90 | ||
| 14.2.4 14.2.5 | Thread-block tile . . . . . . . . . . . . . Coalesced groups . . . . . . . . . . . . . | 91 91 | ||
| 14.3 | Cooperative groups simple example . . | . . . . . . | 92 | |
| 14.4 | Synchronization . . . . . . | . . . . . . . . | 94 | |
| 14.5 | . . . . | . . . . | 97 | |
| Unsupported NVIDIA CUDA features . . . | ||||
| 15 Unified memory | 99 | |||
| 15.1 | Unified memory . . . | . . . . . . . . . . . . . . . | 99 99 | |
| 15.2 | System requirements . . . . . | . . . . . . . . . . . | 100 | |
| 15.3 | Unified memory programming models | . . . . . . | ||
| 15.3.1 | Checking unified memory management support | 100 | ||
| 15.3.2 Example for unified memory management | 101 | |||
| 15.4 | Using unified memory management (UMM) | . . . | 104 | |
| 15.5 | Unified memory HIP runtime hints . . . . . | for the better performance | 104 | |
| 15.5.1 | Data prefetching . . . . . . . . . | 105 | ||
| 15.5.2 | Memory advice . . . . . . . . . . . . . . | 106 107 | ||
| 15.5.3 15.5.4 | Memory range attributes . . . . . . . . . Asynchronously attach memory to a stream | 108 | ||
| 16 Virtual memory management | 109 | |||
**Following table contains:** The table represents a comparison of preprocessor defines related to HIP (Heterogeneous-Compute Interface for Portability) and NVCC (NVIDIA CUDA Compiler) across different compiler environments. Each row corresponds to a specific preprocessor define, and the columns indicate the status or value of these defines when using different compilers: HIP-Clang, NVCC, and other compilers like GCC, ICC, or Clang. - -- **Columns:** - - "Define": Lists the specific preprocessor defines being compared. - - "HIP-Clang": Indicates the status or value of each define when using the HIP-Clang compiler. - - "NVCC": Indicates the status or value of each define when using the NVCC compiler. - - "Other (GCC, ICC, Clang, etc.)": Indicates the status or value of each define when using other compilers such as GCC, ICC, or Clang. - -- **Noteworthy Values:** - - `__HIP_PLATFORM_AMD__` is defined in HIP-Clang and other compilers if targeting the AMD platform, but undefined in NVCC. - - `__HIP_PLATFORM_NVIDIA__` is defined in NVCC and other compilers if targeting the NVIDIA platform, but undefined in HIP-Clang. - - `__HIP_DEVICE_COMPILE__` is defined as 1 if compiling for the device in both HIP-Clang and NVCC, but is undefined in other compilers. - - `__HIPCC__` is consistently defined in both HIP-Clang and NVCC, but undefined in other compilers. - - `__HIP_ARCH_*` can be 0 or 1 depending on feature support in HIP-Clang and NVCC, but is always 0 in other compilers. - - `__CUDACC__` is defined if the source code is compiled by NVCC, but undefined in HIP-Clang and other compilers. - -This table provides a clear overview of how different compilers handle specific preprocessor defines related to HIP and NVCC, highlighting the differences in platform targeting and compilation contexts.
-| 16.1 | Memory allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.1 | . . . 109 . . . |
| Allocate physical memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 109 | |
| 16.1.2 | Reserve virtual address range . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 110 |
| 16.1.3 | Set memory access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 110 |
| Free virtual memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 110 | |
| 16.2 | 16.1.4 Memory usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 111 |
| 16.2.1 Dynamically increase allocation size . . . . . . . . . . . . . . . . . . . . . . . | . . . 111 | |
| 17 Frequently asked questions | 113 | |
| 17.1 | What APIs and features does HIP support? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 113 |
| 17.2 | What is not supported? . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 113 |
| 17.2.1 Runtime/Driver API features . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 113 | |
| 17.2.2 Kernel language features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 114 | |
| 17.3 Is | HIP a drop-in replacement for CUDA? . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 114 |
| 17.4 | What specific version of CUDA does HIP support? . . . . . . . . . . . . . . . . . . . . | . . . 114 |
| 17.5 | What libraries does HIP support? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 115 |
| 17.6 | How does HIP compare with OpenCL? . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 115 |
| 17.7 | How does porting CUDA to HIP compare to porting CUDA to OpenCL? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 115 |
| 17.8 | What hardware does HIP support? . . . . . . . . | . . . 116 |
| 17.9 | Do HIPIFY tools automatically convert all source code? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 116 |
| 17.10 | What is NVCC? . . . . . . . . . . . . . . . . | . . . 116 |
| 17.11 | . . What is HIP-Clang? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 116 |
| 17.12 | Why use HIP rather than supporting CUDA directly? . . . . . . . . . . . . . . . . . . . | . . . 116 117 |
| 17.13 | Can I develop HIP code on an NVIDIA CUDA platform? . . . . . . . . . . . . . . . . Can I develop HIP code on an AMDHIP-Clang platform? . . . . . . . . . . . . . . . . | . . . . . . 117 |
| 17.14 | How to use HIP-Clang to build HIP programs? . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 117 |
| 17.15 | . . . . . . . . . . . . . . | |
| 17.16 17.17 | What is AMDclr? . . . . . . . . . . . . . . . . . . . What is hipother? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 117 . . . 118 |
| 17.18 | Can I get HIP open source repository . . . . . . . . . . . . . . . . | . . . 118 |
| 17.19 | for Windows? . . . Can a HIP binary run on both AMDand NVIDIA platforms? . | . . . 118 |
| 17.20 or | . . . . . . . . . . . . . On HIP-Clang, can I link HIP code with host code compiled with another compiler such clang? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | icc, . . . 118 |
| 17.21 | Can HIP API support C style application? What is the difference between C and C++? . . | . . . 118 |
| 17.22 | Can I install both CUDA SDK and HIP-Clang on the same machine? . . . . . . . . . | . . . 119 |
| 17.23 | HIP detected my platform (HIP-Clang vs NVCC) incorrectly * what should I do? . . . . . . . . . . . . . . . | . . . 119 |
| 17.24 | On CUDA, can I mix CUDA code with HIP code? . . . . . . . . . . . . . . | . . . 120 |
| 17.25 | How do I trace HIP application flow? . . . . . . . . . . . . . . . . . . . . . . | . . . 120 |
| 17.26 | What are the maximum limits of kernel launch parameters? . . . . . . . . . . . . . . . | . . . 120 |
| 17.27 | Are __shfl_*_sync functions supported on HIP platform? . . . . . . . . . . . . . . . | . . . 120 |
| 17.28 | How to create a guard for code that is specific to the host or the GPU? . . . . . . . . . . | . . . 120 |
| 17.29 | Why _OpenMP is undefined when compiling with -fopenmp ? . . . . . . . . . . . . . . | . . . 121 |
| 17.30 | Does the HIP-Clang compiler support extern shared declarations? . . . . . . . . . . . . | . . . 121 code |
| 17.31 | I have multiple HIP enabled devices and I am getting an error hipErrorSharedObjectInitFailed with the message 'Error: shared object initialization failed'? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 121 |
| 17.32 | How to use per-thread default stream in HIP? . . . . . . . . . . . . . . . . . . . . . . . | . . . 122 |
| 17.33 | How to use complex multiplication and division operations? . . . . . . . . . | . . . 122 |
| 17.34 | . . . . . . Can I develop applications with HIP APIs on Windows the same on Linux? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 122 |
| 17.35 | Does HIP support LUID? . . . . . . . . . . . | . . . 123 |
| 17.36 | How can I know the version of HIP? . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . 123 |
| 18 HIP Runtime 18.1 Related | API Reference Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 125 . . . 126 |
18.3
-Namespaces
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-126
-**Following table contains:** The table represents a set of device capabilities and properties related to atomic operations and other features in a computing architecture, likely related to GPU programming with HIP (Heterogeneous-Compute Interface for Portability). - -- **Rows**: Each row corresponds to a specific feature or set of features that can be queried or defined in the device code. These features are related to atomic operations, double-precision support, and warp-level operations. - -- **Columns**: - 1. **Define (use only in device code)**: This column lists preprocessor macros that can be used in device code to check for the availability of certain features. - 2. **Device Property (run-time query)**: This column provides the runtime query names that can be used to check if the device supports the corresponding features. - 3. **Comment**: This column gives a brief description of what each feature or set of features does, such as supporting 32-bit or 64-bit atomic operations, double-precision floating point operations, or warp-level operations. - -- **Noteworthy Values**: - - The table distinguishes between 32-bit and 64-bit atomic operations, indicating different levels of precision and memory scope (global vs. shared). - - The presence of features like `__HIP_ARCH_HAS_DOUBLES__` suggests support for double-precision floating-point operations. - - Warp-level operations such as vote, ballot, shuffle, and funnel shift are highlighted, indicating advanced parallel processing capabilities. - -Overall, the table provides a concise overview of the capabilities that can be queried or defined in device code for optimizing and utilizing specific hardware features.
-| 18.3.1 | Namespace List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.3.2 | Namespace Members . . . . . . . . . . . . . . | 126 | |||
| 18.3.2.1 Namespace Members . . . . . | 126 | ||||
| 18.3.2.2 Namespace Members . | 126 | ||||
| 18.4 | . . . . . Data Structures . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.4.1 Data | Structures . . . . . . . . | . . | 126 | ||
| 18.4.2 | . . . Data Structure Index . . . . . . . . | 126 | |||
| 18.4.3 | . . . Class Hierarchy . . . . . . . . . . . . . . | 126 | |||
| 18.4.4 | . . Data Fields . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.4.4.1 All . . . . . . . . . . . . . . | . | 126 | |||
| 18.4.4.1.1 | Data Fields . . . . . | 126 | |||
| 18.4.4.1.2 | Data Fields . . . . . | 126 | |||
| 18.4.4.1.3 | Data Fields . . . . . | 126 | |||
| 18.4.4.1.4 | Data Fields . . . . . | 126 | |||
| 18.4.4.1.5 | Data Fields . . . . . | 126 | |||
| 18.4.4.1.6 | Data Fields . . . . | 126 | |||
| 18.4.4.1.7 | . Data Fields . . . . . . | 126 | |||
| 18.4.4.1.8 | Data Fields . . . . | 126 | |||
| 18.4.4.1.9 | Data Fields . . . . . . . | 126 | |||
| 18.4.4.1.10 | Data Fields . . . | 126 | |||
| 18.4.4.1.11 | Data Fields . . . . . . . | 126 | |||
| 18.4.4.1.12 | Data Fields . . . | 126 | |||
| 18.4.4.1.13 | Data Fields . . . . . Data Fields . | 126 | |||
| 18.4.4.1.14 | . . . . Data Fields | 126 | |||
| 18.4.4.1.15 | . . . . . . | 126 | |||
| 18.4.4.1.16 | Data Fields . . . . | 126 | |||
| 18.4.4.1.17 | Data Fields . . . . . | 126 | |||
| 18.4.4.1.18 | Data Fields . . . . . | 126 | |||
| 18.4.4.1.19 | Data Fields . . . . . | 126 | |||
| 18.4.4.1.20 | Data Fields . . . . . . | 126 | |||
| 18.4.4.1.21 | Data Fields . . . . | 126 | |||
| 18.4.4.1.22 | Data Fields . . . . . | 126 | |||
| 18.4.4.1.23 | Data Fields . . . . . | 126 | |||
| 18.4.4.1.24 | Data Fields . . . . | 126 | |||
| 18.4.4.1.25 | . Data Fields . . . . . | 126 126 | |||
| 18.4.4.2 Data Fields - Functions . . . . . . . . . . . . . 18.4.4.3 Variables . . . . . . . . . . . . . . . . . . . . . | 126 | ||||
| 18.4.4.3.1 | Data Fields - Variables | 126 | |||
| 18.4.4.3.2 | Data Fields - Variables | 126 | |||
| 18.4.4.3.3 | Data Fields - Variables | 126 | |||
| 18.4.4.3.4 | Data Fields - Variables | 126 | |||
| 18.4.4.3.5 | Data Fields - Variables | 126 | |||
| 18.4.4.3.6 | Data Fields - Variables | 126 | |||
| 18.4.4.3.7 | Data Fields - Variables | 126 | |||
| 18.4.4.3.8 | Data Fields - Variables | 126 | |||
| 18.4.4.3.9 | Data Fields - Variables | 126 | |||
| 18.4.4.3.10 | Data Fields - Variables | 126 | |||
| 18.4.4.3.11 | Data Fields - Variables | 126 | |||
| 18.4.4.3.12 | Data Fields - Variables | 126 | |||
| 18.4.4.3.13 | Data Fields - Variables | 126 | |||
| 18.4.4.3.14 18.4.4.3.15 | Data Fields - Variables Data Fields - Variables | 126 126 | |||
| 18.4.4.3.16 | Data Fields - Variables | 126 | |||
18.4.4.3.17
-Data Fields - Variables
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-126
-**Following table contains:** The table represents a list of compiler options for GPU code generation, specifically related to AMD GPUs. Each row corresponds to a different compiler option that can be used when compiling code for AMD GPUs.
-
-The columns in the table are:
-- "Option": This column lists the command-line options that can be used with the compiler.
-- "Description": This column provides a detailed explanation of what each option does.
-
-Noteworthy values include:
-- The option "--amdgpu-target=
| 18.4.4.3.18 | Data Fields - Variables | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.4.4.3.19 | Data Fields - Variables | . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.4.4.3.20 | Data Fields - | Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.4.4.3.21 | Data Fields - Variables | . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.4.4.3.22 | Data | Fields - Variables . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.4.4.3.23 | Data Fields - | . . . . . . Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.4.4.3.24 | Data | Fields - Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.4.4.3.25 | Data | Fields - Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5 | 18.4.4.4 Data Fields - Related Symbols . . | . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . 126 . . . . . . . . . | 126 | ||
| Files 18.5.1 | File List . . . . . | . . . . . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | ||
| 18.5.2 | Globals . . . | . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | ||
| 18.5.2.1 All . . | . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.1.1 | Globals . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.1.2 | Globals . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.1.3 | Globals . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.1.4 | Globals . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.1.5 | Globals . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.1.6 | Globals . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.1.7 | Globals . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.1.8 | Globals . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.1.9 | Globals | 126 | ||||
| . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | ||||
| 18.5.2.2 Functions 18.5.2.2.1 | . . . . . . . Globals . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.2.2 | Globals . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.2.3 | Globals . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.3 Globals | . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.4 Globals | . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | ||||
| 18.5.2.5 Globals | Enumerator . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.6 18.5.2.6.1 | Globals . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| 18.5.2.6.2 | Globals . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| Globals | . | . . . . | ||||
| 18.5.2.7 | . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 126 | |||
| C++ language extensions | 127 | |||||
| 19.1 | Function-type | qualifiers | . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 127 127 | |
| 19.1.1 | __device__ . . | . . . . . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 128 | ||
| 19.1.2 | __global__ | . . . . . . | ||||
| 19.1.3 | __host__ . . . | . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 128 | ||
| 19.2 19.3 | Calling __global__ functions . . . . . | . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 128 129 | ||
| 19.4 | Kernel launch example . . . . . . . . Variable type qualifiers . . . . . . . . | . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 130 | ||
| . | 130 | |||||
| 19.4.1 | __constant__ | . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | |||
| 19.4.2 | __shared__ . | . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 130 | ||
| 19.4.3 | __managed__ . | . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . | 130 | ||
| 19.4.4 | __restrict__ | . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 130 | ||
| 19.5 | . Built-in variables . . . . . . . . . . . | . . | . . . . . . . . . | 130 | ||
| 19.5.1 | Coordinate built-ins | . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 130 | ||
| 19.5.2 | warpSize . . . | . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 131 | ||
| 19.6 | Vector types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | . . . . . . . . . | 131 | |||
| 19.6.1 | Short vector types | . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 131 | ||
| 19.7 | 19.6.2 dim3 . . . . . . . . . . . . . . . . . . Memory fence instructions . . . . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 | 132 | |||
19.8
-Synchronization functions
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-.
-132
-**Following table contains:** The table appears to represent a comparison of different formats used in GPU programming or compilation, specifically focusing on how code objects are handled in different environments or compilers. Each row represents a specific format or method of handling code objects. - -- **Columns:** - - **Format:** This column lists the type of code object or binary format being discussed. - - **APIs:** This column describes the API functions associated with loading or handling the code objects in the specified format. - - **NVCC:** This column indicates the file types or formats used when compiling with NVCC (NVIDIA's CUDA Compiler), such as `.cubin` or PTX text, and `.fatbin`. - - **HIP-CLANG:** This column specifies the file types or formats used when compiling with HIP-CLANG, such as `.hsaco` and `.hip_fatbin`. - -- **Noteworthy Values:** - - The table highlights that different compilers or environments (NVCC vs. HIP-CLANG) use different file formats for code objects, with NVCC using `.cubin` or PTX text and `.fatbin`, while HIP-CLANG uses `.hsaco` and `.hip_fatbin`. - - The APIs column lists specific functions (`hipModuleLoad`, `hipModuleLoadData`, `hipModuleLoadFatBin`) that are relevant for loading these code objects, indicating a focus on HIP (Heterogeneous-Compute Interface for Portability) API functions.
-| . | . . | ||
|---|---|---|---|
| 19.9 | Math | functions . . . . . . . . . . . | 132 |
| 19.10 | Texture | functions . . . . . . . . . . . . . | 133 |
| 19.11 | Surface | functions . . . . . . . . . . . . . | 133 |
| 19.12 | Timer | functions . . . . . . . . . . . . . . | 137 |
| 19.13 | Atomic | functions | 138 |
| 19.13.1 | . . . . . . . . . . . . . Unsafe floating-point atomic RMWoperations | 139 | |
| 19.14 | Warp cross-lane | functions . . | 140 |
| 19.14.1 | . . . . . . Warp vote and ballot functions . | 140 | |
| 19.14.2 Warp | match functions . . . . . | 141 | |
| 19.14.3 | . Warp shuffle functions . . . . . | 142 | |
| 19.15 | Cooperative groups | functions . . . . . . | 142 |
| 19.16 | Warp matrix | functions . . . . . . . | 143 |
| 19.17 | Independent | . . . thread scheduling . . . . . . | 144 |
| 19.18 | Profiler | Counter Function . . . . . . . . | 144 |
| 19.19 | Assert . . . . . | . . . . . . . . . . . . . . | 144 |
| 19.20 | . . . . | printf . . . . . . . . . . . . . . | 144 |
| 19.21 | Device-Side Dynamic Global Memory Allocation . . . . . . . . | 145 | |
| 19.22 | __launch_bounds__ . . | 145 | |
| 19.22.1 | Compiler Impact . . . . . . . . | 145 | |
| 19.22.2 | CU and EU Definitions . . . . . | 146 146 | |
| 19.22.3 19.22.4 maxregcount | Porting from CUDA __launch_bounds . . . . . . . . . . | 146 | |
| 19.23 | Asynchronous Functions . . . . . . . . . | 147 | |
| 19.23.1 | Memory stream . . . . . . . . . | 147 | |
| 19.23.2 | Peer to peer . . . . . . | 163 | |
| 19.23.3 | . . . . . Memory management . . . | 165 | |
| 19.23.4 | . . . External Resource Interoperability | 195 | |
| Register | Keyword . . . . . . . . . . . . . | 197 | |
| 19.24 | 19.25 Pragma | Unroll . . . . . . . . . . . . . | 198 |
| 19.26 | In-Line | Assembly . . . . . . . . . . . . | 198 |
| 19.27 | Kernel | Compilation . . . . . . . . . . . | 198 |
| 19.28 | |||
| gfx-arch-specific-kernel . . . . . . . . . | 199 | ||
| C++ language 20.1 | 20.1.1 | Modern C++ support . . . . . . . . . . . C++11 support . . . . . . . . . | 201 201 |
| 20.1.2 | C++14 support . . . . . . . . . | 202 | |
| 20.1.3 | C++17 support . . . . . . . . . | 202 | |
| 20.1.4 | C++20 support . . . . . . . . . | 202 | |
| 20.2 | Extensions | and restrictions . . . . . . . . | 202 |
| 20.2.1 | Global functions . . | 202 | |
| 20.2.2 | . . . . . . Device space memory specifiers . . . . | 202 | |
| 20.2.3 | Exception handling . . . . . | 203 | |
| 20.2.4 | Kernel parameters . . . . . . | 203 | |
| 20.2.5 | Classes . . . . | 203 | |
| 20.2.6 | . . . . . . . . . Polymorphic function wrappers . | 203 | |
| 20.2.7 | Extended lambdas . . . . . . . . | 203 | |
| Inline namespaces | |||
| 20.2.8 | . . . . . . . | 203 | |
| . | 205 | ||
| 21 HIP math | API | 205 | |
| 21.1 | Single precision mathematical functions . . . . Double precision mathematical functions . . . Integer . . . . | 215 | |
| 21.2 | intrinsics . . . . . . . . . . . . | ||
| 21.3 | 225 | ||
**Following table contains:** The table represents a mapping between different types used in HIP (Heterogeneous-Compute Interface for Portability), CUDA Driver API, and CUDA Runtime API. Each row corresponds to a specific type in HIP and its equivalent in the CUDA Driver and CUDA Runtime APIs. - -- **Columns:** - - **HIP Type:** Lists the types used in the HIP API. - - **CU Driver Type:** Lists the corresponding types in the CUDA Driver API. - - **CUDA Runtime Type:** Lists the corresponding types in the CUDA Runtime API, if applicable. - -- **Noteworthy Values:** - - Some HIP types have corresponding types in both the CUDA Driver and CUDA Runtime APIs, such as `hipStream_t` which maps to `CUstream` in the CUDA Driver API and `cudaStream_t` in the CUDA Runtime API. - - Other HIP types, like `hipModule_t`, have a corresponding type only in the CUDA Driver API (`CUmodule`) and no equivalent listed in the CUDA Runtime API. - - The absence of a CUDA Runtime Type for some HIP types suggests that not all HIP types have direct equivalents in the CUDA Runtime API.
-| 21.4 | Floating-point Intrinsics . . . . . . . . . . . . . . | 227 | ||
| 22 | Table | comparing syntax for different compute APIs | 231 | |
| 22.1 | Notes . . . . . . . . . . . . . . . . . . . . . . . . | 232 | ||
| 23 | HIP | Cooperative groups API | 233 | |
| 23.1 | Cooperative kernel launches . . . . . . . . . . . . | 233 | ||
| 23.2 | Cooperative groups classes . . . . . . . . . . . . . | 234 | ||
| 23.3 | Cooperative groups construct functions . . . . | 237 | ||
| 23.4 | . . Cooperative groups exposed API functions . . . . | 238 | ||
| 24 | HSA runtime API for ROCm | 241 | ||
| 25 | HIP managed memory allocation API | 247 | ||
| 26 | HIP virtual memory management API | 251 | ||
| 27 | HIP deprecated | runtime API functions | ||
|---|---|---|---|---|
| 257 | ||||
| 27.1 | Context management . . . . . . . . . . . . . . . . | 257 | ||
| 27.2 | Memory management . . . . . . . . . . . . . . . . . . | 258 | ||
| 27.3 | Profiler control . . . . . . . . . . . . . . . . | 258 | ||
| 27.4 | Texture management . . . . . . . . . . . . . . . . | 258 | ||
| 28 | SAXPY - Hello, HIP | 261 | ||
| 28.1 Prerequisites . . . . . . | . . . . . . . . . . . | 261 | ||
| . . . 28.2 Heterogeneous programming | . . . . . . . . . | 261 | ||
| . . 28.3 Your first lines of HIP code . . . | . . . . . . | 261 | ||
| . . . 28.4 Compiling on the command line . . . . | . . . . . . | 263 | ||
| 28.4.1 Setting up the command line | . . . . . . . | 263 | ||
| 28.4.2 | Invoking the compiler manually . . . . . | 266 | ||
| 29 | Reduction | 273 | ||
| 29.1 The algorithm . . . 29.2 | . . . . . . . . . . . . . . . . | 273 | ||
| Reduction on GPUs | . . . . . . . . . . . . . . . . | 273 | ||
| 29.2.1 Naive shared reduction | . . . . . . . . . . | 274 | ||
| 29.2.2 | Reducing thread divergence . . . . . . . . | 276 | ||
| 29.2.3 | Resolving bank conflicts . . . . . . . . . | 276 | ||
| 29.2.4 | Utilize upper half of the block . . . . . . . . . . . . . | 277 | ||
| 29.2.5 | Unroll all loops . . . . . . . | 281 | ||
| 29.2.6 | Communicate using warp-collective functions | 282 | ||
| 29.2.7 | Prefer warp communication over shared | 282 | ||
| 29.2.8 | . Amortize bookkeeping variable overhead | 284 | ||
| 29.2.8.1 Reading ItemsPerThread | . . . | 285 | ||
| 29.2.8.2 Processing ItemsPerThread | . . | 286 | ||
| 29.2.9 Two-pass reduction | . . . . . . . . . . . . | 286 | ||
| 29.2.10 Global data share | . . . . . . . . . . . . . | 286 | ||
| 29.3 Conclusion . . . . . | . . . . . . . . . . . . . . . . | 287 | ||
| 30 | Cooperative groups . | 289 | ||
| 30.1 Prerequisites | . . . . . . . . . . . . . . . . . . . | 289 | ||
| 30.2 Simple HIP Code | . . . . . . . . . . . . . . . . . . | 289 | ||
| Tiled partition | . . . . . . . | 289 | ||
| 30.3 . . . . . . . . . . . . 30.3.1 Device-side code . . . . . . | . . . | 290 | ||
| . . . . 30.3.1.1 1. Initialization of the reduction | 291 | |||
| function variables . . . 30.3.1.2 2. The reduction of thread block . . . . . . . . . . . . | 291 | |||
**Following table contains:** The table represents a list of environment variables related to AMD's HIP (Heterogeneous-compute Interface for Portability) and their configurations. Each row corresponds to a specific environment variable, detailing its default value and usage instructions. - -- **Columns:** - - **Environment variable:** The name of the environment variable. - - **Default value:** The default setting or value assigned to the variable. - - **Usage:** A description of how the variable can be used, including possible values and their meanings. - -- **Noteworthy Values:** - - The `AMD_LOG_LEVEL` and `AMD_LOG_MASK` variables provide detailed logging options, with various levels and masks to control the granularity of the logs. - - `HIP_LAUNCH_BLOCKING` and `AMD_SERIALIZE_KERNEL` offer options for controlling kernel execution serialization, which can impact performance and debugging. - - `HIP_VISIBLE_DEVICES` allows for the selection of specific devices to be visible to HIP, which is useful in multi-GPU systems. - - `GPU_DUMP_CODE_OBJECT` and `HIP_HOST_COHERENT` control debugging and memory coherence settings, respectively. - -Overall, the table provides a comprehensive guide for configuring and debugging HIP applications by adjusting environment variables.
-| 30.3.1.3 3. The reduction of custom partition . . . . . . . . . . . . . . . . . . . . . . . . . . | 291 | |
| 30.3.2 | Host-side code . . . . . . . . . . . . | 292 |
| 30.3.2.1 1. Confirm the cooperative group support on AMDGPUs 30.3.2.2 . . . . . | 292 | |
| 2. Initialize the cooperative group configuration . . . . . . . . . . . . . . . . . . . | 293 | |
| 30.3.2.3 Conclusion | 4. Launch the kernel . . . . . | 293 |
| 30.4 . | . . . . . . . . . . . . . . . . . . . . . . . . . . . . . | 293 |
| 31 License | 295 |
Index
-297
-The Heterogeneous-computing Interface for Portability (HIP) API is a C++ runtime API and kernel language that lets developers create portable applications for AMD and NVIDIA GPUs from single source code.
-For HIP supported AMD GPUs on multiple operating systems, see:
-The CUDA enabled NVIDIA GPUs are supported by HIP. For more information, see GPU Compute Capability.
-On the AMD ROCm platform, HIP provides header files and runtime library built on top of HIP-Clang compiler in the repository Common Language Runtimes (CLR) , which contains source codes for AMD's compute languages runtimes as follows,
-On non-AMD platforms, like NVIDIA, HIP provides header files required to support non-AMD specific back-end implementation in the repository 'hipother', which translates from the HIP runtime APIs to CUDA runtime APIs.
-Known issues are listed on the HIP GitHub repository.
-To contribute features or functions to the HIP project, refer to Contributing to HIP. To contribute to the documentation, refer to Contributing to ROCm docs page.
-You can find licensing information on the Licensing page.
-HIP can be installed on AMD (ROCm with HIP-Clang) and NVIDIA (CUDA with NVCC) platforms.
-Note: The version definition for the HIP runtime is different from CUDA. On an AMD platform, the hipRuntimeGerVersion function returns the HIP runtime version; on an NVIDIA platform, this function returns the CUDA runtime version.
-Refer to the Prerequisites section in the ROCm install guides:
-Check the system requirements in the NVIDIA CUDA Installation Guide.
-HIP is automatically installed during the ROCm installation. If you haven't yet installed ROCm, you can find installation instructions here:
-By default, HIP is installed into /opt/rocm/hip .
-Note: There is no autodetection for the HIP installation. If you choose to install it somewhere other than the default location, you must set the HIP_PATH environment variable as explained in Build HIP from source.
-sudo apt-get install ubuntu-drivers-common && sudo ubuntu-drivers autoinstall sudo reboot
-Alternatively, you can download the latest CUDA Toolkit.
-**Following code does:** This code snippet defines an enumeration `hipMemoryType` in C/C++ for the AMD platform, specifically within the HIP (Heterogeneous-Compute Interface for Portability) runtime API. The enumeration categorizes different types of memory based on their physical location and management. The types include: - -- `hipMemoryTypeHost`: Memory located on the host (CPU). -- `hipMemoryTypeDevice`: Memory located on a specific device (GPU). -- `hipMemoryTypeArray`: Array memory located on a specific device. -- `hipMemoryTypeUnified`: Currently not used. -- `hipMemoryTypeManaged`: Managed memory that is automatically handled by the unified memory system. - -This enumeration helps in identifying and managing different memory types in heterogeneous computing environments.
- | apt-get install hi
-**Following code does:** The code snippet `<_Cuda_>` appears to be a placeholder or a symbolic representation rather than actual executable code. In the context of programming, particularly with Python, it might suggest a reference to CUDA, which is a parallel computing platform and application programming interface (API) model created by NVIDIA. CUDA allows developers to use a GPU for general purpose processing, which can significantly accelerate computations. However, without additional context or actual code, it is not possible to determine any specific functionality or purpose from this snippet alone.
-tall hip-runtime-nvidia hip-dev
-You can optionally add /opt/rocm/bin to your path, which can make it easier to use the tools.
-Run hipconfig in your installation path.
-/opt/rocm/bin/hipconfig --full
-CHAPTER
-HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler, or a CUDA platform with nvcc installed. Before building and running HIP, make sure drivers and prebuilt packages are installed properly on the platform.
-You also need to install Python 3, which includes the CppHeaderParser package. Install Python 3 using the following command:
-**Following code does:** It seems there is a formatting error in your request. The code snippet is labeled as C++ but is enclosed within Python code tags. Please provide the correct code snippet or clarify the language so I can assist you accurately.
- | apt-get install python3
-Check and install CppHeaderParser package using the command:
-**Following code does:** It seems there is a formatting error in your request. The code snippet is labeled as Python, but it contains a placeholder that suggests it should be C++ code. Please provide the correct code snippet or clarify the language so I can assist you accurately.
- | pip3 install CppHeaderParser
-Set the repository branch using the variable: ROCM_BRANCH . For example, for ROCm 6.1, use:
-**Following code does:** This code snippet is part of a build process for a project that involves compiling CUDA code using the HIP (Heterogeneous-Compute Interface for Portability) compiler. The `export HIPCC_VERBOSE=1` command sets an environment variable to enable verbose output from the HIP compiler, which provides detailed information about the compilation process. The `make` command is then executed, which typically runs a Makefile to build the project. The output shown (`hipcc-cmd: ...`) indicates that the HIP compiler (`hipcc`) is being invoked to compile a CUDA source file (`backprop_cuda.cu`) with specific options, such as targeting the native architecture for offloading. This process is likely part of porting or optimizing CUDA code for different hardware using HIP.
-| export
-**Following code does:** The code snippet `<_Cuda_>` appears to be a placeholder or a symbolic representation rather than actual executable code. In the context of programming, "Cuda" typically refers to NVIDIA's parallel computing platform and application programming interface (API) model, which allows developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. However, without additional context or actual code, it's not possible to determine the specific functionality or purpose of this snippet. It might be used in documentation, pseudocode, or as a marker in a larger codebase to indicate where CUDA-related code or functionality is intended to be implemented.
-<_Bash_>
-Note: Starting in ROCM 5.6, CLR is a new repository that includes the former ROCclr, HIPAMD and OpenCl repositories. OpenCL provides headers that ROCclr runtime depends on.
-Note: Starting in ROCM 6.1, a new repository hipother is added to ROCm, which is branched out from HIP. hipother provides files required to support the HIP back-end implementation on some non-AMD platforms, like NVIDIA.
-**Following code does:** The provided code snippet appears to be a series of malformed XML declarations interspersed with various symbols and whitespace. It does not perform any functional operations or represent valid XML or Python code. Its high-level purpose is unclear, as it does not conform to any recognizable programming or markup language syntax. It seems to be a nonsensical or corrupted text rather than executable code.
-<_Bash_>
-**Following code does:** This code snippet is the beginning of a C++ program that is set up to use the HIP (Heterogeneous-computing Interface for Portability) API, which is designed for writing portable code that can run on both AMD and NVIDIA GPUs. The `#include` directives are used to include the necessary headers for HIP runtime functions (`hip_runtime.h` and `hip_runtime_api.h`) and standard C++ functionalities (`iostream` for input/output stream operations, `fstream` for file stream operations, and `vector` for using the vector container). This setup suggests that the program will likely involve GPU computations and possibly handle input/output operations and data storage using vectors.
-:lone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
-:lone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
-CLR (Common Language Runtime) repository includes ROCclr, HIPAMD and OpenCL.
-ROCclr (Radeon Open Compute Common Language Runtime) is a virtual device interface which is defined on the AMD platform. HIP runtime uses ROCclr to interact with different backends.
-HIPAMD provides implementation specifically for HIP on the AMD platform.
-OpenCL provides headers that ROCclr runtime currently depends on. hipother provides headers and implementation specifically for non-AMD HIP platforms, like NVIDIA.
-**Following code does:** This code snippet is a C++ program that includes headers for HIP (Heterogeneous-Compute Interface for Portability) runtime and API, which are used for GPU programming, particularly with AMD hardware. Additionally, it includes standard C++ headers for input/output operations (`iostream`), file handling (`fstream`), and using the vector container (`vector`). The purpose of this setup is likely to perform GPU-accelerated computations while also handling input/output operations and managing data using vectors in a C++ application.
-<_Bash_>
-**Following code does:** This code snippet is a C++ program that uses the HIP (Heterogeneous-Compute Interface for Portability) API to perform a vector copy operation on a GPU. The program is designed to work on both AMD and NVIDIA platforms, as indicated by the conditional compilation directives. Here's a high-level summary of what the code does: - -1. **Define Constants**: It defines constants for the length of the vectors (`LEN`) and the size in bytes (`SIZE`). - -2. **Platform-Specific File Names**: Depending on whether the code is compiled for an AMD or NVIDIA platform, it sets the appropriate file name for the compiled GPU kernel. - -3. **Initialize Vectors**: It allocates and initializes two float arrays, `A` and `B`, each of length `LEN`. Array `A` is initialized with sequential float values, while `B` is initialized with zeros. - -4. **GPU Initialization (NVIDIA only)**: If compiled for an NVIDIA platform, it initializes the HIP runtime, gets a device, and creates a context. - -5. **Memory Allocation on GPU**: It allocates memory on the GPU for both arrays `A` and `B`. - -6. **Data Transfer to GPU**: It copies the contents of arrays `A` and `B` from host memory to the allocated GPU memory. - -7. **Kernel Loading and Execution Setup**: It loads a GPU module from a file and retrieves a function (kernel) named "hello_world" from the module. It prepares the arguments for the kernel launch. - -The code is incomplete, as it ends abruptly, but it appears to be setting up for launching a GPU kernel that would likely perform some operation on the vectors `A` and `B`.
- cd "$CLR_DIR"
- mkdir -p build; cd build
- cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/"_
- ---DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_
- --BUILD_OCL=OFF..
-
- make -j$(nproc)
- sudo make install
-Note: Note, if you don't specify CMAKE_INSTALL_PREFIX , the HIP runtime is installed at <ROCM_PATH>/hip .
-By default, release version of HIP is built. If need debug version, you can put the option CMAKE_BUILD_TYPE=Debug in the command line.
-Default paths and environment variables:
-•
-HSA is in
-<ROCM_PATH>/hsa
-. This can be overridden by setting the
-HSA_PATH
-environment variable.
-After you run the make install command, make sure HIP_PATH points to $PWD/install/hip .
-Whenyouadd or change a HIP API, you may need to generate a new hip_prof_str.h header. This header is used by ROCm tools to track HIP APIs, such as rocprofiler and roctracer .
-To generate the header after your change, use the hip_prof_gen.py tool located in hipamd/src .
-Usage:
-**Following code does:** The code snippet appears to be part of a larger program that utilizes HIP (Heterogeneous-Compute Interface for Portability) to perform parallel computation on a GPU. Here's a high-level summary of what the code does: - -1. **Kernel Launch**: It launches a GPU kernel using `hipModuleLaunchKernel`. The kernel is specified by `Function`, and it is configured to execute with a grid size of 1x1x1 and a block size of `LEN`x1x1. The kernel is passed a configuration parameter `config`. - -2. **Data Transfer**: After the kernel execution, the code copies data from the device (GPU) to the host (CPU) using `hipMemcpyDtoH`. The data is copied from `Bd` (device memory) to `B` (host memory), with the size specified by `SIZE`. - -3. **Output**: It iterates over the length `LEN` and prints the values of arrays `A` and `B` to the standard output, displaying them in a formatted manner. - -4. **Context Detachment**: If the code is compiled for an NVIDIA platform (`__HIP_PLATFORM_NVIDIA__`), it detaches the HIP context using `hipCtxDetach`. - -5. **Return**: The function returns 0, indicating successful execution. - -Overall, this code is part of a GPU computing application that sets up and executes a kernel on a GPU, retrieves the results, and prints them.
-<_PHP_>
-**Following code does:** This code snippet is part of a CUDA or HIP (Heterogeneous-Compute Interface for Portability) program, which is used for parallel computing on GPUs. It declares an external 2D texture reference named `tex` that holds floating-point data. The snippet also includes the beginning of a kernel function definition, `tex2dKernel`, which is intended to be executed on the GPU. This kernel function takes a parameter `outputData`, which is likely an array where the results of the texture operations will be stored. The `hipLaunchParm lp` is a parameter used for launching the kernel with specific execution configurations. The code is incomplete, so the specific operations performed by the kernel are not shown.
- |
-
-
-Example usage:
-**Following code does:** This code snippet is part of a host program that interacts with a GPU using the HIP (Heterogeneous-Compute Interface for Portability) API, which is used for writing portable applications across different GPU architectures. The code sets up a 2D texture reference for use in GPU computations. Specifically, it retrieves a texture reference from a module, sets the address mode for both dimensions to wrap (which means texture coordinates will wrap around), and sets the filter mode to point sampling (which means the nearest texel value will be used). This setup is typically used to configure how textures are accessed and sampled in GPU kernels.
-<_Bash_>
-**Following code does:** This code snippet is configuring a texture reference in a HIP (Heterogeneous-Compute Interface for Portability) application, which is used for GPU programming. Specifically, it sets the flags for the texture reference to zero, specifies the format of the texture data as a single-channel floating-point format, and associates the texture reference with a specified array while overriding the format. This setup is typically used to prepare texture data for use in GPU computations, such as in graphics or parallel processing tasks.
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hipother.git
-**Following code does:** The code snippet `<_C_>` appears to be incomplete or not a valid Python code. It does not represent any recognizable Python construct or functionality. It might be a placeholder or a typo. Without additional context or surrounding code, it's not possible to determine its purpose or functionality.
---- ---
-
- 1. Get the HIP source code.
- git clone -b "$ROCM_BRANCH"
- git clone -b "$ROCM_BRANCH"
- git clone -b "$ROCM_BRANCH"
-
- 2. Set the environment variables.
-**Following code does:** This code snippet is a call to the `hiprtcCreateProgram` function, which is part of the HIP Runtime Compilation (HIPRTC) API. The function is used to create a program object for a GPU kernel written in CUDA C/C++. The parameters provided include the kernel source code as a string, the name of the file for the kernel, the number of header files, and arrays containing the header source code and their corresponding names. This setup is typically used to compile and execute GPU kernels dynamically at runtime.
-export CLR_DIR="$(readlink -f clr)"
-export HIP_DIR="$(readlink -f hip)"
-export HIP_OTHER="$(readlink -f hipother)"
-**Following code does:** This code snippet appears to be a call to the `hiprtcCompileProgram` function, which is part of the HIP (Heterogeneous-Compute Interface for Portability) runtime compilation library. The function is used to compile a HIPRTC program, which is typically written in a CUDA-like language, into a binary that can be executed on a GPU. In this specific call, the function is provided with a `hiprtcProgram` object (`prog`) and is instructed to compile it with zero additional options (`0`), although there is a placeholder for options (`options`) which suggests that options could be passed if needed. The comment indicates that supported Clang options can be used, but none are specified here.
-3. Build HIP.
-**Following code does:** It seems there is a formatting error in your request. The code snippet is labeled as Python, but the placeholder `<_C++_>` suggests it might be intended for C++. Please provide the correct code snippet or clarify the language so I can assist you accurately.
-cd "$CLR_DIR"
-mkdir -p build; cd build
-cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/
- --install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF -DHIPNV_DIR=
- --$HIP_OTHER/hipnv..
-make -j$(nproc)
-sudo make install
-**Following code does:** This code snippet retrieves the compiled binary code of a GPU kernel using the HIP runtime compilation (hipRTC) API. It first determines the size of the compiled code with `hiprtcGetCodeSize`, storing it in `codeSize`. Then, it allocates a vector `kernel_binary` of the appropriate size to hold the binary data. Finally, it populates this vector with the actual compiled kernel code using `hiprtcGetCode`. This process is typically part of a workflow where GPU kernels are dynamically compiled and executed.
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip-tests.git |
-**Following code does:** This code snippet is using the HIP (Heterogeneous-Compute Interface for Portability) API to load and prepare a GPU kernel for execution. Specifically, it performs the following high-level tasks: - -1. Loads a compiled GPU kernel binary into a HIP module using `hipModuleLoadData`. -2. Retrieves a specific function, named "vector_add", from the loaded module using `hipModuleGetFunction`. - -This setup is typically used in GPU programming to execute parallel computations on a GPU, where "vector_add" likely refers to a kernel function that performs vector addition.
- - npm -m -cos no-source.
-
- export HIPTESTS_DIR="$(readlink -f hip-tests)"
- cd "$HIPTESTS_DIR"
- mkdir -p build; cd build
- cmake../catch -DHIP_PLATFORM=amd -DHIP_PATH=$CLR_DIR/build/
- --install # or any path where HIP is installed; for example: ``/
- --opt/rocm``
- make build_tests
- ctest # run tests
-**Following code does:** The code snippet provided is essentially empty and does not perform any operations or have any functionality. It consists only of whitespace and a comment symbol (`//`), which is typically used for comments in languages like C, C++, and Java, but not in Python. In Python, comments are denoted by the `#` symbol. Therefore, this snippet does not execute any code or serve any purpose.
-AMD
-
- * Build HIP catch tests.
-
- HIP catch tests are separate from the HIP project and use Catch2.
-
- - Get HIP tests source code.
-**Following code does:** The code snippet provided consists entirely of a series of angle brackets (`<` and `>`) and hyphens (`-`). It does not contain any executable Python code or meaningful syntax. Therefore, it does not perform any operations or have any functional purpose in a Python program. It appears to be a visual pattern or placeholder rather than actual code.
- - command: command.json.
-
- cd "$HIPTESTS_DIR"
- hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc \
- -I./catch/include./catch/hipTestMain/standalone_main.cc \
- -I./catch/external/Catch2 -o hipPointerGetAttributes
- ./hipPointerGetAttributes
- ...
-
- All tests passed
-The commands to build HIP tests on an NVIDIA platform are the same as on an AMD platform. However, you must first set -DHIP_PLATFORM=nvidia .
-After installation and building HIP, you can compile your application and run. A simple example is square sample.
-The HIP programming model makes it easy to map data-parallel C/C++ algorithms to massively parallel, wide single instruction, multiple data (SIMD) architectures, such as GPUs.
-While the model may be expressed in most imperative languages, (for example Python via PyHIP) this document will focus on the original C/C++ API of HIP.
-A basic understanding of the underlying device architecture helps you make efficient use of HIP and general purpose graphics processing unit (GPGPU) programming in general.
-GPUs in general are made up of basic building blocks called compute units (CUs), that execute the threads of a kernel. These CUs provide the necessary resources for the threads: the Arithmetic Logical Units (ALUs), register files, caches and shared memory for efficient communication between the threads.
-This design allows for efficient execution of kernels while also being able to scale from small GPUs embedded in APUs with few CUs up to GPUs designed for data centers with hundreds of CUs. Figure Block Diagram of an RDNA3 Compute Unit. and Block Diagram of a CDNA3 Compute Unit. show examples of such compute units.
-For architecture details, check Hardware implementation .
-**Image description:** The image is a block diagram representing the architecture of a graphics processing unit (GPU). It is divided into two main sections, each containing multiple components that are responsible for different computational tasks. - -On the left side, there are two identical blocks labeled "Scheduler," each containing: -- Vector GPR (General Purpose Registers) for handling operations like Float/INT/Matrix SIMD32, Float/Matrix SIMD32, and Transcendental SIMD8. -- AI MATRIX Accelerator for artificial intelligence matrix operations. -- DPFP (Double Precision Floating Point) unit. -- Scalar GPR and Scalar ALU (Arithmetic Logic Unit) for scalar operations. - -In the center, there are two caches: -- Scalar Cache for storing scalar data. -- Shader Instruction Cache for storing shader instructions. - -On the right side, there are two identical blocks labeled "Scheduler," each containing: -- Scalar GPR and Scalar ALU. -- Vector GPR for similar operations as the left side. -- AI MATRIX Accelerator and DPFP unit. - -Adjacent to these blocks is a section labeled "Shared Memory," which is used for data exchange between different processing units. - -Finally, on the far right, there are components for handling graphics-specific tasks: -- Ray Accelerator for ray tracing operations. -- Texture Filters for texture processing. -- LD/ST/Tex Addr for load/store and texture addressing. -- L0 cache for fast data access. - -The diagram uses red to highlight the functional units and white for the cache and shared memory areas, indicating a focus on computational and memory management capabilities within the GPU architecture.
-
**Image description:** The image depicts a horizontal bar chart titled "Local Data Share." The chart is divided into two main sections: the left section and the right section. - -### Left Section: -- **Title:** "Local Data Share" -- **X-axis:** Labeled "Schedule" -- **Y-axis:** Labeled "Matrix Core Unit" -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 -- **Data Points:** 10 --
-
The HIP programming model assumes two execution contexts. One is referred to as host while compute kernels execute on a device . These contexts have different capabilities, therefor slightly different rules apply. The host execution is defined by the C++ abstract machine, while device execution follows the SIMT model of HIP. These execution contexts in code are signified by the __host__ and __device__ decorators. There are a few key differences between the two:
-Note: HIP does perform implicit synchronization on occasions, more advanced than other APIs such as OpenCL or SYCL, in which the responsibility of synchronization mostly depends on the user.
-The SIMT programming model behind the HIP device-side execution is a middle-ground between SMT (Simultaneous Multi-Threading) programming known from multicore CPUs, and SIMD (Single Instruction, Multiple Data) programming mostly known from exploiting relevant instruction sets on CPUs (for example SSE/AVX/Neon).
-A HIP device compiler maps SIMT code written in HIP C++ to an inherently SIMD architecture (like GPUs). This is done by scalarizing the entire kernel and issuing the scalar instructions of multiple kernel instances (called threads) to each of the SIMD engine lanes, rather than exploiting data parallelism within a single instance of a kernel and spreading identical instructions over the available SIMD engines.
-Consider the following kernel:
-**Following code does:** The code snippet appears to be a fragment of a documentation header or title, rather than executable code. It likely indicates the version of a software release, specifically version 6.1.40092, for a project or product abbreviated as "HIP." This kind of line is typically used to label or identify the version of documentation associated with a particular release of the software.
-__global__ void k(float4* a, const float4* b)
-{
- int tid = threadIdx.x;
- int bid = blockIdx.x;
- int dim = blockDim.x;
-
- a[tid] += (tid + bid - dim) * b[tid];
-}
-The incoming four-vector of floating-point values b is multiplied by a scalar and then added element-wise to the fourvector floating-point values of a . On modern SIMD-capable architectures, the four-vector ops are expected to compile to a single SIMD instruction. However, GPU execution of this kernel will typically break down the vector elements into 4 separate threads for parallel execution, as seen in the following figure:
-Fig. 3: Instruction flow of the sample SIMT program.
-In HIP, lanes of the SIMD architecture are fed by mapping threads of a SIMT execution, one thread down each lane of an SIMD engine. Execution parallelism usually isn't exploited from the width of the built-in vector types, but across multiple threads via the thread ID constants threadIdx.x , blockIdx.x , etc.
-The SIMT nature of HIP is captured by the ability to execute user-provided device programs, expressed as single-source C/C++ functions or sources compiled online/offline to binaries, in bulk.
-All threads of a kernel are uniquely identified by a set of integral values, called thread IDs. The set of integers identifying a thread relate to the hierarchy in which the threads execute.
-The thread hierarchy inherent to how AMD GPUs operate is depicted in the following figure.
-Fig. 4: Hierarchy of thread groups.
-The innermost grouping of threads is called a warp, or a wavefront in ISA terms. A warp is the most tightly coupled groups of threads, both physically and logically. Threads inside a warp are also called lanes, and the integral value identifying them is the lane ID.
-Tip: Lane IDs aren't queried like other thread IDs, but are user-calculated. As a consequence, they are only as multidimensional as the user interprets the calculated values to be.
-The size of a warp is architecture dependent and always fixed. For AMD GPUs the wavefront is typically 64 threads, though sometimes 32 threads. Warps are signified by the set of communication primitives at their disposal, as discussed in Warp cross-lane functions .
-The middle grouping is called a block or thread block. The defining feature of a block is that all threads in a block will share an instance of memory which they may use to share data or synchronize with one another.
-The size of a block is user-configurable but is limited by the queryable capabilities of the executing hardware. The unique ID of the thread within a block is 3-dimensional as provided by the API. When linearizing thread IDs within a block, assume the 'fast index' being dimension x , followed by the y and z dimensions.
-The outermost grouping is called a grid. A grid manifests as a single dispatch of kernels for execution. The unique ID of each block within a grid is 3-dimensional, as provided by the API and is queryable by every thread within the block.
-The Cooperative groups API introduces new APIs to launch, group, subdivide, synchronize and identify threads, as well as some predefined group-collective algorithms, but most importantly a matching threading model to think in terms of. It relaxes some restrictions of the Inherent thread model imposed by the strict 1:1 mapping of architectural details to the programming model. Cooperative groups let you define your own set of thread groups which may fit your user-cases better than the defaults defined by the hardware.
-Note: The implicit groups defined by kernel launch parameters are still available when working with cooperative groups.
-For further information, see Cooperative groups.
-The hierarchy of threads introduced by the Inherent thread model is induced by the memory subsystem of GPUs. The following figure summarizes the memory namespaces and how they relate to the various levels of the threading model.
-Fig. 5: Memory hierarchy.
-Read-write storage only visible to the threads defining the given variables, also called per-thread memory. The size of a block for a given kernel, and thereby the number of concurrent warps, are limited by local memory usage. This relates to an important aspect: occupancy. This is the default memory namespace.
-Read-write storage visible to all the threads in a given block.
-Read-write storage visible to all threads in a given grid. There are specialized versions of global memory with different usage semantics which are typically backed by the same hardware storing global.
-Read-only storage visible to all threads in a given grid. It is a limited segment of global with queryable size.
-Read-only storage visible to all threads in a given grid and accessible through additional APIs.
-A read-write version of texture memory.
-HIP programs consist of two distinct scopes:
-Note: The HIP does not present two separate APIs link NVIDIA CUDA. HIP only extends the HIP runtime API with new APIs for hipModule and hipCtx .
-The part of the host-side API which deals with device management and their queries are synchronous. All asynchronous APIs, such as kernel execution, data movement and potentially data allocation/freeing all happen in the context of device streams.
-Streams are FIFO buffers of commands to execute relating to a given device. Commands which enqueue tasks on a stream all return promptly and the command is executed asynchronously. All side effects of a command on a stream are visible to all subsequent commands on the same stream. Multiple streams may point to the same device and those streams may be fed from multiple concurrent host-side threads. Execution on multiple streams may be concurrent but isn't required to be.
-Asynchronous APIs involving a stream all return a stream event which may be used to synchronize the execution of multiple streams. A user may enqueue a barrier onto a stream referencing an event. The barrier will block until the command related to the event does not complete, at which point all side effects of the command shall be visible to commands following the barrier, even if those side effects manifest on different devices.
-Streams also support executing user-defined functions as callbacks on the host. The stream will not launch subsequent commands until the callback completes.
-The SIMT programming model behind the HIP device-side execution is a middle-ground between SMT (Simultaneous Multi-Threading) programming known from multicore CPUs, and SIMD (Single Instruction, Multiple Data) programming mostly known from exploiting relevant instruction sets on CPUs (for example SSE/AVX/Neon).
-Kernels may be launched in multiple ways all with different syntaxes and intended use-cases.
-Tip: This name by default is a macro expanding to triple-chevron. In cases where language syntax extensions are undesirable, or where launching templated and/or overloaded kernel functions define the HIP_TEMPLATE_KERNEL_LAUNCH preprocessor macro before including the HIP headers to turn it into a templated function.
-Caution: These APIs are intended to be used/generated by tools such as the HIP compiler itself and not intended towards end-user code. Should you be writing a tool having to launch device code using HIP, consider using these over the alternatives.
-This chapter describes the typical hardware implementation of GPUs supported by HIP, and how the Inherent thread model maps to the hardware.
-The basic building block of a GPU is a compute unit (CU), also known as streaming multiprocessor (SM) on NVIDIA GPUs. The thread blocks making up a grid are scheduled for execution on CUs. Each block is assigned to an individual CU, and a CU can accommodate several blocks. Depending on their resource usage up to thousands of threads can reside on a CU.
-CUs contain an array of processing elements, referred to as vector ALU (VALU), that execute the actual instructions of the threads according to the SIMT model , together with the necessary registers and caches.
-The threads are executed in groupings called warps. The amount of threads making up a warp is architecture dependent. On AMD GPUs the warp size is commonly 64 threads, except in RDNA architectures which can utilize a warp size of 32 or 64 respectively. The warp size of supported AMD GPUs is listed in the Accelerator and GPU hardware specifications. NVIDIA GPUs have a warp size of 32.
-In contrast to CPUs, GPUs generally do not employ complex cache structures or control logic, like branch prediction or out-of-order execution, but instead rely on massive hardware multithreading to hide latency.
-Context switching between warps residing on a CU incurs no overhead, as the context for the warps is stored on the CU and does not need to be fetched from memory. If there are not enough free registers to accommodate all warps of a block, the block can not be scheduled to that CU and it has to wait until other blocks finish execution.
-The amount of warps that can reside concurrently on a CU, known as occupancy, is determined by the warp's resource usage of registers and shared memory.
-Fig. 1: An AMD Graphics Core Next (GCN) CU. The CDNA and RDNA CUs are based on variations of the GCN CU.
-On AMD GCN GPUs the basic structure of a CU is:
-A SIMD consists of a VALU, that executes the instruction of a warp, together with a register file, that provides the registers warps.
-The size of the warp is inherently related to the width of the vector ALU of the SIMD. On GCN compute units the width of the VALU is 16, so a warp can be issued to a SIMD every 4 cycles. Since a CU has 4 SIMDs it issues one warp per cycle. The instructions of a warp are effectively executed in lock-step.
-A SIMD always executes the same instruction for the whole VALU. If the control flow of a warp diverges, the performance is decreased, as the results for the threads that do not participate in that branch have to be masked out, and the instructions of the other branch have to be executed in the same way. The best performance can therefore be achieved when thread divergence is kept to a warp level, i.e. when all threads in a warp take the same execution path.
-The usage of cache on a GPU differs from that on a CPU, as there is less cache available per thread. Its main purpose is to coalesce memory accesses of the warps in order to reduce the amount of accesses to device memory, and make that memory available for other warps that currently reside on the compute unit, that also need to load those values.
-The local data share is memory that is accessible to all threads within a block. Its latency and bandwidth is comparable to that of the vector cache. It can be used to share memory between the threads in a block, or as a software managed cache.
-The scalar unit performs instructions that are uniform within a warp. It thereby improves efficiency and reduces the pressure on the vector ALUs and the vector register file.
-The general structure of CUs stays mostly as it is in GCN architectures. The most prominent change is the addition of matrix ALUs, which can greatly improve the performance of algorithms involving matrix multiply-accumulate operations for int8, float16, bfloat16 or float32.
-**Image description:** The image depicts a horizontal bar chart titled "Local Data Share." The chart is divided into two main sections: the left section and the right section. - -### Left Section: -- **Title:** Local Data Share -- **X-Axis:** Labeled as "Scheduler" and "Matrix Core Unit" -- **Y-Axis:** Labeled as "Local Data Share" -- **Data Points:** - - **Leftmost Data Point:** 100 - - **Rightmost Data Point:** 100 - - **Leftmost Data Point:** 100 - - **Rightmost Data Point:** 100 - -### Right Section: -- **Title:** Shader Core -- **X-Axis:** Labeled as "L1 Cache" -- **Y-Axis:** Labeled as "Shader Core" -- **Data Points:** - - **Leftmost Data Point:** 100 - - **Rightmost
-
RDNA makes a fundamental change to CU design, by changing the size of a warp to 32 threads. This is done by effectively combining two GCN5 SIMDs, creating a VALU of width 32, so that a whole warp can be issued in one cycle. The CU is also replaced by the work group processor (WGP), which encompasses two CUs. For backwards compatibility the WGP can also run in wave64 mode, in which it issues a warp of size 64 in two cycles.
-It also adds an extra layer of cache to the WGP, shared by the CUs within it. This cache is referred to as L1 cache, promoting the per-CU cache to an L0 cache.
-**Image description:** The image is a block diagram illustrating the architecture of a graphics processing unit (GPU) or a similar parallel processing unit. The diagram is divided into two main sections, each representing a compute unit with various components. - -Each compute unit contains: -- **Scheduler**: Manages the execution of instructions. -- **Vector GPR (General Purpose Registers)**: Handles vector operations with components like "Float/INT/Matrix SIMD32" for single instruction, multiple data operations, "AI MATRIX Accelerator" for AI-related tasks, and "Transcendental SIMD8" for complex mathematical functions. -- **Scalar GPR**: Manages scalar operations. -- **Scalar ALU (Arithmetic Logic Unit)**: Performs arithmetic and logical operations on scalar data. -- **DPFP (Double Precision Floating Point)**: Indicates support for double precision calculations. - -The central part of the diagram shows: -- **Scalar Cache** and **Shader Instruction Cache**: These are shared resources for storing scalar data and shader instructions, respectively. - -On the right side of the diagram: -- **Shared Memory**: A memory space accessible by all units for efficient data sharing. -- **Ray Accelerator**: Dedicated hardware for accelerating ray tracing operations. -- **LD/ST/Tex Addr**: Likely stands for Load/Store/Texture Address, handling memory operations and texture addressing. -- **Texture Filters**: Processes texture data for rendering. -- **L0**: Possibly indicates a level of cache or memory hierarchy. - -The diagram uses a consistent color scheme with red blocks indicating functional units and white/grey blocks for caches and shared resources. The layout suggests a focus on parallel processing capabilities, with emphasis on both vector and scalar operations, AI acceleration, and ray tracing support.
-
For hardware implementation's sake, multiple CUs are grouped together into a Shader Engine or Compute Engine, typically sharing some fixed function units or memory subsystem resources.
-**Image description:** The image is a table that contains data about different types of fabric. The table is divided into rows and columns, with each row representing a different type of fabric. The columns are labeled as follows: - -- Memory -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller -- Memory Controller
-
CLRcontains source codes for AMD's compute languages runtimes: HIP and OpenCL ™ . CLR is the part of HIP runtime which is supported on the AMD ROCm platform, it provides a header and runtime library built on top of HIP-Clang compiler. For developers and users, CLR implements HIP runtime APIs including streams, events, and memory APIs, which is a object library that is linked with the application. The source codes for all headers and the library implementation are available on GitHub in the CLR repository.
-CLR includes the following source code,
-Please refer to Quick Start Guide in ROCm Docs.
-Building CLR requires rocm-hip-libraries meta package, which provides the pre-requisites for CLR.
-**Following code does:** The provided code snippet appears to be a series of closing curly braces (`}`) with no accompanying opening braces or any other code. In most programming languages, such as C, C++, Java, or JavaScript, curly braces are used to define blocks of code, such as functions, loops, or conditional statements. However, without any opening braces or context, this snippet does not perform any meaningful operation or represent a valid code structure. It is likely incomplete or incorrectly formatted.
-
-
-
- * For HIP
-**Following code does:** This code snippet is written in C++ and is used to compile a HIP (Heterogeneous-Compute Interface for Portability) runtime compilation program. It sets up a compilation option for the program by creating a string `sarg` with the value "-fgpu-rdc", which is a flag typically used to enable relocatable device code in GPU programming. This string is then converted to a C-style string and stored in an array `options`. The `hiprtcCompileProgram` function is called with the program `prog`, specifying that there is one compilation option, and passing the `options` array to apply the "-fgpu-rdc" flag during the compilation process.
-<_Bash_>
-**Following code does:** This code snippet is used to retrieve the compiled bitcode of a HIP (Heterogeneous-Compute Interface for Portability) program. It first determines the size of the bitcode using `hiprtcGetBitcodeSize`, storing the size in `bitCodeSize`. Then, it creates a vector `kernel_bitcode` of the appropriate size to hold the bitcode. Finally, it retrieves the actual bitcode from the program `prog` and stores it in the `kernel_bitcode` vector using `hiprtcGetBitcode`. This process is typically part of compiling and managing GPU kernels in a HIP runtime environment.
-<_Haskell_>
-Users can also build OCL and HIP at the same time by passing -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=ON to configure command.
-For detail instructions, please refer to build HIP.
-hip-tests is a separate repository hosted at hip-tests.
-To run hip-tests please go to the repository and follow the steps.
-HIP provides release notes in CLR change log, which has records of changes in each release.
-hipHostMalloc allocates pinned host memory which is mapped into the address space of all GPUs in the system, the memory can be accessed directly by the GPU device, and can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc() . There are two use cases for this host memory:
-There are flags parameter which can specify options how to allocate the memory, for example, hipHostMallocPortable , the memory is considered allocated by all contexts, not just the one on which the allocation is made. hipHostMallocMapped , will map the allocation into the address space for the current device, and the device pointer can be obtained with the API hipHostGetDevicePointer() . hipHostMallocNumaUser is the flag to allow host memory allocation to follow Numa policy by user. Please note this flag is currently only applicable on Linux, under development on Windows.
-All allocation flags are independent, and can be used in any combination without restriction, for instance, hipHostMalloc can be called with both hipHostMallocPortable and hipHostMallocMapped flags set. Both usage models described above use the same allocation flags, and the difference is in how the surrounding code uses the host memory.
-Numa policy determines how memory is allocated. Target of Numa policy is to select a CPU that is closest to each GPU. Numa distance is the measurement of how far between GPU and CPU devices.
-By default, each GPU selects a Numa CPU node that has the least Numa distance between them, that is, host memory will be automatically allocated closest on the memory pool of Numa node of the current GPU device. Using hipSetDevice API to a different GPU will still be able to access the host allocation, but can have longer Numa distance. Note, Numa policy is so far implemented on Linux, and under development on Windows.
-ROCm defines two coherency options for host memory:
-HIP provides the developer with controls to select which type of memory is used via allocation flags passed to hipHostMalloc and the HIP_HOST_COHERENT environment variable. By default, the environment variable HIP_HOST_COHERENT is set to 0 in HIP. The control logic in the current version of HIP is as follows:
-Coherent host memory is automatically visible at synchronization points. Non-coherent
-**Following table contains:** The table appears to represent different series of graphics architectures and their compatibility or usage with various memory allocation methods in a computing context. Each row corresponds to a specific series of graphics architectures, while the columns indicate different memory allocation methods or features. - -- **Rows:** - - Each row represents a specific series of graphics architectures. The series listed are "MI200, MI300 Series," "MI100," "RDNA (Navi) Series," and "GCN5 (Vega) Series." - -- **Columns:** - - **Architecture:** This column lists the name of the graphics architecture series. - - **hipMallocManaged():** This column likely indicates whether the architecture series supports or utilizes the `hipMallocManaged()` memory allocation method, which is commonly used in GPU programming for unified memory management. - - **__managed__:** This column might represent the use of `__managed__` memory, which is a feature in some programming environments for managing memory in a unified manner across CPU and GPU. - - **malloc():** This column indicates the use of the standard `malloc()` function, which is a basic memory allocation method in C/C++ programming. - -- **Noteworthy Values:** - - The "MI200, MI300 Series" row has a value of "1" under the "malloc()" column, suggesting that this series uses or supports the `malloc()` function, while the other columns for this series are empty, indicating no information or support for `hipMallocManaged()` and `__managed__`. - - The other architecture series ("MI100," "RDNA (Navi) Series," and "GCN5 (Vega) Series") have no values in any of the columns, which might imply a lack of information or support for the listed memory allocation methods.
-| HIP API Synchronization Effect | Fence | Coherent Memory ity | Host Visibil- | Non-Coherent Host Memory Visi- bility |
|---|---|---|---|---|
| hipStreamSynchronize host waits for all commands in the spec- ified stream to complete | system- scope release | yes | yes | |
| hipDeviceSynchronize host waits for all commands in all streams on the specified device to com- plete | system- scope release | yes | yes | |
| hipEventSynchronize host waits for the specified event to com- plete | device- scope release | yes | depends - see below | |
| hipStreamWaitEvent stream waits for the specified event to complete | none | yes | no |
Developers can control the release scope for hipEvents :
-A stronger system-level fence can be specified when the event is created with hipEventCreateWithFlags :
-Managed memory, including the __managed__ keyword, is supported in HIP combined host/device compilation, on Linux, not on Windows (under development).
-Managed memory, via unified memory allocation, allows data be shared and accessible to both the CPU and GPU using a single pointer. The allocation will be managed by AMD GPU driver using the Linux HMM (Heterogeneous Memory Management) mechanism, the user can call managed memory API hipMallocManaged to allocate a large chunk of HMMmemory, execute kernels on device and fetch data between the host and device as needed.
-In HIP application, it is recommended to do the capability check before calling the managed memory APIs. For example:
-**Following code does:** It seems there is a formatting error in your request. The code snippet is labeled as C++ but is enclosed within Python code tags. Please provide the correct code snippet or clarify the language so I can assist you accurately.
-
-
-
- > ?>
-Please note, the managed memory capability check may not be necessary, but if HMM is not supported, then managed malloc will fall back to using system memory and other managed memory API calls will have undefined behavior.
-Note, managed memory management is implemented on Linux, not supported on Windows yet.
-HIP supports Stream Memory Operations to enable direct synchronization between Network Nodes and GPU. Following new APIs are added, hipStreamWaitValue32 hipStreamWaitValue64 hipStreamWriteValue32 hipStreamWriteValue64
-Note, CPU access to the semaphore's memory requires volatile keyword to disable CPU compiler's optimizations on memory access. For more details, please check the documentation HIP-API.pdf .
-Please note, HIP stream does not guarantee concurrency on AMD hardware for the case of multiple (at least 6) longrunning streams executing concurrently, using hipStreamSynchronize(nullptr) for synchronization.
-HIP runtime has Direct Dispatch enabled by default in ROCM 4.4 on Linux. With this feature we move away from our conventional producer-consumer model where the runtime creates a worker thread(consumer) for each HIP Stream, and the host thread(producer) enqueues commands to a command queue(per stream).
-For Direct Dispatch, HIP runtime would directly enqueue a packet to the AQL queue (user mode queue on GPU) on the Dispatch API call from the application. That has shown to reduce the latency to launch the first wave on the idle GPU and total time of tiny dispatches synchronized with the host.
-In addition, eliminating the threads in runtime has reduced the variance in the dispatch numbers as the thread scheduling delays and atomics/locks synchronization latencies are reduced.
-This feature can be disabled via setting the following environment variable, AMD_DIRECT_DISPATCH=0
-Note, Direct Dispatch is implemented on Linux. It is currently not supported on Windows.
-HIP now supports runtime compilation (HIP RTC), the usage of which will provide the possibility of optimizations and performance improvement compared with other APIs via regular offline static compilation.
-HIP RTC APIs accept HIP source files in character string format as input parameters and create handles of programs by compiling the HIP source files without spawning separate processes.
-For more details on HIP RTC APIs, refer to HIP Runtime API Reference .
-For Linux developers, the link here shows an example how to program HIP application using runtime compilation mechanism, and a detailed HIP RTC programming guide is also available.
-HIP graph is supported. For more details, refer to the HIP API Guide.
-HIP-Clang now supports device-side malloc and free. This implementation does not require the use of hipDeviceSetLimit(hipLimitMallocHeapSize,value) nor respects any setting. The heap is fully dynamic and can grow until the available free memory on the device is consumed.
-The per-thread default stream is supported in HIP. It is an implicit stream local to both the thread and the current device. This means that the command issued to the per-thread default stream by the thread does not implicitly synchronize with other streams (like explicitly created streams), or default per-thread stream on other threads. The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program. The per-thread default stream can be enabled via adding a compilation option, -fgpu-default-stream=per-thread .
-And users can explicitly use hipStreamPerThread as per-thread default stream handle as input in API commands. There are test codes as examples in the link.
-In HIP-Clang, long double type is 80-bit extended precision format for x86_64, which is not supported by AMDGPU. HIP-Clang treats long double type as IEEE double type for AMDGPU. Using long double type in HIP source code will not cause issue as long as data of long double type is not transferred between host and device. However, long double type should not be used as kernel argument type.
-If a host function is to be used between clang (or hipcc) and gcc for x86_64, i.e. its definition is compiled by one compiler but the caller is compiled by a different compiler, _Float16 or aggregates containing _Float16 should not be used as function argument or return type. This is due to lack of stable ABI for _Float16 on x86_64. Passing _Float16 or aggregates containing _Float16 between clang and gcc could cause undefined behavior.
-By default HIP-Clang assumes -ffp-contract=fast-honor-pragmas . Users can use #pragma clang fp contract(on|off|fast) to control fp contraction of a block of code. For x86_64, FMA is off by default since the generic x86_64 target does not support FMA by default. To turn on FMA on x86_64, either use -mfma or -march=native on CPU's supporting FMA.
-When contractions are enabled and the CPU has not enabled FMA instructions, the GPU can produce different numerical results than the CPU for expressions that can be contracted. Tolerance should be used for floating point comparisons.
-Note: Currently, HIP only supports basic math functions with rounding modern (round to nearest). HIP does not support basic math functions with rounding modes ru (round up), rd (round down), and rz (round towards zero).
-HIP-Clang supports generating two types of static libraries. The first type of static library does not export device functions, and only exports and launches host functions within the same library. The advantage of this type is the ability to link with a non-hipcc compiler such as gcc. The second type exports device functions to be linked by other code objects. However, this requires using hipcc as the linker.
-In addition, the first type of library contains host objects with device code embedded as fat binaries. It is generated using the flag -emit-static-lib. The second type of library contains relocatable device objects and is generated using ar .
-Here is an example to create and use static libraries:
-**Following code does:** This code snippet is a function call to `hiprtcLinkAddData`, which is part of the HIPRTC (Heterogeneous Interface for Portability Runtime Compilation) API. The function is used to add input data, such as bitcode or other types of data, to a HIPRTC link state object. This is typically part of a process to compile or link GPU code at runtime. The parameters specify the link state object, the type and size of the input data, an optional name for the input, and any options or option values that might be applied to this input. In this case, no options or option values are provided.
-<_Bash_>
-**Following code does:** This code snippet is a function call to `hiprtcLinkAddFile`, which is part of the HIPRTC (Heterogeneous-Compute Interface for Portability Runtime Compilation) API. The function is used to add a file to the HIPRTC link state. Specifically, it adds a file containing input data or bitcode to the current link state, which is a part of the process of compiling and linking GPU code at runtime. The parameters specify the link state, the type of input, the file path to the bitcode, and options related to the input, although in this case, no additional options are provided (indicated by the zero values).
-hipcc hipDevice.cpp -c -fgpu-rdc -o hipDevice.o
-ar rcsD libHipDevice.a hipDevice.o
-hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.out
-For more information, please see HIP samples host functions and device_functions.
-CHAPTER
-In addition to providing a portable C++ programming environment for GPUs, HIP is designed to ease the porting of existing CUDA code into the HIP environment. This section describes the available tools and provides practical suggestions on how to port CUDA code and work through common issues.
-The hipexamine-perl.sh tool will scan a source directory to determine which files contain CUDA code and how much of that code can be automatically hipified.
-**Following code does:** The code snippet `<_C_>` is not a valid or recognizable Python code. It appears to be a placeholder or a typographical error rather than functional code. Without additional context or correction, it is not possible to determine its purpose or functionality.
-<_Cuda_>
-(continued from previous page)
-**Following code does:** The code snippet appears to be part of a larger program written in C++ that uses the HIP (Heterogeneous-Compute Interface for Portability) API, which is designed for writing portable code that can run on both AMD and NVIDIA GPUs. The specific function call `hipModuleLoadData(&module, bina)` is used to load a compiled GPU module from binary data into the `module` variable. This is typically a step in preparing GPU code for execution, where `bina` represents the binary data of the compiled GPU program.
- (continued from previous page)
-hipexamine-perl scans each code file (cpp, c, h, hpp, etc.) found in the specified directory:
-**Following code does:** The code snippet you provided, `binary);`, appears to be incomplete and does not represent a valid or meaningful Python statement on its own. It seems like it might be a fragment of a larger piece of code or a typographical error. Without additional context or surrounding code, it's not possible to determine its purpose or functionality.
-<_SQL_>
-**Following code does:** The code snippet `<_Cuda_>` appears to be a placeholder or a symbolic representation rather than actual executable code. In the context of programming, "Cuda" typically refers to NVIDIA's parallel computing platform and application programming interface (API) model, which allows developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. However, without additional context or actual code, it is not possible to determine the specific functionality or purpose of this snippet.
-
-
-
- --event:0 -event:0
-**Following code does:** This code snippet is setting up options for a Just-In-Time (JIT) compilation process, likely related to GPU programming using the HIP (Heterogeneous-Compute Interface for Portability) runtime. The `isaopts` array contains command-line options for the LLVM compiler, specifically to set the inline threshold to 1, which influences function inlining decisions during compilation. The `jit_options` vector is initialized with specific JIT options that are likely used to configure the compilation process to include these LLVM options. The `isaoptssize` variable indicates the size of the `isaopts` array, which is used to pass these options to the compiler. Overall, this setup is preparing to customize the behavior of the JIT compiler with specific optimization settings.
->
-**Following code does:** This code snippet is setting up and initiating a linking process for a HIP (Heterogeneous-Compute Interface for Portability) runtime compilation. It creates an array `lopts` containing pointers to ISA (Instruction Set Architecture) options and their size, which are then used as input parameters for the `hiprtcLinkCreate` function. This function initializes a `hiprtcLinkState` object, `linkstate`, which represents the state of the linking process. The purpose of this code is to configure and start the linking of compiled code for execution on a GPU using HIP.
-| > hipify-perl --inplace
-For each input file FILE, this script will:
-This is useful for testing improvements to the hipify toolset.
-The hipconvertinplace-perl.sh script will perform inplace conversion for all code files in the specified directory. This can be quite handy when dealing with an existing CUDA code base since the script preserves the existing directory structure and filenames - and includes work. After converting in-place, you can review the code to add additional parameters to directory names.
-Most CUDA libraries have a corresponding ROCm library with similar functionality and APIs. However, ROCm also provides HIP marshalling libraries that greatly simplify the porting process because they more precisely reflect their CUDAcounterparts and can be used with either the AMD or NVIDIA platforms (see 'Identifying HIP Target Platform' below). There are a few notable exceptions:
-**Following table contains:** The table appears to represent attributes related to memory management capabilities of a HIP (Heterogeneous-computing Interface for Portability) device, which is likely a GPU or similar hardware. Each row describes a specific attribute of the device's memory management features. - -- **Rows**: Each row represents a distinct attribute of the HIP device's memory management capabilities. -- **Columns**: - - The first column lists the attribute names, which are specific features or capabilities of the device. - - The second column provides a description of what each attribute means or supports. - -Noteworthy values include: -- "hipDeviceAttributeManagedMemory" indicates that unified addressing is supported, which means the device can access both host and device memory using a single address space. -- "hipDeviceAttributeConcurrentManagedAccess" suggests that the device supports full managed memory with concurrent access, allowing simultaneous access to managed memory by both the host and the device. -- "hipDeviceAttributePageableMemoryAccess" implies that both managed and system memory allocation APIs are supported, indicating flexibility in memory allocation methods.
-| CUDA brary | Li- | HIP Li- brary | ROCm Li- brary | Comment |
|---|---|---|---|---|
| cuBLAS | hipBLAS | rocBLAS | Basic Linear Algebra Subroutines | |
| cuBLASLt | hip- BLASLt | N/A | Basic Linear Algebra Subroutines, lightweight and new flexible API | |
| cuFFT | hipFFT | rocFFT | Fast Fourier Transfer Library | |
| cuSPARSE | hipSPARSE | rocSPARSE | Sparse BLAS + SPMV | |
| cuSOLVER | hip- SOLVER | rocSOLVER | Lapack library | |
| AmgX | N/A | rocALU- TION | Sparse iterative solvers and preconditioners with algebraic multigrid | |
| Thrust | N/A | rocThrust | C++ parallel algorithms library | |
| CUB | hipCUB | rocPRIM | Low Level Optimized Parallel Primitives | |
| cuDNN | N/A | MIOpen | Deep learning Solver Library | |
| cuRAND | hipRAND | rocRAND | Random Number Generator Library | |
| EIGEN | EIGEN | N/A | C++ template library for linear algebra: matrices, vectors, numeri- cal solvers, | |
| NCCL | N/A | RCCL | Communications Primitives Library based on the MPI equivalents |
All HIP projects target either AMD or NVIDIA platform. The platform affects which headers are included and which libraries are used for linking.
-Often, it's useful to know whether the underlying compiler is HIP-Clang or NVCC. This knowledge can guard platformspecific code or aid in platform-specific performance tuning.
-**Following code does:** It seems like you've provided a placeholder or incomplete code snippet. If you have a specific Python code snippet you'd like me to explain, please provide the actual code, and I'll be happy to help!
- #ifdef __HIP_PLATFORM_AMD__
- // Compiled with HIP-Clang
- #endif
-**Following code does:** This code snippet is part of a program that uses the HIPRTC (Heterogeneous-Compute Interface for Portability Runtime Compilation) API to compile a GPU program at runtime. It attempts to compile a program represented by `prog` with a single compilation option specified in `opts`. The result of the compilation is stored in `result`. If the compilation fails (i.e., `result` is not `HIPRTC_SUCCESS`), it outputs an error message to the standard output, including the error string corresponding to the failure.
-#ifdef __HIP_PLATFORM_NVIDIA__
-// Compiled with nvcc
-// Could be compiling with CUDA language extensions enabled (for example, a ".cu file)
-// Could be in pass-through mode to an underlying host compile OR (for example, a.cpp_
---file)
-**Following code does:** The code snippet appears to be the beginning of a C++ declaration for a static constant character array named `gpu_program`. This array is likely intended to store a sequence of characters, which could represent a GPU program or shader code. The `static constexpr` keywords suggest that this array is both constant and has static storage duration, meaning its value is fixed at compile time and it is shared across all instances of the class or file in which it is declared. The actual content of the array is not shown in the snippet.
- #ifdef __CUDACC__
- // Compiled with nvcc (CUDA language extensions enabled)
-**Following code does:** This code snippet is written in C++ and is part of a program that likely involves GPU programming, possibly using the HIP runtime for compiling and managing GPU kernels. The code performs the following high-level tasks:
-
-1. It initializes a vector `kernel_name_vec` with strings representing kernel function names or expressions. These names include `&f1`, `N1::N2::f2`, and `f3
; enab1ed)
-Compiler directly generates the host code (using the Clang x86 target) and passes the code to another host compiler. Thus, they have no equivalent of the __CUDACC__ define.
-NVCCmakestwo passes over the code: one for host code and one for device code. HIP-Clang will have multiple passes over the code: one for the host code, and one for each architecture on the device code. __HIP_DEVICE_COMPILE__ is set to a nonzero value when the compiler (HIP-Clang or NVCC) is compiling code for a device inside a __global__ kernel or for a device function. __HIP_DEVICE_COMPILE__ can replace #ifdef checks on the __CUDA_ARCH__ define.
-**Following code does:** This code snippet appears to be part of a loop in C++ (though it is formatted like Python, it is not valid Python syntax). It iterates over a collection named `variable_name_vec`, which is likely a vector or similar container of strings or string-like objects. For each element `x` in this collection, it calls the function `hiprtcAddNameExpression`, passing `prog` and `x.c_str()` as arguments. The purpose of this operation is to add name expressions to a HIPRTC (Heterogeneous-Compute Interface for Portability Runtime Compilation) program, where `prog` is likely a handle or reference to a runtime compilation program, and `x.c_str()` converts the string `x` to a C-style string (null-terminated character array). This is typically used in GPU programming to manage and compile code at runtime.
-
- #if __HIP__DEVICE__COMPILE__
-Unlike __CUDA_ARCH__ , the __HIP_DEVICE_COMPILE__ value is 1 or undefined, and it doesn't represent the feature capability of the target device.
-**Following table contains:** The table appears to represent a structured outline or index of topics related to a technical document, possibly a programming or software development guide. Each row corresponds to a specific section or subsection within the document, and the columns provide details about these sections. - -- **Rows**: Each row represents a different section or subsection of the document. The sections are numbered hierarchically (e.g., 8.10.2, 8.11, 8.12), indicating their position within the larger document structure. - -- **Columns**: - - The first column seems to be a placeholder or separator, as it is mostly empty. - - The second column contains the section numbers, which help in identifying the hierarchical structure of the document. - - The third column provides the title or description of the section, which gives an idea of the content covered in that part of the document. - - The fourth column appears to be a repetition of the third column, possibly for emphasis or formatting purposes. - - The fifth column contains numerical values, which might represent page numbers, section identifiers, or reference codes. - -- **Noteworthy Values**: - - The repeated section titles in the third and fourth columns suggest a formatting choice that might be used for emphasis or alignment. - - The numerical values in the fifth column are sequential, indicating a structured progression through the document. - - The presence of technical terms such as "memcpyToSymbol," "CU_POINTER_ATTRIBUTE_MEMORY_TYPE," and "threadfence_system" suggests that the document is related to programming, possibly focusing on GPU computing or parallel processing. - -Overall, the table provides an organized view of the document's contents, helping readers navigate through various technical topics.
-| Define | HIP-Clang | NVCC | Other (GCC, ICC, Clang, etc.) |
|---|---|---|---|
| HIP-related defines: | |||
| __HIP_PLATFORM_AMD__ | Defined | Undefined | Defined if targetingAMD platform; undefined oth- erwise |
| __HIP_PLATFORM_NVIDIA__ | Undefined | Defined | Defined if targeting NVIDIA platform; unde- fined otherwise |
| __HIP_DEVICE_COMPILE__ | 1 if compiling for device; un- defined if compiling for host | 1 if compiling for device; undefined if compiling for host | Undefined |
| __HIPCC__ | Defined | Defined | Undefined |
| __HIP_ARCH_* | 0 or 1 depending on feature support (see below) | 0 or 1 depending on feature support (see below) | 0 |
| NVCC- related defines: | |||
| __CUDACC__ | Defined if source code is compiled by NVCC; unde- fined otherwise | Undefined | |
| __NVCC__ Undefined | Defined | Undefined | |
| __CUDA_ARCH__ | Undefined | Unsigned representing compute capa- bility (e.g., '130') if in device code; 0 if in host code | Undefined |
| hip-clang- related defines: | |||
| __HIP__ HIP-Clang common | Defined | Undefined | Undefined |
| defines: __clang__ | Defined | Defined | Undefined |
Some CUDA code tests __CUDA_ARCH__ for a specific value to determine whether the machine supports a certain architectural feature. For instance,
-**Following code does:** This code snippet appears to be part of a C++ program. It performs the following high-level actions: - -1. It adds a string literal `"&N1::N2::V2"` to a vector named `variable_name_vec`. This vector is likely intended to store names or identifiers, possibly for variables or functions. - -2. It then iterates over each element in `variable_name_vec` using a range-based for loop. For each element `x`, it calls a function or macro `hiprtcAddNameExp`. This function or macro is likely related to the HIP runtime compilation (hiprtc) library, which is used for compiling and managing GPU kernels in AMD's ROCm platform. The purpose of `hiprtcAddNameExp` is probably to register or process the names stored in the vector for use in GPU kernel compilation or execution. - -Overall, the code is managing a collection of names and performing an operation on each name, likely in the context of GPU programming.
- | #if (__CUDA_ARCH__ >= 13 0)
-**Following code does:** This code snippet iterates over a vector named `variable_name_vec` and retrieves the "lowered" name for each element in the vector using the `hiprtcGetLoweredName` function. The `decltype(variable_name_vec.size())` is used to ensure the loop index `i` is of the same type as the size of the vector. The `hiprtcGetLoweredName` function is likely part of the HIP runtime compilation (hipRTC) API, which is used for compiling and managing GPU kernels. The lowered name is stored in the `name` pointer for each variable in the vector.
- |// doubles are supported
-This type of code requires special attention, since AMD and CUDA devices have different architectural capabilities. Moreover, you can't determine the presence of a feature using a simple comparison against an architecture's version
-number. HIP provides a set of defines and device properties to query whether a specific architectural feature is supported.
-The __HIP_ARCH_* defines can replace comparisons of __CUDA_ARCH__ values:
-**Following code does:** This code snippet iterates over a vector named `kernel_name_vec`, which presumably contains kernel names as strings. For each kernel name in the vector, it calls the function `hiprtcGetLoweredName`, passing in a program object `prog`, the current kernel name, and a pointer to a `const char*` variable `name`. The purpose of this function call is likely to retrieve the "lowered" or transformed version of the kernel name, which is stored in the `name` variable. This is typically used in the context of GPU programming, where kernel names might need to be transformed or resolved for further processing or compilation.
-//#if (__CUDA_ARCH__ >= 130) // non-portable
-if __HIP_ARCH_HAS_DOUBLES__ { // portable HIP feature query
- // doubles are supported
-}
-For host code, the __HIP_ARCH__* defines are set to 0. You should only use the __HIP_ARCH__ fields in device code.
-Host code should query the architecture feature flags in the device properties that hipGetDeviceProperties returns, rather than testing the 'major' and 'minor' fields directly:
-**Following code does:** This code snippet is part of a GPU programming workflow using the HIP (Heterogeneous-Compute Interface for Portability) API, which is designed for writing portable applications that can run on AMD and NVIDIA GPUs. The code performs the following high-level tasks: - -1. It declares a variable `variable_addr` to hold the device pointer to a global variable within a GPU module, and a `bytes` variable to store the size of this global variable. -2. It retrieves the address and size of a global variable from a specified GPU module using `hipModuleGetGlobal`. -3. It copies an initial value from the host (CPU) memory to the device (GPU) memory at the location specified by `variable_addr` using `hipMemcpyHtoD`. - -Overall, this code initializes a global variable in a GPU module with a specified initial value from the host.
-hipGetDeviceProperties(&deviceProp, device);
-//if ((deviceProp.major == 1 && deviceProp.minor < 2)) // non-portable
-if (deviceProp.arch.hasSharedInt32Atomics) { // portable HIP feature query
- // has shared int32 atomic operations...
-}
-The table below shows the full set of architectural properties that HIP supports.
-**Following table contains:** The provided CSV preview appears to represent a table of atomic operations, likely related to programming or computing, specifically dealing with atomic functions in a parallel computing context such as CUDA or similar environments. Each row seems to describe a different atomic function or a variant of an atomic function, detailing its signature and possibly its system-specific variant. - -- **Rows**: Each row represents a different atomic operation function, including its type (e.g., `int`, `unsigned int`, `float`, `double`, `unsigned long long`) and the parameters it takes (e.g., `address`, `val`). - -- **Columns**: The columns are not explicitly defined in the preview, but it seems like the data is structured to show the function signature and possibly its system-specific variant. The functions include operations like `atomicAdd`, `atomicSub`, and `atomicExch`, each with different data types. - -- **Noteworthy Values**: - - The presence of both `atomicAdd` and `atomicAdd_system` suggests that there are standard and system-specific implementations of these functions. - - Functions are defined for multiple data types, including `int`, `unsigned int`, `float`, `double`, and `unsigned long long`, indicating a wide applicability across different numeric types. - - The mention of `unsafeAtomicAdd` and `safeAtomicAdd` implies there are considerations for safety or atomicity in certain operations. - -Overall, this table seems to be a reference for developers working with atomic operations in a high-performance computing environment, providing details on function signatures for various data types and system-specific implementations.
-| Define (use only in device code) | Device Property (run- time query) | Comment |
|---|---|---|
| 32-bit atomics: | ||
| __HIP_ARCH_HAS_GLOBAL_INT32_ATOMICS__ __HIP_ARCH_HAS_GLOBAL_FLOAT_ATOMIC_EXCH__ | hasGlobalInt32Atomics hasGlobalFloatAtomicExch | 32-bit integer atomics for global memory 32-bit float atomic exchange for global mem- ory |
| __HIP_ARCH_HAS_SHARED_INT32_ATOMICS__ __HIP_ARCH_HAS_SHARED_FLOAT_ATOMIC_EXCH__ | hasSharedInt32Atomics hasSharedFloatAtomicExch | 32-bit integer atomics for shared memory 32-bit float atomic exchange for shared mem- ory |
| __HIP_ARCH_HAS_FLOAT_ATOMIC_ADD__ | hasFloatAtomicAdd | 32-bit float atomic add in global and shared memory |
| 64-bit atomics: | ||
| __HIP_ARCH_HAS_GLOBAL_INT64_ATOMICS__ __HIP_ARCH_HAS_SHARED_INT64_ATOMICS__ Doubles: | hasGlobalInt64Atomics hasSharedInt64Atomics | 64-bit integer atomics for global memory 64-bit integer atomics for shared memory |
| __HIP_ARCH_HAS_DOUBLES__ Warp cross-lane operations: | hasDoubles | Double-precision floating point |
| __HIP_ARCH_HAS_WARP_VOTE__ __HIP_ARCH_HAS_WARP_BALLOT__ __HIP_ARCH_HAS_WARP_SHUFFLE__ __HIP_ARCH_HAS_WARP_FUNNEL_SHIFT__ Sync: | hasWarpVote hasWarpBallot hasWarpShuffle hasFunnelShift | Warp vote instructions ( any , all ) Warp ballot instructions Warp shuffle operations ( shfl_* ) Funnel shift two input words into one |
| hasThreadFenceSystem hasSyncThreadsExt | threadfence_system syncthreads_count , syncthreads_and | |
| __HIP_ARCH_HAS_THREAD_FENCE_SYSTEM__ | ||
| __HIP_ARCH_HAS_SYNC_THREAD_EXT__ | , syncthreads_or | |
| Miscellaneous: | ||
| __HIP_ARCH_HAS_SURFACE_FUNCS__ | hasSurfaceFuncs | |
| __HIP_ARCH_HAS_3DGRID__ | has3dGrid | Grids and groups are 3D |
| __HIP_ARCH_HAS_DYNAMIC_PARALLEL__ | hasDynamicParallelism |
Makefiles can use the following syntax to conditionally provide a default HIP_PATH if one does not exist:
-HIP_PATH ?= $( shell hipconfig --path )
-HIP can depend on rocclr, or CUDA as runtime
-hipLaunchKernelGGL is a macro that can serve as an alternative way to launch kernel, which accepts parameters of launch configurations (grid dims, group dims, stream, dynamic shared size) followed by a variable number of kernel arguments. It can replace <<< >>>, if the user so desires.
-hipcc is a portable compiler driver that will call NVCC or HIP-Clang (depending on the target system) and attach all required include and library options. It passes options through to the target compiler. Tools that call hipcc must ensure the compiler options are appropriate for the target compiler. The hipconfig script may helpful in identifying the target platform, compiler and runtime. It can also help set options appropriately.
-Here are the main compiler options supported on AMD platforms by HIP-Clang.
-**Following table contains:** The table appears to represent a list of atomic operations, likely related to programming or computing, specifically in the context of concurrent or parallel processing. Each row seems to describe a different atomic function or a variation of a function, possibly from a programming library or API. - -- **Rows**: Each row represents a different atomic operation function, including its signature and possibly its system-specific variant. These functions are typically used in low-level programming to perform thread-safe operations on shared variables. - -- **Columns**: The table seems to have only one column, which contains the function signatures. These signatures include the return type, function name, and parameters, which specify the data types and names of the arguments the function takes. Some rows also mention system-specific versions of these functions, indicated by the suffix "_system". - -- **Noteworthy Values**: - - The presence of both standard and system-specific versions of functions (e.g., `atomicCAS` and `atomicCAS_system`) suggests that these operations might have different implementations or optimizations depending on the system. - - The functions cover a range of operations such as `atomicDec`, `atomicCAS`, `atomicAnd`, and `atomicOr`, which are common atomic operations used to ensure data integrity in concurrent programming. - - The use of different data types like `int`, `unsigned int`, and `unsigned long long` indicates that these operations can be applied to variables of varying sizes and signedness. - -Overall, this table is likely a reference for developers working with atomic operations in a specific programming environment.
-| Option | Description |
|---|---|
| --amdgpu-target=<gpu_arch> [DEPRECATED] This option is being replaced by --offload-arch=<target> . Generate code for the given GPU target. Supported targets are gfx701, gfx801, gfx802, gfx803, gfx900, gfx906, gfx908, gfx1010, gfx1011, gfx1012, gfx1030, gfx1031. This option could appear multiple times on the same command line to generate a fat binary for multiple targets. | |
| --fgpu-rdc | Generate relocatable device code, which allows kernels or device functions calling device functions in different translation units. |
| -ggdb | Equivalent to -g plus tuning for GDB. This is recommended when using ROCm's GDB to debug GPU code. |
| --gpu-max-threads-per-block=<num> Generate code to support up to the specified number of threads per block. | |
| -O<n> | Specify the optimization level. |
| -offload-arch=<target> Specify the AMDGPUtarget ID. | |
| -save-temps | Save the compiler generated intermediate files. |
| -v | Show the compilation steps. |
hipcc adds the necessary libraries for HIP as well as for the accelerator compiler (NVCC or AMD compiler). We recommend linking with hipcc since it automatically links the binary to the necessary HIP runtime libraries. It also has knowledge on how to link and to manage the GPU objects.
-hipcc adds -lm by default to the link command.
-CUDA code often uses NVCC for accelerator code (defining and launching kernels, typically defined in .cu or .cuh files). It also uses a standard compiler (g++) for the rest of the application. NVCC is a preprocessor that employs a standard host compiler (gcc) to generate the host code. Code compiled using this tool can employ only the intersection of language features supported by both NVCC and the host compiler. In some cases, you must take care to ensure the data types and alignment of the host compiler are identical to those of the device compiler. Only some host compilers are supported-for example, recent NVCC versions lack Clang host-compiler capability.
-HIP-Clang generates both device and host code using the same Clang-based compiler. The code uses the same API as gcc, which allows code generated by different gcc-compatible compilers to be linked together. For example, code compiled using HIP-Clang can link with code compiled using 'standard' compilers (such as gcc, ICC and Clang). Take care to ensure all compilers use the same standard C++ header and library formats.
-hipcc links to libstdc++ by default. This provides better compatibility between g++ and HIP.
-If you pass --stdlib=libc++ to hipcc, hipcc will use the libc++ library. Generally, libc++ provides a broader set of C++ features while libstdc++ is the standard for more compilers (notably including g++).
-When cross-linking C++ code, any C++ functions that use types from the C++ standard library (including std::string, std::vector and other containers) must use the same standard-library implementation. They include the following:
-Applications with these interfaces should use the default libstdc++ linking.
-Applications which are compiled entirely with hipcc, and which benefit from advanced C++ features not supported in libstdc++, and which do not require portability to NVCC, may choose to use libc++.
-The hip_runtime.h and hip_runtime_api.h files define the types, functions and enumerations needed to compile a HIP program:
-CUDAhasslightly different contents for these two files. In some cases you may need to convert hipified code to include the richer hip_runtime.h instead of hip_runtime_api.h .
-You can compile hip_runtime_api.h using a standard C or C++ compiler (e.g., gcc or ICC). The HIP include paths and defines ( __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__ ) must pass to the standard compiler; hipconfig then returns the necessary options:
-**Following code does:** This code snippet is part of a HIP (Heterogeneous-Compute Interface for Portability) application, which is used for running code on GPUs. The code performs the following high-level tasks: - -1. It retrieves a GPU kernel function from a given module using `hipModuleGetFunction`, storing the function in the `kernel` variable. -2. It then launches this kernel function on the GPU using `hipModuleLaunchKernel`. The kernel is launched with a grid and block size of 1x1x1, meaning it will execute a single thread. The function is executed with no shared memory (`0`), no stream (`nullptr`), no extra arguments (`nullptr`), and a configuration specified by `config`. - -Overall, this code is setting up and executing a GPU kernel with minimal configuration, likely for testing or demonstration purposes.
-|> hipconfig --cxx_config
-**Following code does:** The provided code snippet appears to be a mix of PHP and HTML-like syntax, but it is mostly nonsensical and does not form a valid or functional script. It includes PHP opening and closing tags (``), but the content within these tags is not meaningful or executable PHP code. The snippet also contains HTML-like tags (``) that do not correspond to any standard HTML or server-side scripting language syntax. Overall, this code does not serve any practical purpose and seems to be a random or corrupted text rather than a functional script.
- | -D___HIP_PLATFORM_AMD___ -I/home/user1/hip/include
-You can capture the hipconfig output and passed it to the standard compiler; below is a sample makefile syntax:
-**Following code does:** This code snippet is an example of using the `ltrace` command to trace calls to HSA (Heterogeneous System Architecture) APIs. The command `ltrace -C -e "hsa*" ./hipGetChanDesc` is executed to monitor and log the function calls made by the `hipGetChanDesc` program that involve HSA-related libraries. The output shows a sequence of HSA API calls, including initialization, memory allocation, and event creation, along with their parameters and return values. This is useful for debugging or understanding the interactions between the program and the HSA runtime.
- |CPPFLAGS += $(shell $(HIP_PATH)/bin/hipconfig --cpp_config)
-**Following code does:** The line you provided appears to be a version identifier or a header comment, rather than executable code. It likely indicates the version of a software release or documentation for a project named "HIP." This version is labeled as "6.1.40092." Such a line is typically used in documentation files to specify the version of the document or software it pertains to, helping users identify the correct version they are referencing or using.
-
-)
-NVCC includes some headers by default. However, HIP does not include default headers, and instead all required files must be explicitly included. Specifically, files that call HIP run-time APIs or define HIP kernels must explicitly include the appropriate HIP headers. If the compilation process reports that it cannot find necessary APIs (for example, error: identifier hipSetDevice is undefined ), ensure that the file includes hip_runtime.h (or hip_runtime_api.h, if appropriate). The hipify-perl script automatically converts cuda_runtime.h to hip_runtime.h , and it converts cuda_runtime_api.h to hip_runtime_api.h , but it may miss nested headers or macros.
-The HIP-Clang path provides an empty cuda.h file. Some existing CUDA programs include this file but don't require any of the functions.
-Many existing CUDA projects use the .cu and .cuh file extensions to indicate code that should be run through the NVCC compiler. For quick HIP ports, leaving these file extensions unchanged is often easier, as it minimizes the work required to change file names in the directory and #include statements in the files.
-For new projects or ports which can be re-factored, we recommend the use of the extension .hip.cpp for source files, and .hip.h or .hip.hpp for header files. This indicates that the code is standard C++ code, but also provides a unique indication for make tools to run hipcc when appropriate.
-Code should not assume a warp size of 32 or 64. See Warp Cross-Lane Functions for information on how to write portable wave-aware code.
-Kernel code should use __attribute__((amdgpu_flat_work_group_size(<min>,<max>))) . For example:
-**Following code does:** The provided code snippet appears to be a mix of HTML tags with unusual formatting and indentation. However, it does not form a valid or meaningful HTML structure. The code includes opening and closing tags for paragraphs (` `), headings (``), anchors (``), and bold text (``), but they are not properly nested or used in a coherent way. As it stands, this code does not serve any functional purpose in a web page and would not render any meaningful content. It seems to be a collection of HTML tags without a clear intent or structure.
<_SQL_>
-HIP support for hipMemcpyToSymbol is complete. This feature allows a kernel to define a device-side data symbol which can be accessed on the host side. The symbol can be in __constant or device space.
-Note that the symbol name needs to be encased in the HIP_SYMBOL macro, as shown in the code example below. This also applies to hipMemcpyFromSymbol , hipGetSymbolAddress , and hipGetSymbolSize .
-For example:
-Device Code:
-**Following code does:** The provided snippet appears to be a fragment of HTML code rather than a complete Python script. It includes an HTML paragraph (`
-<_C++_>
-(continued from previous page)
-**Following code does:** The provided text does not appear to be a functional code snippet. Instead, it seems to be a fragment of a document or a placeholder text, possibly from a documentation file related to a software release (HIP Documentation, Release 6.1.40092). The text includes various symbols and characters that do not form a coherent or executable code. It might be part of a larger document or a formatting artifact.
- {
- A[i] = -1*i;
- B[i] = 0;
- }
-
- HIP_ASSERT(hipMalloc((void**)&Ad, SIZE));
-
- HIP_ASSERT(hipMemcpyToSymbol(HIP_SYMBOL(Value), A, SIZE, 0, hipMemcpyHostToDevice));
- hipLaunchKernelGGL(Get, dim3(1,1,1), dim3(LEN,1,1), 0, 0, Ad);
- HIP_ASSERT(hipMemcpy(B, Ad, SIZE, hipMemcpyDeviceToHost));
-
- for(unsigned i=0;i
-To get pointer's memory type in HIP/HIP-Clang, developers should use hipPointerGetAttributes API. First parameter of the API is hipPointerAttribute_t which has 'type' as member variable. 'type' indicates input pointer is allocated on device or host.
-For example:
-**Following code does:** This code snippet is a shell command that sets the environment variable `HIP_VISIBLE_DEVICES` to `0,1`. This variable is used in systems with AMD GPUs to specify which GPUs should be visible and accessible to applications using the HIP (Heterogeneous-Compute Interface for Portability) platform. By setting it to `0,1`, the command is indicating that only the GPUs with IDs 0 and 1 should be available for use by HIP-enabled applications. This is useful for managing GPU resources and controlling which GPUs are used for specific tasks.
- For example:
- double * ptr;
- hipMalloc(reinterpret_cast(&ptr), sizeof(double));
- hipPointerAttribute_t attr;
- hipPointerGetAttributes(&attr, ptr); /*attr.type will have value as hipMemoryTypeDevice*/
-
- double* ptrHost;
- hipHostMalloc(&ptrHost, sizeof(double));
- hipPointerAttribute_t attr;
- hipPointerGetAttributes(&attr, ptrHost); /*attr.type will have value as _
- ...hipMemoryTypeHost*/
-
- Data data file.MaximumTime amount value as different from end.MaximumTime amount value
-Please note, hipMemoryType enum values are different from cudaMemoryType enum values.
-For example, on AMD platform, hipMemoryType is defined in hip_runtime_api.h ,
-**Following code does:** It seems like you've provided a placeholder or incomplete code snippet. If you have a specific Python code snippet you'd like me to explain, please provide the actual code, and I'll be happy to help!
- For example, on AMD platform, hipMemoryType is defined in hip_runtime_api.h,
-
- typedef enum hipMemoryType {
- hipMemoryTypeHost = 0, ///< Memory is physically located on host
- hipMemoryTypeDevice = 1, ///< Memory is physically located on device. (see deviceId,
- --for specific device)
- hipMemoryTypeArray = 2, ///< Array memory, physically located on device. (see_,
- --deviceId for specific device)
- hipMemoryTypeUnified = 3, ///< Not used currently
- hipMemoryTypeManaged = 4 ///< Managed memory, automatically managed by the unified.
- --memory system
- } hipMemoryType;
-Looking into CUDA toolkit, it defines cudaMemoryType as following,
-**Following code does:** This code snippet is part of a larger program that is likely managing GPU devices for computation. It checks if the total number of devices (`totalDeviceNum`) is greater than 2. If this condition is true, it sets an environment variable `HIP_VISIBLE_DEVICES` to "0,1,2", which specifies that only the first three devices (indexed 0, 1, and 2) should be visible and used by the program. The `setenv` function is used to set this environment variable, with the last argument `1` indicating that it should overwrite any existing value. The `assert` statement then checks that the function `getDeviceNumber(false)` returns 3, ensuring that exactly three devices are recognized as available for use. This setup is likely part of a configuration process for a program that utilizes GPU resources.
-<_Cuda_>
-In this case, memory type translation for hipPointerGetAttributes needs to be handled properly on NVIDIA platform to get the correct memory type in CUDA, which is done in the file nvidia_hip_runtime_api.h .
-So in any HIP applications which use HIP APIs involving memory types, developers should use #ifdef in order to assign the correct enum values depending on NVIDIA or AMD platform.
-As an example, please see the code from the link.
-With the #ifdef condition, HIP APIs work as expected on both AMD and NVIDIA platforms.
-Note, cudaMemoryTypeUnregstered is currently not supported in hipMemoryType enum, due to HIP functionality backward compatibility.
-threadfence_system makes all device memory writes, all writes to mapped host memory, and all writes to peer memory visible to CPU and other GPU devices. Some implementations can provide this behavior by flushing the GPU L2 cache. HIP/HIP-Clang does not provide this functionality. As a workaround, users can set the environment variable HSA_DISABLE_CACHE=1 to disable the GPU L2 cache. This will affect all accesses and for all kernels and so may have a performance impact.
-Compute programs sometimes use textures either to access dedicated texture caches or to use the texture-sampling hardware for interpolation and clamping. The former approach uses simple point samplers with linear interpolation, essentially only reading a single point. The latter approach uses the sampler hardware to interpolate and combine multiple samples. AMD hardware, as well as recent competing hardware, has a unified texture/L1 cache, so it no longer has a dedicated texture cache. But the NVCC path often caches global loads in the L2 cache, and some programs may benefit from explicit control of the L1 cache contents. We recommend the __ldg instruction for this purpose.
-AMDcompilers currently load all data into both the L1 and L2 caches, so __ldg is treated as a no-op.
-We recommend the following for functional portability:
-Onan AMDplatform, set the AMD_LOG_LEVEL environment variable to log HIP application execution information.
-The value of the setting controls different logging level,
-**Following code does:** This code snippet is a command used in the GNU Debugger (GDB) environment. It sets an environment variable named `AND_SERIALIZE_KERNEL` to the value `3` for the program being debugged. This can be used to influence the behavior of the program during debugging, potentially enabling or configuring specific features or modes related to kernel serialization, depending on how the program interprets this environment variable.
-<_C++_>
-Logging mask is used to print types of functionalities during the execution of HIP application. It can be set as one of the following values,
-**Following code does:** The code snippet is a shell command that executes a program named `hipinfo` located in the `~/hip/bin` directory. The output of this program is redirected twice: first to a file named `~/hipinfo` and then to another file named `~/hip_log.txt`. However, due to the syntax used, the output will only be redirected to `~/hip_log.txt`, as the second redirection overwrites the first one.
-<_C++_>
-To see the detailed commands that hipcc issues, set the environment variable HIPCC_VERBOSE to 1. Doing so will print to stderr the HIP-clang (or NVCC) commands that hipcc generates.
-**Following code does:** This code defines an enumeration called `LogLevel` in Python, which is used to represent different levels of logging severity. Each log level is associated with an integer value, starting from 0 for `LOG_NONE` (indicating no logging) up to 4 for `LOG_DEBUG` (indicating the most detailed logging). This enumeration can be used in a logging system to categorize and filter log messages based on their severity.
-export HIPCC_VERBOSE=1
-make
-
-...
-hipcc-cmd: /opt/rcm/bin/hipcc --offload-arch=native -x hip backprop_cuda.cu
-See the utils/vim or utils/gedit directories to add handy highlighting to hip files.
-CUDA provides a separate CUDA Driver and Runtime APIs. The two APIs have significant overlap in functionality:
-The Driver API offers two additional pieces of functionality not provided by the Runtime API: cuModule and cuCtx APIs.
-The Module section of the Driver API provides additional control over how and when accelerator code objects are loaded. For example, the driver API allows code objects to be loaded from files or memory pointers. Symbols for kernels or global data can be extracted from the loaded code objects. In contrast, the Runtime API automatically loads and (if necessary) compiles all of the kernels from an executable binary when run. In this mode, NVCC must be used to compile kernel code so the automatic loading can function correctly.
-Both Driver and Runtime APIs define a function for launching kernels (called cuLaunchKernel or cudaLaunchKernel . The kernel arguments and the execution configuration (grid dimensions, group dimensions, dynamic shared memory, and stream) are passed as arguments to the launch function. The Runtime additionally provides the <<< >>> syntax for launching kernels, which resembles a special function call and is easier to use than explicit launch API (in particular with respect to handling of kernel arguments). However, this syntax is not standard C++ and is available only when NVCC is used to compile the host code.
-The Module features are useful in an environment which generates the code objects directly, such as a new accelerator language front-end. Here, NVCC is not used. Instead, the environment may have a different kernel language or different compilation flow. Other environments have many kernels and do not want them to be all loaded automatically. The Module functions can be used to load the generated code objects and launch kernels. As we will see below, HIP defines a Module API which provides similar explicit control over code object management.
-The Driver API defines 'Context' and 'Devices' as separate entities. Contexts contain a single device, and a device can theoretically have multiple contexts. Each context contains a set of streams and events specific to the context. Historically contexts also defined a unique address space for the GPU, though this may no longer be the case in Unified Memory platforms (since the CPU and all the devices in the same process share a single unified address space). The Context APIs also provide a mechanism to switch between devices, which allowed a single CPU thread to send commands to different GPUs. HIP as well as a recent versions of CUDA Runtime provide other mechanisms to accomplish this feat - for example using streams or cudaSetDevice .
-The CUDA Runtime API unifies the Context API with the Device API. This simplifies the APIs and has little loss of functionality since each Context can contain a single device, and the benefits of multiple contexts has been replaced with other interfaces. HIP provides a context API to facilitate easy porting from existing Driver codes. In HIP, the Ctx functions largely provide an alternate syntax for changing the active device.
-Most new applications will prefer to use hipSetDevice or the stream APIs, therefore HIP has marked hipCtx APIs as deprecated . Support for these APIs may not be available in future releases. For more details on deprecated APIs please refer HIP deprecated APIs .
-Rather than present two separate APIs, HIP extends the HIP API with new APIs for Modules and Ctx control.
-Like the CUDA Driver API, the Module API provides additional control over how code is loaded, including options to load code from files or from in-memory pointers. NVCC and HIP-Clang target different architectures and use different code object formats: NVCC is cubin or ptx files, while the HIP-Clang path is the hsaco format. The external compilers which generate these code objects are responsible for generating and loading the correct code object for each platform. Notably, there is not a fat binary format that can contain code for both NVCC and HIP-Clang platforms. The following table summarizes the formats used on each platform:
-**Following table contains:** The table represents a list of functions related to thread and grid group operations, likely in the context of parallel computing or GPU programming. Each row corresponds to a specific function, detailing its support across different platforms or frameworks. - -The columns are as follows: -- **Function**: This column lists the name and signature of the function. -- **Supported in HIP**: This column indicates whether the function is supported in the HIP (Heterogeneous-Compute Interface for Portability) framework, with a "✓" symbol denoting support. -- **Supported in CUDA**: This column indicates whether the function is supported in the CUDA (Compute Unified Device Architecture) framework, also using a "✓" symbol to denote support. - -Noteworthy observations: -- All listed functions are supported in both HIP and CUDA, as indicated by the "✓" symbols in both columns for each function. This suggests a high level of compatibility or standardization for these functions across the two frameworks.
-| Format | APIs | NVCC | HIP-CLANG |
|---|---|---|---|
| Code Object Fat Binary | hipModuleLoad , hipModuleLoadData hipModuleLoadFatBin | .cubin or PTX text .fatbin | .hsaco .hip_fatbin |
hipcc uses HIP-Clang or NVCC to compile host codes. Both of these may embed code objects into the final executable, and these code objects will be automatically loaded when the application starts. The hipModule API can be used to load additional code objects, and in this way provides an extended capability to the automatically loaded code objects. HIP-Clang allows both of these capabilities to be used together, if desired. Of course it is possible to create a program with no kernels and thus no automatic loading.
-HIP provides a Ctx API as a thin layer over the existing Device functions. This Ctx API can be used to set the current context, or to query properties of the device associated with the context. The current context is implicitly used by other APIs such as hipStreamCreate .
-The HIPIFY tools convert CUDA Driver APIs for streams, events, modules, devices, memory management, context, profiler to the equivalent HIP driver calls. For example, cuEventCreate will be translated to hipEventCreate . HIPIFY tools also convert error codes from the Driver namespace and coding convention to the equivalent HIP error code. Thus, HIP unifies the APIs for these common functions.
-The memory copy API requires additional explanation. The CUDA driver includes the memory direction in the name of the API ( cuMemcpyH2D ) while the CUDA driver API provides a single memory copy API with a parameter that specifies the direction and additionally supports a 'default' direction where the runtime determines the direction automatically. HIP provides APIs with both styles: for example, hipMemcpyH2D as well as hipMemcpy . The first flavor may be faster in some cases since they avoid host overhead to detect the different memory directions.
-HIP defines a single error space, and uses camel-case for all errors (i.e. hipErrorInvalidValue ).
-HIP-Clang defines a process-wide address space where the CPU and all devices allocate addresses from a single unified pool. Thus addresses may be shared between contexts, and unlike the original CUDA definition a new context does not create a new address space for the device.
-hipModuleLaunchKernel is cuLaunchKernel in HIP world. It takes the same arguments as cuLaunchKernel .
-hip-clang links device code from different translation units together. For each device target, a code object is generated. Code objects for different device targets are bundled by clang-offload-bundler as one fatbinary, which is embeded as a global symbol __hip_fatbin in the .hip_fatbin section of the ELF file of the executable or shared object.
-hip-clang generates initialization and termination functions for each translation unit for host code compilation. The initialization functions call __hipRegisterFatBinary to register the fatbinary embeded in the ELF file. They also call __hipRegisterFunction and __hipRegisterVar to register kernel functions and device side global variables. The termination functions call __hipUnregisterFatBinary . hip-clang emits a global variable __hip_gpubin_handle of void** type with linkonce linkage and inital value 0 for each host translation unit. Each initialization function checks __hip_gpubin_handle and register the fatbinary only if __hip_gpubin_handle is 0 and saves the return value of __hip_gpubin_handle to __hip_gpubin_handle . This is to guarantee that the fatbinary is only registered once. Similar check is done in the termination functions.
-hip-clang supports kernel launching by CUDA <<<>>> syntax, hipLaunchKernelGGL. The latter one is macro which expand to CUDA <<<>>> syntax.
-When the executable or shared library is loaded by the dynamic linker, the initialization functions are called. In the initialization functions, when __hipRegisterFatBinary is called, the code objects containing all kernels are loaded; when __hipRegisterFunction is called, the stub functions are associated with the corresponding kernels in code objects.
-hip-clang implements two sets of kernel launching APIs.
-By default, in the host code, for the <<<>>> statement, hip-clang first emits call of hipConfigureCall to set up the threads and grids, then emits call of the stub function with the given arguments. In the stub function, hipSetupArgument is called for each kernel argument, then hipLaunchByPtr is called with a function pointer to the stub function. In hipLaunchByPtr , the real kernel associated with the stub function is launched.
-CUDA applications may want to mix CUDA driver code with HIP code (see example below). This table shows the type equivalence to enable this interaction.
-**Following table contains:** The table represents a list of functions related to matrix operations, specifically in the context of GPU programming. Each row corresponds to a different function, detailing its signature and the environments in which it is supported. - -The columns are as follows: -- **Function**: This column lists the function signatures, which include the function name and parameters. These functions appear to be related to matrix operations such as loading, storing, filling, and performing matrix multiplication. -- **Supported in HIP**: This column indicates whether the function is supported in the HIP (Heterogeneous-Compute Interface for Portability) environment. In this preview, none of the functions are marked as supported in HIP. -- **Supported in CUDA**: This column indicates whether the function is supported in the CUDA (Compute Unified Device Architecture) environment. All functions in this preview are marked with a check (✓), indicating they are supported in CUDA. - -A noteworthy observation is that all the functions listed are supported in CUDA but not in HIP, suggesting a potential limitation or gap in HIP support for these specific matrix operations.
-| HIP Type | CU Driver Type | CUDA Runtime Type |
|---|---|---|
| hipModule_t | CUmodule | |
| hipFunction_t | CUfunction | |
| hipCtx_t | CUcontext | |
| hipDevice_t | CUdevice | |
| hipStream_t | CUstream | cudaStream_t |
| hipEvent_t | CUevent | cudaEvent_t |
| hipArray | CUarray | cudaArray |
The hipModule_t interface does not support cuModuleLoadDataEx function, which is used to control PTX compilation options. HIP-Clang does not use PTX and does not support these compilation options. In fact, HIP-Clang code objects always contain fully compiled ISA and do not require additional compilation as a part of the load step. The corresponding HIP function hipModuleLoadDataEx behaves as hipModuleLoadData on HIP-Clang path (compilation options are not used) and as cuModuleLoadDataEx on NVCC path. For example (CUDA):
-**Following code does:** This code snippet describes a logging system for a HIP (Heterogeneous-Compute Interface for Portability) application. It defines an enumeration `LogMask` with various bitmask values, each representing a different type of logging functionality, such as API calls, synchronization, memory allocation, and more. The logging mask is used to control which types of log messages are printed during the execution of a HIP application. By default, the logging mask is set to a value that enables all logging types, but it can be customized by setting the `AMD_LOG_MASK` environment variable to include only specific types of logs. This allows developers to filter and focus on particular areas of interest when debugging or monitoring the application.
-<_Cuda_>
-HIP:
-**Following code does:** The code snippet provided is not a valid or meaningful Python code. It appears to be a collection of random characters, including HTML-like tags and whitespace, but it does not perform any computational task or serve any functional purpose. It might be an accidental paste or a placeholder.
-
-
-
-
-
- ?xml version="2.0" encoding="UTF-8" />
-
-
-
-
-
-
- ?xml version="5.0" encoding="UTF-8" />
-
-
-
-
- >
-
- //
- !xml version="8.0" encoding="UTF-8" />
- }
-The below sample shows how to use hipModuleGetFunction .
-**Following code does:** This code snippet logs a message indicating that the HSA (Heterogeneous System Architecture) stack is being initialized. It uses a logging function `ClPrint` from the `amd` namespace, specifying the log level as `LOG_INFO` and the log category as `LOG_INIT`. This is likely part of a larger system where logging is used to track the initialization process of various components.
-#include
-#include
-#include
-#include
-#include
-**Following code does:** The provided code snippet appears to be a fragment of HTML or a similar markup language. It includes various HTML tags such as `
`, `
`, and ``. However, the snippet is mostly empty or improperly formatted, with many tags not closed or used incorrectly. The high-level purpose seems to be related to documentation, possibly for a software release (version 6.1.40092), and it might be intended to format or display logging examples. However, due to the lack of proper structure and content, it does not achieve any meaningful functionality or display.
#include
-#include
-#include
-#include
-#include
-**Following code does:** This line appears to be a version identifier or header for a document related to "HIP Documentation." It indicates that the document is associated with a release version "6.1.40092." This is not executable code but rather a label or title that likely precedes detailed documentation about a software product or library named "HIP."
-HIP Documentation, Release 6.1.40092
-
-
-
-
-#define LEN 64
-#define SIZE LEN<<2
-
-#ifdef __HIP_PLATFORM_AMD__
-#define fileName "vcpy_isa.co"
-#endif
-
-#ifdef __HIP_PLATFORM_NVIDIA__
-#define fileName "vcpy_isa.ptx"
-#endif
-
-#define kernel_name "hello_world"
-
-int main(){
- float *A, *B;
- hipDeviceptr_t Ad, Bd;
- A = new float[LEN];
- B = new float[LEN];
-
- for(uint32_t i=0;iargBuffer(2);
- memcpy(&argBuffer[0], &Ad, sizeof(void*));
- memcpy(&argBuffer[1], &Bd, sizeof(void*));
-
- size_t size = argBuffer.size()*sizeof(void*);
-
- void *config[] = {
- HIP_LAUNCH_PARAM_BUFFER_POINTER, &argBuffer[0],
-
-
-
-54
-(continued from previous page)
-(continues on next page)
-(continued from previous page)
-**Following code does:** The code snippet provided appears to be malformed or corrupted, as it contains excessive whitespace and incomplete or nonsensical elements. Specifically, it includes what looks like an HTML/XML-like tag `
HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
- HIP_LAUNCH_PARAM_END
- };
-
- hipModuleLaunchKernel(Function, 1, 1, 1, LEN, 1, 1, 0, 0, NULL, (void**)&config);
-
- hipMemcpyDtoH(B, Bd, SIZE);
- for(uint32_t i=0;i
-HIP supports texture driver APIs however texture reference should be declared in host scope. Following code explains the use of texture reference for __HIP_PLATFORM_AMD__ platform.
-**Following code does:** The provided code snippet does not contain any valid Python code. Instead, it appears to be a malformed and nonsensical mix of XML-like tags and URLs, specifically referencing "amazonaws.com" multiple times. This could potentially be an attempt at obfuscation or a corrupted piece of data, but it does not perform any meaningful function or operation as it stands.
-
-
-
- // Code to generate code object
-
-
-#include "hip/hip_runtime.h"
-
-extern texture tex;
-
- __global__ void tex2dKernel(hipLaunchParm lp, float* outputData,
- //
-**Following code does:** This code snippet appears to be a log output from a graphics or compute driver, specifically related to memory allocation and management. It logs the allocation of memory buffers, including the size of the allocation (100,000 bytes in each case), the memory address ranges for the allocated pointers, and the associated objects. Additionally, it logs a call to `hipMemGetInfo`, which returns the total and free memory available, indicating that the system has 12.06 GB of total memory with 11.93 GB free, representing 99% availability. The log entries include timestamps and thread identifiers for context.
- }
-
- // Host code:
-
- texture tex;
-
- void myFunc ()
- {
-
- //...
-
- textureReference* texref;
- hipModuleGetTexRef(&texref, Module1, "tex");
- hipTexRefSetAddressMode(texref, 0, hipAddressModeWrap);
- hipTexRefSetAddressMode(texref, 1, hipAddressModeWrap);
- hipTexRefSetFilterMode(texref, hipFilterModePoint);
-(continues on next page)
-**Following code does:** This code snippet appears to be a part of a parallel programming framework, likely related to GPU programming, such as CUDA. The `thread_block` class is used to represent a group of threads that can cooperate and synchronize with each other during execution. The line `thread_block g = this_thread_block();` constructs an instance `g` of the `thread_block` class, which represents the current block of threads that the executing thread belongs to. This allows the thread to access information and perform operations specific to its block, such as synchronization or shared memory access.
- hipTexRefSetFlags(texref, 0);
- hipTexRefSetFormat(texref, HIP_AD_FORMAT_FLOAT, 1);
- hipTexRefSetArray(texref, array, HIP_TRSA_OVERRIDE_FORMAT);
-
- //...
-}
-(continued from previous page)
-HIP lets you compile kernels at runtime with the hiprtc* APIs. Kernels can be stored as a text string and can be passed to HIPRTC APIs alongside options to guide the compilation.
-NOTE:
-To use HIPRTC functionality, HIPRTC header needs to be included first. #include <hip/hiprtc.h>
-Kernels can be stored in a string:
-**Following code does:** The code snippet appears to be a fragment of a larger codebase, likely related to a grid or parallel computing framework. The `grid_group` is defined as a class, but the snippet does not provide its implementation details. The comment "Constructed via:" suggests that instances of `grid_group` are created using a function or method called `this_grid()`. This implies that `this_grid()` is a factory function or method that returns an instance of `grid_group`, possibly representing a group or collection of grid elements or processes. The purpose of this setup is likely to manage or interact with a grid structure in a computational context.
-<_C_>
-Now to compile this kernel, it needs to be associated with hiprtcProgram type, which is done by declaring hiprtcProgram prog; and associating the string of kernel with this program:
-**Following code does:** The code snippet appears to be a syntax error or incomplete code. In Python, the `class` keyword is used to define a new class, but the snippet includes a `|` character at the beginning and ends with a semicolon, both of which are not valid in Python class definitions. Additionally, the class body is missing. If this were intended to be a Python class definition, it should start with `class multi_grid_group:` followed by an indented block defining the class's attributes and methods.
-hiprtcCreateProgram(&prog, // HIPRTC program
- kernel, // kernel string
- "gpu_kernel.cu", // Name of the file
- num_headers, // Number of headers
- &header_sources[0], // Header sources
- &header_names[0]); // Name of header files
-hiprtcCreateProgram API also allows you to add headers which can be included in your RTC program. For online compilation, the compiler pre-defines HIP device API functions, HIP specific types and macros for device compilation, but does not include standard C/C++ headers by default. Users can only include header files provided to hiprtcCreateProgram .
-After associating the kernel string with hiprtcProgram , you can now compile this program using:
-**Following code does:** The code snippet provided is a single semicolon (`;`). In Python, a semicolon is used to separate multiple statements on a single line, but on its own, it does nothing and has no effect. Therefore, this code snippet does not perform any operation or have any functional purpose.
-
-
-hiprtcCompileProgram(prog, // hiprtcProgram
- 0, // Number of options
- options); // Clang Options [Supported Clang Options](clang_options.
-
---md)
-hiprtcCompileProgram returns a status value which can be converted to string via hiprtcGetErrorString . If compilation is successful, hiprtcCompileProgram will return HIPRTC_SUCCESS .
-If the compilation fails, you can look up the logs via:
-**Following code does:** The code snippet `<_C_>` appears to be a placeholder or a non-functional piece of code. It does not represent any valid Python syntax or operation. It might be used as a marker or a template in a larger codebase where actual code is intended to be inserted later.
-<_C++_>
-If the compilation is successful, you can load the compiled binary in a local variable.
-**Following code does:** The code snippet you provided seems to be a placeholder or a tag indicating a YAML (YAML Ain't Markup Language) document or section, rather than actual executable code. YAML is a human-readable data serialization standard often used for configuration files and data exchange between languages with different data structures. Without specific YAML content, it's not possible to determine the exact purpose or functionality. If you have a YAML document or configuration, it typically defines settings, parameters, or data structures in a structured format.
- size_t codeSize;
- hiprtcGetCodeSize(prog, &codeSize);
-
- vector kernel_binary(codeSize);
- hiprtcGetCode(prog, kernel_binary.data());
-After loading the binary, hiprtcProgram can be destroyed. hiprtcDestroyProgram(&prog);
-The binary present in kernel_binary can now be loaded via hipModuleLoadData API.
-**Following code does:** It seems there is a formatting error in your request. The code snippet is labeled as Python, but the placeholder `<_C++_>` suggests it might be intended for C++. Please provide the correct code snippet or clarify the language so I can assist you accurately.
-hipModule_t module;
-hipFunction_t kernel;
-
-hipModuleLoadData(&module, kernel_binary.data());
-hipModuleGetFunction(&kernel, module, "vector_add");
-And now this kernel can be launched via hipModule APIs.
-The full example is below:
-**Following code does:** The placeholder `<_SQL_>` suggests that this is not actual code but rather a placeholder for SQL code. In this context, it indicates that the code snippet is intended to represent a section where SQL queries or statements would be placed. Without specific SQL code, it's not possible to determine the exact functionality, but generally, SQL code is used to interact with databases, performing operations such as querying data, updating records, inserting new data, or deleting existing data.
-
-
-
- //
-(continued from previous page)
-**Following code does:** The code snippet appears to be a fragment of code that is not complete or syntactically correct in Python. The line seems to be attempting to define a class named `coalesced_group`, but it includes a syntax error with the presence of the `|` character at the beginning, which is not valid in Python class definitions. In a correct context, this line would be part of a class definition, but as it stands, it does not perform any function or operation.
-
-
-
- <---------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------- >----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
-**Following code does:** The code snippet appears to be a fragment of a larger codebase, likely related to parallel computing or GPU programming. The line seems to be attempting to define a variable named `active` within a group or context, possibly using a syntax specific to a particular framework or language extension. The `coalesced_threads()` function or method is likely intended to return a set of threads that are coalesced, meaning they are grouped together to execute in a more efficient manner, often to optimize memory access patterns in parallel processing environments. However, the syntax is unusual and may contain typographical errors or be specific to a non-standard language or domain-specific language.
-HIP Documentation,Release 6.1.40092
-(continued from previous page)
-**Following code does:** The code snippet `<_Cuda_>` appears to be a placeholder or a symbolic representation rather than actual executable Python code. In the context of programming, "Cuda" typically refers to NVIDIA's parallel computing platform and application programming interface (API) model, which allows developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. However, without additional context or actual code, it's not possible to determine the specific functionality or purpose of this placeholder.
-
-
-
- }
-
-
- }
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- }
-
- }
-
- }
-
-
-
-
- }
-HIPRTC provides a few HIPRTC specific flags
-In the usual scenario, the kernel associated with hiprtcProgram is compiled into the binary which can be loaded and run. However, if -fpu-rdc option is provided in the compile options, HIPRTC calls comgr and generates only the LLVM bitcode. It doesn't convert this bitcode to ISA and generate the final binary.
-**Following code does:** The code snippet provided is not functional Python code. It consists mostly of whitespace and a few non-Python elements such as `//`, `/*`, and `*/`, which are comment syntax from languages like C or C++. The snippet appears to be an attempt to create a comment block, but it is not valid in Python. Therefore, it does not perform any operations or have any high-level purpose in a Python context.
- std::string sarg = std::string("-fgpu-rdc");
- const char* options[] = {
- sarg.c_str() };
- hiprtcCompileProgram(prog, // hiprtcProgram
- 1, // Number of options
- options);
-If the compilation is successful, one can load the bitcode in a local variable using the bitcode APIs provided by HIPRTC.
-**Following code does:** This code snippet is part of a parallel reduction algorithm, typically used in GPU programming or multi-threaded environments. The purpose of this code is to perform a reduction operation, specifically summing up elements in a data set using a hierarchical approach. Here's a high-level summary of what it does: - -1. **Initialization**: The loop starts with `i` set to half the size of the group `g`, which likely represents a group of threads or processing units. - -2. **Shared Memory Usage**: Each thread stores its value (`val`) in a shared memory array indexed by its thread ID (`group_thread_id`). - -3. **Synchronization**: The `g.sync()` calls ensure that all threads in the group reach the same point in the code before proceeding, which is crucial for correct data sharing and avoiding race conditions. - -4. **Reduction Step**: Threads with IDs less than `i` add the value from another thread (offset by `i`) to their own value. This effectively reduces the number of active threads by half each iteration, combining values in a tree-like structure. - -5. **Iterative Halving**: The loop continues, halving `i` each time, until `i` becomes zero, at which point the reduction is complete, and the final sum is likely stored in one of the threads. - -Overall, this code is implementing a parallel reduction to sum up values efficiently across multiple threads, leveraging shared memory and synchronization to ensure correct results.
-size_t bitCodeSize;
-hiprtcGetBitcodeSize(prog, &bitCodeSize);
-
-vector kernel_bitcode(bitCodeSize);
-hiprtcGetBitcode(prog, kernel_bitcode.data());
-AMDGPUs consist of an array of workgroup processors, each built with 2 compute units (CUs) capable of executing SIMD32. All the CUs inside a workgroup processor use local data share (LDS).
-gfx10+ support execution of wavefront in CU mode and work-group processor mode (WGP). Please refer to section 2.3 of RDNA3 ISA reference.
-gfx9 and below only supports CU mode.
-In WGP mode, 4 warps of a block can simultaneously be executed on the workgroup processor, where as in CU mode only 2 warps of a block can simultaneously execute on a CU. In theory, WGP mode might help with occupancy and increase the performance of certain HIP programs (if not bound to inter warp communication), but might incur performance penalty on other HIP programs which rely on atomics and inter warp communication. This also has effect of how the LDS is split between warps, please refer to RDNA3 ISA reference for more information.
-HIPRTCassumes WGPmodebydefault for gfx10+. This can be overridden by passing -mcumode to HIPRTC compile options in hiprtcCompileProgram .
-The bitcode generated using the HIPRTC Bitcode APIs can be loaded using hipModule APIs and also can be linked with other generated bitcodes with appropriate linker flags using the HIPRTC linker APIs. This also provides more flexibility and optimizations to the applications who want to generate the binary dynamically according to their needs. The input bitcodes can be generated only for a specific architecture or it can be a bundled bitcode which is generated for multiple architectures.
-Firstly, HIPRTC link instance or a pending linker invocation must be created using hiprtcLinkCreate , with the appropriate linker options provided.
-**Following code does:** This code snippet is part of a CUDA kernel function, which is designed to run on a GPU. The kernel, named `sum_kernel`, performs a parallel reduction operation. It utilizes shared memory, specifically an array named `workspace`, to efficiently compute the sum of elements from an input array. The `reduce_sum` function is likely responsible for aggregating the values stored in `workspace` and `input`, ultimately producing a single output value that represents the sum of the input elements. This approach leverages the parallel processing capabilities of the GPU to accelerate the summation process.
-<_C++_>
-Following which, the bitcode data can be added to this link instance via hiprtcLinkAddData (if the data is present as a string) or hiprtcLinkAddFile (if the data is present as a file) with the appropriate input type according to the data or the bitcode used.
-**Following code does:** The provided code snippet is a mix of Python and C++ syntax, but it is mostly composed of whitespace and special characters. It does not form any valid or meaningful code in either language. The snippet includes C++ keywords like `const` and `auto`, as well as comment syntax (`//` and `/* */`), but they are not used in a coherent or executable manner. Therefore, this code does not perform any specific function or task.
-
-hiprtcLinkAddData(rtc_link_state, // HIPRTC link state
- input_type, // type of the input data or bitcode
- bit_code_ptr, // input data which is null terminated
- bit_code_size, // size of the input data
- "a", // optional name for this input
- 0, // size of the options
- 0, // Array of options applied to this input
- 0); // Array of option values cast to void*
-**Following code does:** This code snippet is part of a parallel computing operation, likely using a GPU programming model such as CUDA. It involves two main steps: - -1. **Thread Block Initialization**: The `this_thread_block()` function is called to obtain a `thread_block` object, which represents a group of threads that can cooperate and share data during execution. - -2. **Reduction Operation**: The `reduce_sum` function is used to perform a reduction operation on the `input` data. This operation typically involves summing up elements across the threads within the `thread_block_group`, storing the result in `output`. The `workspace` parameter likely provides temporary storage needed for the reduction process. - -Overall, the code is designed to efficiently compute the sum of elements in parallel using a group of threads.
-hiprtcLinkAddFile(rtc_link_state, // HIPRTC link state
- input_type, // type of the input data or bitcode
- bc_file_path.c_str(), // path to the input file where bitcode is_
---present
- 0, // size of the options
- 0, // Array of options applied to this input
- 0); // Array of option values cast to void*
-Once the bitcodes for multiple architectures are added to the link instance, the linking of the device code must be completed using hiprtcLinkComplete which generates the final binary.
-**Following code does:** This code snippet is written in C++ using the HIP (Heterogeneous-Compute Interface for Portability) API, which is used for GPU programming. The code checks if a single AMD GPU supports cooperative launch capabilities. It first retrieves the current GPU device ID and then queries the device to determine if it supports cooperative launch features. If the device does not support cooperative groups, it outputs a message indicating this and exits the program. This check is important for ensuring that the GPU can handle tasks that require cooperation between multiple threads or blocks.
-<_C_>
-If the hiprtcLinkComplete returns successfully, the generated binary can be loaded and run using the hipModule* APIs.
-**Following code does:** This code snippet is designed to check the capability of multiple GPUs to support cooperative group launches, specifically in a multi-GPU environment. It iterates over available devices, checking each one for support of the `hipDeviceAttributeCooperativeMultiDeviceLaunch` attribute, which indicates whether the device can participate in cooperative multi-device launches. If a device supports this feature, its ID is added to a list of valid device IDs. If not, a message is printed indicating that the device does not support cooperative groups. This setup is typically used in scenarios where tasks need to be distributed across multiple GPUs that can work together cooperatively.
- |hipModuleLoadData(&module, bina
-**Following code does:** The code snippet provided is not a valid Python code. It appears to contain a comment written in C/C++ style (using `//` for comments), which suggests that it might be part of a larger C/C++ program. The comment indicates that the code is related to launching a kernel from the host, which is a common operation in parallel computing environments such as CUDA or OpenCL. In such contexts, "launching a kernel" typically means executing a function (the kernel) on a GPU or other parallel processing device from the host CPU. However, without additional context or code, this snippet alone does not perform any operations.
-binary);
-HIPRTC provides hiprtcJITInputType enumeration type which defines the input types accepted by the Linker APIs. Here are the enum values of hiprtcJITInputType . However only the input types HIPRTC_JIT_INPUT_LLVM_BITCODE , HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE and HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE are supported currently.
-HIPRTC_JIT_INPUT_LLVM_BITCODE can be used to load both LLVM bitcode or LLVM IR assembly code. However, HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE and HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE are only for bundled bitcode and archive of bundled bitcode.
-**Following code does:** The code snippet `<_C_>` appears to be a placeholder or a fragment that does not represent any valid or complete Python code. It might be intended as a marker or a template for something to be filled in later. Without additional context or surrounding code, it does not perform any specific function or operation.
-<_Cuda_>
-For HIP applications utilizing HIPRTC to compile LLVM bitcode/IR, compatibility is assured only when the ROCm or HIP SDK version used for generating the LLVM bitcode/IR matches the version used during the runtime compilation. When an application requires the ingestion of bitcode/IR not derived from the currently installed AMD compiler, it must run with HIPRTC and comgr dynamic libraries that are compatible with the version of the bitcode/IR.
-comgr, a shared library, incorporates the LLVM/Clang compiler that HIPRTC relies on. To identify the bitcode/IR version that comgr is compatible with, one can execute 'clang -v' using the clang binary from the same ROCm or HIP SDK package. For instance, if compiling bitcode/IR version 14, the HIPRTC and comgr libraries released by AMD around mid 2022 would be the best choice, assuming the LLVM/Clang version included in the package is also version 14.
-To ensure smooth operation and compatibility, an application may choose to ship the specific versions of HIPRTC and comgr dynamic libraries, or it may opt to clearly specify the version requirements and dependencies. This approach guarantees that the application can correctly compile the specified version of bitcode/IR.
-Example:
-**Following code does:** This code snippet is designed to launch a cooperative kernel across multiple GPUs using the HIP (Heterogeneous-Compute Interface for Portability) API. It sets up and manages the execution of a parallel computation task on multiple devices. The code performs the following high-level tasks: - -1. Allocates memory for an array of `hipLaunchParams` structures, which will hold the configuration for launching kernels on each GPU. -2. Iterates over a list of device IDs to configure each GPU: - - Sets the current device to the specified GPU. - - Creates a stream for asynchronous execution on the GPU. - - Prepares the kernel parameters, which include pointers to device memory arrays. - - Configures the launch parameters for the kernel, including the function to execute, grid and block dimensions, shared memory size, and the stream. -3. Launches the cooperative kernel across all specified GPUs using `hipLaunchCooperativeKernelMultiDevice`, allowing the kernels to work together across devices. -4. The code implies that there is synchronization on the device side after the kernel execution, ensuring that all operations are completed before proceeding. - -Overall, this code is used for executing a parallel reduction operation across multiple GPUs in a coordinated manner.
-const char* isaopts[] = {"-mllvm", "-inline-threshold=1", "-mllvm", "-inlinehint-
---threshold=1"};
-std::vector jit_options = {HIPRTC_JIT_IR_TO_ISA_OPT_EXT,
- HIPRTC_JIT_IR_TO_ISA_OPT_COUNT_EXT};
-size_t isaoptssize = 4;
- (continues on next page)
-**Following code does:** The code snippet `<_C_>` appears to be a placeholder or a non-functional piece of code. It does not represent any valid Python syntax or operation. It might be used as a marker or a template in a larger codebase where actual code is intended to be inserted later.
- const void* lopts[] = {(void*)isaopts, (void*)(isaoptssize)};
- hiprtcLinkState linkstate;
- hiprtcLinkCreate(2, jit_options.data(), (void**)lopts, &linkstate);
-HIPRTC defines the hiprtcResult enumeration type and a function hiprtcGetErrorString for API call error handling. hiprtcResult enum defines the API result codes. HIPRTC APIs return hiprtcResult to indicate the call result. hiprtcGetErrorString function returns a string describing the given hiprtcResult code, e.g., HIPRTC_SUCCESS to 'HIPRTC_SUCCESS'. For unrecognized enumeration values, it returns 'Invalid HIPRTC error code'.
-hiprtcResult enum supported values and the hiprtcGetErrorString usage are mentioned below.
-**Following code does:** The code snippet `<_Cython_>` appears to be a placeholder or a reference to Cython, which is a programming language that makes writing C extensions for Python as easy as Python itself. Cython is used to improve the performance of Python code by compiling it into C, which can then be executed more efficiently. However, the snippet itself does not contain any executable code or functionality; it simply seems to indicate the use of Cython in some context. Without additional context or code, it does not perform any specific action.
-<_Python_>
-**Following code does:** The code snippet appears to be incomplete or malformed, as it starts with an assignment operator (`=`) without a variable on the left-hand side. However, the part `this._grid()` suggests that it is calling a method named `_grid` on an object referred to by `this`. In a complete and correctly formatted context, this line would typically assign the result of the `_grid()` method to a variable. The purpose of the `_grid()` method would depend on its implementation, but it likely relates to some grid-related functionality within the object.
-hiprtcResult result;
-result = hiprtcCompileProgram(prog, 1, opts);
-if (result!= HIPRTC_SUCCESS) {
-std::cout << "hiprtcCompileProgram fails with error " << hiprtcGetErrorString(result);
-}
-HIPRTC provides the following API for querying the version.
-hiprtcVersion(int* major, int* minor) - This sets the output parameters major and minor with the HIP Runtime compilation major version and minor version number respectively.
-Currently, it returns hardcoded value. This should be implemented to return HIP runtime major and minor version in the future releases.
-(continued from previous page)
-HIPRTC mangles the __global__ function names and names of __device__ and __constant__ variables. If the generated binary is being loaded using the HIP Runtime API, the kernel function or __device__/__constant__ variable must be looked up by name, but this is very hard when the name has been mangled. To overcome this, HIPRTC provides API functions that map __global__ function or __device__/__constant__ variable names in the source to the mangled names present in the generated binary.
-The two APIs hiprtcAddNameExpression and hiprtcGetLoweredName provide this functionality. First, a 'name expression' string denoting the address for the __global__ function or __device__/__constant__ variable is provided to hiprtcAddNameExpression . Then, the program is compiled with hiprtcCompileProgram . During compilation, HIPRTC will parse the name expression string as a C++ constant expression at the end of the user program. Finally, the function hiprtcGetLoweredName is called with the original name expression and it returns a pointer to the lowered name. The lowered name can be used to refer to the kernel or variable in the HIP Runtime API.
-kernel containing various definitions __global__ functions/function templates and __device__/__constant__ variables can be stored in a string.
-**Following code does:** This code snippet appears to be part of a larger codebase, likely written in a language that supports object-oriented programming, such as C++ or a similar language. The line of code is performing the following high-level operation: - -- It declares a variable named `multi_grid` of type `multi_grid_group`. -- It initializes this variable by calling a function or method named `this_multi_grid()`, which presumably returns an object or value of type `multi_grid_group`. - -The purpose of this line is to create an instance of `multi_grid_group` and assign it to the variable `multi_grid`, using the result from `this_multi_grid()`. This suggests that `this_multi_grid()` is a function that provides or constructs a `multi_grid_group` object, possibly representing a grid or a collection of grids in a multi-grid system.
-
-
-
- static constexpr const char gpu_program[] {
-hiprtcAddNameExpression is called with various name expressions referring to the address of __global__ functions and __device__/__constant__ variables.
-**Following code does:** This code snippet appears to be a method call on an object or module named `multi_grid`. The method `sync()` is likely intended to synchronize or coordinate data or processes across multiple grids or components within a system. This could involve ensuring that all parts of a distributed system are updated to the same state or that parallel computations are aligned. The exact functionality would depend on the implementation details of the `multi_grid` object and its `sync()` method.
-kernel_name_vec.push_back("&f1");
-kernel_name_vec.push_back("N1::N2::f2");
-kernel_name_vec.push_back("f3");
-for (auto&& x : kernel_name_vec) hiprtcAddNameExpression(prog, x.c_str());
-variable_name_vec.push_back("&V1");
-(continues on next page)
-(continued from previous page)
-**Following code does:** This code is a C++ program that uses the HIP (Heterogeneous-Compute Interface for Portability) runtime API to check if the current GPU device supports concurrent managed memory access. It retrieves the current GPU device, checks the `hipDeviceAttributeConcurrentManagedAccess` attribute, and then prints a message to the console indicating whether concurrent managed memory access is supported by the device.
-auto&& x : variable_name_vec) hiprtcAddNameExpression(prog, x.c_str());
-**Following code does:** The code snippet provided appears to be mostly empty space with some scattered characters, including slashes and curly braces, which are typically used in programming languages like C, C++, or Java for comments and code blocks. However, the snippet does not contain any functional code or meaningful structure. It seems to be either incomplete or corrupted, and thus does not perform any specific operation or serve a high-level purpose.
- | variable_name_vec.push_back("&N1::N2::V2");
- for (auto&& x : variable_name_vec) hiprtcAddNameExp
-After which, the program is compiled using hiprtcCompileProgram and the generated binary is loaded using hipModuleLoadData . And the mangled names can be fetched using hirtcGetLoweredName .
-**Following code does:** This code is a simple GPU program using HIP (Heterogeneous-Compute Interface for Portability) to perform the addition of two integers on a GPU. It defines a kernel function `add` that takes pointers to three integers and computes the sum of the first two, storing the result in the third. The variables `a`, `b`, and `c` are declared as managed memory, allowing them to be accessed by both the host (CPU) and the device (GPU). The `main` function initializes `a` and `b` with values 1 and 2, respectively, and then launches the `add` kernel on the GPU. After ensuring the GPU computation is complete with `hipDeviceSynchronize`, it prints the result of the addition. The code demonstrates basic GPU computation and memory management using HIP.
-for (decltype(variable_name_vec.size()) i = 0; i!= variable_name_vec.size(); ++i) {
- const char* name;
- hiprtcGetLoweredName(prog, variable_name_vec[i].c_str(), &name);
-}
-**Following code does:** This code snippet is a partial implementation of a program using HIP (Heterogeneous-Compute Interface for Portability) to perform addition on a GPU. It defines a simple CUDA kernel function `add` that takes pointers to three integers and assigns the sum of the first two integers to the third. In the `main` function, memory is allocated on the host for three integer pointers `a`, `b`, and `c`, and initial values are assigned to `a` and `b`. The code is incomplete as it does not include the necessary steps to allocate memory on the GPU, copy data to the GPU, launch the kernel, or retrieve the result from the GPU.
- for (decltype(kernel_name_vec.size()) i = 0; i!= kernel_name_vec.size(); ++i) {
- const char* name;
- hiprtcGetLoweredName(prog, kernel_name_vec[i].c_str(), &name);
- }
-The mangled name of the variables are used to look up the variable in the module and update its value.
-**Following code does:** This code snippet is part of a program that performs a simple addition operation on a GPU using the HIP (Heterogeneous-Compute Interface for Portability) API. It launches a kernel function named `add` on the GPU with a single thread block and a single thread. After the kernel execution, it synchronizes the device to ensure the GPU has finished processing before accessing the results on the host. The result of the addition is then printed to the console. Finally, it frees the allocated memory for the variables `a`, `b`, and `c` before the program terminates.
- hipDeviceptr_t variable_addr;
- size_t bytes{};
- hipModuleGetGlobal(&variable_addr, &bytes, module, name);
- hipMemcpyHtoD(variable_addr, &initial_value, sizeof(initial_value));
-Finally, the mangled name of the kernel is used to launch it using the hipModule APIs.
-**Following code does:** This code snippet is a simple example of using HIP (Heterogeneous-Compute Interface for Portability) to perform addition on a GPU. It demonstrates explicit memory management in a GPU programming context. The code allocates memory on the GPU for three integers, copies two input values from the host (CPU) to the GPU, and then launches a kernel function `add` on the GPU to compute the sum of these two values. The result is then copied back from the GPU to the host. This example illustrates basic GPU programming concepts such as memory allocation, data transfer between host and device, and kernel execution.
- hipFunction_t kernel;
- hipModuleGetFunction(&kernel, module, name);
- hipModuleLaunchKernel(kernel, 1, 1, 1, 1, 1, 0, nullptr, nullptr, config);
-Please have a look at hiprtcGetLoweredName.cpp for the detailed example.
-HIPRTC follows the below versioning.
-The AMDHIPPerformance Guidelines are a set of best practices designed to help developers optimize the performance of AMD GPUs. They cover established parallelization and optimization techniques, coding metaphors, and idioms that can greatly simplify programming for HIP-capable GPU architectures.
-By following four main cornerstones, we can exploit the performance optimization potential of HIP.
-In the following chapters, we will show you their benefits and how to use them effectively.
-For optimal use, the application should reveal and efficiently imply as much parallelism as possible to keep all system components active.
-The application should optimize parallel execution across the host and devices using asynchronous calls and streams. Workloads should be assigned based on efficiency: serial to the host, parallel to the devices.
-For parallel workloads, when threads need to synchronize to share data, if they belong to the same block, they should use __syncthreads() (see: Synchronization functions ) within the same kernel invocation. If they belong to different blocks, they must use global memory with two separate kernel invocations. The latter should be minimized as it adds overhead.
-Device-level optimization primarily involves maximizing parallel execution across the multiprocessors of the device. This can be achieved by executing multiple kernels concurrently on a device. The management of these kernels is facilitated by streams, which allow for the overlapping of computation and data transfers, enhancing performance. The aim is to keep all multiprocessors busy by executing enough kernels concurrently. However, launching too many kernels can lead to resource contention, so a balance must be found for optimal performance. This approach helps in achieving maximum utilization of the resources of the device.
-Multiprocessor-level optimization involves maximizing parallel execution within each multiprocessor on a device. Each multiprocessor can execute a number of threads concurrently, and the total number of threads that can run in parallel is determined by the number of concurrent threads each multiprocessor can handle.
-The key to multiprocessor-level optimization is to efficiently utilize the various functional units within a multiprocessor. This can be achieved by ensuring a sufficient number of resident warps, as at every instruction issue time, a warp scheduler selects an instruction that is ready to execute. This instruction can be another independent instruction of the same warp, exploiting Optimization for maximum instruction throughput , or more commonly an instruction of another warp, exploiting thread-level parallelism.
-In comparison, device-level optimization focuses on the device as a whole, aiming to keep all multiprocessors busy by executing enough kernels concurrently. Both levels of optimization are crucial for achieving maximum performance. They work together to ensure efficient utilization of the resources of the GPU, from the individual multiprocessors to the device as a whole.
-The first step in maximizing memory throughput is to minimize low-bandwidth data transfers. This involves reducing data transfers between the host and the device, as these have lower bandwidth than transfers between global memory and the device.
-Additionally, data transfers between global memory and the device should be minimized by maximizing the use of on-chip memory: shared memory and caches. Shared memory acts as a user-managed cache, where the application explicitly allocates and accesses it. A common programming pattern is to stage data from device memory into shared memory. This involves each thread of a block loading data from device memory to shared memory, synchronizing with all other threads of the block, processing the data in shared memory, synchronizing again if necessary, and writing the results back to device global memory.
-For some applications, a traditional hardware-managed cache is more appropriate to exploit data locality. On devices of certain compute capabilities, the same on-chip memory is used for both L1 and shared memory, and the amount dedicated to each is configurable for each kernel call.
-Finally, the throughput of memory accesses by a kernel can vary significantly depending on the access pattern for each type of memory. Therefore, the next step in maximizing memory throughput is to organize memory accesses as optimally as possible. This is especially important for global memory accesses, as global memory bandwidth is low compared to available on-chip bandwidths and arithmetic instruction throughput. Thus, non-optimal global memory accesses generally have a high impact on performance.
-Applications should aim to minimize data transfers between the host and the device. This can be achieved by moving more computations from the host to the device, even if it means running kernels that do not fully utilize the parallelism for device. Intermediate data structures can be created, used, and discarded in device memory without being mapped or copied to host memory.
-Batching small transfers into a single large transfer can improve performance due to the overhead associated with each transfer. On systems with a front-side bus, using page-locked host memory can enhance data transfer performance.
-When using mapped page-locked memory, there is no need to allocate device memory or explicitly copy data between device and host memory. Data transfers occur implicitly each time the kernel accesses the mapped memory. For optimal performance, these memory accesses should be coalesced, similar to global memory accesses.
-On integrated systems where device and host memory are physically the same, any copy operation between host and device memory is unnecessary, and mapped page-locked memory should be used instead. Applications can check if a device is integrated by querying the integrated device property.
-Memory access instructions may be repeated due to the spread of memory addresses across warp threads. The impact on throughput varies with memory type and is generally reduced when addresses are more scattered, especially in global memory.
-Device memory is accessed via 32-, 64-, or 128-byte transactions that must be naturally aligned. Maximizing memory throughput involves coalescing memory accesses of threads within a warp into minimal transactions, following optimal access patterns, using properly sized and aligned data types, and padding data when necessary.
-Global memory instructions support reading or writing data of specific sizes (1, 2, 4, 8, or 16 bytes) that are naturally aligned. If the size and alignment requirements are not met, it leads to multiple instructions, reducing performance. Therefore, using data types that meet these requirements, ensuring alignment for structures, and maintaining alignment for all values or arrays is crucial for correct results and optimal performance.
-Threads often access 2D arrays at an address calculated as BaseAddress + xIndex + width * yIndex . For efficient memory access, the array and thread block widths should be multiples of the warp size. If the array width is not a multiple of the warp size, it is usually more efficient to allocate it with a width rounded up to the nearest multiple and pad the rows accordingly.
-Local memory is used for certain automatic variables, such as arrays with non-constant indices, large structures or arrays, and any variable when the kernel uses more registers than available. Local memory resides in device memory, leading to high latency and low bandwidth similar to global memory accesses. However, it is organized for consecutive 32-bit words to be accessed by consecutive thread IDs, allowing full coalescing when all threads in a warp access the same relative address.
-Shared memory, located on-chip, provides higher bandwidth and lower latency than local or global memory. It is divided into banks that can be simultaneously accessed, boosting bandwidth. However, bank conflicts, where two addresses fall in the same bank, lead to serialized access and decreased throughput. Therefore, understanding how memory addresses map to banks and scheduling requests to minimize conflicts is crucial for optimal performance.
-Constant memory is in device memory and cached in the constant cache. Requests are split based on different memory addresses, affecting throughput, and are serviced at the throughput of the constant cache for cache hits, or the throughput of the device memory otherwise.
-Texture and surface memory are stored in device memory and cached in texture cache. This setup optimizes 2D spatial locality, leading to better performance for threads reading close 2D addresses. Reading device memory through texture or surface fetching can be advantageous, offering higher bandwidth for local texture fetches or surface reads, offloading addressing calculations, allowing data broadcasting, and optional conversion of 8-bit and 16-bit integer input data to 32-bit floating-point values on-the-fly.
-To maximize instruction throughput:
-The type and complexity of arithmetic operations can significantly impact the performance of your application. We are highlighting some hints how to maximize it.
-Using efficient operations: Some arithmetic operations are more costly than others. For example, multiplication is typically faster than division, and integer operations are usually faster than floating-point operations, especially with double-precision.
-Minimizing low-throughput instructions: This might involve trading precision for speed when it does not affect the final result. For instance, consider using single-precision arithmetic instead of double-precision.
-Leverage intrinsic functions: Intrinsic functions are pre-defined functions available in HIP that can often be executed faster than equivalent arithmetic operations (subject to some input or accuracy restrictions). They can help optimize performance by replacing more complex arithmetic operations.
-Avoiding divergent warps: Divergent warps occur when threads within the same warp follow different execution paths. This can happen due to conditional statements that lead to different arithmetic operations being performed by different threads. Divergent warps can significantly reduce instruction throughput, so try to structure your code to minimize divergence.
-Optimizing memory access: The efficiency of memory access can impact the speed of arithmetic operations. Coalesced memory access, where threads in a warp access consecutive memory locations, can improve memory throughput and thus the speed of arithmetic operations.
-Maximizing instruction parallelism: Some GPU architectures could issue parallel independent instructions simultaneously, for example integer and floating point, or two operations with independent inputs and outputs. Mostly this is a work for compiler, but expressing parallelism in the code explicitly can improve instructions throughput.
-Flow control instructions ( if , else , for , do , while , break , continue , switch ) can impact instruction throughput by causing threads within a warp to diverge and follow different execution paths. To optimize performance, control conditions should be written to minimize divergent warps. For example, when the control condition depends on ( threadIdx / warpSize ), no warp diverges. The compiler may optimize loops or short if or switch blocks using branch predication, preventing warp divergence. With branch predication, instructions associated with a false predicate are scheduled but not executed, avoiding unnecessary operations.
-Synchronization ensures that all threads within a block have completed their computations and memory accesses before moving forward, which is critical when threads are dependent on the results of other threads. However, synchronization can also lead to performance overhead, as it requires threads to wait, potentially leading to idle GPU resources.
-__syncthreads() is used to synchronize all threads in a block, ensuring that all threads have reached the same point in the code and that shared memory is visible to all threads after the point of synchronization.
-An alternative way to synchronize is using streams. Different streams can execute commands out of order with respect to one another or concurrently. This allows for more fine-grained control over the execution order of commands, which can be beneficial in certain scenarios.
-Applications frequently allocating and freeing memory may experience slower allocation calls over time. This is expected as memory is released back to the operating system. To optimize performance in such scenarios, consider some recommendations:
-CHAPTER
-AMDdebugging tools include ltrace and ROCgdb . External tools are available and can be found online. For example, if you're using Windows, you can use Microsoft Visual Studio and WinGDB .
-You can trace and debug your code using the following tools and techniques.
-You can use tracing to quickly observe the flow of an application before reviewing the detailed information provided by a command-line debugger. Tracing can be used to identify issues ranging from accidental API calls to calls made on a critical path.
-ltrace is a standard Linux tool that provides a message to stderr on every dynamic library call. You can use ltrace to visualize the runtime behavior of the entire ROCm software stack.
-Here's a simple command-line example that uses ltrace to trace HIP APIs and output:
-**Following code does:** This code snippet is part of a larger program that uses HIP, a C++ runtime API for GPU programming. The snippet performs two main tasks: - -1. It frees up memory that was previously allocated on the GPU for three variables (`d_a`, `d_b`, and `d_c`) using the `hipFree` function. This is a cleanup operation to release resources and prevent memory leaks. - -2. It prints the result of an addition operation to the console, displaying the values of `a`, `b`, and their sum `c` in the format "a + b = c". - -Finally, the function returns 0, indicating successful execution.
-
-
-
-
- >
-Here's another example that uses ltrace to trace hsa APIs and output:
-**Following code does:** This code snippet appears to be a collection of comments and symbols that do not form a valid or functional piece of code. It seems to be a mix of comment syntax from different programming languages (e.g., `//`, `/* ... */`) and other characters, but it does not perform any operations or have any executable logic. It might be a placeholder or an artifact from code formatting or editing.
- Here's another example that uses ltrace to trace hsa APIs and output:
- $ ltrace -C -e "hsa*"./hipGetChanDesc
- libamdhip64.so.4->hsa_init(0, 0x7fff325a69d0, 0x9c80e0, 0
- libhsa-runtime64.so.1->hsaKmtOpenKFD(0x7fff325a6590, 0x9c38c0, 0, 1) = 0
- libhsa-runtime64.so.1->hsaKmtGetVersion(0x7fff325a6608, 0, 0, 0) = 0
- libhsa-runtime64.so.1->hsaKmtReleaseSystemProperties(3, 0x80084b01, 0, 0) = 0
- libhsa-runtime64.so.1->hsaKmtAcquireSystemProperties(0x7fff325a6610, 0, 0, 1) = 0
- libhsa-runtime64.so.1->hsaKmtGetNodeProperties(0, 0x7fff325a66a0, 0, 0) = 0
- libhsa-runtime64.so.1->hsaKmtGetNodeMemoryProperties(0, 1, 0x9c42b0, 0x936012) = 0
- ...
- <... hsaKmtCreateEvent resumed> )
- libhsa-runtime64.so.1->hsaKmtAllocMemory(0, 4096, 64, 0x7fff325a6690) = 0
- libhsa-runtime64.so.1->hsaKmtMapMemoryToGPUNodes(0x7f1202749000, 4096, 0x7fff325a6690,,...
- --0) = 0
- libhsa-runtime64.so.1->hsaKmtCreateEvent(0x7fff325a6700, 0, 0, 0x7fff325a66f0) = 0
- (continues on next page)
-(continues on next page)
-(continued from previous page)
-**Following code does:** This code snippet demonstrates a simple GPU-accelerated program using HIP (Heterogeneous-Compute Interface for Portability) to perform the addition of two integers. It allocates managed memory for three integers (`a`, `b`, and `c`) that can be accessed by both the CPU and GPU. The program sets memory advice to optimize data placement and access patterns, indicating that the memory should be preferred for CPU access and read mostly from the GPU. It initializes the values of `a` and `b`, launches a GPU kernel to compute their sum, synchronizes the device to ensure computation is complete, and then prints the result. Finally, it frees the allocated memory. The use of `hipMemAdvise` helps optimize performance by guiding the runtime on how to handle memory access efficiently.
-HIP Documentation, Release 6.1.40092
-You can use ROCgdb for debugging and profiling.
-ROCgdbis the ROCm source-level debugger for Linux and is based on GNU Project debugger (GDB). the GNU sourcelevel debugger, equivalent of CUDA-GDB, can be used with debugger frontends, such as Eclipse, Visual Studio Code, or GDB dashboard. For details, see (https://github.com/ROCm/ROCgdb).
-Below is a sample how to use ROCgdb run and debug HIP application, ROCgdb is installed with ROCM package in the folder /opt/rocm/bin.
-**Following code does:** This code snippet is part of a larger program that uses the HIP (Heterogeneous-Compute Interface for Portability) API, which is designed for GPU programming. The specific lines shown here are responsible for freeing or deallocating memory that was previously allocated on the GPU for the variables `b` and `c`. This is done using the `hipFree` function, which releases the memory resources associated with these variables. The `return 0;` statement indicates that the program is terminating successfully. Overall, this snippet is part of the cleanup process in a GPU-accelerated application.
-
-
-
-
-
-
-
-
-
-The following Linux example shows how to get useful information from the debugger while running a simple memory copy test, which caused a segmentation fault issue.
-**Following code does:** This code snippet demonstrates a simple example of using HIP (Heterogeneous-Compute Interface for Portability) to perform a basic addition operation on a GPU. It allocates managed memory for three integers (`a`, `b`, and `c`) that can be accessed by both the host (CPU) and the device (GPU). The `add` kernel function is launched on the GPU to compute the sum of `a` and `b`, storing the result in `c`. After synchronizing the device to ensure the computation is complete, the code queries a memory range attribute (`hipMemRangeAttributeReadMostly`) for the memory range pointed to by `a` and stores the result in `attributeValue`. Finally, it prints the result of the addition. The code illustrates basic memory management and kernel execution in a HIP environment.
-
-
-
- test, which caused a segmentation fault issue.
>
- http://tests/src/runtimeApi/memory/hipMemcpy_simple.cpp:147
-(continues on next page)
-(continued from previous page)
-**Following code does:** This code snippet is part of a C++ program that outputs the value of a queried attribute to the console. After displaying the attribute value, it performs cleanup by freeing memory that was previously allocated for three variables (`a`, `b`, and `c`) using the `hipFree` function, which is typically used in programs that utilize HIP (Heterogeneous-Compute Interface for Portability) for managing GPU memory. Finally, the program returns 0, indicating successful execution.
-HIP Documentation, Release 6.1.40092
-
-
- (continued from previous page)
-
-1 " <---------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------- >----------------------------------------------------------------------------------------------------------------------
-1 // : | - + \
- |
- <============================================================================================ } )
-Debugging HIP applications using Windows tools can be more informative than on Linux. Windows tools provides more visibility into debug codes, which makes it easier to inspect variables, watch multiple details, and examine call stacks.
-HIP provides environment variables that allow HIP, hip-clang, or HSA drivers to prevent certain features and optimizations. These are not intended for production, but can be useful to diagnose synchronization problems in the application (or driver).
-Some of the more widely used environment variables are described in this section. These are supported on the Linux ROCm path and Windows.
-You can control kernel command serialization from the host:
-AMD_SERIALIZE_KERNEL = 1 , Wait for completion before enqueue AMD_SERIALIZE_KERNEL = 2 , Wait for completion after enqueue AMD_SERIALIZE_KERNEL = 3 , Both
-Or
-AMD_SERIALIZE_COPY = 1 , Wait for completion before enqueue AMD_SERIALIZE_COPY = 2 , Wait for completion after enqueue AMD_SERIALIZE_COPY = 3 , Both
-So HIP runtime can wait for GPU idle before/after any GPU command depending on the environment setting.
-For systems with multiple devices, you can choose to make only certain device(s) visible to HIP using HIP_VISIBLE_DEVICES (or CUDA_VISIBLE_DEVICES on an NVIDIA platform). Once enabled, HIP can only view devices that have indices present in the sequence. For example:
-**Following code does:** The code snippet `<_C_>` appears to be a placeholder or a non-functional piece of code. It does not represent any valid Python syntax or operation. It might be intended as a marker or a template for where actual code should be inserted.
- | $ HIP_VISIBLE_DEVICES=0,1
-**Following code does:** The code snippet `0) ;` is not valid Python code. It appears to be a fragment or typo, as it does not form a complete or meaningful statement in Python.
-<_Python_>
-**Following code does:** It seems there is a formatting error in your request. The code snippet is labeled as Python, but the content is marked as C++. Please provide the correct code snippet or clarify the language so I can assist you accurately.
-if (totalDeviceNum > 2) {
-setenv("HIP_VISIBLE_DEVICES", "0,1,2", 1);
-assert(getDeviceNumber(false) == 3);
-
-.......
-}
-To analyze compiler-related issues, you can use the dump code object: GPU_DUMP_CODE_OBJECT .
-HSA provides environment variables that help analyze issues in drivers or hardware.
-HSA_ENABLE_INTERRUPT=0 causes completion signals to be detected with memory-based polling, rather than interrupts.
-Here are some of the more commonly used environment variables:
-**Following table contains:** The table represents a list of mathematical functions, specifically focusing on their support across different platforms or environments. Each row corresponds to a specific mathematical function, detailing its purpose and support status. - -- The "Function" column describes the function's name and its mathematical operation, including a brief explanation of what the function does (e.g., calculating the absolute value, arc cosine, etc.). -- The "Supported on Host" column indicates whether the function is supported on the host environment, with a "✓" symbol denoting support. -- The "Supported on Device" column shows whether the function is supported on the device environment, also using a "✓" symbol to denote support. - -Noteworthy is that all functions listed in the preview are supported on both the host and device environments, as indicated by the "✓" symbols in both the "Supported on Host" and "Supported on Device" columns.
-| Environment variable | De- fault value | Usage |
|---|---|---|
| AMD_LOG_LEVEL Enable HIP log on different Level | 0 | 0: Disable log. 1: Enable log on error level 2: Enable log on warning and below levels 0x3: Enable log on information and below levels 0x4: Decode and display AQL packets |
| AMD_LOG_MASK Enable HIP log on different Level | 0x7FFFFFFF 0x1: Log API calls 0x02: Kernel and Copy Commands and Barriers 0x4: Synchroniza- tion and waiting for commands to finish 0x8: Enable log on information and below levels 0x20: Queue commands and queue contents 0x40: Signal creation, allocation, pool 0x80: Locks and thread-safety code 0x100: Copy debug 0x200: Detailed copy debug 0x400: Resource allocation, performance-impacting events 0x800: Initialization and shutdown 0x1000: Misc debug, not yet classified 0x2000: Show raw bytes of AQL packet 0x4000: Show code creation debug 0x8000: More detailed command info, including barrier com- mands 0x10000: Log message location 0xFFFFFFFF: Log always even mask flag is zero | |
| HIP_LAUNCH_BLOCKING Used for serial- ization on kernel execution. | 0 | 0: Disable. Kernel executes normally. 1: Enable. Serializes kernel enqueue, behaves the same as AMD_SERIALIZE_KERNEL. |
| HIP_VISIBLE_DEVICES (or CUDA_VISIBLE_DEVICES) Only devices whose index is present in the sequence are visible to HIP | 0,1,2: Depending on the number of devices on the system | |
| GPU_DUMP_CODE_OBJECT Dump code ob- ject | 0 | 0: Disable 1: Enable |
| AMD_SERIALIZE_KERNEL Serialize kernel enqueue | 0 | 1: Wait for completion before enqueue 2: Wait for completion after enqueue 3: Both |
| AMD_SERIALIZE_COPY Serialize copies | 0 | 1: Wait for completion before enqueue 2: Wait for completion after enqueue 3: Both |
| HIP_HOST_COHERENT Coherent mem- | 0 | 0: memory is not coherent between host and GPU 1: memory is coherent with host |
| ory in hipHost- Malloc AMD_DIRECT_DISPATCH Enable direct kernel dispatch (Currently for Linux; under development for Windows) | 1 | 0: Disable 1: Enable |
| GPU_MAX_HW_QUEUES The maximum number of hard- ware queues allocated per device | 4 | The variable controls how many independent hardware queues HIP runtime can create per process, per device. If an application allocates more HIP streams than this number, then HIP runtime reuses the same hardware queues for the new streams in a round-robin manner. Note that this maximum number does not apply to hardware queues that are created for CU-masked HIP streams, or cooperative queues for HIP Cooperative Groups (single queue per device). |
**Following code does:** This code snippet is configuring memory access permissions for a specific memory region in a HIP (Heterogeneous-Compute Interface for Portability) application, which is used for GPU programming. It sets up a `hipMemAccessDesc` structure to specify that a memory region, identified by `ptr` and of size `padded_size`, should be accessible for both reading and writing on a specific GPU device (`currentDev`). The `hipMemSetAccess` function is then called to apply these access permissions.
- | (gdb) set env AND_SERIALIZE_KERNEL 3
-Note: This gdb command does not use an equal (=) sign.
-HIP provides a logging mechanism that allows you to trace HIP API and runtime codes when running a HIP application. In addition to being useful to our users/developers, the HIP development team uses these logs to improve the HIP runtime.
-By adjusting the logging settings and logging mask, you can get different types of information for different functionalities, such as HIP APIs, executed kernels, queue commands, and queue contents. Refer to the following sections for examples.
-Tip: Logging works for the release and debug versions of HIP. If you want to save logging output in a file, define the file when running the application via command line. For example:
-**Following code does:** This code snippet is related to memory management in a GPU context using the HIP (Heterogeneous-Compute Interface for Portability) API, which is used for writing portable applications that can run on AMD and NVIDIA GPUs. - -The two functions perform the following high-level tasks: - -1. `hipMemUnmap(ptr, size);`: This function unmaps a previously mapped memory region from the host's address space. The `ptr` is the pointer to the memory region, and `size` specifies the size of the memory to be unmapped. - -2. `hipMemRelease(allocHandle);`: This function releases or deallocates the memory associated with a given allocation handle (`allocHandle`). It effectively frees the memory resources that were previously allocated. - -Overall, the code is responsible for cleaning up GPU memory resources by unmapping and releasing them, which is crucial for preventing memory leaks and ensuring efficient memory usage in GPU applications.
- |user@user-test:~/hip/bin$./hipinfo > ~/hipinfo > ~/hip_log.txt
-HIP logging is disabled by default. You can enable it via the AMD_LOG_LEVEL environment variable. The value of this variable controls your logging level. Levels are defined as follows:
-**Following code does:** This code snippet appears to be a call to a function named `hipMemAddressFree`, which is likely part of the HIP (Heterogeneous-Compute Interface for Portability) API used for GPU programming. The function is intended to free or release a block of memory that was previously allocated, identified by the pointer `ptr`, and of a specified `size`. This operation is typically used to manage memory resources efficiently in GPU applications. The vertical bar `|` at the beginning seems to be a typographical error or an artifact, as it is not standard syntax in Python or C/C++.
-
- enum LogLevel {
- LOG_NONE = 0,
- LOG_ERROR = 1,
- LOG_WARNING = 2,
- LOG_INFO = 3,
- LOG_DEBUG = 4
- };
-Tip: You can call a logging function with different logging levels. All information under the value set for AMD_LOG_LEVEL is printed.
-The logging mask is designed to print functionality types when you're running a HIP application. Once you set AMD_LOG_LEVEL , the logging mask is set as the default value ( 0x7FFFFFFF ). You can change this to any of the valid values:
-**Following code does:** This code snippet is part of a memory management process using HIP (Heterogeneous-Compute Interface for Portability), which is a framework for GPU programming. The code performs the following high-level tasks: - -1. **Memory Reservation**: It reserves a block of memory starting at a specified address (`ptr + padded_size`) with a size of `new_size - padded_size`. This is done using `hipMemAddressReserve`, which allocates a virtual address space without actually allocating physical memory. - -2. **Memory Mapping**: It maps the reserved virtual memory to a physical memory allocation using `hipMemMap`. This associates the reserved address space (`new_ptr`) with a physical memory allocation identified by `newAllocHandle`. - -3. **Setting Memory Access**: It sets the access permissions for the mapped memory using `hipMemSetAccess`. This specifies how the memory can be accessed (e.g., read, write) based on the provided `accessDesc`. - -Overall, this code is setting up a reserved and mapped memory region with specific access permissions in a GPU context.
- The logging mask is designed to print functionality types when you're running a HIP application. Once you set
- AMD_LOG_LEVEL, the logging mask is set as the default value (0x7FFFFFFF). You can change this to any of the valid
- values:
-
- enum LogMask {
- LOG_API = 0x000000001, //!< API call
- LOG_CMD = 0x000000002, //!< Kernel and Copy Commands and Barriers
- LOG_WAIT = 0x000000004, //!< Synchronization and waiting for commands to finish
- LOG_AQL = 0x000000008, //!< Decode and display AQL packets
- LOG_QUEUE = 0x00000010, //!< Queue commands and queue contents
- LOG_SIG = 0x00000020, //!< Signal creation, allocation, pool
- LOG_LOCK = 0x00000040, //!< Locks and thread-safety code.
- LOG_KERN = 0x00000080, //!< kernel creations and arguments, etc.
- LOG_COPY = 0x000000100, //!< Copy debug
- LOG_COPY2 = 0x000000200, //!< Detailed copy debug
- LOG_RESOURCE = 0x000000400, //!< Resource allocation, performance-impacting events.
- LOG_INIT = 0x00000080, //!< Initialization and shutdown
- LOG_MISC = 0x00001000, //!< misc debug, not yet classified
- LOG_AQL2 = 0x00002000, //!< Show raw bytes of AQL packet
- LOG_CODE = 0x00004000, //!< Show code creation debug
- LOG_CMD2 = 0x00008000, //!< More detailed command info, including barrier commands
- LOG_LOCATION = 0x00010000, //!< Log message location
- LOG_MEM = 0x0000200000, //!< Memory allocation
- LOG_MEM_POOL = 0x00040000, //!< Memory pool allocation, including memory in graphs
- LOG_ALWAYS = 0xFFFFFFFF, //!< Log always even mask flag is zero
- };
-
- You can also define the logging mask via the AMD_LOG_MASK environment variable.
-You can also define the logging mask via the AMD_LOG_MASK environment variable.
-You can use the following code to print HIP logging information:
-**Following code does:** This code snippet is a simple C++ program that uses the HIP API, which is a C++ runtime API for GPU programming. The program initializes two `dim3` objects, which are typically used to define the dimensions of a grid or block in GPU programming. The first `dim3` object, `grid1`, is default-initialized, meaning its dimensions are set to zero. The second `dim3` object, `grid2`, is explicitly initialized with dimensions (1, 1, 1). The program then prints the dimensions of both `dim3` objects to the console. The purpose of this code is to demonstrate the initialization and usage of `dim3` objects in HIP.
-
-
-
- ?
- <& &
-
-
-
-Using HIP code, call the ClPrint() function with the desired input variables. For example:
-**Following code does:** This code snippet compiles and runs a C++ program that likely involves GPU programming using the HIP (Heterogeneous-Compute Interface for Portability) API, which is used for writing portable code that can run on both AMD and NVIDIA GPUs. - -1. The first line is a shell command that compiles a C++ source file named `test3.hip.cpp` using the `gcc` compiler. The `-x c++` flag specifies that the input file is a C++ source file. The `$(hipconfig --cpp_config)` part dynamically inserts the necessary compiler flags for HIP, which are obtained by running the `hipconfig --cpp_config` command. The output of the compilation is an executable named `test`. - -2. The second line executes the compiled program `./test`. The output suggests that the program is demonstrating the initialization and use of `dim3` objects, which are typically used in GPU programming to specify the dimensions of grid and block structures for kernel launches. The output shows two `dim3` objects, `grid1` and `grid2`, both initialized with dimensions `x=1, y=1, z=1`. This indicates that the program is likely testing or demonstrating basic grid configuration for GPU kernels.
- |ClPrint(amd::LOG_INFO, amd::LOG_INIT, "Initializing HSA stack.");
-On Linux , you can enable HIP logging and retrieve logging information when you run hipinfo .
-**Following code does:** This code snippet appears to be a comment in Python that is referencing a C++ style of initialization. The comment suggests that there is a variable `grid` of type `dim3` being initialized with the values `{1,1,1}`. In C++, `dim3` is typically used in CUDA programming to specify the dimensions of a grid or block for parallel execution on a GPU. The comment is likely indicating that the `grid` variable is being set to a 1x1x1 configuration, which means it is a single block with a single thread in each dimension (x, y, z). However, since this is a comment in Python, it does not execute any code but rather provides information or context about how a similar operation might be performed in C++.
-
-
-
-
-
- Loggging examples >
-
-
- >
-
-
-
-
-
-
-
-
-
-
-
-
-(continues on next page)
-(continued from previous page)
-**Following code does:** The code snippet you provided appears to be a mistake or a mix-up. It seems to be an attempt to specify a programming language (C++) rather than actual code. If you intended to provide a C++ code snippet, please include the relevant code so I can help explain its purpose.
-HIP Documentation, Release 6.1.40092
-On Windows , you can set AMD_LOG_LEVEL via environment variable from the advanced system settings or the command prompt (when run as administrator). The following example shows debug log information when calling the backend runtime.
-**Following code does:** The code snippet `| export` appears to be incomplete or out of context for a typical Python script. In Python, the `|` character is not used in this way, and `export` is not a standard Python keyword. This snippet might be part of a shell command or a configuration file where `export` is used to set environment variables, but as it stands, it does not represent a valid or complete Python statement.
-
-
-
- runume.
-(continues on next page)
-(continued from previous page)
-**Following code does:** The code snippet appears to be a shell command that sets an environment variable named `HIP_PLATFORM` to the value `amd`. This is typically used in a shell or terminal to configure the environment for software that relies on the HIP (Heterogeneous-Compute Interface for Portability) platform, indicating that the target platform for HIP operations should be AMD hardware.
-
-
-
-
- ?xml:%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws,com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws,com%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.
-(continues on next page)
-(continued from previous page)
-**Following code does:** It seems like you've provided a placeholder instead of actual code. If you provide the actual Python code snippet, I'd be happy to help explain its purpose.
- --copyBuffer
-...
-:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.
- --cpp:206 : 605414523422 us: 29864: [tid:0x9298] Alloc: 100000 bytes,_,
- --ptr[0000003008D0000-0000003009D0000], obj[0000003007D0000-0000003047D0000]
-:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.
- --cpp:206 : 605414523767 us: 29864: [tid:0x9298] Alloc: 100000 bytes,_,
- --ptr[0000003009D0000-000000300AD0000], obj[0000003007D0000-0000003047D0000]
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_memory.cpp:681 :_,
- --605414524092 us: 29864: [tid:0x9298] hipMemGetInfo: Returned hipSuccess :
-memInfo.total: 12.06 GB
-memInfo.free: 11.93 GB (99%)
-Cooperative groups API is an extension to the HIP programming model, which provides developers with a flexible, dynamic grouping mechanism for the communicating threads. Cooperative groups let you define your own set of thread groups which may fit your user-cases better than those defined by the hardware. This lets you specify the level of granularity for thread communication which can lead to more efficient parallel decompositions.
-The API is accessible in the cooperative_groups namespace after the hip_cooperative_groups.h is included. The header contains the following elements:
-The thread hierarchy abstraction of cooperative groups are in grid hierarchy and block hierarchy .
-Fig. 1: Cooperative group thread hierarchy in grids.
-The multi grid is an abstraction of potentially multiple simultaneous launches of the same kernel over multiple devices (Deprecated since 5.0). The grid in cooperative groups is a single dispatch of kernels for execution like the original grid.
-Note: The ability to synchronize over a grid or multi grid requires the kernel to be launched using the specific cooperative groups API.
-The block is the same as the Inherent thread model block entity.
-Note: Explicit warp-level thread handling is absent from the Cooperative groups API. In order to exploit the known hardware SIMD width on which built-in functionality translates to simpler logic, you can use the group partitioning part of the API, such as tiled_partition .
-Fig. 2: Cooperative group thread hierarchy in blocks.
-The cooperative groups API introduce a new level between block thread and threads. The thread-block tile give the opportunity to have tiles in the thread block, while the coalesced group holds the active threads of the parent group. These groups further discussed in the groups types section.
-For details on memory model, check the memory model description .
-Group types are based on the levels of synchronization and data sharing among threads.
-Represents an intra-block cooperative groups type where the participating threads within the group are the same threads that participated in the currently executing block .
-**Following code does:** This code snippet is a configuration or environment variable assignment in a shell or script. It sets the `HIP_COMPILER` variable to the value `cuda`. This is likely used in the context of configuring the HIP (Heterogeneous-Compute Interface for Portability) framework, which is designed to allow code to run on both AMD and NVIDIA GPUs. By setting `HIP_COMPILER` to `cuda`, it indicates that the HIP code should be compiled using the CUDA compiler, targeting NVIDIA GPUs.
- class thread_block;
-
- Constructed via:
-
- thread_block g = this_thread_block();
-The group_index() , thread_index() , thread_rank() , size() , cg_type() , is_valid() , sync() and group_dim() member functions are public of the thread_block class. For further details, check the thread_block references .
-Represents an inter-block cooperative groups type where the group's participating threads span multiple blocks running the same kernel on the same device. Use the cooperative launch API to enable synchronization across the grid group.
-**Following code does:** This code snippet sets an environment variable named `HIP_RUNTIME` to the value `nvcc`. This is typically used in the context of configuring or building software that involves GPU programming, particularly when working with HIP (Heterogeneous-Compute Interface for Portability) and CUDA. By setting `HIP_RUNTIME` to `nvcc`, it indicates that the HIP code should be compiled using NVIDIA's CUDA compiler (`nvcc`), which is part of the CUDA toolkit. This setup is often used to ensure compatibility or optimize performance on NVIDIA GPUs.
-class grid_group;
-
- Constructed via:
-
-grid_group g = this_grid();
-The thread_rank() , size() , cg_type() , is_valid() and sync() member functions are public of the grid_group class. For further details, check the grid_group references .
-Represents an inter-device cooperative groups type where the participating threads within the group span multiple devices that run the same kernel on the devices. Use the cooperative launch API to enable synchronization across the multi-grid group.
-**Following code does:** The code snippet `<_C_>` appears to be incomplete or not a valid Python code. It does not represent any recognizable Python syntax or functionality. It might be a placeholder or a typographical error. Without additional context or information, it's not possible to determine its purpose or functionality.
- |class multi_grid_group;
-**Following code does:** The code snippet `<_SQL_>` appears to be a placeholder or a tag indicating that the actual SQL code is not provided. It suggests that the content is intended to be SQL-related, possibly involving database queries or operations. Without the actual SQL code, it's not possible to determine the specific functionality or purpose. If you have the actual SQL code, I can help explain what it does.
-;
-Constructed via:
-**Following code does:** The code snippet you provided appears to be a comment or placeholder text rather than actual executable code. It seems to be referencing an example of using `hipLaunchKernelGGL`, which is a macro used in the HIP (Heterogeneous-Compute Interface for Portability) API. HIP is a C++ runtime API that allows developers to write portable code that can run on both AMD and NVIDIA GPUs. - -The purpose of `hipLaunchKernelGGL` is to launch a kernel function on a GPU. It abstracts the details of setting up the execution configuration (such as grid and block dimensions) and launching the kernel. However, without the actual code or context, this snippet doesn't perform any operations or provide further details.
-<_C_>
-**Following code does:** The provided code snippet appears to be incomplete and contains a syntax error. It seems to be a fragment of a function signature or declaration, possibly from a C or C++ program, rather than Python. The term `size_t` is a data type commonly used in C/C++ to represent sizes and is typically used for array indexing and loop counting. The `N` likely represents a parameter name. However, without additional context or code, it's not possible to determine the exact purpose or functionality of this snippet.
-<_YAML_>
-The num_grids() , grid_rank() , thread_rank() , size() , cg_type() , is_valid() , and sync() member functions are public of the multi_grid_group class. For further details check the multi_grid_group references .
-This constructs a templated class derived from thread_group . The template defines the tile size of the new thread group at compile time. This group type also supports sub-wave level intrinsics.
-**Following code does:** This code snippet is related to launching a GPU kernel using HIP, which is a C++ runtime API and kernel language that allows developers to write portable code that can run on AMD and NVIDIA GPUs. The code launches a kernel function named `MyKernel` on the GPU. The kernel is executed with a specified grid and block (group) dimensions, which are defined by `gridDim` and `groupDim`, respectively. The parameters `a`, `b`, `c`, and `n` are passed to the kernel function, which likely represent data or configuration settings needed for the computation. The snippet also includes an alternative way to launch the kernel using the `hipLaunchKernelGGL` macro, which provides more flexibility, such as specifying a stream for asynchronous execution.
-<_C++_>
-Constructed via:
-**Following code does:** This code snippet is a GPU kernel implementation using HIP, a C++ runtime API and kernel language that allows developers to write portable code for AMD and NVIDIA GPUs. The code defines a simple kernel function `MyKernel` that operates on arrays of floating-point numbers. The kernel adds each element of array `a` to the corresponding element of array `b` after incrementing it by one, storing the result in array `c`. The `PlusOne` function is defined to increment a float by 1.0 and is marked to be compiled for both the host and device. The `callMyKernel` function sets up the necessary parameters and launches the kernel on the GPU, processing `N` elements with a specified block size.
-<_SQL_>
-The thread_rank() , size() , cg_type() , is_valid() , sync() , meta_group_rank() , meta_group_size() , shfl() , shfl_down() , shfl_up() , shfl_xor() , ballot() , any() , all() , match_any() and match_all() member functions are public of the thread_block_tile class. For further details, check the thread_block_tile references .
-Threads (64 threads on CDNA and 32 threads on RDNA) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled, if they do not meet the condition to execute that branch. The active threads referred as coalesced, and coalesced group represents an active thread group within a warp.
-Note: The NVIDIA GPU's independent thread scheduling presents the appearance that threads on different branches execute concurrently.
-Warning: AMD GPUs do not support independent thread scheduling. Some CUDA application can rely on this feature and the ported HIP version on AMD GPUs can deadlock, when they try to make use of independent thread scheduling.
-This group type also supports sub-wave level intrinsics.
-**Following code does:** The code snippet `<_C_>` appears to be incomplete or not a valid Python code. It does not represent any recognizable Python syntax or functionality. It might be a placeholder or a typographical error. Without additional context or information, it's not possible to determine its purpose or functionality.
- | class coalesced_group;
-Constructed via:
-coalesced_group
-**Following code does:** The code snippet `<_SQL_>` appears to be a placeholder or a tag indicating where SQL-related code or content should be inserted. It does not perform any operations or have any functionality on its own. Its purpose is likely to serve as a marker within a larger codebase or documentation to denote the section where SQL code is relevant or expected to be included.
- |roup active = coalesced_threads() ;
-Note: shfl() functions support integer or float type.
-The thread_rank() , size() , cg_type() , is_valid() , sync() , meta_group_rank() , meta_group_size() , shfl() , shfl_down() , shfl_up() , ballot() , any() , all() , match_any() and match_all() member functions are public of the coalesced_group class. For more information, see coalesced_group references .
-The difference to the original block model in the reduce_sum device function is the following.
-**Following code does:** The provided code snippet appears to be incomplete and contains syntax errors, making it difficult to determine its exact purpose. However, it seems to be attempting to use or declare variables related to time measurement, possibly involving a clock function and a variable named `close`. Without additional context or corrections, it's not possible to provide a meaningful high-level summary of its functionality.
-<_Cuda_>
-**Following code does:** The code snippet appears to be a malformed or incomplete line of Python code. It seems to be attempting to declare a variable `w` with a type annotation of `long long int`, which is not a valid type in Python. In Python, type annotations typically use Python's built-in types or types from the `typing` module, and `long long int` is a C/C++ type, not a Python type. Therefore, this line does not serve any functional purpose in Python as it stands.
-
-
-
- // Thread ID
-
- / * /* */
- */
-(continues on next page)
-(continued from previous page)
-**Following code does:** The code snippet `it will_clock64()` appears to be a function call in Python. However, without additional context or definition of the function `it will_clock64`, it's not possible to determine its purpose or functionality. The name suggests it might be related to a clock or timing mechanism, possibly involving 64-bit operations, but this is purely speculative. To understand what this code does, one would need to see the implementation or documentation of the `it will_clock64` function.
- for(unsigned int i = g.size() / 2; i > 0; i /= 2) {
- // Store value in shared memory with thread ID
- shared[group_thread_id] = val;
-
- // Synchronize all threads in the group
- g.sync();
-
- // Active thread sum up
- if(group_thread_id < i)
- val += shared[group_thread_id + i];
-
- // Synchronize all threads in the group
- g.sync();
- }
-
- //...
-}
-
-The reduce_sum() function call and input data initialization difference to the origin.
-The reduce_sum() function call and input data initialization difference to the original block model is the following.
-**Following code does:** This code snippet is written in C++ and uses the HIP (Heterogeneous-Compute Interface for Portability) API, which is designed for writing portable applications that can run on both AMD and NVIDIA GPUs. The code retrieves the wall clock rate of a GPU device in kilohertz. Specifically, it initializes an integer variable `wallClkRate` to store the clock rate, and then calls the `hipDeviceGetAttribute` function to obtain the wall clock rate attribute (`hipDeviceAttributeWallClockRate`) for a specific GPU device identified by `deviceId`. The `HIPCHECK` macro is likely used to check for errors in the HIP API call.
-Original Block
-
-__global__ void sum_kernel(...) {
-
- //...
-
- // Workspace array in shared memory
- __shared__ unsigned int workspace[2048];
-
- //...
-
- // Perform reduction
- output = reduce_sum(workspace, input);
-
- //...
-}
-**Following code does:** This code snippet retrieves the properties of a CUDA-enabled device specified by `deviceID` and stores them in a `cudaDeviceProp` structure named `props`. It then extracts the `warpSize` property, which indicates the number of threads in a warp for that device, and assigns it to the variable `w`. The comment suggests that the subsequent code will implement an algorithm that adapts to the device's warp size, ensuring portability across devices with different warp sizes, rather than assuming a fixed size like 32 or 64.
-
-
-
- // const auto } /* */ *
-(continued from previous page)
-**Following code does:** This code snippet appears to define a set of functions typically used in parallel computing environments, such as CUDA for GPU programming. These functions are likely intended for use in managing and synchronizing threads within a warp (a group of threads that execute instructions in lockstep on a GPU). Here's a high-level summary of each function's purpose: - -1. `__all(int predicate)`: This function checks if a given condition (predicate) is true for all threads in a warp. It returns a non-zero value if the condition is true for all threads, otherwise it returns zero. - -2. `__any(int predicate)`: This function checks if a given condition (predicate) is true for any thread in a warp. It returns a non-zero value if the condition is true for at least one thread, otherwise it returns zero. - -3. `__ballot(int predicate)`: This function returns a bitmask representing which threads in a warp have a true predicate. Each bit in the returned value corresponds to a thread in the warp, with the bit set if the predicate is true for that thread. - -4. `__activemask()`: This function returns a bitmask indicating which threads in the warp are currently active. Active threads are those that are currently participating in the execution of the warp. - -5. `__all_sync(unsigned long long mask, int predicate)`: This function is similar to `__all`, but it allows specifying a mask to indicate which threads should be considered in the check. It synchronizes the threads specified by the mask and checks if the predicate is true for all of them. - -These functions are useful for coordinating and synchronizing operations across threads within a warp, ensuring that certain conditions are met before proceeding with further computations.
- thread_block thread_block_group = this_thread_block();
- // Perform reduction
- output = reduce_sum(thread_block_group, workspace, input);
-
- //...
-}
-At the device function, the input group type is the thread_group , which is the parent class of all the cooperative groups type. With this, you can write generic functions, which can work with any type of cooperative groups.
-With each group type, the synchronization requires using the correct cooperative groups launch API.
-Do not need kernel launch validation.
-Confirm the cooperative launch capability on the single AMD GPU:
-**Following code does:** It seems like there is a placeholder or a formatting error in your request, as the code snippet is not visible. Please provide the actual code snippet so I can help explain its purpose.
- Confirm the cooperative launch capability on the single AMD GPU:
-
- int device = 0;
- int supports_coop_launch = 0;
- // Check support
- // Use hipDeviceAttributeCooperativeMultiDeviceLaunch when launching across multiple_
- --devices
- HIP_CHECK(hipGetDevice(&device));
- HIP_CHECK(
- hipDeviceGetAttribute(&supports_coop_launch, hipDeviceAttributeCooperativeLaunch,\
- --device));
- if(!supports_coop_launch)
- {
- std::cout << "Skipping, device " << device << " does not support cooperative groups"
- << std::endl;
- return 0;
- }
-Confirm the cooperative launch capability over multiple GPUs:
-**Following code does:** This code snippet appears to define function prototypes for a set of operations that involve matching values in a parallel computing context, likely using CUDA or a similar parallel processing framework. The functions are designed to perform matching operations across multiple threads or processing units: - -1. `__match_any(T value)`: This function likely checks if any thread in a group has a value that matches the specified `value`. - -2. `__match_all(T value, int *pred)`: This function probably checks if all threads in a group have a value that matches the specified `value`, and it might use the `pred` pointer to store or influence the result. - -3. `__match_any_sync(unsigned long long mask, T value)`: Similar to `__match_any`, but includes a `mask` parameter to specify which threads participate in the operation, ensuring synchronization among them. - -4. `__match_all_sync(unsigned long long mask, T value, int *pred)`: Similar to `__match_all`, but includes a `mask` parameter for synchronized participation and possibly uses `pred` for additional logic or result storage. - -Overall, these functions are likely used for efficient data comparison and synchronization across multiple threads in a parallel processing environment.
- Multi-grid
-
- Confirm the cooperative launch capability over multiple GPUs:
-
- // Check support of cooperative groups
- std::vector deviceIDs;
- for(int deviceID = 0; deviceID < device_count; deviceID++) {
- #ifdef __HIP_PLATFORM_AMD__
- int supports_coop_launch = 0;
- HIP_CHECK(
- hipDeviceGetAttribute(
- &supports_coop_launch,
- hipDeviceAttributeCooperativeMultiDeviceLaunch,
- deviceID));
- if(!supports_coop_launch) {
- std::cout << "Skipping, device " << deviceID << " does not support cooperative_
- --groups"
- << std::endl;
- }
- else
- #endif
- {
- std::cout << deviceID << std::endl;
- // Collect valid deviceIDs.
- deviceIDs.push_back(deviceID);
- }
- }
-
- Kernel launch
-
- __ ... .
-You can access the new block representation using the original kernel launch methods.
-**Following code does:** The code snippet is a description of a CUDA intrinsic function `__shfl`. This function is used in parallel programming with NVIDIA GPUs to perform a shuffle operation within a warp. Specifically, it allows threads within a warp to exchange data. The function takes three parameters: `var`, which is the variable to be shuffled; `srcLane`, which specifies the source lane (or thread) from which to copy the value; and `width`, which defines the width of the shuffle operation, defaulting to the warp size. The note mentions that half-float data types are not supported for this shuffle operation.
-
-
-
- // Launching kernel from host.
-Launch the cooperative kernel on a single GPU:
-**Following code does:** The code snippet appears to be incomplete and contains a syntax error. It seems to be a mix of Python and C/C++ syntax, as it starts with a pipe character (`|`) and uses a C/C++ style function declaration (`void assert(int ir`). However, based on the visible part, it seems intended to define a function named `assert` that takes an integer parameter `ir` and returns no value (`void`). Without more context or the rest of the code, it's not possible to determine the specific purpose or functionality of this function.
-<_C_>
-Launch the cooperative kernel over multiple GPUs:
-**Following code does:** This code snippet waits for the user to input data from the standard input (typically the keyboard) and press Enter. It effectively pauses the program's execution until the user provides some input. The input function can also be used to capture the user's input as a string, but in this case, since the result is not assigned to any variable, the input is simply discarded.
- Multi-grid
-
- Launch the cooperative kernel over multiple GPUs:
-
- hipLaunchParams *launchParamsList = (hipLaunchParams*)malloc(sizeof(hipLaunchParams) *_
- --deviceIDs.size());
- for(int deviceID : deviceIDs) {
-
- // Set device
- HIP_CHECK(hipSetDevice(deviceID));
-
- // Create stream
- hipStream_t stream;
- HIP_CHECK(hipStreamCreate(&stream));
-
- // Parameters
- void* params[] = {&(d_vector[deviceID]), &(d_block_reduced[deviceID]), &(d_partition_
- --reduced[deviceID])};
-
- // Set launchParams
- launchParamsList[deviceID].func = (void*)vector_reduce_kernel;
- launchParamsList[deviceID].gridDim = dim3(1);
- launchParamsList[deviceID].blockDim = dim3(threads_per_block);
- launchParamsList[deviceID].sharedMem = 0;
- launchParamsList[deviceID].stream = stream;
- launchParamsList[deviceID].args = params;
- }
-
- HIP_CHECK(hipLaunchCooperativeKernelMultiDevice(launchParamsList,
- (int)deviceIDs.size(),
- )
- --hipCooperativeLaunchMultiDeviceNoPreSync));
-
- Device side synchronization
-Device side synchronization
-The device side code of the thread_block synchronization over single GPUs:
-**Following code does:** This code is a simple CUDA program using the HIP (Heterogeneous-Compute Interface for Portability) API, which is designed to run on AMD GPUs. The program defines a kernel function `run_printf` that prints "Hello World" to the console. The `main` function launches this kernel with a single thread block and a single thread, effectively executing the `printf` statement on the GPU. The purpose of this code is to demonstrate how to execute a basic "Hello World" program using GPU parallel computing with HIP.
-<_C_>
-The device side code of the grid synchronization over single GPUs:
-**Following code does:** The code snippet `<_Cython_>` appears to be a placeholder or a tag rather than actual executable code. In a typical context, "Cython" refers to a programming language that makes writing C extensions for Python as easy as Python itself. It is used to improve the performance of Python code by compiling it into C. However, without additional context or actual code, this snippet does not perform any specific function or operation.
-<_Cython_>
-**Following code does:** The code snippet appears to define a constant named `MIN_WARPS_PER_EXECUTION_UNIT`. This constant is calculated by dividing the product of `MIN_BLOCKS_PER_MULTIPROCESSOR` and `MAX_THREADS_PER_BLOCK` by an unknown operator represented by `/_\`. However, the snippet includes a series of unusual characters and formatting that suggest it might be incomplete or corrupted. In a typical context, such a calculation might be used to determine the minimum number of warps (groups of threads) that can be executed per unit in a parallel computing environment, such as on a GPU. However, without a clear definition of `/_\`, the exact calculation is unclear.
- = this._grid() ;
-The device side code of the multi-grid synchronization over multiple GPUs:
-**Following code does:** The code snippet appears to be a warning message indicating that a particular API is deprecated. Specifically, it suggests that the user should replace the deprecated API with `hipHostMalloc()`. This message is likely part of a larger codebase or documentation related to memory allocation in a computing environment that uses the HIP (Heterogeneous-Compute Interface for Portability) API, which is commonly used for GPU programming. The purpose of the message is to inform developers to update their code to use the recommended function for host memory allocation.
- |multi_grid_group multi_grid = this_multi_grid();
-**Following code does:** The provided text appears to be a fragment of a warning message rather than a complete code snippet. It indicates that a particular API is deprecated and suggests using `hipHostFree()` as an alternative. This typically means that the current API or function being used is outdated and may be removed in future versions, so developers are advised to switch to the recommended `hipHostFree()` function for managing host memory in HIP (Heterogeneous-Compute Interface for Portability) applications.
-|multi_grid.sync();
-HIP doesn't support the following NVIDIA CUDA optional headers:
-HIP doesn't support the following CUDA class in cooperative_groups namespace:
-HIP doesn't support the following CUDA functions/operators in cooperative_groups namespace:
-In conventional architectures, CPUs and GPUs have dedicated memory like Random Access Memory (RAM) and Video Random Access Memory (VRAM). This architectural design, while effective, can be limiting in terms of memory capacity and bandwidth, as continuous memory copying is required to allow the processors to access the appropriate data. New architectural features like Heterogeneous System Architectures (HSA) and Unified Memory (UM) help avoid these limitations and promise increased efficiency and innovation.
-Unified Memory is a single memory address space accessible from any processor within a system. This setup simplifies memory management processes and enables applications to allocate data that can be read or written by code running on either CPUs or GPUs. The Unified memory model is shown in the following figure.
-AMD Accelerated Processing Unit (APU) is a typical example of a Unified Memory Architecture. On a single die, a central processing unit (CPU) is combined with an integrated graphics processing unit (iGPU), and both have access to a high-bandwidth memory (HBM) module named Unified Memory. The CPU enables high-performance, low-latency operations, while the GPU is optimized for high throughput (data processed by unit time).
-Unified memory is supported on Linux by all modern AMD GPUs from the Vega series onward. Unified memory management can be achieved with managed memory allocation and, for the latest GPUs, with a system allocator.
-The table below lists the supported allocators. The allocators are described in the next section.
-**Following table contains:** The table represents a list of mathematical functions, specifically those that operate on floating-point numbers in programming or computational contexts. Each row corresponds to a different mathematical function, detailing its purpose and operation. - -The columns appear to indicate the availability or implementation status of these functions, marked by checkmarks (✓). Although the exact meaning of each column is not explicitly stated, it is likely that they represent different programming environments, libraries, or standards where these functions are available or supported. - -Noteworthy values include: -- The functions cover a range of mathematical operations, including trigonometric functions (e.g., `cosf`, `cospif`), hyperbolic functions (e.g., `coshf`, `atanhf`), and other mathematical operations like `ceilf` and `cbrtf`. -- The consistent presence of checkmarks (✓) across all rows suggests that all listed functions are supported or available in the contexts represented by the columns.
-| Architecture | hipMallocManaged() | __managed__ | malloc() |
|---|---|---|---|
| MI200, MI300 Series | 1 | ||
| MI100 | |||
| RDNA (Navi) Series | |||
| GCN5 (Vega) Series |
1 Works only with XNACK=1 . First GPU access causes recoverable page-fault. For more details, visit GPU memory.
-Showcasing various unified memory programming models, the model availability depends on your architecture. For more information, see System requirements and Checking unified memory management support .
-The hipMallocManaged() is a dynamic memory allocator available on all GPUs with unified memory support. For more details, visit HIP managed memory allocation API .
-The __managed__ declaration specifier, which serves as its counterpart, is supported on all modern AMD cards and can be utilized for static allocation.
-Starting with the AMD MI300 series, the malloc() system allocator allows you to reserve unified memory. The system allocator is more versatile and offers an easy transition from a CPU written C++ code to a HIP code as the same system allocation API is used.
-Some device attributes can offer information about which Unified memory programming models are supported. The attribute value is 1 if the functionality is supported, and 0 if it is not supported.
-**Following table contains:** The table appears to represent a list of mathematical functions, specifically focusing on special functions and exponential functions. Each row corresponds to a different function, providing a brief description of what the function does. - -- **Rows**: Each row represents a specific mathematical function, detailing its name, input type, and a brief description of its purpose or calculation. - -- **Columns**: - - **Column 0**: Contains the function signature, including the return type and the function name with its parameter(s). - - **Column 1**: Provides a description of what the function does, often explaining the mathematical operation or the specific function it computes. - - **Column 2**: Contains checkmarks (✓) indicating some form of validation or availability status for the functions listed. The presence of checkmarks suggests that these functions are available or supported in the context being described. - -- **Noteworthy Values**: - - The function `float cyl_bessel_i1f(float x)` does not have checkmarks in column 2, unlike the other functions, which might indicate that it is not supported or validated in the same way as the others. - - All other functions listed have checkmarks in column 2, suggesting they are validated or supported.
-| attribute description |
|---|
| hipDeviceAttributeManagedMemory unified addressing is supported |
| hipDeviceAttributeConcurrentManagedAccess full managed memory support, concurrent access is supported |
| hipDeviceAttributePageableMemoryAccess both managed and system memory allocation API is supported |
The following examples show how to use device attributes:
-**Following code does:** The code snippet is a command intended to be run in a Unix-like terminal that uses the Advanced Package Tool (APT) package management system. The command `apt-get install hi` attempts to install a software package named "hi" from the system's package repositories. If the package "hi" exists in the repositories and the user has the necessary permissions, it will be downloaded and installed on the system.
-
-
-#include
-#include
-
-int main() {
- int d;
- hipGetDevice(&d);
-
- int is_cma = 0;
- hipDeviceGetAttribute(&is_cma, hipDeviceAttributeConcurrentManagedAccess, d);
- std::cout << "HIP Managed Memory: "
- << (is_cma == 1? "is" : "NOT")
- << " " supported" << std::endl;
- return 0;
-}
-The following example shows how to use unified memory management with hipMallocManaged() , function, with __managed__ attribute for static allocation and standard malloc() allocation. For comparison, the Explicit Memory Management example is presented in the last tab.
-**Following code does:** The code snippet you provided appears to be a fragment of a command or a list of package names, but it is not a complete or valid Python code. It seems to be related to package management, possibly for installing or managing software packages related to HIP (Heterogeneous-Compute Interface for Portability) and NVIDIA GPU support. However, without additional context or a complete command, it's difficult to determine its exact purpose.
-
-
-
- // } */
-
-
- //
-
-
-
- }
- */
-__managed__
-**Following code does:** This code is a command-line instruction used in a Unix-based operating system, such as Linux, to install Python 3 using the package manager `apt-get`. The command fetches the Python 3 package from the software repositories and installs it on the system, making Python 3 available for use.
-__managed__
-
-#include
-#include
-
-// Addition of two values.
-__global__ void add(int *a, int *b, int *c) {
- *c = *a + *b;
-}
-
-// Declare a, b and c as static variables.
-__managed__ int a, b, c;
-
-int main() {
- // Setup input values.
- a = 1;
- b = 2;
-
- // Launch add() kernel on GPU.
- hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, &a, &b, &c);
-
- // Wait for GPU to finish before accessing on host.
- hipDeviceSynchronize();
-
- // Prints the result.
- std::cout << a << " + " << b << " = " << c << std::endl;
-
- return 0;
-}
-
-
-malloc()
-**Following code does:** This code installs the Python package `CppHeaderParser` using the `pip3` package manager. `CppHeaderParser` is a library that allows users to parse C++ header files, extracting information such as classes, functions, and other declarations. The `pip3` command specifically uses Python 3's package manager to perform the installation.
-malloc()
-
-#include
-#include
-
-// Addition of two values.
-__global__ void add(int* a, int* b, int* c) {
- *c = *a + *b;
-}
-
-int main() {
- int* a, * b, * c;
-
- // Allocate memory for a, b, and c.
- a = (int*)malloc(sizeof(*a));
- b = (int*)malloc(sizeof(*b));
- c = (int*)malloc(sizeof(*c));
-
- // Setup input values.
- *a = 1;
- *b = 2;
-(continues on next page)
-(continued from previous page)
-**Following code does:** The code snippet `| export` appears to be incomplete or out of context. In Python, this syntax is not valid. It might be part of a larger script or command in a different context, such as a shell command or a configuration file, where `export` is used to set environment variables. Without additional context, it's unclear what the intended purpose is.
-
-
-
- // Launch add() kernel on GPU.
- hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
-
- // Wait for GPU to finish before accessing on host.
- hipDeviceSynchronize();
-
- // Prints the result.
- std::cout << *a << " + " << *b << " = " << *c << std::endl;
-
- // Cleanup allocated memory.
- free(a);
- free(b);
- free(c);
-
- return 0;
- }
-**Following code does:** The code snippet you provided appears to be a placeholder or a tag indicating that the actual code is written in Bash, a Unix shell and command language. However, without the actual Bash code, it's not possible to determine its functionality or purpose. If you have a specific Bash script or command, please provide it for a more detailed explanation.
- tree
- #include
-
- // Addition of two values.
- __global__ void add(int *a, int *b, int *c) {
- *c = *a + *b;
- }
-
- int main() {
- int a, b, c;
- int *d_a, *d_b, *d_c;
-
- // Setup input values.
- a = 1;
- b = 2;
-
- // Allocate device copies of a, b and c.
- hipMalloc(&d_a, sizeof(*d_a));
- hipMalloc(&d_b, sizeof(*d_b));
- hipMalloc(&d_c, sizeof(*d_c));
-
- // Copy input values to device.
- hipMemcpy(d_a, &a, sizeof(*d_a), hipMemcpyHostToDevice);
- hipMemcpy(d_b, &b, sizeof(*d_b), hipMemcpyHostToDevice);
-
- // Launch add() kernel on GPU.
- hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, d_a, d_b, d_c);
-
- // Copy the result back to the host.
- hipMemcpy(&c, d_c, sizeof(*d_c), hipMemcpyDeviceToHost);
- (continues on next page)
-
-
- 15.3. Unified memory programming models 103
-**Following code does:** The code snippet `<_Bash_>` appears to be a placeholder or a tag indicating that a section of code written in the Bash scripting language should be inserted or is expected there. It does not perform any specific function or operation by itself. Instead, it likely serves as a marker within a larger document or system to denote where Bash code is relevant or should be included.
-
- // Cleanup allocated memory.
- hipFree(d_a);
- hipFree(d_b);
- hipFree(d_c);
-
- // Prints the result.
- std::cout << a << " + " << b << " = " << c << std::endl;
-
- return 0;
-}
-Unified memory management (UMM) is a feature that can simplify the complexities of memory management in GPU computing. It is particularly useful in heterogeneous computing environments with heavy memory usage with both a CPU and a GPU, which would require large memory transfers. Here are some areas where UMM can be beneficial:
-UMMcan help to simplify the complexities of memory management. This can make it easier for developers to write code without worrying about memory allocation and deallocation details.
-UMMallows for efficient data migration between the host (CPU) and the device (GPU). This can be particularly useful for applications that need to move data back and forth between the device and host.
-As a positive side effect, UMM can reduce the lines of code, thereby improving programming productivity.
-In HIP, pinned memory allocations are coherent by default. Pinned memory is host memory mapped into the address space of all GPUs, meaning that the pointer can be used on both host and device. Using pinned memory instead of pageable memory on the host can improve bandwidth.
-While UMMcanprovide numerous benefits, it's important to be aware of the potential performance overhead associated with UMM. You must thoroughly test and profile your code to ensure it's the most suitable choice for your use case.
-Unified memory HIP runtime hints can help improve the performance of your code if you know your code's ability and infrastructure. Some hint techniques are presented in this section.
-Thehint functions can set actions on a selected device, which can be identified by hipGetDeviceProperties(&prop, device_id) . There are two special device_id values:
-For the best performance, profile your application to optimize the utilization of HIP runtime hints.
-(continued from previous page)
-Data prefetching is a technique used to improve the performance of your application by moving data closer to the processing unit before it's actually needed.
-**Following code does:** The code snippet appears to be a malformed attempt to clone specific branches of two Git repositories from GitHub using the `git clone` command. The intention is to clone the repositories `clr` and `hip` from the `ROCm` organization, checking out the branch specified by the environment variable `ROCM_BRANCH`. However, the command is incorrectly written with `:lone` instead of `git clone`, which would result in an error if executed as is.
-
-
-
- // All }
-
-// # */
- // *
- */
-
-
-
- *
-
- * /*
- /*
-
- */ /
- }
-Remember to check the return status of hipMemPrefetchAsync() to ensure that the prefetch operations are completed successfully.
-The effectiveness of hipMemAdvise() comes from its ability to inform the runtime system of the developer's intentions regarding memory usage. When the runtime system has knowledge of the expected memory access patterns, it can make better decisions about data placement and caching, leading to more efficient execution of the application. However, the actual impact on performance can vary based on the specific use case and the hardware architecture.
-For the description of hipMemAdvise() and the detailed list of advice, visit the HIP managed memory allocation API .
-Here is the updated version of the example above with memory advice.
-**Following code does:** The code snippet `<_Bash_>` appears to be a placeholder or a tag indicating that a Bash script or command is expected to be inserted or referenced there. It does not contain any actual executable code or commands. Therefore, it does not perform any operations or have a specific functionality on its own.
-
- The effectiveness of nipMemAdvise() comes from its ability to inform the runtime system at the developer's intentions
- regarding memory usage. When the runtime system has knowledge of the expected memory access patterns, it can make
- better decisions about data placement and caching, leading to more efficient execution of the application. However, the
- actual impact on performance can vary based on the specific use case and the hardware architecture.
- For the description of hipMemAdvise() and the detailed list of advice, visit the HIP managed memory allocation API.
- Here is the updated version of the example above with memory advice.
-
- #include
- #include
-
- // Addition of two values.
- __global__ void add(int *a, int *b, int *c) {
- *c = *a + *b;
- }
-
- int main() {
- int *a, *b, *c;
-
- // Allocate memory for a, b, and c accessible to both device and host codes.
- hipMallocManaged(&a, sizeof(*a));
- hipMallocManaged(&b, sizeof(*b));
- hipMallocManaged(&c, sizeof(*c));
-
- // Set memory advice for a, b, and c to be accessed by the CPU.
- hipMemAdvise(a, sizeof(*a), hipMemAdviseSetPreferredLocation, hipCpuDeviceId);
- hipMemAdvise(b, sizeof(*b), hipMemAdviseSetPreferredLocation, hipCpuDeviceId);
- hipMemAdvise(c, sizeof(*c), hipMemAdviseSetPreferredLocation, hipCpuDeviceId);
-
- // Additionally, set memory advice for a, b, and c to be read mostly from the device.
- __0.
- constexpr int device = 0;
- hipMemAdvise(a, sizeof(*a), hipMemAdviseSetReadMostly, device);
- hipMemAdvise(b, sizeof(*b), hipMemAdviseSetReadMostly, device);
- hipMemAdvise(c, sizeof(*c), hipMemAdviseSetReadMostly, device);
-
- // Setup input values.
- *a = 1;
- *b = 2;
-
- // Launch add() kernel on GPU.
- hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
-
- // Wait for GPU to finish before accessing on host.
- hipDeviceSynchronize();
-
- // Prints the result.
- std::cout << *a << " + " << *b << " = " << *c << std::endl;
-
- // Cleanup allocated memory.
- hipFree(a);
-
- (continues on next page)
-
-
-
- 106
-(continues on next page)
-**Following code does:** This code snippet is a shell script that automates the process of building and installing a software project using CMake and Make. Here's a high-level summary of what it does: - -1. It changes the current directory to the one specified by the `CLR_DIR` environment variable. -2. It creates a `build` directory if it doesn't already exist and then changes into that directory. -3. It runs the `cmake` command to configure the build system. This command specifies various options, such as the HIP common directory, the HIP platform, and the CMake prefix path. It also sets certain build flags, like enabling HIP support and disabling OpenCL support. -4. It compiles the project using `make`, utilizing all available CPU cores (`-j$(nproc)`). -5. Finally, it installs the compiled software using `sudo make install`, which typically requires administrative privileges.
- hipFree(b);
- hipFree(c);
-
- return 0;
-}
-Memory Range attributes allow you to query attributes of a given memory range.
-The hipMemRangeGetAttribute() is added to the example to query the hipMemRangeAttributeReadMostly attribute of the memory range pointed to by a . The result is stored in attributeValue and then printed out.
-For more details, visit the HIP managed memory allocation API .
-**Following code does:** The code snippet you provided seems to be a placeholder or a tag indicating PHP code, but it does not contain any actual PHP code to analyze. If you have a specific PHP code snippet you'd like me to explain, please provide the actual code, and I'd be happy to help!
-
- Memory Range attributes allow you to query attributes of a given memory range.
- The hipMemRangeGetAttribute() is added to the example to query the hipMemRangeAttributeReadMostly at-
- title of the memory range pointed to by a. The result is stored in attributeValue and then printed out.
- For more details, visit the HIP managed memory allocation API.
- #include
- #include
-
- // Addition of two values.
- __global__ void add(int *a, int *b, int *c) {
- *c = *a + *b;
- }
-
- int main() {
- int *a, *b, *c;
- unsigned int attributeValue;
- constexpr size_t attributeSize = sizeof(attributeValue);
-
- // Allocate memory for a, b and c that is accessible to both device and host codes.
- hipMallocManaged(&a, sizeof(*a));
- hipMallocManaged(&b, sizeof(*b));
- hipMallocManaged(&c, sizeof(*c));
-
- // Setup input values.
- *a = 1;
- *b = 2;
-
- // Launch add() kernel on GPU.
- hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
-
- // Wait for GPU to finish before accessing on host.
- hipDeviceSynchronize();
-
- // Query an attribute of the memory range.
- hipMemRangeGetAttribute(&attributeValue,
- attributeSize,
- hipMemRangeAttributeReadMostly,
- a,
- sizeof(*a));
-
- // Prints the result.
- std::cout << *a << " + " << *b << " = " << *c << std::endl;
- (continues on next page)
-
- )
- (Continues on next page)
-(continues on next page)
-(continued from previous page)
-(continued from previous page)
-**Following code does:** The code snippet provided appears to be a malformed or corrupted text, possibly due to formatting issues or errors in copying. It includes a mix of symbols and text that do not form a valid or meaningful Python code. Therefore, it does not perform any specific function or task in its current state.
- std::cout << "The queried attribute value is: " << attributeValue << std::endl;
-
- // Cleanup allocated memory.
- hipFree(a);
- hipFree(b);
- hipFree(c);
-
- return 0;
-}
-The hipStreamAttachMemAsync function would be able to asynchronously attach memory to a stream, which can help concurrent execution when using streams.
-Currently, this function is a no-operation (NOP) function on AMD GPUs. It simply returns success after the runtime memory validation passed. This function is necessary on Microsoft Windows, and UMM is not supported on this operating system with AMD GPUs at the moment.
-CHAPTER
-Memorymanagement is important when creating high-performance applications in the HIP ecosystem. Both allocating and copying memory can result in bottlenecks, which can significantly impact performance.
-Global memory allocation in HIP uses the C language style allocation function. This works fine for simple cases but can cause problems if your memory needs change. If you need to increase the size of your memory, you must allocate a second larger buffer and copy the data to it before you can free the original buffer. This increases overall memory usage and causes unnecessary memcpy calls. Another solution is to allocate a larger buffer than you initially need. However, this isn't an efficient way to handle resources and doesn't solve the issue of reallocation when the extra buffer runs out.
-Virtual memory management solves these memory management problems. It helps to reduce memory usage and unnecessary memcpy calls.
-Standard memory allocation uses the hipMalloc function to allocate a block of memory on the device. However, when using virtual memory, this process is separated into multiple steps using the hipMemCreate , hipMemAddressReserve , hipMemMap , and hipMemSetAccess functions. This guide explains what these functions do and how you can use them for virtual memory management.
-The first step is to allocate the physical memory itself with the hipMemCreate function. This function accepts the size of the buffer, an unsigned long long variable for the flags, and a hipMemAllocationProp variable. hipMemAllocationProp contains the properties of the memory to be allocated, such as where the memory is physically located and what kind of shareable handles are available. If the allocation is successful, the function returns a value of hipSuccess , with hipMemGenericAllocationHandle_t representing a valid physical memory allocation. The allocated memory size must be aligned with the granularity appropriate for the properties of the allocation. You can use the hipMemGetAllocationGranularity function to determine the correct granularity.
-**Following code does:** The code snippet you provided seems to be a placeholder or an incomplete representation of a code block, as it only contains `<_Bash_>`. This suggests that it might be intended to indicate a section where Bash script or commands should be inserted. Without actual Bash code or commands, it's not possible to determine its purpose or functionality. If you have a specific Bash script or command in mind, please provide the complete code for a more detailed explanation.
-<_C_>
-After you have acquired an allocation of physical memory, you must map it before you can use it. To do so, you need a virtual address to map it to. Mapping means the physical memory allocation is available from the virtual address range it is mapped to. To reserve a virtual memory range, use the hipMemAddressReserve function. The size of the virtual memory must match the amount of physical memory previously allocated. You can then map the physical memory allocation to the newly-acquired virtual memory address range using the hipMemMap function.
-**Following code does:** The code snippet appears to be a mix of Python and C/C++ style comments and directives. However, it is mostly composed of whitespace and does not contain any executable code. The only meaningful part is the `#pragma unroll 16`, which is a compiler directive typically used in C/C++ to suggest that the compiler should unroll the following loop 16 times to potentially optimize performance. However, without any actual loop or surrounding code, this directive has no effect. Overall, the snippet does not perform any operations or have a functional purpose.
-0) ;
-**Following code does:** The code snippet appears to be malformed and does not represent valid Python code. It seems to contain HTML-like tags (`` and ``) with the word "void" inside them, but these are not valid in Python. The snippet does not perform any meaningful operation or serve a functional purpose in a Python context.
<_C++_>
-Finally, use the hipMemSetAccess function to enable memory access. It accepts the pointer to the virtual memory, the size, and a hipMemAccessDesc descriptor as parameters. In a multi-GPU environment, you can map the device memory of one GPU to another. This feature also works with the traditional memory management system, but isn't as scalable as with virtual memory. When memory is allocated with hipMalloc , hipDeviceEnablePeerAccess is used to enable peer access. This function enables access between two devices, but it means that every call to hipMalloc takes more time to perform the checks and the mapping between the devices. When using virtual memory management, peer access is enabled by hipMemSetAccess , which provides a finer level of control over what is shared. This has no performance impact on memory allocation and gives you more control over what memory buffers are shared with which devices.
-**Following code does:** This code snippet is a command-line instruction using `hipcc`, which is a compiler for HIP (Heterogeneous-Compute Interface for Portability) programs. The command is used to generate a code object file for a specific GPU architecture. It takes an input file containing GPU kernels, compiles it for the specified target GPU architecture, and outputs the compiled code into a designated output file. The `--genco` flag indicates that the command is specifically for generating code objects, and `--offload-arch` specifies the target GPU architecture for which the code should be compiled.
-hipMemAccessDesc accessDesc = {};
-accessDesc.location.type = HIP_MEM_LOCATION_TYPE_DEVICE;
-accessDesc.location.id = currentDev;
-accessDesc.flags = HIP_MEM_ACCESS_FLAGS_PROT_READWRITE;
-hipMemSetAccess(ptr, padded_size, &accessDesc, 1);
-At this point the memory is allocated, mapped, and ready for use. You can read and write to it, just like you would a C style memory allocation.
-To free the memory allocated in this manner, use the corresponding free functions. To unmap the memory, use hipMemUnmap . To release the virtual address range, use hipMemAddressFree . Finally, to release the physical memory, use hipMemRelease . A side effect of these functions is the lack of synchronization when memory is released. If you call hipFree when you have multiple streams running in parallel, it synchronizes the device. This causes worse resource usage and performance.
-**Following code does:** This code snippet is a series of shell commands that clone specific branches of three different repositories from GitHub. The repositories are related to the ROCm (Radeon Open Compute) platform, which is an open-source software platform for GPU computing. The branches being cloned are specified by the environment variable `ROCM_BRANCH`. The repositories being cloned are `clr`, `hip`, and `hipother`, all from the ROCm organization on GitHub. This setup is typically used to obtain specific versions of these projects for development or deployment purposes.
- |hipMemUnmap(ptr, size);
- |hipMemRelease(allocHandle);
-**Following code does:** The code snippet appears to be part of a script or instructions for setting up a development environment related to HIP (Heterogeneous-Compute Interface for Portability). It outlines two main steps: - -1. Cloning the HIP source code from a repository using Git, specifically checking out a branch specified by the environment variable `$ROCM_BRANCH`. The repetition of the `git clone` command suggests either a mistake or an incomplete snippet where different repositories or directories might be intended. - -2. Setting environment variables, which is a common step in configuring a development environment to ensure that the necessary paths and settings are available for building or running the software. - -Overall, the snippet is part of a setup process for working with HIP, likely in the context of ROCm (Radeon Open Compute), a platform for GPU computing.
- |hipMemAddressFree(ptr, size);
-The hipMemAddressReserve function allows you to increase the amount of pre-allocated memory. This function accepts a parameter representing the requested starting address of the virtual memory. This allows you to have a continuous virtual address space without worrying about the underlying physical allocation.
-**Following code does:** This code snippet sets three environment variables (`CLR_DIR`, `HIP_DIR`, and `HIP_OTHER`) to the absolute paths of the directories named `clr`, `hip`, and `hipother`, respectively. The `readlink -f` command is used to resolve and return the full path of each directory, ensuring that any symbolic links are followed to their final target locations. This setup is typically used in shell scripts to configure paths for use in subsequent commands or scripts.
- hipMemAddressReserve(&new_ptr, (new_size - padded_size), 0, ptr + padded_size, 0);
- hipMemMap(new_ptr, (new_size - padded_size), 0, newAllocHandle, 0);
- hipMemSetAccess(new_ptr, (new_size - padded_size), &accessDesc, 1);
-The code sample above assumes that hipMemAddressReserve was able to reserve the memory address at the specified location. However, this isn't guaranteed to be true, so you should validate that new_ptr points to a specific virtual address before using it.
-HIP provides the following:
-The HIP API documentation describes each API and its limitations, if any, compared with the equivalent CUDA API.
-At a high-level, the following features are not supported:
-See the API Support Table for more detailed information.
-No. HIP provides porting tools which do most of the work to convert CUDA code into portable C++ code that uses the HIP APIs. Most developers will port their code from CUDA to HIP and then maintain the HIP version. HIP code provides the same performance as native CUDA code, plus the benefits of running on AMD platforms.
-HIP APIs and features do not map to a specific CUDA version. HIP provides a strong subset of the functionality provided in CUDA, and the hipify tools can scan code to identify any unsupported CUDA functions - this is useful for identifying the specific features required by a given application.
-However, we can provide a rough summary of the features included in each CUDA SDK and the support level in HIP. Each bullet below lists the major new language features in each CUDA release and then indicate which are supported/not supported in HIP:
-HIP includes growing support for the four key math libraries using hipBLAS, hipFFT, hipRAND and hipSPARSE, as well as MIOpen for machine intelligence applications. These offer pointer-based memory interfaces (as opposed to opaque buffers) and can be easily interfaced with other HIP applications. The hip interfaces support both ROCm and CUDA paths, with familiar library interfaces.
-Additionally, some of the cuBLAS routines are automatically converted to hipblas equivalents by the HIPIFY tools. These APIs use cuBLAS or hcBLAS depending on the platform and replace the need to use conditional compilation.
-Both AMD and NVIDIA support OpenCL 1.2 on their devices so that developers can write portable code. HIP offers several benefits over OpenCL:
-Both HIP and CUDA are dialects of C++, and thus porting between them is relatively straightforward. Both dialects support templates, classes, lambdas, and other C++ constructs. As one example, the hipify-perl tool was originally a Perl script that used simple text conversions from CUDA to HIP. HIP and CUDA provide similar math library calls as well. In summary, the HIP philosophy was to make the HIP language close enough to CUDA that the porting effort is relatively simple. This reduces the potential for error, and also makes it easy to automate the translation. HIP goal is to quickly get the ported program running on both platforms with little manual intervention, so that the programmer can focus on performance optimizations.
-There have been several tools that have attempted to convert CUDA into OpenCL, such as CU2CL. OpenCL is a C99based kernel language (rather than C++) and also does not support single-source compilation. As a result, the OpenCL syntax is different from CUDA, and the porting tools have to perform some heroic transformations to bridge this gap. The tools also struggle with more complex CUDA applications, in particular, those that use templates, classes, or other C++ features inside the kernel.
-Typically, HIPIFY tools can automatically convert almost all run-time code. Most device code needs no additional conversion since HIP and CUDA have similar names for math and built-in functions. The hipify-clang tool will automatically modify the kernel signature as needed (automating a step that used to be done manually). Additional porting may be required to deal with architecture feature queries or with CUDA capabilities that HIP doesn't support. In general, developers should always expect to perform some platform-specific tuning and optimization.
-NVCC is NVIDIA's compiler driver for compiling 'CUDA C++' code into PTX or device code for NVIDIA GPUs. It's a closed-source binary compiler that is provided by the CUDA SDK.
-HIP-Clang is a Clang/LLVM based compiler to compile HIP programs which can run on AMD platform.
-While HIP is a strong subset of the CUDA, it is a subset. The HIP layer allows that subset to be clearly defined and documented. Developers who code to the HIP API can be assured their code will remain portable across NVIDIA and AMD platforms. In addition, HIP defines portable mechanisms to query architectural features and supports a larger 64-bit WaveSize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit integers to 64-bit integers.
-Yes. HIP's CUDA path only exposes the APIs and functionality that work on both NVCC and AMDGPU back-ends. 'Extra' APIs, parameters, and features which exist in CUDA but not in HIP-Clang will typically result in compile-time or run-time errors. Developers need to use the HIP API for most accelerator code and bracket any CUDA-specific code with preprocessor conditionals. Developers concerned about portability should, of course, run on both platforms, and should expect to tune for performance. In some cases, CUDA has a richer set of modes for some APIs, and some C++ capabilities such as virtual functions - see the HIP @API documentation for more details.
-Yes. HIP's HIP-Clang path only exposes the APIs and functions that work on AMD runtime back ends. 'Extra' APIs, parameters and features that appear in HIP-Clang but not CUDA will typically cause compile- or run-time errors. Developers must use the HIP API for most accelerator code and bracket any HIP-Clang specific code with preprocessor conditionals. Those concerned about portability should, of course, test their code on both platforms and should tune it for performance. Typically, HIP-Clang supports a more modern set of C++11/C++14/C++17 features, so HIP developers who want portability should be careful when using advanced C++ features on the HIP-Clang path.
-The environment variable can be used to set compiler path:
-There is an alternative environment variable to set compiler path:
-AMD Common Language Runtime (CLR) is a repository for the AMD platform, which contains source codes for AMD's compute languages runtimes as follows,
-A new repository 'hipother' is added in the ROCm 6.1 release, which is branched out from HIP. hipother supports the HIP back-end implementation on some non-AMD platforms, like NVIDIA.
-No, there is no HIP repository open publicly on Windows.
-HIP is a source-portable language that can be compiled to run on either AMD or NVIDIA platform. HIP tools don't create a 'fat binary' that can run on either platform, however.
-Yes. HIP generates the object code which conforms to the GCC ABI, and also links with libstdc++. This means you can compile host code with the compiler of your choice and link the generated object code with GPU code compiled with HIP. Larger projects often contain a mixture of accelerator code (initially written in CUDA with NVCC) and host code (compiled with gcc, icc, or clang). These projects can convert the accelerator code to HIP, compile that code with hipcc, and link with object code from their preferred compiler.
-HIP is C++ runtime API that supports C style applications as well.
-Some C style applications (and interfaces to other languages (FORTRAN, Python)) would call certain HIP APIs but not use kernel programming. They can be compiled with a C compiler and run correctly, however, small details must be considered in the code. For example, initialization, as shown in the simple application below, uses HIP structs dim3 with the file name 'test.hip.cpp'
-**Following code does:** The code snippet provided is not a valid Python code. It appears to be a fragment of a list or a set of instructions, specifically the third step in a sequence, which is "Build HIP." Without additional context, it's unclear what "HIP" refers to, but it could be an acronym or a specific component in a larger process. The snippet suggests that this step involves constructing or assembling something referred to as HIP.
- //the file name `test.hip.cpp`
-
-
-#include "hip/hip_runtime_api.h"
- //this file name `test.hip.cpp`
-
- int main(int argc, char** argv) {
- dim3 grid1;
- printf("dim3 grid1; x=%d, y=%d, z=%d\n",grid1.x,grid1.y,grid1.z);
- dim3 grid2 = {1,1,1};
- printf("dim3 grid2 = {1,1,1}; x=%d, y=%d, z=%d\n",grid2.x,grid2.y,grid2.z);
- return 0;
- }
-When using a C++ compiler,
-**Following code does:** This code snippet is a shell script that automates the process of building and installing a software project using CMake and Make, specifically targeting a HIP (Heterogeneous-Compute Interface for Portability) platform with NVIDIA support. Here's a high-level breakdown of its purpose: - -1. **Change Directory**: It navigates to a directory specified by the environment variable `CLR_DIR`. - -2. **Create and Navigate to Build Directory**: It creates a `build` directory if it doesn't exist and then changes into it. - -3. **Configure the Build with CMake**: It runs the `cmake` command to configure the build system. Various options are set, such as: - - `HIP_COMMON_DIR` and `HIPNV_DIR` for specifying directories related to HIP. - - `HIP_PLATFORM=nvidia` to target NVIDIA GPUs. - - `CMAKE_INSTALL_PREFIX` to set the installation directory to the current working directory. - - Disabling certain build options like `HIP_CATCH_TEST` and `CLR_BUILD_OCL`. - -4. **Compile the Project**: It uses `make` with parallel execution (`-j$(nproc)`) to compile the project, utilizing all available CPU cores. - -5. **Install the Compiled Software**: It runs `sudo make install` to install the compiled software, which typically requires superuser privileges. - -Overall, this script is used to build and install a HIP-based software project configured for NVIDIA GPUs.
-$ gcc -x c++ $(hipconfig --cpp_config) test3.hip.cpp -o test
-$./test
-dim3 grid1; x=1, y=1, z=1
-dim3 grid2 = {1,1,1}; x=1, y=1, z=1
-In which 'dim3 grid1;' will yield a dim3 grid with all dimensional members x,y,z initialized to 1, as the default constructor behaves that way. Further, if written: dim3 grid(2); // yields {2,1,1} dim3 grid(2,3); yields {2,3,1} In comparison, when using the C compiler, $ gcc -x c $( hipconfig --cpp_config ) test.hip.cpp -o test $ ./test dim3 grid1; x=646881376, y=21975, z=1517277280 dim3 grid2 = {1,1,1}; x=1, y=1, z=1 In which 'dim3 grid;' does not imply any initialization, no constructor is called, and dimensional values x,y,z of grid are undefined. NOTE: To get the C++ default behavior, C programmers must additionally specify the right-hand side as shown below,
-**Following code does:** This code snippet is a shell command that uses `git` to clone a specific branch of a repository from GitHub. It clones the `hip-tests` repository from the ROCm (Radeon Open Compute) GitHub organization. The branch to be cloned is specified by the environment variable `ROCM_BRANCH`. This command is typically used to obtain a local copy of the code from a particular branch of the repository for development, testing, or deployment purposes.
- |dim3 grid = {1,1,1}; // initialized as in C++
-**Following code does:** This code snippet appears to be a shell script intended for setting up and running tests for a project that uses HIP (Heterogeneous-Compute Interface for Portability), which is a C++ runtime API and kernel language that allows developers to create portable applications across different GPU platforms. Here's a high-level summary of what it does: - -1. **Set Environment Variable**: It sets the `HIPTESTS_DIR` environment variable to the absolute path of the `hip-tests` directory using `readlink -f`. - -2. **Navigate to Directory**: It changes the current directory to `HIPTESTS_DIR`. - -3. **Create and Navigate to Build Directory**: It creates a `build` directory if it doesn't exist and navigates into it. - -4. **Configure Build with CMake**: It runs `cmake` to configure the build system for the project, specifying the HIP platform as AMD and setting the HIP path to a specified directory. - -5. **Build Tests**: It compiles the test suite using `make build_tests`. - -6. **Run Tests**: It executes the tests using `ctest`. - -Overall, this script automates the process of setting up the environment, configuring, building, and running tests for a HIP-based project.
-C++
-Yes. You can use HIP_PLATFORM to choose which path hipcc targets. This configuration can be useful when using HIP to develop an application which is portable to both AMD and NVIDIA.
-HIP will set the platform to AMD and use HIP-Clang as compiler if it sees that the AMD graphics driver is installed and has detected an AMD GPU. Sometimes this isn't what you want * you can force HIP to recognize the platform by setting the following,
-**Following code does:** The code snippet defines a function `ynf` that calculates and returns the value of the Bessel function of the second kind of order `n` for a given input `x`. Bessel functions are a family of solutions to Bessel's differential equation and are commonly used in various fields such as physics and engineering, particularly in problems involving cylindrical or spherical symmetry.
-| export
-**Following code does:** The code snippet appears to be a comment or documentation rather than executable code. It describes the process of building HIP (Heterogeneous-Compute Interface for Portability) tests using Catch2, a unit testing framework. Specifically, it mentions that these HIP tests are separate from the main HIP project and outlines a step to obtain the source code for these tests.
-rt HIP_PLATFORM=amd
-**Following code does:** This code snippet appears to be a part of a script or configuration file that compiles and runs a specific test for a HIP (Heterogeneous-Compute Interface for Portability) application. It navigates to a directory specified by the environment variable `HIPTESTS_DIR`, then uses the `hipcc` compiler to compile a C++ test file named `hipPointerGetAttributes.cc` along with some include directories. The compiled output is an executable named `hipPointerGetAttributes`, which is then executed. The purpose of this script is to test the functionality related to HIP pointer attributes, and it concludes with a message indicating that all tests have passed.
-<_Python_>
-**Following code does:** The code snippet `++i)` appears to be a syntax error in Python. In Python, the `++` operator does not exist as it does in languages like C++ or JavaScript. Instead, Python uses `i += 1` to increment a variable. The snippet seems to be an incomplete or incorrect attempt to increment a variable, possibly influenced by syntax from another programming language.
- | HIP_COMPILER=cuda
-**Following code does:** The code snippet `<_SQL_>` appears to be a placeholder or a tag indicating that the actual SQL code is not provided. It suggests that the content is related to SQL, which is a language used for managing and manipulating relational databases. Without the actual SQL code, it's not possible to determine the specific operations or queries being performed. The placeholder might be used in documentation, templates, or code generation tools to signify where SQL code should be inserted or referenced.
- | HIP_RUNTIME=nvcc
-One symptom of this problem is the message 'error: 'unknown error'(11) at square.hipref.cpp:56 . This can occur if you have a CUDA installation on an AMD platform, and HIP incorrectly detects the platform as NVCC. HIP may be able to compile the application using the NVCC tool-chain but will generate this error at runtime since the platform does not have a CUDA device.
-Yes. Most HIP data structures ( hipStream_t , hipEvent_t ) are typedefs to CUDA equivalents and can be intermixed. Both CUDA and HIP use integer device ids. One notable exception is that hipError_t is a new type, and cannot be used where a cudaError_t is expected. In these cases, refactor the code to remove the expectation. Alternatively, hip_runtime_api.h defines functions which convert between the error code spaces:
-hipErrorToCudaError hipCUDAErrorTohipError hipCUResultTohipError
-If platform portability is important, use #ifdef __HIP_PLATFORM_NVIDIA__ to guard the CUDA-specific code.
-See Logging HIP activity for more information.
-Product of block.x, block.y, and block.z should be less than 1024. Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32 , so gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32.
-__shfl_*_sync is not supported on HIP but for NVCC path CUDA 9.0 and above all shuffle calls get redirected to it's sync version.
-The compiler defines the __HIP_DEVICE_COMPILE__ macro only when compiling the code for the GPU. It could be used to guard code that is specific to the host or the GPU.
-When compiling an OpenMP source file with hipcc -fopenmp , the compiler may generate error if there is a reference to the _OPENMP macro. This is due to a limitation in hipcc that treats any source file type (for example .cpp ) as an HIP translation unit leading to some conflicts with the OpenMP language switch. If the OpenMP source file doesn't contain any HIP language constructs you could work around this issue by adding the -x c++ switch to force the compiler to treat the file as regular C++. Another approach would be to guard the OpenMP code with #ifdef _OPENMP so that the code block is disabled when compiling for the GPU. The __HIP_DEVICE_COMPILE__ macro defined by the HIP compiler when compiling GPU code could also be used for guarding code paths specific to the host or the GPU.
-Previously, it was essential to declare dynamic shared memory using the HIP_DYNAMIC_SHARED macro for accuracy, as using static shared memory in the same kernel could result in overlapping memory ranges and data-races.
-Now, the HIP-Clang compiler provides support for extern shared declarations, and the HIP_DYNAMIC_SHARED option is no longer required. You may use the standard extern definition: extern shared type var[];
-This error message is seen due to the fact that you do not have valid code object for all of your devices.
-If you have compiled the application yourself, make sure you have given the correct device name(s) and its features via: --offload-arch . If you are not mentioning the --offload-arch , make sure that hipcc is using the correct offload arch by verifying the hipcc output generated by setting the environment variable HIPCC_VERBOSE=1 .
-If you have a precompiled application/library (like rocblas, TensorFlow etc) which gives you such error, there are one of two possibilities.
-Note: In previous releases, the error code is hipErrorNoBinaryForGpu with message 'Unable to find code object for all current devices'. The error code handling behavior is changed. HIP runtime shows the error code hipErrorSharedObjectInitFailed with message 'Error: shared object initialization failed' on unsupported GPU.
-The per-thread default stream is an implicit stream local to both the thread and the current device. It does not do any implicit synchronization with other streams (like explicitly created streams), or default per-thread stream on other threads.
-The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program.
-In ROCm, a compilation option should be added in order to compile the translation unit with per-thread default stream enabled. -fgpu-default-stream=per-thread . Once source is compiled with per-thread default stream enabled, all APIs will be executed on per thread default stream, hence there will not be any implicit synchronization with other streams.
-Besides, per-thread default stream be enabled per translation unit, users can compile some files with feature enabled and some with feature disabled. Feature enabled translation unit will have default stream as per thread and there will not be any implicit synchronization done but other modules will have legacy default stream which will do implicit synchronization.
-In HIP, hipFloatComplex and hipDoubleComplex are defined as complex data types,
-**Following code does:** This code is a command-line instruction that uses the `git` version control system to create a local copy (clone) of the repository located at the specified URL, `https://github.com/amd/rcm-examples.git`. This repository is hosted on GitHub and likely contains example code or resources related to AMD's RCM (Resource and Configuration Management) tools or projects. The cloned repository will be downloaded to the current directory where the command is executed.
-<_C_>
-Any application uses complex multiplication and division operations, need to replace '*' and '/' operators with the following,
-Note: These complex operations are equivalent to corresponding types/functions on the NVIDIA platform.
-Yes, HIP APIs are available to use on both Linux and Windows. Due to different working mechanisms on operating systems like Windows vs Linux, HIP APIs call corresponding lower level backend runtime libraries and kernel drivers for the OS, in order to control the executions on GPU hardware accordingly. There might be a few differences on the related backend software and driver support, which might affect usage of HIP APIs. See OS support details in HIP API document.
-Starting ROCm 6.0, HIP runtime supports Locally Unique Identifier (LUID). This feature enables the local physical device(s) to interoperate with other devices. For example, DirectX 12.
-HIP runtime sets device LUID properties so the driver can query LUID to identify each device for interoperability.
-Note: HIP supports LUID only on Windows OS.
-HIP version definition has been updated since ROCm 4.2 release as the following:
-**Following code does:** This code snippet is written in C++ using the HIP API, which is used for GPU programming. The code's high-level purpose is to allocate memory on a GPU device and copy data from the host (CPU) to the device (GPU). Specifically, it allocates memory for two float arrays (`d_x` and `d_y`) on the GPU, each with a size specified by `size_bytes`. It then copies data from two host arrays (`x` and `y`) to these newly allocated device arrays. The `HIP_CHECK` macro is likely used to handle errors that may occur during these operations.
-<_SQL_>
-HIP version can be queried from HIP API call, hipRuntimeGetVersion(&runtimeVersion);
-The version returned will always be greater than the versions in previous ROCm releases.
-Note: The version definition of HIP runtime is different from CUDA. On AMD platform, the function returns HIP runtime version, while on NVIDIA platform, it returns CUDA runtime version. And there is no mapping/correlation between HIP version and CUDA version.
-18.2 Topics
-18.3.1 Namespace List
-18.3.2 Namespace Members
-18.3.2.1 Namespace Members
-18.3.2.2 Namespace Members
-18.4.4 Data Fields
-18.4.4.1 All
-18.4.4.1.1 Data Fields
-18.4.4.1.2 Data Fields
-18.4.4.1.3 Data Fields
-18.4.4.1.4 Data Fields
-18.4.4.1.5 Data Fields
-18.4.4.1.6 Data Fields 26
-18.4.4.1.7 Data Fields
-CHAPTER
-HIP provides a C++ syntax that is suitable for compiling most code that commonly appears in compute kernels (classes, namespaces, operator overloading, and templates). HIP also defines other language features that are designed to target accelerators, such as:
-Note: This chapter describes the built-in variables and functions that are accessible from the HIP kernel. It's intended for users who are familiar with CUDA kernel syntax and want to learn how HIP differs from CUDA.
-Features are labeled with one of the following keywords:
-Supported __device__ functions are:
-You can combine __device__ with the host keyword ( __host__ ).
-Supported __global__ functions are:
-HIP __global__ functions must have a void return type.
-HIP doesn't support dynamic-parallelism, which means that you can't call __global__ functions from the device.
-Supported __host__ functions are:
-You can combine __host__ with __device__ ; in this case, the function compiles for the host and the device. Note that these functions can't use the HIP grid coordinate functions (e.g., threadIdx.x ). If you need to use HIP grid coordinate functions, you can pass the necessary coordinate information as an argument.
-You can't combine __host__ with __global__ .
-HIP parses the __noinline__ and __forceinline__ keywords and converts them into the appropriate Clang attributes.
-__global__ functions are often referred to as kernels . When you call a global function, you're launching a kernel . When launching a kernel, you must specify an execution configuration that includes the grid and block dimensions. The execution configuration can also include other information for the launch, such as the amount of additional shared memory to allocate and the stream where you want to execute the kernel.
-HIP introduces a standard C++ calling convention ( hipLaunchKernelGGL ) to pass the run configuration to the kernel. However, you can also use the CUDA <<< >>> syntax.
-When using hipLaunchKernelGGL , your first five parameters must be:
-You can include your kernel arguments after these parameters.
-**Following code does:** This code snippet is part of a GPU programming context, likely using a framework like CUDA or HIP for parallel computing. The high-level purpose of the code is to perform a SAXPY operation (Single-Precision A·X Plus Y) on a GPU. The `saxpy_kernel` function is a kernel function that runs on the GPU and performs the SAXPY operation on arrays `d_x` and `d_y` with a scalar `a`. The `main` function sets up the execution environment and launches this kernel on the GPU using a specified grid and block size configuration. The kernel is executed on the default stream, which is a queue for managing the execution order of operations on the GPU.
- // Example hipLaunchKernelGGL pseudocode:
-**Following code does:** The code snippet `<_Cuda_>` appears to be a placeholder or a symbolic representation rather than actual executable Python code. In a typical context, "Cuda" refers to NVIDIA's parallel computing platform and application programming interface (API) model, which allows developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. However, without additional context or surrounding code, it is not possible to determine the specific functionality or purpose of this snippet.
-
-ize_t N)
-(continued from previous page)
-**Following code does:** This code snippet is part of a program that uses HIP, a C++ runtime API and kernel language that allows developers to write portable code to run on AMD and NVIDIA GPUs. The specific line of code is performing a memory copy operation from a GPU device to the host (CPU) memory. - -Here's a high-level explanation of what it does: - -- `hipMemcpy` is a function that copies data between host and device memory. -- `y.data()` is likely a pointer or an array on the host where the data will be copied to. -- `d_y` is a pointer or an array on the device (GPU) from which the data will be copied. -- `size_bytes` specifies the number of bytes to copy. -- `hipMemcpyDeviceToHost` is an enumeration that indicates the direction of the copy, from device to host. - -The `HIP_CHECK` macro is likely used to check for errors in the `hipMemcpy` operation, ensuring that the memory copy was successful.
- (continued from previous page)
-
-
-}
-
-MyKernel<<>> (a,b,c,n);
-
-// Alternatively, you can launch the kernel using:
-// hipLaunchKernelGGL(MyKernel, dim3(gridDim), dim3(groupDim), 0/*dynamicShared*/, 0/
- :*stream), a, b, c, n);
-You can use HIPIFY tools to convert CUDA launch syntax to hipLaunchKernelGGL . This includes the conversion of optional <<< >>> arguments into the five required hipLaunchKernelGGL parameters. Note: HIP doesn't support dimension sizes of 𝑔𝑟𝑖𝑑𝐷𝑖𝑚 * 𝑏𝑙𝑜𝑐𝑘𝐷𝑖𝑚 ≥ 2 32 when launching a kernel.
-**Following code does:** This code snippet is a shell command that modifies the `PATH` environment variable. It prepends the directory `/opt/rcm/bin` to the existing `PATH`. This means that when the system searches for executable files, it will first look in `/opt/rcm/bin` before checking the other directories listed in the current `PATH`. This is typically done to prioritize custom or specific versions of executables located in `/opt/rcm/bin` over those in other directories.
-
-// Example showing device function, __device__ __host__
-// <- compile for both device and host
-float PlusOne(float x)
-{
- return x + 1.0;
-}
-
-__global__
-void
-MyKernel (hipLaunchParm lp, /*lp parm for execution configuration */
- const float *a, const float *b, float *c, unsigned N)
-{
- unsigned gid = threadIdx.x; // <- coordinate index function
- if (gid < N) {
- c[gid] = a[gid] + PlusOne(b[gid]);
- }
-}
-void callMyKernel()
-{
- float *a, *b, *c; // initialization not shown...
- unsigned N = 1000000;
- const unsigned blockSize = 256;
-
- MyKernel<<>> (a,b,c,n);
- // Alternatively, kernel can be launched by
- // hipLaunchKernelGGL(MyKernel, dim3(N/blockSize), dim3(blockSize), 0, 0, a,b,c,N);
-}
-The host writes constant memory before launching the kernel. This memory is read-only from the GPU while the kernel is running. The functions for accessing constant memory are:
-To allow the host to dynamically allocate shared memory, you can specify extern __shared__ as a launch parameter.
-Note: Prior to the HIP-Clang compiler, dynamic shared memory had to be declared using the HIP_DYNAMIC_SHARED macro in order to ensure accuracy. This is because using static shared memory in the same kernel could've resulted in overlapping memory ranges and data-races. The HIP-Clang compiler provides support for extern __shared_ declarations, so HIP_DYNAMIC_SHARED is no longer required.
-Managed memory, including the __managed__ keyword, is supported in HIP combined host/device compilation.
-__restrict__ tells the compiler that the associated memory pointer not to alias with any other pointer in the kernel or function. This can help the compiler generate better code. In most use cases, every pointer argument should use this keyword in order to achieve the benefit.
-The kernel uses coordinate built-ins ( thread* , block* , grid* ) to determine the coordinate index and bounds for the active work item.
-Built-ins are defined in amd_hip_runtime.h , rather than being implicitly defined by the compiler.
-Coordinate variable definitions for built-ins are the same for HIP and CUDA. For example: threadIdx.x , blockIdx. y , and gridDim.y . The products gridDim.x * blockDim.x , gridDim.y * blockDim.y , and gridDim.z * blockDim.z are always less than 2^32 .
-Coordinate built-ins are implemented as structures for improved performance. When used with printf , they must be explicitly cast to integer types.
-The warpSize variable type is int . It contains the warp size (in threads) for the target device. warpSize should only be used in device functions that develop portable wave-aware code.
-Note: NVIDIA devices return 32 for this variable; AMD devices return 64 for gfx9 and 32 for gfx10 and above.
-The following vector types are defined in hip_runtime.h . They are not automatically provided by the compiler.
-Short vector types derive from basic integer and floating-point types. These structures are defined in hip_vector_types.h . The first, second, third, and fourth components of the vector are defined by the x , y , z , and w fields, respectively. All short vector types support a constructor function of the form make_<type_name>() . For example, float4 make_float4(float x, float y, float z, float w) creates a vector with type float4 and value (x,y,z,w) .
-HIP supports the following short vector formats:
-dim3 is a three-dimensional integer vector type that is commonly used to specify grid and group dimensions.
-The dim3 constructor accepts between zero and three arguments. By default, it initializes unspecified dimensions to 1.
-**Following code does:** The code snippet provided is not valid Python code. It consists of a single closing curly brace `}`, which is not used in Python syntax. Curly braces are typically used in languages like C, C++, Java, and JavaScript to denote blocks of code, but in Python, indentation is used instead. Therefore, this snippet does not perform any function or operation in Python.
-<_C_>
-HIP supports __threadfence() and __threadfence_block() . If you're using threadfence_system() in the HIP-Clang path, you can use the following workaround:
-Synchronization functions causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group.
-The __syncthreads() built-in function is supported in HIP. The __syncthreads_count(int) , __syncthreads_and(int) , and __syncthreads_or(int) functions are under development.
-The Cooperative Groups API offer options to do synchronization on a developer defined set of thread groups. For further information, check Cooperative Groups API or Cooperative Groups how to .
-HIP-Clang supports a set of math operations that are callable from the device. HIP supports most of the device functions supported by CUDA. These are described on Math API page .
-The supported texture functions are listed in texture_fetch_functions.h and texture_indirect_functions. h header files in the HIP-AMD backend repository.
-Texture functions are not supported on some devices. To determine if texture functions are supported on your device, use Macro __HIP_NO_IMAGE_SUPPORT == 1 . You can query the attribute hipDeviceAttributeImageSupport to check if texture functions are supported in the host runtime code.
-The following surface functions are supported in HIP:
-hipError_t hipCreateSurfaceObject ( hipSurfaceObject_t *pSurfObject, const hipResourceDesc *pResDesc )
-Create a surface object.
-hipSuccess, hipErrorInvalidValue hipError_t hipDestroySurfaceObject ( hipSurfaceObject_t surfaceObject )
-Destroy a surface object.
-surfaceObject -[in] Surface object to be destroyed.
-hipSuccess, hipErrorInvalidValue template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf1Dread ( T *data, hipSurfaceObject_t surfObj, int x, int boundaryMode = hipBoundaryModeZero )
-Reads the value at coordinate x from the one-dimensional surface.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf1Dwrite ( T data, hipSurfaceObject_t surfObj, int x )
-Writes the value data to the one-dimensional surface at coordinate x.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf2Dread ( T *data, hipSurfaceObject_t surfObj, int x, int y )
-Reads the value from the two-dimensional surface at coordinate x, y.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf2Dwrite ( T data, hipSurfaceObject_t surfObj, int x, int y )
-Writes the value data to the two-dimensional surface at coordinate x, y.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf3Dread ( T *data, hipSurfaceObject_t surfObj, int x, int y, int z )
-Reads the value from the three-dimensional surface at coordinate x, y, z.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf3Dwrite ( T data, hipSurfaceObject_t surfObj, int x, int y, int z )
-Writes the value data to the three-dimensional surface at coordinate x, y, z.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf1DLayeredread ( T *data, hipSurfaceObject_t surfObj, int x, int layer )
-Reads the value from the one-dimensional layered surface at coordinate x and layer index.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf1DLayeredwrite ( T data, hipSurfaceObject_t surfObj, int x, int layer )
-Writes the value data to the one-dimensional layered surface at coordinate x and layer index.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf2DLayeredread ( T *data, hipSurfaceObject_t surfObj, int x, int y, int layer )
-Reads the value from the two-dimensional layered surface at coordinate x, y and layer index.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr>
-static void surf2DLayeredwrite ( T data, hipSurfaceObject_t surfObj, int x, int y, int layer )
-Writes the value data to the two-dimensional layered surface at coordinate x, y and layer index.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surfCubemapread ( T *data, hipSurfaceObject_t surfObj, int x, int y, int face )
-Reads the value from the cubemap surface at coordinate x, y and face index.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surfCubemapwrite ( T data, hipSurfaceObject_t surfObj, int x, int y, int face )
-Writes the value data to the cubemap surface at coordinate x, y and face index.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surfCubemapLayeredread ( T *data, hipSurfaceObject_t surfObj, int x, int y, int face, int layer )
-Reads the value from the layered cubemap surface at coordinate x, y and face, layer index.
-T - The data type of the surface.
-template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surfCubemapLayeredwrite ( T *data, hipSurfaceObject_t surfObj, int x, int y, int face, int layer )
-Writes the value data to the layered cubemap surface at coordinate x, y and face, layer index.
-T - The data type of the surface.
-To read a high-resolution timer from the device, HIP provides the following built-in functions:
-**Following code does:** The code snippet is not actually a code but rather an instruction. It suggests that you should be able to execute a command in the command line to check the version of a compiler named `amdclang++`. This implies that `amdclang++` is a C++ compiler, likely provided by AMD, and the command `amdclang++ --version` is used to display the version information of this compiler.
-<_SQL_>
-**Following code does:** This code snippet is a shell command that checks the version of the NVIDIA CUDA Compiler (nvcc) installed on the system. By executing `nvcc --version`, it outputs the version information of the nvcc tool, which is part of the CUDA Toolkit used for compiling CUDA programs. This is useful for verifying the installation and version of CUDA on a machine.
- [clock_t clock()
- long long int close
-The difference between the values that are returned represents the cycles used.
-**Following code does:** This PowerShell script is designed to set up a development environment for Visual Studio. Here's a high-level summary of what it does: - -1. **Locate Visual Studio Installation**: It retrieves a list of installed Visual Studio instances, sorts them by version in descending order, and selects the most recent one. It then extracts the installation path of this latest version. - -2. **Import Visual Studio Module**: It imports a specific Visual Studio module (`Microsoft.VisualStudio.DevShell.dll`) from the installation path. This module is likely used to facilitate the setup of a development shell environment. - -3. **Enter Visual Studio Developer Shell**: It enters the Visual Studio Developer Shell using the installation path determined earlier. It specifies the architecture as `amd64` for both the host and the development environment, and it suppresses the display of the logo with the `-no_logo` argument. - -4. **Modify Environment Path**: It updates the system's `PATH` environment variable by prepending a directory (`bin` subdirectory of `HIP_PATH`) to the existing `PATH`. This ensures that executables in this directory are prioritized when running commands. - -Overall, the script automates the process of setting up a Visual Studio development environment with specific configurations and updates the system path to include additional tools.
- | long long int w:
-**Following code does:** This code snippet is a shell command that checks the version of the `clang++` compiler installed on the system. `clang++` is the C++ compiler that is part of the LLVM project. Running this command will output the version information of the `clang++` compiler, including details like the version number, target architecture, and possibly the build date.
- it will_clock64()
-This can be queried using the HIP API with the hipDeviceAttributeWallClockRate attribute of the device in HIP application code. For example:
-**Following code does:** This PowerShell script is designed to set up a development environment for Visual Studio. It performs the following high-level tasks: - -1. **Retrieve Visual Studio Installation Path**: It queries the system for installed Visual Studio instances, sorts them by version in descending order, and selects the installation path of the latest version. - -2. **Import Visual Studio Development Shell Module**: It imports a specific module (`Microsoft.VisualStudio.DevShell.dll`) from the selected Visual Studio installation. This module is used to configure the development environment. - -3. **Enter Visual Studio Development Shell**: It initializes the Visual Studio Developer Command Prompt environment for the specified installation path, targeting the `amd64` architecture for both the host and the development environment, while suppressing the display of the logo. This sets up the necessary environment variables and paths for development tasks.
- int wallClkRate = 0; //in kilohertz
- HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, _
- --deviceId));
-Where hipDeviceAttributeWallClockRate is a device attribute. Note that wall clock frequency is a perdevice attribute.
-Note that clock() and clock64() do not work properly on AMD RDNA3 (GFX11) graphic processors.
-Atomic functions are run as read-modify-write (RMW) operations that reside in global or shared memory. No other device or thread can observe or modify the memory location during an atomic operation. If multiple instructions from different devices or threads target the same memory location, the instructions are serialized in an undefined order.
-To support system scope atomic operations, you can use the HIP APIs that contain the _system suffix. For example:
-HIP supports the following atomic operations.
-Table 1: Atomic operations
-**Following table contains:** The table represents a list of mathematical functions, specifically those related to floating-point operations in programming or computational contexts. Each row corresponds to a different function, providing a brief description of what the function does. - -The columns are as follows: -- Column 0: Contains the function signature and a description of what the function does. -- Column 1: Contains a checkmark (✓), which might indicate that the function is supported or verified in some context. -- Column 2: Also contains a checkmark (✓), possibly indicating another layer of support or verification, or perhaps compatibility with another system or standard. - -Noteworthy observations: -- All functions listed are related to floating-point arithmetic operations, such as absolute value, division, floor, fused multiply-add, maximum, minimum, and modulus. -- Every function in the table has checkmarks in both Column 1 and Column 2, suggesting that all functions meet the criteria or standards represented by these columns.
-| Function int atomicAdd(int* address, int val) int atomicAdd_system(int* address, int val) unsigned int atomicAdd(unsigned int* address,unsigned unsigned int atomicAdd_system(unsigned int* address, unsigned long long atomicAdd(unsigned long long* unsigned long long atomicAdd_system(unsigned long long* float atomicAdd(float* address, float val) float atomicAdd_system(float* address, float val) double atomicAdd(double* address, double val) double atomicAdd_system(double* address, double val) float unsafeAtomicAdd(float* address, float val) float safeAtomicAdd(float* address, float val) |
| int val) unsigned int val) address,unsigned long long val) |
| address, unsigned long long val) double unsafeAtomicAdd(double* address, double val) double safeAtomicAdd(double* address, double val) int atomicSub(int* address, int val) int atomicSub_system(int* address, int val) unsigned int atomicSub(unsigned int* address,unsigned int val) unsigned int atomicSub_system(unsigned int* address, unsigned int val) |
| int atomicExch(int* address, int val) |
| int atomicExch_system(int* address, int val) unsigned int atomicExch(unsigned int* address,unsigned int val) unsigned int atomicExch_system(unsigned int* address, unsigned int val) |
| unsigned long long atomicExch(unsigned long long int* address,unsigned long val) |
| long unsigned long long atomicExch_system(unsigned long long* address, unsigned long |
| int long val) unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val) |
| float atomicExch(float* address, float val) int atomicMin(int* address, int val) |
| int atomicMin_system(int* address, int val) unsigned int atomicMin(unsigned int* address,unsigned int |
| val) unsigned int atomicMin_system(unsigned int* address, unsigned int |
| val) unsigned long long atomicMin(unsigned long long* address,unsigned long long val) |
| atomicMax(int* address, int val) atomicMax_system(int* address, int val) |
| int unsigned int atomicMax(unsigned int* address,unsigned int val) |
| unsigned int atomicMax_system(unsigned int* address, unsigned int |
| int |
| val) |
| unsigned long long atomicMax(unsigned long long* address,unsigned long long val) |
**Following table contains:** The table represents a list of mathematical functions, likely from a programming library or API, along with a status indicator. Each row corresponds to a specific function. - -- The first column contains the function signature and a brief description of what the function does. It includes the return type, the function name, and the parameters it takes, followed by a description of the function's purpose. -- The second column contains a checkmark (✓), which likely indicates that the function is available, implemented, or verified. - -Noteworthy values: -- All functions listed are related to mathematical operations, such as extracting components of a number, calculating square roots, determining properties of numbers (finite, infinite, NaN), and evaluating Bessel functions. -- Each function description is followed by a checkmark, suggesting that all functions in this preview are confirmed or validated in some way.
-| unsigned int atomicDec(unsigned int* address) |
| int atomicCAS(int* address, int compare, int val) |
| int atomicCAS_system(int* address, int compare, int val) |
| unsigned int atomicCAS(unsigned int* address,unsigned int compare,unsigned int val) unsigned int atomicCAS_system(unsigned int* address, unsigned int compare, unsigned int val) unsigned long long atomicCAS(unsigned long long* address,unsigned long long compare,unsigned long long unsigned long long atomicCAS_system(unsigned long long* address, unsigned long long compare, unsigned int atomicAnd(int* address, int val) int atomicAnd_system(int* address, int val) |
| unsigned int atomicAnd(unsigned int* address,unsigned int val) unsigned int atomicAnd_system(unsigned int* address, unsigned int val) |
| unsigned long long atomicAnd(unsigned long long* address,unsigned long long val) unsigned long long atomicAnd_system(unsigned long long* address, unsigned long |
| long val) int atomicOr(int* address, int val) |
| int atomicOr_system(int* address, int val) |
| unsigned int atomicOr(unsigned int* address,unsigned int val) unsigned int atomicOr_system(unsigned int* address, unsigned |
| int val) unsigned int atomicOr_system(unsigned int* address, unsigned int val) |
| unsigned long long atomicOr(unsigned long long int* address,unsigned long long val) |
| unsigned long long atomicOr_system(unsigned long long* address, unsigned long long val) |
| int atomicXor(int* address, int val) |
| int atomicXor_system(int* address, int val) |
| unsigned int atomicXor(unsigned int* address,unsigned int val) |
| unsigned int atomicXor_system(unsigned int* address, unsigned int val) |
| unsigned long long atomicXor(unsigned long long* address,unsigned long long val) |
| unsigned long long atomicXor_system(unsigned long long* address, unsigned long long |
| val) |
Some HIP devices support fast atomic RMW operations on floating-point values. For example, atomicAdd on singleor double-precision floating-point values may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop.
-On some devices, fast atomic RMW instructions can produce results that differ from the same functions implemented with atomic CAS loops. For example, some devices will use different rounding or denormal modes, and some devices produce incorrect answers if fast floating-point atomic RMW instructions target fine-grained memory allocations.
-The HIP-Clang compiler offers a compile-time option, so you can choose fast-but potentially unsafe-atomic instructions for your code. On devices that support these instructions, you can include the -munsafe-fp-atomics option. This flag indicates to the compiler that all floating-point atomic function calls are allowed to use an unsafe version, if one exists. For example, on some devices, this flag indicates to the compiler that no floating-point atomicAdd function can target fine-grained memory.
-If you want to avoid using unsafe use a floating-point atomic RMW operations, you can use the -mno-unsafe-fp-atomics option. Note that the compiler default is to not produce unsafe floating-point atomic RMW instructions, so the -mno-unsafe-fp-atomics option is not necessarily required. However, passing this option to the compiler is good practice.
-When you pass -munsafe-fp-atomics or -mno-unsafe-fp-atomics to the compiler's command line, the option is applied globally for the entire compilation. Note that if some of the atomic RMW function calls cannot safely use the faster floating-point atomic RMW instructions, you must use -mno-unsafe-fp-atomics in order to ensure that your atomic RMW function calls produce correct results.
-HIP has four extra functions that you can use to more precisely control which floating-point atomic RMW functions produce unsafe atomic RMW instructions:
-Threads in a warp are referred to as lanes and are numbered from 0 to warpSize - 1 . Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions use no shared memory.
-Note that NVIDIA and AMD devices have different warp sizes. You can use warpSize built-ins in you portable code to query the warp size.
-Tip: Be sure to review HIP code generated from the CUDA path to ensure that it doesn't assume a waveSize of 32. 'Wave-aware' code that assumes a waveSize of 32 can run on a wave-64 machine, but it only utilizes half of the machine's resources.
-To get the default warp size of a GPU device, use hipGetDeviceProperties in you host functions.
-**Following code does:** This code snippet is a shell command that checks the version of the NVIDIA CUDA Compiler (nvcc) installed on the system. By executing `nvcc --version`, it outputs information about the installed version of nvcc, which is part of the CUDA toolkit used for compiling CUDA programs that run on NVIDIA GPUs.
- cudaDeviceProp props;
- cudaGetDeviceProperties(&props, deviceID);
- int w = props.warpSize;
- // implement portable algorithm based on w (rather than assume 32 or 64)
-Only use warpSize built-ins in device functions, and don't assume warpSize to be a compile-time constant.
-Note that assembly kernels may be built for a warp size that is different from the default. All mask values either returned or accepted by these builtins are 64-bit unsigned integer values, even when compiled for a wave-32 device, where all the higher bits are unused. CUDA code ported to HIP requires changes to ensure that the correct type is used.
-Note that the __sync variants are made available in ROCm 6.2, but disabled by default to help with the transition to 64-bit masks. They can be enabled by setting the preprocessor macro HIP_ENABLE_WARP_SYNC_BUILTINS . These builtins will be enabled unconditionally in ROCm 6.3. Wherever possible, the implementation includes a static assert to check that the program source uses the correct type for the mask.
-**Following code does:** The code snippet provided appears to be a series of ASCII art arrows and lines, rather than functional code. It does not perform any computational tasks or have any executable logic. Its high-level purpose seems to be purely decorative or illustrative, possibly intended to visually represent directional arrows or boundaries.
-int __all(int predicate)
-int __any(int predicate)
-unsigned long long __ballot(int predicate)
-unsigned long long __activemask()
-
-int __all_sync(unsigned long long mask, int predicate)
-(continued from previous page)
-**Following code does:** This code snippet is a command-line instruction for compiling a HIP (Heterogeneous-Compute Interface for Portability) program using the NVIDIA CUDA Compiler (`nvcc`). The command is set to compile a source file located at `./HIP-Basic/saxpy/main.hip` into an executable named `saxpy`. It includes additional header files from the directories `./Common` and `/opt/rocm/include`. The `-O2` flag is used to optimize the code for better performance. The `-x cu` flag specifies that the input file is a CUDA source file. This setup is typically used for compiling GPU-accelerated applications that can run on both NVIDIA and AMD hardware using HIP.
-<_Python_>
-You can use __any and __all to get a summary view of the predicates evaluated by the participating lanes.
-To determine if the target platform supports the any/all instruction, you can use the hasWarpVote device property or the HIP_ARCH_HAS_WARP_VOTE compiler definition.
-__ballot returns a bit mask containing the 1-bit predicate value from each lane. The nth bit of the result contains the 1 bit contributed by the nth warp lane.
-__activemask() returns a bit mask of currently active warp lanes. The nth bit of the result is 1 if the nth warp lane is active.
-Note that the __ballot and __activemask builtins in HIP have a 64-bit return value (unlike the 32-bit value returned by the CUDA builtins). Code ported from CUDA should be adapted to support the larger warp sizes that the HIP version requires.
-Applications can test whether the target platform supports the __ballot or __activemask instructions using the hasWarpBallot device property in host code or the HIP_ARCH_HAS_WARP_BALLOT macro defined by the compiler for device code.
-The _sync variants require a 64-bit unsigned integer mask argument that specifies the lanes in the warp that will participate in cross-lane communication with the calling lane. Each participating thread must have its own bit set in its mask argument, and all active threads specified in any mask argument must execute the same call with the same mask, otherwise the result is undefined.
-**Following code does:** This code snippet is a command to compile a C++ program using the `clang++` compiler. It specifically compiles a file named `main.hip` located in the `HIP-Basic\saxpy` directory into an executable named `saxpy.exe`. The command includes several options: - -- `-I.\Common`: Specifies an additional directory (`.\Common`) to search for header files. -- `-lamdhip64`: Links against the `amdhip64` library, which is likely related to AMD's HIP (Heterogeneous-Compute Interface for Portability) framework. -- `-L ${env:HIP_PATH}`: Adds the directory specified by the environment variable `HIP_PATH` to the library search path. -- `-lib`: Indicates that the output should be a library (though this seems inconsistent with the `-o saxpy.exe` option for an executable). -- `-02`: This appears to be a typo or misconfiguration; it might be intended to be `-O2`, which is an optimization flag for the compiler. - -Overall, the command is intended to compile a HIP-based C++ program, linking it with necessary libraries and including specific directories for headers and libraries.
- unsigned long long __match_any(T value)
- unsigned long long __match_all(T value, int *pred)
-
- unsigned long long __match_any_sync(unsigned long long mask, T value)
- unsigned long long __match_all_sync(unsigned long long mask, T value, int *pred)
-T can be a 32-bit integer type, 64-bit integer type or a single precision or double precision floating point type.
-__match_any returns a bit mask containing a 1-bit for every participating lane if and only if that lane has the same value in value as the current lane, and a 0-bit for all other lanes.
-__match_all returns a bit mask containing a 1-bit for every participating lane if and only if they all have the same value in value as the current lane, and a 0-bit for all other lanes. The predicate pred is set to true if and only if all participating threads have the same value in value .
-The _sync variants require a 64-bit unsigned integer mask argument that specifies the lanes in the warp that will participate in cross-lane communication with the calling lane. Each participating thread must have its own bit set in its mask argument, and all active threads specified in any mask argument must execute the same call with the same mask, otherwise the result is undefined.
-The default width is warpSize (see Warp cross-lane functions ). Half-float shuffles are not supported.
-**Following code does:** The code snippet appears to be a command for compiling a HIP (Heterogeneous-Compute Interface for Portability) program using the `nvcc` compiler. The command is intended to compile a source file named `main.hip` located in the `saxpy` directory under `HIP-Basic`, and produce an executable named `saxpy.exe`. The command includes options to specify include directories (`-I ${env:HIP_PATH}include` and `-I.\Common`) and an optimization level (`-O2`). However, the snippet seems to be corrupted or improperly formatted, as it contains extraneous characters and symbols that do not form a valid command.
- The default width is warpSize (see Warp cross-lane functions). Half-float shuffles are not supported.
-
-
-int __shfl (T var, int srcLane, int width=warpSize);
-T can be a 32-bit integer type, 64-bit integer type or a single precision or double precision floating point type.
-The _sync variants require a 64-bit unsigned integer mask argument that specifies the lanes in the warp that will participate in cross-lane communication with the calling lane. Each participating thread must have its own bit set in its mask argument, and all active threads specified in any mask argument must execute the same call with the same mask, otherwise the result is undefined.
-You can use cooperative groups to synchronize groups of threads. Cooperative groups also provide a way of communicating between groups of threads at a granularity that is different from the block.
-HIP supports the following kernel language cooperative groups types and functions:
-**Following table contains:** The table represents a list of mathematical functions, likely from a programming library or API, with their descriptions and some form of validation or support status indicated by checkmarks. Each row corresponds to a specific function, detailing its return type, name, parameters, and a brief description of its purpose. - -The columns are as follows: -- Column 0: Contains the function signature and description, specifying the return type, function name, parameters, and what the function does. -- Column 1: Contains a checkmark (✓) indicating some form of validation, support, or availability status for the function. -- Column 2: Similar to Column 1, it contains a checkmark (✓) which might indicate another aspect of validation, support, or availability. - -Noteworthy values: -- All functions listed have a checkmark in Column 1, suggesting they all meet a certain criterion or are supported in a specific context. -- Most functions also have a checkmark in Column 2, except for "float lgammaf(float x)", which lacks a checkmark in this column, indicating it might not meet the same criteria as the others or is not supported in the same context.
-| Function | Supported in HIP | Supported in CUDA |
|---|---|---|
| void thread_group.sync(); | ✓ | ✓ |
| unsigned thread_group.size(); | ✓ | ✓ |
| unsigned thread_group.thread_rank() | ✓ | ✓ |
| bool thread_group.is_valid(); | ✓ | ✓ |
| grid_group this_grid() | ✓ | ✓ |
| void grid_group.sync() | ✓ | ✓ |
| unsigned grid_group.size() | ✓ | ✓ |
| unsigned grid_group.thread_rank() | ✓ | ✓ |
| bool grid_group.is_valid() | ✓ | ✓ |
| multi_grid_group this_multi_grid() | ✓ | ✓ |
| void multi_grid_group.sync() | ✓ | ✓ |
| unsigned multi_grid_group.size() | ✓ | ✓ |
| unsigned multi_grid_group.thread_rank() | ✓ | ✓ |
| bool multi_grid_group.is_valid() | ✓ | ✓ |
| unsigned multi_grid_group.num_grids() | ✓ | ✓ |
| unsigned multi_grid_group.grid_rank() | ✓ | ✓ |
| thread_block this_thread_block() | ✓ | ✓ |
| multi_grid_group this_multi_grid() | ✓ | ✓ |
| void multi_grid_group.sync() | ✓ | ✓ |
| void thread_block.sync() | ✓ | ✓ |
| unsigned thread_block.size() | ✓ | ✓ |
| unsigned thread_block.thread_rank() | ✓ | ✓ |
| bool thread_block.is_valid() | ✓ | ✓ |
| dim3 thread_block.group_index() | ✓ | ✓ |
| dim3 thread_block.thread_index() | ✓ | ✓ |
For further information, check Cooperative Groups API or Cooperative Groups how to .
-Warp matrix functions allow a warp to cooperatively operate on small matrices that have elements spread over lanes in an unspecified manner.
-HIP does not support kernel language warp matrix types or functions.
-**Following table contains:** The table appears to represent a hierarchical structure of topics related to computer architecture or programming optimization, possibly from a technical document or textbook. Each row represents a specific topic or subtopic, with indentation indicating the level of hierarchy. - -- **Column 0**: This column seems to contain section numbers or identifiers, which help in organizing the topics hierarchically. For example, "11.2" is a main topic, while "11.2.1" and "11.2.2" are subtopics under it. - -- **Column 1**: This column contains the main topic or subtopic names. For instance, "Memory" and "Optimization for maximum instruction throughput" are main topics, while "Data Transfer" and "Arithmetic instructions" are subtopics. - -- **Column 2**: This column provides a more detailed description or continuation of the topic names, often with ellipses indicating continuation or truncation. - -- **Column 3**: This column seems to contain numerical values, possibly page numbers or reference numbers, which are consistently around 70-72. - -Noteworthy values include the consistent numerical values in column 3, suggesting they might be page numbers from a document. The hierarchical structure indicated by the section numbers in column 0 and the indentation in column 1 suggests a detailed breakdown of topics, possibly for educational or reference purposes.
-| Function | Sup- ported in HIP | Supported in CUDA |
|---|---|---|
| void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned lda) | ✓ | |
| void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned lda, layout_t layout) | ✓ | |
| void store_matrix_sync(T* mptr, fragment<...> &a, unsigned lda, layout_t layout) | ✓ | |
| void fill_fragment(fragment<...> &a, const T &value) void mma_sync(fragment<...> &d, const fragment<...> &a, | ✓ | |
| const fragment<...> &b, const fragment<...> &c , bool sat) | ✓ |
Certain architectures that support CUDA allow threads to progress independently of each other. This independent thread scheduling makes intra-warp synchronization possible.
-HIP does not support this type of scheduling.
-The CUDA __prof_trigger() instruction is not supported.
-The assert function is supported in HIP. Assert function is used for debugging purpose, when the input expression equals to zero, the execution will be stopped.
-**Following code does:** This code snippet appears to be a series of special characters and whitespace, but it does not represent any valid or meaningful Python code. It seems to be a random arrangement of symbols and spaces without any executable purpose or functionality in a programming context.
-|void assert(int ir
-**Following code does:** The code snippet `<_SQL_>` appears to be a placeholder or a tag indicating that the actual SQL code is not provided. Without the specific SQL code, it's not possible to determine its functionality. Typically, SQL code is used for interacting with databases, such as querying data, updating records, or managing database structures. If you have the actual SQL code, I can help explain its purpose.
- input()
-There are two kinds of implementations for assert functions depending on the use sceneries, - One is for the host version of assert, which is defined in assert.h , - Another is the device version of assert, which is implemented in hip/hip_runtime.h . Users need to include assert.h to use assert . For assert to work in both device and host functions, users need to include "hip/hip_runtime.h" .
-HIP provides the function abort() which can be used to terminate the application when terminal failures are detected. It is implemented using the __builtin_trap() function.
-This function produces a similar effect of using asm("trap") in the CUDA code.
-Note: In HIP, the function terminates the entire application, while in CUDA, asm("trap") only terminates the dispatch and the application continues to run.
-printf function is supported in HIP. The following is a simple example to print information in the kernel.
-**Following code does:** The code snippet you've provided appears to be a command line instruction rather than a typical Python code snippet. It seems to be a command for a build or compilation tool, possibly related to a makefile or a build script. Here's a breakdown of its components: - -- `|` and `\` are shell operators. `|` is used for piping the output of one command into another, and `\` is used to escape characters or continue a command on the next line. -- `roc-obj` could be a reference to a specific object file or a target within a build system. -- `-t` and `-d` are likely flags or options for the command being executed. -- `gfx803` might refer to a specific architecture or hardware target, possibly related to AMD's GCN (Graphics Core Next) architecture. -- `./saxpy` suggests that the command is operating on or with a file or executable named `saxpy` in the current directory. - -Overall, this command seems to be part of a build process, possibly compiling or linking code for a specific hardware target.
-
-#include
-
- __global__ void run_printf() { printf("Hello World\n"); }
-
- int main() {
- run_printf<<>>();
- }
-Device-side dynamic global memory allocation is under development. HIP now includes a preliminary implementation of malloc and free that can be called from device functions.
-GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simultaneously running. Thus GPUs have a complex relationship between resource usage and performance.
-__launch_bounds__ allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code. It is a function attribute that must be attached to a __global__ function:
-**Following code does:** The code snippet `<_XML_>` appears to be a placeholder or a tag, rather than actual executable Python code. It might be used in a larger context to denote a section where XML data or XML-related processing is expected. Without additional context or surrounding code, it does not perform any specific function or operation by itself.
-<_Cython_>
-__launch_bounds__ supports two parameters: - MAX_THREADS_PER_BLOCK - The programmers guarantees that kernel will be launched with threads less than MAX_THREADS_PER_BLOCK. (On NVCC this maps to the . maxntid PTX directive). If no launch_bounds is specified, MAX_THREADS_PER_BLOCK is the maximum block size supported by the device (typically 1024 or larger). Specifying MAX_THREADS_PER_BLOCK less than the maximum effectively allows the compiler to use more resources than a default unconstrained compilation that supports all possible block sizes at launch time. The threads-per-block is the product of ( blockDim.x * blockDim. y * blockDim.z ). - MIN_WARPS_PER_EXECUTION_UNIT - directs the compiler to minimize resource usage so that the requested number of warps can be simultaneously active on a multi-processor. Since active warps compete for the same fixed pool of resources, the compiler must reduce resources required by each warp(primarily registers). MIN_WARPS_PER_EXECUTION_UNIT is optional and defaults to 1 if not specified. Specifying a MIN_WARPS_PER_EXECUTION_UNIT greater than the default 1 effectively constrains the compiler's resource usage.
-When launch kernel with HIP APIs, for example, hipModuleLaunchKernel() , HIP will do validation to make sure input kernel dimension size is not larger than specified launch_bounds. In case exceeded, HIP would return launch failure, if AMD_LOG_LEVEL is set with proper value (for details, please refer to docs/markdown/hip_logging. md ), detail information will be shown in the error log message, including launch parameters of kernel dim size, launch bounds, and the name of the faulting kernel. It's helpful to figure out which is the faulting kernel, besides, the kernel dim size and launch bounds values will also assist in debugging such failures.
-The compiler uses these parameters as follows: - The compiler uses the hints only to manage register usage, and does not automatically reduce shared memory or other resources. - Compilation fails if compiler cannot generate a kernel which meets the requirements of the specified launch bounds. - From MAX_THREADS_PER_BLOCK, the compiler derives the maximum number of warps/block that can be used at launch time. Values of MAX_THREADS_PER_BLOCK less than the default allows the compiler to use a larger pool of registers : each warp uses registers, and this hint constrains the launch to a warps/block size which is less than maximum. - From MIN_WARPS_PER_EXECUTION_UNIT, the compiler derives a maximum number of registers that can be used by the kernel (to meet the required #simultaneous active blocks). If MIN_WARPS_PER_EXECUTION_UNIT is 1, then the kernel can use all registers supported by the multiprocessor. - The compiler ensures that the registers used in the kernel is less than both allowed maximums, typically by spilling registers (to shared or global memory), or by using more instructions. - The compiler may use heuristics to increase register usage, or may simply be able to avoid spilling. The MAX_THREADS_PER_BLOCK
-is particularly useful in this cases, since it allows the compiler to use more registers and avoid situations where the compiler constrains the register usage (potentially spilling) to meet the requirements of a large block size that is never used at launch time.
-A compute unit (CU) is responsible for executing the waves of a work-group. It is composed of one or more execution units (EU) which are responsible for executing waves. An EU can have enough resources to maintain the state of more than one executing wave. This allows an EU to hide latency by switching between waves in a similar way to symmetric multithreading on a CPU. In order to allow the state for multiple waves to fit on an EU, the resources used by a single wave have to be limited. Limiting such resources can allow greater latency hiding, but can result in having to spill some register state to memory. This attribute allows an advanced developer to tune the number of waves that are capable of fitting within the resources of an EU. It can be used to ensure at least a certain number will fit to help hide latency, and can also be used to ensure no more than a certain number will fit to limit cache thrashing.
-CUDA defines a __launch_bounds which is also designed to control occupancy:
-**Following code does:** This code snippet appears to be a collection of various symbols and characters arranged in a seemingly random manner. It does not form a valid or meaningful Python program. The arrangement of characters does not follow any recognizable syntax or structure that would perform a specific function or task in Python. Therefore, it does not have a high-level purpose or functionality.
- | MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) /_\
- : -- < > }
-The key differences in the interface are: - Warps (rather than blocks): The developer is trying to tell the compiler to control resource utilization to guarantee some amount of active Warps/EU for latency hiding. Specifying active warps in terms of blocks appears to hide the micro-architectural details of the warp size, but makes the interface more confusing since the developer ultimately needs to compute the number of warps to obtain the desired level of control. - Execution Units (rather than multiprocessor): The use of execution units rather than multiprocessors provides support for architectures with multiple execution units/multi-processor. For example, the AMD GCN architecture has 4 execution units per multiprocessor. The hipDeviceProps has a field executionUnitsPerMultiprocessor . Platform-specific coding techniques such as #ifdef can be used to specify different launch_bounds for NVCC and HIP-Clang platforms, if desired.
-Unlike NVCC, HIP-Clang does not support the --maxregcount option. Instead, users are encouraged to use the hip_launch_bounds directive since the parameters are more intuitive and portable than micro-architecture details like registers, and also the directive allows per-kernel control rather than an entire file. hip_launch_bounds works on both HIP-Clang and NVCC targets.
-typedef void (* hipStreamCallback_t )(hipStream_t stream, hipError_t status, void *userData)
-Stream CallBack struct hipError_t hipStreamCreate ( hipStream_t *stream )
-Create an asynchronous stream.
-Create a new asynchronous stream. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy.
-hipStreamCreateWithFlags , hipStreamCreateWithPriority , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy
-stream -[inout] Valid pointer to hipStream_t. This function writes the memory with the newly created stream.
-hipSuccess, hipErrorInvalidValue
-hipSuccess, hipErrorInvalidValue hipError_t hipStreamCreateWithFlags ( hipStream_t *stream, unsigned int flags )
-Create an asynchronous stream.
-Create a new asynchronous stream. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy. Flags controls behavior of the stream. See hipStreamDefault, hipStreamNonBlocking.
-hipStreamCreate , hipStreamCreateWithPriority , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy
-hipSuccess, hipErrorInvalidValue
-hipError_t hipStreamCreateWithPriority ( hipStream_t *stream, unsigned int flags, int priority )
-Create an asynchronous stream with the specified priority.
-Create a new asynchronous stream with the specified priority. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy. Flags controls behavior of the stream. See hipStreamDefault, hipStreamNonBlocking.
-hipStreamCreate , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy
-hipSuccess, hipErrorInvalidValue hipError_t hipDeviceGetStreamPriorityRange ( int *leastPriority, int *greatestPriority )
-Returns numerical values that correspond to the least and greatest stream priority.
-Returns in *leastPriority and *greatestPriority the numerical values that correspond to the least and greatest stream priority respectively. Stream priorities follow a convention where lower numbers imply greater priorities. The range of meaningful stream priorities is given by [*greatestPriority, *leastPriority]. If the user attempts to create a stream with a priority value that is outside the meaningful range as specified by this API, the priority is automatically clamped to within the valid range.
-hipSuccess hipError_t hipStreamDestroy ( hipStream_t stream )
-Destroys the specified stream.
-Destroys the specified stream.
-If commands are still executing on the specified stream, some may complete execution before the queue is deleted.
-The queue may be destroyed while some commands are still inflight, or may wait for all commands queued to the stream before destroying it.
-hipStreamCreate , hipStreamCreateWithFlags , hipStreamCreateWithPriority , hipStreamQuery , hipStreamWaitEvent , hipStreamSynchronize
-stream -[in] stream identifier.
-hipSuccess hipErrorInvalidHandle
-Return hipSuccess if all of the operations in the specified stream have completed, or hipErrorNotReady if not.
-This is thread-safe and returns a snapshot of the current state of the queue. However, if other host threads are sending work to the stream, the status may change immediately after the function is called. It is typically used for debug.
-hipStreamCreate , hipStreamCreateWithFlags , hipStreamCreateWithPriority , hipStreamWaitEvent , hipStreamSynchronize , hipStreamDestroy
-stream -[in] stream to query
-hipSuccess, hipErrorNotReady, hipErrorInvalidHandle
-Wait for all commands in stream to complete.
-This command is host-synchronous : the host will block until the specified stream is empty.
-This command follows standard null-stream semantics. Specifically, specifying the null stream will cause the command to wait for other streams on the same device to complete all pending operations.
-This command honors the hipDeviceLaunchBlocking flag, which controls whether the wait is active or blocking.
-hipStreamCreate , hipStreamCreateWithFlags , hipStreamCreateWithPriority , hipStreamWaitEvent , hipStreamDestroy
-stream -[in] stream identifier.
-hipSuccess, hipErrorInvalidHandle
-hipError_t hipStreamWaitEvent ( hipStream_t stream, hipEvent_t event, unsigned int flags )
-Make the specified compute stream wait for an event.
-This function inserts a wait operation into the specified stream. All future work submitted to stream will wait until event reports completion before beginning execution.
-This function only waits for commands in the current stream to complete. Notably, this function does not implicitly wait for commands in the default stream to complete, even if the specified stream is created with hipStreamNonBlocking = 0.
-hipStreamCreate , hipStreamCreateWithFlags , hipStreamCreateWithPriority , hipStreamSynchronize , hipStreamDestroy
-hipSuccess, hipErrorInvalidHandle hipError_t hipStreamGetFlags ( hipStream_t stream, unsigned int *flags )
-Return flags associated with this stream.
-Return flags associated with this stream in * flags .
-hipStreamCreateWithFlags
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidHandle
-hipSuccess hipErrorInvalidValue hipErrorInvalidHandle hipError_t hipStreamGetPriority ( hipStream_t stream, int *priority )
-Query the priority of a stream.
-Query the priority of a stream. The priority is returned in in priority.
-hipStreamCreateWithFlags
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidHandle
-hipSuccess hipErrorInvalidValue hipErrorInvalidHandle hipError_t hipStreamGetDevice ( hipStream_t stream, hipDevice_t *device )
-Get the device assocaited with the stream.
-hipStreamCreate , hipStreamDestroy , hipDeviceGetStreamPriorityRange
-hipSuccess, hipErrorInvalidValue, hipErrorContextIsDestroyed, hipErrorInvalidHandle, hipErrorNotInitialized, hipErrorDeinitialized, hipErrorInvalidContext hipError_t hipExtStreamCreateWithCUMask ( hipStream_t *stream, uint32_t cuMaskSize, const uint32_t *cuMask )
-Create an asynchronous stream with the specified CU mask.
-Create a new asynchronous stream with the specified CU mask. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy.
-hipStreamCreate , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy
-hipSuccess, hipErrorInvalidHandle, hipErrorInvalidValue hipError_t hipExtStreamGetCUMask ( hipStream_t stream, uint32_t cuMaskSize, uint32_t *cuMask )
-Get CU mask associated with an asynchronous stream.
-hipStreamCreate , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy
-hipSuccess, hipErrorInvalidHandle, hipErrorInvalidValue hipError_t hipStreamAddCallback ( hipStream_t stream, hipStreamCallback_t callback, void *userData, unsigned int flags )
-Adds a callback to be called on the host after all currently enqueued items in the stream have completed. For each hipStreamAddCallback call, a callback will be executed exactly once. The callback will block later work in the stream until it is finished.
-hipStreamCreate , hipStreamCreateWithFlags , hipStreamQuery , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy , hipStreamCreateWithPriority
-hipSuccess, hipErrorInvalidHandle, hipErrorNotSupported static inline hipError_t hipMallocAsync ( void **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream )
-C++ wrappers for allocations from a memory pool.
-This section describes wrappers for stream Ordered allocation from memory pool functions of HIP runtime API.
-This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading.
-hipMallocFromPoolAsync
-Note: APIs in this section are implemented on Linux, under development on Windows.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-static inline hipError_t hipMallocAsync ( T **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream C++ wrappers for allocations from a memory pool on the stream.
-This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading.
-hipMallocFromPoolAsync
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-static inline hipError_t hipMallocAsync ( T **dev_ptr, size_t size, hipStream_t stream )
-C++ wrappers for allocations from a memory pool.
-This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading.
-hipMallocFromPoolAsync
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-static inline hipError_t hipMallocFromPoolAsync ( T **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream )
-C++ wrappers for allocations from a memory pool.
-This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading.
-hipMallocFromPoolAsync
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-)
-hipError_t hipMallocAsync ( void **dev_ptr, size_t size, hipStream_t stream )
-Allocates memory with stream ordered semantics.
-Inserts a memory allocation operation into stream . A pointer to the allocated memory is returned immediately in *dptr. The allocation must not be accessed until the allocation operation completes. The allocation comes from the memory pool associated with the stream's device.
-hipMallocFromPoolAsync , hipFreeAsync , hipMemPoolTrimTo , hipMemPoolGetAttribute , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess
-Note: The default memory pool of a device contains device memory from that device.
-Note: Basic stream ordering allows future work submitted into the same stream to use the allocation. Stream query, stream synchronize, and HIP events can be used to guarantee that the allocation operation completes before work submitted in a separate stream runs.
-Note: During stream capture, this function results in the creation of an allocation node. In this case, the allocation is owned by the graph instead of the memory pool. The memory pool's properties are used to set the node's creation parameters.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported, hipErrorOutOfMemory hipError_t hipFreeAsync ( void *dev_ptr, hipStream_t stream )
-Frees memory with stream ordered semantics.
-Inserts a free operation into stream . The allocation must not be used after stream execution reaches the free. After this API returns, accessing the memory from any subsequent work launched on the GPU or querying its pointer attributes results in undefined behavior.
-hipMallocFromPoolAsync , hipMallocAsync , hipMemPoolTrimTo , hipMemPoolGetAttribute , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess
-Note: During stream capture, this function results in the creation of a free node and must therefore be passed the address of a graph allocation.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemPoolTrimTo ( hipMemPool_t mem_pool, size_t min_bytes_to_hold )
-Releases freed memory back to the OS.
-Releases memory back to the OS until the pool contains fewer than min_bytes_to_keep reserved bytes, or there is no more memory that the allocator can safely release. The allocator cannot release OS allocations that back outstanding asynchronous allocations. The OS allocations may happen at different granularity from the user allocations.
-hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess
-Note: Allocations that have not been freed count as outstanding.
-Note: Allocations that have been asynchronously freed but whose completion has not been observed on the host (eg. by a synchronize) can count as outstanding.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemPoolSetAttribute ( hipMemPool_t mem_pool, hipMemPoolAttr attr, void *value )
-Sets attributes of a memory pool.
-Supported attributes are:
-hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAccess , hipMemPoolGetAccess
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue
-hipError_t hipMemPoolGetAttribute ( hipMemPool_t mem_pool, hipMemPoolAttr attr, void *value )
-Gets attributes of a memory pool.
-Supported attributes are:
-hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemPoolSetAccess ( hipMemPool_t mem_pool, const hipMemAccessDesc *desc_list, size_t count ) Controls visibility of the specified pool between devices.
-hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolGetAccess
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemPoolGetAccess ( hipMemAccessFlags *flags, hipMemPool_t mem_pool, hipMemLocation *location )
-Returns the accessibility of a pool from a device.
-Returns the accessibility of the pool's memory from the specified location.
-hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemPoolCreate ( hipMemPool_t *mem_pool, const hipMemPoolProps *pool_props )
-Creates a memory pool.
-Creates a HIP memory pool and returns the handle in mem_pool . The pool_props determines the properties of the pool such as the backing device and IPC capabilities.
-By default, the memory pool will be accessible from the device it is allocated on.
-hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolDestroy , hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess
-Note: Specifying hipMemHandleTypeNone creates a memory pool that will not support IPC.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemPoolDestroy ( hipMemPool_t mem_pool )
-Destroys the specified memory pool.
-If any pointers obtained from this pool haven't been freed or the pool has free operations that haven't completed when hipMemPoolDestroy is invoked, the function will return immediately and the resources associated with the pool will be released automatically once there are no more outstanding allocations.
-Destroying the current mempool of a device sets the default mempool of that device as the current mempool for that device.
-hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolCreate hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess
-Note: A device's default memory pool cannot be destroyed.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-mem_pool -[in] Memory pool for destruction
-hipSuccess, hipErrorInvalidValue hipError_t void **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream
-hipMallocFromPoolAsync ( ) Allocates memory from a specified pool with stream ordered semantics.
-Inserts an allocation operation into stream . A pointer to the allocated memory is returned immediately in dev_ptr . The allocation must not be accessed until the allocation operation completes. The allocation comes from the specified memory pool.
-Basic stream ordering allows future work submitted into the same stream to use the allocation. Stream query, stream synchronize, and HIP events can be used to guarantee that the allocation operation completes before work submitted in a separate stream runs.
-hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolCreate hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess ,
-Note: The specified memory pool may be from a device different than that of the specified stream .
-Note: During stream capture, this function results in the creation of an allocation node. In this case, the allocation is owned by the graph instead of the memory pool. The memory pool's properties are used to set the node's creation parameters.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported, hipErrorOutOfMemory hipError_t hipMemPoolExportToShareableHandle ( void *shared_handle, hipMemPool_t mem_pool, hipMemAllocationHandleType handle_type, unsigned int flags )
-Exports a memory pool to the requested handle type.
-Given an IPC capable mempool, create an OS handle to share the pool with another process. A recipient process can convert the shareable handle into a mempool with hipMemPoolImportFromShareableHandle . Individual pointers can then be shared with the hipMemPoolExportPointer and hipMemPoolImportPointer APIs. The implementation of what the shareable handle is and how it can be transferred is defined by the requested handle type.
-hipMemPoolImportFromShareableHandle
-Note: To create an IPC capable mempool, create a mempool with a hipMemAllocationHandleType other than hipMemHandleTypeNone .
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory hipError_t hipMemPoolImportFromShareableHandle ( hipMemPool_t *mem_pool, void *shared_handle, hipMemAllocationHandleType handle_type, unsigned int flags )
-Imports a memory pool from a shared handle.
-Specific allocations can be imported from the imported pool with hipMemPoolImportPointer .
-hipMemPoolExportToShareableHandle
-Note: Imported memory pools do not support creating new allocations. As such imported memory pools may not be used in hipDeviceSetMemPool or hipMallocFromPoolAsync calls.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory hipError_t hipMemPoolExportPointer ( hipMemPoolPtrExportData *export_data, void *dev_ptr )
-Export data to share a memory pool allocation between processes.
-Constructs export_data for sharing a specific allocation from an already shared memory pool. The recipient process can import the allocation with the hipMemPoolImportPointer api. The data is not a handle and may be shared through any IPC mechanism.
-hipMemPoolImportPointer
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory hipError_t hipMemPoolImportPointer ( void **dev_ptr, hipMemPool_t mem_pool, hipMemPoolPtrExportData *export_data )
-Import a memory pool allocation from another process.
-Returns in dev_ptr a pointer to the imported memory. The imported memory must not be accessed before the allocation operation completes in the exporting process. The imported memory must be freed from all importing processes before being freed in the exporting process. The pointer may be freed with hipFree or hipFreeAsync . If hipFreeAsync is used, the free must be completed on the importing process before the free operation on the exporting process.
-hipMemPoolExportPointer
-Note: The hipFreeAsync api may be used in the exporting process before the hipFreeAsync operation completes in its stream as long as the hipFreeAsync in the exporting process specifies a stream with a stream dependency on the importing process's hipFreeAsync .
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized, hipErrorOutOfMemory
-hipError_t hipDeviceCanAccessPeer ( int *canAccessPeer, int deviceId, int peerDeviceId )
-Determine if a device can access a peer's memory.
-Returns '1' in canAccessPeer if the specified device is capable of directly accessing memory physically located on peerDevice , or '0' if not.
-Returns '0' in canAccessPeer if deviceId == peerDeviceId, and both are valid devices : a device is not a peer of itself.
-hipSuccess,
-hipErrorInvalidDevice if deviceId or peerDeviceId are not valid devices hipError_t hipDeviceEnablePeerAccess ( int peerDeviceId, unsigned int flags )
-Enable direct access from current device's virtual address space to memory allocations physically located on a peer device.
-Memory which already allocated on peer device will be mapped into the address space of the current device. In addition, all future memory allocations on peerDeviceId will be mapped into the address space of the current device when the memory is allocated. The peer memory remains accessible from the current device until a call to hipDeviceDisablePeerAccess or hipDeviceReset.
-Returns hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue,
-hipErrorPeerAccessAlreadyEnabled if peer access is already enabled for this device.
-Disable direct access from current device's virtual address space to memory allocations physically located on a peer device.
-Returns hipErrorPeerAccessNotEnabled if direct access to memory on peerDevice has not yet been enabled from the current device.
-peerDeviceId -[in] Peer device to disable direct access to
-hipSuccess, hipErrorPeerAccessNotEnabled hipError_t hipMemGetAddressRange ( hipDeviceptr_t *pbase, size_t *psize, hipDeviceptr_t dptr )
-Get information on memory allocations.
-hipCtxCreate, hipCtxDestroy, hipCtxGetFlags, hipCtxPopCurrent, hipCtxGetCurrent, hipCtxSetCurrent, hipCtxPushCurrent, hipCtxSetCacheConfig, hipCtxSynchronize, hipCtxGetDevice
-hipSuccess, hipErrorNotFound
-hipError_t hipPointerSetAttribute ( const void *value, hipPointer_attribute attribute, hipDeviceptr_t ptr )
-Sets information on the specified pointer.[BETA].
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipPointerGetAttributes ( hipPointerAttribute_t *attributes, const void *ptr )
-Returns attributes for the specified pointer.
-The output parameter 'attributes' has a member named 'type' that describes what memory the pointer is associated with, such as device memory, host memory, managed memory, and others. Otherwise, the API cannot handle the pointer and returns hipErrorInvalidValue.
-hipPointerGetAttribute
-Note: The unrecognized memory type is unsupported to keep the HIP functionality backward compatibility due to hipMemoryType enum values.
-Note: The current behavior of this HIP API corresponds to the CUDA API before version 11.0.
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipPointerGetAttribute ( void *data, hipPointer_attribute attribute, hipDeviceptr_t ptr ) Returns information about the specified pointer.[BETA].
-hipPointerGetAttributes
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipDrvPointerGetAttributes ( unsigned int numAttributes, hipPointer_attribute *attributes, void **data, hipDeviceptr_t ptr )
-Returns information about the specified pointer.[BETA].
-hipPointerGetAttribute
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipMalloc ( void **ptr, size_t size )
-Allocate memory on the default accelerator.
-If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
-hipMallocPitch , hipFree , hipMallocArray , hipFreeArray , hipMalloc3D , hipMalloc3DArray , hipHostFree , hipHostMalloc
-hipSuccess, hipErrorOutOfMemory, hipErrorInvalidValue (bad context, null *ptr)
-hipError_t hipExtMallocWithFlags ( void **ptr, size_t sizeBytes, unsigned int flags )
-Allocate memory on the default accelerator.
-If requested memory size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
-The memory allocation flag should be either hipDeviceMallocDefault, hipDeviceMallocFinegrained, hipDeviceMallocUncached, or hipMallocSignalMemory. If the flag is any other value, the API returns hipErrorInvalidValue.
-hipMallocPitch , hipFree , hipMallocArray , hipFreeArray , hipMalloc3D , hipMalloc3DArray , hipHostFree , hipHostMalloc
-hipSuccess, hipErrorOutOfMemory, hipErrorInvalidValue (bad context, null *ptr)
-hipError_t hipMallocHost ( void **ptr, size_t size )
-Allocate pinned host memory [Deprecated].
-If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
-Warning:
-**Following code does:** The code snippet is a shell command that lists files in the current directory with names starting with "main-hip-amdgcn-amd-amdhsa-" and ending with any extension. The output shows a list of files that match this pattern, indicating that these files are likely related to a project or build process involving AMD GPU architecture (specifically the GFX803 series) and the HIP (Heterogeneous-Compute Interface for Portability) platform. The files have various extensions, suggesting they are different types of build artifacts, such as bytecode (.bc), object files (.o), assembly (.s), and output files (.out).
-
-
-arning: This API is deprecated, use hipHostMalloc() instead
-hipSuccess, hipErrorOutOfMemory
-Allocate pinned host memory [Deprecated].
-If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
-Warning: This API is deprecated, use hipHostMalloc() instead
-hipSuccess, hipErrorOutOfMemory hipError_t hipHostMalloc ( void **ptr, size_t size, unsigned int flags )
-Allocates device accessible page locked (pinned) host memory.
-This API allocates pinned host memory which is mapped into the address space of all GPUs in the system, the memory can be accessed directly by the GPU device, and can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc().
-Using the pinned host memory, applications can implement faster data transfers for HostToDevice and DeviceToHost. The runtime tracks the hipHostMalloc allocations and can avoid some of the setup required for regular unpinned memory.
-When the memory accesses are infrequent, zero-copy memory can be a good choice, for coherent allocation. GPU can directly access the host memory over the CPU/GPU interconnect, without need to copy the data.
-Currently the allocation granularity is 4KB for the API.
-Developers need to choose proper allocation flag with consideration of synchronization.
-If no input for flags, it will be the default pinned memory allocation on the host.
-hipSetDeviceFlags, hipHostFree
-hipSuccess, hipErrorOutOfMemory hipError_t hipHostAlloc ( void **ptr, size_t size, unsigned int flags )
-Allocate device accessible page locked host memory [Deprecated].
-If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
-Warning: This API is deprecated, use hipHostMalloc() instead
-hipSuccess, hipErrorOutOfMemory hipError_t hipHostGetDevicePointer ( void **devPtr, void *hstPtr, unsigned int flags )
-Get Device pointer from Host Pointer allocated through hipHostMalloc.
-hipSetDeviceFlags, hipHostMalloc
-hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory hipError_t hipHostGetFlags ( unsigned int *flagsPtr, void *hostPtr )
-Return flags associated with host pointer.
-hipSuccess, hipErrorInvalidValue hipError_t hipHostRegister ( void *hostPtr, size_t sizeBytes, unsigned int flags )
-Register host memory so it can be accessed from the current device.
-After registering the memory, use hipHostGetDevicePointer to obtain the mapped device pointer. On many systems, the mapped device pointer will have a different value than the mapped host pointer. Applications must use the device pointer in device code, and the host pointer in host code.
-On some systems, registered memory is pinned. On some systems, registered memory may not be actually be pinned but uses OS or hardware facilities to all GPU access to the host memory.
-Developers are strongly encouraged to register memory blocks which are aligned to the host cache-line size. (typically 64-bytes but can be obtains from the CPUID instruction).
-If registering non-aligned pointers, the application must take care when register pointers from the same cache line on different devices. HIP's coarse-grained synchronization model does not guarantee correct results if different devices write to different parts of the same cache block - typically one of the writes will 'win' and overwrite data from the other registered memory region.
-hipHostUnregister , hipHostGetFlags , hipHostGetDevicePointer
-hipSuccess, hipErrorOutOfMemory hipError_t hipHostUnregister ( void *hostPtr )
-Un-register host pointer.
-hipHostRegister
-hostPtr -[in] Host pointer previously registered with hipHostRegister
-Error code hipError_t hipMallocPitch ( void **ptr, size_t *pitch, size_t width, size_t height )
-Allocates at least width (in bytes) * height bytes of linear memory Padding may occur to ensure alighnment requirements are met for the given row The change in width size due to padding will be returned in *pitch. Currently the alignment is set to 128 bytes
-If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned.
-hipMalloc , hipFree , hipMallocArray , hipFreeArray , hipHostFree , hipMalloc3D , hipMalloc3DArray , hipHostMalloc
-Error code hipError_t hipMemAllocPitch ( hipDeviceptr_t *dptr, size_t *pitch, size_t widthInBytes, size_t height, unsigned int elementSizeBytes )
-Allocates at least width (in bytes) * height bytes of linear memory Padding may occur to ensure alighnment requirements are met for the given row The change in width size due to padding will be returned in *pitch. Currently the alignment is set to 128 bytes
-If size is 0, no memory is allocated, ptr returns nullptr, and hipSuccess is returned. The intended usage of pitch is as a separate parameter of the allocation, used to compute addresses within the 2D array. Given the row and column of an array element of type T, the address is computed as: T pElement = (T*)((char*)BaseAddress + Row * Pitch) + Column;
-hipMalloc , hipFree , hipMallocArray , hipFreeArray , hipHostFree , hipMalloc3D , hipMalloc3DArray , hipHostMalloc
-Error code
-Free memory allocated by the hcc hip memory allocation API. This API performs an implicit hipDeviceSynchronize() call. If pointer is NULL, the hip runtime is initialized and hipSuccess is returned.
-hipMalloc , hipMallocPitch , hipMallocArray , hipFreeArray , hipHostFree , hipMalloc3D , hipMalloc3DArray , hipHostMalloc
-ptr -[in] Pointer to memory to be freed
-hipSuccess
-hipErrorInvalidDevicePointer (if pointer is invalid, including host pointers allocated with hipHostMalloc)
-Free memory allocated by the hcc hip host memory allocation API [Deprecated].
-Warning:
-**Following code does:** The code snippet appears to be a command intended to be run in a terminal or command line interface, rather than a Python script. The command `cuobjdump --list-ptx ./saxpy` is used to extract and list the PTX (Parallel Thread Execution) code from a compiled CUDA binary file, in this case, named `saxpy`. PTX is an intermediate representation of CUDA code that is closer to assembly language. This command is typically used by developers to inspect the PTX code generated by the CUDA compiler for debugging or optimization purposes. The output of this command would display the PTX code sections contained within the `saxpy` binary.
-
-
-urning: This API is deprecated, use hipHostFree() instead
-ptr -[in] Pointer to memory to be freed
-hipSuccess, hipErrorInvalidValue (if pointer is invalid, including device pointers allocated with hipMalloc)
-Free memory allocated by the hcc hip host memory allocation API This API performs an implicit hipDeviceSynchronize() call. If pointer is NULL, the hip runtime is initialized and hipSuccess is returned.
-hipMalloc , hipMallocPitch , hipFree , hipMallocArray , hipFreeArray , hipMalloc3D , hipMalloc3DArray , hipHostMalloc
-ptr -[in] Pointer to memory to be freed
-hipSuccess, hipErrorInvalidValue (if pointer is invalid, including device pointers allocated with hipMalloc)
-hipError_t hipMemcpy ( void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind )
-Copy data from src to dst.
-It supports memory from host to device, device to host, device to device and host to host The src and dst must not overlap.
-For hipMemcpy, the copy is always performed by the current device (set by hipSetDevice). For multi-gpu or peerto-peer configurations, it is recommended to set the current device to the device where the src data is physically located. For optimal peer-to-peer copies, the copy device must be able to access the src and dst pointers (by calling hipDeviceEnablePeerAccess with copy agent as the current device and src/dest as the peerDevice argument. if this is not done, the hipMemcpy will still work, but will perform the copy using a staging buffer on the host. Calling hipMemcpy with dst and src pointers that do not match the hipMemcpyKind results in undefined behavior.
-hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer
-hipSuccess, hipErrorInvalidValue, hipErrorUnknown hipError_t hipMemcpyWithStream ( void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream )
-Memory copy on the stream. It allows single or multiple devices to do memory copy on single or multiple streams.
-hipMemcpy , hipStreamCreate , hipStreamSynchronize , hipStreamDestroy , hipSetDevice, hipLaunchKernelGGL
-hipSuccess, hipErrorInvalidValue, hipErrorUnknown, hipErrorContextIsDestroyed hipError_t hipMemcpyHtoD ( hipDeviceptr_t dst, void *src, size_t sizeBytes )
-Copy data from Host to Device.
-hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer
-hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue
-hipError_t hipMemcpyDtoH ( void *dst, hipDeviceptr_t src, size_t sizeBytes )
-Copy data from Device to Host.
-hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer
-hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipError_t hipMemcpyDtoD ( hipDeviceptr_t dst, hipDeviceptr_t src, size_t sizeBytes )
-Copy data from Device to Device.
-hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer
-hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipError_t hipMemcpyHtoDAsync ( hipDeviceptr_t dst, void *src, size_t sizeBytes, hipStream_t stream )
-Copy data from Host to Device asynchronously.
-hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD,
-hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer
-hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipError_t hipMemcpyDtoHAsync ( void *dst, hipDeviceptr_t src, size_t sizeBytes, hipStream_t stream )
-Copy data from Device to Host asynchronously.
-hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer
-hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipError_t hipMemcpyDtoDAsync ( hipDeviceptr_t dst, hipDeviceptr_t src, size_t sizeBytes, hipStream_t stream )
-Copy data from Device to Device asynchronously.
-hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer
-hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipError_t hipModuleGetGlobal ( hipDeviceptr_t *dptr, size_t *bytes, hipModule_t hmod, const char *name )
-Returns a global pointer from a module. Returns in *dptr and *bytes the pointer and size of the global of name name located in module hmod. If no variable of that name exists, it returns hipErrorNotFound. Both parameters dptr and bytes are optional. If one of them is NULL, it is ignored and hipSuccess is returned.
-hipSuccess, hipErrorInvalidValue, hipErrorNotFound, hipErrorInvalidContext hipError_t hipGetSymbolAddress ( void **devPtr, const void *symbol )
-Gets device pointer associated with symbol on the device.
-hipSuccess, hipErrorInvalidValue hipError_t hipGetSymbolSize ( size_t *size, const void *symbol )
-Gets the size of the given symbol on the device.
-hipSuccess, hipErrorInvalidValue hipError_t hipGetProcAddress ( const char *symbol, void **pfn, int hipVersion, uint64_t flags, hipDriverProcAddressQueryResult *symbolStatus )
-Gets the pointer of requested HIP driver function.
-Returns hipSuccess if the returned pfn is addressed to the pointer of found driver function.
-hipSuccess, hipErrorInvalidValue.
-hipError_t hipMemcpyToSymbol ( const void *symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind )
-Copies data to the given symbol on the device. Symbol HIP APIs allow a kernel to define a device-side data symbol which can be accessed on the host side. The symbol can be in __constant or device space. Note that the symbol name needs to be encased in the HIP_SYMBOL macro. This also applies to hipMemcpyFromSymbol, hipGetSymbolAddress, and hipGetSymbolSize. For detailed usage, see the memcpyToSymbol example in the HIP Porting Guide.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemcpyToSymbolAsync ( const void *symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream )
-Copies data to the given symbol on the device asynchronously.
-hipSuccess, hipErrorInvalidValue
-hipError_t hipMemcpyFromSymbol ( void *dst, const void *symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind )
-Copies data from the given symbol on the device.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemcpyFromSymbolAsync ( void *dst, const void *symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream )
-Copies data from the given symbol on the device asynchronously.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemcpyAsync ( void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream )
-Copy data from src to dst asynchronously.
-For multi-gpu or peer-to-peer configurations, it is recommended to use a stream which is a attached to the device where the src data is physically located. For optimal peer-to-peer copies, the copy device must be able to access the src and dst pointers (by calling hipDeviceEnablePeerAccess with copy agent as the current device and src/dest as the peerDevice argument. if this is not done, the hipMemcpy will still work, but will perform the copy using a staging buffer on the host.
-hipMemcpy , hipMemcpy2D , hipMemcpyToArray , hipMemcpy2DToArray , hipMemcpyFromArray , hipMemcpy2DFromArray , hipMemcpyArrayToArray, hipMemcpy2DArrayToArray, hipMemcpyToSymbol , hipMemcpyFromSymbol , hipMemcpy2DAsync , hipMemcpyToArrayAsync, hipMemcpy2DToArrayAsync , hipMemcpyFromArrayAsync, hipMemcpy2DFromArrayAsync , hipMemcpyToSymbolAsync , hipMemcpyFromSymbolAsync
-Warning: If host or dest are not pinned, the memory copy will be performed synchronously. For best performance, use hipHostMalloc to allocate host memory that is transferred asynchronously.
-Warning: on HCC hipMemcpyAsync does not support overlapped H2D and D2H copies. For hipMemcpy, the copy is always performed by the device associated with the specified stream.
-hipSuccess, hipErrorInvalidValue, hipErrorUnknown hipError_t hipMemset ( void *dst, int value, size_t sizeBytes )
-Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value.
-hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetD8 ( hipDeviceptr_t dest, unsigned char value, size_t count )
-Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value.
-hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetD8Async ( hipDeviceptr_t dest, unsigned char value, size_t count, hipStream_t stream )
-Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value.
-hipMemsetD8Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.
-hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetD16 ( hipDeviceptr_t dest, unsigned short value, size_t count )
-Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant short value value.
-hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetD16Async ( hipDeviceptr_t dest, unsigned short value, size_t count, hipStream_t stream )
-Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant short value value.
-hipMemsetD16Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.
-hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetD32 ( hipDeviceptr_t dest, int value, size_t count )
-Fills the memory area pointed to by dest with the constant integer value for specified number of times.
-hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetAsync ( void *dst, int value, size_t sizeBytes, hipStream_t stream )
-Fills the first sizeBytes bytes of the memory area pointed to by dev with the constant byte value value.
-hipMemsetAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is nonzero, the operation may overlap with operations in other streams.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemsetD32Async ( hipDeviceptr_t dst, int value, size_t count, hipStream_t stream )
-Fills the memory area pointed to by dev with the constant integer value for specified number of times.
-hipMemsetD32Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemset2D ( void *dst, size_t pitch, int value, size_t width, size_t height )
-Fills the memory area pointed to by dst with the constant value.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemset2DAsync ( void *dst, size_t pitch, int value, size_t width, size_t height, hipStream_t stream ) Fills asynchronously the memory area pointed to by dst with the constant value.
-hipSuccess, hipErrorInvalidValue
-hipError_t hipMemset3D ( hipPitchedPtr pitchedDevPtr, int value, hipExtent extent )
-Fills synchronously the memory area pointed to by pitchedDevPtr with the constant value.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemset3DAsync ( hipPitchedPtr pitchedDevPtr, int value, hipExtent extent, hipStream_t stream )
-Fills asynchronously the memory area pointed to by pitchedDevPtr with the constant value.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemGetInfo ( size_t *free, size_t *total )
-Query memory info.
-On ROCM, this function gets the actual free memory left on the current device, so supports the cases while running multi-workload (such as multiple processes, multiple threads, and multiple GPUs).
-Warning: On Windows, the free memory only accounts for memory allocated by this process and may be optimistic.
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipMemPtrGetInfo ( void *ptr, size_t *size )
-Get allocated memory size via memory pointer.
-This function gets the allocated shared virtual memory size from memory pointer.
-hipSuccess, hipErrorInvalidValue
-hipError_t hipMallocArray ( hipArray_t *array, const hipChannelFormatDesc *desc, size_t width, size_t height, unsigned int flags )
-Allocate an array on the device.
-hipMalloc , hipMallocPitch , hipFree , hipFreeArray , hipHostMalloc , hipHostFree
-hipSuccess, hipErrorOutOfMemory hipError_t hipArrayCreate ( hipArray_t *pHandle, const HIP_ARRAY_DESCRIPTOR *pAllocateArray )
-Create an array memory pointer on the device.
-hipMallocArray , hipArrayDestroy , hipFreeArray
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipArrayDestroy ( hipArray_t array )
-Destroy an array memory pointer on the device.
-hipArrayCreate , hipArrayDestroy , hipFreeArray
-array -[in] Pointer to the array memory
-hipSuccess, hipErrorInvalidValue
-hipError_t hipArray3DCreate ( hipArray_t *array, const HIP_ARRAY3D_DESCRIPTOR *pAllocateArray )
-Create a 3D array memory pointer on the device.
-hipMallocArray , hipArrayDestroy , hipFreeArray
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMalloc3D ( hipPitchedPtr *pitchedDevPtr, hipExtent extent )
-Create a 3D memory pointer on the device.
-hipMallocPitch , hipMemGetInfo , hipFree
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipFreeArray ( hipArray_t array )
-Frees an array on the device.
-hipMalloc , hipMallocPitch , hipFree , hipMallocArray , hipHostMalloc , hipHostFree
-array -[in] Pointer to array to free
-hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMalloc3DArray ( hipArray_t *array, const struct hipChannelFormatDesc *desc, struct hipExtent extent, unsigned int flags )
-Allocate an array on the device.
-hipMalloc , hipMallocPitch , hipFree , hipFreeArray , hipHostMalloc , hipHostFree
-hipSuccess, hipErrorOutOfMemory hipError_t hipArrayGetInfo ( hipChannelFormatDesc *desc, hipExtent *extent, unsigned int *flags, hipArray_t array )
-Gets info about the specified array.
-hipArrayGetDescriptor , hipArray3DGetDescriptor
-hipSuccess, hipErrorInvalidValue hipErrorInvalidHandle hipError_t hipArrayGetDescriptor ( HIP_ARRAY_DESCRIPTOR *pArrayDescriptor, hipArray_t array )
-Gets a 1D or 2D array descriptor.
-hipArray3DCreate , hipArray3DGetDescriptor , hipArrayCreate , hipArrayDestroy , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpy3D , hipMemcpy3DAsync , hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoD , hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer, hipMemsetD8 , hipMemsetD16 , hipMemsetD32 , hipArrayGetInfo
-hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipErrorInvalidHandle
-hipError_t hipArray3DGetDescriptor ( HIP_ARRAY3D_DESCRIPTOR *pArrayDescriptor, hipArray_t array )
-Gets a 3D array descriptor.
-hipArray3DCreate , hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpy3D , hipMemcpy3DAsync , hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoD , hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer, hipMemsetD8 , hipMemsetD16 , hipMemsetD32 , hipArrayGetInfo
-hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipErrorInvalidHandle, hipErrorContextIsDestroyed hipError_t hipMemcpy2D ( void *dst, size_t dpitch, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind )
-Copies data between host and device.
-hipMemcpy , hipMemcpyToArray , hipMemcpy2DToArray , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyParam2D ( const hip_Memcpy2D *pCopy )
-Copies memory for 2D arrays.
-hipMemcpy , hipMemcpy2D , hipMemcpyToArray , hipMemcpy2DToArray , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-pCopy -[in] Parameters for the memory copy
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyParam2DAsync ( const hip_Memcpy2D *pCopy, hipStream_t stream )
-Copies memory for 2D arrays.
-hipMemcpy , hipMemcpy2D , hipMemcpyToArray , hipMemcpy2DToArray , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy2DAsync ( void *dst, size_t dpitch, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream )
-Copies data between host and device.
-hipMemcpy , hipMemcpyToArray , hipMemcpy2DToArray , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy2DToArray ( hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind )
-Copies data between host and device.
-hipMemcpy , hipMemcpyToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy2DToArrayAsync ( hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream )
-Copies data between host and device.
-hipMemcpy , hipMemcpyToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyToArray ( hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t count, hipMemcpyKind kind )
-Copies data between host and device.
-hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-Warning: This API is deprecated.
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyFromArray ( void *dst, hipArray_const_t srcArray, size_t wOffset, size_t hOffset, size_t count, hipMemcpyKind kind )
-Copies data between host and device.
-hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-Warning: This API is deprecated.
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy2DFromArray ( void *dst, size_t dpitch, hipArray_const_t src, size_t wOffset, size_t hOffset, size_t width, size_t height, hipMemcpyKind kind )
-Copies data between host and device.
-hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy2DFromArrayAsync ( void *dst, size_t dpitch, hipArray_const_t src, size_t wOffset, size_t hOffset, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream )
-Copies data between host and device asynchronously.
-hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyAtoH ( void *dst, hipArray_t srcArray, size_t srcOffset, size_t count )
-Copies data between host and device.
-hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyHtoA ( hipArray_t dstArray, size_t dstOffset, const void *srcHost, size_t count )
-Copies data between host and device.
-hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection
-hipError_t hipMemcpy3D ( const struct hipMemcpy3DParms *p )
-Copies data between host and device.
-hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-p -[in] 3D memory copy parameters
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy3DAsync ( const struct hipMemcpy3DParms *p, hipStream_t stream )
-Copies data between host and device asynchronously.
-hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipDrvMemcpy3D ( const HIP_MEMCPY3D *pCopy )
-Copies data between host and device.
-hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-pCopy -[in] 3D memory copy parameters
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipDrvMemcpy3DAsync ( const HIP_MEMCPY3D *pCopy, hipStream_t stream )
-Copies data between host and device asynchronously.
-hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection template<typename T > hipError_t hipGetSymbolAddress ( void **devPtr, const T &symbol ) Gets the address of a symbol.
-hipSuccess, hipErrorInvalidValue template<typename T > hipError_t hipGetSymbolSize ( size_t *size, const T &symbol ) Gets the size of a symbol.
-hipSuccess, hipErrorInvalidValue template<typename T >
-hipError_t hipMemcpyToSymbol ( const T &symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind )
-Copies data to the given symbol on the device.
-hipMemcpyToSymbol
-hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue template<typename T >
-hipError_t hipMemcpyToSymbolAsync ( const T &symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream )
-Copies data to the given symbol on the device asynchronously on the stream.
-hipMemcpyToSymbolAsync
-hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue template<typename T >
-hipError_t hipMemcpyFromSymbol ( void *dst, const T &symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind )
-Copies data from the given symbol on the device.
-hipMemcpyFromSymbol
-hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue template<typename T >
-hipError_t hipMemcpyFromSymbolAsync ( void *dst, const T &symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream )
-Copies data from the given symbol on the device asynchronously on the stream.
-hipMemcpyFromSymbolAsync
-hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue template<class T >
-static inline hipError_t hipMalloc ( T **devPtr, size_t size )
-Perform automatic type conversion to eliminate need for excessive typecasting (ie void**)
-HIP_DISABLE_CPP_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs.
-hipMalloc
-static inline hipError_t hipHostMalloc ( T **ptr, size_t size, unsigned int flags = hipHostMallocDefault )
-Provide an override to automatically typecast the pointer type from void**, and also provide a default for the flags.
-HIP_DISABLE_CPP_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs.
-hipHostMalloc
-hipError_t hipImportExternalSemaphore ( hipExternalSemaphore_t *extSem_out, const hipExternalSemaphoreHandleDesc *semHandleDesc )
-Imports an external semaphore.
-See also:
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipSignalExternalSemaphoresAsync ( const hipExternalSemaphore_t *extSemArray, const hipExternalSemaphoreSignalParams *paramsArray, unsigned int numExtSems, hipStream_t stream )
-Signals a set of external semaphore objects.
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipWaitExternalSemaphoresAsync ( const hipExternalSemaphore_t *extSemArray, const hipExternalSemaphoreWaitParams *paramsArray, unsigned int numExtSems, hipStream_t stream )
-Waits on a set of external semaphore objects.
-See also:
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-Destroys an external semaphore object and releases any references to the underlying resource. Any outstanding signals or waits must have completed before the semaphore is destroyed.
-extSem -[in] handle to an external memory object
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipImportExternalMemory ( hipExternalMemory_t *extMem_out, const hipExternalMemoryHandleDesc *memHandleDesc )
-Imports an external memory object.
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipExternalMemoryGetMappedBuffer ( void **devPtr, hipExternalMemory_t extMem, const hipExternalMemoryBufferDesc *bufferDesc )
-Maps a buffer onto an imported memory object.
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue
-Destroys an external memory object.
-See also:
-extMem -[in] External memory object to be destroyed
-hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipExternalMemoryGetMappedMipmappedArray ( hipMipmappedArray_t *mipmap,
-hipExternalMemory_t extMem, const hipExternalMemoryMipmappedArrayDesc *mipmapDesc )
-Maps a mipmapped array onto an external memory object.
-Returned mipmapped array must be freed using hipFreeMipmappedArray.
-hipImportExternalMemory , hipFreeMipmappedArray
-hipDestroyExternalMemory ,
-hipSuccess, hipErrorInvalidValue, hipErrorInvalidResourceHandle
-The register keyword is deprecated in C++, and is silently ignored by both NVCC and HIP-Clang. You can pass the option -Wdeprecated-register the compiler warning message.
-hipExternalMemoryGetMappedBuffer ,
-Unroll with a bounds that is known at compile-time is supported. For example:
-**Following code does:** The code snippet appears to be a corrupted or incorrectly formatted text, possibly due to a copy-paste error or encoding issue. It does not represent valid Python code or any recognizable programming construct. Therefore, it does not perform any specific function or task in its current form.
-
-
-
- // #pragma unroll 16 /* hint to compiler to unroll next loop by 16 */
-
- } /* */
-
-
-
- //
- }
- */
-GCN ISA In-line assembly, is supported. For example:
-**Following code does:** The code snippet you provided appears to be a command or a reference to a file rather than a typical code snippet. It seems to be related to a SAXPY operation, which stands for "Single-Precision A·X Plus Y." SAXPY is a common operation in linear algebra used to perform a scaled vector addition. The rest of the line suggests it might be related to a specific implementation or version (possibly for a GPU architecture, given the "sm_5" which could refer to a specific compute capability of NVIDIA GPUs) and "ptx" which typically refers to Parallel Thread Execution, NVIDIA's intermediate representation for GPU code. Overall, this line likely refers to a file or command related to executing or compiling a SAXPY operation for a specific GPU architecture.
-
-
-
-
void void
-We insert the GCN isa into the kernel using asm() Assembler statement. volatile keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations. v_mac_f32_e32 is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/) Index for the respective operand in the ordered fashion is provided by % followed by position in the list of operands 'v' is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list) Output Constraints are specified by an '=' prefix as shown above ('=v'). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of '0' says to use the assigned register for output as an input as well (it being the 0'th constraint).
-## C++ Support The following C++ features are not supported: - Run-time-type information (RTTI) - Try/catch Virtual functions Virtual functions are not supported if objects containing virtual function tables are passed between GPU's of different offload arch's, e.g. between gfx906 and gfx1030. Otherwise virtual functions are supported.
-hipcc now supports compiling C++/HIP kernels to binary code objects. The file format for binary is .co which means Code Object. The following command builds the code object using hipcc .
-**Following code does:** The code snippet appears to be a command-line operation that uses `dumpbin.exe` to analyze a specific section of an executable file (`saxpy.exe`). It targets the `.hip_fat` section and extracts raw data with a specified format. The output is then piped into a `select` command, which skips the first 20 lines and selects the next 12 lines from the output. This operation is likely used for inspecting or debugging specific parts of the executable's binary data.
-hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]
-
-[TARGET GPU] = GPU architecture
-[INPUT FILE] = Name of the file containing kernels
-[OUTPUT FILE] = Name of the generated code object file
-Note: When using binary code objects is that the number of arguments to the kernel is different on HIP-Clang and NVCC path. Refer to the HIP module_api sample for differences in the arguments to be passed to the kernel.
-Clang defined '__gfx*__' macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample in HIP 14_gpu_arch sample.
-The ROCm platform enables the power of combined C++ and HIP (Heterogeneous-computing Interface for Portability) code. This code is compiled with a clang or clang++ compiler. The official compilers support the HIP platform, or you can use the amdclang or amdclang++ included in the ROCm installation, which are a wrapper for the official versions.
-The source code is compiled according to the C++03 , C++11 , C++14 , C++17 , and C++20 standards, along with HIPspecific extensions, but is subject to restrictions. The key restriction is the reduced support of standard library in device code. This is due to the fact that by default a function is considered to run on host, except for constexpr functions, which can run on host and device as well.
-C++ is considered a modern programming language as of C++11. This section describes how HIP supports these new C++ features.
-The C++11 standard introduced many new features. These features are supported in HIP host code, with some notable omissions on the device side. The rule of thumb here is that constexpr functions work on device, the rest doesn't. This means that some important functionality like std::function is missing on the device, but unfortunately the standard library wasn't designed with HIP in mind, which means that the support is in a state of 'works as-is'.
-Certain features have restrictions and clarifications. For example, any functions using the constexpr qualifier or the new initializer lists , std::move or std::forward features are implicitly considered to have the __host__ and __device__ execution space specifier. Also, constexpr variables that are static members or namespace scoped can be used from both host and device, but only for read access. Dereferencing a static constexpr outside its specified execution space causes an error.
-Lambdas are supported, but there are some extensions and restrictions on their usage. For more information, see the Extended lambdas section below.
-The C++14 language features are supported.
-All C++17 language features are supported.
-All C++20 language features are supported, but extensions and restrictions apply. C++20 introduced coroutines and modules, which fundamentally changed how programs are written. HIP doesn't support these features. However, consteval functions can be called from host and device, even if specified for host use only.
-The three-way comparison operator (spaceship operator <=> ) works with host and device code.
-In addition to the deviations from the standard, there are some general extensions and restrictions to consider.
-Functions that serve as an entry point for device execution are called kernels and are specified with the __global__ qualifier. To call a kernel function, use the triple chevron operator: <<< >>> . Kernel functions must have a void return type. These functions can't:
-Kernels can have variadic template parameters, but only one parameter pack, which must be the last item in the template parameter list.
-HIP includes device space memory specifiers to indicate whether a variable is allocated in host or device memory and howits memory should be allocated. HIP supports the __device__ , __shared__ , __managed__ , and __constant__ specifiers.
-The __device__ and __constant__ specifiers define global variables, which are allocated within global memory on the HIP devices. The only difference is that __constant__ variables can't be changed after allocation. The __shared__ specifier allocates the variable within shared memory, which is available for all threads in a block.
-The __managed__ variable specifier creates global variables that are initially undefined and unaddressed within the global symbol table. The HIP runtime allocates managed memory and defines the symbol when it loads the device binary. A managed variable can be accessed in both device and host code.
-It's important to know where a variable is stored because it is only available from certain locations. Generally, variables allocated in the host memory are not accessible from the device code, while variables allocated in the device memory are not directly accessible from the host code. Dereferencing a pointer to device memory on the host results in a segmentation fault. Accessing device variables in host code should be done through kernel execution or HIP functions like hipMemCpyToSymbol .
-An important difference between the host and device code is exception handling. In device code, this control flow isn't available due to the hardware architecture. The device code must use return codes to handle errors.
-There are some restrictions on kernel function parameters. They cannot be passed by reference, because these functions are called from the host but run on the device. Also, a variable number of arguments is not allowed.
-Classes work on both the host and device side, but there are some constraints. The static member functions can't be __global__ . Virtual member functions work, but a virtual function must not be called from the host if the parent object was created on the device, or the other way around, because this behavior is undefined. Another minor restriction is that __device__ variables, that are global scoped must have trivial constructors.
-HIP doesn't support the polymorphic function wrapper std::function , which was introduced in C++11.
-HIP supports Lambdas, which by default work as expected.
-Lambdas have implicit host device attributes. This means that they can be executed by both host and device code, and works the way you would expect. To make a lambda callable only by host or device code, users can add __host__ or __device__ attribute. The only restriction is that host variables can only be accessed through copy on the device. Accessing through reference will cause undefined behavior.
-Inline namespaces are supported, but with a few exceptions. The following entities can't be declared in namespace scope within an inline unnamed namespace:
-HIP-Clang supports a set of math operations that are callable from the device. HIP supports most of the device functions supported by NVIDIA CUDA. These are described in the following sections.
-Following is the list of supported single precision mathematical functions.
-**Following table contains:** The table represents a list of mathematical functions, likely from a programming library or API, with details about their purpose and usage. Each row corresponds to a specific function, providing a brief description of what the function does. - -The columns appear to be structured as follows: -- The first column contains the function signature, including the return type and parameters. -- The second column provides a description of the function's purpose or behavior. -- The third and fourth columns seem to indicate some form of categorization or feature presence, marked by checkmarks (✓). - -Noteworthy observations: -- All functions listed are related to mathematical operations, particularly involving floating-point numbers. -- Functions like `nanf`, `nearbyintf`, and `nextafterf` are related to floating-point arithmetic and handling special cases like 'Not a Number' (NaN). -- Functions such as `norm3df`, `norm4df`, `normcdff`, and `normcdfinvf` are related to mathematical norms and statistical functions, indicating a focus on vector operations and probability distributions. -- The presence of checkmarks in the third and fourth columns suggests these functions might be categorized based on certain criteria, such as availability in different environments or support for specific features.
-| Function | Supported on Host | Supported on Device |
|---|---|---|
| float abs(float x) Returns the absolute value of 𝑥 | ✓ | ✓ |
| float acosf(float x) Returns the arc cosine of 𝑥 . | ✓ | ✓ |
| float acoshf(float x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . | ✓ | ✓ |
| float asinf(float x) Returns the arc sine of 𝑥 . | ✓ | ✓ |
| float asinhf(float x) Returns the arc hyperbolic sine of 𝑥 . | ✓ | ✓ |
| float atanf(float x) Returns the arc tangent of 𝑥 . | ✓ | ✓ |
continues on next page
-**Following table contains:** The table appears to represent a list of mathematical functions, specifically those related to floating-point operations in programming or computational contexts. Each row corresponds to a different function, providing a brief description of what the function does. - -The columns are as follows: -- The first column contains the function signature and a description of its purpose or operation. -- The second and third columns, both marked with "✓", likely indicate the presence or support of these functions in specific contexts or libraries, though the exact meaning of the checkmarks is not provided in the preview. - -Noteworthy observations: -- All functions listed are related to floating-point arithmetic, such as power functions, remainder calculations, rounding, and reciprocal operations. -- Each function is prefixed with "float," indicating they operate on single-precision floating-point numbers. -- The consistent presence of checkmarks in the second and third columns suggests that all these functions are uniformly supported or available in the contexts being referenced.
-| Table | 1 - continued from previous page | |
|---|---|---|
| float atan2f(float x, float y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . | ✓ | ✓ |
| float atanhf(float x) Returns the arc hyperbolic tangent of 𝑥 . | ✓ | ✓ |
| float cbrtf(float x) Returns the cube root of 𝑥 . | ✓ | ✓ |
| float ceilf(float x) Returns ceiling of 𝑥 . | ✓ | ✓ |
| float copysignf(float x, float y) Create value with given magnitude, copying sign of second value. | ✓ | ✓ |
| float cosf(float x) Returns the cosine of 𝑥 . | ✓ | ✓ |
| float coshf(float x) Returns the hyperbolic cosine of 𝑥 . | ✓ | ✓ |
| float cospif(float x) Returns the cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
| float cyl_bessel_i0f(float x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 . |
continues on next page
-**Following table contains:** The table represents a list of mathematical functions, likely from a programming library or API, with each row detailing a specific function. - -- **Rows**: Each row corresponds to a different mathematical function, providing a brief description of what the function does. - -- **Columns**: - - **Column 0**: Contains the function signature and a description of its purpose or operation. - - **Column 1**: Appears to indicate whether the function is available or supported, marked with a check (✓). - - **Column 2**: Also seems to indicate availability or support, similarly marked with a check (✓). - -- **Noteworthy Values**: All functions listed have checks in both Column 1 and Column 2, suggesting that all functions are available or supported without exception. The functions cover a range of mathematical operations, including normalization, scaling, and trigonometric calculations.
-| float cyl_bessel_i1f(float x) Returns the value of the regular modified cylindrical Bessel function of order 1 for 𝑥 . | ||
| float erff(float x) Returns the error function of 𝑥 . | ✓ | ✓ |
| float erfcf(float x) Returns the complementary error function of 𝑥 . | ✓ | ✓ |
| float erfcinvf(float x) Returns the inverse complementary function of 𝑥 . | ✓ | ✓ |
| float erfcxf(float x) Returns the scaled complementary error function of 𝑥 . | ✓ | ✓ |
| float erfinvf(float x) Returns the inverse error function of 𝑥 . | ✓ | ✓ |
| float expf(float x) Returns 𝑒 𝑥 . | ✓ | ✓ |
| float exp10f(float x) Returns 10 𝑥 . | ✓ | ✓ |
| float exp2f( float x) Returns 2 𝑥 . | ✓ | ✓ |
| float expm1f(float x) Returns 𝑙𝑛 ( 𝑥 - 1) | ✓ | ✓ |
continues on next page
-Table
-1 - continued from previous page
-**Following table contains:** The table represents a list of mathematical functions, likely from a programming library or API, with each row corresponding to a specific function. The columns provide information about the availability or support of these functions in different contexts or systems. - -- **Column 0**: Describes the function signature and its purpose. It includes the return type, function name, parameters, and a brief description of what the function does. -- **Column 1**: Indicates whether the function is supported or available in a particular context, marked by a check (✓). -- **Column 2**: Similar to Column 1, it indicates support or availability in another context, also marked by a check (✓). - -Noteworthy values: -- The function `float rsqrtf(float x)` does not have a check in Column 1, suggesting it might not be supported or available in the context represented by this column. -- All other functions have checks in both columns, indicating they are supported or available in both contexts.
-| float fabsf(float x) Returns the absolute value of x | ✓ | ✓ |
| float fdimf(float x, float y) Returns the positive difference between 𝑥 and 𝑦 . | ✓ | ✓ |
| float fdividef(float x, float y) Divide two floating point values. | ✓ | ✓ |
| float floorf(float x) Returns the largest integer less than or equal to 𝑥 . | ✓ | ✓ |
| float fmaf(float x, float y, float z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. | ✓ | ✓ |
| float fmaxf(float x, float y) Determine the maximum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
| float fminf(float x, float y) Determine the minimum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
| float fmodf(float x, float y) Returns the floating-point remainder of 𝑥/𝑦 . | ✓ | ✓ |
| float modff(float x, float* iptr) Break down 𝑥 into fractional and integral parts. | ✓ |
continues on next page
-**Following table contains:** The table represents a list of mathematical functions and their support status on different platforms. Each row corresponds to a specific mathematical function, detailing its purpose and whether it is supported on a host system and a device. - -The columns are as follows: -- **Function**: Describes the mathematical function, including its return type and parameters, along with a brief explanation of what the function does. -- **Supported on Host**: Indicates whether the function is supported on a host system, marked with a "✓" for supported. -- **Supported on Device**: Indicates whether the function is supported on a device, also marked with a "✓" for supported. - -Noteworthy observations include: -- All listed functions are supported on both the host and the device, as indicated by the "✓" in both the "Supported on Host" and "Supported on Device" columns. -- The functions cover a range of trigonometric and hyperbolic operations, such as `abs`, `acos`, `acosh`, `asin`, `asinh`, `atan`, and `atan2`.
-| float frexpf(float x, int* nptr) Extract mantissa and exponent of 𝑥 . | ✓ |
| float hypotf(float x, float y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . | ✓ |
| int ilogbf(float x) Returns the unbiased integer exponent of 𝑥 . | ✓ |
| bool isfinite(float x) Determine whether 𝑥 is finite. | ✓ |
| bool isinf(float x) Determine whether 𝑥 is infinite. | ✓ |
| bool isnan(float x) Determine whether 𝑥 is a NAN . | ✓ |
| float j0f(float x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . | ✓ |
| float j1f(float x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . | ✓ |
| float jnf(int n, float x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . | ✓ |
continues on next page
-Table
-1 - continued from previous page
-**Following table contains:** The table appears to represent a list of mathematical functions, likely from a programming library or a mathematical software package. Each row corresponds to a specific function, detailing its purpose and behavior. - -- **Rows**: Each row represents a different mathematical function, including its name, a brief description of what it does, and potentially some metadata or status indicators. - -- **Columns**: - 1. The first column contains the function signature and a brief description of what the function does. For example, "double atanh(double x) Returns the arc hyperbolic tangent of 𝑥." - 2. The second and third columns seem to contain checkmarks (✓), which might indicate the availability, implementation status, or some form of validation or approval for the function. - -- **Noteworthy Values**: - - Most functions have checkmarks in the second and third columns, suggesting they are available or validated. - - The functions `cyl_bessel_i0` and `cyl_bessel_i1` do not have checkmarks in these columns, which might indicate they are not available, not implemented, or not validated in the same way as the others.
-| float ldexpf(float x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | ✓ |
| float lgammaf(float x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | |
| long int lrintf(float x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
| long long int llrintf(float x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
| long int lroundf(float x) Round to nearest integer value. | ✓ | ✓ |
| long long int llroundf(float x) Round to nearest integer value. | ✓ | ✓ |
| float log10f(float x) Returns the base 10 logarithm of 𝑥 . | ✓ | ✓ |
| float log1pf(float x) Returns the natural logarithm of 𝑥 +1 . | ✓ | ✓ |
| float log2f(float x) Returns the base 2 logarithm of 𝑥 . | ✓ | ✓ |
| float logf(float x) Returns the natural logarithm of 𝑥 . | ✓ | ✓ |
continues on next page
-**Following table contains:** The table represents a list of mathematical functions and their descriptions, along with indicators of their availability or implementation status. Each row corresponds to a specific mathematical function, detailing its purpose and the mathematical operation it performs. - -The columns are as follows: -- Column 0: Contains the function signature, which includes the return type (`double`), the function name, and the parameter (`double x`). -- Column 1: Provides a brief description of what the function does, typically explaining the mathematical operation or transformation it performs on the input `x`. -- Column 2 and Column 3: Both contain checkmarks (✓), which likely indicate that the function is implemented or available in two different contexts, environments, or versions. - -Noteworthy values: -- All functions listed have checkmarks in both Column 2 and Column 3, suggesting that they are consistently available or implemented across the contexts or versions represented by these columns. -- The functions cover a range of mathematical operations related to error functions and exponential calculations, which are common in scientific and engineering computations.
-| ✓ | ||
|---|---|---|
| float logbf(float x) Returns the floating point representation of the exponent of 𝑥 . | ✓ | |
| float nanf(const char* tagp) Returns 'Not a Number' value. | ✓ | |
| float nearbyintf(float x) Round 𝑥 to the nearest integer. | ✓ | ✓ |
| float nextafterf(float x, float y) Returns next representable single-precision floating-point value after argument. | ✓ | |
| float norm3df(float x, float y, float z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . | ✓ | ✓ |
| float norm4df(float x, float y, float z, float w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . | ✓ | ✓ |
| float normcdff(float y) Returns the standard normal cumulative distribution function. | ✓ | ✓ |
| float normcdfinvf(float y) Returns the inverse of the standard normal cumulative distribution function. | ✓ | ✓ |
| float normf(int dim, const float *a) Returns the square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
continues on next page
-**Following table contains:** The table represents a list of mathematical functions, likely from a programming library or API, with their descriptions and availability across different platforms or versions. - -- **Rows**: Each row corresponds to a specific mathematical function, detailing its purpose and behavior. - -- **Columns**: - 1. **Function Description**: This column provides the function signature and a brief description of what the function does. For example, "double floor(double x)" returns the largest integer less than or equal to x. - 2. **Availability Indicator 1**: This column seems to indicate whether the function is available or supported in a certain context, marked with a "✓" for available. - 3. **Availability Indicator 2**: Similar to the second column, this column also indicates availability or support, again marked with a "✓". - -- **Noteworthy Values**: - - All functions except "modf" and "frexp" have a "✓" in both availability columns, suggesting they are widely supported. - - "modf" and "frexp" have a missing "✓" in the third column, indicating they might not be available or supported in the same context as the others.
-| Table | 1 - continued from previous page | |
|---|---|---|
| float powf(float x, float y) Returns 𝑥 𝑦 . | ✓ | ✓ |
| float powif(float base, int iexp) Returns the value of first argument to the power of second argument. | ✓ | ✓ |
| float remainderf(float x, float y) Returns single-precision floating-point remainder. | ✓ | ✓ |
| float remquof(float x, float y, int* quo) Returns single-precision floating-point remainder and part of quotient. | ✓ | ✓ |
| float roundf(float x) Round to nearest integer value in floating-point. | ✓ | ✓ |
| float rcbrtf(float x) Returns the reciprocal cube root function. | ✓ | ✓ |
| float rhypotf(float x, float y) Returns one over the square root of the sum of squares of two arguments. | ✓ | ✓ |
| float rintf(float x) Round input to nearest integer value in floating-point. | ✓ | ✓ |
continues on next page
-Table
-1 - continued from previous page
-**Following table contains:** The table appears to represent a list of mathematical functions, likely from a programming library or documentation, with each row corresponding to a specific function. The columns provide information about these functions: - -- **Column 0**: Contains the function signature and a brief description of what the function does. This includes the return type, function name, and parameters, followed by a description of the function's purpose. -- **Column 1**: Contains a checkmark (✓) indicating some form of validation or support for the function, possibly denoting that the function is implemented or available. -- **Column 2**: Also contains a checkmark (✓), which might indicate another layer of validation or support, such as compatibility with a specific version or platform. - -Noteworthy observations: -- All functions except for `lgamma(double x)` have checkmarks in both columns 1 and 2, suggesting that `lgamma(double x)` might be missing some form of validation or support compared to the others. -- The functions cover a range of mathematical operations, including checks for finite, infinite, and NaN values, as well as calculations involving Bessel functions and logarithmic operations.
-| float rnorm3df(float x, float y, float z) Returns one over the square root of the sum of squares of three coordinates of the argument. | ✓ | ✓ |
| float rnorm4df(float x, float y, float z, float w) Returns one over the square root of the sum of squares of four coordinates of the argument. | ✓ | ✓ |
|---|---|---|
| float rnormf(int dim, const float *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
| float scalblnf(float x, long int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
| float scalbnf(float x, int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
| bool signbit(float x) Return the sign bit of 𝑥 . | ✓ | ✓ |
| float sinf(float x) Returns the sine of 𝑥 . | ✓ | ✓ |
| float sinhf(float x) Returns the hyperbolic sine of 𝑥 . | ✓ | ✓ |
| float sinpif(float x) Returns the hyperbolic sine of 𝜋 · 𝑥 . | ✓ | ✓ |
continues on next page
-Table
-1 - continued from previous page
-**Following table contains:** The table represents a list of mathematical functions, likely from a programming or mathematical library, with each row detailing a specific function. - -- **Rows**: Each row corresponds to a different mathematical function, providing its return type, function name, and a brief description of what the function does. - -- **Columns**: - - Column 0: Contains the return type and the function signature, including the function name and its parameter(s). - - Column 1: Appears to indicate whether the function is available or supported, marked with a "✓". - - Column 2: Also seems to indicate availability or support, similarly marked with a "✓". - -- **Noteworthy Values**: All functions listed have checkmarks in both columns 1 and 2, suggesting that they are all supported or available in the context being described. The functions cover a range of mathematical operations, primarily focusing on rounding and logarithmic calculations.
-| void sincosf(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝑥 . | ✓ | ✓ |
| void sincospif(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
| float sqrtf(float x) Returns the square root of 𝑥 . | ✓ | ✓ |
| float rsqrtf(float x) Returns the reciprocal of the square root of 𝑥 . | ✓ | |
| float tanf(float x) Returns the tangent of 𝑥 . | ✓ | ✓ |
| float tanhf(float x) Returns the hyperbolic tangent of 𝑥 . | ✓ | ✓ |
| float tgammaf(float x) Returns the gamma function of 𝑥 . | ✓ | ✓ |
| float truncf(float x) Truncate 𝑥 to the integral part. | ✓ | ✓ |
| float y0f(float x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . | ✓ | ✓ |
| float y1f(float x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . | ✓ | ✓ |
continues on next page
-**Following code does:** It seems like there is a placeholder or an error in your request, as the code snippet is missing. Please provide the actual code snippet you would like me to analyze, and I'll be happy to help!
- \
- float ynf(int n, float x)
- Returns the value of the Bessel
- function of the second kind of order
- n for x.
-Following is the list of supported double precision mathematical functions.
-**Following table contains:** The table appears to represent a structured outline or index of a document, likely related to memory management in computing. Each row corresponds to a specific section or subsection of the document, with the columns providing different pieces of information about each section. - -- **Column 0**: This column seems to contain section numbers or identifiers, which help in organizing the document into a hierarchical structure. For example, "16.1" and "16.1.1" indicate a main section and its subsection, respectively. - -- **Column 1**: This column contains the titles or descriptions of the sections. These titles describe various topics related to memory management, such as "Memory allocation," "Allocate physical memory," "Reserve virtual address range," and "Set memory access." - -- **Column 2**: This column appears to contain page numbers or reference numbers, which likely indicate where in the document the section can be found. For example, "109" and "110" are associated with specific subsections. - -Noteworthy values include: -- The section "17 Frequently asked questions" is repeated in both columns 0 and 1, suggesting it might be a standalone section or a significant part of the document. -- The consistent use of ellipses in the titles and page numbers suggests a formatting style typical of an index or table of contents.
-| Function | Supported on Host | Supported on Device |
|---|---|---|
| double abs(double x) Returns the absolute value of 𝑥 | ✓ | ✓ |
| double acos(double x) Returns the arc cosine of 𝑥 . | ✓ | ✓ |
| double acosh(double x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . | ✓ | ✓ |
| double asin(double x) Returns the arc sine of 𝑥 . | ✓ | ✓ |
| double asinh(double x) Returns the arc hyperbolic sine of 𝑥 . | ✓ | ✓ |
| double atan(double x) Returns the arc tangent of 𝑥 . | ✓ | ✓ |
| double atan2(double x, double y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . | ✓ | ✓ |
continues on next page
-**Following table contains:** The table appears to represent a list of mathematical functions, likely from a programming library or API, with each row detailing a specific function. - -- **Rows**: Each row corresponds to a different mathematical function, providing a brief description of what the function does. - -- **Columns**: - - The first column contains the function signature and a description of its purpose. - - The second and third columns seem to indicate some form of categorization or feature availability, marked by checkmarks (✓). - -- **Noteworthy Values**: - - All functions listed have a checkmark in the second column, suggesting they all share a common characteristic or are available in a particular context. - - Most functions also have a checkmark in the third column, except for the first function, "double nextafter(double x, double y)", which lacks a checkmark in this column. This might indicate a difference in availability or feature set compared to the others.
-| double atanh(double x) Returns the arc hyperbolic tangent of 𝑥 . | ✓ | ✓ |
|---|---|---|
| double cbrt(double x) Returns the cube root of 𝑥 . | ✓ | ✓ |
| double ceil(double x) Returns ceiling of 𝑥 . | ✓ | ✓ |
| double copysign(double x, double y) Create value with given magnitude, copying sign of second value. | ✓ | ✓ |
| double cos(double x) Returns the cosine of 𝑥 . | ✓ | ✓ |
| double cosh(double x) Returns the hyperbolic cosine of 𝑥 . | ✓ | ✓ |
| double cospi(double x) Returns the cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
| double cyl_bessel_i0(double x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 . | ||
| double cyl_bessel_i1(double x) Returns the value of the regular modified cylindrical Bessel function of order 1 for | 𝑥 . | |
| double erf(double x) Returns the error function of 𝑥 . | ✓ | ✓ |
continues on next page
-**Following table contains:** The table appears to represent a list of mathematical functions, specifically those related to floating-point operations in programming or computational contexts. Each row corresponds to a different function, providing a brief description of what the function does. - -The columns in the table seem to indicate some form of categorization or availability of these functions, with checkmarks (✓) suggesting that the function is available or applicable in certain contexts. The second column consistently contains checkmarks, while the third column mostly contains checkmarks, with one exception where it contains the word "of." - -Noteworthy observations include: -- The functions listed are primarily related to mathematical operations involving floating-point numbers, such as calculating remainders, rounding, and computing reciprocal roots. -- The presence of "of" in the third column for the function `double remquo(double x, double y, int* quo)` suggests a possible error or inconsistency in the data entry for that particular row.
-| double erfc(double x) Returns the complementary error function of 𝑥 . | ✓ | ✓ |
| double erfcinv(double x) Returns the inverse complementary function of 𝑥 . | ✓ | ✓ |
|---|---|---|
| double erfcx(double x) Returns the scaled complementary error function of 𝑥 . | ✓ | ✓ |
| double erfinv(double x) Returns the inverse error function of 𝑥 . | ✓ | ✓ |
| double exp(double x) Returns 𝑒 𝑥 . | ✓ | ✓ |
| double exp10(double x) Returns 10 𝑥 . | ✓ | ✓ |
| double exp2( double x) Returns 2 𝑥 . | ✓ | ✓ |
| double expm1(double x) Returns 𝑙𝑛 ( 𝑥 - 1) | ✓ | ✓ |
| double fabs(double x) Returns the absolute value of x | ✓ | ✓ |
| double fdim(double x, double y) Returns the positive difference between 𝑥 and 𝑦 . | ✓ | ✓ |
continues on next page
-**Following table contains:** The table appears to represent a list of mathematical or computational functions, likely from a programming library or documentation. Each row corresponds to a different function, providing a brief description of what the function does. - -The columns are as follows: -- Column 0: Contains the function signature and a brief description of its purpose. -- Column 1: Contains a checkmark (✓) indicating some form of validation or availability of the function. -- Column 2: Also contains a checkmark (✓), possibly indicating another form of validation or compatibility. - -Noteworthy values: -- The first row in Column 0 is empty, which might indicate a header or a placeholder. -- All functions listed have checkmarks in both Column 1 and Column 2, suggesting that they are all validated or available in the context being described. -- The function descriptions include mathematical operations such as scaling, sine, and hyperbolic sine, which are common in mathematical libraries.
-| double floor(double x) Returns the largest integer less than or equal to 𝑥 . | ✓ | ✓ |
|---|---|---|
| double fma(double x, double y, double z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. | ✓ | ✓ |
| double fmax(double x, double y) Determine the maximum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
| double fmin(double x, double y) Determine the minimum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
| double fmod(double x, double y) Returns the floating-point remainder of 𝑥/𝑦 . | ✓ | ✓ |
| double modf(double x, double* iptr) Break down 𝑥 into fractional and integral parts. | ✓ | |
| double frexp(double x, int* nptr) Extract mantissa and exponent of 𝑥 . | ✓ | |
| double hypot(double x, double y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . | ✓ | ✓ |
| int ilogb(double x) Returns the unbiased integer exponent of 𝑥 . | ✓ | ✓ |
continues on next page
-**Following table contains:** The table appears to represent a list of mathematical functions, specifically those that are likely part of a programming or mathematical library, possibly for a language like C or C++. Each row represents a different mathematical function that can be applied to a variable \( x \). - -The columns in the table are as follows: -- The first column provides the function signature, which includes the return type (`double`), the function name, and the parameters it takes (e.g., `double x` or `int n, double x`). -- The second column provides a brief description of what the function does, such as calculating the reciprocal of the square root, the tangent, or the Bessel function of a given order. -- The third column, marked with a check (✓), likely indicates that the function is available or supported in the context being described. - -Noteworthy values: -- All functions listed return a `double` type, indicating they are designed to handle floating-point arithmetic. -- The functions cover a range of mathematical operations, including trigonometric functions (`tan`, `tanh`), special functions (`tgamma`, Bessel functions `y0`, `y1`, `yn`), and basic arithmetic operations (`trunc`). -- The presence of Bessel functions of different orders suggests that the library or context is equipped to handle advanced mathematical computations.
-| bool isfinite(double x) Determine whether 𝑥 is finite. | ✓ | ✓ |
|---|---|---|
| bool isin(double x) Determine whether 𝑥 is infinite. | ✓ | ✓ |
| bool isnan(double x) Determine whether 𝑥 is a NAN . | ✓ | ✓ |
| double j0(double x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . | ✓ | ✓ |
| double j1(double x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . | ✓ | ✓ |
| double jn(int n, double x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . | ✓ | ✓ |
| double ldexp(double x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | ✓ |
| double lgamma(double x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | |
| long int lrint(double x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
continues on next page
-**Following table contains:** The table represents a comparison of different parallel computing platforms and their respective terminologies or constructs. Each row corresponds to a specific concept or component used in parallel computing, while the columns represent three different platforms: CUDA, HIP, and OpenCL. - -- **Columns:** - - **Term:** Lists the general concept or component used in parallel computing. - - **CUDA:** Shows the specific term or data type used in NVIDIA's CUDA platform for each concept. - - **HIP:** Displays the equivalent term or data type used in AMD's HIP (Heterogeneous-Compute Interface for Portability) platform. - - **OpenCL:** Provides the corresponding term or data type used in the OpenCL (Open Computing Language) platform. - -- **Rows:** - - **Device:** Refers to the identifier for a compute device in each platform. - - **Queue:** Represents the command queue or stream used to manage execution order. - - **Event:** Denotes the event object used for synchronization. - - **Memory:** Indicates the data type used for memory objects. - - **Grid/NDRange:** Refers to the execution configuration for launching kernels. - - **Block/work-group:** Represents a group of threads or work-items. - - **Thread/work-item:** Denotes the smallest unit of execution. - - **Warp/sub-group:** Refers to a group of threads or work-items that execute together. - -- **Noteworthy Values:** - - The table highlights the differences and similarities in terminology across the three platforms, which is crucial for developers working with multiple parallel computing environments. - - The use of specific data types like `cudaStream_t`, `hipStream_t`, and `cl_command_queue` for queues, and `cudaEvent_t`, `hipEvent_t`, and `cl_event` for events, shows how each platform handles these constructs differently.
-| long long int llrint(double x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
| long int lround(double x) Round to nearest integer value. | ✓ | ✓ |
| long long int llround(double x) Round to nearest integer value. | ✓ | ✓ |
| double log10(double x) Returns the base 10 logarithm of 𝑥 . | ✓ | ✓ |
| double log1p(double x) Returns the natural logarithm of 𝑥 +1 . | ✓ | ✓ |
| double log2(double x) Returns the base 2 logarithm of 𝑥 . | ✓ | ✓ |
| double log(double x) Returns the natural logarithm of 𝑥 . | ✓ | ✓ |
| double logb(double x) Returns the floating point representation of the exponent of 𝑥 . | ✓ | ✓ |
| double nan(const char* tagp) Returns 'Not a Number' value. | ✓ | |
| double nearbyint(double x) Round 𝑥 to the nearest integer. | ✓ | ✓ |
continues on next page
-**Following table contains:** The table appears to represent a log or report of warnings generated during the documentation process for a software project, specifically related to the HIP (Heterogeneous-Compute Interface for Portability) project version 6.1.40092. Each row seems to represent a separate warning or issue encountered. - -The columns in the table are not clearly defined due to the formatting, but they seem to include: -- A repeated "Warning:" label, indicating the presence of a warning. -- A repeated "doxygenfunction:" label, suggesting the context or type of warning related to Doxygen, a documentation generator. -- A message indicating the inability to find the function 'hipLaunchCooperativeKernel' in the Doxygen XML output. -- The directory path where the documentation process was executed, which is repeated across the rows. - -Noteworthy values include: -- The specific function 'hipLaunchCooperativeKernel' that could not be found in the documentation, which might be a critical issue if this function is expected to be documented. -- The consistent directory path suggests that all warnings are related to the same documentation build process, possibly indicating a systemic issue with the documentation generation for this particular version of the HIP project.
-| ✓ | ||
|---|---|---|
| double nextafter(double x, double y) Returns next representable double-precision floating-point value after argument. | ✓ | |
| double norm3d(double x, double y, double z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . | ✓ | ✓ |
| double norm4d(double x, double y, double z, double w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . | ✓ | ✓ |
| double normcdf(double y) Returns the standard normal cumulative distribution function. | ✓ | ✓ |
| double normcdfinv(double y) Returns the inverse of the standard normal cumulative distribution function. | ✓ | ✓ |
| double norm(int dim, const double *a) Returns the square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
| double pow(double x, double y) Returns 𝑥 𝑦 . | ✓ | ✓ |
| double powi(double base, int iexp) Returns the value of first argument to the power of second argument. | ✓ | ✓ |
continues on next page
-**Following table contains:** The table appears to represent a log or error report related to the documentation generation process for a software project, specifically the HIP (Heterogeneous-Compute Interface for Portability) project. Each row seems to represent an instance of a warning or error encountered during the documentation build process. - -The columns in the table are not clearly defined due to the formatting, but they seem to include: -- Warning messages indicating issues encountered. -- References to specific functions or components, such as 'hipLaunchCooperativeKernel', which could not be found in the documentation output. -- Paths to the location in the file system where the documentation generation process was executed, specifically pointing to directories related to Doxygen XML output. - -A noteworthy value is the repeated mention of the function 'hipLaunchCooperativeKernel', which suggests that this function is missing or not properly documented in the generated output. This could be a critical issue if this function is important for the project's documentation. Additionally, the version '6.1.40092' of the HIP documentation is mentioned, indicating the specific version of the documentation that was being processed.
-| Table | 2 - continued from previous page | |
|---|---|---|
| double remainder(double x, double y) Returns double-precision floating-point remainder. | ✓ | ✓ |
| double remquo(double x, double y, int* quo) Returns double-precision floating-point remainder and part quotient. | ✓ | of |
| double round(double x) Round to nearest integer value in floating-point. | ✓ | ✓ |
| double rcbrt(double x) Returns the reciprocal cube root function. | ✓ | ✓ |
| double rhypot(double x, double y) Returns one over the square root of the sum of squares of two arguments. | ✓ | ✓ |
| double rint(double x) Round input to nearest integer value in floating-point. | ✓ | ✓ |
| double rnorm3d(double x, double y, double z) Returns one over the square root of the sum of squares of three coordinates of the argument. | ✓ | ✓ |
| double rnorm4d(double x, double y, double z, double w) Returns one over the square root of the sum of squares of four coordinates of the argument. | ✓ | ✓ |
continues on next page
-**Following table contains:** The table appears to represent a log or error report related to the documentation generation process for a project named "HIP" with version "6.1.40092". Each row seems to indicate an instance where a specific function, 'hipLaunchCooperativeKernelMultiDevice', could not be found in the Doxygen XML output. - -The columns are not clearly defined in the preview, but they seem to include: -- Warning messages indicating the issue. -- The function name that could not be found. -- The project name and version. -- The directory path where the Doxygen XML output was expected to be found. - -Noteworthy values include the repeated mention of the function 'hipLaunchCooperativeKernelMultiDevice' and the consistent directory path '/home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs-6.1.2/docs/doxygen/xml', suggesting that the issue is persistent across multiple attempts or checks.
-| ✓ | ||
| double rnorm(int dim, const double *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. | ✓ | |
| double scalbln(double x, long int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
| double scalbn(double x, int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
| bool signbit(double x) Return the sign bit of 𝑥 . | ✓ | ✓ |
| double sin(double x) Returns the sine of 𝑥 . | ✓ | ✓ |
| double sinh(double x) Returns the hyperbolic sine of 𝑥 . | ✓ | ✓ |
| double sinpi(double x) Returns the hyperbolic sine of 𝜋 · 𝑥 . | ✓ | ✓ |
| void sincos(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝑥 . | ✓ | ✓ |
| void sincospi(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
| double sqrt(double x) Returns the square root of 𝑥 . | ✓ | ✓ |
continues on next page
-**Following table contains:** The table appears to represent log entries or error messages related to the documentation generation process for a project named 'HIP' (likely referring to Heterogeneous-Compute Interface for Portability). Each row seems to represent a separate log entry or error message. - -The columns in the table include: -1. **Warning**: This column likely contains the type or category of the message, in this case, a warning. -2. **in,doxygenfunction**: This column seems to indicate the context or function within which the warning occurred, specifically related to Doxygen, a documentation generator. -3. **Cannot find xml output for project 'HIP**: This column likely contains the specific warning message, indicating that the XML output for the HIP project could not be found. -4. **nel',function 6.1.40092**: This column might contain additional details about the function or version number related to the warning. -5. **'hipModuleLaunchCooperativeKer- Documentation' from directory:**: This column seems to specify the documentation or directory path involved in the warning. -6. **/home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs-**: This column provides the file path where the issue occurred. - -Noteworthy values include the repeated mention of the inability to find XML output for the HIP project, which suggests a recurring issue in the documentation generation process. The directory paths are consistent across entries, indicating that the problem might be localized to a specific directory or setup.
-| double rsqrt(double x) Returns the reciprocal of the square root of 𝑥 . | ✓ |
|---|---|
| double tan(double x) Returns the tangent of 𝑥 . | ✓ |
| double tanh(double x) Returns the hyperbolic tangent of 𝑥 . | ✓ |
| double tgamma(double x) Returns the gamma function of 𝑥 . | ✓ |
| double trunc(double x) Truncate 𝑥 to the integral part. | ✓ |
| double y0(double x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . | ✓ |
| double y1(double x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . | ✓ |
| double yn(int n, double x) Returns the value of the Bessel function of the second kind of order n for 𝑥 . | ✓ |
Following is the list of supported integer intrinsics. Note that intrinsics are supported on device only.
-Table 3: Integer intrinsics mathematical functions
-unsigned int __brev(unsigned int x) Reverse the bit order of a 32 bit unsigned integer.
-unsigned long long int __brevll(unsigned long long int x) Reverse the bit order of a 64 bit unsigned integer.
-unsigned int __byte_perm(unsigned int x, unsigned int y, unsigned int z) Return selected bytes from two 32-bit unsigned integers.
-unsigned int __clz(int x) Return the number of consecutive high-order zero bits in 32 bit integer.
-unsigned int __clzll(long long int x) Return the number of consecutive high-order zero bits in 64 bit integer.
-unsigned int __ffs(int x) Find the position of least significant bit set to 1 in a 32 bit integer.
-unsigned int __ffsll(long long int x) Find the position of least significant bit set to 1 in a 64 bit signed integer.
-unsigned int __fns32(unsigned long long mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 32-bit integer.
-unsigned int __fns64(unsigned long long int mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 64-bit integer.
-unsigned int __funnelshift_l(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by shift & 31 bits, return the most significant 32 bits.
-unsigned int __funnelshift_lc(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by min(shift, 32) bits, return the most significant 32 bits.
-unsigned int __funnelshift_r(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift right by shift & 31 bits, return the least significant 32 bits. 226 Chapter 21. HIP math API
-The HIP-Clang implementation of __ffs() and __ffsll() contains code to add a constant +1 to produce the ffs result format. For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform, HIP-Clang provides __lastbit_u32_u32(unsigned int input) and __lastbit_u32_u64(unsigned long long int input) . The index returned by __lastbit_ instructions starts at -1, while for ffs the index starts at 0.
-Following is the list of supported floating-point intrinsics. Note that intrinsics are supported on device only.
-Note: Only the nearest even rounding mode supported on AMD GPUs by defaults. The _rz , _ru and _rd suffixed intrinsic functions are existing in HIP AMD backend, if the OCML_BASIC_ROUNDED_OPERATIONS macro is defined.
-Table 4: Single precision intrinsics mathematical functions
-Function float __cosf(float x) Returns the fast approximate cosine of 𝑥 . float __exp10f(float x) Returns the fast approximate for 10 x . float __expf(float x) Returns the fast approximate for e x . float __fadd_rn(float x, float y) Add two floating-point values in round-to-nearest-even mode. float __fdiv_rn(float x, float y) Divide two floating point values in round-to-nearest-even mode. float __fmaf_rn(float x, float y, float z) Returns x × y + z as a single operation in round-to-nearest-even mode. float __fmul_rn(float x, float y) Multiply two floating-point values in round-to-nearest-even mode. float __frcp_rn(float x, float y) Returns 1 / x in round-to-nearest-even mode. float __frsqrt_rn(float x) Returns 1 / x in round-to-nearest-even mode. float __fsqrt_rn(float x) Returns x in round-to-nearest-even mode. float __fsub_rn(float x, float y) Subtract two floating-point values in round-to-nearest-even mode. float __log10f(float x) Returns the fast approximate for base 10 logarithm of 𝑥 . 228 Chapter 21. HIP math API
-Table 5: Double precision intrinsics mathematical functions
-Function double __dadd_rn(double x, double y) Add two floating-point values in round-to-nearest-even mode. double __ddiv_rn(double x, double y) Divide two floating-point values in round-to-nearest-even mode. double __dmul_rn(double x, double y) Multiply two floating-point values in round-to-nearest-even mode. double __drcp_rn(double x, double y) Returns 1 / x in round-to-nearest-even mode. double __dsqrt_rn(double x) Returns x in round-to-nearest-even mode. double __dsub_rn(double x, double y) Subtract two floating-point values in round-to-nearest-even mode. double __fma_rn(double x, double y, double z) Returns x × y + z as a single operation in round-to-nearest-even mode.
-**Following table contains:** The table appears to represent a log or error report related to the documentation generation process for a software project, specifically the HIP (Heterogeneous-Compute Interface for Portability) project version 6.1.40092. Each row seems to capture a specific warning or error message encountered during the process. - -- **Rows**: Each row represents a part of the error message or warning related to the documentation generation process using Doxygen, a documentation generator tool. - -- **Columns**: - - Columns 0 and 1 contain the word "Warning:", indicating the presence of a warning message. - - Columns 2 to 4 contain the word "doxygenfunction:", suggesting that the warning is related to a function within the Doxygen documentation. - - Columns 5 and 6 contain the phrase "Cannot find", indicating a missing element in the documentation. - - Columns 7 to 9 specify the function 'cooperative_groups::this_multi_grid', which is the subject of the warning. - - Column 10 contains the project name and version 'HIP,6.1.40092,Documentation'. - - Column 11 indicates the directory path where the issue was encountered. - -- **Noteworthy Values**: - - The repeated mention of 'cooperative_groups::this_multi_grid' suggests that this specific function is missing or not documented properly in the Doxygen XML output. - - The directory path '/home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs-' is repeated across the table, indicating the location where the documentation generation process is being executed. - - The consistent version '6.1.2/docs/doxygen/xml' in the last column suggests that the issue is specific to this version of the documentation files.
-| Term | CUDA | HIP | OpenCL |
|---|---|---|---|
| Device | int deviceId | int deviceId | cl_device |
| Queue | cudaStream_t | hipStream_t | cl_command_queue |
| Event | cudaEvent_t | hipEvent_t | cl_event |
| Memory | void * | void * | cl_mem |
| grid | grid | NDRange | |
| block | block | work-group | |
| thread | thread | work-item | |
| warp | warp | sub-group | |
| Thread-index | threadIdx.x | threadIdx.x | get_local_id(0) |
| Block-index | blockIdx.x | blockIdx.x | get_group_id(0) |
| Block-dim | blockDim.x | blockDim.x | get_local_size(0) |
| Grid-dim | gridDim.x | gridDim.x | get_num_groups(0) |
| Device Kernel | __global__ | __global__ | __kernel |
| Device Function | __device__ | __device__ | Implied in device com |
| Host Function | __host_ (default) | __host_ (default) | Implied in host compil |
| Host + Device Function | __host__ __device__ | __host__ __device__ | No equivalent |
| Kernel Launch | <<< >>> | hipLaunchKernel / hipLaunchKernelGGL / <<< | clEnqueueNDRangeK |
| Global Memory | __global__ | __global__ | __global |
| Group Memory | __shared__ | __shared__ | __local |
| Constant | __constant__ | __constant__ | __constant |
| __syncthreads | __syncthreads | barrier(CLK_LOCAL | |
| Atomic Builtins | atomicAdd | atomicAdd | atomic_add |
| Precise Math | cos(f) | cos(f) | cos(f) |
| Fast Math | __cos(f) | __cos(f) | native_cos(f) |
| Vector | float4 | float4 | float4 |
The indexing functions (starting with thread-index ) show the terminology for a 1D grid. Some APIs use reverse order of xyz / 012 indexing for 3D grids.
-The following host-side functions are used for cooperative kernel launches.
-**Following table contains:** The table appears to represent a structured outline or index, possibly from a documentation or technical manual. Each row represents a different section or subsection of the document, with hierarchical numbering indicating the level of each section. - -The columns are as follows: -- Column 0: Contains section numbers, indicating the hierarchy and order of sections. -- Column 1: Contains the section titles or descriptions, often with ellipses suggesting continuation or truncation. -- Column 2: Appears to repeat the section titles or descriptions from Column 1. -- Column 3: Again repeats the section titles or descriptions, similar to Columns 1 and 2. -- Column 4: Contains a numeric value, "126," which is consistent across all rows. This could represent a page number, a reference ID, or another form of categorization. - -Noteworthy values: -- The consistent repetition of section titles across Columns 1, 2, and 3 suggests redundancy, possibly for formatting or alignment purposes. -- The uniform value "126" in Column 4 across all rows is notable, indicating a common attribute or reference for all sections listed.
-| Warning: | doxygenfunction: | Cannot | find function | 'hipLaunchCooperativeKernel' Documentation' | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| in | doxygen | xml | output | for project | 'HIP | 6.1.40092 | from | directory: | ||
| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | ||||||||||
| 6.1.2/docs/doxygen/xml | ||||||||||
**Following table contains:** The table appears to be a log or error message related to the generation of documentation using Doxygen, a tool for generating documentation from annotated source code. The rows represent individual entries or lines in the log file. The columns seem to be parts of a single message rather than distinct data fields, as they are separated by commas but form a continuous sentence when read together. - -The columns contain the following information: -- The first two columns contain the word "Warning:", indicating that the message is a warning. -- The next several columns contain the word "doxygenfunction:", which might be a tag or label used in the log. -- The message then states "Cannot find function 'cooperative_groups::coalesced_threads','cooperative_groups::coalesced_threads'", indicating that there is a missing function or reference in the Doxygen XML output. -- The message specifies that this issue is related to the project 'HIP 6.1.40092 Documentation', suggesting that the documentation for this specific version of the HIP project is incomplete or has errors. -- The final part of the message mentions "from directory:", which might indicate the location where the Doxygen tool was searching for the function or where the documentation files are stored. - -Noteworthy values: -- The repeated mention of "cooperative_groups::coalesced_threads" suggests that this specific function is missing or not properly documented. -- The warning is associated with the 'HIP 6.1.40092 Documentation', which could be important for developers or users relying on this version of the documentation.
-| Warning: | doxygenfunction: project | Cannot | find function | 'hipLaunchCooperativeKernel' | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| in | doxygen | xml | output | for | 'HIP | 6.1.40092 | Documentation' | from | |||
| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | |||||||||||
| 6.1.2/docs/doxygen/xml | |||||||||||
**Following table contains:** The table appears to represent a log or error report related to the documentation generation process for a software project, specifically the HIP (Heterogeneous-Compute Interface for Portability) version 6.1.40092. Each row seems to indicate an instance where a specific function, `cooperative_groups::tiled_partition`, could not be found in the Doxygen XML output for the project. - -### Rows: -- Each row represents a repeated warning message indicating the absence of a specific function in the generated documentation files. - -### Columns: -- The columns contain parts of the warning message, which are repeated across the table. The message indicates: - - A warning about the inability to find a function. - - The specific function name: `cooperative_groups::tiled_partition`. - - The project name and version: 'HIP,6.1.40092,Documentation'. - - The directory path where the documentation was expected to be found. - -### Noteworthy Values: -- The function `cooperative_groups::tiled_partition` is consistently mentioned as missing in the documentation output. -- The directory path is repeated multiple times, indicating that the issue might be persistent across different documentation builds or checks. -- The repetition of the warning message suggests a systematic issue in the documentation generation process for this particular function.
-| Warning: | doxygenfunction: | Cannot | find | function | 'hipLaunchCooperativeKernelMultiDe- | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| vice' | in | doxygen | xml | output for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: |
| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | ||||||||||
| 6.1.2/docs/doxygen/xml | ||||||||||
**Following table contains:** The table appears to represent a log or report related to documentation generation for a project named 'HIP Documentation'. Each row seems to indicate a specific instance or occurrence of a warning or error related to the documentation process. - -- **Rows**: Each row represents a specific warning or error message encountered during the generation of documentation using Doxygen for the HIP project. - -- **Columns**: - - The first few columns contain repeated warnings about the inability to find a specific function, 'cooperative_groups::tiled_partition', in the Doxygen XML output. - - The subsequent columns provide details about the project and directory path where the issue was encountered, specifically pointing to the directory path where the Doxygen XML files are located. - -- **Noteworthy Values**: - - The repeated warning about the missing function 'cooperative_groups::tiled_partition' suggests a persistent issue in the documentation generation process. - - The directory path is consistently pointing to a specific location under '/home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs-', indicating where the documentation files are being processed. - -Overall, the table highlights a documentation generation issue that needs to be addressed, specifically related to the missing function in the Doxygen output.
-| Warning: in | doxygenfunction: Cannot find xml output for project 'HIP | |||
|---|---|---|---|---|
| nel' | function 6.1.40092 | 'hipModuleLaunchCooperativeKer- Documentation' from directory: | ||
| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | ||||
| 6.1.2/docs/doxygen/xml | ||||
Warning: doxygenfunction: Cannot find function 'hipModuleLaunchCooperativeKernelMultiDevice' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-The following cooperative groups classes can be used on the device side.
-The base type of all cooperative group types.
-Holds the key properties of a constructed cooperative group types object, like the group type, its size, etc.
-Note: Cooperative groups feature is implemented on Linux, under development on Microsoft Windows.
-Subclassed by cooperative_groups::coalesced_group , cooperative_groups::grid_group , coopera-tive_groups::multi_grid_group , cooperative_groups::thread_block , cooperative_groups::tiled_group class thread_block : public cooperative_groups:: thread_group
-The workgroup (thread-block in CUDA terminology) cooperative group type.
-Represents an intra-workgroup cooperative group type, where the participating threads within the group are the same threads that participated in the currently executing workgroup .
-Note: This function is implemented on Linux and is under development on Microsoft Windows.
-class grid_group : public cooperative_groups:: thread_group
-The grid cooperative group type.
-Represents an inter-workgroup cooperative group type, where the participating threads within the group spans across multiple workgroups running the (same) kernel on the same device.
-Note: This is implemented on Linux and is under development on Microsoft Windows.
-class multi_grid_group : public cooperative_groups:: thread_group
-The multi-grid cooperative group type.
-Represents an inter-device cooperative group type, where the participating threads within the group span across multiple devices, running the (same) kernel on these devices.
-Note: The multi-grid cooperative group type is implemented on Linux, under development on Microsoft Windows.
-class thread_block_tile : public cooperative_groups::impl::thread_block_tile_internal< size , ParentCGTy > Group type -thread_block_tile .
-Represents one tiled thread group in a wavefront. This group type also supports sub-wave level intrinsics.
-Note: This type is implemented on Linux, under development on Microsoft Windows.
-unsigned int thread_rank () const
-Rank of the calling thread within [0, size() ).
-Synchronizes the threads in the group.
-Causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group.
-Note: There are potential read-after-write (RAW), write-after-read (WAR), or write-after-write (WAW) hazards, when threads in the group access the same addresses in shared or global memory. The data hazards can be avoided with synchronization of the group.
-Returns the linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta_group_size)
-unsigned int meta_group_size () const
-Returns the number of groups created when the parent group was partitioned.
-T shfl ( T var, int srcRank ) const
-Shuffle operation on group level.
-Exchanging variables between threads without use of shared memory. Shuffle operation is a direct copy of var from srcRank thread ID of group.
-T - The type can be a 32-bit integer or single-precision floating point.
-T shfl_down ( T var, unsigned int lane_delta ) const
-Shuffle down operation on group level.
-Exchanging variables between threads without use of shared memory. Shuffle down operation is copy of var from thread with thread ID of group relative higher with lane_delta to caller thread ID.
-T - The type can be a 32-bit integer or single-precision floating point.
-template<class T >
-Shuffle up operation on group level.
-Exchanging variables between threads without use of shared memory. Shuffle up operation is copy of var from thread with thread ID of group relative lower with lane_delta to caller thread ID.
-T - The type can be a 32-bit integer or single-precision floating point.
-T shfl_xor ( T var, unsigned int laneMask ) const
-Shuffle xor operation on group level.
-Exchanging variables between threads without use of shared memory. Shuffle xor operation is copy of var from thread with thread ID of group based on laneMask XOR of the caller thread ID.
-unsigned long long ballot ( int pred ) const
-Ballot function on group level.
-Returns a bit mask with the Nth bit set to one if the Nth thread predicate evaluates true.
-pred - [in] The predicate to evaluate on group threads.
-int any ( int pred ) const
-Any function on group level.
-Returns non-zero if a predicate evaluates true for any threads.
-pred - [in] The predicate to evaluate on group threads.
-int all ( int pred ) const
-All function on group level.
-Returns non-zero if a predicate evaluates true for all threads.
-pred - [in] The predicate to evaluate on group threads.
-template<typename T >
-unsigned long long match_any ( T value ) const
-Match any function on group level.
-Returns a bit mask containing a 1-bit for every participating thread if that thread has the same value in value as the caller thread.
-value - [in] The value to examine on the current thread in group.
-template<typename T > unsigned long long match_all ( T value, int &pred ) const
-Match all function on group level.
-Returns a bit mask containing a 1-bit for every participating thread if they all have the same value in value as the caller thread. The predicate pred is set to true if all participating threads have the same value in value .
-class coalesced_group : public cooperative_groups:: thread_group
-The coalesced_group cooperative group type.
-Represents an active thread group in a wavefront. This group type also supports sub-wave level intrinsics.
-Note: This is implemented on Linux and is under development on Microsoft Windows.
-The following functions are used to construct different group-type instances on the device side.
-**Following table contains:** The table appears to represent a log or error report related to the generation of documentation for a software project, specifically the HIP (Heterogeneous-Compute Interface for Portability) project. Each row seems to represent a specific warning or error message encountered during the documentation generation process. - -The columns in the table include: -- Repeated "Warning:" labels, indicating that each entry is a warning message. -- "doxygenfunction:" labels, suggesting that the warnings are related to specific functions or elements within the Doxygen documentation tool. -- "Cannot find" and "function" columns, which specify the particular function or element that could not be found in the Doxygen XML output. -- Specific function names, such as 'cooperative_groups::binary_partition', indicating the functions that are missing or causing issues. - -A noteworthy value is the repeated mention of the function 'cooperative_groups::binary_partition', which suggests that this function is consistently not found in the Doxygen XML output, potentially indicating a documentation or code issue that needs to be addressed. Additionally, the path '/home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs-' is repeated, indicating the location where the documentation build process is taking place.
-| Warning: | doxygenfunction: | Cannot find | function | 'cooperative_groups::this_multi_grid' | ||||||||
| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: | ||
| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | ||||||||||||
| 6.1.2/docs/doxygen/xml | ||||||||||||
Warning: doxygenfunction: Cannot find function 'cooperative_groups::this_grid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-Warning: doxygenfunction: Cannot find function 'cooperative_groups::this_thread_block' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-**Following table contains:** The table appears to represent a log or error report related to the generation of documentation using Doxygen for a project named 'HIP Documentation'. Each row seems to capture a specific warning or error message encountered during this process. - -- **Rows**: Each row represents a specific instance of a warning or error message related to the Doxygen documentation generation process for the HIP project. - -- **Columns**: - - The first few columns contain repeated warnings and references to a specific function, `cooperative_groups::binary_partition`, indicating that there is an issue with finding this function in the Doxygen XML output. - - The subsequent columns provide context about the project ('HIP Documentation') and the directory path where the documentation generation was attempted. - -- **Noteworthy Values**: - - The repeated mention of the function `cooperative_groups::binary_partition` suggests a recurring issue with this specific function not being found in the Doxygen XML output. - - The directory path `/home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs-` indicates where the documentation files are located, and the version `6.1.2` is specified, which might be relevant for troubleshooting or version control.
-| Warning: | doxygenfunction: | Cannot | find | function | 'cooperative_groups::coalesced_threads' | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: |
/home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-**Following table contains:** The table appears to represent a log or error report related to the documentation generation process for a software project, specifically the HIP (Heterogeneous-Compute Interface for Portability) project. Each row seems to represent a specific warning or error encountered during the documentation generation process using Doxygen, a documentation generator tool. - -The columns in the table include: -- Repeated "Warning:" labels, indicating that the entries are warnings. -- "doxygenfunction: project" repeated multiple times, suggesting that the warnings are related to a specific function or feature within the project. -- "Cannot find function 'cooperative_groups::group_size'", indicating a specific function that could not be found in the documentation output. -- Additional columns appear to be empty or contain repeated directory paths. - -Noteworthy values include: -- The repeated mention of the function 'cooperative_groups::group_size', which suggests that this function is missing or not properly documented in the Doxygen output. -- The directory paths indicate the location where the documentation generation process was attempted, specifically within the version 6.1.2 of the HIP project documentation. - -Overall, the table highlights issues in the documentation process, specifically the inability to locate a particular function in the generated documentation.
-| Warning: | doxygenfunction: | Cannot find | function | 'cooperative_groups::tiled_partition' | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: | ||
| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | ||||||||||||
| 6.1.2/docs/doxygen/xml | ||||||||||||
**Following table contains:** The table appears to represent a log or error report related to documentation generation for a project associated with Advanced Micro Devices (AMD) HIP (Heterogeneous-Compute Interface for Portability). Each row seems to represent an instance of a warning or error encountered during the documentation generation process using Doxygen, a documentation generator tool. - -The columns in the table include: -- Repeated warnings indicating issues with finding specific functions or components ('HIP', 'cooperative_groups::thread_rank') in the Doxygen XML output. -- The project version or identifier ('6.1.40092') and the directory path where the documentation generation was attempted. - -Noteworthy values include: -- The repeated mention of the inability to find 'HIP' and 'cooperative_groups::thread_rank' functions, suggesting a persistent issue in the documentation process. -- The directory path '/home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs-' indicates the location where the documentation build was attempted, which might be useful for troubleshooting. -- The version '6.1.40092' could be relevant for identifying the specific build or release of the project being documented.
-| Warning: | doxygenfunction: | Cannot | find | function 'cooperative_groups::tiled_partition' 6.1.40092 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| in | doxygen | xml | output | for | project | 'HIP | Documentation' | from | directory: | ||||
| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | |||||||||||||
| 6.1.2/docs/doxygen/xml | |||||||||||||
**Following table contains:** The table appears to represent a list of C++ functions and classes related to a programming library, possibly for GPU computing given the context of "hip" functions, which are associated with the HIP (Heterogeneous-Compute Interface for Portability) API. Each row seems to represent a specific C++ class or function, along with associated functions or methods. - -### Columns: -- **Column 0**: Appears to list C++ classes or functions, possibly from a cooperative groups library or similar. -- **Column 1**: Lists associated C++ functions, possibly indicating dependencies or related operations, along with a numerical value that could represent a reference page, version number, or some form of identifier. -- **Columns 2-6**: These columns are mostly empty, suggesting they might be placeholders for additional information or metadata that is not included in this preview. - -### Noteworthy Values: -- The repeated mention of "hip" functions suggests a focus on GPU or parallel computing. -- The numerical values following the function names (e.g., 183, 237, 234) might indicate reference pages, version numbers, or identifiers, but their exact meaning is not clear from the preview. -- The presence of cooperative group classes (e.g., `cooperative_groups::coalesced_group`, `cooperative_groups::grid_group`) suggests a focus on parallel computing constructs. - -Overall, the table seems to be a reference or index of C++ functions and classes, possibly for developers working with GPU computing libraries.
-| Warning: | doxygenfunction: | Cannot find | function | 'cooperative_groups::binary_partition' | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | ||
| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | |||||||||||
| 6.1.2/docs/doxygen/xml | |||||||||||
**Following table contains:** The table appears to represent a list of C++ functions related to memory management, likely in the context of GPU programming with HIP (Heterogeneous-Compute Interface for Portability). Each row corresponds to a specific C++ function, and the columns provide additional information about these functions. - -- The first column lists the name of the C++ function, which is related to memory allocation or management. -- The subsequent columns seem to contain numerical values, which could represent line numbers, error codes, or some form of identifiers associated with these functions. - -Noteworthy observations: -- The function `hipMallocManaged` is associated with the numbers 247 and 249, suggesting it might have multiple entries or references. -- The functions `hipMemAddressFree` and `hipMemAddressReserve` both have the number 251, indicating a possible shared characteristic or reference point. -- The function `hipMallocFromPoolAsync` has two numbers, 153 and 160, which might indicate different contexts or usages. - -Overall, the table provides a structured overview of various HIP memory management functions and their associated numerical data, which could be useful for debugging, documentation, or analysis purposes.
-| Warning: | doxygenfunction: | Cannot find | function 'cooperative_groups::binary_partition' 6.1.40092 | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| in | doxygen | xml | output | for | project | 'HIP | Documentation' | from | directory: | |
| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | ||||||||||
| 6.1.2/docs/doxygen/xml | ||||||||||
The following functions are the exposed API for different group-type instances on the device side.
-**Following table contains:** The table appears to represent a list of C++ functions related to memory pool operations, possibly from a library or API documentation. Each row corresponds to a specific function or a set of functions, along with an associated numerical value, which might represent a version number, an identifier, or a priority level. - -The columns are as follows: -- The first column lists the function names along with their associated numerical values. Some entries contain multiple functions separated by spaces. -- The second column seems to duplicate the first column, suggesting that the table might be intended for comparison or verification purposes. - -Noteworthy values: -- The functions listed are all related to memory pool operations, such as exporting, importing, setting attributes, and prefetching. -- The numerical values associated with each function range from 155 to 247, with "hipMemPrefetchAsync" having the highest value of 247, which might indicate its significance or a different categorization compared to the others. -- There is a noticeable repetition of function names and values across both columns, indicating potential redundancy or a need for data cleaning.
-| Warning: | doxygenfunction: project | Cannot find | function | 'cooperative_groups::group_size' | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| in | doxygen | xml | output | for | 'HIP | 6.1.40092 | Documentation' | from | directory: | ||||||
| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | |||||||||||||||
| 6.1.2/docs/doxygen/xml | |||||||||||||||
**Following table contains:** The table appears to represent a dataset where each row corresponds to a specific data field or variable, as indicated by the sequence numbers in the third column (e.g., 18.4.4.3.18, 18.4.4.3.19, etc.). The columns are as follows: - -- Column 0 and 1: These columns are empty and do not contain any data. -- Column 2: Contains identifiers for each data field or variable, formatted as a sequence of numbers (e.g., 18.4.4.3.18). -- Column 3: Describes the type or category of the data field, consistently labeled as "Data Fields - Variables." -- Column 4: Appears to contain a placeholder or repeated pattern of dots, possibly indicating missing or unspecified information. -- Column 5: Contains the value "126" for each row, which might represent a code, category, or a constant value associated with each data field. - -Noteworthy observations include the consistent use of "126" in the last column for all rows, suggesting a uniform characteristic or classification across all data fields. Additionally, the repetitive pattern of dots in column 4 might indicate incomplete data or a placeholder for future information.
-| Warning: | doxygenfunction: Cannot | find 'HIP | function 'cooperative_groups::thread_rank' | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| in | doxygen | xml | output | for | project | 6.1.40092 | Documentation' | from | directory: | ||||
| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | |||||||||||||
| 6.1.2/docs/doxygen/xml | |||||||||||||
Warning: doxygenfunction: Cannot find function 'cooperative_groups::is_valid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-Warning: doxygenfunction: Cannot find function 'cooperative_groups::sync' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advancedmicro-devices-hip/checkouts/docs-6.1.2/docs/doxygen/xml
-The following functions are located in the https://github.com/ROCm/ROCR-Runtime repository.
-hsa_status_t hsa_amd_vmem_address_reserve ( void **va, size_t size, uint64_t address, uint64_t flags )
-Allocate a reserved address range.
-Reserve a virtual address range. The size must be a multiple of the system page size. If it is not possible to allocate the address specified by address , then va will be a different address range. Address range should be released by calling hsa_amd_vmem_address_free.
-Note that this API will be deprecated in a future release and replaced by hsa_amd_vmem_address_reserve_align
-hsa_status_t hsa_amd_vmem_address_free ( void *va, size_t size )
-Free a reserved address range.
-Free a previously allocated address range. The size must match the size of a previously allocated address range.
-· ::HSA_STATUS_ERROR - Internal unexpected error
-hsa_status_t hsa_amd_vmem_handle_create ( hsa_amd_memory_pool_t pool, size_t size, hsa_amd_memory_type_t type, uint64_t flags, hsa_amd_vmem_alloc_handle_t *memory_handle
-)
-Create a virtual memory handle.
-Create a virtual memory handle within this pool size must be a aligned to allocation granule size for this memory pool, see HSA_AMD_MEMORY_POOL_INFO_RUNTIME_ALLOC_GRANULE To minimize internal memory fragmentation, align the size to the recommended allocation granule size, see HSA_AMD_MEMORY_POOL_INFO_RUNTIME_ALLOC_REC_GRANULE
-hsa_status_t hsa_amd_vmem_handle_release ( hsa_amd_vmem_alloc_handle_t memory_handle )
-Release a virtual memory handle.
-memory -[in] handle that was previously allocated
-hsa_status_t hsa_amd_vmem_map ( void *va, size_t size, size_t in_offset, hsa_amd_vmem_alloc_handle_t memory_handle, uint64_t flags )
-Map a virtual memory handle.
-Map a virtual memory handle to a reserved address range. The virtual address requested must be within a previously reserved address range. va and ( va + size) must be must be within (va + size) of the previous allocated address range. size must be equal to size of the memory_handle hsa_amd_vmem_set_access needs to be called to make the memory accessible to specific agents
-Unmap a virtual memory handle.
-Unmap previously mapped virtual address range
-hsa_status_t hsa_amd_vmem_set_access ( void *va, size_t size, const hsa_amd_memory_access_desc_t *desc, size_t desc_cnt )
-Make a memory mapping accessible.
-Make previously mapped virtual address accessible to specific agents. size must be equal to size of previously mapped virtual memory handle. Calling hsa_amd_vmem_set_access multiple times on the same va will overwrite previous permissions for all agents
-hsa_status_t hsa_amd_vmem_get_access ( void *va, hsa_access_permission_t *perms, hsa_agent_t agent_handle )
-Get current access permissions for memory mapping.
-Get access permissions for memory mapping for specific agent.
-hsa_status_t hsa_amd_vmem_export_shareable_handle ( int *dmabuf_fd, hsa_amd_vmem_alloc_handle_t handle, uint64_t flags )
-Get an exportable shareable handle.
-Get an exportable shareable handle for a memory_handle. This shareabl handle can then be used to re-create a virtual memory handle using hsa_amd_vmem_import_shareable_handle. The shareable handle can be transferred using mechanisms that support posix file descriptors Once all shareable handles are closed, the memory_handle is released.
-Import a shareable handle.
-Import a shareable handle for a memory handle. Importing a shareable handle that has been closed and released results in undefined behavior.
-hsa_status_t hsa_amd_vmem_retain_alloc_handle ( hsa_amd_vmem_alloc_handle_t *memory_handle, void *addr )
-Returns memory handle for mapped memory.
-Return a memory handle for previously mapped memory. The handle will be the same value of handle used to map the memory. The returned handle must be released with corresponding number of calls to hsa_amd_vmem_handle_release.
-hsa_status_t hsa_amd_vmem_get_alloc_properties_from_handle ( hsa_amd_vmem_alloc_handle_t memory_handle, hsa_amd_memory_pool_t *pool, hsa_amd_memory_type_t *type )
-Returns the current allocation properties of a handle.
-Returns the allocation properties of an existing handle
-hipError_t hipMallocManaged ( void **dev_ptr, size_t size, unsigned int flags )
-Allocates memory that will be automatically managed by HIP.
-This API is used for managed memory, allows data be shared and accessible to both CPU and GPU using a single pointer.
-The API returns the allocation pointer, managed by HMM, can be used further to execute kernels on device and fetch data between the host and device as needed.
-Note: It is recommend to do the capability check before call this API.
-hipSuccess, hipErrorMemoryAllocation, hipErrorNotSupported, hipErrorInvalidValue hipError_t hipMemPrefetchAsync ( const void *dev_ptr, size_t count, int device, hipStream_t stream
-) Prefetches memory to the specified destination device using HIP.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-hipSuccess, hipErrorInvalidValue
-hipError_t hipMemAdvise ( const void *dev_ptr, size_t count, hipMemoryAdvise advice, int device )
-Advise about the usage of a given memory range to HIP.
-This HIP API advises about the usage to be applied on unified memory allocation in the range starting from the pointer address devPtr, with the size of count bytes. The memory range must refer to managed memory allocated via the API hipMallocManaged, and the range will be handled with proper round down and round up respectively in the driver to be aligned to CPU page size, the same way as corresponding CUDA API behaves in CUDA version 8.0 and afterwards.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemRangeGetAttribute ( void *data, size_t data_size, hipMemRangeAttribute attribute, const void *dev_ptr, size_t count )
-Query an attribute of a given memory range in HIP.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-hipSuccess, hipErrorInvalidValue hipError_t hipMemRangeGetAttributes ( void **data, size_t *data_sizes, hipMemRangeAttribute *attributes, size_t num_attributes, const void *dev_ptr, size_t count )
-Query attributes of a given memory range in HIP.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-hipSuccess, hipErrorInvalidValue hipError_t hipStreamAttachMemAsync ( hipStream_t stream, void *dev_ptr, size_t length, unsigned int flags ) Attach memory to a stream asynchronously in HIP.
-Warning: This API is under development. Currently it is a no-operation (NOP) function on AMD GPUs and returns hipSuccess.
-hipSuccess, hipErrorInvalidValue
-static inline hipError_t hipMallocManaged ( T **devPtr, size_t size, unsigned int flags = hipMemAttachGlobal )
-Provide an override to automatically typecast the pointer type from void**, and also provide a default for the flags.
-HIP_DISABLE_CPP_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs.
-hipMallocManaged
-hipError_t hipMemAddressFree ( void *devPtr, size_t size )
-Frees an address range reservation made via hipMemAddressReserve.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemAddressReserve ( void **ptr, size_t size, size_t alignment, void *addr, unsigned long long flags )
-Reserves an address range.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-hipError_t hipMemCreate ( hipMemGenericAllocationHandle_t *handle, size_t size, const hipMemAllocationProp *prop, unsigned long long flags )
-Creates a memory allocation described by the properties and size.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemExportToShareableHandle ( void *shareableHandle, hipMemGenericAllocationHandle_t handle, hipMemAllocationHandleType handleType, unsigned long long flags )
-Exports an allocation to a requested shareable handle type.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemGetAccess ( unsigned long long *flags, const hipMemLocation *location, void *ptr
-) Get the access flags set for the given location and ptr.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemGetAllocationGranularity ( size_t *granularity, const hipMemAllocationProp *prop, hipMemAllocationGranularity_flags option )
-Calculates either the minimal or recommended granularity.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemGetAllocationPropertiesFromHandle ( hipMemAllocationProp *prop,
-hipMemGenericAllocationHandle_t handle )
-Retrieve the property structure of the given handle.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-hipError_t hipMemImportFromShareableHandle ( hipMemGenericAllocationHandle_t *handle, void *osHandle, hipMemAllocationHandleType shHandleType )
-Imports an allocation from a requested shareable handle type.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemMap ( void *ptr, size_t size, size_t offset, hipMemGenericAllocationHandle_t handle, unsigned long long flags )
-Maps an allocation handle to a reserved virtual address range.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemMapArrayAsync ( hipArrayMapInfo *mapInfoList, unsigned int count, hipStream_t stream )
-Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays.
-Warning: This API is under development. Currently it is not supported on AMD GPUs and returns hipErrorNotSupported.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemRelease ( hipMemGenericAllocationHandle_t handle )
-Release a memory handle representing a memory allocation which was previously allocated through hipMemCreate.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-handle -[in] - handle of the memory allocation.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemRetainAllocationHandle ( hipMemGenericAllocationHandle_t *handle, void *addr )
-Returns the allocation handle of the backing memory allocation given the address.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemSetAccess ( void *ptr, size_t size, const hipMemAccessDesc *desc, size_t count )
-Set the access flags for each location specified in desc for the given virtual address range.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-Unmap memory allocation of a given address range.
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-Several of our API functions have been flagged for deprecation. Using the following functions results in errors and unexpected results, so we encourage you to update your code accordingly.
-CUDAsupports cuCtx API, which is the driver API that defines 'Context' and 'Devices' as separate entities. Context contains a single device, and a device can theoretically have multiple contexts. HIP initially added limited support for these APIs in order to facilitate porting from existing driver codes. These APIs are now marked as deprecated because there are better alternate interfaces (such as hipSetDevice or the stream API) to achieve these functions.
-This tutorial explains the basic concepts of the single-source Heterogeneous-computing Interface for Portability (HIP) programming model and the essential tooling around it. It also reviews some commonalities of heterogenous APIs in general. This topic assumes basic familiarity with the C/C++ compilation model and language.
-To follow this tutorial, you'll need installed drivers and a HIP compiler toolchain to compile your code. Because HIP for ROCm supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP .
-Heterogeneous programming and offloading APIs are often mentioned together. Heterogeneous programming deals with devices of varying capabilities simultaneously. Offloading focuses on the 'remote' and asynchronous aspects of computation. HIP encompasses both. It exposes GPGPU (general-purpose GPU) programming much like ordinary host-side CPU programming and lets you move data across various devices.
-When programming in HIP (and other heterogenous APIs for that matter), remember that target devices are built for a specific purpose. They are designed with different tradeoffs than traditional CPUs and therefore have very different performance characteristics. Even subtle changes in code might adversely affect execution time.
-First, let's do the 'Hello, World!' of GPGPU: SAXPY. Single-precision A times X Plus Y ( SAXPY ) is a mathematical acronym; a vector equation 𝑎 · 𝑥 + 𝑦 = 𝑧 where 𝑎 ∈ R is a scalar and 𝑥, 𝑦, 𝑧 ∈ V are vector quantities of some large dimensionality. This vector space is defined over the set of reals. Practically speaking, you can compute this using a single for loop over three arrays.
-**Following code does:** This code snippet is a command to compile a C++ program using the `clang++` compiler. It is specifically compiling a HIP (Heterogeneous-Compute Interface for Portability) program, which is used for writing portable code that can run on both AMD and NVIDIA GPUs. The command compiles the source file `main.hip` located in the `HIP-Basic\saxpy` directory into an executable named `saxpy.exe`. It includes headers from the `.\Common` directory, links against the `amdhip64` library, and uses the library path specified by the `HIP_PATH` environment variable. The `-O2` flag is used for optimization, and `--save-temps` is used to save intermediate compilation files.
-++i)
-**Following code does:** This code snippet is a partial command written in PowerShell. It uses the `Get-ChildItem` cmdlet to list items (files and directories) in the current directory that match a specific pattern. The `-Filter main-hip-*` option specifies that only items with names starting with "main-hip-" should be included in the output. The command is incomplete, as it ends with `selec`, which is likely intended to be `Select-Object` to further process or filter the properties of the listed items.
-<_SQL_>
-In linear algebra libraries, such as BLAS (Basic Linear Algebra Subsystem) this operation is defined as AXPY 'A times X Plus Y'. The 'S' comes from single-precision , meaning that array element is float -s (IEEE 754 binary32 representation).
-To quickly get started, use the set of HIP samples from GitHub. With Git configured on your machine, open a commandline and navigate to your desired working directory, then run:
-**Following code does:** The code snippet appears to be incomplete and seems to be a mix of Python and PowerShell syntax. The part `n-hip-* | select -Property Name` resembles a PowerShell command rather than Python. In PowerShell, this command would filter objects whose names match the pattern `n-hip-*` and then select only the `Name` property of those objects. However, without additional context or a complete command, it's difficult to provide a precise explanation.
- |git clone https://github.com/amd/rcm-examples.git
-A simple implementation of SAXPY resides in the HIP-Basic/saxpy/main.hip file in this repository. The HIP code here mostly deals with where data has to be and when, and how devices transform this data. The first HIP calls deal with allocating device-side memory and copying data from host-side memory to device side in a C runtime-like fashion.
-**Following code does:** The code snippet appears to be a list of filenames rather than executable code. These filenames suggest that they are related to a project or compilation process targeting the AMD GCN (Graphics Core Next) architecture, specifically the gfx906 variant, which is used in AMD GPUs. The files likely represent different stages or outputs of compiling a program using the HIP (Heterogeneous-Compute Interface for Portability) framework for AMD hardware. The extensions indicate various file types: - -- `.bc` is likely a LLVM bitcode file. -- `.hipi` could be an intermediate file specific to HIP. -- `.o` is an object file. -- `.out` is an executable or output file. -- `.out.resolution.txt` might be a text file containing resolution or linking information. -- `.s` is an assembly file. - -Overall, these files are part of the build process for a HIP application targeting AMD GPUs.
-// Allocate and copy vectors to device memory.
-float* d_x{};
-float* d_y{};
-HIP_CHECK(hipMalloc(&d_x, size_bytes));
-HIP_CHECK(hipMalloc(&d_y, size_bytes));
-HIP_CHECK(hipMemcpy(d_x, x.data(), size_bytes, hipMemcpyHostToDevice));
-HIP_CHECK(hipMemcpy(d_y, y.data(), size_bytes, hipMemcpyHostToDevice));
-HIP_CHECK is a custom macro borrowed from the examples utilities which checks the error code returned by API functions for errors and reports them to the console. It is not essential to the API, but it is a good practice to check the error codes of the HIP APIs in case you pass on incorrect values to the API, or the API might be out of resources.
-The code selects the device to allocate to and to copy to. Commands are issued to the HIP runtime per thread, and every thread has a device set as the target of commands. The default device is 0 , which is equivalent to calling hipSetDevice(0) .
-Launch the calculation on the device after the input data has been prepared.
-**Following code does:** This code snippet appears to be a disassembled output of a compiled program targeting AMD's GCN (Graphics Core Next) architecture, specifically the gfx906 variant. The disassembly is likely generated from a HIP (Heterogeneous-Compute Interface for Portability) application, which is used for GPU programming on AMD hardware. - -The code includes assembly instructions for a function named `_Z12saxpy_kernelPKfPfj`, which suggests it is implementing a SAXPY (Single-Precision A·X Plus Y) operation, a common vector operation in linear algebra. The SAXPY operation computes the result of `Y = a * X + Y`, where `a` is a scalar and `X` and `Y` are vectors. - -The assembly instructions involve loading data, performing arithmetic operations, and storing results back to memory, which are typical steps in executing a SAXPY operation on a GPU. The use of specific instructions like `s_load_dword`, `v_add_u32_e32`, and `global_store_dword` indicates manipulation of scalar and vector registers, memory access, and arithmetic operations optimized for parallel execution on the GPU.
- Launch the calculation on the device after the input data has been prepared.
- __global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned_
- __int size)
- {
- //...
- }
-
- int main()
- {
- //...
-
- // Launch the kernel on the default stream.
- saxpy_kernel<<>>(a, d_x, d_y,
- -- size);
- }
-
- Analyze at the signature of the offloaded function:
-Analyze at the signature of the offloaded function:
-This function is launched from the host using a language extension often called the triple chevron syntax. Inside the angle brackets, provide the following.
-The block size and shared memory become important later in Reduction . For now, a hardcoded 256 is a safe default for simple kernels such as this. Following the triple chevron is ordinary function argument passing.
-Look at how the kernel is implemented.
-**Following code does:** The code snippet you provided seems to be a placeholder or an incomplete representation of a Bash script. The `<_Bash_>` tag does not contain any actual Bash code or commands. Therefore, it does not perform any specific function or task. If you have a specific Bash script or code snippet you would like explained, please provide the actual code.
-<_Cuda_>
-Retrieval of the result from the device is done much like input data copy. In this current step the results copied from device to host. The opposite direction of the input data copy:
-**Following code does:** The code snippet you've provided appears to be a corrupted or improperly formatted text, possibly due to a copy-paste error or encoding issue. It does not represent valid Python code or any recognizable programming construct. If you intended to share a specific piece of code, please ensure it is correctly formatted and complete.
- |HIP_CHECK(hipMemcpy(y.data()), d_y, size_bytes, hipMemcpyDeviceToHost));
-Strictly speaking there's no such thing as 'setting up the command-line for compilation' on Linux. To make invocations more terse, Linux and Windows example follow.
-While distro maintainers might package ROCm so that it installs to system-default locations, AMD's packages aren't installed that way. They need to be added to the PATH by the user.
-**Following code does:** The code snippet you provided appears to be a command or a reference to a file rather than a typical code snippet. It seems to be related to a SAXPY operation, which stands for "Single-Precision A·X Plus Y." SAXPY is a common operation in linear algebra used to scale a vector `X` by a scalar `A` and add it to another vector `Y`. The term "sm_5" likely refers to a specific compute capability or architecture version for NVIDIA GPUs, and "ptx" indicates that this is a PTX (Parallel Thread Execution) file, which is an intermediate representation used in CUDA programming for NVIDIA GPUs. Overall, this line likely refers to a compiled or source file related to performing a SAXPY operation on a GPU.
- | export PATH=/opt/rcm/bin:${PATH}
-**Following code does:** The code snippet appears to be a mix of Python and PHP syntax, but it is not valid code in either language. It seems to be an attempt to execute a shell command using PHP tags, specifically running `rocminfo` and filtering the output with `grep gfx`. The intention might be to extract information about AMD GPU architectures (gfx) from the `rocminfo` command output. However, due to the incorrect syntax and formatting, this code will not execute as intended in any programming environment.
-}
-**Following code does:** This code snippet appears to be a series of special characters and symbols arranged in a specific pattern. However, it does not constitute valid or meaningful Python code. It seems to be more of an artistic or visual arrangement of characters rather than a functional code snippet with a high-level purpose.
- You should be able to call the compiler on the command line now:
-
- amdclang++ --version
-Note: Docker images distributed by AMD, such as rocm-terminal already have /opt/rocm/bin on the Path for convenience. This subtly affects CMake package detection logic of ROCm libraries.
-Both distro maintainers and NVIDIA package CUDA so that nvcc and related tools are available on the command line by default. You can call the compiler on the command line with:
-**Following code does:** The code snippet describes a SAXPY operation, which stands for "Single-Precision A·X Plus Y". It performs a linear algebra operation where each element of an array `y` is updated by multiplying a scalar `a` with the corresponding element of another array `x` and then adding the result to the original element in `y`. This operation is applied over 10,000,000 elements. The snippet also provides the first 10 elements of the resulting array `y` after the operation, which are `[3, 5, 7, 9, 11, 13, 15, 17, 19, 21]`.
-| nvcc --version
-Windows compilers and command line tooling have traditionally relied on extra environmental variables and PATH entries to function correctly. Visual Studio refers to command lines with this setup as 'Developer Command Prompt' or 'Developer PowerShell' for cmd.exe and PowerShell respectively.
-The HIP SDK on Windows doesn't include a complete toolchain. You will also need:
-If you don't have a version of Visual Studio 2022 installed, for a minimal command line experience, install the Build Tools for Visual Studio 2022 with the Desktop Developemnt Workload. Under Individual Components select:
-Note: The 'C++ CMake tools for Windows' individual component is a convenience which puts both cmake.exe and ninja.exe onto the PATH inside developer command prompts. You can install these manually, but then you must manage them manually.
-Visual Studio 2017 and later are detectable as COM object instances via WMI. To setup a command line from any shell for the latest Visual Studio's default Visual C++ toolset issue:
-**Following code does:** It seems like there is a placeholder `<_Python_>` instead of actual code. Please provide the code snippet you would like me to explain.
-$InstallationPath = Get-CimInstance MSFT_VSInstance | Sort-Object -Property Version -
- --Descending | Select-Object -First 1 -ExpandProperty InstallLocation
- Import-Module $InstallationPath\Common?\Tools\Microsoft.VisualStudio.DevShell.dll
-Enter-VsDevShell -InstallPath $InstallationPath -SkipAutomaticLocation -Arch amd64 -
- --HostArch amd64 -DevCmdArguments '-no_logo'
-$env:PATH = "${env:HIP_PATH}bin;${env:PATH}"
-You should be able to call the compiler on the command line now:
-**Following code does:** This code snippet appears to be a series of symbols and numbers, but it does not represent valid or meaningful Python code. It seems to be a random arrangement of characters and does not perform any computational task or serve any high-level purpose in programming.
- | clang++ --version
-Windows compilers and command line tooling have traditionally relied on extra environmental variables and PATH entries to function correctly. Visual Studio refers to command lines with this setup as 'Developer Command Prompt' or 'Developer PowerShell' for cmd.exe and PowerShell respectively.
-The HIP and CUDA SDKs on Windows don't include complete toolchains. You will also need:
-If you don't have a version of Visual Studio 2022 installed, for a minimal command line experience, install the Build Tools for Visual Studio 2022 with the Desktop Developemnt Workload. Under Individual Components select:
-Note: The 'C++ CMake tools for Windows' individual component is a convenience which puts both cmake.exe and ninja.exe onto the PATH inside developer command prompts. You can install these manually, but then you must manage them manually.
-Visual Studio 2017 and later are detectable as COM object instances via WMI. To setup a command line from any shell for the latest Visual Studio's default Visual C++ toolset issue:
-**Following code does:** The code snippet you provided seems to be a placeholder or a tag indicating a YAML (YAML Ain't Markup Language) document or section, rather than actual executable code. YAML is a human-readable data serialization standard often used for configuration files or data exchange between languages with different data structures. Without specific YAML content, it's not possible to determine what data or configuration it represents. If you have a YAML file or content, it would typically define structured data in a readable format, such as lists, dictionaries, or configuration settings.
-$InstallationPath = Get-CimInstance MSFT_VSInstance | Sort-Object -Property Version -
---Descending | Select-Object -First 1 -ExpandProperty InstallLocation
-Import-Module $InstallationPath\Common7\Tools\Microsoft.VisualStudio.DevShell.dll
-Enter-VsDevShell -InstallPath $InstallationPath -SkipAutomaticLocation -Arch amd64 -
---HostArch amd64 -DevCmdArguments '-no_logo'
-You should be able to call the compiler on the command line now:
-**Following code does:** The code snippet provided is not a valid or complete Python code. It appears to be a fragment with a lot of whitespace and a string that looks like a comment or a directive related to architecture specifications, possibly for CUDA (Compute Unified Device Architecture) which is used for programming NVIDIA GPUs. The `<_arch=sm_70,sm_86` part suggests specifying the compute capabilities (sm_70 and sm_86) for which the code might be intended to run. However, without additional context or surrounding code, it's not possible to determine its exact purpose or functionality.
-| nvcc --version
-To compile and link a single-file application, use the following commands:
-**Following code does:** The code snippet appears to be executing a program or script named `saxpy`, which performs the SAXPY operation (Single-Precision A·X Plus Y). This operation is a common vector operation in numerical computing, where each element of a vector `y` is updated by multiplying a scalar `a` with the corresponding element of another vector `x` and then adding the result to the original element of `y`. The operation is performed over 10,000,000 elements. After the computation, the first 10 elements of the resulting vector `y` are displayed, showing a sequence of odd numbers starting from 3.
- |
- <---------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------- <--------------------------------------------------------------------------------------------------------------------- >---------------------------------------------------------------------------------------------------------------------- |
- |
-
-
-
-
- ----------------------------------------------------------------------------------------------------------------------
-**Following code does:** This code snippet is a PowerShell command that executes the `hipInfo.exe` program located in the `bin` directory of the path specified by the `HIP_PATH` environment variable. The output of this program is then filtered using `Select-String` to search for lines containing the string "gfx". The purpose of this command is to extract and display information related to graphics architecture names (e.g., `gfx1032`, `gfx1035`) from the output of `hipInfo.exe`.
- | nvcc ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -I /opt/rocm/include -02 -x cu
- -x cu | }
-**Following code does:** This code snippet is a command line instruction for compiling a HIP (Heterogeneous-Compute Interface for Portability) program using the `clang++` compiler. The command is set up to compile a source file named `main.hip` located in the `HIP-Basic\saxpy` directory into an executable named `saxpy.exe`. It includes a directory `.\Common` for header files and links against the `amdhip64` library. The command also specifies the HIP installation path using the environment variable `${env:HIP_PATH}`. Additionally, it targets specific GPU architectures (`gfx1032` and `gfx1035`) for offloading, and uses optimization level `-O2` for the compilation process.
- |clang++.\HIP-Basic\saxpy\main.hip -o saxpy.exe -I.\Common -lamdhip64 -L ${env:HIP_PATH}
- -lib -02
-**Following code does:** This code snippet appears to be executing a compiled program named `saxpy.exe`, which performs the SAXPY operation (Single-Precision A·X Plus Y). The operation involves scaling a vector `x` by a scalar `a` and adding it to another vector `y`, element-wise. The program processes 10,000,000 elements and outputs the first 10 elements of the resulting vector `y` after the computation. The output suggests that the operation has been successfully completed, showing the first 10 elements of the updated vector `y`.
- | nvcc.\HIP-Basic\saxpy\main.hip -o saxpy.exe -I ${env:HIP_PATH}include -I.\Common -02 - -
- -x
- | +x |
-Depending on your computer, the resulting binary might or might not run. If not, it typically complains about 'Invalid device function'. That error (corresponding to the hipErrorInvalidDeviceFunction entry of hipError_t ) means that the runtime could not find a device program binary of the appropriate flavor embedded into the executable.
-So far, the discussion has covered how data makes it from the host to the device and back. It has also discussed the device code as source, with the HIP runtime arguing that the correct binary to dispatch for execution. How can you find out what device binary flavors are embedded into the executable?
-The utilities included with ROCm help significantly to inspect binary artifacts on disk. Add the ROCmCC installation folder to your PATH if you want to use these utilities (the utilities expect them to be on the PATH).
-You can list embedded program binaries using roc-obj-ls .
-**Following code does:** The code snippet appears to be a command for compiling a C++ program using `nvcc`, which is the NVIDIA CUDA Compiler. The command is intended to compile a source file named `main.cpp` located in the `HIP-Basic/device_query` directory, and produce an executable named `device_query.exe`. It includes additional directories for header files specified by `-I.\Common` and `-I ${env:HIP_PATH}`. However, the command seems to be malformed or incomplete, as it includes unusual formatting and options like `-include -02` and `--include -0022`, which are not standard or correctly formatted compiler options. This suggests that the command might be incorrectly copied or formatted.
-
-
-
- >
-
- "
-**Following code does:** This code snippet executes a command-line program called `device_query.exe` and pipes its output to the `Select-String` command, which searches for occurrences of the string "major.minor". The output shows that the program found two instances of "major.minor" with associated version numbers, 8.6 and 7.0. This suggests that the code is used to query and display the major and minor version numbers of certain devices or software components.
-<_SQL_>
-The compiler embeds a version 4 code object (more on code object versions) and used the LLVM target triple amdgcnamd-amdhsa-gfx803 (more on target triples). You can extract that program object in a disassembled fashion for human consumption via roc-obj .
-**Following code does:** The code snippet provided appears to be a series of whitespace characters followed by two lines with the decrement operation `--x` and `-x`. However, this is not valid Python code due to the excessive whitespace and lack of context or structure. In Python, `--x` is not a valid operation as it would be interpreted as two separate unary negations, effectively leaving `x` unchanged. Similarly, `-x` is a unary negation that simply returns the negative value of `x`. Without additional context or surrounding code, this snippet does not perform any meaningful operation.
- | r o c - o b j \ - t \ g f x 8 0 3 \ - d \. / s a x p y
-This creates two files on disk and .s extension is of most interest. Opening this file or dumping it to the console using cat lets find the disassembled binary of the SAXPY compute kernel, something similar to:
-**Following code does:** This code snippet appears to be executing a compiled program named `saxpy.exe`, which performs the SAXPY operation (Single-Precision A·X Plus Y). The operation involves scaling a vector `x` by a scalar `a` and adding it to another vector `y`, element-wise, over 10,000,000 elements. The output shows the first 10 elements of the resulting vector `y` after the operation has been completed.
-<_XML_>
-Alternatively, call the compiler with --save-temps to dump all device binary to disk in separate files.
-**Following code does:** It seems like you've provided a placeholder or incomplete code snippet. Please provide the actual code so I can help explain its purpose.
- |
- - -- \ + <---------------------------------------------------------------------------------------------------------------------- } ---------------------------------------------------------------------------------------------------------------------- |
- | - | 0 ] ) .
-
- : '
-List all the temporaries created while compiling main.hip with:
-**Following code does:** This code defines a CUDA kernel function named `k` that operates on arrays of `float4` structures. The kernel is intended to be executed on a GPU, where each thread in a block performs a computation on elements of the input arrays `a` and `b`. Specifically, for each thread, it calculates an index using the thread ID (`tid`), block ID (`bid`), and block dimension (`dim`). It then updates the element in array `a` at the current thread index by adding the product of the calculated index and the corresponding element in array `b`. This operation is performed in parallel across all threads in a block.
-
-
-
- ls main-hip-amdgcn-amd-amdhsa-*
- main-hip-amdgcn-amd-amdhsa-gfx803.bc
- main-hip-amdgcn-amd-amdhsa-gfx803.cui
- main-hip-amdgcn-amd-amdhsa-gfx803.o
- main-hip-amdgcn-amd-amdhsa-gfx803.out
- main-hip-amdgcn-amd-amdhsa-gfx803.out.resolution.txt
- main-hip-amdgcn-amd-amdhsa-gfx803.s
-Files with the .s extension hold the disassembled contents of the binary. The filename notes the graphics IPs used by the compiler. The contents of this file are similar to what roc-obj printed to the console.
-Unlike HIP on AMD, when compiling using the NVIDIA support of HIP the resulting binary will be a valid CUDA executable as far as the binary goes. Therefor it'll incorporate PTX ISA (Parallel Thread eXecution Instruction Set Architecture) instead of AMDGPU binary. As s result, tooling shipping with the CUDA SDK can be used to inspect which device ISA got compiled into a specific executable. The tool most useful to us currently is cuobjdump .
-**Following code does:** The provided code snippet appears to be a mix of Python and C++ syntax, but it is largely nonsensical and does not perform any meaningful operation. It includes fragments of C++ code, such as `
cuobjdump --list-ptx./saxpy
-
- Which will print something like:
-**Following code does:** The provided code snippet appears to be a series of `
| P T X \ f i l e
-**Following code does:** This code snippet appears to be part of a loop that launches a GPU kernel using the HIP (Heterogeneous-Compute Interface for Portability) API. The loop iterates in reverse, starting from `input_count` and decrementing until it is greater than 1. Within the loop, the `hipLaunchKernelGGL` function is called to execute a kernel function named `kernel` on the GPU. The kernel is launched with a grid configuration determined by `dim3(new_size(curr))` and a block size of `dim3(block_size)`. The kernel is provided with shared memory of size `factor * sizeof(unsigned)` and is executed on the default HIP stream. The parameters `front`, `back`, and `kernel_op` are passed to the kernel, likely as arguments for processing data. The purpose of this code is to perform parallel computations on a GPU, potentially for tasks like data processing or transformation, by repeatedly launching a kernel with varying configurations.
-1: saxpy. 1. sm_5. ptx
-From this we can see that the saxpy kernel is stored as sm_52 , which shows that a compute capability 5.2 ISA got embedded into the executable, so devices which sport compute capability 5.2 or newer will be able to run this code.
-The HIP SDK for Windows don't yet sport the roc-* set of utilities to work with binary artifacts. To find out what binary formats are embedded into an executable, one may use dumpbin tool from the Windows SDK to obtain the raw data of the .hip_fat section of an executable. (This binary payload is what gets parsed by the roc-* set of utilities on Linux.) Skipping over the reported header, the rendered raw data as ASCII has ~3 lines per entries. Depending on how many binaries are embedded, you may need to alter the number of rendered lines. An invocation such as:
-**Following code does:** The code snippet `1;)` is not valid Python code. It appears to be a syntax error, as Python does not allow a semicolon to follow a number without a preceding valid statement. This snippet will result in a syntax error if executed in a Python interpreter.
- | dumpbin.exe /nologo /section:.hip_fat /rawdata:8 .\saxpy.exe | select -Skip 20 -First 12
- - -- + <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | dumpbin .exe /nologo /section:.hip_fat /rawdata:8 .\saxpy.exe | select -Skip 20 -First 12 | >--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- | dumpbin.exe /nologo /section:.hip_fat /rawdata:8 .\saxpy.exe | select -Skip 20 -First 12 |
- | |
- |
- | |
- |
- |
- |
- |
- |
-The output may look like:
-**Following code does:** This code snippet appears to be part of a larger function or program, likely written in C++ given the use of `std::swap`. The code performs the following high-level operations: - -1. It assigns a new value to the variable `curr` by calling a function `new_size(curr)`. -2. It checks if the updated value of `curr` is greater than 1. -3. If `curr` is greater than 1, it swaps the values of two variables, `front` and `back`, using the `std::swap` function. - -Overall, the code seems to be adjusting the size of something (possibly a container or data structure) and conditionally swapping two elements based on the new size.
-<_Python_>
-We can see that the compiler embedded a version 4 code object (more on code object versions) and used the LLVM target triple amdgcn-amd-amdhsa-gfx906 (more on target triples). Don't be alarmed about linux showing up as a binary format, AMDGPU binaries uploaded to the GPU for execution are proper linux ELF binaries in their format.
-Alternatively we can call the compiler with --save-temps to dump all device binary to disk in separate files.
-**Following code does:** The code snippet `<_Cuda_>` appears to be a placeholder or a symbolic representation rather than actual executable code. In the context of programming, particularly with Python, it might be intended to represent a CUDA-related object or module. CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA, which allows developers to use a GPU for general purpose processing. However, without additional context or actual code, it's not possible to determine the specific functionality or purpose of this placeholder.
- |clang++.\HIP-Basic\saxpy\main.hip -o saxpy.exe -I.\Common -lamdhip64 -L ${env:HIP_PATH}
- -lib -02 --save-temps \
-Now we can list all the temporaries created while compiling main.hip via
-**Following code does:** This code snippet is part of a parallel reduction algorithm typically used in GPU programming, such as with CUDA. The purpose of this code is to perform a reduction operation (e.g., summation, finding the maximum) on an array of data stored in shared memory (`shared`). The reduction is done in parallel by multiple threads within a block. - -The loop iteratively reduces the number of elements by combining pairs of elements using a binary operation `op`. The `for` loop iterates over increasing powers of two, effectively halving the number of active elements in each iteration. The `__syncthreads()` function is used to synchronize all threads in the block, ensuring that all threads have completed their operations before proceeding to the next iteration. - -The code has been modified to use a different indexing strategy, where the index `j` is calculated based on the thread ID (`tid`) and the current step size (`i`). This change aims to ensure that only valid indices within the bounds of the block are accessed, preventing out-of-bounds memory access.
- |Get-ChildItem -Filter main-hip-* | selec
-**Following code does:** The provided code snippet appears to be a fragment of a larger codebase, possibly related to parallel computing or GPU programming, given the use of `blockDim.x` which is common in CUDA programming for defining the number of threads in a block. The snippet seems to be part of a loop intended for a shared reduction operation, which is a common parallel algorithm used to reduce an array of values to a single value (e.g., sum, max). However, the snippet is incomplete and contains some formatting artifacts (e.g., `-` and `+` symbols), which suggest it might be part of a diff or patch file showing changes to the code. The high-level purpose of this snippet is likely to perform a reduction operation across threads within a block.
-n-hip-* | select -Property Name
- (continues on next page)
-(continued from previous page)
-**Following code does:** The code snippet `<_Cuda_>` appears to be a placeholder or a symbolic representation rather than actual executable Python code. In the context of programming, "Cuda" typically refers to NVIDIA's parallel computing platform and application programming interface (API) model, which allows developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. However, without additional context or actual code, it's not possible to determine the specific functionality or purpose of this placeholder.
-Name
------
-main-hip-amdgcn-amd-amdhsa-gfx906.bc
-main-hip-amdgcn-amd-amdhsa-gfx906.hipi
-main-hip-amdgcn-amd-amdhsa-gfx906.o
-main-hip-amdgcn-amd-amdhsa-gfx906.out
-main-hip-amdgcn-amd-amdhsa-gfx906.out.resolution.txt
-main-hip-amdgcn-amd-amdhsa-gfx906.s
-Files with the .s extension hold the disassembled contents of the binary and the filename directly informs us of the graphics IPs used by the compiler.
-**Following code does:** This code snippet is written in C++ and its purpose is to print the numbers from 0 to 3 sequentially. It uses a `for` loop to iterate over a range of integers starting from 0 up to, but not including, 4. During each iteration, it prints the current value of the loop variable `i` using the `printf` function.
-main-hip-amdgcn-amd-amdsha-gfx906.out
-main-hip-amdgcn-amd-amdsha-gfx906.out.resolution.txt
-main-hip-amdgcn-amd-amdsha-gfx906.s
-
-Files with the.s extension hold the disassembled contents of the binary and the filename directly informs us of the
-graphics IPs used by the compiler.
-
-Get-ChildItem main-hip-*.s | Get-Content
- .text
- .amdgcn_target "amdgcn-amd-amdsha--gfx906"
- .protected _Z12saxpy_kernelPKfPfj ; -- Begin function _Z12saxpy_
- --kernelPKfPfj
- .glob1 _Z12saxpy_kernelPKfPfj
- .p2align 8
- .type _Z12saxpy_kernelPKfPfj,@function
-_Z12saxpy_kernelPKfPfj:
- ; %bb.0:
- s_load_dword s0, s[4:5], 0x4
- s_load_dword s1, s[6:7], 0x18
- s_waitcnt lgkmcnt(0)
- s_and_b32 s0, s0, 0xffff
- s_mul_i32 s8, s8, s0
- v_add_u32_e32 v0, s8, v0
- v_cmp_gt_u32_e32 vcc, s1, v0
- s_and_saveexec_b64 s[0:1], vcc
- s_cbranch_execz.LBB0_2
- ; %bb.1:
- s_load_dwordx4 s[0:3], s[6:7], 0x8
- v_mov_b32_e32 v1, 0
- v_lshlrev_b64 v[0:1], 2, v[0:1]
- s_waitcnt lgkmcnt(0)
- v_mov_b32_e32 v3, s1
- v_add_co_u32_e32 v2, vcc, s0, v0
- v_addc_co_u32_e32 v3, vcc, v3, v1, vcc
- global_load_dword v2, v[2:3], off
- v_mov_b32_e32 v3, s3
- v_add_co_u32_e32 v0, vcc, s2, v0
- v_addc_co_u32_e32 v1, vcc, v3, v1, vcc
- global_load_dword v3, v[0:1], off
- s_load_dword s0, s[6:7], 0x0
- s_waitcnt vmcnt(0) lgkmcnt(0)
- v_fmac_f32_e32 v3, s0, v2
- global_store_dword v[0:1], v3, off
- .LBB0_2:
- s_endpgm
- ...
-Unlike HIP on AMD, when compiling using the NVIDIA support for HIP, the resulting binary will be a valid CUDA executable. Therefore, it'll incorporate PTX ISA (Parallel Thread eXecution Instruction Set Architecture) instead of AMDGPU binary. As a result, tooling included with the CUDA SDK can be used to inspect which device ISA was compiled into a specific executable. The most helpful to us currently is cuobjdump .
-**Following code does:** This code snippet is a portion of assembly code that is part of a function, likely the `main` function of a program. It primarily performs the following actions: - -1. It saves the `rbx` register onto the stack to preserve its value. -2. It loads the address of a string (likely a format string) into the `rbx` register. -3. It sets up arguments for the `printf` function call by moving the address of the string into the `rdi` register and zeroing out the `esi` and `eax` registers, which are used for additional arguments and return values, respectively. -4. It calls the `printf` function twice, first with `esi` set to zero and then with `esi` set to one, likely printing the same string twice with different formatting or arguments. -5. The code does not show the complete function, but it is likely part of a simple program that prints a string to the console using the `printf` function from the C standard library. - -Overall, the code is focused on printing a string to the console using the `printf` function.
-<_Bash_>
-**Following code does:** This code snippet is a fragment of assembly code, likely generated by the GCC compiler, that involves calling the `printf` function twice. The purpose of this code is to print two integers to the standard output using the format string "%d". - -1. The first `printf` call prints the integer `2`. -2. The second `printf` call prints the integer `3`. - -The code uses the `printf` function from the Procedure Linkage Table (PLT), which is a common technique in dynamically linked executables to resolve function addresses at runtime. The `xor eax, eax` instructions are used to set the `eax` register to zero, which is a common way to prepare for a function call in x86-64 assembly. The `pop rbx` and `ret` instructions are used to restore the base pointer and return from the function, respectively.
-| P T X \ f i l e
-**Following code does:** This code is a low-level assembly representation of a simple C program that prints the numbers 0 through 3, each on a new line. It uses a loop to increment a counter (`ebx`) from 0 to 3, and in each iteration, it calls the `printf` function to print the current value of the counter. The loop continues until the counter reaches 4, at which point the program exits. The `.string "%d"` is used as the format string for `printf` to print integers.
-1: saxpy. 1. sm_5. ptx
-This example shows that the SAXPY kernel is stored as sm_52 . It also shows that a compute capability 5.2 ISA was embedded into the executable, so devices that support compute capability 5.2 or newer will be able to run this code.
-Now that you've found what binary got embedded into the executable, find which format our available devices use.
-On Linux a utility called rocminfo helps us list all the properties of the devices available on the system, including which version of graphics IP ( gfxXYZ ) they employ. You can filter the output to have only these lines:
-**Following code does:** This code is an assembly routine for a function named `main`, likely compiled from a C/C++ program using the Microsoft Visual C++ (MSVC) compiler. The function's high-level purpose is to print a string (referred to as `'string'` in the code) four times using a loop. It uses the `printf` function to output the string. The loop is controlled by the `ebx` register, which is incremented each iteration until it reaches 4. The function sets up and cleans up the stack frame before and after the loop, respectively, and returns 0 upon completion.
-
-
-
-
-Now that you know which graphics IPs our devices use, recompile your program with the appropriate parameters.
-**Following code does:** This code snippet sequentially prints the integers 0, 1, 2, and 3 to the standard output. Each `printf` function call outputs one integer, formatted as a decimal integer (`%d`).
- |
- - -- \ + < } > & ) ]
-
- : ; . "
-Now the sample will run.
-**Following code does:** The code snippet appears to be an incorrectly formatted XML declaration. It contains two XML declaration lines, which are not valid in XML syntax. The first line is a correctly formatted XML declaration, while the second line is incorrectly closed with a question mark before the "xml" keyword. This snippet does not perform any functional operation and would likely result in a parsing error if used in an XML document.
- /saxpy
- Calculating y[i] = a * x[i] + y[i] over 10000000 elements.
- First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
-**Following code does:** This code snippet is written in C++ and is used in a GPU programming context, likely with the HIP (Heterogeneous-Compute Interface for Portability) API, which is used for writing portable code across AMD and NVIDIA GPUs. The code retrieves the warp size from the device properties and uses a switch statement to launch a GPU kernel with a template parameter that matches the warp size. Specifically, if the warp size is 32, it launches a kernel with a template argument of 32; if the warp size is 64, it launches a kernel with a template argument of 64. This allows the kernel to be optimized based on the warp size of the GPU being used.
-<_Python_>
-On Linux HIP with the NVIDIA back-end, the deviceQuery CUDA SDK sample can help us list all the properties of the devices available on the system, including which version of compute capability a device sports. <major>.<minor> compute capability is passed to nvcc on the command-line as sm_<major><minor> , for eg. 8.6 is sm_86 .
-Because it's not included as a binary, compile the matching example from ROCm.
-**Following code does:** This code snippet is a template-based mechanism that uses a static switch to select between different compile-time options for a CUDA or HIP kernel launch. Specifically, it uses `tmp::static_switch` to choose between two possible values for `warp_size` (32 or 64). Depending on the selected value, it launches a GPU kernel (`kernel`) with the corresponding warp size as a template parameter. This allows the code to optimize the kernel execution for different warp sizes at compile time, potentially improving performance by tailoring the execution to the specific hardware configuration.
- |
- <.02 .00 <.00
-Filter the output to have only the lines of interest, for example:
-**Following code does:** The code snippet appears to be incomplete or malformed. It seems to be missing context or additional code that would clarify its purpose. The snippet includes a generic placeholder `t WarpSize>()`, which might suggest a template or a function call in a language like C++ rather than Python. However, without additional context or surrounding code, it's not possible to determine its high-level purpose or functionality.
-<_YAML_>
-Note: In addition to the nvcc executable is another tool called __nvcc_device_query which prints the SM Architecture numbers to standard out as a comma separated list of numbers. The utility's name suggests it's not a user-facing executable but is used by nvcc to determine what devices are in the system at hand.
-Now that you know which graphics IPs our devices use, recompile your program with the appropriate parameters.
-**Following code does:** This code snippet is a modification of a GPU kernel function written in HIP (Heterogeneous-Compute Interface for Portability), which is used for parallel computing on AMD and NVIDIA GPUs. The changes introduce a template parameter `WarpSize` to the kernel, allowing it to handle different warp sizes (32 for NVIDIA and RDNA AMD GPUs, 64 for CDNA AMD GPUs) more flexibly. The shared memory reduction loop is adjusted to stop at the warp size, and a new warp-level reduction is added using a static loop unrolling technique (`tmp::static_for`). This ensures that the kernel can efficiently perform reductions across threads within a warp, adapting to the specific warp size of the target hardware.
-
-
-
- <_arch=sm_70,sm_86
-Note: If you want to portably target the development machine which is compiling, you may specify -arch=native instead.
-Now the sample will run.
-**Following code does:** It seems there is a formatting error in your request. The code snippet is labeled as C++ but is enclosed within Python code tags. Please provide the correct code snippet or clarify the language so I can assist you accurately.
- ./saxpy
- Calculating y[i] = a * x[i] + y[i] over 10000000 elements.
- First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
-On Windows, a utility called hipInfo.exe helps us list all the properties of the devices available on the system, including which version of graphics IP ( gfxXYZ ) they employ. Filter the output to have only these lines:
-**Following code does:** The provided code snippet appears to be a collection of random characters and symbols, including some that resemble HTML/XML tags and others that are typical in programming syntax. However, it does not form any valid or meaningful code in Python or any other programming language. It seems to be a nonsensical or corrupted text rather than a functional code snippet.
-& ${env:HIP_PATH}bin\hipInfo.exe | Select-String gfx
-
-gcnArchName: gfx1032
-gcnArchName: gfx1035
-Now that you know which graphics IPs our devices use, recompile your program with the appropriate parameters.
-**Following code does:** The code snippet provided is incomplete and contains only a comment: `// Warp reduction`. This comment suggests that the code is likely part of a larger program, possibly written in a language like CUDA C/C++ for GPU programming, where "warp reduction" is a common technique. Warp reduction is used to efficiently perform parallel reduction operations (such as summing elements) within a warp, which is a group of threads that execute the same instruction simultaneously on a GPU. However, without additional context or code, it's not possible to describe the specific implementation or purpose beyond this general concept.
- |clang++.\HIP-Basic\saxpy\main.hip -o saxpy.exe -I.\Common -lamdhip64 -L ${env:HIP_PATH}
- -lib -02 --offload-arch=gfx1032 --offload-arch=gfx1035 --lib -02 --offload-arch=gfx1035
-Now the sample will run.
-**Following code does:** The provided code snippet is a partial definition of a CUDA kernel function template in C++. This kernel is designed to be executed on a GPU. It uses template parameters to allow flexibility in specifying the block size, warp size, data type, and operation type. The kernel function, named `kernel`, takes several parameters: pointers to two arrays (`front` and `back`), an operation (`op`), a zero element (`zero_elem`), and the size of the `front` array (`front_size`). The `__launch_bounds__(BlockSize)` attribute is used to specify the maximum number of threads per block, optimizing the kernel's execution configuration. The actual implementation of the kernel is not provided in the snippet.
- .\saxpy.exe
- Calculating y[i] = a * x[i] + y[i] over 10000000 elements.
- First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
-On Windows HIP with the NVIDIA back-end, the deviceQuery CUDASDKsample can help us list all the properties of the devices available on the system, including which version of compute capability a device sports. <major>. <minor> compute capability is passed to nvcc on the command-line as sm_<major><minor> , for eg. 8.6 is sm_86 .
-Because it's not included as a binary, compile the matching example from ROCm.
-**Following code does:** This code snippet is part of a CUDA kernel designed for parallel computation on a GPU. It involves reading data from a global memory buffer (`front`) and performing operations on it using shared memory to optimize performance. Here's a high-level summary of what the code does: - -1. **Warp and Block Configuration**: It calculates the number of warps (`WarpCount`) within a block based on the block size and warp size. Warps are groups of threads that execute instructions in lockstep. - -2. **Shared Memory Allocation**: It allocates shared memory (`shared`) for storing intermediate results. The size of this shared memory is determined by the number of warps in the block. - -3. **Safe Reading Functions**: It defines two lambda functions, `read_global_safe` and `read_shared_safe`, to safely read elements from global and shared memory, respectively. These functions ensure that reads do not exceed the bounds of the memory arrays. - -4. **Thread and Block Indexing**: It calculates various indices (`tid`, `bid`, `gid`, `wid`, `lid`) to identify the thread's position within the block and grid. These indices are used to determine which data elements each thread should process. - -5. **Data Processing**: Each thread reads two elements from the global memory (`front`) using the `read_global_safe` function and applies an operation (`op`) on them. The result is stored in a local variable (`res`). - -Overall, this code is part of a larger CUDA kernel that performs parallel data processing using shared memory to improve performance by reducing global memory accesses. The use of shared memory allows for efficient communication of results between threads within the same block.
- |nvcc.\HIP-Basic\device_query\main.cpp -o device_query.exe -I.\Common -I ${env:HIP_PATH}
- -include -02 --include -0022
-Filter the output to have only the lines of interest, for example:
-**Following code does:** This code snippet appears to be part of a parallel computing or multi-threading context, likely using a GPU or a similar architecture where threads are organized into blocks. The code checks if the current thread's identifier (`tid`) is zero, which typically means it is the first thread in a block. If this condition is true, it writes a computed result (`res`) from local memory to a shared or global memory location (`back`) indexed by the block identifier (`bid`). This pattern is often used to consolidate results from multiple threads within a block into a single output.
- .\device_query.exe | Select-String "major.minor"
-
- major.minor: 8.6
- major.minor: 7.0
-Note: Next to the nvcc executable is another tool called __nvcc_device_query.exe which simply prints the SM Architecture numbers to standard out as a comma separated list of numbers. The naming of this utility suggests it's not a user facing executable but is used by nvcc to determine what devices are in the system at hand.
-Now that you know which graphics IPs our devices use, recompile your program with the appropriate parameters.
-**Following code does:** The code snippet appears to be incomplete and contains a syntax error. It seems like it might be part of a larger piece of code, possibly involving threading or parallel processing, given the mention of "ItemsPerThread". However, without additional context or complete code, it's not possible to determine its high-level purpose.
-
-
-
- --x -x
-Note: If you want to portably target the development machine which is compiling, you may specify -arch=native instead.
-Now the sample will run.
-**Following code does:** The code snippet provided is a fragment of a CUDA kernel function declaration in C++. CUDA is a parallel computing platform and application programming interface model created by NVIDIA. The purpose of this code is to define a GPU kernel function that can be executed on an NVIDIA GPU. - -- `__global__`: This keyword indicates that the function is a CUDA kernel, which means it is executed on the GPU but can be called from the host (CPU) code. -- `static`: This keyword suggests that the kernel function has internal linkage, meaning it is only visible within the translation unit it is defined in. -- `__launch_bounds__(BlockSize)`: This is an optional CUDA attribute that specifies the maximum number of threads per block (`BlockSize`) that the kernel is optimized for. It helps the compiler optimize the kernel for specific execution configurations. - -The ellipsis (`...`) indicates that the function's parameters and body are not shown. Overall, this snippet is part of a larger CUDA program designed to perform parallel computations on a GPU.
- .\saxpy.exe
- Calculating y[i] = a * x[i] + y[i] over 10000000 elements.
- First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
-**Following code does:** The code snippet `<_C_>` appears to be a placeholder or a symbolic representation rather than actual executable code. It does not perform any operations or have any functionality in Python as it stands. It might be used as a marker or a template in a larger context where specific code or content is meant to be inserted.
-<_Python_>
-Reduction is a common algorithmic operation used in parallel programming to reduce an array of elements into a shorter array of elements or a single value. This document exploits reduction to introduce some key considerations while designing and optimizing GPU algorithms.
-This document is a rejuvenation and extension of the invaluable work of Mark Harris. While the author approaches the topic with a less naive approach, reviewing some original material is valuable to see how much the underlying hardware has changed. This document provides a greater insight to demonstrate progress.
-Reduction has many names depending on the domain; in functional programming it's referred to as fold, in C++, it's called std::accumulate and in C++17, as std::reduce . A reduction takes a range of inputs and 'reduces' the given range with a binary operation to a singular or scalar output. Canonically, a reduction requires a 'zero' element that bootstraps the algorithm and serves as one of the initial operands to the binary operation. The 'zero' element is generally called identity or neutral element in the group theory, which implies that it is an operand that doesn't change the result. Some typical use cases are: calculating a sum or normalizing a dataset and finding the maximum value in the dataset. The latter use case is discussed further in this tutorial.
-There are multiple variations of reduction that allow parallel processing. The approach taken by std::reduce requires the user-provided binary operator to operate on any combination of identity and input range elements, or even exclusively on any of them. This allows you to insert any number of identities to facilitate parallel processing and then combine the partial results of parallel execution.
-Implementing reductions on GPUs requires a basic understanding of the /understand/programming_model_reference. The document explores aspects of low-level optimization best discussed through the Inherent thread model , and refrains from using cooperative groups.
-Synchronizing parallel threads of execution across a GPU is crucial for correctness as the partial results can't be synchronized before they manifest. Synchronizing all the threads running on a GPU at any given time is possible, however, it is a costly and intricate operation. If synchronization is not absolutely necessary, map the parallel algorithm so that multiprocessors and blocks can make independent progress and need not sync frequently.
-There are ten reduction implementations in the rocm-examples, which are described in the following sections.
-The naive algorithm takes a tree-like shape, where the computational domain is purposefully distributed among blocks. In all blocks, all threads participate in loading data from persistent (from the kernel's perspective) global memory into the shared memory. This helps to perform tree-like reduction for a single thread by writing the partial result to global, in a location unique to the block, which allows the block to make independent progress. The partial results are combined in subsequent launches of the same kernel until a scalar result is reached.
-This approach requires temporary storage based on the number of blocks launched, as each block outputs a scalar partial result. Depending on the need to store or destroy the input, a second temporary storage might be needed, which could be large enough to store the results of the second kernel launch. Alternatively, you can reuse the storage of the larger than necessary original input. These implementations differ so slightly that the document only considers the use case where the input could be destroyed.
-**Following code does:** The code snippet defines a lambda function `read_global_safe` in C++ that reads a sequence of elements from an array `front` into a static array of a specified size `ItemsPerThread`. The lambda uses a template parameter pack and `std::integer_sequence` to iterate over indices. If the index plus the number of items per thread is within the bounds of `front_size`, it directly loads elements from `front`. Otherwise, it loads elements conditionally, substituting a `zero_elem` for out-of-bounds indices. This ensures safe reading from the array without exceeding its bounds.
-
-
-
- //
-// + - * / */
- +
- *
- -
- */
-For threads that don't have unique inputs, feed zero_elem instances to threads. The backing of double-buffering is allocated as such:
-**Following code does:** This code snippet initializes an array `arr` of size 4 with elements from another array or list called `front`. The elements are selected based on an index `gid`, such that `arr` contains four consecutive elements starting from `front[gid]` to `front[gid + 3]`. The type of the elements in `arr` is denoted by `T`, which suggests that this code is likely written in a language that uses templates or generics, such as C++ or a similar language. The purpose of this code is to create a subarray or slice of four elements from the `front` array starting at a specific index `gid`.
-
-
-
-
-
-
-
-
-
-
-
- " >void>
-
-
-
-Data is initialized on the host and dispatched to the device followed by the commencement of device-side reduction. The swapping of the double-buffer on the last iteration is omitted, therefore the result is in the back-buffer irrespective of the input size.
-**Following code does:** This code snippet initializes an array `arr` of size 4 with elements from another array `front`. It uses a conditional expression to fill each position in `arr`. For each index `i` from 0 to 3, it checks if `i` is within the bounds of `front` (i.e., `i < front_size`). If it is, the corresponding element from `front` is used; otherwise, a default value `zero_elem` is used. This effectively copies up to four elements from `front` into `arr`, filling any out-of-bounds positions with `zero_elem`.
-of the input size.
-
-for (uint32_t curr = input_count; curr > 1;)
-{
- hipLaunchKernelGGL(
- kernel,
- dim3(new_size(curr)),
- dim3(block_size),
- factor * sizeof(unsigned),
- hipStreamDefault,
- front,
- back,
- kernel_op,
-**Following code does:** This code snippet appears to be a series of empty lines with a placeholder or invalid tag `>` in the middle. It doesn't perform any functional operation or have any meaningful purpose in its current form. It might be intended as a placeholder or a template for future code development.
-1;)
-(continues on next page)
-(continued from previous page)
-**Following code does:** This code snippet is a host-side reference implementation of a reduction operation typically executed on a GPU. It processes an input vector of unsigned integers by dividing it into partitions of a specified size (`partition_size`). For each partition, it calculates the sum of its elements and stores the result in a new vector (`result`). The size of the result vector is determined by dividing the total input size by the partition size, effectively reducing the input data by aggregating sums of each partition. This operation is useful for tasks like parallel processing where data needs to be reduced or aggregated efficiently.
- zero_elem,
- curr);
-
- curr = new_size(curr);
- if (curr > 1)
- std::swap(front, back);
-}
-This structure persists in the kernel throughout all the variations of reduction with slight modifications to factor and shared memory allocation:
-**Following code does:** The code snippet appears to be part of a function in a programming language that uses semicolons to terminate statements, likely C, C++, or Java. It seems to be iterating over some collection or array, assigning a value `partition_result` to an element at index `i` in an array or list called `result`. After the loop completes, the function returns the `result` array or list. The high-level purpose of this code is to populate the `result` array with values computed or retrieved during the loop and then return this populated array.
-<_Cuda_>
-While the tid % (2 * i) == 0 indexing scheme yields correct results, it also leads to high thread divergence. Thread divergence indicates the event when the threads in a warp diverge, which implies that the threads have to execute different instructions in a given clock cycle. This is easily manifested using if-else statements as shown here, but can also be manifested as for loop dependent on thread ID lengths. Even though the number of active threads participating in the reduction reduces, warps remain active longer than necessary, as at least one lane in a warp hits the if statement.
-You can reduce divergence by keeping dataflow between memory addresses identical but reassigning the thread ids.
-**Following code does:** The provided code snippet is a CUDA device function named `reduce_sum` that performs a parallel reduction to compute the sum of unsigned integer values within a thread group using shared memory. The function takes a `thread_group` object `g`, a shared memory pointer `x`, and an unsigned integer `val` as inputs. It uses a loop to iteratively halve the number of active threads, synchronizing them at each step, and accumulates the sum of values from different threads. The final result of the reduction is stored in the first thread of the group, while other threads return 0. This function is typically used in GPU programming to efficiently compute sums across threads in a block or custom partition.
-// Shared reduction
-for (uint32_t i = 1; i < blockDim.x; i *= 2)
-{
-- if (tid % (2 * i) == 0)
-- shared[tid] = op(shared[tid], shared[tid + i]);
-+ if (uint32_t j = 2 * i * tid; j < blockDim.x)
-+ shared[j] = op(shared[j], shared[j + i]);
- __syncthreads();
-}
-This way inactive threads start accumulating uniformly towards the higher thread ID index range and might uniformly skip to __syncthreads() . However, this introduces a bank conflicts issue.
-Both AMD and NVIDIA implement shared memory in the hardware by organizing storage into banks of various sizes. This hardware element is known as Local Data Share (LDS) on AMD hardware. On NVIDIA hardware, it's implemented using the same silicon as the L1 data cache. You can think of shared memory as a striped 2-dimensional range of memory. Shared memory bank's count, width, and depth depend on the architecture. A bank conflict occurs when different threads in a warp access the same bank during the same operation. In this case, the hardware prevents the attempted concurrent accesses to the same bank by converting them into serial accesses.
-A notable exception is when the shared read uniformly broadcasts to the same address across the entire warp. A better implementation of the naive algorithm is to form continuous ranges of the threads activities and their memory accesses.
-**Following code does:** This code snippet is part of a CUDA program, which is designed to run on NVIDIA GPUs. Its high-level purpose is to set up a parallel computation environment using CUDA's thread block and shared memory features. Specifically, it: - -1. Declares shared memory (`workspace`) to be used by threads within a block for operations like reduction, which is a common parallel algorithm for combining elements (e.g., summing an array). -2. Defines a `thread_block` object (`thread_block_group`) that represents all threads within a CUDA block, allowing them to coordinate and share data. -3. Loads an input value from global memory (`d_vector`) into a local variable (`input`) for each thread, based on the thread's rank within the block. -4. Creates a `custom_partition` of threads within the block, where each partition consists of 16 threads. This partitioning allows for more fine-grained control over thread collaboration and data sharing within the block. - -Overall, the code sets up the necessary structures for performing parallel computations on a GPU, leveraging shared memory and thread coordination to optimize performance.
-
- implementation of the naive algorithm is to form continuous ranges of the threads activ
-
- // Shared reduction
- -for (uint32_t i = 1; i < blockDim.x; i *= 2)
- -{
- -
-
-
- -f +f
-Note: To avoid bank conflicts, read shared memory in a coalesced manner, which implies that reads/writes of each lane in a warp evaluate to consecutive locations. Analyzing the read/write patterns could help you to understand the cause of bank conflicts. For more details, check CDNA3 ISA or RDNA3 ISA data share operations chapter.
-The preceding implementation is free of low-level GPU-specific anti-patterns. However, it still exhibits some common shortcomings. The loop performing the reduction in the shared memory starts from i = blockDim.x / 2 and the first predicate if (tid < i) immediately disables half of the block, which only helps load the data into the shared memory. You can change the kernel along with the calculation of factor on the host, as shown here:
-**Following code does:** This code snippet performs a parallel reduction operation on a set of input data using a thread block group, which is a collection of threads that work together. The `reduce_sum` function aggregates the input data into a single sum, storing the result in the `output` variable. After the reduction, only the first thread in the thread block group (determined by checking if its rank is 0) writes the computed sum to the first element of the `d_block_reduced_vector` array. This ensures that only one thread outputs the final reduced value, preventing race conditions or redundant writes.
-<_Cuda_>
-By eliminating half of the threads and giving meaningful work to all the threads by unconditionally performing a binary op , you can prevent the wastage of half of the threads.
-Even though global memory is read in a coalesced fashion, as preferred by the memory controller, optimal performance is still limited by the instruction throughput. Omit superfluous synchronization -----------
-Warps are known to execute in a strict lockstep fashion. Therefore, once shared reduction reaches a point where only a single warp participates meaningfully, you can cut short the loop and let the rest of the warps terminate. Moreover, you can also unroll the loop without syncing the entire block.
-The tmp namespace used beyond this point in this document holds a handful of template meta-programmed utilities to facilitate writing flexible and optimal code.
-tmp::static_for is not just a constant folding within the optimizer but a variation of the language for loop, where the running index is a compile-time constant and is eligible for use in compile-time evaluated contexts.
-Consider the following code:
-**Following code does:** This code snippet appears to be part of a parallel computing or GPU programming context, where it performs a reduction operation. The `reduce_sum` function is used to sum elements within a specified partition of data. The `custom_partition` likely defines how the data is divided, and `workspace[group_offset]` and `input` are the data sources involved in the reduction. The comment indicates that only the first thread in each partition will return a valid result, which is a common pattern in parallel reductions to ensure that only one thread writes the final result of the reduction for each partition.
-constexpr int size = 4;
-for (int i = 0 ; i < size ; ++i)
-{
- printf("%d", i);
-}
-This compiles to the following binaries:
-**Following code does:** This code snippet appears to be part of a parallel computing operation, likely using CUDA or a similar framework for GPU programming. The code calculates a `partition_id` for a thread within a block by dividing the thread's rank by a constant `PartitionSize`. It then assigns a value `output` to an element in the `d_partition_reduced_vector` array at the index corresponding to this `partition_id`. The purpose is to organize or reduce data into partitions based on thread ranks within a block.
-LLVM Block
-main:
- push rbx
- lea rbx, [rip +.L.str]
- mov rdi, rbx
- xor esi, esi
- xor eax, eax
- call printf@PLT
- mov rdi, rbx
- mov esi, 1
- xor eax, eax
- call printf@PLT
- mov rdi, rbx
-(continues on next page)
-**Following code does:** It seems there is a formatting error in your request. The code snippet is labeled as Python, but it contains a placeholder that suggests it might be C++ code. Please provide the actual code snippet you would like me to analyze, and I'll be happy to help!
- mov esi, 2
- xor eax, eax
- call printf@PLT
- mov rdi, rbx
- mov esi, 3
- xor eax, eax
- call printf@PLT
- xor eax, eax
- pop rbx
- ret
-.L.str:
- .asciz "%d"
-
- GCC
-**Following code does:** The code snippet `<_C_>` appears to be incomplete or not a valid Python code. It does not represent any known Python syntax or construct. It might be a placeholder or a typo. Without additional context or surrounding code, it's not possible to determine its purpose or functionality.
- GCC
- .LC0:
- .string "%d"
- main:
- push rbx
- xor ebx, ebx
- .L2:
- mov esi, ebx
- mov edi, 0FFSET FLAT:.LC0
- xor eax, eax
- add ebx, 1
- call printf
- cmp ebx, 4
- jne .L2
- xor eax, eax
- pop rbx
- ret
-
- MSVC
-**Following code does:** This code snippet is part of a program that uses the HIP (Heterogeneous-Compute Interface for Portability) API to launch a cooperative kernel on a GPU. The cooperative kernel is likely designed to perform a reduction operation on a vector, as suggested by the variable names. The `hipLaunchCooperativeKernel` function is used to initiate the execution of the `vector_reduce_kernel` on the GPU, with `params` being an array of pointers to the data structures (`d_vector`, `d_block_reduced`, and `d_partition_reduced`) that the kernel will operate on. The cooperative groups API allows for more efficient synchronization and communication between threads within a GPU kernel.
- MSVC
-
-main PROC
- $LN12:
- push rbx
- sub rsp, 32
- xor ebx, ebx
- npad 8
- $LL4@main:
- mov edx, ebx
- lea rcx, OFFSET FLAT:'string'
- call printf
- inc ebx
- cmp ebx, 4
- jl SHORT $LL4@main
- xor eax, eax
- add rsp, 32
- pop rbx
- ret 0
- main ENDP
-(continued from previous page)
-LLVM unrolls the loop and compiles to a flat series of printf invocations, while both GCC and MSVC keep the loop intact, as visible from the compare ( cmp ) and the jump ( jne , jl ) instructions. LLVM code generation is identical to manually writing the unrolled loop:
-**Following code does:** The provided snippet appears to be incomplete and lacks any functional code. It only contains a comment line that reads "* For HIP". This suggests that the code might be related to or intended for use with HIP, which is a C++ runtime API and kernel language that allows developers to create portable applications that can run on AMD and NVIDIA GPUs. However, without additional context or code, it's not possible to determine any specific functionality or purpose.
-printf("%d", 0);
-printf("%d", 1);
-printf("%d", 2);
-printf("%d", 3);
-While various non-standard pragmas are available to hint or force the compiler to unroll the loop, we instead use template meta-programming to force feed the compiler the unrolled loop.
-**Following code does:** The code snippet `<_Bash_>` appears to be a placeholder or a tag indicating that a section of code written in the Bash scripting language should be inserted or is expected in that location. It does not perform any operations or have any functionality on its own. Instead, it likely serves as a marker for where Bash code should be included or referenced in a larger context, such as in documentation, a template, or a code generation tool.
-
-
-
-
-
- ?xml version="1.0" encoding="UTF-8" />
-The most notable structural difference is that in the language for loop, the loop variable is given a name in the beginning, while in the static_for utility, the loop variable is given a name in the end. An important bonus is that in the loop's body, you can use the running index i in contexts requiring constant expressions such as template arguments or inside if constexpr .
-tmp::static_switch takes runtime value and runtime dispatches to a range of set of tabulated functions, where said value is a compile-time constant and is eligible for use in compile-time evaluated contexts.
-Consider the following code:
-**Following code does:** It seems there is a misunderstanding in the code snippet provided. The snippet `<_Haskell_>` does not represent valid Python code or any executable code in any programming language. It appears to be a placeholder or a tag indicating that Haskell code might be expected or referenced. If you intended to provide a Haskell code snippet, please share the correct code so I can help explain its purpose.
- Consider the following code:
-
- int warp_size = device_props.warpSize;
-
- switch (warp_size)
-
- {
-
- case 32:
-
- hipLaunchKernelGGL(kernel<32>, ...);
-
- break;
-
- case 64:
-
- hipLaunchKernelGGL(kernel<64>, ...);
-
- break;
-
- }
-In the preceding code, note the code repetition for all possible values of warp_size , the code is prepared to handle. To avoid this, use tmp::static_switch , as shown:
-**Following code does:** This code snippet appears to be a series of empty lines with a placeholder or invalid tag `>` in the middle. It doesn't perform any functional operation or have any meaningful purpose in its current form. It might be intended as a placeholder or template for further development, but as it stands, it doesn't execute any logic or serve a specific function.
- tmp::static_switch(warp_size, [&]
- {
- hipLaunchKernelGGL(kernel,...);
- });
-**Following code does:** The code snippet `<_Bash_>` appears to be a placeholder or a formatting artifact rather than actual executable code. It does not perform any operations or have any functionality as it stands. If this is meant to indicate a section where Bash code should be inserted, it would typically be replaced with actual Bash script content to perform specific tasks in a Unix-like shell environment.
-
-t WarpSize>()
-**Following code does:** This code snippet is a sequence of shell commands used to compile and link a C++ program that utilizes GPU resources with the HIP (Heterogeneous-Compute Interface for Portability) compiler, `hipcc`. - -1. The first command compiles `hipDevice.cpp` into an object file `hipDevice.o` with GPU relocatable device code enabled (`-fgpu-rdc`). -2. The second command creates a static library `libHipDevice.a` from the object file `hipDevice.o` using the `ar` archiving tool. -3. The third command links the static library `libHipDevice.a` with another source file `test.cpp`, again with GPU relocatable device code enabled, and produces an executable `test.out`. - -Overall, this process compiles and links a GPU-accelerated application using HIP, organizing the code into a static library before creating the final executable.
-
- HIP Documentation, Release 6.1.40092
-
-
-
- -template
- +template
- __global__ void kernel(
- ...
- )
- {
- ...
- // Shared reduction
- -for (uint32_t i = blockDim.x / 2; i!= 0; i /= 2)
- +for (uint32_t i = blockDim.x / 2; i > WarpSize; i /= 2)
- {
- if (tid < i)
- shared[tid] = op(shared[tid], shared[tid + i]);
- __syncthreads();
- }
- +// Warp reduction
- +tmp::static_for, tmp::divide<2>>([&]()
- +{
- + if (tid < I)
- + shared[tid] = op(shared[tid], shared[tid + I]);
- +#ifdef __HIP_PLATFORM_NVIDIA__
- + __syncwarp(0xffffffff >> (WarpSize - I));
- +#endif
- +});
-
- Because HIP typically targets hardware with warp sizes of 32(NVIDIA GPUs and RDNA AMD GPUs) and 64 (CD!
- AMD GPUs), portable HIP code must handle both. That is why instead of assuming a warp size of 32, make the w:
- size a template argument of the kernel. This allows you to unroll the final loop using tmp::static_for in a paramet
-Because HIP typically targets hardware with warp sizes of 32 (NVIDIA GPUs and RDNA AMD GPUs) and 64 (CDNA AMDGPUs), portable HIP code must handle both. That is why instead of assuming a warp size of 32, make the warp size a template argument of the kernel. This allows you to unroll the final loop using tmp::static_for in a parametric way but still having the code read much like an ordinary loop.
-Promoting the warp size to being a compile-time constant also requires you to handle it similarly on the host-side. You can sandwich the kernel launch with tmp::static_switch , promoting the snake-case run-time warp_size variable to a camel-case compile-time constant WarpSize .
-**Following code does:** The code snippet `<_Cuda_>` appears to be a placeholder or a symbolic representation rather than actual executable Python code. In the context of programming, "Cuda" typically refers to NVIDIA's parallel computing platform and application programming interface (API) model, which allows developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. However, without additional context or actual code, it's not possible to determine the specific functionality or purpose of this placeholder.
-<_C++_>
-Note: Neither RDNA- nor CDNA-based AMD hardware provides guaranteed independent progress to lanes of the same warp. When targeting NVIDIA hardware, lanes of a warp might execute somewhat independently as long as the programmer assists the compiler using dedicated built-in functions. This feature is called Independent Thread Scheduling. The HIP headers don't expose the necessary warp primitives and their overloads.
-Portable applications can still tap into this feature with carefully #ifdef -ed code, but at this particular optimization level, it's a requirement. The code implicitly relies on the lockstep behavior of an ROCm wavefront, but CUDA warps don't share this property. You must synchronize all the active lanes of a warp to avoid a data race with some lanes progressing faster than others in the same warp.
-While the previous step primarily aims to remove unnecessary syncing, it also unrolls the end of the loop. However, you could also force unrolling the first part of the loop. This saves a few scalar registers (values the compiler can prove to be uniform across warps).
-**Following code does:** The provided text appears to be a continuation marker from a previous page, indicating that the code snippet is incomplete. Without the actual code, it's not possible to determine its purpose or functionality. If you can provide the complete code snippet, I would be happy to help explain what it does.
-
-
-
- &Linear
-
-
-Introducing yet another template argument for the kernel and moving from for to tmp::static_for leads to the following two notable improvements:
-Shared memory provides a fast communication path within a block, however when performing reduction within the last warp, you can use faster means of communication, which is warp-collective or cross-lane functions. Instead of using the hardware-backed shared memory, you can directly copy between the local memory (registers) of each lane in a warp. This can be achieve using the shuffle functions.
-See how to use __shfl_down() , which is one of the most restrictive but also the most structured communication schemes.
-**Following code does:** The placeholder `<_SQL_>` suggests that this is not actual code but rather a placeholder indicating where SQL code would be inserted. Without specific SQL statements, it's not possible to determine the exact functionality. However, generally, SQL code is used for interacting with databases, which can include operations such as querying data, updating records, inserting new data, or deleting existing records. The high-level purpose of SQL code is to manage and manipulate data stored in a relational database.
-
-
-
- // Warp reduction
-Using warp-collective functions for communication requires the control flow to be uniform across warps, as the name warp-collective implies. Therefore, you can see that the thread ID is being checked outside the loop, but the result is written inside due to variable scoping.
-As mentioned in the previous step, communication between local memory is faster than shared memory. Instead of relying on the local memory only at the end of the tree-like reduction, a better approach is to turn the tree reduction inside out and perform multiple warp reductions in parallel on all active threads, thus communicating only their partial results through the shared memory.
-The kernel versions differ significantly enough to be described using a diff; use afresh instead.
-**Following code does:** The provided code snippet appears to be malformed or incomplete. It consists of a series of spaces and dashes followed by the text `--event:0` and `-event:0`. This does not represent valid or executable Python code. It seems more like a formatting error or a placeholder rather than a functional script. Without additional context or correction, it is not possible to determine a high-level purpose for this snippet.
- The kernel versions differ significantly enough to be described using a diff; use afresh instead.
-
- template
- __global__ __launch_bounds__(BlockSize) void kernel(
- T* front,
- T* back,
- F op,
- T zero_elem,
- uint32_t front_size)
- {
- // ...
- }
-
- The kernel estimate and the reduction factor as the comma as in variance access only the imlamantation diffar.
-The kernel signature and the reduction factor are the same as in previous cases; only the implementation differs.
-**Following code does:** It seems that you've provided a single character `>` instead of a code snippet. Could you please provide the complete code so I can help explain its purpose?
-static constexpr uint32_t WarpCount = BlockSize / WarpSize;
-
-__shared__ T shared[WarpCount];
-
-auto read_global_safe =
- [&](const uint32_t i) { return i < front_size? front[i] : zero_elem; };
-auto read_shared_safe =
- [&](const uint32_t i) { return i < WarpCount? shared[i] : zero_elem; };
-
-const uint32_t tid = threadIdx.x,
- bid = blockIdx.x,
- gid = bid * (blockDim.x * 2) + tid,
- wid = tid / WarpSize,
- lid = tid % WarpSize;
-
-// Read input from front buffer to local
-T res = op(read_global_safe(gid), read_global_safe(gid + blockDim.x));
-
-As we communicate the results of warps through shared memory, the same number of elements are required in the
-shared memory as warps within the block. Similar to how you can only launch kernels at block granularity. you can
-As we communicate the results of warps through shared memory, the same number of elements are required in the shared memory as warps within the block. Similar to how you can only launch kernels at block granularity, you can only warp reduce with WarpSize granularity due to the collective nature of the cross-lane builtins. To address this, you can use read_shared_safe to pad overindexing by reading zero_elem . Reading from global remains unaffected. // Perform warp reductions and communicate results via shared // for (uint32_t ActiveWarps = WarpCount; // ActiveWarps != 0; // ActiveWarps = ActiveWarps != 1 ? // divide_ceil(ActiveWarps, WarpSize) : // ActiveWarps = 0) tmp::static_for< WarpCount, tmp::not_equal<0>, tmp::select< tmp::not_equal<1>, tmp::divide_ceil<WarpSize>, tmp::constant<0>>>([&]< uint32_t ActiveWarps>() { if (wid < ActiveWarps) { // Warp reduction tmp::static_for<WarpSize / 2, tmp::not_equal<0>, tmp::divide<2>>([&]< int Delta>() { res = op(res, __shfl_down(res, Delta)); }); // Write warp result from local to shared if (lid == 0) shared[wid] = res; } __syncthreads(); // Read warp result from shared to local res = read_shared_safe(tid); (continues on next page)
-(continued from previous page)
-**Following code does:** The code snippet is a command-line instruction that uses the `hipify-perl` tool to convert CUDA code into HIP (Heterogeneous-Compute Interface for Portability) code. The `--inplace` option indicates that the conversion should be done directly in the original files, modifying them in place rather than creating new output files. This is typically used to facilitate the migration of CUDA applications to run on AMD GPUs by translating CUDA-specific syntax and API calls to their HIP equivalents.
-});
-
-// Write result from local to back buffer
-if(tid == 0)
- back[bid] = res;
-ActiveWarps iterates from WarpCount until it reaches 0 . Every iteration of ActiveWarps reduces the WarpSize . In cases where the partial result count isn't a divisor of ActiveWarps and you need to launch an extra warp, use tmp::divide_ceil , which always rounds to positive infinity. The tertiary tmp::select is required because such division never reaches 0 , so you must terminate the loop after the last warp concludes.
-In each iteration, if the warp is active, which means it has at least a single valid input, it carries out a pass of warp reduction and writes output based on warp ID. Reading is carried out based on thread ID. Global output continues to be based on block ID.
-The previous sections explained how to reduce register usage to improve occupancy. This allows more blocks to execute in parallel on all multiprocessors, leading to more global store/load latency to be hidden. Reducing the number of kernels in flight while still carrying out the same workload reduces the wastage of registers while loading and maintaining bookkeeping variables such as kernel indices.
-An example of this optimization is performing one binary op while loading input from global. Even though the operation is said to be carried out 'in flight', the two values are loaded into local memory (registers) before op is called.
-Amore general form of this optimization is wrapping most kernel logic in loops that carry out the workload of multiple kernel instances but require storing only a single instance of most of the bookkeeping logic. In code, this multiplicity factor is referred to via the ItemsPerThread compile-time constant, which is supplied by a template argument to allow for loop unrolling.
-This kernel variant utilizes another generally applicable utility known as hip::static_array , which is a more restrictive wrapper over the builtin array than std::array , as it allows indexing only compile-time constants using the usual tuple-like template <size_t I> auto get<I>(...) interface.
-Note: On a GPU, there is no stack, and the local memory is provisioned from the register file. This provisioning takes place statically. To paraphrase, the address range of a thread's local memory is determined at compile-time. When an array is defined and used in the local storage, the compiler can only maintain its storage in the register file as long as all accesses to the array are computable by the compiler at compile-time. It doesn't need to be a compile-time constant as long as the compiler can resolve the addresses of the accesses through constant folding or some other means. If the compiler fails to do so, the array will be backed by global memory, which is indicated by allocating a non-zero number of spill registers observable using static analysis tools. However, this is slower by the magnitude of multiple order. hip::static_array via its hip::get<> interface ensures that no such spills occur.
-**Following code does:** This code snippet is a preprocessor directive used in C/C++ programming, specifically when working with the HIP (Heterogeneous-Compute Interface for Portability) framework. The `#ifdef __HIP_PLATFORM_AMD__` checks if the macro `__HIP_PLATFORM_AMD__` is defined, which indicates that the code is being compiled for AMD platforms using HIP-Clang. If this condition is true, the comment `// Compiled with HIP-Clang` is included in the code. This is typically used to conditionally compile code specific to AMD hardware when using the HIP framework.
-_t ItemsPerThread>
-**Following code does:** This code snippet is a preprocessor directive used in a C/C++ program to conditionally compile code based on the target platform. Specifically, it checks if the code is being compiled for an NVIDIA platform using the HIP (Heterogeneous-Compute Interface for Portability) API. If the `__HIP_PLATFORM_NVIDIA__` macro is defined, it indicates that the code is being compiled with NVIDIA's CUDA compiler (`nvcc`). The comments suggest that the code could be using CUDA language extensions or be in a pass-through mode to an underlying host compiler, depending on the file type or compilation settings.
-
-
-
- --global__ static __launch_bounds__(BlockSize) void kernel(...)
-The kernel now has three compile-time configurable parameters. The only part of the kernel that changes depends on how you load data from global and perform the binary operation on those loaded values. So, the following step to read input from front buffer to global is now split into two steps: reading ``ItemsPerThread` <reading-items>`and processing ``ItemsPerThread` <processing-items>`.
-**Following code does:** This code snippet is a preprocessor directive used in C/C++ programming to check if the code is being compiled with NVIDIA's CUDA Compiler (nvcc). The `#ifdef __CUDACC__` checks if the `__CUDACC__` macro is defined, which indicates that the CUDA language extensions are enabled. This is typically used to conditionally include or exclude code that is specific to CUDA, allowing the same source file to be compiled with or without CUDA support.
-<_C_>
-The change to reading happens inside read_global_safe :
-**Following code does:** The code snippet you provided appears to be incomplete or malformed. It does not represent valid Python code, as it contains a semicolon at the beginning and a misspelled or incorrectly formatted word "enab1ed" (with a numeral '1' instead of the letter 'l'). Without additional context or correction, it's not possible to determine its high-level purpose.
- The change to reading happens inside read_global_safe:
- auto read_global_safe = [&](const int32_t i) -> hip::static_array
- {
- return [&](std::integer_sequence)
- {
- if(i + ItemsPerThread < front_size)
- return hip::static_array{
- front[i + I]...
- };
- else
- return hip::static_array{
- (i + I < front_size? front[i + I] : zero_elem)...
- };
- }(std::make_integer_sequence());
- };
-
- Note that each array element is being loaded consecutively without the flexibility of a configurable ItemsPerThread
-Note that each array element is being loaded consecutively without the flexibility of a configurable ItemsPerThread property. This is morally equivalent to:
-**Following code does:** This code snippet is a preprocessor directive used in the context of HIP (Heterogeneous-Compute Interface for Portability), which is a C++ runtime API that allows developers to write portable code to run on AMD and NVIDIA GPUs. The line `#if __HIP__DEVICE__COMPILE__` is a conditional compilation directive that checks if the code is being compiled for a GPU device. If the condition is true, the code following this directive will be included in the compilation process for the device. This is typically used to separate code that should only be executed on the GPU from code that runs on the host (CPU).
-T arr[4] = {
- front[gid + 0],
- front[gid + 1],
- front[gid + 2],
- front[gid + 3]
-}
-This is exactly what's happening in the front[i + I]... fold-expression. However, this can only be issued if the entire read operates on real input without padding using zero_elem . If some reads over-index the input, the read turns into:
-**Following code does:** This code snippet is a preprocessor directive used in CUDA programming, which is a parallel computing platform and application programming interface model created by NVIDIA. The directive `#if (__CUDA_ARCH__ >= 130)` checks if the code is being compiled for a CUDA architecture version that is 1.3 or higher. If the condition is true, the code following this directive will be included in the compilation process. This is typically used to ensure that certain code segments are only compiled for specific GPU architectures that support the required features or capabilities.
-T arr[4] = {
- i + 0 < front_size? front[i + 0] : zero_elem,
- i + 1 < front_size? front[i + 1] : zero_elem,
- i + 2 < front_size? front[i + 2] : zero_elem,
- i + 3 < front_size? front[i + 3] : zero_elem
-}
-This makes it easier for the compiler to recognize vector loads from global. As the performance at large is dominated by how you move the data, it's only natural to utilize dedicated instructions to move more data with less binary. This is evident by the huge performance improvement when loading two values per thread. For more information, see the compiler explorer to learn how loading for AMD (both RDNA and CDNA) compiles to global_load_dwordx4 , where x4 denotes the 4-vector variant of the instruction.
-Note: Note that read_global_safe , which used to take an uint32_t as the index type, now takes a signed integer. When indexing an array with unsigned integers, the compiler has to handle integer overflows, as the C/C++ standards defined them. It might happen that some part of the vector load indices overflow, thus resulting in a non-contiguous
-read. If you change the previously linked code to use an unsigned integer as the thread ID, the compiler won't emit a vector load. Signed integer overflow is an undefined behavior, and hence, unknown to the optimizer. To convey the absence of overflow to the compiler with unsigned indices, add __builtin_assume(gid + 4 > gid) , or the more portable [[assume]](gid + 4 > gid) , once amdclang++ supports it.
-read_global_safe implementation is an Immediately Invoked Lambda Expression (IILE), because ItemsPerThread is an integer value, while you need a compile-time iota -like sequence of integers as a pack for the fold-expressions to expand on. This can only occur as part of template argument deduction on the IILE.
-Once the kernel reads ItemsPerThread number of inputs to local, it immediately reduces them to a scalar. There is no reason to propagate the input element multiplicity to the warp reduction phase.
-**Following code does:** This code snippet appears to be a comment rather than executable code. It suggests that the programming environment or language being used supports the use of "doubles," which typically refers to double-precision floating-point numbers. This comment might be indicating that the code or system can handle numerical data types that require more precision than single-precision floating-point numbers.
-
-
-
- > ?>
-Alter kernel launch and input fetching such that no more blocks are launched than what a subsequent kernel launch's single block can conveniently reduce, while performing multiple passes of input reading from global and combining their results before engaging in the end game tree-like reduction.
-With this method, you can save at least one to two kernel launches for large inputs.
-Warning: This modification can only be executed on AMD hardware.
-Perform the first step of the two-pass reduction, but in the end, instead of writing to global and reading it back in a subsequent kernel, write the partial results to the Global Data Share (GDS). This is an N+1 th shared memory that is accessed by all multiprocessors and is also on-chip memory.
-Note: The API doesn't guarantee the order in which blocks are scheduled even though all GPUs schedule them in the same monotonically increasing order of block ids. Relying on this implicitly, the last block of a grid is in the optimal position to observe the side effects of all other blocks (using spinlocks or other methods) without occupying a multiprocessor for longer than necessary.
-Without launching a second kernel, you can make the last block collect the results of all other blocks from GDS by implicitly exploiting the scheduling behavior or relying on another AMD-specific feature called Global Wave Sync (GWS) to merge them for a final tree-like reduction.
-Note: GDS and GWS are reserved runtime features that the HIP API doesn't cover. Invoking these functionalities requires inline AMDGCN assembly. Moreover, the fact that the runtime doesn't virtualize the GDS, imposes further restrictions on concurrent scheduling of other kernels.
-Optimizing code on GPUs, like on any other architecture, requires careful consideration and balancing of resources and costs of various operations to obtain optimal performance. This document explored optimizing reductions much beyond the territory of diminishing returns. This approach introduced multiple optimization techniques and discussed opportunities.
-The document focused on reductions when an entire device participates in it. Still, the choice of optimal compile-time constants or even the algorithm itself might not be optimal when its multiple blocks participate in multiple parallel reductions or when each thread performs its reduction. However, when multiple devices participate in the same reduction, other aspects must be considered.
-Most solutions, including the ones covered in this document, are given to the end users in a turnkey fashion via algorithm primitive libraries. These solutions might not be the fastest in all cases, but they are close to being the gold standard for carrying out certain operations as reasonably as possible.
-CHAPTER
-This tutorial demonstrates the basic concepts of cooperative groups in the HIP (Heterogeneous-computing Interface for Portability) programming model and the most essential tooling supporting it. This topic also reviews the commonalities of heterogeneous APIs. Familiarity with the C/C++ compilation model and the language is assumed.
-To follow this tutorial, you'll need properly installed drivers and a HIP compiler toolchain to compile your code. Because ROCm HIP supports compiling and running on Linux and Microsoft Windows with AMD and NVIDIA GPUs, review the HIP development package installation before starting this tutorial. For more information, see Install HIP .
-To become familiar with heterogeneous programming, review the SAXPY tutorial and the first HIP code subsection. Compiling is also described in that tutorial.
-You can use tiled partition to calculate the sum of partition_size length sequences and the sum of result_size / BlockSize length sequences. The host-side reference implementation is the following:
-**Following code does:** This code snippet is a conditional check to determine if the hardware architecture supports double-precision floating-point operations. The commented-out line `//#if (__CUDA_ARCH__ >= 130)` is a non-portable way to check for this support in CUDA, which is specific to NVIDIA GPUs. The active line `if __HIP_ARCH_HAS_DOUBLES__` is a portable way to perform a similar check in HIP, which is a framework designed to run on both NVIDIA and AMD GPUs. If the condition is true, it indicates that the architecture supports double-precision operations, and the code within the block can safely use double-precision data types.
- You can use ued partition to calculate the sum or partition_size length sequences and the sum or result_size/
- BlockSize length sequences. The host-side reference implementation is the following:
-
- // Host-side function to perform the same reductions as executed on the GPU
- std::vector ref_reduced(const unsigned int partition_size,
- std::vector input)
- {
- const unsigned int input_size = input.size();
- const unsigned int result_size = input_size / partition_size;
- std::vector result(result_size);
-
- for(unsigned int i = 0; i < result_size; i++)
- {
- unsigned int partition_result = 0;
- for(unsigned int j = 0; j < partition_size; j++)
- {
- partition_result += input[partition_size * i + j];
- }
- (continues on next page)
-(continues on next page)
-**Following code does:** This code snippet is querying the properties of a GPU device using the HIP (Heterogeneous-Compute Interface for Portability) API. It retrieves the properties of a specified device and checks if the device supports shared 32-bit integer atomic operations. The commented-out line suggests an alternative, non-portable way of checking device capabilities based on major and minor version numbers, but the actual code uses a more portable method by directly querying the `hasSharedInt32Atomics` feature of the device architecture. If the device supports this feature, the code within the conditional block (not shown) would execute.
- result[i] = partition_result;
- }
-
- return result;
- }
-To calculate the sum of the sets of numbers, the tutorial uses the shared memory-based reduction on the device side. The warp level intrinsics usage is not covered in this tutorial, unlike in the reduction tutorial. x input variable is a shared pointer, which needs to be synchronized after every value change. The thread_group input parameter can be thread_block_tile or thread_block because the thread_group is the parent class of these types. The val are the numbers to calculate the sum of. The returned results of this function return the final results of the reduction on thread ID 0 of the thread_group , and for every other thread, the function results are 0.
-**Following code does:** The code snippet is a command-line instruction that uses the `hipconfig` tool with the `--cxx_config` option. `hipconfig` is a utility associated with the HIP (Heterogeneous-Compute Interface for Portability) framework, which is used for developing applications that can run on both AMD and NVIDIA GPUs. The `--cxx_config` option specifically retrieves and displays the C++ compiler configuration settings used by HIP. This information is useful for developers to understand or verify the compiler settings being applied in their HIP-based projects.
- The warp level intrinsics usage is not covered in this tutorial, unlike in the reduction tutorial. x input variable is a
- shared pointer, which needs to be synchronized after every value change. The thread_group input parameter can be
- thread_block_tile or thread_block because the thread_group is the parent class of these types. The val are
- the numbers to calculate the sum of. The returned results of this function return the final results of the reduction on
- thread ID 0 of the thread_group, and for every other thread, the function results are 0.
-
- /// \brief Summation of `unsigned int val`s in `thread_group g` using shared memory `x`
- __device__ unsigned int reduce_sum(thread_group g, unsigned int* x, unsigned int val)
- {
- // Rank of this thread in the group
- const unsigned int group_thread_id = g.thread_rank();
-
- // We start with half the group size as active threads
- // Every iteration the number of active threads halves, until we processed all values
- for(unsigned int i = g.size() / 2; i > 0; i /= 2)
- {
- // Store value for this thread in a shared, temporary array
- x[group_thread_id] = val;
-
- // Synchronize all threads in the group
- g.sync();
-
- // If our thread is still active, sum with its counterpart in the other half
- if(group_thread_id < i)
- {
- val += x[group_thread_id + i];
- }
-
- // Synchronize all threads in the group
- g.sync();
- }
-
- // Only the first thread returns a valid value
- if(g.thread_rank() == 0)
- return val;
- else
- return 0;
- }
-
- The reduce_sum device function is reused to calculate the block and custom partition sum of the input numbers. The
- kernel has three sections:
-
- 1. Initialization of the reduction function variables.
-The reduce_sum device function is reused to calculate the block and custom partition sum of the input numbers. The kernel has three sections:
-(continued from previous page)
-In this code section, the shared memory is declared, the thread_block_group and custom_partition are defined, and the input variables are loaded from global memory.
-**Following code does:** The code snippet appears to be a fragment of a command or configuration related to compiling or building software that uses the HIP (Heterogeneous-Compute Interface for Portability) platform. Specifically, it includes a compiler definition `-D___HIP_PLATFORM_AMD___`, which indicates that the target platform is AMD, and an include path `-I/home/user1/hip/include`, which specifies where to find the HIP header files during compilation. This setup is typically used in environments where code is being prepared to run on AMD GPUs using HIP.
-
-
-in this code section, the shared memory is declared, the thread_block_group and custom_partition are define
-input variables are loaded from global memory.
-
-// threadBlockGroup consists of all threads in the block
-thread_block thread_block_group = this_thread_block();
-
-// Workspace array in shared memory required for reduction
-__shared__ unsigned int workspace[2048];
-
-unsigned int output;
-
-// Input to reduce
-const unsigned int input = d_vector[thread_block_group.thread_rank()];
-
-//...
-
-// Every custom_partition group consists of 16 threads
-thread_block_tile custom_partition
- = tiled_partition(thread_block_group);
-
-
-
-as a 2 - 2 The read-line of thread block
-In this code section, the sum is calculated on thread_block_group level, then the results are stored in global memory.
-**Following code does:** This code snippet is a Makefile command that appends additional preprocessor flags to the `CPPFLAGS` variable. It uses the `hipconfig` tool, which is part of the HIP (Heterogeneous-Compute Interface for Portability) framework, to generate the necessary preprocessor configuration flags for compiling HIP code. The `$(shell ...)` function executes the `hipconfig --cpp_config` command and captures its output, which is then added to `CPPFLAGS`. This setup is typically used to ensure that the correct compiler flags are used when building applications that utilize HIP for GPU programming.
-// Perform reduction
-output = reduce_sum(thread_block_group, workspace, input);
-
-// Only the first thread returns a valid value
-if(thread_block_group.thread_rank() == 0)
-{
- d_block_reduced_vector[0] = output;
-}
-In this code section, the sum is calculated on the custom partition level, then the results are stored in global memory. The custom partition is a partial block of the thread block, it means the reduction calculates on a shorter sequence of input numbers than at the thread_block_group case.
-**Following code does:** The code snippet you provided appears to be incomplete or malformed, as it only contains a closing parenthesis `)`. Without additional context or surrounding code, it is not possible to determine its purpose or functionality.
-
-
-
- // Perform reduction }
-
-output = reduce_sum(custom_partition, &workspace[group_offset], input);
-
-
-
- // Only the first thread in each partition returns a valid value \
- \
-
- \
-(continued from previous page)
-**Following code does:** The code snippet `<_SQL_>` appears to be a placeholder or a tag indicating that the actual SQL code is not provided. In this context, it suggests that the code is related to SQL (Structured Query Language), which is used for managing and manipulating relational databases. Without the actual SQL code, it's not possible to determine the specific operations or queries being performed. The placeholder might be used in documentation, templates, or code generation tools to signify where SQL code should be inserted or processed.
- const unsigned int partition_id = thread_block_group.thread_rank() /\__
- ~PartitionSize;
- d_partition_reduced_vector[partition_id] = output;
- }
-On the host-side, the following steps are done in the example:
-Only the first, second and fourth steps are important from the cooperative groups aspect, that's why those steps are detailed further.
-Not all AMD GPUs support cooperative groups. You can confirm support with the following code:
-**Following code does:** It seems there is a formatting error in your request. The code snippet is labeled as Python, but it contains a placeholder that suggests it should be C++ code. Please provide the correct code snippet or clarify the language so I can assist you accurately.
-<_C++_>
-In the example, there is only one block in the grid, and the threads_per_block must be dividable with partition_size .
-**Following code does:** This code snippet is part of a GPU programming workflow using HIP, a C++ runtime API and kernel language that allows developers to write portable code for AMD and NVIDIA GPUs. The code performs the following high-level tasks: - -1. **Initialization**: It initializes two arrays, `A` and `B`, where `A` is filled with negative indices and `B` is initialized to zero. - -2. **Memory Allocation**: It allocates memory on the GPU for the array `Ad` using `hipMalloc`. - -3. **Data Transfer to GPU**: It copies the contents of array `A` from the host (CPU) to a symbol on the device (GPU) using `hipMemcpyToSymbol`. - -4. **Kernel Launch**: It launches a GPU kernel named `Get` with a specific grid and block configuration to perform operations on the data stored in `Ad`. - -5. **Data Transfer to Host**: It copies the results from the device array `Ad` back to the host array `B` using `hipMemcpy`. - -6. **Validation**: It checks if the contents of arrays `A` and `B` are equal, asserting that the GPU computation was performed correctly. - -7. **Output**: If the assertion passes, it prints "Passed" to indicate successful execution and validation of the GPU operations. - -Overall, this code tests the correctness of a GPU computation by comparing the results with expected values.
-<_C_>
-The kernel launch is done with the hipLaunchCooperativeKernel of the cooperative groups API.
-**Following code does:** This code snippet demonstrates the allocation of memory on both the device (GPU) and the host (CPU) using HIP, a C++ runtime API for GPU programming. It first allocates memory on the device for a double pointer using `hipMalloc` and retrieves its attributes with `hipPointerGetAttributes`, which would indicate that the memory type is `hipMemoryTypeDevice`. Then, it allocates memory on the host using `hipHostMalloc` for another double pointer and again retrieves its attributes, which would indicate that the memory type is `hipMemoryTypeHost`. The comments suggest that the code is checking the type of memory allocated (device vs. host). The last line seems to be incomplete and unrelated to the memory operations shown.
- The kernel launch is done with the hipLaunchCooperativeKernel of the cooperative groups API.
- void* params[] = {&d_vector, &d_block_reduced, &d_partition_reduced};
- // Launching kernel from host.
- HIP_CHECK(hipLaunchCooperativeKernel(vector_reduce_kernel,
-With cooperative groups, you can easily use custom partitions to create custom tiles for custom solutions. You can find the complete code at cooperative groups ROCm example.
-Copyright © 2008 - 2024 Advanced Micro Devices, Inc.
-Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
-The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
-**Following table contains:** The table appears to represent a list of C++ functions related to AMD virtual memory (vmem) operations, possibly from a software library or API documentation. Each row corresponds to a specific function, and the columns provide details about these functions. - -- **Rows**: Each row represents a specific function related to AMD virtual memory operations. The functions seem to be part of a larger set of operations, possibly for handling memory allocation, access, and management. - -- **Columns**: - - **Column 0**: Contains the name of the function along with a description or identifier. It appears to include the function name followed by a brief description or a numerical identifier in parentheses. - - **Column 1**: Seems to repeat the information from Column 0, possibly indicating a duplication or a formatting error in the data. - -- **Noteworthy Values**: - - There are several instances where the function names and descriptions are repeated across both columns, suggesting a potential issue with data duplication or formatting. - - Some function names are truncated or have formatting issues, such as "hsa_amd_vmem_import_shareable_handle function ), 244" and "hsa_amd_vmem_retain_alloc_handle ( C++ tion ), 245", indicating possible data entry errors or incomplete data extraction. - -Overall, the table seems to be a list of functions related to AMD virtual memory operations, but it may require cleaning or correction for accurate interpretation.
-| C | hipArrayCreate ( C++ function ), 183 | |||||
| cooperative_groups::coalesced_group class ), 237 | C++ hipArrayDestroy ( C++ function ), 183 hipArrayGetDescriptor ( C++ function ), 185 ( C++ function ), 185 | ( | ||||
| cooperative_groups::grid_group ( C++ class ), 234 | hipArrayGetInfo hipCreateSurfaceObject ( C++ function ), 133 | |||||
| cooperative_groups::multi_grid_group ( C++ | ( C++ function ), 197 | |||||
| class ), 234 cooperative_groups::thread_block ( C++ | ), hipDestroyExternalMemory ( C++ function ), 196 | |||||
| class 234 | hipDestroyExternalSemaphore hipDestroySurfaceObject ( C++ function ), 133 C++ function | |||||
| cooperative_groups::thread_block_tile ( C++ class ), 234 | hipDeviceCanAccessPeer ( ), 163 hipDeviceDisablePeerAccess ( C++ function ), 164 ( C++ function ), 163 | |||||
| cooperative_groups::thread_block_tile::all ( C++ function ), 236 cooperative_groups::thread_block_tile::any | hipDeviceEnablePeerAccess hipDeviceGetStreamPriorityRange ( C++ function ), 148 | |||||
| ( C++ function ), 236 | hipDrvMemcpy3D ( C++ function ), 192 | |||||
| cooperative_groups::thread_block_tile::ballot | hipDrvMemcpy3DAsync ( C++ function ), 192 | |||||
| hipDrvPointerGetAttributes ( C++ function ), 166 | ||||||
| ( C++ function ), 236 | ||||||
| cooperative_groups::thread_block_tile::match_all | hipExternalMemoryGetMappedBuffer ( C++ func- tion ), 196 | |||||
| ( C++ function ), 237 cooperative_groups::thread_block_tile::match_any | hipExternalMemoryGetMappedMipmappedArray ( C++ function ), 197 hipExtMallocWithFlags ( C++ function ), 167 | 236 | ( ), | C++ | function | |
| cooperative_groups::thread_block_tile::meta_group_rank ( C++ function ), 235 | hipExtStreamCreateWithCUMask ( C++function ), 151 hipExtStreamGetCUMask ( C++ function ), 152 | |||||
| hipFree ( C++ function ), 171 hipFreeArray ( C++ function ), 184 | ||||||
| cooperative_groups::thread_block_tile::meta_group_size ( C++ function ), 235 | hipFreeAsync | |||||
| cooperative_groups::thread_block_tile::shfl ( C++ function ), 235 | ||||||
| ( C++ function ), 154 | ||||||
| cooperative_groups::thread_block_tile::shfl_down | ||||||
| ( C++ function ), 235 | hipFreeHost ( C++ function ), 172 hipGetProcAddress ( C++ function ), 176 | |||||
| cooperative_groups::thread_block_tile::shfl_up | hipGetSymbolAddress ( C++ function ), 176, 193 ( C++ function ), 176, 193 | |||||
| ( C++ function ), 235 | hipHostAlloc ( C++ function ), 168 hipHostFree ( C++ function ), 172 | |||||
| ( C++ function ), 236 | ( C++ function ), | |||||
| cooperative_groups::thread_block_tile::sync ( C++ function ), 235 | 169 ( C++ function ), 169 | |||||
| cooperative_groups::thread_block_tile::thread_rank ( C++ function ), 235 ), | ( C++ function ), 169 | |||||
| cooperative_groups::thread_group ( C++ class | ( C++ function ), 170 ( C++ function ), | |||||
| 234 | 196 ( C++ function ), | |||||
| ( C++ function ), 166, 194 | ||||||
| H | hipMalloc hipMalloc3D | |||||
| ( C++ function ), 184 | ||||||
| hipArray3DCreate ( C++ function ), 183 | ||||||
| 185 | hipMalloc3DArray ( C++ function ), | |||||
| hipArray3DGetDescriptor ( C++ function ), | ||||||
| 195 | ||||||
| 184 | ||||||
| hipImportExternalSemaphore | ||||||
| hipImportExternalMemory | ||||||
| hipHostMalloc ( C++ function ), 168, 194 hipHostRegister hipHostUnregister | ||||||
| hipHostGetDevicePointer hipHostGetFlags | ||||||
| hipGetSymbolSize | ||||||
| cooperative_groups::thread_block_tile::shfl_xor |
**Following table contains:** The table appears to represent a list of C++ functions related to surface memory operations, specifically reading and writing operations in one-dimensional and two-dimensional contexts. Each row corresponds to a different function, and the columns seem to indicate the function name followed by a numerical value, which could represent a performance metric, version number, or some other quantitative measure associated with the function. - -The columns are structured as follows: -- The first column lists the function names, which include operations like `surf1DLayeredread`, `surf1DLayeredwrite`, `surf1Dread`, `surf1Dwrite`, `surf2DLayeredread`, `surf2DLayeredwrite`, `surf2Dread`, and `surf2Dwrite`. -- The second column, which is not explicitly labeled, contains numerical values associated with each function. These values are 133, 134, or 135. - -Noteworthy observations include: -- The functions `surf1DLayeredwrite`, `surf2DLayeredread`, and `surf2DLayeredwrite` all have the highest value of 135. -- The functions `surf1Dread` and `surf1Dwrite` share the same value of 133. -- The functions `surf2Dread` and `surf2Dwrite` both have a value of 134. - -These values might suggest a categorization or ranking of the functions, potentially indicating performance or compatibility levels.
-| hipMallocArray ( C++ function ), 182 ( C++ function ), 152, |
| 153 hipMallocFromPoolAsync ( C++ function ), 153, 160 |
| hipMallocHost ( C++ function ), 167 |
| hipMallocManaged ( C++ function ), 247, 249 |
| hipMallocPitch ( C++ function ), 170 |
| hipMemAddressFree ( C++ function ), 251 |
| hipMemAddressReserve ( C++ function ), 251 |
| hipMemAdvise ( C++ function ), 247 |
| hipMemAllocHost ( C++ function ), 167 |
| hipMemAllocPitch ( C++ function ), 171 |
| hipMemcpy ( C++ function ), 172 |
| hipMemcpy2D ( C++ function ), 186 |
| hipMemcpy2DAsync ( C++ function ), 187 |
| hipMemcpy2DFromArray ( C++ function ), 190 |
| hipMemcpy2DFromArrayAsync ( C++ function ), 190 |
| hipMemcpy2DToArray ( C++ function ), 188 |
| hipMemcpy2DToArrayAsync ( C++ function ), 188 |
| hipMemcpy3D ( C++ function ), 191 |
| hipMemcpy3DAsync ( C++ function ), 192 |
| hipMemcpyAsync ( C++ function ), 178 |
| hipMemcpyAtoH ( C++ function ), 191 |
| hipMemcpyDtoD ( C++ function ), 174 |
| hipMemcpyDtoDAsync ( C++ function ), 175 hipMemcpyDtoH ( C++ function ), 174 |
| hipMemcpyDtoHAsync ( C++ function ), 175 |
| hipMemcpyFromArray ( C++ function ), 189 |
| hipMemcpyFromSymbol ( C++ function ), |
| 177, |
| 194 hipMemcpyFromSymbolAsync ( C++ function ), 178, 194 |
| hipMemcpyHtoA ( C++ function ), 191 |
| hipMemcpyHtoD ( C++ function ), 173 hipMemcpyHtoDAsync ( C++ function ), |
| hipMemcpyParam2D ( C++ function ), |
| 174 186 |
| hipMemcpyParam2DAsync ( C++ function ), |
| 187 hipMemcpyToArray ( C++ function ), 189 ( C++ function ), 177, |
| hipMemcpyToSymbol 193 hipMemcpyToSymbolAsync ( C++ function ), 193 |
| 177, ( C++ function ), 173 |
| hipMemcpyWithStream hipMemCreate ( C++ function ), 252 |
| hipMemExportToShareableHandle ( |
| C++ function 252 ( C++ function ), 252 |
| hipMemGetAddressRange ( C++ function ), 164 hipMemGetAllocationGranularity ( C++ |
| function 253 hipMemGetAllocationPropertiesFromHandle |
| ( C++ function ), 253 |
| hipMemGetInfo ( C++ function ), 182 hipMemImportFromShareableHandle ( C++ function |
| 253 hipMemMap ( C++ function ), 254 hipMemMapArrayAsync ( C++ function ), hipMemPoolCreate |
| 254 ( C++ function ), 158 |
| hipMemPoolDestroy 159 |
| ( C++ function ), |
C++ function
-(
-), 162
-**Following table contains:** The table appears to represent a list of functions or operations, possibly related to a programming or computational context, given the mention of terms like "Atomic" and "Warp cross-lane." Each row seems to describe a specific function or category of functions. - -- **Rows**: Each row represents a specific function or a sub-category of functions, possibly organized by a version or section number (e.g., 19.9, 19.10, etc.). - -- **Columns**: - - **Column 0**: This column seems to contain version or section numbers, which might indicate the order or hierarchy of the functions listed. - - **Column 1**: This column provides a brief descriptor or category name for the function (e.g., Math, Texture, Surface). - - **Column 2**: This column contains a more detailed description of the function or operation, although it is often abbreviated or truncated with ellipses. - - **Column 3**: This column appears to contain numerical identifiers or codes associated with each function or category. - -- **Noteworthy Values**: - - The entry "19.13.1" in Column 0 suggests a sub-section under "Atomic" functions, indicating a more detailed breakdown within that category. - - The description in Column 2 for "19.13.1" mentions "Unsafe floating-point atomic RMW operations," which could be a specific type of operation that warrants caution or special handling. - - The numerical values in Column 3 are sequential but not strictly incremental, suggesting they might be identifiers rather than simple counts. - -Overall, the table seems to be a structured list of functions or operations, possibly from a technical manual or documentation, organized by categories and sub-categories.
-| hipMemPoolExportPointer hipMemPoolExportToShareableHandle ( C++ | func- |
| tion ), 160 | |
| hipMemPoolGetAccess ( C++ function ), 158 hipMemPoolGetAttribute ( C++ function ), 156 | |
| hipMemPoolImportFromShareableHandle function ), 161 | ( C++ |
| hipMemPoolImportPointer ( C++ function ), 162 hipMemPoolSetAccess ( C++ function ), 157 | |
| hipMemPoolSetAttribute ( C++ function ), 156 | |
| hipMemPoolTrimTo ( C++ function ), 155 | |
| hipMemPrefetchAsync ( C++ function ), 247 | |
| hipMemPtrGetInfo ( C++ function ), 182 | |
| hipMemRangeGetAttribute ( C++ function ), 248 | |
| hipMemRangeGetAttributes ( C++ function ), 248 | |
| hipMemRelease ( C++ function ), 255 | |
| hipMemRetainAllocationHandle ( C++function ), 255 | |
| hipMemset ( C++ function ), 179 | |
| hipMemset2D ( C++ function ), 181 | |
| hipMemset2DAsync ( C++ function ), 181 | |
| hipMemset3D ( C++ function ), 181 | |
| hipMemset3DAsync ( C++ function ), 182 | |
| hipMemSetAccess ( C++ function ), 255 | |
| hipMemsetAsync ( C++ function ), 180 | |
| hipMemsetD16 ( C++ function ), 180 hipMemsetD16Async ( C++ function ), 180 | |
| hipMemsetD32 ( C++ function ), 180 | |
| hipMemsetD32Async ( C++ function ), 181 | |
| hipMemsetD8 ( C++ function ), 179 | |
| hipMemsetD8Async ( C++ function ), 179 | |
| hipMemUnmap ( C++ function ), 256 | |
| hipModuleGetGlobal ( C++ function ), 176 | |
| ( C++ function ), 165 | |
| hipPointerGetAttribute hipPointerGetAttributes ( C++ function ), 165 | |
| hipPointerSetAttribute ( C++ function ), 165 | |
| hipSignalExternalSemaphoresAsync ( C++ func- tion ), 195 | |
| hipStreamAddCallback ( C++ function ), 152 hipStreamAttachMemAsync ( C++ function ), 249 | |
| hipStreamCallback_t ( C++ type ), 147 ( C++ function ), 147 | |
| hipStreamCreate hipStreamCreateWithFlags ( C++ function ), 147 | |
| hipStreamCreateWithPriority ( C++ function ), 147 hipStreamDestroy ( C++ function ), 148 | |
| hipStreamGetDevice ( C++ function ), 151 | |
| hipStreamGetFlags ( C++ function ), 150 | |
| hipStreamGetPriority ( C++ function ), 150 hipStreamQuery ( C++ function ), 149 | |
| hipStreamSynchronize ( C++ function ), 149 | |
| hipStreamWaitEvent | |
| ( C++ function ), 149 | |
| hipWaitExternalSemaphoresAsync ( C++ function ), 195 | |
| hsa_amd_vmem_address_free ( C++ function ), 241 | |
| hsa_amd_vmem_address_reserve ( C++function ), | |
| 241 | |
**Following table contains:** The table appears to represent a structured outline or index of a document, possibly a technical manual or guide related to computing or programming. Each row corresponds to a specific section or subsection of the document. - -- **Column 0**: This column seems to represent the section or subsection number, indicating the hierarchical structure of the document. -- **Column 1**: This column contains the main title or heading of each section or subsection. -- **Column 2**: This column provides a brief description or additional details about the content of the section. -- **Column 3**: This column likely represents a page number or reference number where the section can be found in the document. - -Noteworthy values include: -- The presence of both main sections (e.g., "21.4 Floating-point Intrinsics") and subsections (e.g., "23.1 Cooperative kernel launches"), indicating a detailed breakdown of topics. -- The use of ellipses in the descriptions suggests that these are truncated or summarized titles, possibly indicating longer, more detailed content in the actual document. -- The section "22 Table, comparing syntax for different compute APIs" suggests a comparative analysis, which might be a key part of the document for readers interested in different computing APIs.
-| hsa_amd_vmem_export_shareable_handle function ), 244 | |
| hsa_amd_vmem_get_access ( C++ function ), 243 | |
| hsa_amd_vmem_get_alloc_properties_from_handle ( C++ function ), 245 | |
| hsa_amd_vmem_handle_create ( C++ function ), 242 | |
| hsa_amd_vmem_handle_release ( C++ function ), 242 | |
| hsa_amd_vmem_import_shareable_handle function ), 244 | ( C++ |
| hsa_amd_vmem_map ( C++ function ), 242 | |
| hsa_amd_vmem_retain_alloc_handle ( C++ tion ), 245 | func- |
| hsa_amd_vmem_set_access ( C++ function ), 243 | |
| hsa_amd_vmem_unmap ( C++ function ), 243 | |
**Following table contains:** The table appears to represent a structured outline or index of a document, possibly a technical manual or report. Each row corresponds to a section or subsection of the document, with the columns providing different pieces of information about each section. - -- **Column 0**: This column seems to contain section numbers, indicating the hierarchical structure of the document. For example, "30.3.1.3" and "30.3.2" suggest subsections within a larger section 30. - -- **Column 1**: This column contains the titles or descriptions of the sections. These descriptions provide a brief overview of the content covered in each section, such as "The reduction of custom partition" and "Host-side code." - -- **Column 2**: This column likely represents page numbers where each section begins, helping readers locate the sections within the document. For instance, section "30.3.1.3" starts on page 291, and "31 License" starts on page 295. - -Noteworthy values include the presence of detailed subsections under "30.3.2," indicating a focus on cooperative group support and configuration on AMDGPUs, and the "31 License" section, which might contain legal or usage information. The document seems to be technical, possibly related to programming or hardware configuration.
-| surf1DLayeredread |
| surf1DLayeredwrite ( C++ function ), 135 |
| surf1Dread ( C++ function ), 133 |
| surf1Dwrite ( C++ function ), 133 |
| surf2DLayeredread ( C++ function ), 135 |
| surf2DLayeredwrite ( C++ function ), 135 |
| surf2Dread ( C++ function ), 134 |
| surf2Dwrite ( C++ function ), 134 |
| surf3Dread ( C++ function ), 134 |
| surf3Dwrite ( C++ function ), 134 |
| surfCubemapLayeredread ( C++ function ), 136 |
| surfCubemapLayeredwrite ( C++ function ), 137 |
| surfCubemapread ( C++ function ), 136 |
| surfCubemapwrite ( C++ function ), 136 |
USE_PEER_NON_UNIFIED ( C macro ), 164
-`), headings (`
-
-
-
-
-
-
-
-
-
-
-
-
-
-```
-
-## 12.2.1 Debugging HIP applications
-
-The following Linux example shows how to get useful information from the debugger while running a simple memory copy test, which caused a segmentation fault issue.
-**Following code does:** This code snippet demonstrates a simple example of using HIP (Heterogeneous-Compute Interface for Portability) to perform a basic addition operation on a GPU. It allocates managed memory for three integers (`a`, `b`, and `c`) that can be accessed by both the host (CPU) and the device (GPU). The `add` kernel function is launched on the GPU to compute the sum of `a` and `b`, storing the result in `c`. After synchronizing the device to ensure the computation is complete, the code queries a memory range attribute (`hipMemRangeAttributeReadMostly`) for the memory range pointed to by `a` and stores the result in `attributeValue`. Finally, it prints the result of the addition. The code illustrates basic memory management and kernel execution in a HIP environment.
-
-
-```
-
-
-
- test, which caused a segmentation fault issue. Advanced Micro Devices, Inc. Sep 13, 2024 8.10.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 11.1.2 Device level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 18.3 Namespaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.4.4.3.17 Data Fields - Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 19.8 Synchronization functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Index 297 The Heterogeneous-computing Interface for Portability (HIP) API is a C++ runtime API and kernel language that lets developers create portable applications for AMD and NVIDIA GPUs from single source code. For HIP supported AMD GPUs on multiple operating systems, see: The CUDA enabled NVIDIA GPUs are supported by HIP. For more information, see GPU Compute Capability. On the AMD ROCm platform, HIP provides header files and runtime library built on top of HIP-Clang compiler in the repository Common Language Runtimes (CLR) , which contains source codes for AMD's compute languages runtimes as follows, On non-AMD platforms, like NVIDIA, HIP provides header files required to support non-AMD specific back-end implementation in the repository 'hipother', which translates from the HIP runtime APIs to CUDA runtime APIs. Known issues are listed on the HIP GitHub repository. To contribute features or functions to the HIP project, refer to Contributing to HIP. To contribute to the documentation, refer to Contributing to ROCm docs page. You can find licensing information on the Licensing page. HIP can be installed on AMD (ROCm with HIP-Clang) and NVIDIA (CUDA with NVCC) platforms. Note: The version definition for the HIP runtime is different from CUDA. On an AMD platform, the hipRuntimeGerVersion function returns the HIP runtime version; on an NVIDIA platform, this function returns the CUDA runtime version. Refer to the Prerequisites section in the ROCm install guides: Check the system requirements in the NVIDIA CUDA Installation Guide. HIP is automatically installed during the ROCm installation. If you haven't yet installed ROCm, you can find installation instructions here: By default, HIP is installed into /opt/rocm/hip . Note: There is no autodetection for the HIP installation. If you choose to install it somewhere other than the default location, you must set the HIP_PATH environment variable as explained in Build HIP from source. sudo apt-get install ubuntu-drivers-common && sudo ubuntu-drivers autoinstall sudo reboot Alternatively, you can download the latest CUDA Toolkit. You can optionally add /opt/rocm/bin to your path, which can make it easier to use the tools. Run hipconfig in your installation path. /opt/rocm/bin/hipconfig --full CHAPTER HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler, or a CUDA platform with nvcc installed. Before building and running HIP, make sure drivers and prebuilt packages are installed properly on the platform. You also need to install Python 3, which includes the CppHeaderParser package. Install Python 3 using the following command: Check and install CppHeaderParser package using the command: Set the repository branch using the variable: ROCM_BRANCH . For example, for ROCm 6.1, use: Note: Starting in ROCM 5.6, CLR is a new repository that includes the former ROCclr, HIPAMD and OpenCl repositories. OpenCL provides headers that ROCclr runtime depends on. Note: Starting in ROCM 6.1, a new repository hipother is added to ROCm, which is branched out from HIP. hipother provides files required to support the HIP back-end implementation on some non-AMD platforms, like NVIDIA. CLR (Common Language Runtime) repository includes ROCclr, HIPAMD and OpenCL. ROCclr (Radeon Open Compute Common Language Runtime) is a virtual device interface which is defined on the AMD platform. HIP runtime uses ROCclr to interact with different backends. HIPAMD provides implementation specifically for HIP on the AMD platform. OpenCL provides headers that ROCclr runtime currently depends on. hipother provides headers and implementation specifically for non-AMD HIP platforms, like NVIDIA. Note: Note, if you don't specify CMAKE_INSTALL_PREFIX , the HIP runtime is installed at <ROCM_PATH>/hip . By default, release version of HIP is built. If need debug version, you can put the option CMAKE_BUILD_TYPE=Debug in the command line. Default paths and environment variables: • HSA is in <ROCM_PATH>/hsa . This can be overridden by setting the HSA_PATH environment variable. After you run the make install command, make sure HIP_PATH points to $PWD/install/hip . Whenyouadd or change a HIP API, you may need to generate a new hip_prof_str.h header. This header is used by ROCm tools to track HIP APIs, such as rocprofiler and roctracer . To generate the header after your change, use the hip_prof_gen.py tool located in hipamd/src . Usage: Example usage: The commands to build HIP tests on an NVIDIA platform are the same as on an AMD platform. However, you must first set -DHIP_PLATFORM=nvidia . After installation and building HIP, you can compile your application and run. A simple example is square sample. The HIP programming model makes it easy to map data-parallel C/C++ algorithms to massively parallel, wide single instruction, multiple data (SIMD) architectures, such as GPUs. While the model may be expressed in most imperative languages, (for example Python via PyHIP) this document will focus on the original C/C++ API of HIP. A basic understanding of the underlying device architecture helps you make efficient use of HIP and general purpose graphics processing unit (GPGPU) programming in general. GPUs in general are made up of basic building blocks called compute units (CUs), that execute the threads of a kernel. These CUs provide the necessary resources for the threads: the Arithmetic Logical Units (ALUs), register files, caches and shared memory for efficient communication between the threads. This design allows for efficient execution of kernels while also being able to scale from small GPUs embedded in APUs with few CUs up to GPUs designed for data centers with hundreds of CUs. Figure Block Diagram of an RDNA3 Compute Unit. and Block Diagram of a CDNA3 Compute Unit. show examples of such compute units. For architecture details, check Hardware implementation . The HIP programming model assumes two execution contexts. One is referred to as host while compute kernels execute on a device . These contexts have different capabilities, therefor slightly different rules apply. The host execution is defined by the C++ abstract machine, while device execution follows the SIMT model of HIP. These execution contexts in code are signified by the __host__ and __device__ decorators. There are a few key differences between the two: Note: HIP does perform implicit synchronization on occasions, more advanced than other APIs such as OpenCL or SYCL, in which the responsibility of synchronization mostly depends on the user. The SIMT programming model behind the HIP device-side execution is a middle-ground between SMT (Simultaneous Multi-Threading) programming known from multicore CPUs, and SIMD (Single Instruction, Multiple Data) programming mostly known from exploiting relevant instruction sets on CPUs (for example SSE/AVX/Neon). A HIP device compiler maps SIMT code written in HIP C++ to an inherently SIMD architecture (like GPUs). This is done by scalarizing the entire kernel and issuing the scalar instructions of multiple kernel instances (called threads) to each of the SIMD engine lanes, rather than exploiting data parallelism within a single instance of a kernel and spreading identical instructions over the available SIMD engines. Consider the following kernel: The incoming four-vector of floating-point values b is multiplied by a scalar and then added element-wise to the fourvector floating-point values of a . On modern SIMD-capable architectures, the four-vector ops are expected to compile to a single SIMD instruction. However, GPU execution of this kernel will typically break down the vector elements into 4 separate threads for parallel execution, as seen in the following figure: Fig. 3: Instruction flow of the sample SIMT program. In HIP, lanes of the SIMD architecture are fed by mapping threads of a SIMT execution, one thread down each lane of an SIMD engine. Execution parallelism usually isn't exploited from the width of the built-in vector types, but across multiple threads via the thread ID constants threadIdx.x , blockIdx.x , etc. The SIMT nature of HIP is captured by the ability to execute user-provided device programs, expressed as single-source C/C++ functions or sources compiled online/offline to binaries, in bulk. All threads of a kernel are uniquely identified by a set of integral values, called thread IDs. The set of integers identifying a thread relate to the hierarchy in which the threads execute. The thread hierarchy inherent to how AMD GPUs operate is depicted in the following figure. Fig. 4: Hierarchy of thread groups. The innermost grouping of threads is called a warp, or a wavefront in ISA terms. A warp is the most tightly coupled groups of threads, both physically and logically. Threads inside a warp are also called lanes, and the integral value identifying them is the lane ID. Tip: Lane IDs aren't queried like other thread IDs, but are user-calculated. As a consequence, they are only as multidimensional as the user interprets the calculated values to be. The size of a warp is architecture dependent and always fixed. For AMD GPUs the wavefront is typically 64 threads, though sometimes 32 threads. Warps are signified by the set of communication primitives at their disposal, as discussed in Warp cross-lane functions . The middle grouping is called a block or thread block. The defining feature of a block is that all threads in a block will share an instance of memory which they may use to share data or synchronize with one another. The size of a block is user-configurable but is limited by the queryable capabilities of the executing hardware. The unique ID of the thread within a block is 3-dimensional as provided by the API. When linearizing thread IDs within a block, assume the 'fast index' being dimension x , followed by the y and z dimensions. The outermost grouping is called a grid. A grid manifests as a single dispatch of kernels for execution. The unique ID of each block within a grid is 3-dimensional, as provided by the API and is queryable by every thread within the block. The Cooperative groups API introduces new APIs to launch, group, subdivide, synchronize and identify threads, as well as some predefined group-collective algorithms, but most importantly a matching threading model to think in terms of. It relaxes some restrictions of the Inherent thread model imposed by the strict 1:1 mapping of architectural details to the programming model. Cooperative groups let you define your own set of thread groups which may fit your user-cases better than the defaults defined by the hardware. Note: The implicit groups defined by kernel launch parameters are still available when working with cooperative groups. For further information, see Cooperative groups. The hierarchy of threads introduced by the Inherent thread model is induced by the memory subsystem of GPUs. The following figure summarizes the memory namespaces and how they relate to the various levels of the threading model. Fig. 5: Memory hierarchy. Read-write storage only visible to the threads defining the given variables, also called per-thread memory. The size of a block for a given kernel, and thereby the number of concurrent warps, are limited by local memory usage. This relates to an important aspect: occupancy. This is the default memory namespace. Read-write storage visible to all the threads in a given block. Read-write storage visible to all threads in a given grid. There are specialized versions of global memory with different usage semantics which are typically backed by the same hardware storing global. Read-only storage visible to all threads in a given grid. It is a limited segment of global with queryable size. Read-only storage visible to all threads in a given grid and accessible through additional APIs. A read-write version of texture memory. HIP programs consist of two distinct scopes: Note: The HIP does not present two separate APIs link NVIDIA CUDA. HIP only extends the HIP runtime API with new APIs for hipModule and hipCtx . The part of the host-side API which deals with device management and their queries are synchronous. All asynchronous APIs, such as kernel execution, data movement and potentially data allocation/freeing all happen in the context of device streams. Streams are FIFO buffers of commands to execute relating to a given device. Commands which enqueue tasks on a stream all return promptly and the command is executed asynchronously. All side effects of a command on a stream are visible to all subsequent commands on the same stream. Multiple streams may point to the same device and those streams may be fed from multiple concurrent host-side threads. Execution on multiple streams may be concurrent but isn't required to be. Asynchronous APIs involving a stream all return a stream event which may be used to synchronize the execution of multiple streams. A user may enqueue a barrier onto a stream referencing an event. The barrier will block until the command related to the event does not complete, at which point all side effects of the command shall be visible to commands following the barrier, even if those side effects manifest on different devices. Streams also support executing user-defined functions as callbacks on the host. The stream will not launch subsequent commands until the callback completes. The SIMT programming model behind the HIP device-side execution is a middle-ground between SMT (Simultaneous Multi-Threading) programming known from multicore CPUs, and SIMD (Single Instruction, Multiple Data) programming mostly known from exploiting relevant instruction sets on CPUs (for example SSE/AVX/Neon). Kernels may be launched in multiple ways all with different syntaxes and intended use-cases. Tip: This name by default is a macro expanding to triple-chevron. In cases where language syntax extensions are undesirable, or where launching templated and/or overloaded kernel functions define the HIP_TEMPLATE_KERNEL_LAUNCH preprocessor macro before including the HIP headers to turn it into a templated function. Caution: These APIs are intended to be used/generated by tools such as the HIP compiler itself and not intended towards end-user code. Should you be writing a tool having to launch device code using HIP, consider using these over the alternatives. This chapter describes the typical hardware implementation of GPUs supported by HIP, and how the Inherent thread model maps to the hardware. The basic building block of a GPU is a compute unit (CU), also known as streaming multiprocessor (SM) on NVIDIA GPUs. The thread blocks making up a grid are scheduled for execution on CUs. Each block is assigned to an individual CU, and a CU can accommodate several blocks. Depending on their resource usage up to thousands of threads can reside on a CU. CUs contain an array of processing elements, referred to as vector ALU (VALU), that execute the actual instructions of the threads according to the SIMT model , together with the necessary registers and caches. The threads are executed in groupings called warps. The amount of threads making up a warp is architecture dependent. On AMD GPUs the warp size is commonly 64 threads, except in RDNA architectures which can utilize a warp size of 32 or 64 respectively. The warp size of supported AMD GPUs is listed in the Accelerator and GPU hardware specifications. NVIDIA GPUs have a warp size of 32. In contrast to CPUs, GPUs generally do not employ complex cache structures or control logic, like branch prediction or out-of-order execution, but instead rely on massive hardware multithreading to hide latency. Context switching between warps residing on a CU incurs no overhead, as the context for the warps is stored on the CU and does not need to be fetched from memory. If there are not enough free registers to accommodate all warps of a block, the block can not be scheduled to that CU and it has to wait until other blocks finish execution. The amount of warps that can reside concurrently on a CU, known as occupancy, is determined by the warp's resource usage of registers and shared memory. Fig. 1: An AMD Graphics Core Next (GCN) CU. The CDNA and RDNA CUs are based on variations of the GCN CU. On AMD GCN GPUs the basic structure of a CU is: A SIMD consists of a VALU, that executes the instruction of a warp, together with a register file, that provides the registers warps. The size of the warp is inherently related to the width of the vector ALU of the SIMD. On GCN compute units the width of the VALU is 16, so a warp can be issued to a SIMD every 4 cycles. Since a CU has 4 SIMDs it issues one warp per cycle. The instructions of a warp are effectively executed in lock-step. A SIMD always executes the same instruction for the whole VALU. If the control flow of a warp diverges, the performance is decreased, as the results for the threads that do not participate in that branch have to be masked out, and the instructions of the other branch have to be executed in the same way. The best performance can therefore be achieved when thread divergence is kept to a warp level, i.e. when all threads in a warp take the same execution path. The usage of cache on a GPU differs from that on a CPU, as there is less cache available per thread. Its main purpose is to coalesce memory accesses of the warps in order to reduce the amount of accesses to device memory, and make that memory available for other warps that currently reside on the compute unit, that also need to load those values. The local data share is memory that is accessible to all threads within a block. Its latency and bandwidth is comparable to that of the vector cache. It can be used to share memory between the threads in a block, or as a software managed cache. The scalar unit performs instructions that are uniform within a warp. It thereby improves efficiency and reduces the pressure on the vector ALUs and the vector register file. The general structure of CUs stays mostly as it is in GCN architectures. The most prominent change is the addition of matrix ALUs, which can greatly improve the performance of algorithms involving matrix multiply-accumulate operations for int8, float16, bfloat16 or float32. RDNA makes a fundamental change to CU design, by changing the size of a warp to 32 threads. This is done by effectively combining two GCN5 SIMDs, creating a VALU of width 32, so that a whole warp can be issued in one cycle. The CU is also replaced by the work group processor (WGP), which encompasses two CUs. For backwards compatibility the WGP can also run in wave64 mode, in which it issues a warp of size 64 in two cycles. It also adds an extra layer of cache to the WGP, shared by the CUs within it. This cache is referred to as L1 cache, promoting the per-CU cache to an L0 cache. For hardware implementation's sake, multiple CUs are grouped together into a Shader Engine or Compute Engine, typically sharing some fixed function units or memory subsystem resources. CLRcontains source codes for AMD's compute languages runtimes: HIP and OpenCL ™ . CLR is the part of HIP runtime which is supported on the AMD ROCm platform, it provides a header and runtime library built on top of HIP-Clang compiler. For developers and users, CLR implements HIP runtime APIs including streams, events, and memory APIs, which is a object library that is linked with the application. The source codes for all headers and the library implementation are available on GitHub in the CLR repository. CLR includes the following source code, Please refer to Quick Start Guide in ROCm Docs. Building CLR requires rocm-hip-libraries meta package, which provides the pre-requisites for CLR. Users can also build OCL and HIP at the same time by passing -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=ON to configure command. For detail instructions, please refer to build HIP. hip-tests is a separate repository hosted at hip-tests. To run hip-tests please go to the repository and follow the steps. HIP provides release notes in CLR change log, which has records of changes in each release. hipHostMalloc allocates pinned host memory which is mapped into the address space of all GPUs in the system, the memory can be accessed directly by the GPU device, and can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc() . There are two use cases for this host memory: There are flags parameter which can specify options how to allocate the memory, for example, hipHostMallocPortable , the memory is considered allocated by all contexts, not just the one on which the allocation is made. hipHostMallocMapped , will map the allocation into the address space for the current device, and the device pointer can be obtained with the API hipHostGetDevicePointer() . hipHostMallocNumaUser is the flag to allow host memory allocation to follow Numa policy by user. Please note this flag is currently only applicable on Linux, under development on Windows. All allocation flags are independent, and can be used in any combination without restriction, for instance, hipHostMalloc can be called with both hipHostMallocPortable and hipHostMallocMapped flags set. Both usage models described above use the same allocation flags, and the difference is in how the surrounding code uses the host memory. Numa policy determines how memory is allocated. Target of Numa policy is to select a CPU that is closest to each GPU. Numa distance is the measurement of how far between GPU and CPU devices. By default, each GPU selects a Numa CPU node that has the least Numa distance between them, that is, host memory will be automatically allocated closest on the memory pool of Numa node of the current GPU device. Using hipSetDevice API to a different GPU will still be able to access the host allocation, but can have longer Numa distance. Note, Numa policy is so far implemented on Linux, and under development on Windows. ROCm defines two coherency options for host memory: HIP provides the developer with controls to select which type of memory is used via allocation flags passed to hipHostMalloc and the HIP_HOST_COHERENT environment variable. By default, the environment variable HIP_HOST_COHERENT is set to 0 in HIP. The control logic in the current version of HIP is as follows: Coherent host memory is automatically visible at synchronization points. Non-coherent Developers can control the release scope for hipEvents : A stronger system-level fence can be specified when the event is created with hipEventCreateWithFlags : Managed memory, including the __managed__ keyword, is supported in HIP combined host/device compilation, on Linux, not on Windows (under development). Managed memory, via unified memory allocation, allows data be shared and accessible to both the CPU and GPU using a single pointer. The allocation will be managed by AMD GPU driver using the Linux HMM (Heterogeneous Memory Management) mechanism, the user can call managed memory API hipMallocManaged to allocate a large chunk of HMMmemory, execute kernels on device and fetch data between the host and device as needed. In HIP application, it is recommended to do the capability check before calling the managed memory APIs. For example: Please note, the managed memory capability check may not be necessary, but if HMM is not supported, then managed malloc will fall back to using system memory and other managed memory API calls will have undefined behavior. Note, managed memory management is implemented on Linux, not supported on Windows yet. HIP supports Stream Memory Operations to enable direct synchronization between Network Nodes and GPU. Following new APIs are added, hipStreamWaitValue32 hipStreamWaitValue64 hipStreamWriteValue32 hipStreamWriteValue64 Note, CPU access to the semaphore's memory requires volatile keyword to disable CPU compiler's optimizations on memory access. For more details, please check the documentation HIP-API.pdf . Please note, HIP stream does not guarantee concurrency on AMD hardware for the case of multiple (at least 6) longrunning streams executing concurrently, using hipStreamSynchronize(nullptr) for synchronization. HIP runtime has Direct Dispatch enabled by default in ROCM 4.4 on Linux. With this feature we move away from our conventional producer-consumer model where the runtime creates a worker thread(consumer) for each HIP Stream, and the host thread(producer) enqueues commands to a command queue(per stream). For Direct Dispatch, HIP runtime would directly enqueue a packet to the AQL queue (user mode queue on GPU) on the Dispatch API call from the application. That has shown to reduce the latency to launch the first wave on the idle GPU and total time of tiny dispatches synchronized with the host. In addition, eliminating the threads in runtime has reduced the variance in the dispatch numbers as the thread scheduling delays and atomics/locks synchronization latencies are reduced. This feature can be disabled via setting the following environment variable, AMD_DIRECT_DISPATCH=0 Note, Direct Dispatch is implemented on Linux. It is currently not supported on Windows. HIP now supports runtime compilation (HIP RTC), the usage of which will provide the possibility of optimizations and performance improvement compared with other APIs via regular offline static compilation. HIP RTC APIs accept HIP source files in character string format as input parameters and create handles of programs by compiling the HIP source files without spawning separate processes. For more details on HIP RTC APIs, refer to HIP Runtime API Reference . For Linux developers, the link here shows an example how to program HIP application using runtime compilation mechanism, and a detailed HIP RTC programming guide is also available. HIP graph is supported. For more details, refer to the HIP API Guide. HIP-Clang now supports device-side malloc and free. This implementation does not require the use of hipDeviceSetLimit(hipLimitMallocHeapSize,value) nor respects any setting. The heap is fully dynamic and can grow until the available free memory on the device is consumed. The per-thread default stream is supported in HIP. It is an implicit stream local to both the thread and the current device. This means that the command issued to the per-thread default stream by the thread does not implicitly synchronize with other streams (like explicitly created streams), or default per-thread stream on other threads. The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program. The per-thread default stream can be enabled via adding a compilation option, -fgpu-default-stream=per-thread . And users can explicitly use hipStreamPerThread as per-thread default stream handle as input in API commands. There are test codes as examples in the link. In HIP-Clang, long double type is 80-bit extended precision format for x86_64, which is not supported by AMDGPU. HIP-Clang treats long double type as IEEE double type for AMDGPU. Using long double type in HIP source code will not cause issue as long as data of long double type is not transferred between host and device. However, long double type should not be used as kernel argument type. If a host function is to be used between clang (or hipcc) and gcc for x86_64, i.e. its definition is compiled by one compiler but the caller is compiled by a different compiler, _Float16 or aggregates containing _Float16 should not be used as function argument or return type. This is due to lack of stable ABI for _Float16 on x86_64. Passing _Float16 or aggregates containing _Float16 between clang and gcc could cause undefined behavior. By default HIP-Clang assumes -ffp-contract=fast-honor-pragmas . Users can use #pragma clang fp contract(on|off|fast) to control fp contraction of a block of code. For x86_64, FMA is off by default since the generic x86_64 target does not support FMA by default. To turn on FMA on x86_64, either use -mfma or -march=native on CPU's supporting FMA. When contractions are enabled and the CPU has not enabled FMA instructions, the GPU can produce different numerical results than the CPU for expressions that can be contracted. Tolerance should be used for floating point comparisons. Note: Currently, HIP only supports basic math functions with rounding modern (round to nearest). HIP does not support basic math functions with rounding modes ru (round up), rd (round down), and rz (round towards zero). HIP-Clang supports generating two types of static libraries. The first type of static library does not export device functions, and only exports and launches host functions within the same library. The advantage of this type is the ability to link with a non-hipcc compiler such as gcc. The second type exports device functions to be linked by other code objects. However, this requires using hipcc as the linker. In addition, the first type of library contains host objects with device code embedded as fat binaries. It is generated using the flag -emit-static-lib. The second type of library contains relocatable device objects and is generated using ar . Here is an example to create and use static libraries: For more information, please see HIP samples host functions and device_functions. CHAPTER In addition to providing a portable C++ programming environment for GPUs, HIP is designed to ease the porting of existing CUDA code into the HIP environment. This section describes the available tools and provides practical suggestions on how to port CUDA code and work through common issues. The hipexamine-perl.sh tool will scan a source directory to determine which files contain CUDA code and how much of that code can be automatically hipified. (continued from previous page) hipexamine-perl scans each code file (cpp, c, h, hpp, etc.) found in the specified directory: For each input file FILE, this script will: This is useful for testing improvements to the hipify toolset. The hipconvertinplace-perl.sh script will perform inplace conversion for all code files in the specified directory. This can be quite handy when dealing with an existing CUDA code base since the script preserves the existing directory structure and filenames - and includes work. After converting in-place, you can review the code to add additional parameters to directory names. Most CUDA libraries have a corresponding ROCm library with similar functionality and APIs. However, ROCm also provides HIP marshalling libraries that greatly simplify the porting process because they more precisely reflect their CUDAcounterparts and can be used with either the AMD or NVIDIA platforms (see 'Identifying HIP Target Platform' below). There are a few notable exceptions: All HIP projects target either AMD or NVIDIA platform. The platform affects which headers are included and which libraries are used for linking. Often, it's useful to know whether the underlying compiler is HIP-Clang or NVCC. This knowledge can guard platformspecific code or aid in platform-specific performance tuning. Compiler directly generates the host code (using the Clang x86 target) and passes the code to another host compiler. Thus, they have no equivalent of the __CUDACC__ define. NVCCmakestwo passes over the code: one for host code and one for device code. HIP-Clang will have multiple passes over the code: one for the host code, and one for each architecture on the device code. __HIP_DEVICE_COMPILE__ is set to a nonzero value when the compiler (HIP-Clang or NVCC) is compiling code for a device inside a __global__ kernel or for a device function. __HIP_DEVICE_COMPILE__ can replace #ifdef checks on the __CUDA_ARCH__ define. Unlike __CUDA_ARCH__ , the __HIP_DEVICE_COMPILE__ value is 1 or undefined, and it doesn't represent the feature capability of the target device. Some CUDA code tests __CUDA_ARCH__ for a specific value to determine whether the machine supports a certain architectural feature. For instance, This type of code requires special attention, since AMD and CUDA devices have different architectural capabilities. Moreover, you can't determine the presence of a feature using a simple comparison against an architecture's version number. HIP provides a set of defines and device properties to query whether a specific architectural feature is supported. The __HIP_ARCH_* defines can replace comparisons of __CUDA_ARCH__ values: For host code, the __HIP_ARCH__* defines are set to 0. You should only use the __HIP_ARCH__ fields in device code. Host code should query the architecture feature flags in the device properties that hipGetDeviceProperties returns, rather than testing the 'major' and 'minor' fields directly: The table below shows the full set of architectural properties that HIP supports. Makefiles can use the following syntax to conditionally provide a default HIP_PATH if one does not exist: HIP_PATH ?= $( shell hipconfig --path ) HIP can depend on rocclr, or CUDA as runtime hipLaunchKernelGGL is a macro that can serve as an alternative way to launch kernel, which accepts parameters of launch configurations (grid dims, group dims, stream, dynamic shared size) followed by a variable number of kernel arguments. It can replace <<< >>>, if the user so desires. hipcc is a portable compiler driver that will call NVCC or HIP-Clang (depending on the target system) and attach all required include and library options. It passes options through to the target compiler. Tools that call hipcc must ensure the compiler options are appropriate for the target compiler. The hipconfig script may helpful in identifying the target platform, compiler and runtime. It can also help set options appropriately. Here are the main compiler options supported on AMD platforms by HIP-Clang. hipcc adds the necessary libraries for HIP as well as for the accelerator compiler (NVCC or AMD compiler). We recommend linking with hipcc since it automatically links the binary to the necessary HIP runtime libraries. It also has knowledge on how to link and to manage the GPU objects. hipcc adds -lm by default to the link command. CUDA code often uses NVCC for accelerator code (defining and launching kernels, typically defined in .cu or .cuh files). It also uses a standard compiler (g++) for the rest of the application. NVCC is a preprocessor that employs a standard host compiler (gcc) to generate the host code. Code compiled using this tool can employ only the intersection of language features supported by both NVCC and the host compiler. In some cases, you must take care to ensure the data types and alignment of the host compiler are identical to those of the device compiler. Only some host compilers are supported-for example, recent NVCC versions lack Clang host-compiler capability. HIP-Clang generates both device and host code using the same Clang-based compiler. The code uses the same API as gcc, which allows code generated by different gcc-compatible compilers to be linked together. For example, code compiled using HIP-Clang can link with code compiled using 'standard' compilers (such as gcc, ICC and Clang). Take care to ensure all compilers use the same standard C++ header and library formats. hipcc links to libstdc++ by default. This provides better compatibility between g++ and HIP. If you pass --stdlib=libc++ to hipcc, hipcc will use the libc++ library. Generally, libc++ provides a broader set of C++ features while libstdc++ is the standard for more compilers (notably including g++). When cross-linking C++ code, any C++ functions that use types from the C++ standard library (including std::string, std::vector and other containers) must use the same standard-library implementation. They include the following: Applications with these interfaces should use the default libstdc++ linking. Applications which are compiled entirely with hipcc, and which benefit from advanced C++ features not supported in libstdc++, and which do not require portability to NVCC, may choose to use libc++. The hip_runtime.h and hip_runtime_api.h files define the types, functions and enumerations needed to compile a HIP program: CUDAhasslightly different contents for these two files. In some cases you may need to convert hipified code to include the richer hip_runtime.h instead of hip_runtime_api.h . You can compile hip_runtime_api.h using a standard C or C++ compiler (e.g., gcc or ICC). The HIP include paths and defines ( __HIP_PLATFORM_AMD__ or __HIP_PLATFORM_NVIDIA__ ) must pass to the standard compiler; hipconfig then returns the necessary options: You can capture the hipconfig output and passed it to the standard compiler; below is a sample makefile syntax: NVCC includes some headers by default. However, HIP does not include default headers, and instead all required files must be explicitly included. Specifically, files that call HIP run-time APIs or define HIP kernels must explicitly include the appropriate HIP headers. If the compilation process reports that it cannot find necessary APIs (for example, error: identifier hipSetDevice is undefined ), ensure that the file includes hip_runtime.h (or hip_runtime_api.h, if appropriate). The hipify-perl script automatically converts cuda_runtime.h to hip_runtime.h , and it converts cuda_runtime_api.h to hip_runtime_api.h , but it may miss nested headers or macros. The HIP-Clang path provides an empty cuda.h file. Some existing CUDA programs include this file but don't require any of the functions. Many existing CUDA projects use the .cu and .cuh file extensions to indicate code that should be run through the NVCC compiler. For quick HIP ports, leaving these file extensions unchanged is often easier, as it minimizes the work required to change file names in the directory and #include statements in the files. For new projects or ports which can be re-factored, we recommend the use of the extension .hip.cpp for source files, and .hip.h or .hip.hpp for header files. This indicates that the code is standard C++ code, but also provides a unique indication for make tools to run hipcc when appropriate. Code should not assume a warp size of 32 or 64. See Warp Cross-Lane Functions for information on how to write portable wave-aware code. Kernel code should use __attribute__((amdgpu_flat_work_group_size(<min>,<max>))) . For example: HIP support for hipMemcpyToSymbol is complete. This feature allows a kernel to define a device-side data symbol which can be accessed on the host side. The symbol can be in __constant or device space. Note that the symbol name needs to be encased in the HIP_SYMBOL macro, as shown in the code example below. This also applies to hipMemcpyFromSymbol , hipGetSymbolAddress , and hipGetSymbolSize . For example: Device Code: (continued from previous page) To get pointer's memory type in HIP/HIP-Clang, developers should use hipPointerGetAttributes API. First parameter of the API is hipPointerAttribute_t which has 'type' as member variable. 'type' indicates input pointer is allocated on device or host. For example: Please note, hipMemoryType enum values are different from cudaMemoryType enum values. For example, on AMD platform, hipMemoryType is defined in hip_runtime_api.h , Looking into CUDA toolkit, it defines cudaMemoryType as following, In this case, memory type translation for hipPointerGetAttributes needs to be handled properly on NVIDIA platform to get the correct memory type in CUDA, which is done in the file nvidia_hip_runtime_api.h . So in any HIP applications which use HIP APIs involving memory types, developers should use #ifdef in order to assign the correct enum values depending on NVIDIA or AMD platform. As an example, please see the code from the link. With the #ifdef condition, HIP APIs work as expected on both AMD and NVIDIA platforms. Note, cudaMemoryTypeUnregstered is currently not supported in hipMemoryType enum, due to HIP functionality backward compatibility. threadfence_system makes all device memory writes, all writes to mapped host memory, and all writes to peer memory visible to CPU and other GPU devices. Some implementations can provide this behavior by flushing the GPU L2 cache. HIP/HIP-Clang does not provide this functionality. As a workaround, users can set the environment variable HSA_DISABLE_CACHE=1 to disable the GPU L2 cache. This will affect all accesses and for all kernels and so may have a performance impact. Compute programs sometimes use textures either to access dedicated texture caches or to use the texture-sampling hardware for interpolation and clamping. The former approach uses simple point samplers with linear interpolation, essentially only reading a single point. The latter approach uses the sampler hardware to interpolate and combine multiple samples. AMD hardware, as well as recent competing hardware, has a unified texture/L1 cache, so it no longer has a dedicated texture cache. But the NVCC path often caches global loads in the L2 cache, and some programs may benefit from explicit control of the L1 cache contents. We recommend the __ldg instruction for this purpose. AMDcompilers currently load all data into both the L1 and L2 caches, so __ldg is treated as a no-op. We recommend the following for functional portability: Onan AMDplatform, set the AMD_LOG_LEVEL environment variable to log HIP application execution information. The value of the setting controls different logging level, Logging mask is used to print types of functionalities during the execution of HIP application. It can be set as one of the following values, To see the detailed commands that hipcc issues, set the environment variable HIPCC_VERBOSE to 1. Doing so will print to stderr the HIP-clang (or NVCC) commands that hipcc generates. See the utils/vim or utils/gedit directories to add handy highlighting to hip files. CUDA provides a separate CUDA Driver and Runtime APIs. The two APIs have significant overlap in functionality: The Driver API offers two additional pieces of functionality not provided by the Runtime API: cuModule and cuCtx APIs. The Module section of the Driver API provides additional control over how and when accelerator code objects are loaded. For example, the driver API allows code objects to be loaded from files or memory pointers. Symbols for kernels or global data can be extracted from the loaded code objects. In contrast, the Runtime API automatically loads and (if necessary) compiles all of the kernels from an executable binary when run. In this mode, NVCC must be used to compile kernel code so the automatic loading can function correctly. Both Driver and Runtime APIs define a function for launching kernels (called cuLaunchKernel or cudaLaunchKernel . The kernel arguments and the execution configuration (grid dimensions, group dimensions, dynamic shared memory, and stream) are passed as arguments to the launch function. The Runtime additionally provides the <<< >>> syntax for launching kernels, which resembles a special function call and is easier to use than explicit launch API (in particular with respect to handling of kernel arguments). However, this syntax is not standard C++ and is available only when NVCC is used to compile the host code. The Module features are useful in an environment which generates the code objects directly, such as a new accelerator language front-end. Here, NVCC is not used. Instead, the environment may have a different kernel language or different compilation flow. Other environments have many kernels and do not want them to be all loaded automatically. The Module functions can be used to load the generated code objects and launch kernels. As we will see below, HIP defines a Module API which provides similar explicit control over code object management. The Driver API defines 'Context' and 'Devices' as separate entities. Contexts contain a single device, and a device can theoretically have multiple contexts. Each context contains a set of streams and events specific to the context. Historically contexts also defined a unique address space for the GPU, though this may no longer be the case in Unified Memory platforms (since the CPU and all the devices in the same process share a single unified address space). The Context APIs also provide a mechanism to switch between devices, which allowed a single CPU thread to send commands to different GPUs. HIP as well as a recent versions of CUDA Runtime provide other mechanisms to accomplish this feat - for example using streams or cudaSetDevice . The CUDA Runtime API unifies the Context API with the Device API. This simplifies the APIs and has little loss of functionality since each Context can contain a single device, and the benefits of multiple contexts has been replaced with other interfaces. HIP provides a context API to facilitate easy porting from existing Driver codes. In HIP, the Ctx functions largely provide an alternate syntax for changing the active device. Most new applications will prefer to use hipSetDevice or the stream APIs, therefore HIP has marked hipCtx APIs as deprecated . Support for these APIs may not be available in future releases. For more details on deprecated APIs please refer HIP deprecated APIs . Rather than present two separate APIs, HIP extends the HIP API with new APIs for Modules and Ctx control. Like the CUDA Driver API, the Module API provides additional control over how code is loaded, including options to load code from files or from in-memory pointers. NVCC and HIP-Clang target different architectures and use different code object formats: NVCC is cubin or ptx files, while the HIP-Clang path is the hsaco format. The external compilers which generate these code objects are responsible for generating and loading the correct code object for each platform. Notably, there is not a fat binary format that can contain code for both NVCC and HIP-Clang platforms. The following table summarizes the formats used on each platform: hipcc uses HIP-Clang or NVCC to compile host codes. Both of these may embed code objects into the final executable, and these code objects will be automatically loaded when the application starts. The hipModule API can be used to load additional code objects, and in this way provides an extended capability to the automatically loaded code objects. HIP-Clang allows both of these capabilities to be used together, if desired. Of course it is possible to create a program with no kernels and thus no automatic loading. HIP provides a Ctx API as a thin layer over the existing Device functions. This Ctx API can be used to set the current context, or to query properties of the device associated with the context. The current context is implicitly used by other APIs such as hipStreamCreate . The HIPIFY tools convert CUDA Driver APIs for streams, events, modules, devices, memory management, context, profiler to the equivalent HIP driver calls. For example, cuEventCreate will be translated to hipEventCreate . HIPIFY tools also convert error codes from the Driver namespace and coding convention to the equivalent HIP error code. Thus, HIP unifies the APIs for these common functions. The memory copy API requires additional explanation. The CUDA driver includes the memory direction in the name of the API ( cuMemcpyH2D ) while the CUDA driver API provides a single memory copy API with a parameter that specifies the direction and additionally supports a 'default' direction where the runtime determines the direction automatically. HIP provides APIs with both styles: for example, hipMemcpyH2D as well as hipMemcpy . The first flavor may be faster in some cases since they avoid host overhead to detect the different memory directions. HIP defines a single error space, and uses camel-case for all errors (i.e. hipErrorInvalidValue ). HIP-Clang defines a process-wide address space where the CPU and all devices allocate addresses from a single unified pool. Thus addresses may be shared between contexts, and unlike the original CUDA definition a new context does not create a new address space for the device. hipModuleLaunchKernel is cuLaunchKernel in HIP world. It takes the same arguments as cuLaunchKernel . hip-clang links device code from different translation units together. For each device target, a code object is generated. Code objects for different device targets are bundled by clang-offload-bundler as one fatbinary, which is embeded as a global symbol __hip_fatbin in the .hip_fatbin section of the ELF file of the executable or shared object. hip-clang generates initialization and termination functions for each translation unit for host code compilation. The initialization functions call __hipRegisterFatBinary to register the fatbinary embeded in the ELF file. They also call __hipRegisterFunction and __hipRegisterVar to register kernel functions and device side global variables. The termination functions call __hipUnregisterFatBinary . hip-clang emits a global variable __hip_gpubin_handle of void** type with linkonce linkage and inital value 0 for each host translation unit. Each initialization function checks __hip_gpubin_handle and register the fatbinary only if __hip_gpubin_handle is 0 and saves the return value of __hip_gpubin_handle to __hip_gpubin_handle . This is to guarantee that the fatbinary is only registered once. Similar check is done in the termination functions. hip-clang supports kernel launching by CUDA <<<>>> syntax, hipLaunchKernelGGL. The latter one is macro which expand to CUDA <<<>>> syntax. When the executable or shared library is loaded by the dynamic linker, the initialization functions are called. In the initialization functions, when __hipRegisterFatBinary is called, the code objects containing all kernels are loaded; when __hipRegisterFunction is called, the stub functions are associated with the corresponding kernels in code objects. hip-clang implements two sets of kernel launching APIs. By default, in the host code, for the <<<>>> statement, hip-clang first emits call of hipConfigureCall to set up the threads and grids, then emits call of the stub function with the given arguments. In the stub function, hipSetupArgument is called for each kernel argument, then hipLaunchByPtr is called with a function pointer to the stub function. In hipLaunchByPtr , the real kernel associated with the stub function is launched. CUDA applications may want to mix CUDA driver code with HIP code (see example below). This table shows the type equivalence to enable this interaction. The hipModule_t interface does not support cuModuleLoadDataEx function, which is used to control PTX compilation options. HIP-Clang does not use PTX and does not support these compilation options. In fact, HIP-Clang code objects always contain fully compiled ISA and do not require additional compilation as a part of the load step. The corresponding HIP function hipModuleLoadDataEx behaves as hipModuleLoadData on HIP-Clang path (compilation options are not used) and as cuModuleLoadDataEx on NVCC path. For example (CUDA): HIP: The below sample shows how to use hipModuleGetFunction . (continued from previous page) (continues on next page) (continued from previous page) HIP supports texture driver APIs however texture reference should be declared in host scope. Following code explains the use of texture reference for __HIP_PLATFORM_AMD__ platform. (continues on next page) (continued from previous page) HIP lets you compile kernels at runtime with the hiprtc* APIs. Kernels can be stored as a text string and can be passed to HIPRTC APIs alongside options to guide the compilation. NOTE: To use HIPRTC functionality, HIPRTC header needs to be included first. #include <hip/hiprtc.h> Kernels can be stored in a string: Now to compile this kernel, it needs to be associated with hiprtcProgram type, which is done by declaring hiprtcProgram prog; and associating the string of kernel with this program: hiprtcCreateProgram API also allows you to add headers which can be included in your RTC program. For online compilation, the compiler pre-defines HIP device API functions, HIP specific types and macros for device compilation, but does not include standard C/C++ headers by default. Users can only include header files provided to hiprtcCreateProgram . After associating the kernel string with hiprtcProgram , you can now compile this program using: hiprtcCompileProgram returns a status value which can be converted to string via hiprtcGetErrorString . If compilation is successful, hiprtcCompileProgram will return HIPRTC_SUCCESS . If the compilation fails, you can look up the logs via: If the compilation is successful, you can load the compiled binary in a local variable. After loading the binary, hiprtcProgram can be destroyed. hiprtcDestroyProgram(&prog); The binary present in kernel_binary can now be loaded via hipModuleLoadData API. And now this kernel can be launched via hipModule APIs. The full example is below: (continued from previous page) (continued from previous page) HIPRTC provides a few HIPRTC specific flags In the usual scenario, the kernel associated with hiprtcProgram is compiled into the binary which can be loaded and run. However, if -fpu-rdc option is provided in the compile options, HIPRTC calls comgr and generates only the LLVM bitcode. It doesn't convert this bitcode to ISA and generate the final binary. If the compilation is successful, one can load the bitcode in a local variable using the bitcode APIs provided by HIPRTC. AMDGPUs consist of an array of workgroup processors, each built with 2 compute units (CUs) capable of executing SIMD32. All the CUs inside a workgroup processor use local data share (LDS). gfx10+ support execution of wavefront in CU mode and work-group processor mode (WGP). Please refer to section 2.3 of RDNA3 ISA reference. gfx9 and below only supports CU mode. In WGP mode, 4 warps of a block can simultaneously be executed on the workgroup processor, where as in CU mode only 2 warps of a block can simultaneously execute on a CU. In theory, WGP mode might help with occupancy and increase the performance of certain HIP programs (if not bound to inter warp communication), but might incur performance penalty on other HIP programs which rely on atomics and inter warp communication. This also has effect of how the LDS is split between warps, please refer to RDNA3 ISA reference for more information. HIPRTCassumes WGPmodebydefault for gfx10+. This can be overridden by passing -mcumode to HIPRTC compile options in hiprtcCompileProgram . The bitcode generated using the HIPRTC Bitcode APIs can be loaded using hipModule APIs and also can be linked with other generated bitcodes with appropriate linker flags using the HIPRTC linker APIs. This also provides more flexibility and optimizations to the applications who want to generate the binary dynamically according to their needs. The input bitcodes can be generated only for a specific architecture or it can be a bundled bitcode which is generated for multiple architectures. Firstly, HIPRTC link instance or a pending linker invocation must be created using hiprtcLinkCreate , with the appropriate linker options provided. Following which, the bitcode data can be added to this link instance via hiprtcLinkAddData (if the data is present as a string) or hiprtcLinkAddFile (if the data is present as a file) with the appropriate input type according to the data or the bitcode used. Once the bitcodes for multiple architectures are added to the link instance, the linking of the device code must be completed using hiprtcLinkComplete which generates the final binary. If the hiprtcLinkComplete returns successfully, the generated binary can be loaded and run using the hipModule* APIs. HIPRTC provides hiprtcJITInputType enumeration type which defines the input types accepted by the Linker APIs. Here are the enum values of hiprtcJITInputType . However only the input types HIPRTC_JIT_INPUT_LLVM_BITCODE , HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE and HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE are supported currently. HIPRTC_JIT_INPUT_LLVM_BITCODE can be used to load both LLVM bitcode or LLVM IR assembly code. However, HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE and HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE are only for bundled bitcode and archive of bundled bitcode. For HIP applications utilizing HIPRTC to compile LLVM bitcode/IR, compatibility is assured only when the ROCm or HIP SDK version used for generating the LLVM bitcode/IR matches the version used during the runtime compilation. When an application requires the ingestion of bitcode/IR not derived from the currently installed AMD compiler, it must run with HIPRTC and comgr dynamic libraries that are compatible with the version of the bitcode/IR. comgr, a shared library, incorporates the LLVM/Clang compiler that HIPRTC relies on. To identify the bitcode/IR version that comgr is compatible with, one can execute 'clang -v' using the clang binary from the same ROCm or HIP SDK package. For instance, if compiling bitcode/IR version 14, the HIPRTC and comgr libraries released by AMD around mid 2022 would be the best choice, assuming the LLVM/Clang version included in the package is also version 14. To ensure smooth operation and compatibility, an application may choose to ship the specific versions of HIPRTC and comgr dynamic libraries, or it may opt to clearly specify the version requirements and dependencies. This approach guarantees that the application can correctly compile the specified version of bitcode/IR. Example: HIPRTC defines the hiprtcResult enumeration type and a function hiprtcGetErrorString for API call error handling. hiprtcResult enum defines the API result codes. HIPRTC APIs return hiprtcResult to indicate the call result. hiprtcGetErrorString function returns a string describing the given hiprtcResult code, e.g., HIPRTC_SUCCESS to 'HIPRTC_SUCCESS'. For unrecognized enumeration values, it returns 'Invalid HIPRTC error code'. hiprtcResult enum supported values and the hiprtcGetErrorString usage are mentioned below. HIPRTC provides the following API for querying the version. hiprtcVersion(int* major, int* minor) - This sets the output parameters major and minor with the HIP Runtime compilation major version and minor version number respectively. Currently, it returns hardcoded value. This should be implemented to return HIP runtime major and minor version in the future releases. (continued from previous page) HIPRTC mangles the __global__ function names and names of __device__ and __constant__ variables. If the generated binary is being loaded using the HIP Runtime API, the kernel function or __device__/__constant__ variable must be looked up by name, but this is very hard when the name has been mangled. To overcome this, HIPRTC provides API functions that map __global__ function or __device__/__constant__ variable names in the source to the mangled names present in the generated binary. The two APIs hiprtcAddNameExpression and hiprtcGetLoweredName provide this functionality. First, a 'name expression' string denoting the address for the __global__ function or __device__/__constant__ variable is provided to hiprtcAddNameExpression . Then, the program is compiled with hiprtcCompileProgram . During compilation, HIPRTC will parse the name expression string as a C++ constant expression at the end of the user program. Finally, the function hiprtcGetLoweredName is called with the original name expression and it returns a pointer to the lowered name. The lowered name can be used to refer to the kernel or variable in the HIP Runtime API. kernel containing various definitions __global__ functions/function templates and __device__/__constant__ variables can be stored in a string. hiprtcAddNameExpression is called with various name expressions referring to the address of __global__ functions and __device__/__constant__ variables. (continues on next page) (continued from previous page) After which, the program is compiled using hiprtcCompileProgram and the generated binary is loaded using hipModuleLoadData . And the mangled names can be fetched using hirtcGetLoweredName . The mangled name of the variables are used to look up the variable in the module and update its value. Finally, the mangled name of the kernel is used to launch it using the hipModule APIs. Please have a look at hiprtcGetLoweredName.cpp for the detailed example. HIPRTC follows the below versioning. The AMDHIPPerformance Guidelines are a set of best practices designed to help developers optimize the performance of AMD GPUs. They cover established parallelization and optimization techniques, coding metaphors, and idioms that can greatly simplify programming for HIP-capable GPU architectures. By following four main cornerstones, we can exploit the performance optimization potential of HIP. In the following chapters, we will show you their benefits and how to use them effectively. For optimal use, the application should reveal and efficiently imply as much parallelism as possible to keep all system components active. The application should optimize parallel execution across the host and devices using asynchronous calls and streams. Workloads should be assigned based on efficiency: serial to the host, parallel to the devices. For parallel workloads, when threads need to synchronize to share data, if they belong to the same block, they should use __syncthreads() (see: Synchronization functions ) within the same kernel invocation. If they belong to different blocks, they must use global memory with two separate kernel invocations. The latter should be minimized as it adds overhead. Device-level optimization primarily involves maximizing parallel execution across the multiprocessors of the device. This can be achieved by executing multiple kernels concurrently on a device. The management of these kernels is facilitated by streams, which allow for the overlapping of computation and data transfers, enhancing performance. The aim is to keep all multiprocessors busy by executing enough kernels concurrently. However, launching too many kernels can lead to resource contention, so a balance must be found for optimal performance. This approach helps in achieving maximum utilization of the resources of the device. Multiprocessor-level optimization involves maximizing parallel execution within each multiprocessor on a device. Each multiprocessor can execute a number of threads concurrently, and the total number of threads that can run in parallel is determined by the number of concurrent threads each multiprocessor can handle. The key to multiprocessor-level optimization is to efficiently utilize the various functional units within a multiprocessor. This can be achieved by ensuring a sufficient number of resident warps, as at every instruction issue time, a warp scheduler selects an instruction that is ready to execute. This instruction can be another independent instruction of the same warp, exploiting Optimization for maximum instruction throughput , or more commonly an instruction of another warp, exploiting thread-level parallelism. In comparison, device-level optimization focuses on the device as a whole, aiming to keep all multiprocessors busy by executing enough kernels concurrently. Both levels of optimization are crucial for achieving maximum performance. They work together to ensure efficient utilization of the resources of the GPU, from the individual multiprocessors to the device as a whole. The first step in maximizing memory throughput is to minimize low-bandwidth data transfers. This involves reducing data transfers between the host and the device, as these have lower bandwidth than transfers between global memory and the device. Additionally, data transfers between global memory and the device should be minimized by maximizing the use of on-chip memory: shared memory and caches. Shared memory acts as a user-managed cache, where the application explicitly allocates and accesses it. A common programming pattern is to stage data from device memory into shared memory. This involves each thread of a block loading data from device memory to shared memory, synchronizing with all other threads of the block, processing the data in shared memory, synchronizing again if necessary, and writing the results back to device global memory. For some applications, a traditional hardware-managed cache is more appropriate to exploit data locality. On devices of certain compute capabilities, the same on-chip memory is used for both L1 and shared memory, and the amount dedicated to each is configurable for each kernel call. Finally, the throughput of memory accesses by a kernel can vary significantly depending on the access pattern for each type of memory. Therefore, the next step in maximizing memory throughput is to organize memory accesses as optimally as possible. This is especially important for global memory accesses, as global memory bandwidth is low compared to available on-chip bandwidths and arithmetic instruction throughput. Thus, non-optimal global memory accesses generally have a high impact on performance. Applications should aim to minimize data transfers between the host and the device. This can be achieved by moving more computations from the host to the device, even if it means running kernels that do not fully utilize the parallelism for device. Intermediate data structures can be created, used, and discarded in device memory without being mapped or copied to host memory. Batching small transfers into a single large transfer can improve performance due to the overhead associated with each transfer. On systems with a front-side bus, using page-locked host memory can enhance data transfer performance. When using mapped page-locked memory, there is no need to allocate device memory or explicitly copy data between device and host memory. Data transfers occur implicitly each time the kernel accesses the mapped memory. For optimal performance, these memory accesses should be coalesced, similar to global memory accesses. On integrated systems where device and host memory are physically the same, any copy operation between host and device memory is unnecessary, and mapped page-locked memory should be used instead. Applications can check if a device is integrated by querying the integrated device property. Memory access instructions may be repeated due to the spread of memory addresses across warp threads. The impact on throughput varies with memory type and is generally reduced when addresses are more scattered, especially in global memory. Device memory is accessed via 32-, 64-, or 128-byte transactions that must be naturally aligned. Maximizing memory throughput involves coalescing memory accesses of threads within a warp into minimal transactions, following optimal access patterns, using properly sized and aligned data types, and padding data when necessary. Global memory instructions support reading or writing data of specific sizes (1, 2, 4, 8, or 16 bytes) that are naturally aligned. If the size and alignment requirements are not met, it leads to multiple instructions, reducing performance. Therefore, using data types that meet these requirements, ensuring alignment for structures, and maintaining alignment for all values or arrays is crucial for correct results and optimal performance. Threads often access 2D arrays at an address calculated as BaseAddress + xIndex + width * yIndex . For efficient memory access, the array and thread block widths should be multiples of the warp size. If the array width is not a multiple of the warp size, it is usually more efficient to allocate it with a width rounded up to the nearest multiple and pad the rows accordingly. Local memory is used for certain automatic variables, such as arrays with non-constant indices, large structures or arrays, and any variable when the kernel uses more registers than available. Local memory resides in device memory, leading to high latency and low bandwidth similar to global memory accesses. However, it is organized for consecutive 32-bit words to be accessed by consecutive thread IDs, allowing full coalescing when all threads in a warp access the same relative address. Shared memory, located on-chip, provides higher bandwidth and lower latency than local or global memory. It is divided into banks that can be simultaneously accessed, boosting bandwidth. However, bank conflicts, where two addresses fall in the same bank, lead to serialized access and decreased throughput. Therefore, understanding how memory addresses map to banks and scheduling requests to minimize conflicts is crucial for optimal performance. Constant memory is in device memory and cached in the constant cache. Requests are split based on different memory addresses, affecting throughput, and are serviced at the throughput of the constant cache for cache hits, or the throughput of the device memory otherwise. Texture and surface memory are stored in device memory and cached in texture cache. This setup optimizes 2D spatial locality, leading to better performance for threads reading close 2D addresses. Reading device memory through texture or surface fetching can be advantageous, offering higher bandwidth for local texture fetches or surface reads, offloading addressing calculations, allowing data broadcasting, and optional conversion of 8-bit and 16-bit integer input data to 32-bit floating-point values on-the-fly. To maximize instruction throughput: The type and complexity of arithmetic operations can significantly impact the performance of your application. We are highlighting some hints how to maximize it. Using efficient operations: Some arithmetic operations are more costly than others. For example, multiplication is typically faster than division, and integer operations are usually faster than floating-point operations, especially with double-precision. Minimizing low-throughput instructions: This might involve trading precision for speed when it does not affect the final result. For instance, consider using single-precision arithmetic instead of double-precision. Leverage intrinsic functions: Intrinsic functions are pre-defined functions available in HIP that can often be executed faster than equivalent arithmetic operations (subject to some input or accuracy restrictions). They can help optimize performance by replacing more complex arithmetic operations. Avoiding divergent warps: Divergent warps occur when threads within the same warp follow different execution paths. This can happen due to conditional statements that lead to different arithmetic operations being performed by different threads. Divergent warps can significantly reduce instruction throughput, so try to structure your code to minimize divergence. Optimizing memory access: The efficiency of memory access can impact the speed of arithmetic operations. Coalesced memory access, where threads in a warp access consecutive memory locations, can improve memory throughput and thus the speed of arithmetic operations. Maximizing instruction parallelism: Some GPU architectures could issue parallel independent instructions simultaneously, for example integer and floating point, or two operations with independent inputs and outputs. Mostly this is a work for compiler, but expressing parallelism in the code explicitly can improve instructions throughput. Flow control instructions ( if , else , for , do , while , break , continue , switch ) can impact instruction throughput by causing threads within a warp to diverge and follow different execution paths. To optimize performance, control conditions should be written to minimize divergent warps. For example, when the control condition depends on ( threadIdx / warpSize ), no warp diverges. The compiler may optimize loops or short if or switch blocks using branch predication, preventing warp divergence. With branch predication, instructions associated with a false predicate are scheduled but not executed, avoiding unnecessary operations. Synchronization ensures that all threads within a block have completed their computations and memory accesses before moving forward, which is critical when threads are dependent on the results of other threads. However, synchronization can also lead to performance overhead, as it requires threads to wait, potentially leading to idle GPU resources. __syncthreads() is used to synchronize all threads in a block, ensuring that all threads have reached the same point in the code and that shared memory is visible to all threads after the point of synchronization. An alternative way to synchronize is using streams. Different streams can execute commands out of order with respect to one another or concurrently. This allows for more fine-grained control over the execution order of commands, which can be beneficial in certain scenarios. Applications frequently allocating and freeing memory may experience slower allocation calls over time. This is expected as memory is released back to the operating system. To optimize performance in such scenarios, consider some recommendations: CHAPTER AMDdebugging tools include ltrace and ROCgdb . External tools are available and can be found online. For example, if you're using Windows, you can use Microsoft Visual Studio and WinGDB . You can trace and debug your code using the following tools and techniques. You can use tracing to quickly observe the flow of an application before reviewing the detailed information provided by a command-line debugger. Tracing can be used to identify issues ranging from accidental API calls to calls made on a critical path. ltrace is a standard Linux tool that provides a message to stderr on every dynamic library call. You can use ltrace to visualize the runtime behavior of the entire ROCm software stack. Here's a simple command-line example that uses ltrace to trace HIP APIs and output: Here's another example that uses ltrace to trace hsa APIs and output: (continues on next page) (continued from previous page) You can use ROCgdb for debugging and profiling. ROCgdbis the ROCm source-level debugger for Linux and is based on GNU Project debugger (GDB). the GNU sourcelevel debugger, equivalent of CUDA-GDB, can be used with debugger frontends, such as Eclipse, Visual Studio Code, or GDB dashboard. For details, see (https://github.com/ROCm/ROCgdb). Below is a sample how to use ROCgdb run and debug HIP application, ROCgdb is installed with ROCM package in the folder /opt/rocm/bin.
- The following Linux example shows how to get useful information from the debugger while running a simple memory copy test, which caused a segmentation fault issue. test, which caused a segmentation fault issue. (continues on next page) (continued from previous page) Debugging HIP applications using Windows tools can be more informative than on Linux. Windows tools provides more visibility into debug codes, which makes it easier to inspect variables, watch multiple details, and examine call stacks. HIP provides environment variables that allow HIP, hip-clang, or HSA drivers to prevent certain features and optimizations. These are not intended for production, but can be useful to diagnose synchronization problems in the application (or driver). Some of the more widely used environment variables are described in this section. These are supported on the Linux ROCm path and Windows. You can control kernel command serialization from the host: AMD_SERIALIZE_KERNEL = 1 , Wait for completion before enqueue AMD_SERIALIZE_KERNEL = 2 , Wait for completion after enqueue AMD_SERIALIZE_KERNEL = 3 , Both Or AMD_SERIALIZE_COPY = 1 , Wait for completion before enqueue AMD_SERIALIZE_COPY = 2 , Wait for completion after enqueue AMD_SERIALIZE_COPY = 3 , Both So HIP runtime can wait for GPU idle before/after any GPU command depending on the environment setting. For systems with multiple devices, you can choose to make only certain device(s) visible to HIP using HIP_VISIBLE_DEVICES (or CUDA_VISIBLE_DEVICES on an NVIDIA platform). Once enabled, HIP can only view devices that have indices present in the sequence. For example: To analyze compiler-related issues, you can use the dump code object: GPU_DUMP_CODE_OBJECT . HSA provides environment variables that help analyze issues in drivers or hardware. HSA_ENABLE_INTERRUPT=0 causes completion signals to be detected with memory-based polling, rather than interrupts. Here are some of the more commonly used environment variables: Note: This gdb command does not use an equal (=) sign. HIP provides a logging mechanism that allows you to trace HIP API and runtime codes when running a HIP application. In addition to being useful to our users/developers, the HIP development team uses these logs to improve the HIP runtime. By adjusting the logging settings and logging mask, you can get different types of information for different functionalities, such as HIP APIs, executed kernels, queue commands, and queue contents. Refer to the following sections for examples. Tip: Logging works for the release and debug versions of HIP. If you want to save logging output in a file, define the file when running the application via command line. For example: HIP logging is disabled by default. You can enable it via the AMD_LOG_LEVEL environment variable. The value of this variable controls your logging level. Levels are defined as follows: Tip: You can call a logging function with different logging levels. All information under the value set for AMD_LOG_LEVEL is printed. The logging mask is designed to print functionality types when you're running a HIP application. Once you set AMD_LOG_LEVEL , the logging mask is set as the default value ( 0x7FFFFFFF ). You can change this to any of the valid values: You can also define the logging mask via the AMD_LOG_MASK environment variable. You can use the following code to print HIP logging information: Using HIP code, call the ClPrint() function with the desired input variables. For example: On Linux , you can enable HIP logging and retrieve logging information when you run hipinfo . (continues on next page) (continued from previous page) On Windows , you can set AMD_LOG_LEVEL via environment variable from the advanced system settings or the command prompt (when run as administrator). The following example shows debug log information when calling the backend runtime. (continues on next page) (continued from previous page) (continues on next page) (continued from previous page) Cooperative groups API is an extension to the HIP programming model, which provides developers with a flexible, dynamic grouping mechanism for the communicating threads. Cooperative groups let you define your own set of thread groups which may fit your user-cases better than those defined by the hardware. This lets you specify the level of granularity for thread communication which can lead to more efficient parallel decompositions. The API is accessible in the cooperative_groups namespace after the hip_cooperative_groups.h is included. The header contains the following elements: The thread hierarchy abstraction of cooperative groups are in grid hierarchy and block hierarchy . Fig. 1: Cooperative group thread hierarchy in grids. The multi grid is an abstraction of potentially multiple simultaneous launches of the same kernel over multiple devices (Deprecated since 5.0). The grid in cooperative groups is a single dispatch of kernels for execution like the original grid. Note: The ability to synchronize over a grid or multi grid requires the kernel to be launched using the specific cooperative groups API. The block is the same as the Inherent thread model block entity. Note: Explicit warp-level thread handling is absent from the Cooperative groups API. In order to exploit the known hardware SIMD width on which built-in functionality translates to simpler logic, you can use the group partitioning part of the API, such as tiled_partition . Fig. 2: Cooperative group thread hierarchy in blocks. The cooperative groups API introduce a new level between block thread and threads. The thread-block tile give the opportunity to have tiles in the thread block, while the coalesced group holds the active threads of the parent group. These groups further discussed in the groups types section. For details on memory model, check the memory model description . Group types are based on the levels of synchronization and data sharing among threads. Represents an intra-block cooperative groups type where the participating threads within the group are the same threads that participated in the currently executing block . The group_index() , thread_index() , thread_rank() , size() , cg_type() , is_valid() , sync() and group_dim() member functions are public of the thread_block class. For further details, check the thread_block references . Represents an inter-block cooperative groups type where the group's participating threads span multiple blocks running the same kernel on the same device. Use the cooperative launch API to enable synchronization across the grid group. The thread_rank() , size() , cg_type() , is_valid() and sync() member functions are public of the grid_group class. For further details, check the grid_group references . Represents an inter-device cooperative groups type where the participating threads within the group span multiple devices that run the same kernel on the devices. Use the cooperative launch API to enable synchronization across the multi-grid group. Constructed via: The num_grids() , grid_rank() , thread_rank() , size() , cg_type() , is_valid() , and sync() member functions are public of the multi_grid_group class. For further details check the multi_grid_group references . This constructs a templated class derived from thread_group . The template defines the tile size of the new thread group at compile time. This group type also supports sub-wave level intrinsics. Constructed via: The thread_rank() , size() , cg_type() , is_valid() , sync() , meta_group_rank() , meta_group_size() , shfl() , shfl_down() , shfl_up() , shfl_xor() , ballot() , any() , all() , match_any() and match_all() member functions are public of the thread_block_tile class. For further details, check the thread_block_tile references . Threads (64 threads on CDNA and 32 threads on RDNA) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled, if they do not meet the condition to execute that branch. The active threads referred as coalesced, and coalesced group represents an active thread group within a warp. Note: The NVIDIA GPU's independent thread scheduling presents the appearance that threads on different branches execute concurrently. Warning: AMD GPUs do not support independent thread scheduling. Some CUDA application can rely on this feature and the ported HIP version on AMD GPUs can deadlock, when they try to make use of independent thread scheduling. This group type also supports sub-wave level intrinsics. Constructed via: coalesced_group Note: shfl() functions support integer or float type. The thread_rank() , size() , cg_type() , is_valid() , sync() , meta_group_rank() , meta_group_size() , shfl() , shfl_down() , shfl_up() , ballot() , any() , all() , match_any() and match_all() member functions are public of the coalesced_group class. For more information, see coalesced_group references . The difference to the original block model in the reduce_sum device function is the following. (continues on next page) (continued from previous page) The reduce_sum() function call and input data initialization difference to the original block model is the following. (continued from previous page) At the device function, the input group type is the thread_group , which is the parent class of all the cooperative groups type. With this, you can write generic functions, which can work with any type of cooperative groups. With each group type, the synchronization requires using the correct cooperative groups launch API. Do not need kernel launch validation. Confirm the cooperative launch capability on the single AMD GPU: Confirm the cooperative launch capability over multiple GPUs: You can access the new block representation using the original kernel launch methods. Launch the cooperative kernel on a single GPU: Launch the cooperative kernel over multiple GPUs: Device side synchronization The device side code of the thread_block synchronization over single GPUs: The device side code of the grid synchronization over single GPUs: The device side code of the multi-grid synchronization over multiple GPUs: HIP doesn't support the following NVIDIA CUDA optional headers: HIP doesn't support the following CUDA class in cooperative_groups namespace: HIP doesn't support the following CUDA functions/operators in cooperative_groups namespace: In conventional architectures, CPUs and GPUs have dedicated memory like Random Access Memory (RAM) and Video Random Access Memory (VRAM). This architectural design, while effective, can be limiting in terms of memory capacity and bandwidth, as continuous memory copying is required to allow the processors to access the appropriate data. New architectural features like Heterogeneous System Architectures (HSA) and Unified Memory (UM) help avoid these limitations and promise increased efficiency and innovation. Unified Memory is a single memory address space accessible from any processor within a system. This setup simplifies memory management processes and enables applications to allocate data that can be read or written by code running on either CPUs or GPUs. The Unified memory model is shown in the following figure. AMD Accelerated Processing Unit (APU) is a typical example of a Unified Memory Architecture. On a single die, a central processing unit (CPU) is combined with an integrated graphics processing unit (iGPU), and both have access to a high-bandwidth memory (HBM) module named Unified Memory. The CPU enables high-performance, low-latency operations, while the GPU is optimized for high throughput (data processed by unit time). Unified memory is supported on Linux by all modern AMD GPUs from the Vega series onward. Unified memory management can be achieved with managed memory allocation and, for the latest GPUs, with a system allocator. The table below lists the supported allocators. The allocators are described in the next section. 1 Works only with XNACK=1 . First GPU access causes recoverable page-fault. For more details, visit GPU memory. Showcasing various unified memory programming models, the model availability depends on your architecture. For more information, see System requirements and Checking unified memory management support . The hipMallocManaged() is a dynamic memory allocator available on all GPUs with unified memory support. For more details, visit HIP managed memory allocation API . The __managed__ declaration specifier, which serves as its counterpart, is supported on all modern AMD cards and can be utilized for static allocation. Starting with the AMD MI300 series, the malloc() system allocator allows you to reserve unified memory. The system allocator is more versatile and offers an easy transition from a CPU written C++ code to a HIP code as the same system allocation API is used. Some device attributes can offer information about which Unified memory programming models are supported. The attribute value is 1 if the functionality is supported, and 0 if it is not supported. The following examples show how to use device attributes: The following example shows how to use unified memory management with hipMallocManaged() , function, with __managed__ attribute for static allocation and standard malloc() allocation. For comparison, the Explicit Memory Management example is presented in the last tab. __managed__ (continues on next page) (continued from previous page) Unified memory management (UMM) is a feature that can simplify the complexities of memory management in GPU computing. It is particularly useful in heterogeneous computing environments with heavy memory usage with both a CPU and a GPU, which would require large memory transfers. Here are some areas where UMM can be beneficial: UMMcan help to simplify the complexities of memory management. This can make it easier for developers to write code without worrying about memory allocation and deallocation details. UMMallows for efficient data migration between the host (CPU) and the device (GPU). This can be particularly useful for applications that need to move data back and forth between the device and host. As a positive side effect, UMM can reduce the lines of code, thereby improving programming productivity. In HIP, pinned memory allocations are coherent by default. Pinned memory is host memory mapped into the address space of all GPUs, meaning that the pointer can be used on both host and device. Using pinned memory instead of pageable memory on the host can improve bandwidth. While UMMcanprovide numerous benefits, it's important to be aware of the potential performance overhead associated with UMM. You must thoroughly test and profile your code to ensure it's the most suitable choice for your use case. Unified memory HIP runtime hints can help improve the performance of your code if you know your code's ability and infrastructure. Some hint techniques are presented in this section. Thehint functions can set actions on a selected device, which can be identified by hipGetDeviceProperties(&prop, device_id) . There are two special device_id values: For the best performance, profile your application to optimize the utilization of HIP runtime hints. (continued from previous page) Data prefetching is a technique used to improve the performance of your application by moving data closer to the processing unit before it's actually needed. Remember to check the return status of hipMemPrefetchAsync() to ensure that the prefetch operations are completed successfully. The effectiveness of hipMemAdvise() comes from its ability to inform the runtime system of the developer's intentions regarding memory usage. When the runtime system has knowledge of the expected memory access patterns, it can make better decisions about data placement and caching, leading to more efficient execution of the application. However, the actual impact on performance can vary based on the specific use case and the hardware architecture. For the description of hipMemAdvise() and the detailed list of advice, visit the HIP managed memory allocation API . Here is the updated version of the example above with memory advice. (continues on next page) Memory Range attributes allow you to query attributes of a given memory range. The hipMemRangeGetAttribute() is added to the example to query the hipMemRangeAttributeReadMostly attribute of the memory range pointed to by a . The result is stored in attributeValue and then printed out. For more details, visit the HIP managed memory allocation API . (continues on next page) (continued from previous page) (continued from previous page) The hipStreamAttachMemAsync function would be able to asynchronously attach memory to a stream, which can help concurrent execution when using streams. Currently, this function is a no-operation (NOP) function on AMD GPUs. It simply returns success after the runtime memory validation passed. This function is necessary on Microsoft Windows, and UMM is not supported on this operating system with AMD GPUs at the moment. CHAPTER Memorymanagement is important when creating high-performance applications in the HIP ecosystem. Both allocating and copying memory can result in bottlenecks, which can significantly impact performance. Global memory allocation in HIP uses the C language style allocation function. This works fine for simple cases but can cause problems if your memory needs change. If you need to increase the size of your memory, you must allocate a second larger buffer and copy the data to it before you can free the original buffer. This increases overall memory usage and causes unnecessary memcpy calls. Another solution is to allocate a larger buffer than you initially need. However, this isn't an efficient way to handle resources and doesn't solve the issue of reallocation when the extra buffer runs out. Virtual memory management solves these memory management problems. It helps to reduce memory usage and unnecessary memcpy calls. Standard memory allocation uses the hipMalloc function to allocate a block of memory on the device. However, when using virtual memory, this process is separated into multiple steps using the hipMemCreate , hipMemAddressReserve , hipMemMap , and hipMemSetAccess functions. This guide explains what these functions do and how you can use them for virtual memory management. The first step is to allocate the physical memory itself with the hipMemCreate function. This function accepts the size of the buffer, an unsigned long long variable for the flags, and a hipMemAllocationProp variable. hipMemAllocationProp contains the properties of the memory to be allocated, such as where the memory is physically located and what kind of shareable handles are available. If the allocation is successful, the function returns a value of hipSuccess , with hipMemGenericAllocationHandle_t representing a valid physical memory allocation. The allocated memory size must be aligned with the granularity appropriate for the properties of the allocation. You can use the hipMemGetAllocationGranularity function to determine the correct granularity. After you have acquired an allocation of physical memory, you must map it before you can use it. To do so, you need a virtual address to map it to. Mapping means the physical memory allocation is available from the virtual address range it is mapped to. To reserve a virtual memory range, use the hipMemAddressReserve function. The size of the virtual memory must match the amount of physical memory previously allocated. You can then map the physical memory allocation to the newly-acquired virtual memory address range using the hipMemMap function. Finally, use the hipMemSetAccess function to enable memory access. It accepts the pointer to the virtual memory, the size, and a hipMemAccessDesc descriptor as parameters. In a multi-GPU environment, you can map the device memory of one GPU to another. This feature also works with the traditional memory management system, but isn't as scalable as with virtual memory. When memory is allocated with hipMalloc , hipDeviceEnablePeerAccess is used to enable peer access. This function enables access between two devices, but it means that every call to hipMalloc takes more time to perform the checks and the mapping between the devices. When using virtual memory management, peer access is enabled by hipMemSetAccess , which provides a finer level of control over what is shared. This has no performance impact on memory allocation and gives you more control over what memory buffers are shared with which devices. At this point the memory is allocated, mapped, and ready for use. You can read and write to it, just like you would a C style memory allocation. To free the memory allocated in this manner, use the corresponding free functions. To unmap the memory, use hipMemUnmap . To release the virtual address range, use hipMemAddressFree . Finally, to release the physical memory, use hipMemRelease . A side effect of these functions is the lack of synchronization when memory is released. If you call hipFree when you have multiple streams running in parallel, it synchronizes the device. This causes worse resource usage and performance. The hipMemAddressReserve function allows you to increase the amount of pre-allocated memory. This function accepts a parameter representing the requested starting address of the virtual memory. This allows you to have a continuous virtual address space without worrying about the underlying physical allocation. The code sample above assumes that hipMemAddressReserve was able to reserve the memory address at the specified location. However, this isn't guaranteed to be true, so you should validate that new_ptr points to a specific virtual address before using it. HIP provides the following: The HIP API documentation describes each API and its limitations, if any, compared with the equivalent CUDA API. At a high-level, the following features are not supported: See the API Support Table for more detailed information. No. HIP provides porting tools which do most of the work to convert CUDA code into portable C++ code that uses the HIP APIs. Most developers will port their code from CUDA to HIP and then maintain the HIP version. HIP code provides the same performance as native CUDA code, plus the benefits of running on AMD platforms. HIP APIs and features do not map to a specific CUDA version. HIP provides a strong subset of the functionality provided in CUDA, and the hipify tools can scan code to identify any unsupported CUDA functions - this is useful for identifying the specific features required by a given application. However, we can provide a rough summary of the features included in each CUDA SDK and the support level in HIP. Each bullet below lists the major new language features in each CUDA release and then indicate which are supported/not supported in HIP: HIP includes growing support for the four key math libraries using hipBLAS, hipFFT, hipRAND and hipSPARSE, as well as MIOpen for machine intelligence applications. These offer pointer-based memory interfaces (as opposed to opaque buffers) and can be easily interfaced with other HIP applications. The hip interfaces support both ROCm and CUDA paths, with familiar library interfaces. Additionally, some of the cuBLAS routines are automatically converted to hipblas equivalents by the HIPIFY tools. These APIs use cuBLAS or hcBLAS depending on the platform and replace the need to use conditional compilation. Both AMD and NVIDIA support OpenCL 1.2 on their devices so that developers can write portable code. HIP offers several benefits over OpenCL: Both HIP and CUDA are dialects of C++, and thus porting between them is relatively straightforward. Both dialects support templates, classes, lambdas, and other C++ constructs. As one example, the hipify-perl tool was originally a Perl script that used simple text conversions from CUDA to HIP. HIP and CUDA provide similar math library calls as well. In summary, the HIP philosophy was to make the HIP language close enough to CUDA that the porting effort is relatively simple. This reduces the potential for error, and also makes it easy to automate the translation. HIP goal is to quickly get the ported program running on both platforms with little manual intervention, so that the programmer can focus on performance optimizations. There have been several tools that have attempted to convert CUDA into OpenCL, such as CU2CL. OpenCL is a C99based kernel language (rather than C++) and also does not support single-source compilation. As a result, the OpenCL syntax is different from CUDA, and the porting tools have to perform some heroic transformations to bridge this gap. The tools also struggle with more complex CUDA applications, in particular, those that use templates, classes, or other C++ features inside the kernel. Typically, HIPIFY tools can automatically convert almost all run-time code. Most device code needs no additional conversion since HIP and CUDA have similar names for math and built-in functions. The hipify-clang tool will automatically modify the kernel signature as needed (automating a step that used to be done manually). Additional porting may be required to deal with architecture feature queries or with CUDA capabilities that HIP doesn't support. In general, developers should always expect to perform some platform-specific tuning and optimization. NVCC is NVIDIA's compiler driver for compiling 'CUDA C++' code into PTX or device code for NVIDIA GPUs. It's a closed-source binary compiler that is provided by the CUDA SDK. HIP-Clang is a Clang/LLVM based compiler to compile HIP programs which can run on AMD platform. While HIP is a strong subset of the CUDA, it is a subset. The HIP layer allows that subset to be clearly defined and documented. Developers who code to the HIP API can be assured their code will remain portable across NVIDIA and AMD platforms. In addition, HIP defines portable mechanisms to query architectural features and supports a larger 64-bit WaveSize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit integers to 64-bit integers. Yes. HIP's CUDA path only exposes the APIs and functionality that work on both NVCC and AMDGPU back-ends. 'Extra' APIs, parameters, and features which exist in CUDA but not in HIP-Clang will typically result in compile-time or run-time errors. Developers need to use the HIP API for most accelerator code and bracket any CUDA-specific code with preprocessor conditionals. Developers concerned about portability should, of course, run on both platforms, and should expect to tune for performance. In some cases, CUDA has a richer set of modes for some APIs, and some C++ capabilities such as virtual functions - see the HIP @API documentation for more details. Yes. HIP's HIP-Clang path only exposes the APIs and functions that work on AMD runtime back ends. 'Extra' APIs, parameters and features that appear in HIP-Clang but not CUDA will typically cause compile- or run-time errors. Developers must use the HIP API for most accelerator code and bracket any HIP-Clang specific code with preprocessor conditionals. Those concerned about portability should, of course, test their code on both platforms and should tune it for performance. Typically, HIP-Clang supports a more modern set of C++11/C++14/C++17 features, so HIP developers who want portability should be careful when using advanced C++ features on the HIP-Clang path. The environment variable can be used to set compiler path: There is an alternative environment variable to set compiler path: AMD Common Language Runtime (CLR) is a repository for the AMD platform, which contains source codes for AMD's compute languages runtimes as follows, A new repository 'hipother' is added in the ROCm 6.1 release, which is branched out from HIP. hipother supports the HIP back-end implementation on some non-AMD platforms, like NVIDIA. No, there is no HIP repository open publicly on Windows. HIP is a source-portable language that can be compiled to run on either AMD or NVIDIA platform. HIP tools don't create a 'fat binary' that can run on either platform, however. Yes. HIP generates the object code which conforms to the GCC ABI, and also links with libstdc++. This means you can compile host code with the compiler of your choice and link the generated object code with GPU code compiled with HIP. Larger projects often contain a mixture of accelerator code (initially written in CUDA with NVCC) and host code (compiled with gcc, icc, or clang). These projects can convert the accelerator code to HIP, compile that code with hipcc, and link with object code from their preferred compiler. HIP is C++ runtime API that supports C style applications as well. Some C style applications (and interfaces to other languages (FORTRAN, Python)) would call certain HIP APIs but not use kernel programming. They can be compiled with a C compiler and run correctly, however, small details must be considered in the code. For example, initialization, as shown in the simple application below, uses HIP structs dim3 with the file name 'test.hip.cpp' When using a C++ compiler, In which 'dim3 grid1;' will yield a dim3 grid with all dimensional members x,y,z initialized to 1, as the default constructor behaves that way. Further, if written: dim3 grid(2); // yields {2,1,1} dim3 grid(2,3); yields {2,3,1} In comparison, when using the C compiler, $ gcc -x c $( hipconfig --cpp_config ) test.hip.cpp -o test $ ./test dim3 grid1; x=646881376, y=21975, z=1517277280 dim3 grid2 = {1,1,1}; x=1, y=1, z=1 In which 'dim3 grid;' does not imply any initialization, no constructor is called, and dimensional values x,y,z of grid are undefined. NOTE: To get the C++ default behavior, C programmers must additionally specify the right-hand side as shown below, Yes. You can use HIP_PLATFORM to choose which path hipcc targets. This configuration can be useful when using HIP to develop an application which is portable to both AMD and NVIDIA. HIP will set the platform to AMD and use HIP-Clang as compiler if it sees that the AMD graphics driver is installed and has detected an AMD GPU. Sometimes this isn't what you want * you can force HIP to recognize the platform by setting the following, One symptom of this problem is the message 'error: 'unknown error'(11) at square.hipref.cpp:56 . This can occur if you have a CUDA installation on an AMD platform, and HIP incorrectly detects the platform as NVCC. HIP may be able to compile the application using the NVCC tool-chain but will generate this error at runtime since the platform does not have a CUDA device. Yes. Most HIP data structures ( hipStream_t , hipEvent_t ) are typedefs to CUDA equivalents and can be intermixed. Both CUDA and HIP use integer device ids. One notable exception is that hipError_t is a new type, and cannot be used where a cudaError_t is expected. In these cases, refactor the code to remove the expectation. Alternatively, hip_runtime_api.h defines functions which convert between the error code spaces: hipErrorToCudaError hipCUDAErrorTohipError hipCUResultTohipError If platform portability is important, use #ifdef __HIP_PLATFORM_NVIDIA__ to guard the CUDA-specific code. See Logging HIP activity for more information. Product of block.x, block.y, and block.z should be less than 1024. Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32 , so gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32. __shfl_*_sync is not supported on HIP but for NVCC path CUDA 9.0 and above all shuffle calls get redirected to it's sync version. The compiler defines the __HIP_DEVICE_COMPILE__ macro only when compiling the code for the GPU. It could be used to guard code that is specific to the host or the GPU. When compiling an OpenMP source file with hipcc -fopenmp , the compiler may generate error if there is a reference to the _OPENMP macro. This is due to a limitation in hipcc that treats any source file type (for example .cpp ) as an HIP translation unit leading to some conflicts with the OpenMP language switch. If the OpenMP source file doesn't contain any HIP language constructs you could work around this issue by adding the -x c++ switch to force the compiler to treat the file as regular C++. Another approach would be to guard the OpenMP code with #ifdef _OPENMP so that the code block is disabled when compiling for the GPU. The __HIP_DEVICE_COMPILE__ macro defined by the HIP compiler when compiling GPU code could also be used for guarding code paths specific to the host or the GPU. Previously, it was essential to declare dynamic shared memory using the HIP_DYNAMIC_SHARED macro for accuracy, as using static shared memory in the same kernel could result in overlapping memory ranges and data-races. Now, the HIP-Clang compiler provides support for extern shared declarations, and the HIP_DYNAMIC_SHARED option is no longer required. You may use the standard extern definition: extern shared type var[]; This error message is seen due to the fact that you do not have valid code object for all of your devices. If you have compiled the application yourself, make sure you have given the correct device name(s) and its features via: --offload-arch . If you are not mentioning the --offload-arch , make sure that hipcc is using the correct offload arch by verifying the hipcc output generated by setting the environment variable HIPCC_VERBOSE=1 . If you have a precompiled application/library (like rocblas, TensorFlow etc) which gives you such error, there are one of two possibilities. Note: In previous releases, the error code is hipErrorNoBinaryForGpu with message 'Unable to find code object for all current devices'. The error code handling behavior is changed. HIP runtime shows the error code hipErrorSharedObjectInitFailed with message 'Error: shared object initialization failed' on unsupported GPU. The per-thread default stream is an implicit stream local to both the thread and the current device. It does not do any implicit synchronization with other streams (like explicitly created streams), or default per-thread stream on other threads. The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program. In ROCm, a compilation option should be added in order to compile the translation unit with per-thread default stream enabled. -fgpu-default-stream=per-thread . Once source is compiled with per-thread default stream enabled, all APIs will be executed on per thread default stream, hence there will not be any implicit synchronization with other streams. Besides, per-thread default stream be enabled per translation unit, users can compile some files with feature enabled and some with feature disabled. Feature enabled translation unit will have default stream as per thread and there will not be any implicit synchronization done but other modules will have legacy default stream which will do implicit synchronization. In HIP, hipFloatComplex and hipDoubleComplex are defined as complex data types, Any application uses complex multiplication and division operations, need to replace '*' and '/' operators with the following, Note: These complex operations are equivalent to corresponding types/functions on the NVIDIA platform. Yes, HIP APIs are available to use on both Linux and Windows. Due to different working mechanisms on operating systems like Windows vs Linux, HIP APIs call corresponding lower level backend runtime libraries and kernel drivers for the OS, in order to control the executions on GPU hardware accordingly. There might be a few differences on the related backend software and driver support, which might affect usage of HIP APIs. See OS support details in HIP API document. Starting ROCm 6.0, HIP runtime supports Locally Unique Identifier (LUID). This feature enables the local physical device(s) to interoperate with other devices. For example, DirectX 12. HIP runtime sets device LUID properties so the driver can query LUID to identify each device for interoperability. Note: HIP supports LUID only on Windows OS. HIP version definition has been updated since ROCm 4.2 release as the following: HIP version can be queried from HIP API call, hipRuntimeGetVersion(&runtimeVersion); The version returned will always be greater than the versions in previous ROCm releases. Note: The version definition of HIP runtime is different from CUDA. On AMD platform, the function returns HIP runtime version, while on NVIDIA platform, it returns CUDA runtime version. And there is no mapping/correlation between HIP version and CUDA version. 18.2 Topics 18.3.1 Namespace List 18.3.2 Namespace Members 18.3.2.1 Namespace Members 18.3.2.2 Namespace Members 18.4.4 Data Fields 18.4.4.1 All 18.4.4.1.1 Data Fields 18.4.4.1.2 Data Fields 18.4.4.1.3 Data Fields 18.4.4.1.4 Data Fields 18.4.4.1.5 Data Fields 18.4.4.1.6 Data Fields 26 18.4.4.1.7 Data Fields CHAPTER HIP provides a C++ syntax that is suitable for compiling most code that commonly appears in compute kernels (classes, namespaces, operator overloading, and templates). HIP also defines other language features that are designed to target accelerators, such as: Note: This chapter describes the built-in variables and functions that are accessible from the HIP kernel. It's intended for users who are familiar with CUDA kernel syntax and want to learn how HIP differs from CUDA. Features are labeled with one of the following keywords: Supported __device__ functions are: You can combine __device__ with the host keyword ( __host__ ). Supported __global__ functions are: HIP __global__ functions must have a void return type. HIP doesn't support dynamic-parallelism, which means that you can't call __global__ functions from the device. Supported __host__ functions are: You can combine __host__ with __device__ ; in this case, the function compiles for the host and the device. Note that these functions can't use the HIP grid coordinate functions (e.g., threadIdx.x ). If you need to use HIP grid coordinate functions, you can pass the necessary coordinate information as an argument. You can't combine __host__ with __global__ . HIP parses the __noinline__ and __forceinline__ keywords and converts them into the appropriate Clang attributes. __global__ functions are often referred to as kernels . When you call a global function, you're launching a kernel . When launching a kernel, you must specify an execution configuration that includes the grid and block dimensions. The execution configuration can also include other information for the launch, such as the amount of additional shared memory to allocate and the stream where you want to execute the kernel. HIP introduces a standard C++ calling convention ( hipLaunchKernelGGL ) to pass the run configuration to the kernel. However, you can also use the CUDA <<< >>> syntax. When using hipLaunchKernelGGL , your first five parameters must be: You can include your kernel arguments after these parameters. (continued from previous page) You can use HIPIFY tools to convert CUDA launch syntax to hipLaunchKernelGGL . This includes the conversion of optional <<< >>> arguments into the five required hipLaunchKernelGGL parameters. Note: HIP doesn't support dimension sizes of 𝑔𝑟𝑖𝑑𝐷𝑖𝑚 * 𝑏𝑙𝑜𝑐𝑘𝐷𝑖𝑚 ≥ 2 32 when launching a kernel. The host writes constant memory before launching the kernel. This memory is read-only from the GPU while the kernel is running. The functions for accessing constant memory are: To allow the host to dynamically allocate shared memory, you can specify extern __shared__ as a launch parameter. Note: Prior to the HIP-Clang compiler, dynamic shared memory had to be declared using the HIP_DYNAMIC_SHARED macro in order to ensure accuracy. This is because using static shared memory in the same kernel could've resulted in overlapping memory ranges and data-races. The HIP-Clang compiler provides support for extern __shared_ declarations, so HIP_DYNAMIC_SHARED is no longer required. Managed memory, including the __managed__ keyword, is supported in HIP combined host/device compilation. __restrict__ tells the compiler that the associated memory pointer not to alias with any other pointer in the kernel or function. This can help the compiler generate better code. In most use cases, every pointer argument should use this keyword in order to achieve the benefit. The kernel uses coordinate built-ins ( thread* , block* , grid* ) to determine the coordinate index and bounds for the active work item. Built-ins are defined in amd_hip_runtime.h , rather than being implicitly defined by the compiler. Coordinate variable definitions for built-ins are the same for HIP and CUDA. For example: threadIdx.x , blockIdx. y , and gridDim.y . The products gridDim.x * blockDim.x , gridDim.y * blockDim.y , and gridDim.z * blockDim.z are always less than 2^32 . Coordinate built-ins are implemented as structures for improved performance. When used with printf , they must be explicitly cast to integer types. The warpSize variable type is int . It contains the warp size (in threads) for the target device. warpSize should only be used in device functions that develop portable wave-aware code. Note: NVIDIA devices return 32 for this variable; AMD devices return 64 for gfx9 and 32 for gfx10 and above. The following vector types are defined in hip_runtime.h . They are not automatically provided by the compiler. Short vector types derive from basic integer and floating-point types. These structures are defined in hip_vector_types.h . The first, second, third, and fourth components of the vector are defined by the x , y , z , and w fields, respectively. All short vector types support a constructor function of the form make_<type_name>() . For example, float4 make_float4(float x, float y, float z, float w) creates a vector with type float4 and value (x,y,z,w) . HIP supports the following short vector formats: dim3 is a three-dimensional integer vector type that is commonly used to specify grid and group dimensions. The dim3 constructor accepts between zero and three arguments. By default, it initializes unspecified dimensions to 1. HIP supports __threadfence() and __threadfence_block() . If you're using threadfence_system() in the HIP-Clang path, you can use the following workaround: Synchronization functions causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group. The __syncthreads() built-in function is supported in HIP. The __syncthreads_count(int) , __syncthreads_and(int) , and __syncthreads_or(int) functions are under development. The Cooperative Groups API offer options to do synchronization on a developer defined set of thread groups. For further information, check Cooperative Groups API or Cooperative Groups how to . HIP-Clang supports a set of math operations that are callable from the device. HIP supports most of the device functions supported by CUDA. These are described on Math API page . The supported texture functions are listed in texture_fetch_functions.h and texture_indirect_functions. h header files in the HIP-AMD backend repository. Texture functions are not supported on some devices. To determine if texture functions are supported on your device, use Macro __HIP_NO_IMAGE_SUPPORT == 1 . You can query the attribute hipDeviceAttributeImageSupport to check if texture functions are supported in the host runtime code. The following surface functions are supported in HIP: hipError_t hipCreateSurfaceObject ( hipSurfaceObject_t *pSurfObject, const hipResourceDesc *pResDesc ) Create a surface object. hipSuccess, hipErrorInvalidValue hipError_t hipDestroySurfaceObject ( hipSurfaceObject_t surfaceObject ) Destroy a surface object. surfaceObject -[in] Surface object to be destroyed. hipSuccess, hipErrorInvalidValue template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf1Dread ( T *data, hipSurfaceObject_t surfObj, int x, int boundaryMode = hipBoundaryModeZero ) Reads the value at coordinate x from the one-dimensional surface. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf1Dwrite ( T data, hipSurfaceObject_t surfObj, int x ) Writes the value data to the one-dimensional surface at coordinate x. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf2Dread ( T *data, hipSurfaceObject_t surfObj, int x, int y ) Reads the value from the two-dimensional surface at coordinate x, y. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf2Dwrite ( T data, hipSurfaceObject_t surfObj, int x, int y ) Writes the value data to the two-dimensional surface at coordinate x, y. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf3Dread ( T *data, hipSurfaceObject_t surfObj, int x, int y, int z ) Reads the value from the three-dimensional surface at coordinate x, y, z. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf3Dwrite ( T data, hipSurfaceObject_t surfObj, int x, int y, int z ) Writes the value data to the three-dimensional surface at coordinate x, y, z. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf1DLayeredread ( T *data, hipSurfaceObject_t surfObj, int x, int layer ) Reads the value from the one-dimensional layered surface at coordinate x and layer index. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf1DLayeredwrite ( T data, hipSurfaceObject_t surfObj, int x, int layer ) Writes the value data to the one-dimensional layered surface at coordinate x and layer index. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf2DLayeredread ( T *data, hipSurfaceObject_t surfObj, int x, int y, int layer ) Reads the value from the two-dimensional layered surface at coordinate x, y and layer index. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surf2DLayeredwrite ( T data, hipSurfaceObject_t surfObj, int x, int y, int layer ) Writes the value data to the two-dimensional layered surface at coordinate x, y and layer index. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surfCubemapread ( T *data, hipSurfaceObject_t surfObj, int x, int y, int face ) Reads the value from the cubemap surface at coordinate x, y and face index. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surfCubemapwrite ( T data, hipSurfaceObject_t surfObj, int x, int y, int face ) Writes the value data to the cubemap surface at coordinate x, y and face index. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surfCubemapLayeredread ( T *data, hipSurfaceObject_t surfObj, int x, int y, int face, int layer ) Reads the value from the layered cubemap surface at coordinate x, y and face, layer index. T - The data type of the surface. template<typename T , typename std::enable_if<__hip_is_tex_surf_channel_type< T >::value>::type* = nullptr> static void surfCubemapLayeredwrite ( T *data, hipSurfaceObject_t surfObj, int x, int y, int face, int layer ) Writes the value data to the layered cubemap surface at coordinate x, y and face, layer index. T - The data type of the surface. To read a high-resolution timer from the device, HIP provides the following built-in functions: The difference between the values that are returned represents the cycles used. This can be queried using the HIP API with the hipDeviceAttributeWallClockRate attribute of the device in HIP application code. For example: Where hipDeviceAttributeWallClockRate is a device attribute. Note that wall clock frequency is a perdevice attribute. Note that clock() and clock64() do not work properly on AMD RDNA3 (GFX11) graphic processors. Atomic functions are run as read-modify-write (RMW) operations that reside in global or shared memory. No other device or thread can observe or modify the memory location during an atomic operation. If multiple instructions from different devices or threads target the same memory location, the instructions are serialized in an undefined order. To support system scope atomic operations, you can use the HIP APIs that contain the _system suffix. For example: HIP supports the following atomic operations. Table 1: Atomic operations Some HIP devices support fast atomic RMW operations on floating-point values. For example, atomicAdd on singleor double-precision floating-point values may generate a hardware RMW instruction that is faster than emulating the atomic operation using an atomic compare-and-swap (CAS) loop. On some devices, fast atomic RMW instructions can produce results that differ from the same functions implemented with atomic CAS loops. For example, some devices will use different rounding or denormal modes, and some devices produce incorrect answers if fast floating-point atomic RMW instructions target fine-grained memory allocations. The HIP-Clang compiler offers a compile-time option, so you can choose fast-but potentially unsafe-atomic instructions for your code. On devices that support these instructions, you can include the -munsafe-fp-atomics option. This flag indicates to the compiler that all floating-point atomic function calls are allowed to use an unsafe version, if one exists. For example, on some devices, this flag indicates to the compiler that no floating-point atomicAdd function can target fine-grained memory. If you want to avoid using unsafe use a floating-point atomic RMW operations, you can use the -mno-unsafe-fp-atomics option. Note that the compiler default is to not produce unsafe floating-point atomic RMW instructions, so the -mno-unsafe-fp-atomics option is not necessarily required. However, passing this option to the compiler is good practice. When you pass -munsafe-fp-atomics or -mno-unsafe-fp-atomics to the compiler's command line, the option is applied globally for the entire compilation. Note that if some of the atomic RMW function calls cannot safely use the faster floating-point atomic RMW instructions, you must use -mno-unsafe-fp-atomics in order to ensure that your atomic RMW function calls produce correct results. HIP has four extra functions that you can use to more precisely control which floating-point atomic RMW functions produce unsafe atomic RMW instructions: Threads in a warp are referred to as lanes and are numbered from 0 to warpSize - 1 . Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions use no shared memory. Note that NVIDIA and AMD devices have different warp sizes. You can use warpSize built-ins in you portable code to query the warp size. Tip: Be sure to review HIP code generated from the CUDA path to ensure that it doesn't assume a waveSize of 32. 'Wave-aware' code that assumes a waveSize of 32 can run on a wave-64 machine, but it only utilizes half of the machine's resources. To get the default warp size of a GPU device, use hipGetDeviceProperties in you host functions. Only use warpSize built-ins in device functions, and don't assume warpSize to be a compile-time constant. Note that assembly kernels may be built for a warp size that is different from the default. All mask values either returned or accepted by these builtins are 64-bit unsigned integer values, even when compiled for a wave-32 device, where all the higher bits are unused. CUDA code ported to HIP requires changes to ensure that the correct type is used. Note that the __sync variants are made available in ROCm 6.2, but disabled by default to help with the transition to 64-bit masks. They can be enabled by setting the preprocessor macro HIP_ENABLE_WARP_SYNC_BUILTINS . These builtins will be enabled unconditionally in ROCm 6.3. Wherever possible, the implementation includes a static assert to check that the program source uses the correct type for the mask. (continued from previous page) You can use __any and __all to get a summary view of the predicates evaluated by the participating lanes. To determine if the target platform supports the any/all instruction, you can use the hasWarpVote device property or the HIP_ARCH_HAS_WARP_VOTE compiler definition. __ballot returns a bit mask containing the 1-bit predicate value from each lane. The nth bit of the result contains the 1 bit contributed by the nth warp lane. __activemask() returns a bit mask of currently active warp lanes. The nth bit of the result is 1 if the nth warp lane is active. Note that the __ballot and __activemask builtins in HIP have a 64-bit return value (unlike the 32-bit value returned by the CUDA builtins). Code ported from CUDA should be adapted to support the larger warp sizes that the HIP version requires. Applications can test whether the target platform supports the __ballot or __activemask instructions using the hasWarpBallot device property in host code or the HIP_ARCH_HAS_WARP_BALLOT macro defined by the compiler for device code. The _sync variants require a 64-bit unsigned integer mask argument that specifies the lanes in the warp that will participate in cross-lane communication with the calling lane. Each participating thread must have its own bit set in its mask argument, and all active threads specified in any mask argument must execute the same call with the same mask, otherwise the result is undefined. T can be a 32-bit integer type, 64-bit integer type or a single precision or double precision floating point type. __match_any returns a bit mask containing a 1-bit for every participating lane if and only if that lane has the same value in value as the current lane, and a 0-bit for all other lanes. __match_all returns a bit mask containing a 1-bit for every participating lane if and only if they all have the same value in value as the current lane, and a 0-bit for all other lanes. The predicate pred is set to true if and only if all participating threads have the same value in value . The _sync variants require a 64-bit unsigned integer mask argument that specifies the lanes in the warp that will participate in cross-lane communication with the calling lane. Each participating thread must have its own bit set in its mask argument, and all active threads specified in any mask argument must execute the same call with the same mask, otherwise the result is undefined. The default width is warpSize (see Warp cross-lane functions ). Half-float shuffles are not supported. T can be a 32-bit integer type, 64-bit integer type or a single precision or double precision floating point type. The _sync variants require a 64-bit unsigned integer mask argument that specifies the lanes in the warp that will participate in cross-lane communication with the calling lane. Each participating thread must have its own bit set in its mask argument, and all active threads specified in any mask argument must execute the same call with the same mask, otherwise the result is undefined. You can use cooperative groups to synchronize groups of threads. Cooperative groups also provide a way of communicating between groups of threads at a granularity that is different from the block. HIP supports the following kernel language cooperative groups types and functions: For further information, check Cooperative Groups API or Cooperative Groups how to . Warp matrix functions allow a warp to cooperatively operate on small matrices that have elements spread over lanes in an unspecified manner. HIP does not support kernel language warp matrix types or functions. Certain architectures that support CUDA allow threads to progress independently of each other. This independent thread scheduling makes intra-warp synchronization possible. HIP does not support this type of scheduling. The CUDA __prof_trigger() instruction is not supported. The assert function is supported in HIP. Assert function is used for debugging purpose, when the input expression equals to zero, the execution will be stopped. There are two kinds of implementations for assert functions depending on the use sceneries, - One is for the host version of assert, which is defined in assert.h , - Another is the device version of assert, which is implemented in hip/hip_runtime.h . Users need to include assert.h to use assert . For assert to work in both device and host functions, users need to include "hip/hip_runtime.h" . HIP provides the function abort() which can be used to terminate the application when terminal failures are detected. It is implemented using the __builtin_trap() function. This function produces a similar effect of using asm("trap") in the CUDA code. Note: In HIP, the function terminates the entire application, while in CUDA, asm("trap") only terminates the dispatch and the application continues to run. printf function is supported in HIP. The following is a simple example to print information in the kernel. Device-side dynamic global memory allocation is under development. HIP now includes a preliminary implementation of malloc and free that can be called from device functions. GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory) which are shared by the actively running warps. Using more resources can increase IPC of the kernel but reduces the resources available for other warps and limits the number of warps that can be simultaneously running. Thus GPUs have a complex relationship between resource usage and performance. __launch_bounds__ allows the application to provide usage hints that influence the resources (primarily registers) used by the generated code. It is a function attribute that must be attached to a __global__ function: __launch_bounds__ supports two parameters: - MAX_THREADS_PER_BLOCK - The programmers guarantees that kernel will be launched with threads less than MAX_THREADS_PER_BLOCK. (On NVCC this maps to the . maxntid PTX directive). If no launch_bounds is specified, MAX_THREADS_PER_BLOCK is the maximum block size supported by the device (typically 1024 or larger). Specifying MAX_THREADS_PER_BLOCK less than the maximum effectively allows the compiler to use more resources than a default unconstrained compilation that supports all possible block sizes at launch time. The threads-per-block is the product of ( blockDim.x * blockDim. y * blockDim.z ). - MIN_WARPS_PER_EXECUTION_UNIT - directs the compiler to minimize resource usage so that the requested number of warps can be simultaneously active on a multi-processor. Since active warps compete for the same fixed pool of resources, the compiler must reduce resources required by each warp(primarily registers). MIN_WARPS_PER_EXECUTION_UNIT is optional and defaults to 1 if not specified. Specifying a MIN_WARPS_PER_EXECUTION_UNIT greater than the default 1 effectively constrains the compiler's resource usage. When launch kernel with HIP APIs, for example, hipModuleLaunchKernel() , HIP will do validation to make sure input kernel dimension size is not larger than specified launch_bounds. In case exceeded, HIP would return launch failure, if AMD_LOG_LEVEL is set with proper value (for details, please refer to docs/markdown/hip_logging. md ), detail information will be shown in the error log message, including launch parameters of kernel dim size, launch bounds, and the name of the faulting kernel. It's helpful to figure out which is the faulting kernel, besides, the kernel dim size and launch bounds values will also assist in debugging such failures. The compiler uses these parameters as follows: - The compiler uses the hints only to manage register usage, and does not automatically reduce shared memory or other resources. - Compilation fails if compiler cannot generate a kernel which meets the requirements of the specified launch bounds. - From MAX_THREADS_PER_BLOCK, the compiler derives the maximum number of warps/block that can be used at launch time. Values of MAX_THREADS_PER_BLOCK less than the default allows the compiler to use a larger pool of registers : each warp uses registers, and this hint constrains the launch to a warps/block size which is less than maximum. - From MIN_WARPS_PER_EXECUTION_UNIT, the compiler derives a maximum number of registers that can be used by the kernel (to meet the required #simultaneous active blocks). If MIN_WARPS_PER_EXECUTION_UNIT is 1, then the kernel can use all registers supported by the multiprocessor. - The compiler ensures that the registers used in the kernel is less than both allowed maximums, typically by spilling registers (to shared or global memory), or by using more instructions. - The compiler may use heuristics to increase register usage, or may simply be able to avoid spilling. The MAX_THREADS_PER_BLOCK is particularly useful in this cases, since it allows the compiler to use more registers and avoid situations where the compiler constrains the register usage (potentially spilling) to meet the requirements of a large block size that is never used at launch time. A compute unit (CU) is responsible for executing the waves of a work-group. It is composed of one or more execution units (EU) which are responsible for executing waves. An EU can have enough resources to maintain the state of more than one executing wave. This allows an EU to hide latency by switching between waves in a similar way to symmetric multithreading on a CPU. In order to allow the state for multiple waves to fit on an EU, the resources used by a single wave have to be limited. Limiting such resources can allow greater latency hiding, but can result in having to spill some register state to memory. This attribute allows an advanced developer to tune the number of waves that are capable of fitting within the resources of an EU. It can be used to ensure at least a certain number will fit to help hide latency, and can also be used to ensure no more than a certain number will fit to limit cache thrashing. CUDA defines a __launch_bounds which is also designed to control occupancy: The key differences in the interface are: - Warps (rather than blocks): The developer is trying to tell the compiler to control resource utilization to guarantee some amount of active Warps/EU for latency hiding. Specifying active warps in terms of blocks appears to hide the micro-architectural details of the warp size, but makes the interface more confusing since the developer ultimately needs to compute the number of warps to obtain the desired level of control. - Execution Units (rather than multiprocessor): The use of execution units rather than multiprocessors provides support for architectures with multiple execution units/multi-processor. For example, the AMD GCN architecture has 4 execution units per multiprocessor. The hipDeviceProps has a field executionUnitsPerMultiprocessor . Platform-specific coding techniques such as #ifdef can be used to specify different launch_bounds for NVCC and HIP-Clang platforms, if desired. Unlike NVCC, HIP-Clang does not support the --maxregcount option. Instead, users are encouraged to use the hip_launch_bounds directive since the parameters are more intuitive and portable than micro-architecture details like registers, and also the directive allows per-kernel control rather than an entire file. hip_launch_bounds works on both HIP-Clang and NVCC targets. typedef void (* hipStreamCallback_t )(hipStream_t stream, hipError_t status, void *userData) Stream CallBack struct hipError_t hipStreamCreate ( hipStream_t *stream ) Create an asynchronous stream. Create a new asynchronous stream. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy. hipStreamCreateWithFlags , hipStreamCreateWithPriority , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy stream -[inout] Valid pointer to hipStream_t. This function writes the memory with the newly created stream. hipSuccess, hipErrorInvalidValue hipSuccess, hipErrorInvalidValue hipError_t hipStreamCreateWithFlags ( hipStream_t *stream, unsigned int flags ) Create an asynchronous stream. Create a new asynchronous stream. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy. Flags controls behavior of the stream. See hipStreamDefault, hipStreamNonBlocking. hipStreamCreate , hipStreamCreateWithPriority , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy hipSuccess, hipErrorInvalidValue hipError_t hipStreamCreateWithPriority ( hipStream_t *stream, unsigned int flags, int priority ) Create an asynchronous stream with the specified priority. Create a new asynchronous stream with the specified priority. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy. Flags controls behavior of the stream. See hipStreamDefault, hipStreamNonBlocking. hipStreamCreate , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy hipSuccess, hipErrorInvalidValue hipError_t hipDeviceGetStreamPriorityRange ( int *leastPriority, int *greatestPriority ) Returns numerical values that correspond to the least and greatest stream priority. Returns in *leastPriority and *greatestPriority the numerical values that correspond to the least and greatest stream priority respectively. Stream priorities follow a convention where lower numbers imply greater priorities. The range of meaningful stream priorities is given by [*greatestPriority, *leastPriority]. If the user attempts to create a stream with a priority value that is outside the meaningful range as specified by this API, the priority is automatically clamped to within the valid range. hipSuccess hipError_t hipStreamDestroy ( hipStream_t stream ) Destroys the specified stream. Destroys the specified stream. If commands are still executing on the specified stream, some may complete execution before the queue is deleted. The queue may be destroyed while some commands are still inflight, or may wait for all commands queued to the stream before destroying it. hipStreamCreate , hipStreamCreateWithFlags , hipStreamCreateWithPriority , hipStreamQuery , hipStreamWaitEvent , hipStreamSynchronize stream -[in] stream identifier. hipSuccess hipErrorInvalidHandle Return hipSuccess if all of the operations in the specified stream have completed, or hipErrorNotReady if not. This is thread-safe and returns a snapshot of the current state of the queue. However, if other host threads are sending work to the stream, the status may change immediately after the function is called. It is typically used for debug. hipStreamCreate , hipStreamCreateWithFlags , hipStreamCreateWithPriority , hipStreamWaitEvent , hipStreamSynchronize , hipStreamDestroy stream -[in] stream to query hipSuccess, hipErrorNotReady, hipErrorInvalidHandle Wait for all commands in stream to complete. This command is host-synchronous : the host will block until the specified stream is empty. This command follows standard null-stream semantics. Specifically, specifying the null stream will cause the command to wait for other streams on the same device to complete all pending operations. This command honors the hipDeviceLaunchBlocking flag, which controls whether the wait is active or blocking. hipStreamCreate , hipStreamCreateWithFlags , hipStreamCreateWithPriority , hipStreamWaitEvent , hipStreamDestroy stream -[in] stream identifier. hipSuccess, hipErrorInvalidHandle hipError_t hipStreamWaitEvent ( hipStream_t stream, hipEvent_t event, unsigned int flags ) Make the specified compute stream wait for an event. This function inserts a wait operation into the specified stream. All future work submitted to stream will wait until event reports completion before beginning execution. This function only waits for commands in the current stream to complete. Notably, this function does not implicitly wait for commands in the default stream to complete, even if the specified stream is created with hipStreamNonBlocking = 0. hipStreamCreate , hipStreamCreateWithFlags , hipStreamCreateWithPriority , hipStreamSynchronize , hipStreamDestroy hipSuccess, hipErrorInvalidHandle hipError_t hipStreamGetFlags ( hipStream_t stream, unsigned int *flags ) Return flags associated with this stream. Return flags associated with this stream in * flags . hipStreamCreateWithFlags hipSuccess, hipErrorInvalidValue, hipErrorInvalidHandle hipSuccess hipErrorInvalidValue hipErrorInvalidHandle hipError_t hipStreamGetPriority ( hipStream_t stream, int *priority ) Query the priority of a stream. Query the priority of a stream. The priority is returned in in priority. hipStreamCreateWithFlags hipSuccess, hipErrorInvalidValue, hipErrorInvalidHandle hipSuccess hipErrorInvalidValue hipErrorInvalidHandle hipError_t hipStreamGetDevice ( hipStream_t stream, hipDevice_t *device ) Get the device assocaited with the stream. hipStreamCreate , hipStreamDestroy , hipDeviceGetStreamPriorityRange hipSuccess, hipErrorInvalidValue, hipErrorContextIsDestroyed, hipErrorInvalidHandle, hipErrorNotInitialized, hipErrorDeinitialized, hipErrorInvalidContext hipError_t hipExtStreamCreateWithCUMask ( hipStream_t *stream, uint32_t cuMaskSize, const uint32_t *cuMask ) Create an asynchronous stream with the specified CU mask. Create a new asynchronous stream with the specified CU mask. stream returns an opaque handle that can be used to reference the newly created stream in subsequent hipStream* commands. The stream is allocated on the heap and will remain allocated even if the handle goes out-of-scope. To release the memory used by the stream, application must call hipStreamDestroy. hipStreamCreate , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy hipSuccess, hipErrorInvalidHandle, hipErrorInvalidValue hipError_t hipExtStreamGetCUMask ( hipStream_t stream, uint32_t cuMaskSize, uint32_t *cuMask ) Get CU mask associated with an asynchronous stream. hipStreamCreate , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy hipSuccess, hipErrorInvalidHandle, hipErrorInvalidValue hipError_t hipStreamAddCallback ( hipStream_t stream, hipStreamCallback_t callback, void *userData, unsigned int flags ) Adds a callback to be called on the host after all currently enqueued items in the stream have completed. For each hipStreamAddCallback call, a callback will be executed exactly once. The callback will block later work in the stream until it is finished. hipStreamCreate , hipStreamCreateWithFlags , hipStreamQuery , hipStreamSynchronize , hipStreamWaitEvent , hipStreamDestroy , hipStreamCreateWithPriority hipSuccess, hipErrorInvalidHandle, hipErrorNotSupported static inline hipError_t hipMallocAsync ( void **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream ) C++ wrappers for allocations from a memory pool. This section describes wrappers for stream Ordered allocation from memory pool functions of HIP runtime API. This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading. hipMallocFromPoolAsync Note: APIs in this section are implemented on Linux, under development on Windows. Note: This API is implemented on Linux and is under development on Microsoft Windows. static inline hipError_t hipMallocAsync ( T **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream C++ wrappers for allocations from a memory pool on the stream. This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading. hipMallocFromPoolAsync Note: This API is implemented on Linux and is under development on Microsoft Windows. static inline hipError_t hipMallocAsync ( T **dev_ptr, size_t size, hipStream_t stream ) C++ wrappers for allocations from a memory pool. This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading. hipMallocFromPoolAsync Note: This API is implemented on Linux and is under development on Microsoft Windows. static inline hipError_t hipMallocFromPoolAsync ( T **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream ) C++ wrappers for allocations from a memory pool. This is an alternate C++ calls for hipMallocFromPoolAsync made available through function overloading. hipMallocFromPoolAsync Note: This API is implemented on Linux and is under development on Microsoft Windows. ) hipError_t hipMallocAsync ( void **dev_ptr, size_t size, hipStream_t stream ) Allocates memory with stream ordered semantics. Inserts a memory allocation operation into stream . A pointer to the allocated memory is returned immediately in *dptr. The allocation must not be accessed until the allocation operation completes. The allocation comes from the memory pool associated with the stream's device. hipMallocFromPoolAsync , hipFreeAsync , hipMemPoolTrimTo , hipMemPoolGetAttribute , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess Note: The default memory pool of a device contains device memory from that device. Note: Basic stream ordering allows future work submitted into the same stream to use the allocation. Stream query, stream synchronize, and HIP events can be used to guarantee that the allocation operation completes before work submitted in a separate stream runs. Note: During stream capture, this function results in the creation of an allocation node. In this case, the allocation is owned by the graph instead of the memory pool. The memory pool's properties are used to set the node's creation parameters. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported, hipErrorOutOfMemory hipError_t hipFreeAsync ( void *dev_ptr, hipStream_t stream ) Frees memory with stream ordered semantics. Inserts a free operation into stream . The allocation must not be used after stream execution reaches the free. After this API returns, accessing the memory from any subsequent work launched on the GPU or querying its pointer attributes results in undefined behavior. hipMallocFromPoolAsync , hipMallocAsync , hipMemPoolTrimTo , hipMemPoolGetAttribute , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess Note: During stream capture, this function results in the creation of a free node and must therefore be passed the address of a graph allocation. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemPoolTrimTo ( hipMemPool_t mem_pool, size_t min_bytes_to_hold ) Releases freed memory back to the OS. Releases memory back to the OS until the pool contains fewer than min_bytes_to_keep reserved bytes, or there is no more memory that the allocator can safely release. The allocator cannot release OS allocations that back outstanding asynchronous allocations. The OS allocations may happen at different granularity from the user allocations. hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess Note: Allocations that have not been freed count as outstanding. Note: Allocations that have been asynchronously freed but whose completion has not been observed on the host (eg. by a synchronize) can count as outstanding. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue hipError_t hipMemPoolSetAttribute ( hipMemPool_t mem_pool, hipMemPoolAttr attr, void *value ) Sets attributes of a memory pool. Supported attributes are: hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAccess , hipMemPoolGetAccess Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue hipError_t hipMemPoolGetAttribute ( hipMemPool_t mem_pool, hipMemPoolAttr attr, void *value ) Gets attributes of a memory pool. Supported attributes are: hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue hipError_t hipMemPoolSetAccess ( hipMemPool_t mem_pool, const hipMemAccessDesc *desc_list, size_t count ) Controls visibility of the specified pool between devices. hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolGetAccess Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue hipError_t hipMemPoolGetAccess ( hipMemAccessFlags *flags, hipMemPool_t mem_pool, hipMemLocation *location ) Returns the accessibility of a pool from a device. Returns the accessibility of the pool's memory from the specified location. hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue hipError_t hipMemPoolCreate ( hipMemPool_t *mem_pool, const hipMemPoolProps *pool_props ) Creates a memory pool. Creates a HIP memory pool and returns the handle in mem_pool . The pool_props determines the properties of the pool such as the backing device and IPC capabilities. By default, the memory pool will be accessible from the device it is allocated on. hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolDestroy , hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess Note: Specifying hipMemHandleTypeNone creates a memory pool that will not support IPC. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemPoolDestroy ( hipMemPool_t mem_pool ) Destroys the specified memory pool. If any pointers obtained from this pool haven't been freed or the pool has free operations that haven't completed when hipMemPoolDestroy is invoked, the function will return immediately and the resources associated with the pool will be released automatically once there are no more outstanding allocations. Destroying the current mempool of a device sets the default mempool of that device as the current mempool for that device. hipMallocFromPoolAsync , hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolCreate hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess Note: A device's default memory pool cannot be destroyed. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. mem_pool -[in] Memory pool for destruction hipSuccess, hipErrorInvalidValue hipError_t void **dev_ptr, size_t size, hipMemPool_t mem_pool, hipStream_t stream hipMallocFromPoolAsync ( ) Allocates memory from a specified pool with stream ordered semantics. Inserts an allocation operation into stream . A pointer to the allocated memory is returned immediately in dev_ptr . The allocation must not be accessed until the allocation operation completes. The allocation comes from the specified memory pool. Basic stream ordering allows future work submitted into the same stream to use the allocation. Stream query, stream synchronize, and HIP events can be used to guarantee that the allocation operation completes before work submitted in a separate stream runs. hipMallocAsync , hipFreeAsync , hipMemPoolGetAttribute , hipMemPoolCreate hipMemPoolTrimTo , hipDeviceSetMemPool, hipMemPoolSetAttribute , hipMemPoolSetAccess , hipMemPoolGetAccess , Note: The specified memory pool may be from a device different than that of the specified stream . Note: During stream capture, this function results in the creation of an allocation node. In this case, the allocation is owned by the graph instead of the memory pool. The memory pool's properties are used to set the node's creation parameters. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported, hipErrorOutOfMemory hipError_t hipMemPoolExportToShareableHandle ( void *shared_handle, hipMemPool_t mem_pool, hipMemAllocationHandleType handle_type, unsigned int flags ) Exports a memory pool to the requested handle type. Given an IPC capable mempool, create an OS handle to share the pool with another process. A recipient process can convert the shareable handle into a mempool with hipMemPoolImportFromShareableHandle . Individual pointers can then be shared with the hipMemPoolExportPointer and hipMemPoolImportPointer APIs. The implementation of what the shareable handle is and how it can be transferred is defined by the requested handle type. hipMemPoolImportFromShareableHandle Note: To create an IPC capable mempool, create a mempool with a hipMemAllocationHandleType other than hipMemHandleTypeNone . Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory hipError_t hipMemPoolImportFromShareableHandle ( hipMemPool_t *mem_pool, void *shared_handle, hipMemAllocationHandleType handle_type, unsigned int flags ) Imports a memory pool from a shared handle. Specific allocations can be imported from the imported pool with hipMemPoolImportPointer . hipMemPoolExportToShareableHandle Note: Imported memory pools do not support creating new allocations. As such imported memory pools may not be used in hipDeviceSetMemPool or hipMallocFromPoolAsync calls. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory hipError_t hipMemPoolExportPointer ( hipMemPoolPtrExportData *export_data, void *dev_ptr ) Export data to share a memory pool allocation between processes. Constructs export_data for sharing a specific allocation from an already shared memory pool. The recipient process can import the allocation with the hipMemPoolImportPointer api. The data is not a handle and may be shared through any IPC mechanism. hipMemPoolImportPointer Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory hipError_t hipMemPoolImportPointer ( void **dev_ptr, hipMemPool_t mem_pool, hipMemPoolPtrExportData *export_data ) Import a memory pool allocation from another process. Returns in dev_ptr a pointer to the imported memory. The imported memory must not be accessed before the allocation operation completes in the exporting process. The imported memory must be freed from all importing processes before being freed in the exporting process. The pointer may be freed with hipFree or hipFreeAsync . If hipFreeAsync is used, the free must be completed on the importing process before the free operation on the exporting process. hipMemPoolExportPointer Note: The hipFreeAsync api may be used in the exporting process before the hipFreeAsync operation completes in its stream as long as the hipFreeAsync in the exporting process specifies a stream with a stream dependency on the importing process's hipFreeAsync . Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized, hipErrorOutOfMemory hipError_t hipDeviceCanAccessPeer ( int *canAccessPeer, int deviceId, int peerDeviceId ) Determine if a device can access a peer's memory. Returns '1' in canAccessPeer if the specified device is capable of directly accessing memory physically located on peerDevice , or '0' if not. Returns '0' in canAccessPeer if deviceId == peerDeviceId, and both are valid devices : a device is not a peer of itself. hipSuccess, hipErrorInvalidDevice if deviceId or peerDeviceId are not valid devices hipError_t hipDeviceEnablePeerAccess ( int peerDeviceId, unsigned int flags ) Enable direct access from current device's virtual address space to memory allocations physically located on a peer device. Memory which already allocated on peer device will be mapped into the address space of the current device. In addition, all future memory allocations on peerDeviceId will be mapped into the address space of the current device when the memory is allocated. The peer memory remains accessible from the current device until a call to hipDeviceDisablePeerAccess or hipDeviceReset. Returns hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue, hipErrorPeerAccessAlreadyEnabled if peer access is already enabled for this device. Disable direct access from current device's virtual address space to memory allocations physically located on a peer device. Returns hipErrorPeerAccessNotEnabled if direct access to memory on peerDevice has not yet been enabled from the current device. peerDeviceId -[in] Peer device to disable direct access to hipSuccess, hipErrorPeerAccessNotEnabled hipError_t hipMemGetAddressRange ( hipDeviceptr_t *pbase, size_t *psize, hipDeviceptr_t dptr ) Get information on memory allocations. hipCtxCreate, hipCtxDestroy, hipCtxGetFlags, hipCtxPopCurrent, hipCtxGetCurrent, hipCtxSetCurrent, hipCtxPushCurrent, hipCtxSetCacheConfig, hipCtxSynchronize, hipCtxGetDevice hipSuccess, hipErrorNotFound hipError_t hipPointerSetAttribute ( const void *value, hipPointer_attribute attribute, hipDeviceptr_t ptr ) Sets information on the specified pointer.[BETA]. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipPointerGetAttributes ( hipPointerAttribute_t *attributes, const void *ptr ) Returns attributes for the specified pointer. The output parameter 'attributes' has a member named 'type' that describes what memory the pointer is associated with, such as device memory, host memory, managed memory, and others. Otherwise, the API cannot handle the pointer and returns hipErrorInvalidValue. hipPointerGetAttribute Note: The unrecognized memory type is unsupported to keep the HIP functionality backward compatibility due to hipMemoryType enum values. Note: The current behavior of this HIP API corresponds to the CUDA API before version 11.0. hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipPointerGetAttribute ( void *data, hipPointer_attribute attribute, hipDeviceptr_t ptr ) Returns information about the specified pointer.[BETA]. hipPointerGetAttributes Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipDrvPointerGetAttributes ( unsigned int numAttributes, hipPointer_attribute *attributes, void **data, hipDeviceptr_t ptr ) Returns information about the specified pointer.[BETA]. hipPointerGetAttribute Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipMalloc ( void **ptr, size_t size ) Allocate memory on the default accelerator. If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned. hipMallocPitch , hipFree , hipMallocArray , hipFreeArray , hipMalloc3D , hipMalloc3DArray , hipHostFree , hipHostMalloc hipSuccess, hipErrorOutOfMemory, hipErrorInvalidValue (bad context, null *ptr) hipError_t hipExtMallocWithFlags ( void **ptr, size_t sizeBytes, unsigned int flags ) Allocate memory on the default accelerator. If requested memory size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned. The memory allocation flag should be either hipDeviceMallocDefault, hipDeviceMallocFinegrained, hipDeviceMallocUncached, or hipMallocSignalMemory. If the flag is any other value, the API returns hipErrorInvalidValue. hipMallocPitch , hipFree , hipMallocArray , hipFreeArray , hipMalloc3D , hipMalloc3DArray , hipHostFree , hipHostMalloc hipSuccess, hipErrorOutOfMemory, hipErrorInvalidValue (bad context, null *ptr) hipError_t hipMallocHost ( void **ptr, size_t size ) Allocate pinned host memory [Deprecated]. If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned. Warning: hipSuccess, hipErrorOutOfMemory Allocate pinned host memory [Deprecated]. If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned. Warning: This API is deprecated, use hipHostMalloc() instead hipSuccess, hipErrorOutOfMemory hipError_t hipHostMalloc ( void **ptr, size_t size, unsigned int flags ) Allocates device accessible page locked (pinned) host memory. This API allocates pinned host memory which is mapped into the address space of all GPUs in the system, the memory can be accessed directly by the GPU device, and can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc(). Using the pinned host memory, applications can implement faster data transfers for HostToDevice and DeviceToHost. The runtime tracks the hipHostMalloc allocations and can avoid some of the setup required for regular unpinned memory. When the memory accesses are infrequent, zero-copy memory can be a good choice, for coherent allocation. GPU can directly access the host memory over the CPU/GPU interconnect, without need to copy the data. Currently the allocation granularity is 4KB for the API. Developers need to choose proper allocation flag with consideration of synchronization. If no input for flags, it will be the default pinned memory allocation on the host. hipSetDeviceFlags, hipHostFree hipSuccess, hipErrorOutOfMemory hipError_t hipHostAlloc ( void **ptr, size_t size, unsigned int flags ) Allocate device accessible page locked host memory [Deprecated]. If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned. Warning: This API is deprecated, use hipHostMalloc() instead hipSuccess, hipErrorOutOfMemory hipError_t hipHostGetDevicePointer ( void **devPtr, void *hstPtr, unsigned int flags ) Get Device pointer from Host Pointer allocated through hipHostMalloc. hipSetDeviceFlags, hipHostMalloc hipSuccess, hipErrorInvalidValue, hipErrorOutOfMemory hipError_t hipHostGetFlags ( unsigned int *flagsPtr, void *hostPtr ) Return flags associated with host pointer. hipSuccess, hipErrorInvalidValue hipError_t hipHostRegister ( void *hostPtr, size_t sizeBytes, unsigned int flags ) Register host memory so it can be accessed from the current device. After registering the memory, use hipHostGetDevicePointer to obtain the mapped device pointer. On many systems, the mapped device pointer will have a different value than the mapped host pointer. Applications must use the device pointer in device code, and the host pointer in host code. On some systems, registered memory is pinned. On some systems, registered memory may not be actually be pinned but uses OS or hardware facilities to all GPU access to the host memory. Developers are strongly encouraged to register memory blocks which are aligned to the host cache-line size. (typically 64-bytes but can be obtains from the CPUID instruction). If registering non-aligned pointers, the application must take care when register pointers from the same cache line on different devices. HIP's coarse-grained synchronization model does not guarantee correct results if different devices write to different parts of the same cache block - typically one of the writes will 'win' and overwrite data from the other registered memory region. hipHostUnregister , hipHostGetFlags , hipHostGetDevicePointer hipSuccess, hipErrorOutOfMemory hipError_t hipHostUnregister ( void *hostPtr ) Un-register host pointer. hipHostRegister hostPtr -[in] Host pointer previously registered with hipHostRegister Error code hipError_t hipMallocPitch ( void **ptr, size_t *pitch, size_t width, size_t height ) Allocates at least width (in bytes) * height bytes of linear memory Padding may occur to ensure alighnment requirements are met for the given row The change in width size due to padding will be returned in *pitch. Currently the alignment is set to 128 bytes If size is 0, no memory is allocated, *ptr returns nullptr, and hipSuccess is returned. hipMalloc , hipFree , hipMallocArray , hipFreeArray , hipHostFree , hipMalloc3D , hipMalloc3DArray , hipHostMalloc Error code hipError_t hipMemAllocPitch ( hipDeviceptr_t *dptr, size_t *pitch, size_t widthInBytes, size_t height, unsigned int elementSizeBytes ) Allocates at least width (in bytes) * height bytes of linear memory Padding may occur to ensure alighnment requirements are met for the given row The change in width size due to padding will be returned in *pitch. Currently the alignment is set to 128 bytes If size is 0, no memory is allocated, ptr returns nullptr, and hipSuccess is returned. The intended usage of pitch is as a separate parameter of the allocation, used to compute addresses within the 2D array. Given the row and column of an array element of type T, the address is computed as: T pElement = (T*)((char*)BaseAddress + Row * Pitch) + Column; hipMalloc , hipFree , hipMallocArray , hipFreeArray , hipHostFree , hipMalloc3D , hipMalloc3DArray , hipHostMalloc Error code Free memory allocated by the hcc hip memory allocation API. This API performs an implicit hipDeviceSynchronize() call. If pointer is NULL, the hip runtime is initialized and hipSuccess is returned. hipMalloc , hipMallocPitch , hipMallocArray , hipFreeArray , hipHostFree , hipMalloc3D , hipMalloc3DArray , hipHostMalloc ptr -[in] Pointer to memory to be freed hipSuccess hipErrorInvalidDevicePointer (if pointer is invalid, including host pointers allocated with hipHostMalloc) Free memory allocated by the hcc hip host memory allocation API [Deprecated]. Warning: ptr -[in] Pointer to memory to be freed hipSuccess, hipErrorInvalidValue (if pointer is invalid, including device pointers allocated with hipMalloc) Free memory allocated by the hcc hip host memory allocation API This API performs an implicit hipDeviceSynchronize() call. If pointer is NULL, the hip runtime is initialized and hipSuccess is returned. hipMalloc , hipMallocPitch , hipFree , hipMallocArray , hipFreeArray , hipMalloc3D , hipMalloc3DArray , hipHostMalloc ptr -[in] Pointer to memory to be freed hipSuccess, hipErrorInvalidValue (if pointer is invalid, including device pointers allocated with hipMalloc) hipError_t hipMemcpy ( void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind ) Copy data from src to dst. It supports memory from host to device, device to host, device to device and host to host The src and dst must not overlap. For hipMemcpy, the copy is always performed by the current device (set by hipSetDevice). For multi-gpu or peerto-peer configurations, it is recommended to set the current device to the device where the src data is physically located. For optimal peer-to-peer copies, the copy device must be able to access the src and dst pointers (by calling hipDeviceEnablePeerAccess with copy agent as the current device and src/dest as the peerDevice argument. if this is not done, the hipMemcpy will still work, but will perform the copy using a staging buffer on the host. Calling hipMemcpy with dst and src pointers that do not match the hipMemcpyKind results in undefined behavior. hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer hipSuccess, hipErrorInvalidValue, hipErrorUnknown hipError_t hipMemcpyWithStream ( void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream ) Memory copy on the stream. It allows single or multiple devices to do memory copy on single or multiple streams. hipMemcpy , hipStreamCreate , hipStreamSynchronize , hipStreamDestroy , hipSetDevice, hipLaunchKernelGGL hipSuccess, hipErrorInvalidValue, hipErrorUnknown, hipErrorContextIsDestroyed hipError_t hipMemcpyHtoD ( hipDeviceptr_t dst, void *src, size_t sizeBytes ) Copy data from Host to Device. hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipError_t hipMemcpyDtoH ( void *dst, hipDeviceptr_t src, size_t sizeBytes ) Copy data from Device to Host. hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipError_t hipMemcpyDtoD ( hipDeviceptr_t dst, hipDeviceptr_t src, size_t sizeBytes ) Copy data from Device to Device. hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipError_t hipMemcpyHtoDAsync ( hipDeviceptr_t dst, void *src, size_t sizeBytes, hipStream_t stream ) Copy data from Host to Device asynchronously. hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipError_t hipMemcpyDtoHAsync ( void *dst, hipDeviceptr_t src, size_t sizeBytes, hipStream_t stream ) Copy data from Device to Host asynchronously. hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipError_t hipMemcpyDtoDAsync ( hipDeviceptr_t dst, hipDeviceptr_t src, size_t sizeBytes, hipStream_t stream ) Copy data from Device to Device asynchronously. hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipError_t hipModuleGetGlobal ( hipDeviceptr_t *dptr, size_t *bytes, hipModule_t hmod, const char *name ) Returns a global pointer from a module. Returns in *dptr and *bytes the pointer and size of the global of name name located in module hmod. If no variable of that name exists, it returns hipErrorNotFound. Both parameters dptr and bytes are optional. If one of them is NULL, it is ignored and hipSuccess is returned. hipSuccess, hipErrorInvalidValue, hipErrorNotFound, hipErrorInvalidContext hipError_t hipGetSymbolAddress ( void **devPtr, const void *symbol ) Gets device pointer associated with symbol on the device. hipSuccess, hipErrorInvalidValue hipError_t hipGetSymbolSize ( size_t *size, const void *symbol ) Gets the size of the given symbol on the device. hipSuccess, hipErrorInvalidValue hipError_t hipGetProcAddress ( const char *symbol, void **pfn, int hipVersion, uint64_t flags, hipDriverProcAddressQueryResult *symbolStatus ) Gets the pointer of requested HIP driver function. Returns hipSuccess if the returned pfn is addressed to the pointer of found driver function. hipSuccess, hipErrorInvalidValue. hipError_t hipMemcpyToSymbol ( const void *symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind ) Copies data to the given symbol on the device. Symbol HIP APIs allow a kernel to define a device-side data symbol which can be accessed on the host side. The symbol can be in __constant or device space. Note that the symbol name needs to be encased in the HIP_SYMBOL macro. This also applies to hipMemcpyFromSymbol, hipGetSymbolAddress, and hipGetSymbolSize. For detailed usage, see the memcpyToSymbol example in the HIP Porting Guide. hipSuccess, hipErrorInvalidValue hipError_t hipMemcpyToSymbolAsync ( const void *symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream ) Copies data to the given symbol on the device asynchronously. hipSuccess, hipErrorInvalidValue hipError_t hipMemcpyFromSymbol ( void *dst, const void *symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind ) Copies data from the given symbol on the device. hipSuccess, hipErrorInvalidValue hipError_t hipMemcpyFromSymbolAsync ( void *dst, const void *symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream ) Copies data from the given symbol on the device asynchronously. hipSuccess, hipErrorInvalidValue hipError_t hipMemcpyAsync ( void *dst, const void *src, size_t sizeBytes, hipMemcpyKind kind, hipStream_t stream ) Copy data from src to dst asynchronously. For multi-gpu or peer-to-peer configurations, it is recommended to use a stream which is a attached to the device where the src data is physically located. For optimal peer-to-peer copies, the copy device must be able to access the src and dst pointers (by calling hipDeviceEnablePeerAccess with copy agent as the current device and src/dest as the peerDevice argument. if this is not done, the hipMemcpy will still work, but will perform the copy using a staging buffer on the host. hipMemcpy , hipMemcpy2D , hipMemcpyToArray , hipMemcpy2DToArray , hipMemcpyFromArray , hipMemcpy2DFromArray , hipMemcpyArrayToArray, hipMemcpy2DArrayToArray, hipMemcpyToSymbol , hipMemcpyFromSymbol , hipMemcpy2DAsync , hipMemcpyToArrayAsync, hipMemcpy2DToArrayAsync , hipMemcpyFromArrayAsync, hipMemcpy2DFromArrayAsync , hipMemcpyToSymbolAsync , hipMemcpyFromSymbolAsync Warning: If host or dest are not pinned, the memory copy will be performed synchronously. For best performance, use hipHostMalloc to allocate host memory that is transferred asynchronously. Warning: on HCC hipMemcpyAsync does not support overlapped H2D and D2H copies. For hipMemcpy, the copy is always performed by the device associated with the specified stream. hipSuccess, hipErrorInvalidValue, hipErrorUnknown hipError_t hipMemset ( void *dst, int value, size_t sizeBytes ) Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value. hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetD8 ( hipDeviceptr_t dest, unsigned char value, size_t count ) Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value. hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetD8Async ( hipDeviceptr_t dest, unsigned char value, size_t count, hipStream_t stream ) Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant byte value value. hipMemsetD8Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams. hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetD16 ( hipDeviceptr_t dest, unsigned short value, size_t count ) Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant short value value. hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetD16Async ( hipDeviceptr_t dest, unsigned short value, size_t count, hipStream_t stream ) Fills the first sizeBytes bytes of the memory area pointed to by dest with the constant short value value. hipMemsetD16Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams. hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetD32 ( hipDeviceptr_t dest, int value, size_t count ) Fills the memory area pointed to by dest with the constant integer value for specified number of times. hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMemsetAsync ( void *dst, int value, size_t sizeBytes, hipStream_t stream ) Fills the first sizeBytes bytes of the memory area pointed to by dev with the constant byte value value. hipMemsetAsync() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is nonzero, the operation may overlap with operations in other streams. hipSuccess, hipErrorInvalidValue hipError_t hipMemsetD32Async ( hipDeviceptr_t dst, int value, size_t count, hipStream_t stream ) Fills the memory area pointed to by dev with the constant integer value for specified number of times. hipMemsetD32Async() is asynchronous with respect to the host, so the call may return before the memset is complete. The operation can optionally be associated to a stream by passing a non-zero stream argument. If stream is non-zero, the operation may overlap with operations in other streams. hipSuccess, hipErrorInvalidValue hipError_t hipMemset2D ( void *dst, size_t pitch, int value, size_t width, size_t height ) Fills the memory area pointed to by dst with the constant value. hipSuccess, hipErrorInvalidValue hipError_t hipMemset2DAsync ( void *dst, size_t pitch, int value, size_t width, size_t height, hipStream_t stream ) Fills asynchronously the memory area pointed to by dst with the constant value. hipSuccess, hipErrorInvalidValue hipError_t hipMemset3D ( hipPitchedPtr pitchedDevPtr, int value, hipExtent extent ) Fills synchronously the memory area pointed to by pitchedDevPtr with the constant value. hipSuccess, hipErrorInvalidValue hipError_t hipMemset3DAsync ( hipPitchedPtr pitchedDevPtr, int value, hipExtent extent, hipStream_t stream ) Fills asynchronously the memory area pointed to by pitchedDevPtr with the constant value. hipSuccess, hipErrorInvalidValue hipError_t hipMemGetInfo ( size_t *free, size_t *total ) Query memory info. On ROCM, this function gets the actual free memory left on the current device, so supports the cases while running multi-workload (such as multiple processes, multiple threads, and multiple GPUs). Warning: On Windows, the free memory only accounts for memory allocated by this process and may be optimistic. hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipMemPtrGetInfo ( void *ptr, size_t *size ) Get allocated memory size via memory pointer. This function gets the allocated shared virtual memory size from memory pointer. hipSuccess, hipErrorInvalidValue hipError_t hipMallocArray ( hipArray_t *array, const hipChannelFormatDesc *desc, size_t width, size_t height, unsigned int flags ) Allocate an array on the device. hipMalloc , hipMallocPitch , hipFree , hipFreeArray , hipHostMalloc , hipHostFree hipSuccess, hipErrorOutOfMemory hipError_t hipArrayCreate ( hipArray_t *pHandle, const HIP_ARRAY_DESCRIPTOR *pAllocateArray ) Create an array memory pointer on the device. hipMallocArray , hipArrayDestroy , hipFreeArray hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipArrayDestroy ( hipArray_t array ) Destroy an array memory pointer on the device. hipArrayCreate , hipArrayDestroy , hipFreeArray array -[in] Pointer to the array memory hipSuccess, hipErrorInvalidValue hipError_t hipArray3DCreate ( hipArray_t *array, const HIP_ARRAY3D_DESCRIPTOR *pAllocateArray ) Create a 3D array memory pointer on the device. hipMallocArray , hipArrayDestroy , hipFreeArray hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMalloc3D ( hipPitchedPtr *pitchedDevPtr, hipExtent extent ) Create a 3D memory pointer on the device. hipMallocPitch , hipMemGetInfo , hipFree hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipFreeArray ( hipArray_t array ) Frees an array on the device. hipMalloc , hipMallocPitch , hipFree , hipMallocArray , hipHostMalloc , hipHostFree array -[in] Pointer to array to free hipSuccess, hipErrorInvalidValue, hipErrorNotInitialized hipError_t hipMalloc3DArray ( hipArray_t *array, const struct hipChannelFormatDesc *desc, struct hipExtent extent, unsigned int flags ) Allocate an array on the device. hipMalloc , hipMallocPitch , hipFree , hipFreeArray , hipHostMalloc , hipHostFree hipSuccess, hipErrorOutOfMemory hipError_t hipArrayGetInfo ( hipChannelFormatDesc *desc, hipExtent *extent, unsigned int *flags, hipArray_t array ) Gets info about the specified array. hipArrayGetDescriptor , hipArray3DGetDescriptor hipSuccess, hipErrorInvalidValue hipErrorInvalidHandle hipError_t hipArrayGetDescriptor ( HIP_ARRAY_DESCRIPTOR *pArrayDescriptor, hipArray_t array ) Gets a 1D or 2D array descriptor. hipArray3DCreate , hipArray3DGetDescriptor , hipArrayCreate , hipArrayDestroy , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpy3D , hipMemcpy3DAsync , hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoD , hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer, hipMemsetD8 , hipMemsetD16 , hipMemsetD32 , hipArrayGetInfo hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipErrorInvalidHandle hipError_t hipArray3DGetDescriptor ( HIP_ARRAY3D_DESCRIPTOR *pArrayDescriptor, hipArray_t array ) Gets a 3D array descriptor. hipArray3DCreate , hipArrayCreate , hipArrayDestroy , hipArrayGetDescriptor , hipMemAlloc, hipMemAllocHost , hipMemAllocPitch , hipMemcpy2D , hipMemcpy2DAsync , hipMemcpy2DUnaligned, hipMemcpy3D , hipMemcpy3DAsync , hipMemcpyAtoA, hipMemcpyAtoD, hipMemcpyAtoH , hipMemcpyAtoHAsync, hipMemcpyDtoA, hipMemcpyDtoD , hipMemcpyDtoDAsync , hipMemcpyDtoH , hipMemcpyDtoHAsync , hipMemcpyHtoA , hipMemcpyHtoAAsync, hipMemcpyHtoD , hipMemcpyHtoDAsync , hipMemFree, hipMemFreeHost, hipMemGetAddressRange , hipMemGetInfo , hipMemHostAlloc, hipMemHostGetDevicePointer, hipMemsetD8 , hipMemsetD16 , hipMemsetD32 , hipArrayGetInfo hipSuccess, hipErrorDeinitialized, hipErrorNotInitialized, hipErrorInvalidContext, hipErrorInvalidValue hipErrorInvalidHandle, hipErrorContextIsDestroyed hipError_t hipMemcpy2D ( void *dst, size_t dpitch, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind ) Copies data between host and device. hipMemcpy , hipMemcpyToArray , hipMemcpy2DToArray , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyParam2D ( const hip_Memcpy2D *pCopy ) Copies memory for 2D arrays. hipMemcpy , hipMemcpy2D , hipMemcpyToArray , hipMemcpy2DToArray , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync pCopy -[in] Parameters for the memory copy hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyParam2DAsync ( const hip_Memcpy2D *pCopy, hipStream_t stream ) Copies memory for 2D arrays. hipMemcpy , hipMemcpy2D , hipMemcpyToArray , hipMemcpy2DToArray , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy2DAsync ( void *dst, size_t dpitch, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream ) Copies data between host and device. hipMemcpy , hipMemcpyToArray , hipMemcpy2DToArray , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy2DToArray ( hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind ) Copies data between host and device. hipMemcpy , hipMemcpyToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy2DToArrayAsync ( hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t spitch, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream ) Copies data between host and device. hipMemcpy , hipMemcpyToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyToArray ( hipArray_t dst, size_t wOffset, size_t hOffset, const void *src, size_t count, hipMemcpyKind kind ) Copies data between host and device. hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync Warning: This API is deprecated. hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyFromArray ( void *dst, hipArray_const_t srcArray, size_t wOffset, size_t hOffset, size_t count, hipMemcpyKind kind ) Copies data between host and device. hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync Warning: This API is deprecated. hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy2DFromArray ( void *dst, size_t dpitch, hipArray_const_t src, size_t wOffset, size_t hOffset, size_t width, size_t height, hipMemcpyKind kind ) Copies data between host and device. hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy2DFromArrayAsync ( void *dst, size_t dpitch, hipArray_const_t src, size_t wOffset, size_t hOffset, size_t width, size_t height, hipMemcpyKind kind, hipStream_t stream ) Copies data between host and device asynchronously. hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyAtoH ( void *dst, hipArray_t srcArray, size_t srcOffset, size_t count ) Copies data between host and device. hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpyHtoA ( hipArray_t dstArray, size_t dstOffset, const void *srcHost, size_t count ) Copies data between host and device. hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy3D ( const struct hipMemcpy3DParms *p ) Copies data between host and device. hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync p -[in] 3D memory copy parameters hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipMemcpy3DAsync ( const struct hipMemcpy3DParms *p, hipStream_t stream ) Copies data between host and device asynchronously. hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipDrvMemcpy3D ( const HIP_MEMCPY3D *pCopy ) Copies data between host and device. hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync pCopy -[in] 3D memory copy parameters hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection hipError_t hipDrvMemcpy3DAsync ( const HIP_MEMCPY3D *pCopy, hipStream_t stream ) Copies data between host and device asynchronously. hipMemcpy , hipMemcpy2DToArray , hipMemcpy2D , hipMemcpyFromArray , hipMemcpyToSymbol , hipMemcpyAsync hipSuccess, hipErrorInvalidValue, hipErrorInvalidPitchValue, hipErrorInvalidDevicePointer, hipErrorInvalidMemcpyDirection template<typename T > hipError_t hipGetSymbolAddress ( void **devPtr, const T &symbol ) Gets the address of a symbol. hipSuccess, hipErrorInvalidValue template<typename T > hipError_t hipGetSymbolSize ( size_t *size, const T &symbol ) Gets the size of a symbol. hipSuccess, hipErrorInvalidValue template<typename T > hipError_t hipMemcpyToSymbol ( const T &symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind ) Copies data to the given symbol on the device. hipMemcpyToSymbol hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue template<typename T > hipError_t hipMemcpyToSymbolAsync ( const T &symbol, const void *src, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream ) Copies data to the given symbol on the device asynchronously on the stream. hipMemcpyToSymbolAsync hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue template<typename T > hipError_t hipMemcpyFromSymbol ( void *dst, const T &symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind ) Copies data from the given symbol on the device. hipMemcpyFromSymbol hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue template<typename T > hipError_t hipMemcpyFromSymbolAsync ( void *dst, const T &symbol, size_t sizeBytes, size_t offset, hipMemcpyKind kind, hipStream_t stream ) Copies data from the given symbol on the device asynchronously on the stream. hipMemcpyFromSymbolAsync hipSuccess, hipErrorInvalidMemcpyDirection, hipErrorInvalidValue template<class T > static inline hipError_t hipMalloc ( T **devPtr, size_t size ) Perform automatic type conversion to eliminate need for excessive typecasting (ie void**) HIP_DISABLE_CPP_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs. hipMalloc static inline hipError_t hipHostMalloc ( T **ptr, size_t size, unsigned int flags = hipHostMallocDefault ) Provide an override to automatically typecast the pointer type from void**, and also provide a default for the flags. HIP_DISABLE_CPP_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs. hipHostMalloc hipError_t hipImportExternalSemaphore ( hipExternalSemaphore_t *extSem_out, const hipExternalSemaphoreHandleDesc *semHandleDesc ) Imports an external semaphore. See also: hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipSignalExternalSemaphoresAsync ( const hipExternalSemaphore_t *extSemArray, const hipExternalSemaphoreSignalParams *paramsArray, unsigned int numExtSems, hipStream_t stream ) Signals a set of external semaphore objects. hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipWaitExternalSemaphoresAsync ( const hipExternalSemaphore_t *extSemArray, const hipExternalSemaphoreWaitParams *paramsArray, unsigned int numExtSems, hipStream_t stream ) Waits on a set of external semaphore objects. See also: hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue Destroys an external semaphore object and releases any references to the underlying resource. Any outstanding signals or waits must have completed before the semaphore is destroyed. extSem -[in] handle to an external memory object hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipImportExternalMemory ( hipExternalMemory_t *extMem_out, const hipExternalMemoryHandleDesc *memHandleDesc ) Imports an external memory object. hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipExternalMemoryGetMappedBuffer ( void **devPtr, hipExternalMemory_t extMem, const hipExternalMemoryBufferDesc *bufferDesc ) Maps a buffer onto an imported memory object. hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue Destroys an external memory object. See also: extMem -[in] External memory object to be destroyed hipSuccess, hipErrorInvalidDevice, hipErrorInvalidValue hipError_t hipExternalMemoryGetMappedMipmappedArray ( hipMipmappedArray_t *mipmap, hipExternalMemory_t extMem, const hipExternalMemoryMipmappedArrayDesc *mipmapDesc ) Maps a mipmapped array onto an external memory object. Returned mipmapped array must be freed using hipFreeMipmappedArray. hipImportExternalMemory , hipFreeMipmappedArray hipDestroyExternalMemory , hipSuccess, hipErrorInvalidValue, hipErrorInvalidResourceHandle The register keyword is deprecated in C++, and is silently ignored by both NVCC and HIP-Clang. You can pass the option -Wdeprecated-register the compiler warning message. hipExternalMemoryGetMappedBuffer , Unroll with a bounds that is known at compile-time is supported. For example: GCN ISA In-line assembly, is supported. For example: We insert the GCN isa into the kernel using asm() Assembler statement. volatile keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations. v_mac_f32_e32 is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/) Index for the respective operand in the ordered fashion is provided by % followed by position in the list of operands 'v' is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list) Output Constraints are specified by an '=' prefix as shown above ('=v'). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of '0' says to use the assigned register for output as an input as well (it being the 0'th constraint). ## C++ Support The following C++ features are not supported: - Run-time-type information (RTTI) - Try/catch Virtual functions Virtual functions are not supported if objects containing virtual function tables are passed between GPU's of different offload arch's, e.g. between gfx906 and gfx1030. Otherwise virtual functions are supported. hipcc now supports compiling C++/HIP kernels to binary code objects. The file format for binary is .co which means Code Object. The following command builds the code object using hipcc . Note: When using binary code objects is that the number of arguments to the kernel is different on HIP-Clang and NVCC path. Refer to the HIP module_api sample for differences in the arguments to be passed to the kernel. Clang defined '__gfx*__' macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample in HIP 14_gpu_arch sample. The ROCm platform enables the power of combined C++ and HIP (Heterogeneous-computing Interface for Portability) code. This code is compiled with a clang or clang++ compiler. The official compilers support the HIP platform, or you can use the amdclang or amdclang++ included in the ROCm installation, which are a wrapper for the official versions. The source code is compiled according to the C++03 , C++11 , C++14 , C++17 , and C++20 standards, along with HIPspecific extensions, but is subject to restrictions. The key restriction is the reduced support of standard library in device code. This is due to the fact that by default a function is considered to run on host, except for constexpr functions, which can run on host and device as well. C++ is considered a modern programming language as of C++11. This section describes how HIP supports these new C++ features. The C++11 standard introduced many new features. These features are supported in HIP host code, with some notable omissions on the device side. The rule of thumb here is that constexpr functions work on device, the rest doesn't. This means that some important functionality like std::function is missing on the device, but unfortunately the standard library wasn't designed with HIP in mind, which means that the support is in a state of 'works as-is'. Certain features have restrictions and clarifications. For example, any functions using the constexpr qualifier or the new initializer lists , std::move or std::forward features are implicitly considered to have the __host__ and __device__ execution space specifier. Also, constexpr variables that are static members or namespace scoped can be used from both host and device, but only for read access. Dereferencing a static constexpr outside its specified execution space causes an error. Lambdas are supported, but there are some extensions and restrictions on their usage. For more information, see the Extended lambdas section below. The C++14 language features are supported. All C++17 language features are supported. All C++20 language features are supported, but extensions and restrictions apply. C++20 introduced coroutines and modules, which fundamentally changed how programs are written. HIP doesn't support these features. However, consteval functions can be called from host and device, even if specified for host use only. The three-way comparison operator (spaceship operator <=> ) works with host and device code. In addition to the deviations from the standard, there are some general extensions and restrictions to consider. Functions that serve as an entry point for device execution are called kernels and are specified with the __global__ qualifier. To call a kernel function, use the triple chevron operator: <<< >>> . Kernel functions must have a void return type. These functions can't: Kernels can have variadic template parameters, but only one parameter pack, which must be the last item in the template parameter list. HIP includes device space memory specifiers to indicate whether a variable is allocated in host or device memory and howits memory should be allocated. HIP supports the __device__ , __shared__ , __managed__ , and __constant__ specifiers. The __device__ and __constant__ specifiers define global variables, which are allocated within global memory on the HIP devices. The only difference is that __constant__ variables can't be changed after allocation. The __shared__ specifier allocates the variable within shared memory, which is available for all threads in a block. The __managed__ variable specifier creates global variables that are initially undefined and unaddressed within the global symbol table. The HIP runtime allocates managed memory and defines the symbol when it loads the device binary. A managed variable can be accessed in both device and host code. It's important to know where a variable is stored because it is only available from certain locations. Generally, variables allocated in the host memory are not accessible from the device code, while variables allocated in the device memory are not directly accessible from the host code. Dereferencing a pointer to device memory on the host results in a segmentation fault. Accessing device variables in host code should be done through kernel execution or HIP functions like hipMemCpyToSymbol . An important difference between the host and device code is exception handling. In device code, this control flow isn't available due to the hardware architecture. The device code must use return codes to handle errors. There are some restrictions on kernel function parameters. They cannot be passed by reference, because these functions are called from the host but run on the device. Also, a variable number of arguments is not allowed. Classes work on both the host and device side, but there are some constraints. The static member functions can't be __global__ . Virtual member functions work, but a virtual function must not be called from the host if the parent object was created on the device, or the other way around, because this behavior is undefined. Another minor restriction is that __device__ variables, that are global scoped must have trivial constructors. HIP doesn't support the polymorphic function wrapper std::function , which was introduced in C++11. HIP supports Lambdas, which by default work as expected. Lambdas have implicit host device attributes. This means that they can be executed by both host and device code, and works the way you would expect. To make a lambda callable only by host or device code, users can add __host__ or __device__ attribute. The only restriction is that host variables can only be accessed through copy on the device. Accessing through reference will cause undefined behavior. Inline namespaces are supported, but with a few exceptions. The following entities can't be declared in namespace scope within an inline unnamed namespace: HIP-Clang supports a set of math operations that are callable from the device. HIP supports most of the device functions supported by NVIDIA CUDA. These are described in the following sections. Following is the list of supported single precision mathematical functions. continues on next page continues on next page continues on next page Table 1 - continued from previous page continues on next page continues on next page Table 1 - continued from previous page continues on next page continues on next page continues on next page Table 1 - continued from previous page continues on next page Table 1 - continued from previous page continues on next page Following is the list of supported double precision mathematical functions. continues on next page continues on next page continues on next page continues on next page continues on next page continues on next page continues on next page continues on next page continues on next page Following is the list of supported integer intrinsics. Note that intrinsics are supported on device only. Table 3: Integer intrinsics mathematical functions unsigned int __brev(unsigned int x) Reverse the bit order of a 32 bit unsigned integer. unsigned long long int __brevll(unsigned long long int x) Reverse the bit order of a 64 bit unsigned integer. unsigned int __byte_perm(unsigned int x, unsigned int y, unsigned int z) Return selected bytes from two 32-bit unsigned integers. unsigned int __clz(int x) Return the number of consecutive high-order zero bits in 32 bit integer. unsigned int __clzll(long long int x) Return the number of consecutive high-order zero bits in 64 bit integer. unsigned int __ffs(int x) Find the position of least significant bit set to 1 in a 32 bit integer. unsigned int __ffsll(long long int x) Find the position of least significant bit set to 1 in a 64 bit signed integer. unsigned int __fns32(unsigned long long mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 32-bit integer. unsigned int __fns64(unsigned long long int mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 64-bit integer. unsigned int __funnelshift_l(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by shift & 31 bits, return the most significant 32 bits. unsigned int __funnelshift_lc(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by min(shift, 32) bits, return the most significant 32 bits. unsigned int __funnelshift_r(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift right by shift & 31 bits, return the least significant 32 bits. 226 Chapter 21. HIP math API The HIP-Clang implementation of __ffs() and __ffsll() contains code to add a constant +1 to produce the ffs result format. For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform, HIP-Clang provides __lastbit_u32_u32(unsigned int input) and __lastbit_u32_u64(unsigned long long int input) . The index returned by __lastbit_ instructions starts at -1, while for ffs the index starts at 0. Following is the list of supported floating-point intrinsics. Note that intrinsics are supported on device only. Note: Only the nearest even rounding mode supported on AMD GPUs by defaults. The _rz , _ru and _rd suffixed intrinsic functions are existing in HIP AMD backend, if the OCML_BASIC_ROUNDED_OPERATIONS macro is defined. Table 4: Single precision intrinsics mathematical functions Function float __cosf(float x) Returns the fast approximate cosine of 𝑥 . float __exp10f(float x) Returns the fast approximate for 10 x . float __expf(float x) Returns the fast approximate for e x . float __fadd_rn(float x, float y) Add two floating-point values in round-to-nearest-even mode. float __fdiv_rn(float x, float y) Divide two floating point values in round-to-nearest-even mode. float __fmaf_rn(float x, float y, float z) Returns x × y + z as a single operation in round-to-nearest-even mode. float __fmul_rn(float x, float y) Multiply two floating-point values in round-to-nearest-even mode. float __frcp_rn(float x, float y) Returns 1 / x in round-to-nearest-even mode. float __frsqrt_rn(float x) Returns 1 / x in round-to-nearest-even mode. float __fsqrt_rn(float x) Returns x in round-to-nearest-even mode. float __fsub_rn(float x, float y) Subtract two floating-point values in round-to-nearest-even mode. float __log10f(float x) Returns the fast approximate for base 10 logarithm of 𝑥 . 228 Chapter 21. HIP math API Table 5: Double precision intrinsics mathematical functions Function double __dadd_rn(double x, double y) Add two floating-point values in round-to-nearest-even mode. double __ddiv_rn(double x, double y) Divide two floating-point values in round-to-nearest-even mode. double __dmul_rn(double x, double y) Multiply two floating-point values in round-to-nearest-even mode. double __drcp_rn(double x, double y) Returns 1 / x in round-to-nearest-even mode. double __dsqrt_rn(double x) Returns x in round-to-nearest-even mode. double __dsub_rn(double x, double y) Subtract two floating-point values in round-to-nearest-even mode. double __fma_rn(double x, double y, double z) Returns x × y + z as a single operation in round-to-nearest-even mode. The indexing functions (starting with thread-index ) show the terminology for a 1D grid. Some APIs use reverse order of xyz / 012 indexing for 3D grids. The following host-side functions are used for cooperative kernel launches. Warning: doxygenfunction: Cannot find function 'hipModuleLaunchCooperativeKernelMultiDevice' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml The following cooperative groups classes can be used on the device side. The base type of all cooperative group types. Holds the key properties of a constructed cooperative group types object, like the group type, its size, etc. Note: Cooperative groups feature is implemented on Linux, under development on Microsoft Windows. Subclassed by cooperative_groups::coalesced_group , cooperative_groups::grid_group , coopera-tive_groups::multi_grid_group , cooperative_groups::thread_block , cooperative_groups::tiled_group class thread_block : public cooperative_groups:: thread_group The workgroup (thread-block in CUDA terminology) cooperative group type. Represents an intra-workgroup cooperative group type, where the participating threads within the group are the same threads that participated in the currently executing workgroup . Note: This function is implemented on Linux and is under development on Microsoft Windows. class grid_group : public cooperative_groups:: thread_group The grid cooperative group type. Represents an inter-workgroup cooperative group type, where the participating threads within the group spans across multiple workgroups running the (same) kernel on the same device. Note: This is implemented on Linux and is under development on Microsoft Windows. class multi_grid_group : public cooperative_groups:: thread_group The multi-grid cooperative group type. Represents an inter-device cooperative group type, where the participating threads within the group span across multiple devices, running the (same) kernel on these devices. Note: The multi-grid cooperative group type is implemented on Linux, under development on Microsoft Windows. class thread_block_tile : public cooperative_groups::impl::thread_block_tile_internal< size , ParentCGTy > Group type -thread_block_tile . Represents one tiled thread group in a wavefront. This group type also supports sub-wave level intrinsics. Note: This type is implemented on Linux, under development on Microsoft Windows. unsigned int thread_rank () const Rank of the calling thread within [0, size() ). Synchronizes the threads in the group. Causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group. Note: There are potential read-after-write (RAW), write-after-read (WAR), or write-after-write (WAW) hazards, when threads in the group access the same addresses in shared or global memory. The data hazards can be avoided with synchronization of the group. Returns the linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta_group_size) unsigned int meta_group_size () const Returns the number of groups created when the parent group was partitioned. T shfl ( T var, int srcRank ) const Shuffle operation on group level. Exchanging variables between threads without use of shared memory. Shuffle operation is a direct copy of var from srcRank thread ID of group. T - The type can be a 32-bit integer or single-precision floating point. T shfl_down ( T var, unsigned int lane_delta ) const Shuffle down operation on group level. Exchanging variables between threads without use of shared memory. Shuffle down operation is copy of var from thread with thread ID of group relative higher with lane_delta to caller thread ID. T - The type can be a 32-bit integer or single-precision floating point. template<class T > Shuffle up operation on group level. Exchanging variables between threads without use of shared memory. Shuffle up operation is copy of var from thread with thread ID of group relative lower with lane_delta to caller thread ID. T - The type can be a 32-bit integer or single-precision floating point. T shfl_xor ( T var, unsigned int laneMask ) const Shuffle xor operation on group level. Exchanging variables between threads without use of shared memory. Shuffle xor operation is copy of var from thread with thread ID of group based on laneMask XOR of the caller thread ID. unsigned long long ballot ( int pred ) const Ballot function on group level. Returns a bit mask with the Nth bit set to one if the Nth thread predicate evaluates true. pred - [in] The predicate to evaluate on group threads. int any ( int pred ) const Any function on group level. Returns non-zero if a predicate evaluates true for any threads. pred - [in] The predicate to evaluate on group threads. int all ( int pred ) const All function on group level. Returns non-zero if a predicate evaluates true for all threads. pred - [in] The predicate to evaluate on group threads. template<typename T > unsigned long long match_any ( T value ) const Match any function on group level. Returns a bit mask containing a 1-bit for every participating thread if that thread has the same value in value as the caller thread. value - [in] The value to examine on the current thread in group. template<typename T > unsigned long long match_all ( T value, int &pred ) const Match all function on group level. Returns a bit mask containing a 1-bit for every participating thread if they all have the same value in value as the caller thread. The predicate pred is set to true if all participating threads have the same value in value . class coalesced_group : public cooperative_groups:: thread_group The coalesced_group cooperative group type. Represents an active thread group in a wavefront. This group type also supports sub-wave level intrinsics. Note: This is implemented on Linux and is under development on Microsoft Windows. The following functions are used to construct different group-type instances on the device side. Warning: doxygenfunction: Cannot find function 'cooperative_groups::this_grid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml Warning: doxygenfunction: Cannot find function 'cooperative_groups::this_thread_block' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml The following functions are the exposed API for different group-type instances on the device side. Warning: doxygenfunction: Cannot find function 'cooperative_groups::is_valid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml Warning: doxygenfunction: Cannot find function 'cooperative_groups::sync' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advancedmicro-devices-hip/checkouts/docs-6.1.2/docs/doxygen/xml The following functions are located in the https://github.com/ROCm/ROCR-Runtime repository. hsa_status_t hsa_amd_vmem_address_reserve ( void **va, size_t size, uint64_t address, uint64_t flags ) Allocate a reserved address range. Reserve a virtual address range. The size must be a multiple of the system page size. If it is not possible to allocate the address specified by address , then va will be a different address range. Address range should be released by calling hsa_amd_vmem_address_free. Note that this API will be deprecated in a future release and replaced by hsa_amd_vmem_address_reserve_align hsa_status_t hsa_amd_vmem_address_free ( void *va, size_t size ) Free a reserved address range. Free a previously allocated address range. The size must match the size of a previously allocated address range. · ::HSA_STATUS_ERROR - Internal unexpected error hsa_status_t hsa_amd_vmem_handle_create ( hsa_amd_memory_pool_t pool, size_t size, hsa_amd_memory_type_t type, uint64_t flags, hsa_amd_vmem_alloc_handle_t *memory_handle ) Create a virtual memory handle. Create a virtual memory handle within this pool size must be a aligned to allocation granule size for this memory pool, see HSA_AMD_MEMORY_POOL_INFO_RUNTIME_ALLOC_GRANULE To minimize internal memory fragmentation, align the size to the recommended allocation granule size, see HSA_AMD_MEMORY_POOL_INFO_RUNTIME_ALLOC_REC_GRANULE hsa_status_t hsa_amd_vmem_handle_release ( hsa_amd_vmem_alloc_handle_t memory_handle ) Release a virtual memory handle. memory -[in] handle that was previously allocated hsa_status_t hsa_amd_vmem_map ( void *va, size_t size, size_t in_offset, hsa_amd_vmem_alloc_handle_t memory_handle, uint64_t flags ) Map a virtual memory handle. Map a virtual memory handle to a reserved address range. The virtual address requested must be within a previously reserved address range. va and ( va + size) must be must be within (va + size) of the previous allocated address range. size must be equal to size of the memory_handle hsa_amd_vmem_set_access needs to be called to make the memory accessible to specific agents Unmap a virtual memory handle. Unmap previously mapped virtual address range hsa_status_t hsa_amd_vmem_set_access ( void *va, size_t size, const hsa_amd_memory_access_desc_t *desc, size_t desc_cnt ) Make a memory mapping accessible. Make previously mapped virtual address accessible to specific agents. size must be equal to size of previously mapped virtual memory handle. Calling hsa_amd_vmem_set_access multiple times on the same va will overwrite previous permissions for all agents hsa_status_t hsa_amd_vmem_get_access ( void *va, hsa_access_permission_t *perms, hsa_agent_t agent_handle ) Get current access permissions for memory mapping. Get access permissions for memory mapping for specific agent. hsa_status_t hsa_amd_vmem_export_shareable_handle ( int *dmabuf_fd, hsa_amd_vmem_alloc_handle_t handle, uint64_t flags ) Get an exportable shareable handle. Get an exportable shareable handle for a memory_handle. This shareabl handle can then be used to re-create a virtual memory handle using hsa_amd_vmem_import_shareable_handle. The shareable handle can be transferred using mechanisms that support posix file descriptors Once all shareable handles are closed, the memory_handle is released. Import a shareable handle. Import a shareable handle for a memory handle. Importing a shareable handle that has been closed and released results in undefined behavior. hsa_status_t hsa_amd_vmem_retain_alloc_handle ( hsa_amd_vmem_alloc_handle_t *memory_handle, void *addr ) Returns memory handle for mapped memory. Return a memory handle for previously mapped memory. The handle will be the same value of handle used to map the memory. The returned handle must be released with corresponding number of calls to hsa_amd_vmem_handle_release. hsa_status_t hsa_amd_vmem_get_alloc_properties_from_handle ( hsa_amd_vmem_alloc_handle_t memory_handle, hsa_amd_memory_pool_t *pool, hsa_amd_memory_type_t *type ) Returns the current allocation properties of a handle. Returns the allocation properties of an existing handle hipError_t hipMallocManaged ( void **dev_ptr, size_t size, unsigned int flags ) Allocates memory that will be automatically managed by HIP. This API is used for managed memory, allows data be shared and accessible to both CPU and GPU using a single pointer. The API returns the allocation pointer, managed by HMM, can be used further to execute kernels on device and fetch data between the host and device as needed. Note: It is recommend to do the capability check before call this API. hipSuccess, hipErrorMemoryAllocation, hipErrorNotSupported, hipErrorInvalidValue hipError_t hipMemPrefetchAsync ( const void *dev_ptr, size_t count, int device, hipStream_t stream ) Prefetches memory to the specified destination device using HIP. Note: This API is implemented on Linux and is under development on Microsoft Windows. hipSuccess, hipErrorInvalidValue hipError_t hipMemAdvise ( const void *dev_ptr, size_t count, hipMemoryAdvise advice, int device ) Advise about the usage of a given memory range to HIP. This HIP API advises about the usage to be applied on unified memory allocation in the range starting from the pointer address devPtr, with the size of count bytes. The memory range must refer to managed memory allocated via the API hipMallocManaged, and the range will be handled with proper round down and round up respectively in the driver to be aligned to CPU page size, the same way as corresponding CUDA API behaves in CUDA version 8.0 and afterwards. Note: This API is implemented on Linux and is under development on Microsoft Windows. hipSuccess, hipErrorInvalidValue hipError_t hipMemRangeGetAttribute ( void *data, size_t data_size, hipMemRangeAttribute attribute, const void *dev_ptr, size_t count ) Query an attribute of a given memory range in HIP. Note: This API is implemented on Linux and is under development on Microsoft Windows. hipSuccess, hipErrorInvalidValue hipError_t hipMemRangeGetAttributes ( void **data, size_t *data_sizes, hipMemRangeAttribute *attributes, size_t num_attributes, const void *dev_ptr, size_t count ) Query attributes of a given memory range in HIP. Note: This API is implemented on Linux and is under development on Microsoft Windows. hipSuccess, hipErrorInvalidValue hipError_t hipStreamAttachMemAsync ( hipStream_t stream, void *dev_ptr, size_t length, unsigned int flags ) Attach memory to a stream asynchronously in HIP. Warning: This API is under development. Currently it is a no-operation (NOP) function on AMD GPUs and returns hipSuccess. hipSuccess, hipErrorInvalidValue static inline hipError_t hipMallocManaged ( T **devPtr, size_t size, unsigned int flags = hipMemAttachGlobal ) Provide an override to automatically typecast the pointer type from void**, and also provide a default for the flags. HIP_DISABLE_CPP_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs. hipMallocManaged hipError_t hipMemAddressFree ( void *devPtr, size_t size ) Frees an address range reservation made via hipMemAddressReserve. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemAddressReserve ( void **ptr, size_t size, size_t alignment, void *addr, unsigned long long flags ) Reserves an address range. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemCreate ( hipMemGenericAllocationHandle_t *handle, size_t size, const hipMemAllocationProp *prop, unsigned long long flags ) Creates a memory allocation described by the properties and size. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemExportToShareableHandle ( void *shareableHandle, hipMemGenericAllocationHandle_t handle, hipMemAllocationHandleType handleType, unsigned long long flags ) Exports an allocation to a requested shareable handle type. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemGetAccess ( unsigned long long *flags, const hipMemLocation *location, void *ptr ) Get the access flags set for the given location and ptr. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemGetAllocationGranularity ( size_t *granularity, const hipMemAllocationProp *prop, hipMemAllocationGranularity_flags option ) Calculates either the minimal or recommended granularity. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemGetAllocationPropertiesFromHandle ( hipMemAllocationProp *prop, hipMemGenericAllocationHandle_t handle ) Retrieve the property structure of the given handle. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemImportFromShareableHandle ( hipMemGenericAllocationHandle_t *handle, void *osHandle, hipMemAllocationHandleType shHandleType ) Imports an allocation from a requested shareable handle type. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemMap ( void *ptr, size_t size, size_t offset, hipMemGenericAllocationHandle_t handle, unsigned long long flags ) Maps an allocation handle to a reserved virtual address range. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemMapArrayAsync ( hipArrayMapInfo *mapInfoList, unsigned int count, hipStream_t stream ) Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays. Warning: This API is under development. Currently it is not supported on AMD GPUs and returns hipErrorNotSupported. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemRelease ( hipMemGenericAllocationHandle_t handle ) Release a memory handle representing a memory allocation which was previously allocated through hipMemCreate. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. handle -[in] - handle of the memory allocation. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemRetainAllocationHandle ( hipMemGenericAllocationHandle_t *handle, void *addr ) Returns the allocation handle of the backing memory allocation given the address. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError_t hipMemSetAccess ( void *ptr, size_t size, const hipMemAccessDesc *desc, size_t count ) Set the access flags for each location specified in desc for the given virtual address range. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported Unmap memory allocation of a given address range. Note: This API is implemented on Linux and is under development on Microsoft Windows. Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues. hipSuccess, hipErrorInvalidValue, hipErrorNotSupported Several of our API functions have been flagged for deprecation. Using the following functions results in errors and unexpected results, so we encourage you to update your code accordingly. CUDAsupports cuCtx API, which is the driver API that defines 'Context' and 'Devices' as separate entities. Context contains a single device, and a device can theoretically have multiple contexts. HIP initially added limited support for these APIs in order to facilitate porting from existing driver codes. These APIs are now marked as deprecated because there are better alternate interfaces (such as hipSetDevice or the stream API) to achieve these functions. This tutorial explains the basic concepts of the single-source Heterogeneous-computing Interface for Portability (HIP) programming model and the essential tooling around it. It also reviews some commonalities of heterogenous APIs in general. This topic assumes basic familiarity with the C/C++ compilation model and language. To follow this tutorial, you'll need installed drivers and a HIP compiler toolchain to compile your code. Because HIP for ROCm supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP . Heterogeneous programming and offloading APIs are often mentioned together. Heterogeneous programming deals with devices of varying capabilities simultaneously. Offloading focuses on the 'remote' and asynchronous aspects of computation. HIP encompasses both. It exposes GPGPU (general-purpose GPU) programming much like ordinary host-side CPU programming and lets you move data across various devices. When programming in HIP (and other heterogenous APIs for that matter), remember that target devices are built for a specific purpose. They are designed with different tradeoffs than traditional CPUs and therefore have very different performance characteristics. Even subtle changes in code might adversely affect execution time. First, let's do the 'Hello, World!' of GPGPU: SAXPY. Single-precision A times X Plus Y ( SAXPY ) is a mathematical acronym; a vector equation 𝑎 · 𝑥 + 𝑦 = 𝑧 where 𝑎 ∈ R is a scalar and 𝑥, 𝑦, 𝑧 ∈ V are vector quantities of some large dimensionality. This vector space is defined over the set of reals. Practically speaking, you can compute this using a single for loop over three arrays. In linear algebra libraries, such as BLAS (Basic Linear Algebra Subsystem) this operation is defined as AXPY 'A times X Plus Y'. The 'S' comes from single-precision , meaning that array element is float -s (IEEE 754 binary32 representation). To quickly get started, use the set of HIP samples from GitHub. With Git configured on your machine, open a commandline and navigate to your desired working directory, then run: A simple implementation of SAXPY resides in the HIP-Basic/saxpy/main.hip file in this repository. The HIP code here mostly deals with where data has to be and when, and how devices transform this data. The first HIP calls deal with allocating device-side memory and copying data from host-side memory to device side in a C runtime-like fashion. HIP_CHECK is a custom macro borrowed from the examples utilities which checks the error code returned by API functions for errors and reports them to the console. It is not essential to the API, but it is a good practice to check the error codes of the HIP APIs in case you pass on incorrect values to the API, or the API might be out of resources. The code selects the device to allocate to and to copy to. Commands are issued to the HIP runtime per thread, and every thread has a device set as the target of commands. The default device is 0 , which is equivalent to calling hipSetDevice(0) . Launch the calculation on the device after the input data has been prepared. Analyze at the signature of the offloaded function: This function is launched from the host using a language extension often called the triple chevron syntax. Inside the angle brackets, provide the following. The block size and shared memory become important later in Reduction . For now, a hardcoded 256 is a safe default for simple kernels such as this. Following the triple chevron is ordinary function argument passing. Look at how the kernel is implemented. Retrieval of the result from the device is done much like input data copy. In this current step the results copied from device to host. The opposite direction of the input data copy: Strictly speaking there's no such thing as 'setting up the command-line for compilation' on Linux. To make invocations more terse, Linux and Windows example follow. While distro maintainers might package ROCm so that it installs to system-default locations, AMD's packages aren't installed that way. They need to be added to the PATH by the user. Note: Docker images distributed by AMD, such as rocm-terminal already have /opt/rocm/bin on the Path for convenience. This subtly affects CMake package detection logic of ROCm libraries. Both distro maintainers and NVIDIA package CUDA so that nvcc and related tools are available on the command line by default. You can call the compiler on the command line with: Windows compilers and command line tooling have traditionally relied on extra environmental variables and PATH entries to function correctly. Visual Studio refers to command lines with this setup as 'Developer Command Prompt' or 'Developer PowerShell' for cmd.exe and PowerShell respectively. The HIP SDK on Windows doesn't include a complete toolchain. You will also need: If you don't have a version of Visual Studio 2022 installed, for a minimal command line experience, install the Build Tools for Visual Studio 2022 with the Desktop Developemnt Workload. Under Individual Components select: Note: The 'C++ CMake tools for Windows' individual component is a convenience which puts both cmake.exe and ninja.exe onto the PATH inside developer command prompts. You can install these manually, but then you must manage them manually. Visual Studio 2017 and later are detectable as COM object instances via WMI. To setup a command line from any shell for the latest Visual Studio's default Visual C++ toolset issue: You should be able to call the compiler on the command line now: Windows compilers and command line tooling have traditionally relied on extra environmental variables and PATH entries to function correctly. Visual Studio refers to command lines with this setup as 'Developer Command Prompt' or 'Developer PowerShell' for cmd.exe and PowerShell respectively. The HIP and CUDA SDKs on Windows don't include complete toolchains. You will also need: If you don't have a version of Visual Studio 2022 installed, for a minimal command line experience, install the Build Tools for Visual Studio 2022 with the Desktop Developemnt Workload. Under Individual Components select: Note: The 'C++ CMake tools for Windows' individual component is a convenience which puts both cmake.exe and ninja.exe onto the PATH inside developer command prompts. You can install these manually, but then you must manage them manually. Visual Studio 2017 and later are detectable as COM object instances via WMI. To setup a command line from any shell for the latest Visual Studio's default Visual C++ toolset issue: You should be able to call the compiler on the command line now: To compile and link a single-file application, use the following commands: Depending on your computer, the resulting binary might or might not run. If not, it typically complains about 'Invalid device function'. That error (corresponding to the hipErrorInvalidDeviceFunction entry of hipError_t ) means that the runtime could not find a device program binary of the appropriate flavor embedded into the executable. So far, the discussion has covered how data makes it from the host to the device and back. It has also discussed the device code as source, with the HIP runtime arguing that the correct binary to dispatch for execution. How can you find out what device binary flavors are embedded into the executable? The utilities included with ROCm help significantly to inspect binary artifacts on disk. Add the ROCmCC installation folder to your PATH if you want to use these utilities (the utilities expect them to be on the PATH). You can list embedded program binaries using roc-obj-ls . The compiler embeds a version 4 code object (more on code object versions) and used the LLVM target triple amdgcnamd-amdhsa-gfx803 (more on target triples). You can extract that program object in a disassembled fashion for human consumption via roc-obj . This creates two files on disk and .s extension is of most interest. Opening this file or dumping it to the console using cat lets find the disassembled binary of the SAXPY compute kernel, something similar to: Alternatively, call the compiler with --save-temps to dump all device binary to disk in separate files. List all the temporaries created while compiling main.hip with: Files with the .s extension hold the disassembled contents of the binary. The filename notes the graphics IPs used by the compiler. The contents of this file are similar to what roc-obj printed to the console. Unlike HIP on AMD, when compiling using the NVIDIA support of HIP the resulting binary will be a valid CUDA executable as far as the binary goes. Therefor it'll incorporate PTX ISA (Parallel Thread eXecution Instruction Set Architecture) instead of AMDGPU binary. As s result, tooling shipping with the CUDA SDK can be used to inspect which device ISA got compiled into a specific executable. The tool most useful to us currently is cuobjdump . From this we can see that the saxpy kernel is stored as sm_52 , which shows that a compute capability 5.2 ISA got embedded into the executable, so devices which sport compute capability 5.2 or newer will be able to run this code. The HIP SDK for Windows don't yet sport the roc-* set of utilities to work with binary artifacts. To find out what binary formats are embedded into an executable, one may use dumpbin tool from the Windows SDK to obtain the raw data of the .hip_fat section of an executable. (This binary payload is what gets parsed by the roc-* set of utilities on Linux.) Skipping over the reported header, the rendered raw data as ASCII has ~3 lines per entries. Depending on how many binaries are embedded, you may need to alter the number of rendered lines. An invocation such as: The output may look like: We can see that the compiler embedded a version 4 code object (more on code object versions) and used the LLVM target triple amdgcn-amd-amdhsa-gfx906 (more on target triples). Don't be alarmed about linux showing up as a binary format, AMDGPU binaries uploaded to the GPU for execution are proper linux ELF binaries in their format. Alternatively we can call the compiler with --save-temps to dump all device binary to disk in separate files. Now we can list all the temporaries created while compiling main.hip via (continued from previous page) Files with the .s extension hold the disassembled contents of the binary and the filename directly informs us of the graphics IPs used by the compiler. Unlike HIP on AMD, when compiling using the NVIDIA support for HIP, the resulting binary will be a valid CUDA executable. Therefore, it'll incorporate PTX ISA (Parallel Thread eXecution Instruction Set Architecture) instead of AMDGPU binary. As a result, tooling included with the CUDA SDK can be used to inspect which device ISA was compiled into a specific executable. The most helpful to us currently is cuobjdump . This example shows that the SAXPY kernel is stored as sm_52 . It also shows that a compute capability 5.2 ISA was embedded into the executable, so devices that support compute capability 5.2 or newer will be able to run this code. Now that you've found what binary got embedded into the executable, find which format our available devices use. On Linux a utility called rocminfo helps us list all the properties of the devices available on the system, including which version of graphics IP ( gfxXYZ ) they employ. You can filter the output to have only these lines: Now that you know which graphics IPs our devices use, recompile your program with the appropriate parameters. Now the sample will run. On Linux HIP with the NVIDIA back-end, the deviceQuery CUDA SDK sample can help us list all the properties of the devices available on the system, including which version of compute capability a device sports. <major>.<minor> compute capability is passed to nvcc on the command-line as sm_<major><minor> , for eg. 8.6 is sm_86 . Because it's not included as a binary, compile the matching example from ROCm. Filter the output to have only the lines of interest, for example: Note: In addition to the nvcc executable is another tool called __nvcc_device_query which prints the SM Architecture numbers to standard out as a comma separated list of numbers. The utility's name suggests it's not a user-facing executable but is used by nvcc to determine what devices are in the system at hand. Now that you know which graphics IPs our devices use, recompile your program with the appropriate parameters. Note: If you want to portably target the development machine which is compiling, you may specify -arch=native instead. Now the sample will run. On Windows, a utility called hipInfo.exe helps us list all the properties of the devices available on the system, including which version of graphics IP ( gfxXYZ ) they employ. Filter the output to have only these lines: Now that you know which graphics IPs our devices use, recompile your program with the appropriate parameters. Now the sample will run. On Windows HIP with the NVIDIA back-end, the deviceQuery CUDASDKsample can help us list all the properties of the devices available on the system, including which version of compute capability a device sports. <major>. <minor> compute capability is passed to nvcc on the command-line as sm_<major><minor> , for eg. 8.6 is sm_86 . Because it's not included as a binary, compile the matching example from ROCm. Filter the output to have only the lines of interest, for example: Note: Next to the nvcc executable is another tool called __nvcc_device_query.exe which simply prints the SM Architecture numbers to standard out as a comma separated list of numbers. The naming of this utility suggests it's not a user facing executable but is used by nvcc to determine what devices are in the system at hand. Now that you know which graphics IPs our devices use, recompile your program with the appropriate parameters. Note: If you want to portably target the development machine which is compiling, you may specify -arch=native instead. Now the sample will run. Reduction is a common algorithmic operation used in parallel programming to reduce an array of elements into a shorter array of elements or a single value. This document exploits reduction to introduce some key considerations while designing and optimizing GPU algorithms. This document is a rejuvenation and extension of the invaluable work of Mark Harris. While the author approaches the topic with a less naive approach, reviewing some original material is valuable to see how much the underlying hardware has changed. This document provides a greater insight to demonstrate progress. Reduction has many names depending on the domain; in functional programming it's referred to as fold, in C++, it's called std::accumulate and in C++17, as std::reduce . A reduction takes a range of inputs and 'reduces' the given range with a binary operation to a singular or scalar output. Canonically, a reduction requires a 'zero' element that bootstraps the algorithm and serves as one of the initial operands to the binary operation. The 'zero' element is generally called identity or neutral element in the group theory, which implies that it is an operand that doesn't change the result. Some typical use cases are: calculating a sum or normalizing a dataset and finding the maximum value in the dataset. The latter use case is discussed further in this tutorial. There are multiple variations of reduction that allow parallel processing. The approach taken by std::reduce requires the user-provided binary operator to operate on any combination of identity and input range elements, or even exclusively on any of them. This allows you to insert any number of identities to facilitate parallel processing and then combine the partial results of parallel execution. Implementing reductions on GPUs requires a basic understanding of the /understand/programming_model_reference. The document explores aspects of low-level optimization best discussed through the Inherent thread model , and refrains from using cooperative groups. Synchronizing parallel threads of execution across a GPU is crucial for correctness as the partial results can't be synchronized before they manifest. Synchronizing all the threads running on a GPU at any given time is possible, however, it is a costly and intricate operation. If synchronization is not absolutely necessary, map the parallel algorithm so that multiprocessors and blocks can make independent progress and need not sync frequently. There are ten reduction implementations in the rocm-examples, which are described in the following sections. The naive algorithm takes a tree-like shape, where the computational domain is purposefully distributed among blocks. In all blocks, all threads participate in loading data from persistent (from the kernel's perspective) global memory into the shared memory. This helps to perform tree-like reduction for a single thread by writing the partial result to global, in a location unique to the block, which allows the block to make independent progress. The partial results are combined in subsequent launches of the same kernel until a scalar result is reached. This approach requires temporary storage based on the number of blocks launched, as each block outputs a scalar partial result. Depending on the need to store or destroy the input, a second temporary storage might be needed, which could be large enough to store the results of the second kernel launch. Alternatively, you can reuse the storage of the larger than necessary original input. These implementations differ so slightly that the document only considers the use case where the input could be destroyed. For threads that don't have unique inputs, feed zero_elem instances to threads. The backing of double-buffering is allocated as such: Data is initialized on the host and dispatched to the device followed by the commencement of device-side reduction. The swapping of the double-buffer on the last iteration is omitted, therefore the result is in the back-buffer irrespective of the input size. (continues on next page) (continued from previous page) This structure persists in the kernel throughout all the variations of reduction with slight modifications to factor and shared memory allocation: While the tid % (2 * i) == 0 indexing scheme yields correct results, it also leads to high thread divergence. Thread divergence indicates the event when the threads in a warp diverge, which implies that the threads have to execute different instructions in a given clock cycle. This is easily manifested using if-else statements as shown here, but can also be manifested as for loop dependent on thread ID lengths. Even though the number of active threads participating in the reduction reduces, warps remain active longer than necessary, as at least one lane in a warp hits the if statement. You can reduce divergence by keeping dataflow between memory addresses identical but reassigning the thread ids. This way inactive threads start accumulating uniformly towards the higher thread ID index range and might uniformly skip to __syncthreads() . However, this introduces a bank conflicts issue. Both AMD and NVIDIA implement shared memory in the hardware by organizing storage into banks of various sizes. This hardware element is known as Local Data Share (LDS) on AMD hardware. On NVIDIA hardware, it's implemented using the same silicon as the L1 data cache. You can think of shared memory as a striped 2-dimensional range of memory. Shared memory bank's count, width, and depth depend on the architecture. A bank conflict occurs when different threads in a warp access the same bank during the same operation. In this case, the hardware prevents the attempted concurrent accesses to the same bank by converting them into serial accesses. A notable exception is when the shared read uniformly broadcasts to the same address across the entire warp. A better implementation of the naive algorithm is to form continuous ranges of the threads activities and their memory accesses. Note: To avoid bank conflicts, read shared memory in a coalesced manner, which implies that reads/writes of each lane in a warp evaluate to consecutive locations. Analyzing the read/write patterns could help you to understand the cause of bank conflicts. For more details, check CDNA3 ISA or RDNA3 ISA data share operations chapter. The preceding implementation is free of low-level GPU-specific anti-patterns. However, it still exhibits some common shortcomings. The loop performing the reduction in the shared memory starts from i = blockDim.x / 2 and the first predicate if (tid < i) immediately disables half of the block, which only helps load the data into the shared memory. You can change the kernel along with the calculation of factor on the host, as shown here: By eliminating half of the threads and giving meaningful work to all the threads by unconditionally performing a binary op , you can prevent the wastage of half of the threads. Even though global memory is read in a coalesced fashion, as preferred by the memory controller, optimal performance is still limited by the instruction throughput. Omit superfluous synchronization ----------- Warps are known to execute in a strict lockstep fashion. Therefore, once shared reduction reaches a point where only a single warp participates meaningfully, you can cut short the loop and let the rest of the warps terminate. Moreover, you can also unroll the loop without syncing the entire block. The tmp namespace used beyond this point in this document holds a handful of template meta-programmed utilities to facilitate writing flexible and optimal code. tmp::static_for is not just a constant folding within the optimizer but a variation of the language for loop, where the running index is a compile-time constant and is eligible for use in compile-time evaluated contexts. Consider the following code: This compiles to the following binaries: (continues on next page) (continued from previous page) LLVM unrolls the loop and compiles to a flat series of printf invocations, while both GCC and MSVC keep the loop intact, as visible from the compare ( cmp ) and the jump ( jne , jl ) instructions. LLVM code generation is identical to manually writing the unrolled loop: While various non-standard pragmas are available to hint or force the compiler to unroll the loop, we instead use template meta-programming to force feed the compiler the unrolled loop. The most notable structural difference is that in the language for loop, the loop variable is given a name in the beginning, while in the static_for utility, the loop variable is given a name in the end. An important bonus is that in the loop's body, you can use the running index i in contexts requiring constant expressions such as template arguments or inside if constexpr . tmp::static_switch takes runtime value and runtime dispatches to a range of set of tabulated functions, where said value is a compile-time constant and is eligible for use in compile-time evaluated contexts. Consider the following code: In the preceding code, note the code repetition for all possible values of warp_size , the code is prepared to handle. To avoid this, use tmp::static_switch , as shown: Because HIP typically targets hardware with warp sizes of 32 (NVIDIA GPUs and RDNA AMD GPUs) and 64 (CDNA AMDGPUs), portable HIP code must handle both. That is why instead of assuming a warp size of 32, make the warp size a template argument of the kernel. This allows you to unroll the final loop using tmp::static_for in a parametric way but still having the code read much like an ordinary loop. Promoting the warp size to being a compile-time constant also requires you to handle it similarly on the host-side. You can sandwich the kernel launch with tmp::static_switch , promoting the snake-case run-time warp_size variable to a camel-case compile-time constant WarpSize . Note: Neither RDNA- nor CDNA-based AMD hardware provides guaranteed independent progress to lanes of the same warp. When targeting NVIDIA hardware, lanes of a warp might execute somewhat independently as long as the programmer assists the compiler using dedicated built-in functions. This feature is called Independent Thread Scheduling. The HIP headers don't expose the necessary warp primitives and their overloads. Portable applications can still tap into this feature with carefully #ifdef -ed code, but at this particular optimization level, it's a requirement. The code implicitly relies on the lockstep behavior of an ROCm wavefront, but CUDA warps don't share this property. You must synchronize all the active lanes of a warp to avoid a data race with some lanes progressing faster than others in the same warp. While the previous step primarily aims to remove unnecessary syncing, it also unrolls the end of the loop. However, you could also force unrolling the first part of the loop. This saves a few scalar registers (values the compiler can prove to be uniform across warps). Introducing yet another template argument for the kernel and moving from for to tmp::static_for leads to the following two notable improvements: Shared memory provides a fast communication path within a block, however when performing reduction within the last warp, you can use faster means of communication, which is warp-collective or cross-lane functions. Instead of using the hardware-backed shared memory, you can directly copy between the local memory (registers) of each lane in a warp. This can be achieve using the shuffle functions. See how to use __shfl_down() , which is one of the most restrictive but also the most structured communication schemes. Using warp-collective functions for communication requires the control flow to be uniform across warps, as the name warp-collective implies. Therefore, you can see that the thread ID is being checked outside the loop, but the result is written inside due to variable scoping. As mentioned in the previous step, communication between local memory is faster than shared memory. Instead of relying on the local memory only at the end of the tree-like reduction, a better approach is to turn the tree reduction inside out and perform multiple warp reductions in parallel on all active threads, thus communicating only their partial results through the shared memory. The kernel versions differ significantly enough to be described using a diff; use afresh instead. The kernel signature and the reduction factor are the same as in previous cases; only the implementation differs. As we communicate the results of warps through shared memory, the same number of elements are required in the shared memory as warps within the block. Similar to how you can only launch kernels at block granularity, you can only warp reduce with WarpSize granularity due to the collective nature of the cross-lane builtins. To address this, you can use read_shared_safe to pad overindexing by reading zero_elem . Reading from global remains unaffected. // Perform warp reductions and communicate results via shared // for (uint32_t ActiveWarps = WarpCount; // ActiveWarps != 0; // ActiveWarps = ActiveWarps != 1 ? // divide_ceil(ActiveWarps, WarpSize) : // ActiveWarps = 0) tmp::static_for< WarpCount, tmp::not_equal<0>, tmp::select< tmp::not_equal<1>, tmp::divide_ceil<WarpSize>, tmp::constant<0>>>([&]< uint32_t ActiveWarps>() { if (wid < ActiveWarps) { // Warp reduction tmp::static_for<WarpSize / 2, tmp::not_equal<0>, tmp::divide<2>>([&]< int Delta>() { res = op(res, __shfl_down(res, Delta)); }); // Write warp result from local to shared if (lid == 0) shared[wid] = res; } __syncthreads(); // Read warp result from shared to local res = read_shared_safe(tid); (continues on next page) (continued from previous page) ActiveWarps iterates from WarpCount until it reaches 0 . Every iteration of ActiveWarps reduces the WarpSize . In cases where the partial result count isn't a divisor of ActiveWarps and you need to launch an extra warp, use tmp::divide_ceil , which always rounds to positive infinity. The tertiary tmp::select is required because such division never reaches 0 , so you must terminate the loop after the last warp concludes. In each iteration, if the warp is active, which means it has at least a single valid input, it carries out a pass of warp reduction and writes output based on warp ID. Reading is carried out based on thread ID. Global output continues to be based on block ID. The previous sections explained how to reduce register usage to improve occupancy. This allows more blocks to execute in parallel on all multiprocessors, leading to more global store/load latency to be hidden. Reducing the number of kernels in flight while still carrying out the same workload reduces the wastage of registers while loading and maintaining bookkeeping variables such as kernel indices. An example of this optimization is performing one binary op while loading input from global. Even though the operation is said to be carried out 'in flight', the two values are loaded into local memory (registers) before op is called. Amore general form of this optimization is wrapping most kernel logic in loops that carry out the workload of multiple kernel instances but require storing only a single instance of most of the bookkeeping logic. In code, this multiplicity factor is referred to via the ItemsPerThread compile-time constant, which is supplied by a template argument to allow for loop unrolling. This kernel variant utilizes another generally applicable utility known as hip::static_array , which is a more restrictive wrapper over the builtin array than std::array , as it allows indexing only compile-time constants using the usual tuple-like template <size_t I> auto get<I>(...) interface. Note: On a GPU, there is no stack, and the local memory is provisioned from the register file. This provisioning takes place statically. To paraphrase, the address range of a thread's local memory is determined at compile-time. When an array is defined and used in the local storage, the compiler can only maintain its storage in the register file as long as all accesses to the array are computable by the compiler at compile-time. It doesn't need to be a compile-time constant as long as the compiler can resolve the addresses of the accesses through constant folding or some other means. If the compiler fails to do so, the array will be backed by global memory, which is indicated by allocating a non-zero number of spill registers observable using static analysis tools. However, this is slower by the magnitude of multiple order. hip::static_array via its hip::get<> interface ensures that no such spills occur. The kernel now has three compile-time configurable parameters. The only part of the kernel that changes depends on how you load data from global and perform the binary operation on those loaded values. So, the following step to read input from front buffer to global is now split into two steps: reading ``ItemsPerThread` <reading-items>`and processing ``ItemsPerThread` <processing-items>`. The change to reading happens inside read_global_safe : Note that each array element is being loaded consecutively without the flexibility of a configurable ItemsPerThread property. This is morally equivalent to: This is exactly what's happening in the front[i + I]... fold-expression. However, this can only be issued if the entire read operates on real input without padding using zero_elem . If some reads over-index the input, the read turns into: This makes it easier for the compiler to recognize vector loads from global. As the performance at large is dominated by how you move the data, it's only natural to utilize dedicated instructions to move more data with less binary. This is evident by the huge performance improvement when loading two values per thread. For more information, see the compiler explorer to learn how loading for AMD (both RDNA and CDNA) compiles to global_load_dwordx4 , where x4 denotes the 4-vector variant of the instruction. Note: Note that read_global_safe , which used to take an uint32_t as the index type, now takes a signed integer. When indexing an array with unsigned integers, the compiler has to handle integer overflows, as the C/C++ standards defined them. It might happen that some part of the vector load indices overflow, thus resulting in a non-contiguous read. If you change the previously linked code to use an unsigned integer as the thread ID, the compiler won't emit a vector load. Signed integer overflow is an undefined behavior, and hence, unknown to the optimizer. To convey the absence of overflow to the compiler with unsigned indices, add __builtin_assume(gid + 4 > gid) , or the more portable [[assume]](gid + 4 > gid) , once amdclang++ supports it. read_global_safe implementation is an Immediately Invoked Lambda Expression (IILE), because ItemsPerThread is an integer value, while you need a compile-time iota -like sequence of integers as a pack for the fold-expressions to expand on. This can only occur as part of template argument deduction on the IILE. Once the kernel reads ItemsPerThread number of inputs to local, it immediately reduces them to a scalar. There is no reason to propagate the input element multiplicity to the warp reduction phase. Alter kernel launch and input fetching such that no more blocks are launched than what a subsequent kernel launch's single block can conveniently reduce, while performing multiple passes of input reading from global and combining their results before engaging in the end game tree-like reduction. With this method, you can save at least one to two kernel launches for large inputs. Warning: This modification can only be executed on AMD hardware. Perform the first step of the two-pass reduction, but in the end, instead of writing to global and reading it back in a subsequent kernel, write the partial results to the Global Data Share (GDS). This is an N+1 th shared memory that is accessed by all multiprocessors and is also on-chip memory. Note: The API doesn't guarantee the order in which blocks are scheduled even though all GPUs schedule them in the same monotonically increasing order of block ids. Relying on this implicitly, the last block of a grid is in the optimal position to observe the side effects of all other blocks (using spinlocks or other methods) without occupying a multiprocessor for longer than necessary. Without launching a second kernel, you can make the last block collect the results of all other blocks from GDS by implicitly exploiting the scheduling behavior or relying on another AMD-specific feature called Global Wave Sync (GWS) to merge them for a final tree-like reduction. Note: GDS and GWS are reserved runtime features that the HIP API doesn't cover. Invoking these functionalities requires inline AMDGCN assembly. Moreover, the fact that the runtime doesn't virtualize the GDS, imposes further restrictions on concurrent scheduling of other kernels. Optimizing code on GPUs, like on any other architecture, requires careful consideration and balancing of resources and costs of various operations to obtain optimal performance. This document explored optimizing reductions much beyond the territory of diminishing returns. This approach introduced multiple optimization techniques and discussed opportunities. The document focused on reductions when an entire device participates in it. Still, the choice of optimal compile-time constants or even the algorithm itself might not be optimal when its multiple blocks participate in multiple parallel reductions or when each thread performs its reduction. However, when multiple devices participate in the same reduction, other aspects must be considered. Most solutions, including the ones covered in this document, are given to the end users in a turnkey fashion via algorithm primitive libraries. These solutions might not be the fastest in all cases, but they are close to being the gold standard for carrying out certain operations as reasonably as possible. CHAPTER This tutorial demonstrates the basic concepts of cooperative groups in the HIP (Heterogeneous-computing Interface for Portability) programming model and the most essential tooling supporting it. This topic also reviews the commonalities of heterogeneous APIs. Familiarity with the C/C++ compilation model and the language is assumed. To follow this tutorial, you'll need properly installed drivers and a HIP compiler toolchain to compile your code. Because ROCm HIP supports compiling and running on Linux and Microsoft Windows with AMD and NVIDIA GPUs, review the HIP development package installation before starting this tutorial. For more information, see Install HIP . To become familiar with heterogeneous programming, review the SAXPY tutorial and the first HIP code subsection. Compiling is also described in that tutorial. You can use tiled partition to calculate the sum of partition_size length sequences and the sum of result_size / BlockSize length sequences. The host-side reference implementation is the following: (continues on next page) To calculate the sum of the sets of numbers, the tutorial uses the shared memory-based reduction on the device side. The warp level intrinsics usage is not covered in this tutorial, unlike in the reduction tutorial. x input variable is a shared pointer, which needs to be synchronized after every value change. The thread_group input parameter can be thread_block_tile or thread_block because the thread_group is the parent class of these types. The val are the numbers to calculate the sum of. The returned results of this function return the final results of the reduction on thread ID 0 of the thread_group , and for every other thread, the function results are 0. The reduce_sum device function is reused to calculate the block and custom partition sum of the input numbers. The kernel has three sections: (continued from previous page) In this code section, the shared memory is declared, the thread_block_group and custom_partition are defined, and the input variables are loaded from global memory. In this code section, the sum is calculated on thread_block_group level, then the results are stored in global memory. In this code section, the sum is calculated on the custom partition level, then the results are stored in global memory. The custom partition is a partial block of the thread block, it means the reduction calculates on a shorter sequence of input numbers than at the thread_block_group case. (continued from previous page) On the host-side, the following steps are done in the example: Only the first, second and fourth steps are important from the cooperative groups aspect, that's why those steps are detailed further. Not all AMD GPUs support cooperative groups. You can confirm support with the following code: In the example, there is only one block in the grid, and the threads_per_block must be dividable with partition_size . The kernel launch is done with the hipLaunchCooperativeKernel of the cooperative groups API. With cooperative groups, you can easily use custom partitions to create custom tiles for custom solutions. You can find the complete code at cooperative groups ROCm example. Copyright © 2008 - 2024 Advanced Micro Devices, Inc. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the 'Software'), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. C++ function ( ), 162 USE_PEER_NON_UNIFIED ( C macro ), 164
-
-
-
-
-
-
-
-
-
-
-
-
-```
-
-(continues on next page)
-
-(continued from previous page)
-**Following code does:** The code snippet you provided appears to be a mistake or a mix-up. It seems to be an attempt to specify a programming language (C++) rather than actual code. If you intended to provide a C++ code snippet, please include the relevant code so I can help explain its purpose.
-
-
-```
-HIP Documentation, Release 6.1.40092
-```
-
-On Windows , you can set AMD\_LOG\_LEVEL via environment variable from the advanced system settings or the command prompt (when run as administrator). The following example shows debug log information when calling the backend runtime.
-**Following code does:** The code snippet `| export` appears to be incomplete or out of context for a typical Python script. In Python, the `|` character is not used in this way, and `export` is not a standard Python keyword. This snippet might be part of a shell command or a configuration file where `export` is used to set environment variables, but as it stands, it does not represent a valid or complete Python statement.
-
-
-```
-
-
-
- ` and ``) with the word "void" inside them, but these are not valid in Python. The snippet does not perform any meaningful operation or serve a functional purpose in a Python context.
-
-
-```
-<_C++_>
-```
-
-## 16.1.3 Set memory access
-
-Finally, use the hipMemSetAccess function to enable memory access. It accepts the pointer to the virtual memory, the size, and a hipMemAccessDesc descriptor as parameters. In a multi-GPU environment, you can map the device memory of one GPU to another. This feature also works with the traditional memory management system, but isn't as scalable as with virtual memory. When memory is allocated with hipMalloc , hipDeviceEnablePeerAccess is used to enable peer access. This function enables access between two devices, but it means that every call to hipMalloc takes more time to perform the checks and the mapping between the devices. When using virtual memory management, peer access is enabled by hipMemSetAccess , which provides a finer level of control over what is shared. This has no performance impact on memory allocation and gives you more control over what memory buffers are shared with which devices.
-**Following code does:** This code snippet is a command-line instruction using `hipcc`, which is a compiler for HIP (Heterogeneous-Compute Interface for Portability) programs. The command is used to generate a code object file for a specific GPU architecture. It takes an input file containing GPU kernels, compiles it for the specified target GPU architecture, and outputs the compiled code into a designated output file. The `--genco` flag indicates that the command is specifically for generating code objects, and `--offload-arch` specifies the target GPU architecture for which the code should be compiled.
-
-
-```
-hipMemAccessDesc accessDesc = {};
-accessDesc.location.type = HIP_MEM_LOCATION_TYPE_DEVICE;
-accessDesc.location.id = currentDev;
-accessDesc.flags = HIP_MEM_ACCESS_FLAGS_PROT_READWRITE;
-hipMemSetAccess(ptr, padded_size, &accessDesc, 1);
-```
-
-At this point the memory is allocated, mapped, and ready for use. You can read and write to it, just like you would a C style memory allocation.
-
-## 16.1.4 Free virtual memory
-
-To free the memory allocated in this manner, use the corresponding free functions. To unmap the memory, use hipMemUnmap . To release the virtual address range, use hipMemAddressFree . Finally, to release the physical memory, use hipMemRelease . A side effect of these functions is the lack of synchronization when memory is released. If you call hipFree when you have multiple streams running in parallel, it synchronizes the device. This causes worse resource usage and performance.
-**Following code does:** This code snippet is a series of shell commands that clone specific branches of three different repositories from GitHub. The repositories are related to the ROCm (Radeon Open Compute) platform, which is an open-source software platform for GPU computing. The branches being cloned are specified by the environment variable `ROCM_BRANCH`. The repositories being cloned are `clr`, `hip`, and `hipother`, all from the ROCm organization on GitHub. This setup is typically used to obtain specific versions of these projects for development or deployment purposes.
-
-
-```
- |hipMemUnmap(ptr, size);
- |hipMemRelease(allocHandle);
-```
-**Following code does:** The code snippet appears to be part of a script or instructions for setting up a development environment related to HIP (Heterogeneous-Compute Interface for Portability). It outlines two main steps:
-
-1. Cloning the HIP source code from a repository using Git, specifically checking out a branch specified by the environment variable `$ROCM_BRANCH`. The repetition of the `git clone` command suggests either a mistake or an incomplete snippet where different repositories or directories might be intended.
-
-2. Setting environment variables, which is a common step in configuring a development environment to ensure that the necessary paths and settings are available for building or running the software.
-
-Overall, the snippet is part of a setup process for working with HIP, likely in the context of ROCm (Radeon Open Compute), a platform for GPU computing.
-
-
-```
- |hipMemAddressFree(ptr, size);
-```
-
-## 16.2 Memory usage
-
-## 16.2.1 Dynamically increase allocation size
-
-The hipMemAddressReserve function allows you to increase the amount of pre-allocated memory. This function accepts a parameter representing the requested starting address of the virtual memory. This allows you to have a continuous virtual address space without worrying about the underlying physical allocation.
-**Following code does:** This code snippet sets three environment variables (`CLR_DIR`, `HIP_DIR`, and `HIP_OTHER`) to the absolute paths of the directories named `clr`, `hip`, and `hipother`, respectively. The `readlink -f` command is used to resolve and return the full path of each directory, ensuring that any symbolic links are followed to their final target locations. This setup is typically used in shell scripts to configure paths for use in subsequent commands or scripts.
-
-
-```
- hipMemAddressReserve(&new_ptr, (new_size - padded_size), 0, ptr + padded_size, 0);
- hipMemMap(new_ptr, (new_size - padded_size), 0, newAllocHandle, 0);
- hipMemSetAccess(new_ptr, (new_size - padded_size), &accessDesc, 1);
-```
-
-The code sample above assumes that hipMemAddressReserve was able to reserve the memory address at the specified location. However, this isn't guaranteed to be true, so you should validate that new\_ptr points to a specific virtual address before using it.
-
-## CHAPTER
-
-## SEVENTEEN
-
-## FREQUENTLY ASKED QUESTIONS
-
-## 17.1 What APIs and features does HIP support?
-
-HIP provides the following:
-
-- Devices ( hipSetDevice() , hipGetDeviceProperties() , etc.)
-- Memory management ( hipMalloc() , hipMemcpy() , hipFree() , etc.)
-- Streams ( hipStreamCreate() , hipStreamSynchronize() , hipStreamWaitEvent() , etc.)
-- Events ( hipEventRecord() , hipEventElapsedTime() , etc.)
-- Kernel launching ( hipLaunchKernel / hipLaunchKernelGGL is the preferred way of launching kernels. hipLaunchKernelGGL is a standard C/C++ macro that can serve as an alternative way to launch kernels, replacing the CUDA triple-chevron ( <<< >>> ) syntax).
-- HIP Module API to control when and how code is loaded.
-- CUDA-style kernel coordinate functions ( threadIdx , blockIdx , blockDim , gridDim )
-- Cross-lane instructions including shfl , ballot , any , all
-- Most device-side math built-ins
-- Error reporting ( hipGetLastError() , hipGetErrorString() )
-
-The HIP API documentation describes each API and its limitations, if any, compared with the equivalent CUDA API.
-
-## 17.2 What is not supported?
-
-## 17.2.1 Runtime/Driver API features
-
-At a high-level, the following features are not supported:
-
-- Textures (partial support available)
-- Dynamic parallelism (CUDA 5.0)
-- Graphics interoperability with OpenGL or Direct3D
-- CUDA IPC Functions (Under Development)
-- CUDA array, mipmappedArray and pitched memory
-- Queue priority controls
-
-See the API Support Table for more detailed information.
-
-## 17.2.2 Kernel language features
-
-- C+ ± style device-side dynamic memory allocations (free, new, delete) (CUDA 4.0)
-- Virtual functions, indirect functions and try/catch (CUDA 4.0)
-- \_\_prof\_trigger
-- PTX assembly (CUDA 4.0). HIP-Clang supports inline GCN assembly.
-- Several kernel features are under development. See the C++ language extensions for more information.
-
-## 17.3 Is HIP a drop-in replacement for CUDA?
-
-No. HIP provides porting tools which do most of the work to convert CUDA code into portable C++ code that uses the HIP APIs. Most developers will port their code from CUDA to HIP and then maintain the HIP version. HIP code provides the same performance as native CUDA code, plus the benefits of running on AMD platforms.
-
-## 17.4 What specific version of CUDA does HIP support?
-
-HIP APIs and features do not map to a specific CUDA version. HIP provides a strong subset of the functionality provided in CUDA, and the hipify tools can scan code to identify any unsupported CUDA functions - this is useful for identifying the specific features required by a given application.
-
-However, we can provide a rough summary of the features included in each CUDA SDK and the support level in HIP. Each bullet below lists the major new language features in each CUDA release and then indicate which are supported/not supported in HIP:
-
-- CUDA 4.0 and earlier :
-- -HIP supports CUDA 4.0 except for the limitations described above.
-- CUDA 5.0 :
-- -Dynamic Parallelism (not supported)
-- -cuIpc functions (under development).
-- CUDA 6.0 :
-- -Managed memory (under development)
-- CUDA 6.5 :
-- -\_\_shfl intrinsic (supported)
-- CUDA 7.0 :
-- -Per-thread default streams (supported)
-- -C++11 (Hip-Clang supports all of C++11, all of C++14 and some C++17 features)
-- CUDA 7.5 :
-- -float16 (supported)
-- CUDA 8.0 :
-- -Page Migration including cudaMemAdvise , cudaMemPrefetch , other cudaMem* APIs(not supported)
-- CUDA 9.0 :
-
-- -Cooperative Launch, Surface Object Management, Version Management
-
-## 17.5 What libraries does HIP support?
-
-HIP includes growing support for the four key math libraries using hipBLAS, hipFFT, hipRAND and hipSPARSE, as well as MIOpen for machine intelligence applications. These offer pointer-based memory interfaces (as opposed to opaque buffers) and can be easily interfaced with other HIP applications. The hip interfaces support both ROCm and CUDA paths, with familiar library interfaces.
-
-- hipBLAS, which utilizes rocBlas.
-- hipFFT
-- hipsSPARSE
-- hipRAND
-- MIOpen
-
-Additionally, some of the cuBLAS routines are automatically converted to hipblas equivalents by the HIPIFY tools. These APIs use cuBLAS or hcBLAS depending on the platform and replace the need to use conditional compilation.
-
-## 17.6 How does HIP compare with OpenCL?
-
-Both AMD and NVIDIA support OpenCL 1.2 on their devices so that developers can write portable code. HIP offers several benefits over OpenCL:
-
-- Developers can code in C++ as well as mix host and device C++ code in their source files. HIP C++ code can use templates, lambdas, classes and so on.
-- The HIP API is less verbose than OpenCL and is familiar to CUDA developers.
-- Because both CUDA and HIP are C++ languages, porting from CUDA to HIP is significantly easier than porting from CUDA to OpenCL.
-- HIP uses the best available development tools on each platform: on NVIDIA GPUs, HIP code compiles using NVCC and can employ the Nsight profiler and debugger (unlike OpenCL on NVIDIA GPUs).
-- HIP provides pointers and host-side pointer arithmetic.
-- HIP provides device-level control over memory allocation and placement.
-- HIP offers an offline compilation model.
-
-## 17.7 How does porting CUDA to HIP compare to porting CUDA to OpenCL?
-
-Both HIP and CUDA are dialects of C++, and thus porting between them is relatively straightforward. Both dialects support templates, classes, lambdas, and other C++ constructs. As one example, the hipify-perl tool was originally a Perl script that used simple text conversions from CUDA to HIP. HIP and CUDA provide similar math library calls as well. In summary, the HIP philosophy was to make the HIP language close enough to CUDA that the porting effort is relatively simple. This reduces the potential for error, and also makes it easy to automate the translation. HIP goal is to quickly get the ported program running on both platforms with little manual intervention, so that the programmer can focus on performance optimizations.
-
-There have been several tools that have attempted to convert CUDA into OpenCL, such as CU2CL. OpenCL is a C99based kernel language (rather than C++) and also does not support single-source compilation. As a result, the OpenCL syntax is different from CUDA, and the porting tools have to perform some heroic transformations to bridge this gap. The tools also struggle with more complex CUDA applications, in particular, those that use templates, classes, or other C++ features inside the kernel.
-
-## 17.8 What hardware does HIP support?
-
-- For AMD platforms, see the ROCm documentation for the list of supported platforms.
-- For NVIDIA platforms, HIP requires unified memory and should run on any device supporting CUDA SDK 6.0 or newer. We have tested the NVIDIA Titan and Tesla K40.
-
-## 17.9 Do HIPIFY tools automatically convert all source code?
-
-Typically, HIPIFY tools can automatically convert almost all run-time code. Most device code needs no additional conversion since HIP and CUDA have similar names for math and built-in functions. The hipify-clang tool will automatically modify the kernel signature as needed (automating a step that used to be done manually). Additional porting may be required to deal with architecture feature queries or with CUDA capabilities that HIP doesn't support. In general, developers should always expect to perform some platform-specific tuning and optimization.
-
-## 17.10 What is NVCC?
-
-NVCC is NVIDIA's compiler driver for compiling 'CUDA C++' code into PTX or device code for NVIDIA GPUs. It's a closed-source binary compiler that is provided by the CUDA SDK.
-
-## 17.11 What is HIP-Clang?
-
-HIP-Clang is a Clang/LLVM based compiler to compile HIP programs which can run on AMD platform.
-
-## 17.12 Why use HIP rather than supporting CUDA directly?
-
-While HIP is a strong subset of the CUDA, it is a subset. The HIP layer allows that subset to be clearly defined and documented. Developers who code to the HIP API can be assured their code will remain portable across NVIDIA and AMD platforms. In addition, HIP defines portable mechanisms to query architectural features and supports a larger 64-bit WaveSize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit integers to 64-bit integers.
-
-## 17.13 Can I develop HIP code on an NVIDIA CUDA platform?
-
-Yes. HIP's CUDA path only exposes the APIs and functionality that work on both NVCC and AMDGPU back-ends. 'Extra' APIs, parameters, and features which exist in CUDA but not in HIP-Clang will typically result in compile-time or run-time errors. Developers need to use the HIP API for most accelerator code and bracket any CUDA-specific code with preprocessor conditionals. Developers concerned about portability should, of course, run on both platforms, and should expect to tune for performance. In some cases, CUDA has a richer set of modes for some APIs, and some C++ capabilities such as virtual functions - see the HIP @API documentation for more details.
-
-## 17.14 Can I develop HIP code on an AMD HIP-Clang platform?
-
-Yes. HIP's HIP-Clang path only exposes the APIs and functions that work on AMD runtime back ends. 'Extra' APIs, parameters and features that appear in HIP-Clang but not CUDA will typically cause compile- or run-time errors. Developers must use the HIP API for most accelerator code and bracket any HIP-Clang specific code with preprocessor conditionals. Those concerned about portability should, of course, test their code on both platforms and should tune it for performance. Typically, HIP-Clang supports a more modern set of C++11/C++14/C++17 features, so HIP developers who want portability should be careful when using advanced C++ features on the HIP-Clang path.
-
-## 17.15 How to use HIP-Clang to build HIP programs?
-
-The environment variable can be used to set compiler path:
-
-- HIP\_CLANG\_PATH: path to hip-clang. When set, this variable let hipcc to use hip-clang for compilation/linking.
-
-There is an alternative environment variable to set compiler path:
-
-- HIP\_ROCCLR\_HOME: path to root directory of the HIP-ROCclr runtime. When set, this variable let hipcc use hip-clang from the ROCclr distribution. NOTE: If HIP\_ROCCLR\_HOME is set, there is no need to set HIP\_CLANG\_PATH since hipcc will deduce them from HIP\_ROCCLR\_HOME.
-
-## 17.16 What is AMD clr?
-
-AMD Common Language Runtime (CLR) is a repository for the AMD platform, which contains source codes for AMD's compute languages runtimes as follows,
-
-- hipamd - contains implementation of HIP language for AMD GPU.
-- rocclr - contains virtual device interfaces that compute runtimes interact with backends, such as ROCr on Linux and PAL on Windows.
-- opencl - contains implementation of OpenCL™ on the AMD platform.
-
-## 17.17 What is hipother?
-
-A new repository 'hipother' is added in the ROCm 6.1 release, which is branched out from HIP. hipother supports the HIP back-end implementation on some non-AMD platforms, like NVIDIA.
-
-## 17.18 Can I get HIP open source repository for Windows?
-
-No, there is no HIP repository open publicly on Windows.
-
-## 17.19 Can a HIP binary run on both AMD and NVIDIA platforms?
-
-HIP is a source-portable language that can be compiled to run on either AMD or NVIDIA platform. HIP tools don't create a 'fat binary' that can run on either platform, however.
-
-## 17.20 On HIP-Clang, can I link HIP code with host code compiled with another compiler such as gcc, icc, or clang?
-
-Yes. HIP generates the object code which conforms to the GCC ABI, and also links with libstdc++. This means you can compile host code with the compiler of your choice and link the generated object code with GPU code compiled with HIP. Larger projects often contain a mixture of accelerator code (initially written in CUDA with NVCC) and host code (compiled with gcc, icc, or clang). These projects can convert the accelerator code to HIP, compile that code with hipcc, and link with object code from their preferred compiler.
-
-## 17.21 Can HIP API support C style application? What is the difference between C and C++?
-
-HIP is C++ runtime API that supports C style applications as well.
-
-Some C style applications (and interfaces to other languages (FORTRAN, Python)) would call certain HIP APIs but not use kernel programming. They can be compiled with a C compiler and run correctly, however, small details must be considered in the code. For example, initialization, as shown in the simple application below, uses HIP structs dim3 with the file name 'test.hip.cpp'
-**Following code does:** The code snippet provided is not a valid Python code. It appears to be a fragment of a list or a set of instructions, specifically the third step in a sequence, which is "Build HIP." Without additional context, it's unclear what "HIP" refers to, but it could be an acronym or a specific component in a larger process. The snippet suggests that this step involves constructing or assembling something referred to as HIP.
-
-
-```
- //the file name `test.hip.cpp`
-
-
-#include "hip/hip_runtime_api.h"
- //this file name `test.hip.cpp`
-
- int main(int argc, char** argv) {
- dim3 grid1;
- printf("dim3 grid1; x=%d, y=%d, z=%d\n",grid1.x,grid1.y,grid1.z);
- dim3 grid2 = {1,1,1};
- printf("dim3 grid2 = {1,1,1}; x=%d, y=%d, z=%d\n",grid2.x,grid2.y,grid2.z);
- return 0;
- }
-```
-
-When using a C++ compiler,
-**Following code does:** This code snippet is a shell script that automates the process of building and installing a software project using CMake and Make, specifically targeting a HIP (Heterogeneous-Compute Interface for Portability) platform with NVIDIA support. Here's a high-level breakdown of its purpose:
-
-1. **Change Directory**: It navigates to a directory specified by the environment variable `CLR_DIR`.
-
-2. **Create and Navigate to Build Directory**: It creates a `build` directory if it doesn't exist and then changes into it.
-
-3. **Configure the Build with CMake**: It runs the `cmake` command to configure the build system. Various options are set, such as:
- - `HIP_COMMON_DIR` and `HIPNV_DIR` for specifying directories related to HIP.
- - `HIP_PLATFORM=nvidia` to target NVIDIA GPUs.
- - `CMAKE_INSTALL_PREFIX` to set the installation directory to the current working directory.
- - Disabling certain build options like `HIP_CATCH_TEST` and `CLR_BUILD_OCL`.
-
-4. **Compile the Project**: It uses `make` with parallel execution (`-j$(nproc)`) to compile the project, utilizing all available CPU cores.
-
-5. **Install the Compiled Software**: It runs `sudo make install` to install the compiled software, which typically requires superuser privileges.
-
-Overall, this script is used to build and install a HIP-based software project configured for NVIDIA GPUs.
-
-
-```
-$ gcc -x c++ $(hipconfig --cpp_config) test3.hip.cpp -o test
-$./test
-dim3 grid1; x=1, y=1, z=1
-dim3 grid2 = {1,1,1}; x=1, y=1, z=1
-```
-
-In which 'dim3 grid1;' will yield a dim3 grid with all dimensional members x,y,z initialized to 1, as the default constructor behaves that way. Further, if written: dim3 grid(2); // yields {2,1,1} dim3 grid(2,3); yields {2,3,1} In comparison, when using the C compiler, $ gcc -x c $( hipconfig --cpp\_config ) test.hip.cpp -o test $ ./test dim3 grid1; x=646881376, y=21975, z=1517277280 dim3 grid2 = {1,1,1}; x=1, y=1, z=1 In which 'dim3 grid;' does not imply any initialization, no constructor is called, and dimensional values x,y,z of grid are undefined. NOTE: To get the C++ default behavior, C programmers must additionally specify the right-hand side as shown below,
-**Following code does:** This code snippet is a shell command that uses `git` to clone a specific branch of a repository from GitHub. It clones the `hip-tests` repository from the ROCm (Radeon Open Compute) GitHub organization. The branch to be cloned is specified by the environment variable `ROCM_BRANCH`. This command is typically used to obtain a local copy of the code from a particular branch of the repository for development, testing, or deployment purposes.
-
-
-```
- |dim3 grid = {1,1,1}; // initialized as in C++
-```
-**Following code does:** This code snippet appears to be a shell script intended for setting up and running tests for a project that uses HIP (Heterogeneous-Compute Interface for Portability), which is a C++ runtime API and kernel language that allows developers to create portable applications across different GPU platforms. Here's a high-level summary of what it does:
-
-1. **Set Environment Variable**: It sets the `HIPTESTS_DIR` environment variable to the absolute path of the `hip-tests` directory using `readlink -f`.
-
-2. **Navigate to Directory**: It changes the current directory to `HIPTESTS_DIR`.
-
-3. **Create and Navigate to Build Directory**: It creates a `build` directory if it doesn't exist and navigates into it.
-
-4. **Configure Build with CMake**: It runs `cmake` to configure the build system for the project, specifying the HIP platform as AMD and setting the HIP path to a specified directory.
-
-5. **Build Tests**: It compiles the test suite using `make build_tests`.
-
-6. **Run Tests**: It executes the tests using `ctest`.
-
-Overall, this script automates the process of setting up the environment, configuring, building, and running tests for a HIP-based project.
-
-
-```
-C++
-```
-
-## 17.22 Can I install both CUDA SDK and HIP-Clang on the same machine?
-
-Yes. You can use HIP\_PLATFORM to choose which path hipcc targets. This configuration can be useful when using HIP to develop an application which is portable to both AMD and NVIDIA.
-
-## 17.23 HIP detected my platform (HIP-Clang vs NVCC) incorrectly * what should I do?
-
-HIP will set the platform to AMD and use HIP-Clang as compiler if it sees that the AMD graphics driver is installed and has detected an AMD GPU. Sometimes this isn't what you want * you can force HIP to recognize the platform by setting the following,
-**Following code does:** The code snippet defines a function `ynf` that calculates and returns the value of the Bessel function of the second kind of order `n` for a given input `x`. Bessel functions are a family of solutions to Bessel's differential equation and are commonly used in various fields such as physics and engineering, particularly in problems involving cylindrical or spherical symmetry.
-
-
-```
-| export
-```
-**Following code does:** The code snippet appears to be a comment or documentation rather than executable code. It describes the process of building HIP (Heterogeneous-Compute Interface for Portability) tests using Catch2, a unit testing framework. Specifically, it mentions that these HIP tests are separate from the main HIP project and outlines a step to obtain the source code for these tests.
-
-
-```
-rt HIP_PLATFORM=amd
-```
-**Following code does:** This code snippet appears to be a part of a script or configuration file that compiles and runs a specific test for a HIP (Heterogeneous-Compute Interface for Portability) application. It navigates to a directory specified by the environment variable `HIPTESTS_DIR`, then uses the `hipcc` compiler to compile a C++ test file named `hipPointerGetAttributes.cc` along with some include directories. The compiled output is an executable named `hipPointerGetAttributes`, which is then executed. The purpose of this script is to test the functionality related to HIP pointer attributes, and it concludes with a message indicating that all tests have passed.
-
-
-```
-<_Python_>
-```
-**Following code does:** The code snippet `++i)` appears to be a syntax error in Python. In Python, the `++` operator does not exist as it does in languages like C++ or JavaScript. Instead, Python uses `i += 1` to increment a variable. The snippet seems to be an incomplete or incorrect attempt to increment a variable, possibly influenced by syntax from another programming language.
-
-
-```
- | HIP_COMPILER=cuda
-```
-**Following code does:** The code snippet `<_SQL_>` appears to be a placeholder or a tag indicating that the actual SQL code is not provided. It suggests that the content is related to SQL, which is a language used for managing and manipulating relational databases. Without the actual SQL code, it's not possible to determine the specific operations or queries being performed. The placeholder might be used in documentation, templates, or code generation tools to signify where SQL code should be inserted or referenced.
-
-
-```
- | HIP_RUNTIME=nvcc
-```
-
-One symptom of this problem is the message 'error: 'unknown error'(11) at square.hipref.cpp:56 . This can occur if you have a CUDA installation on an AMD platform, and HIP incorrectly detects the platform as NVCC. HIP may be able to compile the application using the NVCC tool-chain but will generate this error at runtime since the platform does not have a CUDA device.
-
-## 17.24 On CUDA, can I mix CUDA code with HIP code?
-
-Yes. Most HIP data structures ( hipStream\_t , hipEvent\_t ) are typedefs to CUDA equivalents and can be intermixed. Both CUDA and HIP use integer device ids. One notable exception is that hipError\_t is a new type, and cannot be used where a cudaError\_t is expected. In these cases, refactor the code to remove the expectation. Alternatively, hip\_runtime\_api.h defines functions which convert between the error code spaces:
-
-hipErrorToCudaError hipCUDAErrorTohipError hipCUResultTohipError
-
-If platform portability is important, use #ifdef \_\_HIP\_PLATFORM\_NVIDIA\_\_ to guard the CUDA-specific code.
-
-## 17.25 How do I trace HIP application flow?
-
-See Logging HIP activity for more information.
-
-## 17.26 What are the maximum limits of kernel launch parameters?
-
-Product of block.x, block.y, and block.z should be less than 1024. Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32 , so gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32.
-
-## 17.27 Are \_\_shfl\_*\_sync functions supported on HIP platform?
-
-\_\_shfl\_*\_sync is not supported on HIP but for NVCC path CUDA 9.0 and above all shuffle calls get redirected to it's sync version.
-
-## 17.28 How to create a guard for code that is specific to the host or the GPU?
-
-The compiler defines the \_\_HIP\_DEVICE\_COMPILE\_\_ macro only when compiling the code for the GPU. It could be used to guard code that is specific to the host or the GPU.
-
-## 17.29 Why \_OpenMP is undefined when compiling with -fopenmp ?
-
-When compiling an OpenMP source file with hipcc -fopenmp , the compiler may generate error if there is a reference to the \_OPENMP macro. This is due to a limitation in hipcc that treats any source file type (for example .cpp ) as an HIP translation unit leading to some conflicts with the OpenMP language switch. If the OpenMP source file doesn't contain any HIP language constructs you could work around this issue by adding the -x c++ switch to force the compiler to treat the file as regular C++. Another approach would be to guard the OpenMP code with #ifdef \_OPENMP so that the code block is disabled when compiling for the GPU. The \_\_HIP\_DEVICE\_COMPILE\_\_ macro defined by the HIP compiler when compiling GPU code could also be used for guarding code paths specific to the host or the GPU.
-
-## 17.30 Does the HIP-Clang compiler support extern shared declarations?
-
-Previously, it was essential to declare dynamic shared memory using the HIP\_DYNAMIC\_SHARED macro for accuracy, as using static shared memory in the same kernel could result in overlapping memory ranges and data-races.
-
-Now, the HIP-Clang compiler provides support for extern shared declarations, and the HIP\_DYNAMIC\_SHARED option is no longer required. You may use the standard extern definition: extern shared type var[];
-
-## 17.31 I have multiple HIP enabled devices and I am getting an error code hipErrorSharedObjectInitFailed with the message 'Error: shared object initialization failed'?
-
-This error message is seen due to the fact that you do not have valid code object for all of your devices.
-
-If you have compiled the application yourself, make sure you have given the correct device name(s) and its features via: --offload-arch . If you are not mentioning the --offload-arch , make sure that hipcc is using the correct offload arch by verifying the hipcc output generated by setting the environment variable HIPCC\_VERBOSE=1 .
-
-If you have a precompiled application/library (like rocblas, TensorFlow etc) which gives you such error, there are one of two possibilities.
-
-- The application/library does not ship code object bundles for all of your device(s): in this case you need to recompile the application/library yourself with correct --offload-arch .
-- The application/library does not ship code object bundles for some of your device(s), for example you have a system with an APU + GPU and the library does not ship code objects for your APU. For this you can set the environment variable HIP\_VISIBLE\_DEVICES or CUDA\_VISIBLE\_DEVICES on NVIDIA platform, to only enable GPUs for which code object is available. This will limit the GPUs visible to your application and allow it to run.
-
-Note: In previous releases, the error code is hipErrorNoBinaryForGpu with message 'Unable to find code object for all current devices'. The error code handling behavior is changed. HIP runtime shows the error code hipErrorSharedObjectInitFailed with message 'Error: shared object initialization failed' on unsupported GPU.
-
-## 17.32 How to use per-thread default stream in HIP?
-
-The per-thread default stream is an implicit stream local to both the thread and the current device. It does not do any implicit synchronization with other streams (like explicitly created streams), or default per-thread stream on other threads.
-
-The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program.
-
-In ROCm, a compilation option should be added in order to compile the translation unit with per-thread default stream enabled. -fgpu-default-stream=per-thread . Once source is compiled with per-thread default stream enabled, all APIs will be executed on per thread default stream, hence there will not be any implicit synchronization with other streams.
-
-Besides, per-thread default stream be enabled per translation unit, users can compile some files with feature enabled and some with feature disabled. Feature enabled translation unit will have default stream as per thread and there will not be any implicit synchronization done but other modules will have legacy default stream which will do implicit synchronization.
-
-## 17.33 How to use complex multiplication and division operations?
-
-In HIP, hipFloatComplex and hipDoubleComplex are defined as complex data types,
-**Following code does:** This code is a command-line instruction that uses the `git` version control system to create a local copy (clone) of the repository located at the specified URL, `https://github.com/amd/rcm-examples.git`. This repository is hosted on GitHub and likely contains example code or resources related to AMD's RCM (Resource and Configuration Management) tools or projects. The cloned repository will be downloaded to the current directory where the command is executed.
-
-
-```
-<_C_>
-```
-
-Any application uses complex multiplication and division operations, need to replace '*' and '/' operators with the following,
-
-- hipCmulf() and hipCdivf() for hipFloatComplex
-- hipCmul() and hipCdiv() for hipDoubleComplex
-
-Note: These complex operations are equivalent to corresponding types/functions on the NVIDIA platform.
-
-## 17.34 Can I develop applications with HIP APIs on Windows the same on Linux?
-
-Yes, HIP APIs are available to use on both Linux and Windows. Due to different working mechanisms on operating systems like Windows vs Linux, HIP APIs call corresponding lower level backend runtime libraries and kernel drivers for the OS, in order to control the executions on GPU hardware accordingly. There might be a few differences on the related backend software and driver support, which might affect usage of HIP APIs. See OS support details in HIP API document.
-
-## 17.35 Does HIP support LUID?
-
-Starting ROCm 6.0, HIP runtime supports Locally Unique Identifier (LUID). This feature enables the local physical device(s) to interoperate with other devices. For example, DirectX 12.
-
-HIP runtime sets device LUID properties so the driver can query LUID to identify each device for interoperability.
-
-Note: HIP supports LUID only on Windows OS.
-
-## 17.36 How can I know the version of HIP?
-
-HIP version definition has been updated since ROCm 4.2 release as the following:
-**Following code does:** This code snippet is written in C++ using the HIP API, which is used for GPU programming. The code's high-level purpose is to allocate memory on a GPU device and copy data from the host (CPU) to the device (GPU). Specifically, it allocates memory for two float arrays (`d_x` and `d_y`) on the GPU, each with a size specified by `size_bytes`. It then copies data from two host arrays (`x` and `y`) to these newly allocated device arrays. The `HIP_CHECK` macro is likely used to handle errors that may occur during these operations.
-
-
-```
-<_SQL_>
-```
-
-HIP version can be queried from HIP API call, hipRuntimeGetVersion(&runtimeVersion);
-
-The version returned will always be greater than the versions in previous ROCm releases.
-
-Note: The version definition of HIP runtime is different from CUDA. On AMD platform, the function returns HIP runtime version, while on NVIDIA platform, it returns CUDA runtime version. And there is no mapping/correlation between HIP version and CUDA version.
-
-## 18.1 Related Pages
-
-18.2 Topics
-
-## 18.3 Namespaces
-
-18.3.1 Namespace List
-
-18.3.2 Namespace Members
-
-18.3.2.1 Namespace Members
-
-18.3.2.2 Namespace Members
-
-## 18.4 Data Structures
-
-- 18.4.1 Data Structures
-- 18.4.2 Data Structure Index
-- 18.4.3 Class Hierarchy
-
-18.4.4 Data Fields
-
-18.4.4.1 All
-
-18.4.4.1.1 Data Fields
-
-18.4.4.1.2 Data Fields
-
-18.4.4.1.3 Data Fields
-
-18.4.4.1.4 Data Fields
-
-18.4.4.1.5 Data Fields
-
-18.4.4.1.6 Data Fields 26
-
-18.4.4.1.7 Data Fields
-
-CHAPTER
-
-## EIGHTEEN
-
-## HIP RUNTIME API REFERENCE
-
-## CHAPTER
-
-## NINETEEN
-
-## C++ LANGUAGE EXTENSIONS
-
-HIP provides a C++ syntax that is suitable for compiling most code that commonly appears in compute kernels (classes, namespaces, operator overloading, and templates). HIP also defines other language features that are designed to target accelerators, such as:
-
-- A kernel-launch syntax that uses standard C++ (this resembles a function call and is portable to all HIP targets)
-- Short-vector headers that can serve on a host or device
-- Math functions that resemble those in math.h , which is included with standard C++ compilers
-- Built-in functions for accessing specific GPU hardware capabilities
-
-Note: This chapter describes the built-in variables and functions that are accessible from the HIP kernel. It's intended for users who are familiar with CUDA kernel syntax and want to learn how HIP differs from CUDA.
-
-Features are labeled with one of the following keywords:
-
-- Supported : HIP supports the feature with a CUDA-equivalent function
-- Not supported : HIP does not support the feature
-- Under development : The feature is under development and not yet available
-
-## 19.1 Function-type qualifiers
-
-## 19.1.1 \_\_device\_\_
-
-Supported \_\_device\_\_ functions are:
-
-- Run on the device
-- Called from the device only
-
-You can combine \_\_device\_\_ with the host keyword ( \_\_host\_\_ ).
-
-## 19.1.2 \_\_global\_\_
-
-Supported \_\_global\_\_ functions are:
-
-- Run on the device
-- Called (launched) from the host
-
-HIP \_\_global\_\_ functions must have a void return type.
-
-HIP doesn't support dynamic-parallelism, which means that you can't call \_\_global\_\_ functions from the device.
-
-## 19.1.3 \_\_host\_\_
-
-Supported \_\_host\_\_ functions are:
-
-- Run on the host
-- Called from the host
-
-You can combine \_\_host\_\_ with \_\_device\_\_ ; in this case, the function compiles for the host and the device. Note that these functions can't use the HIP grid coordinate functions (e.g., threadIdx.x ). If you need to use HIP grid coordinate functions, you can pass the necessary coordinate information as an argument.
-
-You can't combine \_\_host\_\_ with \_\_global\_\_ .
-
-HIP parses the \_\_noinline\_\_ and \_\_forceinline\_\_ keywords and converts them into the appropriate Clang attributes.
-
-## 19.2 Calling \_\_global\_\_ functions
-
-\_\_global\_\_ functions are often referred to as kernels . When you call a global function, you're launching a kernel . When launching a kernel, you must specify an execution configuration that includes the grid and block dimensions. The execution configuration can also include other information for the launch, such as the amount of additional shared memory to allocate and the stream where you want to execute the kernel.
-
-HIP introduces a standard C++ calling convention ( hipLaunchKernelGGL ) to pass the run configuration to the kernel. However, you can also use the CUDA <<< >>> syntax.
-
-When using hipLaunchKernelGGL , your first five parameters must be:
-
-- symbol kernelName : The name of the kernel you want to launch. To support template kernels that contain "," , use the HIP\_KERNEL\_NAME macro (HIPIFY tools insert this automatically).
-- dim3 gridDim : 3D-grid dimensions that specify the number of blocks to launch.
-- dim3 blockDim : 3D-block dimensions that specify the number of threads in each block.
-- size\_t dynamicShared : The amount of additional shared memory that you want to allocate when launching the kernel (see \_\_shared\_\_ ).
-- hipStream\_t : The stream where you want to run the kernel. A value of 0 corresponds to the NULL stream (see Synchronization functions ).
-
-You can include your kernel arguments after these parameters.
-**Following code does:** This code snippet is part of a GPU programming context, likely using a framework like CUDA or HIP for parallel computing. The high-level purpose of the code is to perform a SAXPY operation (Single-Precision A·X Plus Y) on a GPU. The `saxpy_kernel` function is a kernel function that runs on the GPU and performs the SAXPY operation on arrays `d_x` and `d_y` with a scalar `a`. The `main` function sets up the execution environment and launches this kernel on the GPU using a specified grid and block size configuration. The kernel is executed on the default stream, which is a queue for managing the execution order of operations on the GPU.
-
-
-```
- // Example hipLaunchKernelGGL pseudocode:
-```
-**Following code does:** The code snippet `<_Cuda_>` appears to be a placeholder or a symbolic representation rather than actual executable Python code. In a typical context, "Cuda" refers to NVIDIA's parallel computing platform and application programming interface (API) model, which allows developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. However, without additional context or surrounding code, it is not possible to determine the specific functionality or purpose of this snippet.
-
-
-```
-
-ize_t N)
-```
-
-(continued from previous page)
-**Following code does:** This code snippet is part of a program that uses HIP, a C++ runtime API and kernel language that allows developers to write portable code to run on AMD and NVIDIA GPUs. The specific line of code is performing a memory copy operation from a GPU device to the host (CPU) memory.
-
-Here's a high-level explanation of what it does:
-
-- `hipMemcpy` is a function that copies data between host and device memory.
-- `y.data()` is likely a pointer or an array on the host where the data will be copied to.
-- `d_y` is a pointer or an array on the device (GPU) from which the data will be copied.
-- `size_bytes` specifies the number of bytes to copy.
-- `hipMemcpyDeviceToHost` is an enumeration that indicates the direction of the copy, from device to host.
-
-The `HIP_CHECK` macro is likely used to check for errors in the `hipMemcpy` operation, ensuring that the memory copy was successful.
-
-
-```
- (continued from previous page)
-
-
-}
-
-MyKernel<<
void void
-```
-
-We insert the GCN isa into the kernel using asm() Assembler statement. volatile keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations. v\_mac\_f32\_e32 is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/) Index for the respective operand in the ordered fashion is provided by % followed by position in the list of operands 'v' is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list) Output Constraints are specified by an '=' prefix as shown above ('=v'). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of '0' says to use the assigned register for output as an input as well (it being the 0'th constraint).
-
-## C++ Support The following C++ features are not supported: - Run-time-type information (RTTI) - Try/catch Virtual functions Virtual functions are not supported if objects containing virtual function tables are passed between GPU's of different offload arch's, e.g. between gfx906 and gfx1030. Otherwise virtual functions are supported.
-
-## 19.27 Kernel Compilation
-
-hipcc now supports compiling C++/HIP kernels to binary code objects. The file format for binary is .co which means Code Object. The following command builds the code object using hipcc .
-**Following code does:** The code snippet appears to be a command-line operation that uses `dumpbin.exe` to analyze a specific section of an executable file (`saxpy.exe`). It targets the `.hip_fat` section and extracts raw data with a specified format. The output is then piped into a `select` command, which skips the first 20 lines and selects the next 12 lines from the output. This operation is likely used for inspecting or debugging specific parts of the executable's binary data.
-
-
-```
-hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]
-
-[TARGET GPU] = GPU architecture
-[INPUT FILE] = Name of the file containing kernels
-[OUTPUT FILE] = Name of the generated code object file
-```
-
-Note: When using binary code objects is that the number of arguments to the kernel is different on HIP-Clang and NVCC path. Refer to the HIP module\_api sample for differences in the arguments to be passed to the kernel.
-
-## 19.28 gfx-arch-specific-kernel
-
-Clang defined '\_\_gfx*\_\_' macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample in HIP 14\_gpu\_arch sample.
-
-## CHAPTER
-
-## TWENTY
-
-## C++ LANGUAGE SUPPORT
-
-The ROCm platform enables the power of combined C++ and HIP (Heterogeneous-computing Interface for Portability) code. This code is compiled with a clang or clang++ compiler. The official compilers support the HIP platform, or you can use the amdclang or amdclang++ included in the ROCm installation, which are a wrapper for the official versions.
-
-The source code is compiled according to the C++03 , C++11 , C++14 , C++17 , and C++20 standards, along with HIPspecific extensions, but is subject to restrictions. The key restriction is the reduced support of standard library in device code. This is due to the fact that by default a function is considered to run on host, except for constexpr functions, which can run on host and device as well.
-
-## 20.1 Modern C++ support
-
-C++ is considered a modern programming language as of C++11. This section describes how HIP supports these new C++ features.
-
-## 20.1.1 C++11 support
-
-The C++11 standard introduced many new features. These features are supported in HIP host code, with some notable omissions on the device side. The rule of thumb here is that constexpr functions work on device, the rest doesn't. This means that some important functionality like std::function is missing on the device, but unfortunately the standard library wasn't designed with HIP in mind, which means that the support is in a state of 'works as-is'.
-
-Certain features have restrictions and clarifications. For example, any functions using the constexpr qualifier or the new initializer lists , std::move or std::forward features are implicitly considered to have the \_\_host\_\_ and \_\_device\_\_ execution space specifier. Also, constexpr variables that are static members or namespace scoped can be used from both host and device, but only for read access. Dereferencing a static constexpr outside its specified execution space causes an error.
-
-Lambdas are supported, but there are some extensions and restrictions on their usage. For more information, see the Extended lambdas section below.
-
-## 20.1.2 C++14 support
-
-The C++14 language features are supported.
-
-## 20.1.3 C++17 support
-
-All C++17 language features are supported.
-
-## 20.1.4 C++20 support
-
-All C++20 language features are supported, but extensions and restrictions apply. C++20 introduced coroutines and modules, which fundamentally changed how programs are written. HIP doesn't support these features. However, consteval functions can be called from host and device, even if specified for host use only.
-
-The three-way comparison operator (spaceship operator <=> ) works with host and device code.
-
-## 20.2 Extensions and restrictions
-
-In addition to the deviations from the standard, there are some general extensions and restrictions to consider.
-
-## 20.2.1 Global functions
-
-Functions that serve as an entry point for device execution are called kernels and are specified with the \_\_global\_\_ qualifier. To call a kernel function, use the triple chevron operator: <<< >>> . Kernel functions must have a void return type. These functions can't:
-
-- have a constexpr specifier
-- have a parameter of type std::initializer\_list or va\_list
-- use an rvalue reference as a parameter.
-- use parameters having different sizes in host and device code, e.g. long double arguments, or structs containing long double members.
-- use struct-type arguments which have different layout in host and device code.
-
-Kernels can have variadic template parameters, but only one parameter pack, which must be the last item in the template parameter list.
-
-## 20.2.2 Device space memory specifiers
-
-HIP includes device space memory specifiers to indicate whether a variable is allocated in host or device memory and howits memory should be allocated. HIP supports the \_\_device\_\_ , \_\_shared\_\_ , \_\_managed\_\_ , and \_\_constant\_\_ specifiers.
-
-The \_\_device\_\_ and \_\_constant\_\_ specifiers define global variables, which are allocated within global memory on the HIP devices. The only difference is that \_\_constant\_\_ variables can't be changed after allocation. The \_\_shared\_\_ specifier allocates the variable within shared memory, which is available for all threads in a block.
-
-The \_\_managed\_\_ variable specifier creates global variables that are initially undefined and unaddressed within the global symbol table. The HIP runtime allocates managed memory and defines the symbol when it loads the device binary. A managed variable can be accessed in both device and host code.
-
-It's important to know where a variable is stored because it is only available from certain locations. Generally, variables allocated in the host memory are not accessible from the device code, while variables allocated in the device memory are not directly accessible from the host code. Dereferencing a pointer to device memory on the host results in a segmentation fault. Accessing device variables in host code should be done through kernel execution or HIP functions like hipMemCpyToSymbol .
-
-## 20.2.3 Exception handling
-
-An important difference between the host and device code is exception handling. In device code, this control flow isn't available due to the hardware architecture. The device code must use return codes to handle errors.
-
-## 20.2.4 Kernel parameters
-
-There are some restrictions on kernel function parameters. They cannot be passed by reference, because these functions are called from the host but run on the device. Also, a variable number of arguments is not allowed.
-
-## 20.2.5 Classes
-
-Classes work on both the host and device side, but there are some constraints. The static member functions can't be \_\_global\_\_ . Virtual member functions work, but a virtual function must not be called from the host if the parent object was created on the device, or the other way around, because this behavior is undefined. Another minor restriction is that \_\_device\_\_ variables, that are global scoped must have trivial constructors.
-
-## 20.2.6 Polymorphic function wrappers
-
-HIP doesn't support the polymorphic function wrapper std::function , which was introduced in C++11.
-
-## 20.2.7 Extended lambdas
-
-HIP supports Lambdas, which by default work as expected.
-
-Lambdas have implicit host device attributes. This means that they can be executed by both host and device code, and works the way you would expect. To make a lambda callable only by host or device code, users can add \_\_host\_\_ or \_\_device\_\_ attribute. The only restriction is that host variables can only be accessed through copy on the device. Accessing through reference will cause undefined behavior.
-
-## 20.2.8 Inline namespaces
-
-Inline namespaces are supported, but with a few exceptions. The following entities can't be declared in namespace scope within an inline unnamed namespace:
-
-- \_\_managed\_\_ , \_\_device\_\_ , \_\_shared\_\_ and \_\_constant\_\_ variables
-- \_\_global\_\_ function and function templates
-- variables with surface or texture type
-
-## CHAPTER
-
-## TWENTYONE
-
-## HIP MATH API
-
-HIP-Clang supports a set of math operations that are callable from the device. HIP supports most of the device functions supported by NVIDIA CUDA. These are described in the following sections.
-
-## 21.1 Single precision mathematical functions
-
-Following is the list of supported single precision mathematical functions.
-
-Table 1: Single precision mathematical functions
-
-| Function | Supported on Host | Supported on Device |
-|----------------------------------------------------------------------------|---------------------|-----------------------|
-| float abs(float x) Returns the absolute value of 𝑥 | ✓ | ✓ |
-| float acosf(float x) Returns the arc cosine of 𝑥 . | ✓ | ✓ |
-| float acoshf(float x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| float asinf(float x) Returns the arc sine of 𝑥 . | ✓ | ✓ |
-| float asinhf(float x) Returns the arc hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| float atanf(float x) Returns the arc tangent of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 1 - continued from previous page |
-|------------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| float atan2f(float x, float y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float atanhf(float x) Returns the arc hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-| float cbrtf(float x) Returns the cube root of 𝑥 . | ✓ | ✓ |
-| float ceilf(float x) Returns ceiling of 𝑥 . | ✓ | ✓ |
-| float copysignf(float x, float y) Create value with given magnitude, copying sign of second value. | ✓ | ✓ |
-| float cosf(float x) Returns the cosine of 𝑥 . | ✓ | ✓ |
-| float coshf(float x) Returns the hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| float cospif(float x) Returns the cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| float cyl_bessel_i0f(float x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 . | | |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| float cyl_bessel_i1f(float x) Returns the value of the regular modified cylindrical Bessel function of order 1 for 𝑥 . | | |
-|--------------------------------------------------------------------------------------------------------------------------|----|----|
-| float erff(float x) Returns the error function of 𝑥 . | ✓ | ✓ |
-| float erfcf(float x) Returns the complementary error function of 𝑥 . | ✓ | ✓ |
-| float erfcinvf(float x) Returns the inverse complementary function of 𝑥 . | ✓ | ✓ |
-| float erfcxf(float x) Returns the scaled complementary error function of 𝑥 . | ✓ | ✓ |
-| float erfinvf(float x) Returns the inverse error function of 𝑥 . | ✓ | ✓ |
-| float expf(float x) Returns 𝑒 𝑥 . | ✓ | ✓ |
-| float exp10f(float x) Returns 10 𝑥 . | ✓ | ✓ |
-| float exp2f( float x) Returns 2 𝑥 . | ✓ | ✓ |
-| float expm1f(float x) Returns 𝑙𝑛 ( 𝑥 - 1) | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float fabsf(float x) Returns the absolute value of x | ✓ | ✓ |
-|------------------------------------------------------------------------------------|-----|-----|
-| float fdimf(float x, float y) Returns the positive difference between 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fdividef(float x, float y) Divide two floating point values. | ✓ | ✓ |
-| float floorf(float x) Returns the largest integer less than or equal to 𝑥 . | ✓ | ✓ |
-| float fmaf(float x, float y, float z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. | ✓ | ✓ |
-| float fmaxf(float x, float y) Determine the maximum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fminf(float x, float y) Determine the minimum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fmodf(float x, float y) Returns the floating-point remainder of 𝑥/𝑦 . | ✓ | ✓ |
-| float modff(float x, float* iptr) Break down 𝑥 into fractional and integral parts. | ✓ | |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| float frexpf(float x, int* nptr) Extract mantissa and exponent of 𝑥 . | ✓ |
-|---------------------------------------------------------------------------------------------------------|-----|
-| float hypotf(float x, float y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . | ✓ |
-| int ilogbf(float x) Returns the unbiased integer exponent of 𝑥 . | ✓ |
-| bool isfinite(float x) Determine whether 𝑥 is finite. | ✓ |
-| bool isinf(float x) Determine whether 𝑥 is infinite. | ✓ |
-| bool isnan(float x) Determine whether 𝑥 is a NAN . | ✓ |
-| float j0f(float x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . | ✓ |
-| float j1f(float x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . | ✓ |
-| float jnf(int n, float x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float ldexpf(float x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | ✓ |
-|-------------------------------------------------------------------------------------------------------------------|-----|-----|
-| float lgammaf(float x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | |
-| long int lrintf(float x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-| long long int llrintf(float x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-| long int lroundf(float x) Round to nearest integer value. | ✓ | ✓ |
-| long long int llroundf(float x) Round to nearest integer value. | ✓ | ✓ |
-| float log10f(float x) Returns the base 10 logarithm of 𝑥 . | ✓ | ✓ |
-| float log1pf(float x) Returns the natural logarithm of 𝑥 +1 . | ✓ | ✓ |
-| float log2f(float x) Returns the base 2 logarithm of 𝑥 . | ✓ | ✓ |
-| float logf(float x) Returns the natural logarithm of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| | | ✓ |
-|----------------------------------------------------------------------------------------------------------------------|----|-----|
-| float logbf(float x) Returns the floating point representation of the exponent of 𝑥 . | ✓ | |
-| float nanf(const char* tagp) Returns 'Not a Number' value. | | ✓ |
-| float nearbyintf(float x) Round 𝑥 to the nearest integer. | ✓ | ✓ |
-| float nextafterf(float x, float y) Returns next representable single-precision floating-point value after argument. | ✓ | |
-| float norm3df(float x, float y, float z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . | ✓ | ✓ |
-| float norm4df(float x, float y, float z, float w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . | ✓ | ✓ |
-| float normcdff(float y) Returns the standard normal cumulative distribution function. | ✓ | ✓ |
-| float normcdfinvf(float y) Returns the inverse of the standard normal cumulative distribution function. | ✓ | ✓ |
-| float normf(int dim, const float *a) Returns the square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 1 - continued from previous page |
-|-------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| float powf(float x, float y) Returns 𝑥 𝑦 . | ✓ | ✓ |
-| float powif(float base, int iexp) Returns the value of first argument to the power of second argument. | ✓ | ✓ |
-| float remainderf(float x, float y) Returns single-precision floating-point remainder. | ✓ | ✓ |
-| float remquof(float x, float y, int* quo) Returns single-precision floating-point remainder and part of quotient. | ✓ | ✓ |
-| float roundf(float x) Round to nearest integer value in floating-point. | ✓ | ✓ |
-| float rcbrtf(float x) Returns the reciprocal cube root function. | ✓ | ✓ |
-| float rhypotf(float x, float y) Returns one over the square root of the sum of squares of two arguments. | ✓ | ✓ |
-| float rintf(float x) Round input to nearest integer value in floating-point. | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float rnorm3df(float x, float y, float z) Returns one over the square root of the sum of squares of three coordinates of the argument. | ✓ | ✓ |
-|------------------------------------------------------------------------------------------------------------------------------------------------|-----|-----|
-| float rnorm4df(float x, float y, float z, float w) Returns one over the square root of the sum of squares of four coordinates of the argument. | ✓ | ✓ |
-| float rnormf(int dim, const float *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-| float scalblnf(float x, long int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| float scalbnf(float x, int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| bool signbit(float x) Return the sign bit of 𝑥 . | ✓ | ✓ |
-| float sinf(float x) Returns the sine of 𝑥 . | ✓ | ✓ |
-| float sinhf(float x) Returns the hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| float sinpif(float x) Returns the hyperbolic sine of 𝜋 · 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| void sincosf(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝑥 . | ✓ | ✓ |
-|---------------------------------------------------------------------------------------------------|-----|-----|
-| void sincospif(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| float sqrtf(float x) Returns the square root of 𝑥 . | ✓ | ✓ |
-| float rsqrtf(float x) Returns the reciprocal of the square root of 𝑥 . | | ✓ |
-| float tanf(float x) Returns the tangent of 𝑥 . | ✓ | ✓ |
-| float tanhf(float x) Returns the hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-| float tgammaf(float x) Returns the gamma function of 𝑥 . | ✓ | ✓ |
-| float truncf(float x) Truncate 𝑥 to the integral part. | ✓ | ✓ |
-| float y0f(float x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . | ✓ | ✓ |
-| float y1f(float x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . | ✓ | ✓ |
-
-continues on next page
-**Following code does:** It seems like there is a placeholder or an error in your request, as the code snippet is missing. Please provide the actual code snippet you would like me to analyze, and I'll be happy to help!
-
-
-```
- \
- float ynf(int n, float x)
- Returns the value of the Bessel
- function of the second kind of order
- n for x.
-```
-
-Table 1 - continued from previous page
-
-## 21.2 Double precision mathematical functions
-
-Following is the list of supported double precision mathematical functions.
-
-Table 2: Double precision mathematical functions
-
-| Function | Supported on Host | Supported on Device |
-|------------------------------------------------------------------------------------|---------------------|-----------------------|
-| double abs(double x) Returns the absolute value of 𝑥 | ✓ | ✓ |
-| double acos(double x) Returns the arc cosine of 𝑥 . | ✓ | ✓ |
-| double acosh(double x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| double asin(double x) Returns the arc sine of 𝑥 . | ✓ | ✓ |
-| double asinh(double x) Returns the arc hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| double atan(double x) Returns the arc tangent of 𝑥 . | ✓ | ✓ |
-| double atan2(double x, double y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double atanh(double x) Returns the arc hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-|-------------------------------------------------------------------------------------------------------------------------|-----|-----|
-| double cbrt(double x) Returns the cube root of 𝑥 . | ✓ | ✓ |
-| double ceil(double x) Returns ceiling of 𝑥 . | ✓ | ✓ |
-| double copysign(double x, double y) Create value with given magnitude, copying sign of second value. | ✓ | ✓ |
-| double cos(double x) Returns the cosine of 𝑥 . | ✓ | ✓ |
-| double cosh(double x) Returns the hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| double cospi(double x) Returns the cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| double cyl_bessel_i0(double x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 . | | |
-| double cyl_bessel_i1(double x) Returns the value of the regular modified cylindrical Bessel function of order 1 for | 𝑥 . | |
-| double erf(double x) Returns the error function of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double erfc(double x) Returns the complementary error function of 𝑥 . | ✓ | ✓ |
-|-----------------------------------------------------------------------------------|-----|-----|
-| double erfcinv(double x) Returns the inverse complementary function of 𝑥 . | ✓ | ✓ |
-| double erfcx(double x) Returns the scaled complementary error function of 𝑥 . | ✓ | ✓ |
-| double erfinv(double x) Returns the inverse error function of 𝑥 . | ✓ | ✓ |
-| double exp(double x) Returns 𝑒 𝑥 . | ✓ | ✓ |
-| double exp10(double x) Returns 10 𝑥 . | ✓ | ✓ |
-| double exp2( double x) Returns 2 𝑥 . | ✓ | ✓ |
-| double expm1(double x) Returns 𝑙𝑛 ( 𝑥 - 1) | ✓ | ✓ |
-| double fabs(double x) Returns the absolute value of x | ✓ | ✓ |
-| double fdim(double x, double y) Returns the positive difference between 𝑥 and 𝑦 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double floor(double x) Returns the largest integer less than or equal to 𝑥 . | ✓ | ✓ |
-|---------------------------------------------------------------------------------------------|-----|-----|
-| double fma(double x, double y, double z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. | ✓ | ✓ |
-| double fmax(double x, double y) Determine the maximum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| double fmin(double x, double y) Determine the minimum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| double fmod(double x, double y) Returns the floating-point remainder of 𝑥/𝑦 . | ✓ | ✓ |
-| double modf(double x, double* iptr) Break down 𝑥 into fractional and integral parts. | ✓ | |
-| double frexp(double x, int* nptr) Extract mantissa and exponent of 𝑥 . | ✓ | |
-| double hypot(double x, double y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . | ✓ | ✓ |
-| int ilogb(double x) Returns the unbiased integer exponent of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| bool isfinite(double x) Determine whether 𝑥 is finite. | ✓ | ✓ |
-|------------------------------------------------------------------------------------------------------------------|-----|-----|
-| bool isin(double x) Determine whether 𝑥 is infinite. | ✓ | ✓ |
-| bool isnan(double x) Determine whether 𝑥 is a NAN . | ✓ | ✓ |
-| double j0(double x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . | ✓ | ✓ |
-| double j1(double x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . | ✓ | ✓ |
-| double jn(int n, double x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . | ✓ | ✓ |
-| double ldexp(double x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | ✓ |
-| double lgamma(double x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | |
-| long int lrint(double x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| long long int llrint(double x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-|----------------------------------------------------------------------------------------|-----|-----|
-| long int lround(double x) Round to nearest integer value. | ✓ | ✓ |
-| long long int llround(double x) Round to nearest integer value. | ✓ | ✓ |
-| double log10(double x) Returns the base 10 logarithm of 𝑥 . | ✓ | ✓ |
-| double log1p(double x) Returns the natural logarithm of 𝑥 +1 . | ✓ | ✓ |
-| double log2(double x) Returns the base 2 logarithm of 𝑥 . | ✓ | ✓ |
-| double log(double x) Returns the natural logarithm of 𝑥 . | ✓ | ✓ |
-| double logb(double x) Returns the floating point representation of the exponent of 𝑥 . | ✓ | ✓ |
-| double nan(const char* tagp) Returns 'Not a Number' value. | | ✓ |
-| double nearbyint(double x) Round 𝑥 to the nearest integer. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| | | ✓ |
-|--------------------------------------------------------------------------------------------------------------------------|----|-----|
-| double nextafter(double x, double y) Returns next representable double-precision floating-point value after argument. | ✓ | |
-| double norm3d(double x, double y, double z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . | ✓ | ✓ |
-| double norm4d(double x, double y, double z, double w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . | ✓ | ✓ |
-| double normcdf(double y) Returns the standard normal cumulative distribution function. | ✓ | ✓ |
-| double normcdfinv(double y) Returns the inverse of the standard normal cumulative distribution function. | ✓ | ✓ |
-| double norm(int dim, const double *a) Returns the square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-| double pow(double x, double y) Returns 𝑥 𝑦 . | ✓ | ✓ |
-| double powi(double base, int iexp) Returns the value of first argument to the power of second argument. | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 2 - continued from previous page |
-|----------------------------------------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| double remainder(double x, double y) Returns double-precision floating-point remainder. | ✓ | ✓ |
-| double remquo(double x, double y, int* quo) Returns double-precision floating-point remainder and part quotient. | ✓ | of |
-| double round(double x) Round to nearest integer value in floating-point. | ✓ | ✓ |
-| double rcbrt(double x) Returns the reciprocal cube root function. | ✓ | ✓ |
-| double rhypot(double x, double y) Returns one over the square root of the sum of squares of two arguments. | ✓ | ✓ |
-| double rint(double x) Round input to nearest integer value in floating-point. | ✓ | ✓ |
-| double rnorm3d(double x, double y, double z) Returns one over the square root of the sum of squares of three coordinates of the argument. | ✓ | ✓ |
-| double rnorm4d(double x, double y, double z, double w) Returns one over the square root of the sum of squares of four coordinates of the argument. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| | ✓ | |
-|----------------------------------------------------------------------------------------------------------------------------------|-----|----|
-| double rnorm(int dim, const double *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. | | ✓ |
-| double scalbln(double x, long int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| double scalbn(double x, int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| bool signbit(double x) Return the sign bit of 𝑥 . | ✓ | ✓ |
-| double sin(double x) Returns the sine of 𝑥 . | ✓ | ✓ |
-| double sinh(double x) Returns the hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| double sinpi(double x) Returns the hyperbolic sine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| void sincos(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝑥 . | ✓ | ✓ |
-| void sincospi(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| double sqrt(double x) Returns the square root of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double rsqrt(double x) Returns the reciprocal of the square root of 𝑥 . | ✓ |
-|-----------------------------------------------------------------------------------------------------------|-----|
-| double tan(double x) Returns the tangent of 𝑥 . | ✓ |
-| double tanh(double x) Returns the hyperbolic tangent of 𝑥 . | ✓ |
-| double tgamma(double x) Returns the gamma function of 𝑥 . | ✓ |
-| double trunc(double x) Truncate 𝑥 to the integral part. | ✓ |
-| double y0(double x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . | ✓ |
-| double y1(double x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . | ✓ |
-| double yn(int n, double x) Returns the value of the Bessel function of the second kind of order n for 𝑥 . | ✓ |
-
-## 21.3 Integer intrinsics
-
-Following is the list of supported integer intrinsics. Note that intrinsics are supported on device only.
-
-Table 3: Integer intrinsics mathematical functions
-
-## Function
-
-unsigned int \_\_brev(unsigned int x) Reverse the bit order of a 32 bit unsigned integer.
-
-unsigned long long int \_\_brevll(unsigned long long int x) Reverse the bit order of a 64 bit unsigned integer.
-
-unsigned int \_\_byte\_perm(unsigned int x, unsigned int y, unsigned int z) Return selected bytes from two 32-bit unsigned integers.
-
-unsigned int \_\_clz(int x) Return the number of consecutive high-order zero bits in 32 bit integer.
-
-unsigned int \_\_clzll(long long int x) Return the number of consecutive high-order zero bits in 64 bit integer.
-
-unsigned int \_\_ffs(int x) Find the position of least significant bit set to 1 in a 32 bit integer.
-
-unsigned int \_\_ffsll(long long int x) Find the position of least significant bit set to 1 in a 64 bit signed integer.
-
-unsigned int \_\_fns32(unsigned long long mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 32-bit integer.
-
-unsigned int \_\_fns64(unsigned long long int mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 64-bit integer.
-
-unsigned int \_\_funnelshift\_l(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by shift & 31 bits, return the most significant 32 bits.
-
-unsigned int \_\_funnelshift\_lc(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by min(shift, 32) bits, return the most significant 32 bits.
-
-unsigned int \_\_funnelshift\_r(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift right by shift & 31 bits, return the least significant 32 bits. 226 Chapter 21. HIP math API
-
-The HIP-Clang implementation of \_\_ffs() and \_\_ffsll() contains code to add a constant +1 to produce the ffs result format. For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform, HIP-Clang provides \_\_lastbit\_u32\_u32(unsigned int input) and \_\_lastbit\_u32\_u64(unsigned long long int input) . The index returned by \_\_lastbit\_ instructions starts at -1, while for ffs the index starts at 0.
-
-## 21.4 Floating-point Intrinsics
-
-Following is the list of supported floating-point intrinsics. Note that intrinsics are supported on device only.
-
-Note: Only the nearest even rounding mode supported on AMD GPUs by defaults. The \_rz , \_ru and \_rd suffixed intrinsic functions are existing in HIP AMD backend, if the OCML\_BASIC\_ROUNDED\_OPERATIONS macro is defined.
-
-Table 4: Single precision intrinsics mathematical functions
-
-Function float \_\_cosf(float x) Returns the fast approximate cosine of 𝑥 . float \_\_exp10f(float x) Returns the fast approximate for 10 x . float \_\_expf(float x) Returns the fast approximate for e x . float \_\_fadd\_rn(float x, float y) Add two floating-point values in round-to-nearest-even mode. float \_\_fdiv\_rn(float x, float y) Divide two floating point values in round-to-nearest-even mode. float \_\_fmaf\_rn(float x, float y, float z) Returns x × y + z as a single operation in round-to-nearest-even mode. float \_\_fmul\_rn(float x, float y) Multiply two floating-point values in round-to-nearest-even mode. float \_\_frcp\_rn(float x, float y) Returns 1 / x in round-to-nearest-even mode. float \_\_frsqrt\_rn(float x) Returns 1 / x in round-to-nearest-even mode. float \_\_fsqrt\_rn(float x) Returns x in round-to-nearest-even mode. float \_\_fsub\_rn(float x, float y) Subtract two floating-point values in round-to-nearest-even mode. float \_\_log10f(float x) Returns the fast approximate for base 10 logarithm of 𝑥 . 228 Chapter 21. HIP math API
-
-Table 5: Double precision intrinsics mathematical functions
-
-Function double \_\_dadd\_rn(double x, double y) Add two floating-point values in round-to-nearest-even mode. double \_\_ddiv\_rn(double x, double y) Divide two floating-point values in round-to-nearest-even mode. double \_\_dmul\_rn(double x, double y) Multiply two floating-point values in round-to-nearest-even mode. double \_\_drcp\_rn(double x, double y) Returns 1 / x in round-to-nearest-even mode. double \_\_dsqrt\_rn(double x) Returns x in round-to-nearest-even mode. double \_\_dsub\_rn(double x, double y) Subtract two floating-point values in round-to-nearest-even mode. double \_\_fma\_rn(double x, double y, double z) Returns x × y + z as a single operation in round-to-nearest-even mode.
-
-## CHAPTER
-
-## TWENTYTWO
-
-## TABLE COMPARING SYNTAX FOR DIFFERENT COMPUTE APIS
-
-| Term | CUDA | HIP | OpenCL |
-|------------------------|---------------------|--------------------------------------------|------------------------|
-| Device | int deviceId | int deviceId | cl_device |
-| Queue | cudaStream_t | hipStream_t | cl_command_queue |
-| Event | cudaEvent_t | hipEvent_t | cl_event |
-| Memory | void * | void * | cl_mem |
-| | grid | grid | NDRange |
-| | block | block | work-group |
-| | thread | thread | work-item |
-| | warp | warp | sub-group |
-| Thread-index | threadIdx.x | threadIdx.x | get_local_id(0) |
-| Block-index | blockIdx.x | blockIdx.x | get_group_id(0) |
-| Block-dim | blockDim.x | blockDim.x | get_local_size(0) |
-| Grid-dim | gridDim.x | gridDim.x | get_num_groups(0) |
-| Device Kernel | __global__ | __global__ | __kernel |
-| Device Function | __device__ | __device__ | Implied in device com |
-| Host Function | __host_ (default) | __host_ (default) | Implied in host compil |
-| Host + Device Function | __host__ __device__ | __host__ __device__ | No equivalent |
-| Kernel Launch | <<< >>> | hipLaunchKernel / hipLaunchKernelGGL / <<< | clEnqueueNDRangeK |
-| Global Memory | __global__ | __global__ | __global |
-| Group Memory | __shared__ | __shared__ | __local |
-| Constant | __constant__ | __constant__ | __constant |
-| | __syncthreads | __syncthreads | barrier(CLK_LOCAL |
-| Atomic Builtins | atomicAdd | atomicAdd | atomic_add |
-| Precise Math | cos(f) | cos(f) | cos(f) |
-| Fast Math | __cos(f) | __cos(f) | native_cos(f) |
-| Vector | float4 | float4 | float4 |
-
-## 22.1 Notes
-
-The indexing functions (starting with thread-index ) show the terminology for a 1D grid. Some APIs use reverse order of xyz / 012 indexing for 3D grids.
-
-## CHAPTER
-
-## TWENTYTHREE
-
-## HIP COOPERATIVE GROUPS API
-
-## 23.1 Cooperative kernel launches
-
-The following host-side functions are used for cooperative kernel launches.
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find function | 'hipLaunchCooperativeKernel' Documentation' | 'hipLaunchCooperativeKernel' Documentation' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for project | 'HIP | 6.1.40092 | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: project | doxygenfunction: project | doxygenfunction: project | Cannot | Cannot | find function | 'hipLaunchCooperativeKernel' | 'hipLaunchCooperativeKernel' |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | | 'HIP | 6.1.40092 | Documentation' | from |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function | 'hipLaunchCooperativeKernelMultiDe- | 'hipLaunchCooperativeKernelMultiDe- | 'hipLaunchCooperativeKernelMultiDe- |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| vice' | in | doxygen | xml | output for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: in | doxygenfunction: Cannot find xml output for project 'HIP | doxygenfunction: Cannot find xml output for project 'HIP | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| nel' | function 6.1.40092 | 'hipModuleLaunchCooperativeKer- Documentation' from directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'hipModuleLaunchCooperativeKernelMultiDevice' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-## 23.2 Cooperative groups classes
-
-The following cooperative groups classes can be used on the device side.
-
-## class thread\_group
-
-The base type of all cooperative group types.
-
-Holds the key properties of a constructed cooperative group types object, like the group type, its size, etc.
-
-Note: Cooperative groups feature is implemented on Linux, under development on Microsoft Windows.
-
-Subclassed by cooperative\_groups::coalesced\_group , cooperative\_groups::grid\_group , coopera-tive\_groups::multi\_grid\_group , cooperative\_groups::thread\_block , cooperative\_groups::tiled\_group class thread\_block : public cooperative\_groups:: thread\_group
-
-The workgroup (thread-block in CUDA terminology) cooperative group type.
-
-Represents an intra-workgroup cooperative group type, where the participating threads within the group are the same threads that participated in the currently executing workgroup .
-
-Note: This function is implemented on Linux and is under development on Microsoft Windows.
-
-class grid\_group : public cooperative\_groups:: thread\_group
-
-The grid cooperative group type.
-
-Represents an inter-workgroup cooperative group type, where the participating threads within the group spans across multiple workgroups running the (same) kernel on the same device.
-
-Note: This is implemented on Linux and is under development on Microsoft Windows.
-
-class multi\_grid\_group : public cooperative\_groups:: thread\_group
-
-The multi-grid cooperative group type.
-
-Represents an inter-device cooperative group type, where the participating threads within the group span across multiple devices, running the (same) kernel on these devices.
-
-Note: The multi-grid cooperative group type is implemented on Linux, under development on Microsoft Windows.
-
-## template<unsigned int size , class ParentCGTy >
-
-class thread\_block\_tile : public cooperative\_groups::impl::thread\_block\_tile\_internal< size , ParentCGTy > Group type -thread\_block\_tile .
-
-Represents one tiled thread group in a wavefront. This group type also supports sub-wave level intrinsics.
-
-Note: This type is implemented on Linux, under development on Microsoft Windows.
-
-## Public Functions
-
-unsigned int thread\_rank () const
-
-Rank of the calling thread within [0, size() ).
-
-## void sync ()
-
-Synchronizes the threads in the group.
-
-Causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group.
-
-Note: There are potential read-after-write (RAW), write-after-read (WAR), or write-after-write (WAW) hazards, when threads in the group access the same addresses in shared or global memory. The data hazards can be avoided with synchronization of the group.
-
-## unsigned int meta\_group\_rank () const
-
-Returns the linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta\_group\_size)
-
-unsigned int meta\_group\_size () const
-
-Returns the number of groups created when the parent group was partitioned.
-
-## template<class T >
-
-T shfl ( T var, int srcRank ) const
-
-Shuffle operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle operation is a direct copy of var from srcRank thread ID of group.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy. Only the srcRank thread ID of group is copied to other threads.
-- srcRank - [in] The source thread ID of the group for copy.
-
-## template<class T >
-
-T shfl\_down ( T var, unsigned int lane\_delta ) const
-
-Shuffle down operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle down operation is copy of var from thread with thread ID of group relative higher with lane\_delta to caller thread ID.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- lane\_delta - [in] The lane\_delta is the relative thread ID difference between caller thread ID and source of copy thread ID. sourceID = (threadID + lane\_delta) % size()
-
-template<class T >
-
-## T shfl\_up ( T var, unsigned int lane\_delta ) const
-
-Shuffle up operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle up operation is copy of var from thread with thread ID of group relative lower with lane\_delta to caller thread ID.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- lane\_delta - [in] The lane\_delta is the relative thread ID difference between caller thread ID and source of copy thread ID. sourceID = (threadID - lane\_delta) % size()
-
-## template<class T >
-
-T shfl\_xor ( T var, unsigned int laneMask ) const
-
-Shuffle xor operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle xor operation is copy of var from thread with thread ID of group based on laneMask XOR of the caller thread ID.
-
-## Template Parameters
-
-- T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- laneMask - [in] The laneMask is the mask for XOR operation. sourceID = threadID ^ laneMask
-
-unsigned long long ballot ( int pred ) const
-
-Ballot function on group level.
-
-Returns a bit mask with the Nth bit set to one if the Nth thread predicate evaluates true.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-int any ( int pred ) const
-
-Any function on group level.
-
-Returns non-zero if a predicate evaluates true for any threads.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-int all ( int pred ) const
-
-All function on group level.
-
-Returns non-zero if a predicate evaluates true for all threads.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-template<typename T >
-
-unsigned long long match\_any ( T value ) const
-
-Match any function on group level.
-
-Returns a bit mask containing a 1-bit for every participating thread if that thread has the same value in value as the caller thread.
-
-## Parameters
-
-value - [in] The value to examine on the current thread in group.
-
-template<typename T > unsigned long long match\_all ( T value, int &pred ) const
-
-Match all function on group level.
-
-Returns a bit mask containing a 1-bit for every participating thread if they all have the same value in value as the caller thread. The predicate pred is set to true if all participating threads have the same value in value .
-
-## Parameters
-
-- value - [in] The value to examine on the current thread in group.
-- pred - [out] The predicate is set to true if all participating threads in the thread group have the same value.
-
-class coalesced\_group : public cooperative\_groups:: thread\_group
-
-The coalesced\_group cooperative group type.
-
-Represents an active thread group in a wavefront. This group type also supports sub-wave level intrinsics.
-
-Note: This is implemented on Linux and is under development on Microsoft Windows.
-
-## 23.3 Cooperative groups construct functions
-
-The following functions are used to construct different group-type instances on the device side.
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | 'cooperative_groups::this_multi_grid' | 'cooperative_groups::this_multi_grid' | 'cooperative_groups::this_multi_grid' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::this\_grid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::this\_thread\_block' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function | 'cooperative_groups::coalesced_threads' | 'cooperative_groups::coalesced_threads' | |
-|------------|------------|--------------------|--------------------|--------------------|----------|--------|------------|-------------------------------------------|-------------------------------------------|------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: |
-
-/home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | 'cooperative_groups::tiled_partition' | 'cooperative_groups::tiled_partition' | 'cooperative_groups::tiled_partition' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function 'cooperative_groups::tiled_partition' 6.1.40092 | function 'cooperative_groups::tiled_partition' 6.1.40092 | function 'cooperative_groups::tiled_partition' 6.1.40092 | | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | Documentation' | from | directory: | | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | function | 'cooperative_groups::binary_partition' | 'cooperative_groups::binary_partition' |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | | 6.1.40092 | Documentation' | from |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function 'cooperative_groups::binary_partition' 6.1.40092 | function 'cooperative_groups::binary_partition' 6.1.40092 | function 'cooperative_groups::binary_partition' 6.1.40092 |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | Documentation' | from | directory: |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-## 23.4 Cooperative groups exposed API functions
-
-The following functions are the exposed API for different group-type instances on the device side.
-
-| Warning: | Warning: | doxygenfunction: project | doxygenfunction: project | doxygenfunction: project | Cannot find | function | 'cooperative_groups::group_size' | 'cooperative_groups::group_size' | 'cooperative_groups::group_size' | | | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | 'HIP | 6.1.40092 | Documentation' | from | directory: | | | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: Cannot | doxygenfunction: Cannot | doxygenfunction: Cannot | doxygenfunction: Cannot | find 'HIP | find 'HIP | function 'cooperative_groups::thread_rank' | function 'cooperative_groups::thread_rank' | function 'cooperative_groups::thread_rank' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::is\_valid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::sync' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advancedmicro-devices-hip/checkouts/docs-6.1.2/docs/doxygen/xml
-
-## CHAPTER
-
-## TWENTYFOUR
-
-## HSA RUNTIME API FOR ROCM
-
-The following functions are located in the https://github.com/ROCm/ROCR-Runtime repository.
-
-hsa\_status\_t hsa\_amd\_vmem\_address\_reserve ( void **va, size\_t size, uint64\_t address, uint64\_t flags )
-
-Allocate a reserved address range.
-
-Reserve a virtual address range. The size must be a multiple of the system page size. If it is not possible to allocate the address specified by address , then va will be a different address range. Address range should be released by calling hsa\_amd\_vmem\_address\_free.
-
-Note that this API will be deprecated in a future release and replaced by hsa\_amd\_vmem\_address\_reserve\_align
-
-## Parameters
-
-- va -[out] virtual address allocated
-- size -[in] of address range requested
-- address -[in] requested
-- flags -[in] currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range allocated successfully
-- ::HSA\_STATUS\_ERROR\_NOT\_INITIALIZED - The HSA runtime has not been initialized.
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources to allocate an address range of this size.
-
-hsa\_status\_t hsa\_amd\_vmem\_address\_free ( void *va, size\_t size )
-
-Free a reserved address range.
-
-Free a previously allocated address range. The size must match the size of a previously allocated address range.
-
-## Parameters
-
-- va -[out] virtual address to be freed
-- size -[in] of address range
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range released successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid va specified
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - Invalid size specified
-- ::HSA\_STATUS\_ERROR\_RESOURCE\_FREE - Address range is still in use
-
-· ::HSA\_STATUS\_ERROR - Internal unexpected error
-
-hsa\_status\_t hsa\_amd\_vmem\_handle\_create ( hsa\_amd\_memory\_pool\_t pool, size\_t size, hsa\_amd\_memory\_type\_t type, uint64\_t flags, hsa\_amd\_vmem\_alloc\_handle\_t *memory\_handle
-
-)
-
-Create a virtual memory handle.
-
-Create a virtual memory handle within this pool size must be a aligned to allocation granule size for this memory pool, see HSA\_AMD\_MEMORY\_POOL\_INFO\_RUNTIME\_ALLOC\_GRANULE To minimize internal memory fragmentation, align the size to the recommended allocation granule size, see HSA\_AMD\_MEMORY\_POOL\_INFO\_RUNTIME\_ALLOC\_REC\_GRANULE
-
-## Parameters
-
-- pool -[in] memory to use
-- size -[in] of the memory allocation
-- type -[in] of memory
-- flags -[in] - currently unsupported
-- memory\_handle -[out] - handle for the allocation
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - memory allocated successfully
-- ::HSA\_STATUS\_ERROR\_NOT\_INITIALIZED - The HSA runtime has not been initialized.
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - Invalid arguments
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - This memory pool does not support allocations
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources to allocate this memory
-
-hsa\_status\_t hsa\_amd\_vmem\_handle\_release ( hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle )
-
-Release a virtual memory handle.
-
-## Parameters
-
-memory -[in] handle that was previously allocated
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range allocated successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-
-hsa\_status\_t hsa\_amd\_vmem\_map ( void *va, size\_t size, size\_t in\_offset, hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle, uint64\_t flags )
-
-Map a virtual memory handle.
-
-Map a virtual memory handle to a reserved address range. The virtual address requested must be within a previously reserved address range. va and ( va + size) must be must be within (va + size) of the previous allocated address range. size must be equal to size of the memory\_handle hsa\_amd\_vmem\_set\_access needs to be called to make the memory accessible to specific agents
-
-## Parameters
-
-- va -[in] virtual address range where memory will be mapped
-- size -[in] of memory mapping
-- in\_offset -[in] offset into memory. Currently unsupported
-
-- memory\_handle -[in] virtual memory handle to be mapped
-- flags. -[in] Currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Memory mapped successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - va, size or memory\_handle are invalid
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-## hsa\_status\_t hsa\_amd\_vmem\_unmap ( void *va, size\_t size )
-
-Unmap a virtual memory handle.
-
-Unmap previously mapped virtual address range
-
-## Parameters
-
-- va -[in] virtual address range where memory will be mapped
-- size -[in] of memory mapping
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Memory backing unmapped successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - memory\_handle is invalid
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - size is invalid
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_set\_access ( void *va, size\_t size, const hsa\_amd\_memory\_access\_desc\_t *desc, size\_t desc\_cnt )
-
-Make a memory mapping accessible.
-
-Make previously mapped virtual address accessible to specific agents. size must be equal to size of previously mapped virtual memory handle. Calling hsa\_amd\_vmem\_set\_access multiple times on the same va will overwrite previous permissions for all agents
-
-## Parameters
-
-- va -[in] previously mapped virtual address
-- size -[in] of memory mapping
-- desc -[in] list of access permissions for each agent
-- desc\_cnt -[in] number of elements in desc
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - va, size or memory\_handle are invalid
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - memory\_handle is invalid
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources
-- ::HSA\_STATUS\_ERROR\_INVALID\_AGENT - Invalid agent in desc
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_get\_access ( void *va, hsa\_access\_permission\_t *perms, hsa\_agent\_t agent\_handle )
-
-Get current access permissions for memory mapping.
-
-Get access permissions for memory mapping for specific agent.
-
-## Parameters
-
-- va -[in] previously mapped virtual address
-- perms -[in] current permissions
-- agent\_handle -[in] agent
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_AGENT - Invalid agent
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - va is not mapped or permissions never set for this agent
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_export\_shareable\_handle ( int *dmabuf\_fd, hsa\_amd\_vmem\_alloc\_handle\_t handle, uint64\_t flags )
-
-Get an exportable shareable handle.
-
-Get an exportable shareable handle for a memory\_handle. This shareabl handle can then be used to re-create a virtual memory handle using hsa\_amd\_vmem\_import\_shareable\_handle. The shareable handle can be transferred using mechanisms that support posix file descriptors Once all shareable handles are closed, the memory\_handle is released.
-
-## Parameters
-
-- dmabuf\_fd -[out] shareable handle
-- handle -[in] previously allocated virtual memory handle
-- flags -[in] Currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Out of resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-## hsa\_status\_t hsa\_amd\_vmem\_import\_shareable\_handle ( int dmabuf\_fd, hsa\_amd\_vmem\_alloc\_handle\_t *handle )
-
-Import a shareable handle.
-
-Import a shareable handle for a memory handle. Importing a shareable handle that has been closed and released results in undefined behavior.
-
-## Parameters
-
-- dmabuf\_fd -[in] shareable handle exported with hsa\_amd\_vmem\_export\_shareable\_handle
-- handle -[out] virtual memory handle
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Out of resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_retain\_alloc\_handle ( hsa\_amd\_vmem\_alloc\_handle\_t *memory\_handle, void *addr )
-
-Returns memory handle for mapped memory.
-
-Return a memory handle for previously mapped memory. The handle will be the same value of handle used to map the memory. The returned handle must be released with corresponding number of calls to hsa\_amd\_vmem\_handle\_release.
-
-## Parameters
-
-- memory\_handle -[out] memory handle for this mapped address
-- mapped -[in] address
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid address
-
-hsa\_status\_t hsa\_amd\_vmem\_get\_alloc\_properties\_from\_handle ( hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle, hsa\_amd\_memory\_pool\_t *pool, hsa\_amd\_memory\_type\_t *type )
-
-Returns the current allocation properties of a handle.
-
-Returns the allocation properties of an existing handle
-
-## Parameters
-
-- memory\_handle -[in] memory handle to be queried
-- pool -[out] memory pool that owns this handle
-- memory -[out] type
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory\_handle
-
-## CHAPTER
-
-## TWENTYFIVE
-
-## HIP MANAGED MEMORY ALLOCATION API
-
-hipError\_t hipMallocManaged ( void **dev\_ptr, size\_t size, unsigned int flags )
-
-Allocates memory that will be automatically managed by HIP.
-
-This API is used for managed memory, allows data be shared and accessible to both CPU and GPU using a single pointer.
-
-The API returns the allocation pointer, managed by HMM, can be used further to execute kernels on device and fetch data between the host and device as needed.
-
-Note: It is recommend to do the capability check before call this API.
-
-## Parameters
-
-- dev\_ptr -[out] - pointer to allocated device memory
-- size -[in] - requested allocation size in bytes, it should be granularity of 4KB
-- flags -[in] - must be either hipMemAttachGlobal or hipMemAttachHost (defaults to hipMemAttachGlobal)
-
-## Returns
-
-hipSuccess, hipErrorMemoryAllocation, hipErrorNotSupported, hipErrorInvalidValue hipError\_t hipMemPrefetchAsync ( const void *dev\_ptr, size\_t count, int device, hipStream\_t stream
-
-) Prefetches memory to the specified destination device using HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- dev\_ptr -[in] pointer to be prefetched
-- count -[in] size in bytes for prefetching
-- device -[in] destination device to prefetch to
-- stream -[in] stream to enqueue prefetch operation
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue
-
-hipError\_t hipMemAdvise ( const void *dev\_ptr, size\_t count, hipMemoryAdvise advice, int device )
-
-Advise about the usage of a given memory range to HIP.
-
-This HIP API advises about the usage to be applied on unified memory allocation in the range starting from the pointer address devPtr, with the size of count bytes. The memory range must refer to managed memory allocated via the API hipMallocManaged, and the range will be handled with proper round down and round up respectively in the driver to be aligned to CPU page size, the same way as corresponding CUDA API behaves in CUDA version 8.0 and afterwards.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- dev\_ptr -[in] pointer to memory to set the advice for
-- count -[in] size in bytes of the memory range, it should be CPU page size alligned.
-- advice -[in] advice to be applied for the specified memory range
-- device -[in] device to apply the advice for
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipMemRangeGetAttribute ( void *data, size\_t data\_size, hipMemRangeAttribute attribute, const void *dev\_ptr, size\_t count )
-
-Query an attribute of a given memory range in HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- data -[inout] a pointer to a memory location where the result of each attribute query will be written to
-- data\_size -[in] the size of data
-- attribute -[in] the attribute to query
-- dev\_ptr -[in] start of the range to query
-- count -[in] size of the range to query
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipMemRangeGetAttributes ( void **data, size\_t *data\_sizes, hipMemRangeAttribute *attributes, size\_t num\_attributes, const void *dev\_ptr, size\_t count )
-
-Query attributes of a given memory range in HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- data -[inout] a two-dimensional array containing pointers to memory locations where the result of each attribute query will be written to
-- data\_sizes -[in] an array, containing the sizes of each result
-- attributes -[in] the attribute to query
-- num\_attributes -[in] an array of attributes to query (numAttributes and the number of attributes in this array should match)
-- dev\_ptr -[in] start of the range to query
-- count -[in] size of the range to query
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipStreamAttachMemAsync ( hipStream\_t stream, void *dev\_ptr, size\_t length, unsigned int flags ) Attach memory to a stream asynchronously in HIP.
-
-Warning: This API is under development. Currently it is a no-operation (NOP) function on AMD GPUs and returns hipSuccess.
-
-## Parameters
-
-- stream -[in] - stream in which to enqueue the attach operation
-- dev\_ptr -[in] - pointer to memory (must be a pointer to managed memory or to a valid host-accessible region of system-allocated memory)
-- length -[in] - length of memory (defaults to zero)
-- flags -[in] - must be one of hipMemAttachGlobal, hipMemAttachHost or hipMemAttachSingle (defaults to hipMemAttachSingle)
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue
-
-## template<class T >
-
-static inline hipError\_t hipMallocManaged ( T **devPtr, size\_t size, unsigned int flags = hipMemAttachGlobal )
-
-- : C++ wrapper for hipMallocManaged
-
-Provide an override to automatically typecast the pointer type from void**, and also provide a default for the flags.
-
-HIP\_DISABLE\_CPP\_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs.
-
-## See also:
-
-hipMallocManaged
-
-## CHAPTER
-
-## TWENTYSIX
-
-## HIP VIRTUAL MEMORY MANAGEMENT API
-
-hipError\_t hipMemAddressFree ( void *devPtr, size\_t size )
-
-Frees an address range reservation made via hipMemAddressReserve.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- devPtr -[in] - starting address of the range.
-- size -[in] - size of the range.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemAddressReserve ( void **ptr, size\_t size, size\_t alignment, void *addr, unsigned long long flags )
-
-Reserves an address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[out] - starting address of the reserved range.
-- size -[in] - size of the reservation.
-- alignment -[in] - alignment of the address.
-- addr -[in] - requested starting address of the range.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-hipError\_t hipMemCreate ( hipMemGenericAllocationHandle\_t *handle, size\_t size, const hipMemAllocationProp *prop, unsigned long long flags )
-
-Creates a memory allocation described by the properties and size.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - value of the returned handle.
-- size -[in] - size of the allocation.
-- prop -[in] - properties of the allocation.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemExportToShareableHandle ( void *shareableHandle, hipMemGenericAllocationHandle\_t handle, hipMemAllocationHandleType handleType, unsigned long long flags )
-
-Exports an allocation to a requested shareable handle type.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- shareableHandle -[out] - value of the returned handle.
-- handle -[in] - handle to share.
-- handleType -[in] - type of the shareable handle.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAccess ( unsigned long long *flags, const hipMemLocation *location, void *ptr
-
-) Get the access flags set for the given location and ptr.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- flags -[out] - flags for this location.
-- location -[in] - target location.
-- ptr -[in] - address to check the access flags.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAllocationGranularity ( size\_t *granularity, const hipMemAllocationProp *prop, hipMemAllocationGranularity\_flags option )
-
-Calculates either the minimal or recommended granularity.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- granularity -[out] - returned granularity.
-- prop -[in] - location properties.
-- option -[in] - determines which granularity to return.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAllocationPropertiesFromHandle ( hipMemAllocationProp *prop,
-
-hipMemGenericAllocationHandle\_t handle )
-
-Retrieve the property structure of the given handle.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- prop -[out] - properties of the given handle.
-- handle -[in] - handle to perform the query on.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-hipError\_t hipMemImportFromShareableHandle ( hipMemGenericAllocationHandle\_t *handle, void *osHandle, hipMemAllocationHandleType shHandleType )
-
-Imports an allocation from a requested shareable handle type.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - returned value.
-- osHandle -[in] - shareable handle representing the memory allocation.
-- shHandleType -[in] - handle type.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemMap ( void *ptr, size\_t size, size\_t offset, hipMemGenericAllocationHandle\_t handle, unsigned long long flags )
-
-Maps an allocation handle to a reserved virtual address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - address where the memory will be mapped.
-- size -[in] - size of the mapping.
-- offset -[in] - offset into the memory, currently must be zero.
-- handle -[in] - memory allocation to be mapped.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemMapArrayAsync ( hipArrayMapInfo *mapInfoList, unsigned int count, hipStream\_t stream )
-
-Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays.
-
-Warning: This API is under development. Currently it is not supported on AMD GPUs and returns hipErrorNotSupported.
-
-## Parameters
-
-- mapInfoList -[in] - list of hipArrayMapInfo.
-- count -[in] - number of hipArrayMapInfo in mapInfoList.
-- stream -[in] - stream identifier for the stream to use for map or unmap operations.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemRelease ( hipMemGenericAllocationHandle\_t handle )
-
-Release a memory handle representing a memory allocation which was previously allocated through hipMemCreate.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-handle -[in] - handle of the memory allocation.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemRetainAllocationHandle ( hipMemGenericAllocationHandle\_t *handle, void *addr )
-
-Returns the allocation handle of the backing memory allocation given the address.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - handle representing addr.
-- addr -[in] - address to look up.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemSetAccess ( void *ptr, size\_t size, const hipMemAccessDesc *desc, size\_t count )
-
-Set the access flags for each location specified in desc for the given virtual address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - starting address of the virtual address range.
-- size -[in] - size of the range.
-- desc -[in] - array of hipMemAccessDesc.
-- count -[in] - number of hipMemAccessDesc in desc.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-## hipError\_t hipMemUnmap ( void *ptr, size\_t size )
-
-Unmap memory allocation of a given address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - starting address of the range to unmap.
-- size -[in] - size of the virtual address range.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-## CHAPTER
-
-## TWENTYSEVEN
-
-## HIP DEPRECATED RUNTIME API FUNCTIONS
-
-Several of our API functions have been flagged for deprecation. Using the following functions results in errors and unexpected results, so we encourage you to update your code accordingly.
-
-## 27.1 Context management
-
-CUDAsupports cuCtx API, which is the driver API that defines 'Context' and 'Devices' as separate entities. Context contains a single device, and a device can theoretically have multiple contexts. HIP initially added limited support for these APIs in order to facilitate porting from existing driver codes. These APIs are now marked as deprecated because there are better alternate interfaces (such as hipSetDevice or the stream API) to achieve these functions.
-
-- hipCtxCreate
-- hipCtxDestroy
-- hipCtxPopCurrent
-- hipCtxPushCurrent
-- hipCtxSetCurrent
-- hipCtxGetCurrent
-- hipCtxGetDevice
-- hipCtxGetApiVersion
-- hipCtxGetCacheConfig
-- hipCtxSetCacheConfig
-- hipCtxSetSharedMemConfig
-- hipCtxGetSharedMemConfig
-- hipCtxSynchronize
-- hipCtxGetFlags
-- hipCtxEnablePeerAccess
-- hipCtxDisablePeerAccess
-- hipDevicePrimaryCtxGetState
-- hipDevicePrimaryCtxRelease
-- hipDevicePrimaryCtxRetain
-- hipDevicePrimaryCtxReset
-
-- hipDevicePrimaryCtxSetFlags
-
-## 27.2 Memory management
-
-- hipMallocHost (replaced with hipHostMalloc )
-- hipMemAllocHost (replaced with hipHostMalloc )
-- hipHostAlloc (replaced with hipHostMalloc )
-- hipFreeHost (replaced with hipHostFree )
-- hipMemcpyToArray
-- hipMemcpyFromArray
-
-## 27.3 Profiler control
-
-- hipProfilerStart (use roctracer/rocTX)
-- hipProfilerStop (use roctracer/rocTX)
-
-## 27.4 Texture management
-
-- hipGetTextureReference
-- hipTexRefSetAddressMode
-- hipTexRefSetArray
-- hipTexRefSetFilterMode
-- hipTexRefSetFlags
-- hipTexRefSetFormat
-- hipTexRefGetAddress
-- hipTexRefGetAddressMode
-- hipTexRefGetFilterMode
-- hipTexRefGetFlags
-- hipTexRefGetFormat
-- hipTexRefGetMaxAnisotropy
-- hipTexRefGetMipmapFilterMode
-- hipTexRefGetMipmapLevelBias
-- hipTexRefGetMipmapLevelClamp
-- hipTexRefGetMipMappedArray
-- hipTexRefSetAddress
-- hipTexRefSetAddress2D
-- hipTexRefSetMaxAnisotropy
-
-- hipTexRefSetBorderColor
-- hipTexRefSetMipmapFilterMode
-- hipTexRefSetMipmapLevelBias
-- hipTexRefSetMipmapLevelClamp
-- hipTexRefSetMipmappedArray
-- hipTexRefGetBorderColor
-- hipTexRefGetArray
-- hipBindTexture
-- hipBindTexture2D
-- hipBindTextureToArray
-- hipGetTextureAlignmentOffset
-- hipUnbindTexture
-- hipBindTextureToMipmappedArray
-
-## CHAPTER
-
-## TWENTYEIGHT
-
-## SAXPY - HELLO, HIP
-
-This tutorial explains the basic concepts of the single-source Heterogeneous-computing Interface for Portability (HIP) programming model and the essential tooling around it. It also reviews some commonalities of heterogenous APIs in general. This topic assumes basic familiarity with the C/C++ compilation model and language.
-
-## 28.1 Prerequisites
-
-To follow this tutorial, you'll need installed drivers and a HIP compiler toolchain to compile your code. Because HIP for ROCm supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP .
-
-## 28.2 Heterogeneous programming
-
-Heterogeneous programming and offloading APIs are often mentioned together. Heterogeneous programming deals with devices of varying capabilities simultaneously. Offloading focuses on the 'remote' and asynchronous aspects of computation. HIP encompasses both. It exposes GPGPU (general-purpose GPU) programming much like ordinary host-side CPU programming and lets you move data across various devices.
-
-When programming in HIP (and other heterogenous APIs for that matter), remember that target devices are built for a specific purpose. They are designed with different tradeoffs than traditional CPUs and therefore have very different performance characteristics. Even subtle changes in code might adversely affect execution time.
-
-## 28.3 Your first lines of HIP code
-
-First, let's do the 'Hello, World!' of GPGPU: SAXPY. Single-precision A times X Plus Y ( SAXPY ) is a mathematical acronym; a vector equation 𝑎 · 𝑥 + 𝑦 = 𝑧 where 𝑎 ∈ R is a scalar and 𝑥, 𝑦, 𝑧 ∈ V are vector quantities of some large dimensionality. This vector space is defined over the set of reals. Practically speaking, you can compute this using a single for loop over three arrays.
-**Following code does:** This code snippet is a command to compile a C++ program using the `clang++` compiler. It is specifically compiling a HIP (Heterogeneous-Compute Interface for Portability) program, which is used for writing portable code that can run on both AMD and NVIDIA GPUs. The command compiles the source file `main.hip` located in the `HIP-Basic\saxpy` directory into an executable named `saxpy.exe`. It includes headers from the `.\Common` directory, links against the `amdhip64` library, and uses the library path specified by the `HIP_PATH` environment variable. The `-O2` flag is used for optimization, and `--save-temps` is used to save intermediate compilation files.
-
-
-```
-++i)
-```
-**Following code does:** This code snippet is a partial command written in PowerShell. It uses the `Get-ChildItem` cmdlet to list items (files and directories) in the current directory that match a specific pattern. The `-Filter main-hip-*` option specifies that only items with names starting with "main-hip-" should be included in the output. The command is incomplete, as it ends with `selec`, which is likely intended to be `Select-Object` to further process or filter the properties of the listed items.
-
-
-```
-<_SQL_>
-```
-
-In linear algebra libraries, such as BLAS (Basic Linear Algebra Subsystem) this operation is defined as AXPY 'A times X Plus Y'. The 'S' comes from single-precision , meaning that array element is float -s (IEEE 754 binary32 representation).
-
-To quickly get started, use the set of HIP samples from GitHub. With Git configured on your machine, open a commandline and navigate to your desired working directory, then run:
-**Following code does:** The code snippet appears to be incomplete and seems to be a mix of Python and PowerShell syntax. The part `n-hip-* | select -Property Name` resembles a PowerShell command rather than Python. In PowerShell, this command would filter objects whose names match the pattern `n-hip-*` and then select only the `Name` property of those objects. However, without additional context or a complete command, it's difficult to provide a precise explanation.
-
-
-```
- |git clone https://github.com/amd/rcm-examples.git
-```
-
-A simple implementation of SAXPY resides in the HIP-Basic/saxpy/main.hip file in this repository. The HIP code here mostly deals with where data has to be and when, and how devices transform this data. The first HIP calls deal with allocating device-side memory and copying data from host-side memory to device side in a C runtime-like fashion.
-**Following code does:** The code snippet appears to be a list of filenames rather than executable code. These filenames suggest that they are related to a project or compilation process targeting the AMD GCN (Graphics Core Next) architecture, specifically the gfx906 variant, which is used in AMD GPUs. The files likely represent different stages or outputs of compiling a program using the HIP (Heterogeneous-Compute Interface for Portability) framework for AMD hardware. The extensions indicate various file types:
-
-- `.bc` is likely a LLVM bitcode file.
-- `.hipi` could be an intermediate file specific to HIP.
-- `.o` is an object file.
-- `.out` is an executable or output file.
-- `.out.resolution.txt` might be a text file containing resolution or linking information.
-- `.s` is an assembly file.
-
-Overall, these files are part of the build process for a HIP application targeting AMD GPUs.
-
-
-```
-// Allocate and copy vectors to device memory.
-float* d_x{};
-float* d_y{};
-HIP_CHECK(hipMalloc(&d_x, size_bytes));
-HIP_CHECK(hipMalloc(&d_y, size_bytes));
-HIP_CHECK(hipMemcpy(d_x, x.data(), size_bytes, hipMemcpyHostToDevice));
-HIP_CHECK(hipMemcpy(d_y, y.data(), size_bytes, hipMemcpyHostToDevice));
-```
-
-HIP\_CHECK is a custom macro borrowed from the examples utilities which checks the error code returned by API functions for errors and reports them to the console. It is not essential to the API, but it is a good practice to check the error codes of the HIP APIs in case you pass on incorrect values to the API, or the API might be out of resources.
-
-The code selects the device to allocate to and to copy to. Commands are issued to the HIP runtime per thread, and every thread has a device set as the target of commands. The default device is 0 , which is equivalent to calling hipSetDevice(0) .
-
-Launch the calculation on the device after the input data has been prepared.
-**Following code does:** This code snippet appears to be a disassembled output of a compiled program targeting AMD's GCN (Graphics Core Next) architecture, specifically the gfx906 variant. The disassembly is likely generated from a HIP (Heterogeneous-Compute Interface for Portability) application, which is used for GPU programming on AMD hardware.
-
-The code includes assembly instructions for a function named `_Z12saxpy_kernelPKfPfj`, which suggests it is implementing a SAXPY (Single-Precision A·X Plus Y) operation, a common vector operation in linear algebra. The SAXPY operation computes the result of `Y = a * X + Y`, where `a` is a scalar and `X` and `Y` are vectors.
-
-The assembly instructions involve loading data, performing arithmetic operations, and storing results back to memory, which are typical steps in executing a SAXPY operation on a GPU. The use of specific instructions like `s_load_dword`, `v_add_u32_e32`, and `global_store_dword` indicates manipulation of scalar and vector registers, memory access, and arithmetic operations optimized for parallel execution on the GPU.
-
-
-```
- Launch the calculation on the device after the input data has been prepared.
- __global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned_
- __int size)
- {
- //...
- }
-
- int main()
- {
- //...
-
- // Launch the kernel on the default stream.
- saxpy_kernel<<
HIP Documentation Release 6.1.40092
-INSTALL
-
-1 Overview 3 Install HIP 2 5 2.1 Prerequisites . . . . . . . . . 5 2.2 Installation . . . 5 2.3 . . . . . Verify your installation . . . 6 3 Build HIP from source 7 3.1 Prerequisites . . . . . . . . 7 3.2 Building the HIP runtime 3.3 . . . 7 Build HIP tests . . . 10 . . . 3.4 Run HIP . . . . . . . . . . . 11 4 HIP programming model 13 4.1 13 4.2 RDNA &CDNAarchitecture summary Heterogeneous Programming . . 14 Single instruction multiple threads (SIMT) . . . 14 4.3 4.4 Inherent thread model . . 15 4.5 . . . . 4.4.1 Cooperative groups thread 16 Memory model . . . . . . . . . 16 4.6 Execution model . . . . . . 17 4.6.1 Host-side execution 17 17 4.6.2 Device-side execution . . 4.6.3 Kernel launch . 18 5 Hardware implementation 19 Compute units . . . . . . . 19 . . . . . 20 5.1 5.1.1 5.1.2 SIMD . . Vector cache . . 20 5.1.3 . . Local data share . . 20 5.1.4 Scalar Unit . . . . 20 5.2 CDNA architecture . . . . . 20 5.3 RDNA architecture . . . . . . 21 5.4 Shader engines . . . . . . 21 (CLR) 6 AMDcommon language runtimes 23 6.1 Project organization . . . . 23 How to build/install . . . . 23 6.2 6.2.1 Prerequisites . . . 23 6.2.2 Linux . . . . . . . 23 6.2.3 Test . . . . . . . . 24
-6.2.4 Release notes . . . . . . . . . . . . . . . . . . . . . 24 HIP programming manual 25 7 7.1 Host Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 25 7.1.1 Introduction . . . . . . . . . . . . . . . 25 7.1.2 . . . . . . . Memory allocation flags . . . . . . . . . 25 7.1.3 . . . . . . Numa-aware host memory allocation . . . . . . . . . 26 7.1.4 Coherency Controls . . . . . . . . . 26 7.1.5 . . . . . . . . . Visibility of Zero-Copy Host Memory . . . . . . . . 27 7.1.6 hipEventSynchronize . . . . . . . . . . . 27 7.1.7 . . . . Summary and Recommendations . . . . . . . . . . . 27 7.1.8 Managed memory allocation . . . . . . . . . . . . . 28 7.1.9 HIP Stream Memory Operations . . . . . . . . . . . 28 7.2 Direct Dispatch . . . . . . . . . . . . . . . . . . . . . . . . . 28 7.3 HIP Runtime Compilation . . . . . . . . . . 29 7.4 . . . . . . . . . HIP Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 7.5 Device-Side Malloc . . . . . . . . . . . . . . . . . . . 29 7.6 . . . Use of Per-thread default stream . . . . . . . . . . . . . . . . 29 7.7 Use of Long Double Type . . . . . . . . . . . . . . . . . . . 30 7.8 Use of _Float16 Type . . . . . . . 30 7.9 . . . . . . . . . . . . . . FMA and contractions . . . . . . . . . . . . . . . . . . . . . 30 7.10 Math functions with special rounding modes . . . . . . . . . . . . . . . . . . . . . 30 7.11 Creating Static Libraries . . . . . . . . 30 8 HIP porting guide . . . . . . . . . . . 33 8.1 Porting a New CUDA Project . . . . . . 33 General Tips . . 8.1.1 . . . . . . . . . . . . . . . . . . . 33 8.1.2 Scanning existing CUDA code to scope the porting effort 'in-place' . . . . . . . . . . 33 34 8.1.3 Converting a project . . 8.1.4 Library Equivalents . . . . . . . . . . . . . . . . . . 35 35 8.2 Distinguishing 8.2.1 Compiler Modes . . . . . . . . . . . . . . . . Identifying HIP Target Platform . . . 35 8.2.2 . . . . . . . . Identifying the Compiler: hip-clang or NVCC . . . 36 8.2.3 . Identifying Current Compilation Pass: Host or Device 36 8.2.4 Compiler Defines: Summary . . . . . . . . . . . . . . . 37 8.3 Identifying Architecture Features . . . . . . . . . . . . . . . 37 8.3.1 HIP_ARCH Defines . . . . . . . . . . . . . . . 37 8.3.2 Device-Architecture Properties . . . . . . . . . . . . 38 8.3.3 Table of Architecture Properties . . . . . . . . . . . . . . . . . . 38 8.4 Finding HIP . . . . . . . . . . . . . . . . . . . 39 40 8.5 8.6 Identifying HIP Runtime . . . . . hipLaunchKernelGGL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 8.7 Compiler Options . . . . . . . 8.7.1 Compiler options supported . . . . . . . . . . . . . . . . 40 on AMDplatforms . . . 40 8.8 Linking Issues . . . . . . . . . . . . . . . . . . . . . . . 41 . . 8.8.1 Linking With hipcc . . . . . . . . . . . . . . . . . . 41 8.8.2 -lm Option . . . . . . . . . . . . . . . . . 41 8.9 . . . . . Linking Code With Other Compilers . . . . . . . . . . . . . 41 8.9.1 libc++ and libstdc++ . . . . . . . . . . . . . . . . . 41 8.9.2 HIP Headers ( hip_runtime.h , hip_runtime_api.h Compiler . . . . . . . . . . . 42 42 8.9.3 Using a Standard C++ 8.9.3.1 . . . . . . . . . . . . . . 42 cuda.h . . . . . 8.9.4 Choosing HIP File Extensions . . . . . . . . . . . . . . 42 8.10 Workarounds . . . . . . . . . . . . . . . . . . . . . . . . 43
-warpSize 8.10.2 Kernel launch with group size > 256 . . . . . . . . . . . . . . . . . . . . . 43 8.11 memcpyToSymbol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 8.12 CU_POINTER_ATTRIBUTE_MEMORY_TYPE . . . 44 8.13 threadfence_system . . . . . . . . . . . 45 . . . . . . 8.13.1 Textures and Cache Control . . . . . . . . . . . 45 8.14 More Tips . . . . . . . . . . . . . 46 8.14.1 . . . . . . . . . . . HIP Logging . . . . . . . . . . . . . . . . . . 46 8.14.2 Debugging hipcc . . . . . . . . . . . . . . . . 47 8.14.3 Editor Highlighting . . . . . . . . . . . . . . . 47 9 Porting CUDA driver API 49 9.1 Introduction to the CUDA Driver and Runtime APIs . . . . . . . . 49 9.1.1 cuModule API . . . . . . . . . . . . . . 49 9.1.2 . . . . . . . . . cuCtx API . . . . . . . . . . . . . . . . . . . . . . . . . 50 9.2 HIP Module and Ctx APIs . . . . . . . . . . . . . . . . 50 9.2.1 hipModule API . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 9.2.2 . hipCtx API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 9.2.3 hipify translation of CUDA Driver API . . . . . . 51 9.2.3.1 Address Spaces . . . . . . 51 9.2.3.2 . . . . . . Using hipModuleLaunchKernel . . 51 9.2.3.3 Additional Information 51 9.2.4 . . . . . . . . . . . hip-clang Implementation Notes . . . . . . . . . . 51 9.2.4.1 .hip_fatbin . . . . . . . . . . . . . 51 9.2.4.2 Initialization and Termination Functions 52 9.2.4.3 Kernel Launching 52 9.2.5 . . . . . . . . . . . . . . . . . . NVCC Implementation Notes . . . . . . . . . . . . . . . . . 52 . . . . 9.2.5.1 Interoperation between HIP and CUDA Driver . . . . . . . 52 9.2.5.2 Compilation Options . . . . . . . . . 53 9.3 HIP Module and Texture Driver API . . . . . . . . . . . 55 10 Programming for HIP runtime compiler (RTC) 57 10.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 10.2 HIPRTC specific options . . . . . . . . . . . . . . . . . 61 10.2.1 Bitcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 10.2.2 CU Mode vs WGP mode . . . . . . 62 10.3 Linker APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 10.3.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 63 10.3.2 10.3.1.1 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Input Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 10.3.3 . Backward Compatibility of LLVM Bitcode/IR . . . . . . . . . . . . . . 64 10.3.4 Link Options . . . . . . . . . . . . . . . . . . 64 10.4 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 10.5 HIPRTC General APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 10.6 Lowered Names (Mangled Names) . . . . . . . . . . . 66 10.6.1 10.6.2 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . Example . . . . . . . . . . . . . . . . . . . . . . . . . 66 66 10.7 . . . . . . . . . . . 67 10.8 Versioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . HIP header support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 10.9 Deprecation notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 11 69 11.1 Performance guidelines Parallel execution . . . . . . . . . . . . . . . . . . . . . 69
-11.1.3 Multiprocessor level . . . . . . . . . . . . . . 70 11.2 Memory optimization . . . . . . . . . . . . 70 11.2.1 Data Transfer . . . . . . . . . . . . . 70 11.2.2 . . Device Memory Access . . . . . . . . . . 71 11.3 Optimization for maximum instruction throughput 71 11.3.1 Arithmetic instructions . . . . . . . . . 72 11.3.2 . Control flow instructions . . . . . . . . . 72 11.3.3 Synchronization . . . . . . . . . . 72 11.4 . . . . Minimizing memory thrashing . . . . . . . . . . . 73 12 Debugging with HIP 75 12.1 Tracing . . . . . . . . . . . . . . . . . . . . . . . 75 12.2 Debugging . . . . . . . . . . . . . . 77 . . . . . . . 12.2.1 Debugging HIP applications . . . . . . . 77 12.3 Useful environment variables . . . . . . . 79 12.3.1 . . . . Kernel enqueue serialization . . . . . . . 79 12.3.2 Making device visible . . . . . . . . . . . 79 12.3.3 Dump code object . . 79 12.3.4 . . . . . . . . . . . HSA-related environment variables (Linux) 80 12.3.5 HIP environment variable summary . . . 80 12.4 General debugging tips . . . . . . . . . . . . . . . 82 13 Logging HIP activity 83 13.1 Logging level . . . . . . . . . . . . . . . . . . . . 83 13.2 Logging mask . . . . . . . . 84 13.3 . . . . . . . . . . . Logging command . . . . . . . . . . . . . . . . . 84 13.4 Logging examples . . . . . . . . . . . . . . . . . 85 14 Cooperative groups . . . . . . . . . . . . 89 14.1 Cooperative groups thread model . . . 89 14.2 Group types . . . . . . . . group . . . . . . . . . 90 14.2.1 Thread-block . . . . . . . . . . . . . 90 14.2.2 Grid group . . . . . . . . . . . . . . 90 14.2.3 . Multi-grid group . . . . . . . . . . . . . 90 14.2.4 14.2.5 Thread-block tile . . . . . . . . . . . . . Coalesced groups . . . . . . . . . . . . . 91 91 14.3 Cooperative groups simple example . . . . . . . . 92 14.4 Synchronization . . . . . . . . . . . . . . 94 14.5 . . . . . . . . 97 Unsupported NVIDIA CUDA features . . . 15 Unified memory 99 15.1 Unified memory . . . . . . . . . . . . . . . . . . 99 99 15.2 System requirements . . . . . . . . . . . . . . . . 100 15.3 Unified memory programming models . . . . . . 15.3.1 Checking unified memory management support 100 15.3.2 Example for unified memory management 101 15.4 Using unified memory management (UMM) . . . 104 15.5 Unified memory HIP runtime hints . . . . . for the better performance 104 15.5.1 Data prefetching . . . . . . . . . 105 15.5.2 Memory advice . . . . . . . . . . . . . . 106 107 15.5.3 15.5.4 Memory range attributes . . . . . . . . . Asynchronously attach memory to a stream 108 16 Virtual memory management 109
-16.1 Memory allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.1 . . . 109 . . . Allocate physical memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 16.1.2 Reserve virtual address range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 16.1.3 Set memory access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Free virtual memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 16.2 16.1.4 Memory usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 16.2.1 Dynamically increase allocation size . . . . . . . . . . . . . . . . . . . . . . . . . . 111 17 Frequently asked questions 113 17.1 What APIs and features does HIP support? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 17.2 What is not supported? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 17.2.1 Runtime/Driver API features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 17.2.2 Kernel language features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 17.3 Is HIP a drop-in replacement for CUDA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 17.4 What specific version of CUDA does HIP support? . . . . . . . . . . . . . . . . . . . . . . . 114 17.5 What libraries does HIP support? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 17.6 How does HIP compare with OpenCL? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 17.7 How does porting CUDA to HIP compare to porting CUDA to OpenCL? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 17.8 What hardware does HIP support? . . . . . . . . . . . 116 17.9 Do HIPIFY tools automatically convert all source code? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 17.10 What is NVCC? . . . . . . . . . . . . . . . . . . . 116 17.11 . . What is HIP-Clang? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 17.12 Why use HIP rather than supporting CUDA directly? . . . . . . . . . . . . . . . . . . . . . . 116 117 17.13 Can I develop HIP code on an NVIDIA CUDA platform? . . . . . . . . . . . . . . . . Can I develop HIP code on an AMDHIP-Clang platform? . . . . . . . . . . . . . . . . . . . . . . 117 17.14 How to use HIP-Clang to build HIP programs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 17.15 . . . . . . . . . . . . . . 17.16 17.17 What is AMDclr? . . . . . . . . . . . . . . . . . . . What is hipother? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 . . . 118 17.18 Can I get HIP open source repository . . . . . . . . . . . . . . . . . . . 118 17.19 for Windows? . . . Can a HIP binary run on both AMDand NVIDIA platforms? . . . . 118 17.20 or . . . . . . . . . . . . . On HIP-Clang, can I link HIP code with host code compiled with another compiler such clang? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . icc, . . . 118 17.21 Can HIP API support C style application? What is the difference between C and C++? . . . . . 118 17.22 Can I install both CUDA SDK and HIP-Clang on the same machine? . . . . . . . . . . . . 119 17.23 HIP detected my platform (HIP-Clang vs NVCC) incorrectly * what should I do? . . . . . . . . . . . . . . . . . . 119 17.24 On CUDA, can I mix CUDA code with HIP code? . . . . . . . . . . . . . . . . . 120 17.25 How do I trace HIP application flow? . . . . . . . . . . . . . . . . . . . . . . . . . 120 17.26 What are the maximum limits of kernel launch parameters? . . . . . . . . . . . . . . . . . . 120 17.27 Are __shfl_*_sync functions supported on HIP platform? . . . . . . . . . . . . . . . . . . 120 17.28 How to create a guard for code that is specific to the host or the GPU? . . . . . . . . . . . . . 120 17.29 Why _OpenMP is undefined when compiling with -fopenmp ? . . . . . . . . . . . . . . . . . 121 17.30 Does the HIP-Clang compiler support extern shared declarations? . . . . . . . . . . . . . . . 121 code 17.31 I have multiple HIP enabled devices and I am getting an error hipErrorSharedObjectInitFailed with the message 'Error: shared object initialization failed'? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 17.32 How to use per-thread default stream in HIP? . . . . . . . . . . . . . . . . . . . . . . . . . . 122 17.33 How to use complex multiplication and division operations? . . . . . . . . . . . . 122 17.34 . . . . . . Can I develop applications with HIP APIs on Windows the same on Linux? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 17.35 Does HIP support LUID? . . . . . . . . . . . . . . 123 17.36 How can I know the version of HIP? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 18 HIP Runtime 18.1 Related API Reference Pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 . . . 126
-18.3.1 Namespace List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.3.2 Namespace Members . . . . . . . . . . . . . . 126 18.3.2.1 Namespace Members . . . . . 126 18.3.2.2 Namespace Members . 126 18.4 . . . . . Data Structures . . . . . . . . . . . . . . . . . . . 126 18.4.1 Data Structures . . . . . . . . . . 126 18.4.2 . . . Data Structure Index . . . . . . . . 126 18.4.3 . . . Class Hierarchy . . . . . . . . . . . . . . 126 18.4.4 . . Data Fields . . . . . . . . . . . . . . . . . . 126 18.4.4.1 All . . . . . . . . . . . . . . . 126 18.4.4.1.1 Data Fields . . . . . 126 18.4.4.1.2 Data Fields . . . . . 126 18.4.4.1.3 Data Fields . . . . . 126 18.4.4.1.4 Data Fields . . . . . 126 18.4.4.1.5 Data Fields . . . . . 126 18.4.4.1.6 Data Fields . . . . 126 18.4.4.1.7 . Data Fields . . . . . . 126 18.4.4.1.8 Data Fields . . . . 126 18.4.4.1.9 Data Fields . . . . . . . 126 18.4.4.1.10 Data Fields . . . 126 18.4.4.1.11 Data Fields . . . . . . . 126 18.4.4.1.12 Data Fields . . . 126 18.4.4.1.13 Data Fields . . . . . Data Fields . 126 18.4.4.1.14 . . . . Data Fields 126 18.4.4.1.15 . . . . . . 126 18.4.4.1.16 Data Fields . . . . 126 18.4.4.1.17 Data Fields . . . . . 126 18.4.4.1.18 Data Fields . . . . . 126 18.4.4.1.19 Data Fields . . . . . 126 18.4.4.1.20 Data Fields . . . . . . 126 18.4.4.1.21 Data Fields . . . . 126 18.4.4.1.22 Data Fields . . . . . 126 18.4.4.1.23 Data Fields . . . . . 126 18.4.4.1.24 Data Fields . . . . 126 18.4.4.1.25 . Data Fields . . . . . 126 126 18.4.4.2 Data Fields - Functions . . . . . . . . . . . . . 18.4.4.3 Variables . . . . . . . . . . . . . . . . . . . . . 126 18.4.4.3.1 Data Fields - Variables 126 18.4.4.3.2 Data Fields - Variables 126 18.4.4.3.3 Data Fields - Variables 126 18.4.4.3.4 Data Fields - Variables 126 18.4.4.3.5 Data Fields - Variables 126 18.4.4.3.6 Data Fields - Variables 126 18.4.4.3.7 Data Fields - Variables 126 18.4.4.3.8 Data Fields - Variables 126 18.4.4.3.9 Data Fields - Variables 126 18.4.4.3.10 Data Fields - Variables 126 18.4.4.3.11 Data Fields - Variables 126 18.4.4.3.12 Data Fields - Variables 126 18.4.4.3.13 Data Fields - Variables 126 18.4.4.3.14 18.4.4.3.15 Data Fields - Variables Data Fields - Variables 126 126 18.4.4.3.16 Data Fields - Variables 126
-18.4.4.3.18 Data Fields - Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.4.4.3.19 Data Fields - Variables . . . . . . . . . . . . . . . . . . . . . 126 18.4.4.3.20 Data Fields - Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.4.4.3.21 Data Fields - Variables . . . . . . . . . . . . . . . . . . . . . . 126 18.4.4.3.22 Data Fields - Variables . . . . . . . . . . . . . . . . . . . . . 126 18.4.4.3.23 Data Fields - . . . . . . Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.4.4.3.24 Data Fields - Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.4.4.3.25 Data Fields - Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5 18.4.4.4 Data Fields - Related Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 . . . . . . . . . 126 Files 18.5.1 File List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.1 All . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.1.1 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.1.2 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.1.3 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.1.4 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.1.5 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.1.6 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.1.7 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.1.8 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.1.9 Globals 126 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.2 Functions 18.5.2.2.1 . . . . . . . Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.2.2 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.2.3 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.3 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.4 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.5 Globals Enumerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.6 18.5.2.6.1 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 18.5.2.6.2 Globals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 Globals . . . . . 18.5.2.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 C++ language extensions 127 19.1 Function-type qualifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 127 19.1.1 __device__ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 19.1.2 __global__ . . . . . . 19.1.3 __host__ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 19.2 19.3 Calling __global__ functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 129 19.4 Kernel launch example . . . . . . . . Variable type qualifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 . 130 19.4.1 __constant__ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19.4.2 __shared__ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 19.4.3 __managed__ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 19.4.4 __restrict__ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 19.5 . Built-in variables . . . . . . . . . . . . . . . . . . . . . . 130 19.5.1 Coordinate built-ins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 19.5.2 warpSize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 19.6 Vector types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 19.6.1 Short vector types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 19.7 19.6.2 dim3 . . . . . . . . . . . . . . . . . . Memory fence instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 132
-. . . 19.9 Math functions . . . . . . . . . . . 132 19.10 Texture functions . . . . . . . . . . . . . 133 19.11 Surface functions . . . . . . . . . . . . . 133 19.12 Timer functions . . . . . . . . . . . . . . 137 19.13 Atomic functions 138 19.13.1 . . . . . . . . . . . . . Unsafe floating-point atomic RMWoperations 139 19.14 Warp cross-lane functions . . 140 19.14.1 . . . . . . Warp vote and ballot functions . 140 19.14.2 Warp match functions . . . . . 141 19.14.3 . Warp shuffle functions . . . . . 142 19.15 Cooperative groups functions . . . . . . 142 19.16 Warp matrix functions . . . . . . . 143 19.17 Independent . . . thread scheduling . . . . . . 144 19.18 Profiler Counter Function . . . . . . . . 144 19.19 Assert . . . . . . . . . . . . . . . . . . . 144 19.20 . . . . printf . . . . . . . . . . . . . . 144 19.21 Device-Side Dynamic Global Memory Allocation . . . . . . . . 145 19.22 __launch_bounds__ . . 145 19.22.1 Compiler Impact . . . . . . . . 145 19.22.2 CU and EU Definitions . . . . . 146 146 19.22.3 19.22.4 maxregcount Porting from CUDA __launch_bounds . . . . . . . . . . 146 19.23 Asynchronous Functions . . . . . . . . . 147 19.23.1 Memory stream . . . . . . . . . 147 19.23.2 Peer to peer . . . . . . 163 19.23.3 . . . . . Memory management . . . 165 19.23.4 . . . External Resource Interoperability 195 Register Keyword . . . . . . . . . . . . . 197 19.24 19.25 Pragma Unroll . . . . . . . . . . . . . 198 19.26 In-Line Assembly . . . . . . . . . . . . 198 19.27 Kernel Compilation . . . . . . . . . . . 198 19.28 gfx-arch-specific-kernel . . . . . . . . . 199 C++ language 20.1 20.1.1 Modern C++ support . . . . . . . . . . . C++11 support . . . . . . . . . 201 201 20.1.2 C++14 support . . . . . . . . . 202 20.1.3 C++17 support . . . . . . . . . 202 20.1.4 C++20 support . . . . . . . . . 202 20.2 Extensions and restrictions . . . . . . . . 202 20.2.1 Global functions . . 202 20.2.2 . . . . . . Device space memory specifiers . . . . 202 20.2.3 Exception handling . . . . . 203 20.2.4 Kernel parameters . . . . . . 203 20.2.5 Classes . . . . 203 20.2.6 . . . . . . . . . Polymorphic function wrappers . 203 20.2.7 Extended lambdas . . . . . . . . 203 Inline namespaces 20.2.8 . . . . . . . 203 . 205 21 HIP math API 205 21.1 Single precision mathematical functions . . . . Double precision mathematical functions . . . Integer . . . . 215 21.2 intrinsics . . . . . . . . . . . . 21.3 225
-21.4 Floating-point Intrinsics . . . . . . . . . . . . . . 227 22 Table comparing syntax for different compute APIs 231 22.1 Notes . . . . . . . . . . . . . . . . . . . . . . . . 232 23 HIP Cooperative groups API 233 23.1 Cooperative kernel launches . . . . . . . . . . . . 233 23.2 Cooperative groups classes . . . . . . . . . . . . . 234 23.3 Cooperative groups construct functions . . . . 237 23.4 . . Cooperative groups exposed API functions . . . . 238 24 HSA runtime API for ROCm 241 25 HIP managed memory allocation API 247 26 HIP virtual memory management API 251 27 HIP deprecated runtime API functions 257 27.1 Context management . . . . . . . . . . . . . . . . 257 27.2 Memory management . . . . . . . . . . . . . . . . . . 258 27.3 Profiler control . . . . . . . . . . . . . . . . 258 27.4 Texture management . . . . . . . . . . . . . . . . 258 28 SAXPY - Hello, HIP 261 28.1 Prerequisites . . . . . . . . . . . . . . . . . 261 . . . 28.2 Heterogeneous programming . . . . . . . . . 261 . . 28.3 Your first lines of HIP code . . . . . . . . . 261 . . . 28.4 Compiling on the command line . . . . . . . . . . 263 28.4.1 Setting up the command line . . . . . . . 263 28.4.2 Invoking the compiler manually . . . . . 266 29 Reduction 273 29.1 The algorithm . . . 29.2 . . . . . . . . . . . . . . . . 273 Reduction on GPUs . . . . . . . . . . . . . . . . 273 29.2.1 Naive shared reduction . . . . . . . . . . 274 29.2.2 Reducing thread divergence . . . . . . . . 276 29.2.3 Resolving bank conflicts . . . . . . . . . 276 29.2.4 Utilize upper half of the block . . . . . . . . . . . . . 277 29.2.5 Unroll all loops . . . . . . . 281 29.2.6 Communicate using warp-collective functions 282 29.2.7 Prefer warp communication over shared 282 29.2.8 . Amortize bookkeeping variable overhead 284 29.2.8.1 Reading ItemsPerThread . . . 285 29.2.8.2 Processing ItemsPerThread . . 286 29.2.9 Two-pass reduction . . . . . . . . . . . . 286 29.2.10 Global data share . . . . . . . . . . . . . 286 29.3 Conclusion . . . . . . . . . . . . . . . . . . . . . 287 30 Cooperative groups . 289 30.1 Prerequisites . . . . . . . . . . . . . . . . . . . 289 30.2 Simple HIP Code . . . . . . . . . . . . . . . . . . 289 Tiled partition . . . . . . . 289 30.3 . . . . . . . . . . . . 30.3.1 Device-side code . . . . . . . . . 290 . . . . 30.3.1.1 1. Initialization of the reduction 291 function variables . . . 30.3.1.2 2. The reduction of thread block . . . . . . . . . . . . 291
-30.3.1.3 3. The reduction of custom partition . . . . . . . . . . . . . . . . . . . . . . . . . . 291 30.3.2 Host-side code . . . . . . . . . . . . 292 30.3.2.1 1. Confirm the cooperative group support on AMDGPUs 30.3.2.2 . . . . . 292 2. Initialize the cooperative group configuration . . . . . . . . . . . . . . . . . . . 293 30.3.2.3 Conclusion 4. Launch the kernel . . . . . 293 30.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293 31 License 295
-
-Install
-
-
-Conceptual
-
-
-How to
-
-
-Reference
-
-
-CHAPTER ONE
-OVERVIEW
-
-
-Tutorial
-
-
-CHAPTER
-TWO
-INSTALL HIP
-2.1 Prerequisites
-AMD
-
-
-NVIDIA
-2.2 Installation
-AMD
-
-
-NVIDIA
-
-
-
-
-
- | apt-get install hi
-tall hip-runtime-nvidia hip-devThe default paths are:
-
-
-2.3 Verify your installation
-THREE
-BUILD HIP FROM SOURCE
-3.1 Prerequisites
-
- | apt-get install python3
- | pip3 install CppHeaderParser3.2 Building the HIP runtime
-
-| export
-<_Bash_>AMD
-
-
-
-<_Bash_>
-:lone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
-:lone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
-
-
-<_Bash_>
- cd "$CLR_DIR"
- mkdir -p build; cd build
- cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/"_
- ---DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_
- --BUILD_OCL=OFF..
-
- make -j$(nproc)
- sudo make install
-
-
-
-
-
-
-<_PHP_>
- |
-
- Flags:
-
-
-
-<_Bash_>NVIDIA
-
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hipother.git
---- ---
-
- 1. Get the HIP source code.
- git clone -b "$ROCM_BRANCH"
- git clone -b "$ROCM_BRANCH"
- git clone -b "$ROCM_BRANCH"
-
- 2. Set the environment variables.
-export CLR_DIR="$(readlink -f clr)"
-export HIP_DIR="$(readlink -f hip)"
-export HIP_OTHER="$(readlink -f hipother)"
-3. Build HIP.
-cd "$CLR_DIR"
-mkdir -p build; cd build
-cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=nvidia -DCMAKE_INSTALL_PREFIX=$PWD/
- --install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF -DHIPNV_DIR=
- --$HIP_OTHER/hipnv..
-make -j$(nproc)
-sudo make install3.3 Build HIP tests
-AMD
-
-
-
-git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip-tests.git |
- - npm -m -cos no-source.
-
- export HIPTESTS_DIR="$(readlink -f hip-tests)"
- cd "$HIPTESTS_DIR"
- mkdir -p build; cd build
- cmake../catch -DHIP_PLATFORM=amd -DHIP_PATH=$CLR_DIR/build/
- --install # or any path where HIP is installed; for example: ``/
- --opt/rocm``
- make build_tests
- ctest # run tests
-AMD
-
- * Build HIP catch tests.
-
- HIP catch tests are separate from the HIP project and use Catch2.
-
- - Get HIP tests source code.
- - command: command.json.
-
- cd "$HIPTESTS_DIR"
- hipcc $HIPTESTS_DIR/catch/unit/memory/hipPointerGetAttributes.cc \
- -I./catch/include./catch/hipTestMain/standalone_main.cc \
- -I./catch/external/Catch2 -o hipPointerGetAttributes
- ./hipPointerGetAttributes
- ...
-
- All tests passedNVIDIA
-3.4 Run HIP
-FOUR
-HIP PROGRAMMING MODEL
-4.1 RDNA & CDNA architecture summary
-

4.2 Heterogeneous Programming
-
-
-4.3 Single instruction multiple threads (SIMT)
-
-__global__ void k(float4* a, const float4* b)
-{
- int tid = threadIdx.x;
- int bid = blockIdx.x;
- int dim = blockDim.x;
-
- a[tid] += (tid + bid - dim) * b[tid];
-}4.4 Inherent thread model
-Warp (or Wavefront)
-Block
-Grid
-4.4.1 Cooperative groups thread model
-4.5 Memory model
-Local or per-thread memory
-Shared memory
-Global
-Constant
-Texture
-Surface
-4.6 Execution model
-
-
-4.6.1 Host-side execution
-4.6.2 Device-side execution
-4.6.3 Kernel launch
-
-
-
-
-HARDWARE IMPLEMENTATION
-5.1 Compute units
-
-
-5.1.1 SIMD
-5.1.2 Vector cache
-5.1.3 Local data share
-5.1.4 Scalar Unit
-5.2 CDNA architecture
-
5.3 RDNA architecture
-
5.4 Shader engines
-
CHAPTER
-SIX
-AMD COMMON LANGUAGE RUNTIMES (CLR)
-6.1 Project organization
-
-
-6.2 How to build/install
-6.2.1 Prerequisites
-6.2.2 Linux
-
-
-
-
-
-
- * For HIP
-<_Bash_>
-<_Haskell_>6.2.3 Test
-6.2.4 Release notes
-7.1 Host Memory
-7.1.1 Introduction
-
-
-7.1.2 Memory allocation flags
-HIP PROGRAMMING MANUAL
-7.1.3 Numa-aware host memory allocation
-7.1.4 Coherency Controls
-
-
-
-
-7.1.5 Visibility of Zero-Copy Host Memory
-
-HIP API Synchronization Effect Fence Coherent Memory ity Host Visibil- Non-Coherent Host Memory Visi- bility hipStreamSynchronize host waits for all commands in the spec- ified stream to complete system- scope release yes yes hipDeviceSynchronize host waits for all commands in all streams on the specified device to com- plete system- scope release yes yes hipEventSynchronize host waits for the specified event to com- plete device- scope release yes depends - see below hipStreamWaitEvent stream waits for the specified event to complete none yes no 7.1.6 hipEventSynchronize
-
-
-
-
-7.1.7 Summary and Recommendations
-
-
-7.1.8 Managed memory allocation
-
-
-
-
- > ?>7.1.9 HIP Stream Memory Operations
-7.2 Direct Dispatch
-7.3 HIP Runtime Compilation
-7.4 HIP Graph
-7.5 Device-Side Malloc
-7.6 Use of Per-thread default stream
-7.7 Use of Long Double Type
-7.8 Use of _Float16 Type
-7.9 FMA and contractions
-7.10 Math functions with special rounding modes
-7.11 Creating Static Libraries
-
-
-
-<_Bash_>
-hipcc hipDevice.cpp -c -fgpu-rdc -o hipDevice.o
-ar rcsD libHipDevice.a hipDevice.o
-hipcc libHipDevice.a test.cpp -fgpu-rdc -o test.outEIGHT
-HIP PORTING GUIDE
-8.1 Porting a New CUDA Project
-8.1.1 General Tips
-
-
-8.1.2 Scanning existing CUDA code to scope the porting effort
-
-<_Cuda_>
- (continued from previous page)
-
-
-<_SQL_>
-
-
-
- --event:0 -event:08.1.3 Converting a project 'in-place'
-
->
-| > hipify-perl --inplace
-
-
-
-8.1.4 Library Equivalents
-
-
-
-CUDA brary Li- HIP Li- brary ROCm Li- brary Comment cuBLAS hipBLAS rocBLAS Basic Linear Algebra Subroutines cuBLASLt hip- BLASLt N/A Basic Linear Algebra Subroutines, lightweight and new flexible API cuFFT hipFFT rocFFT Fast Fourier Transfer Library cuSPARSE hipSPARSE rocSPARSE Sparse BLAS + SPMV cuSOLVER hip- SOLVER rocSOLVER Lapack library AmgX N/A rocALU- TION Sparse iterative solvers and preconditioners with algebraic multigrid Thrust N/A rocThrust C++ parallel algorithms library CUB hipCUB rocPRIM Low Level Optimized Parallel Primitives cuDNN N/A MIOpen Deep learning Solver Library cuRAND hipRAND rocRAND Random Number Generator Library EIGEN EIGEN N/A C++ template library for linear algebra: matrices, vectors, numeri- cal solvers, NCCL N/A RCCL Communications Primitives Library based on the MPI equivalents 8.2 Distinguishing Compiler Modes
-8.2.1 Identifying HIP Target Platform
-
-
-8.2.2 Identifying the Compiler: hip-clang or NVCC
-
- #ifdef __HIP_PLATFORM_AMD__
- // Compiled with HIP-Clang
- #endif
-#ifdef __HIP_PLATFORM_NVIDIA__
-// Compiled with nvcc
-// Could be compiling with CUDA language extensions enabled (for example, a ".cu file)
-// Could be in pass-through mode to an underlying host compile OR (for example, a.cpp_
---file)
- #ifdef __CUDACC__
- // Compiled with nvcc (CUDA language extensions enabled)
-; enab1ed)8.2.3 Identifying Current Compilation Pass: Host or Device
-
-
- #if __HIP__DEVICE__COMPILE__8.2.4 Compiler Defines: Summary
-
-Define HIP-Clang NVCC Other (GCC, ICC, Clang, etc.) HIP-related defines: __HIP_PLATFORM_AMD__ Defined Undefined Defined if targetingAMD platform; undefined oth- erwise __HIP_PLATFORM_NVIDIA__ Undefined Defined Defined if targeting NVIDIA platform; unde- fined otherwise __HIP_DEVICE_COMPILE__ 1 if compiling for device; un- defined if compiling for host 1 if compiling for device; undefined if compiling for host Undefined __HIPCC__ Defined Defined Undefined __HIP_ARCH_* 0 or 1 depending on feature support (see below) 0 or 1 depending on feature support (see below) 0 NVCC- related defines: __CUDACC__ Defined if source code is compiled by NVCC; unde- fined otherwise Undefined __NVCC__ Undefined Defined Undefined __CUDA_ARCH__ Undefined Unsigned representing compute capa- bility (e.g., '130') if in device code; 0 if in host code Undefined hip-clang- related defines: __HIP__ HIP-Clang common Defined Undefined Undefined defines: __clang__ Defined Defined Undefined 8.3 Identifying Architecture Features
-8.3.1 HIP_ARCH Defines
-
- | #if (__CUDA_ARCH__ >= 13 0)
- |// doubles are supported
-//#if (__CUDA_ARCH__ >= 130) // non-portable
-if __HIP_ARCH_HAS_DOUBLES__ { // portable HIP feature query
- // doubles are supported
-}8.3.2 Device-Architecture Properties
-
-hipGetDeviceProperties(&deviceProp, device);
-//if ((deviceProp.major == 1 && deviceProp.minor < 2)) // non-portable
-if (deviceProp.arch.hasSharedInt32Atomics) { // portable HIP feature query
- // has shared int32 atomic operations...
-}8.3.3 Table of Architecture Properties
-
-Define (use only in device code) Device Property (run- time query) Comment 32-bit atomics: __HIP_ARCH_HAS_GLOBAL_INT32_ATOMICS__ __HIP_ARCH_HAS_GLOBAL_FLOAT_ATOMIC_EXCH__ hasGlobalInt32Atomics hasGlobalFloatAtomicExch 32-bit integer atomics for global memory 32-bit float atomic exchange for global mem- ory __HIP_ARCH_HAS_SHARED_INT32_ATOMICS__ __HIP_ARCH_HAS_SHARED_FLOAT_ATOMIC_EXCH__ hasSharedInt32Atomics hasSharedFloatAtomicExch 32-bit integer atomics for shared memory 32-bit float atomic exchange for shared mem- ory __HIP_ARCH_HAS_FLOAT_ATOMIC_ADD__ hasFloatAtomicAdd 32-bit float atomic add in global and shared memory 64-bit atomics: __HIP_ARCH_HAS_GLOBAL_INT64_ATOMICS__ __HIP_ARCH_HAS_SHARED_INT64_ATOMICS__ Doubles: hasGlobalInt64Atomics hasSharedInt64Atomics 64-bit integer atomics for global memory 64-bit integer atomics for shared memory __HIP_ARCH_HAS_DOUBLES__ Warp cross-lane operations: hasDoubles Double-precision floating point __HIP_ARCH_HAS_WARP_VOTE__ __HIP_ARCH_HAS_WARP_BALLOT__ __HIP_ARCH_HAS_WARP_SHUFFLE__ __HIP_ARCH_HAS_WARP_FUNNEL_SHIFT__ Sync: hasWarpVote hasWarpBallot hasWarpShuffle hasFunnelShift Warp vote instructions ( any , all ) Warp ballot instructions Warp shuffle operations ( shfl_* ) Funnel shift two input words into one hasThreadFenceSystem hasSyncThreadsExt threadfence_system syncthreads_count , syncthreads_and __HIP_ARCH_HAS_THREAD_FENCE_SYSTEM__ __HIP_ARCH_HAS_SYNC_THREAD_EXT__ , syncthreads_or Miscellaneous: __HIP_ARCH_HAS_SURFACE_FUNCS__ hasSurfaceFuncs __HIP_ARCH_HAS_3DGRID__ has3dGrid Grids and groups are 3D __HIP_ARCH_HAS_DYNAMIC_PARALLEL__ hasDynamicParallelism 8.4 Finding HIP
-8.5 Identifying HIP Runtime
-
-
-8.6 hipLaunchKernelGGL
-8.7 Compiler Options
-8.7.1 Compiler options supported on AMD platforms
-
-Option Description --amdgpu-target=<gpu_arch> [DEPRECATED] This option is being replaced by --offload-arch=<target> . Generate code for the given GPU target. Supported targets are gfx701, gfx801, gfx802, gfx803, gfx900, gfx906, gfx908, gfx1010, gfx1011, gfx1012, gfx1030, gfx1031. This option could appear multiple times on the same command line to generate a fat binary for multiple targets. --fgpu-rdc Generate relocatable device code, which allows kernels or device functions calling device functions in different translation units. -ggdb Equivalent to -g plus tuning for GDB. This is recommended when using ROCm's GDB to debug GPU code. --gpu-max-threads-per-block=<num> Generate code to support up to the specified number of threads per block. -O<n> Specify the optimization level. -offload-arch=<target> Specify the AMDGPUtarget ID. -save-temps Save the compiler generated intermediate files. -v Show the compilation steps. 8.8 Linking Issues
-8.8.1 Linking With hipcc
-8.8.2 -lm Option
-8.9 Linking Code With Other Compilers
-8.9.1 libc++ and libstdc++
-
-
-8.9.2 HIP Headers ( hip_runtime.h , hip_runtime_api.h )
-
-
-8.9.3 Using a Standard C++ Compiler
-
-
-
-|> hipconfig --cxx_config
- | -D___HIP_PLATFORM_AMD___ -I/home/user1/hip/include
- |CPPFLAGS += $(shell $(HIP_PATH)/bin/hipconfig --cpp_config)
-
-)8.9.3.1 cuda.h
-8.9.4 Choosing HIP File Extensions
-8.10 Workarounds
-8.10.1 warpSize
-8.10.2 Kernel launch with group size > 256
-
-<_SQL_>8.11 memcpyToSymbol
-
-<_C++_>
- {
- A[i] = -1*i;
- B[i] = 0;
- }
-
- HIP_ASSERT(hipMalloc((void**)&Ad, SIZE));
-
- HIP_ASSERT(hipMemcpyToSymbol(HIP_SYMBOL(Value), A, SIZE, 0, hipMemcpyHostToDevice));
- hipLaunchKernelGGL(Get, dim3(1,1,1), dim3(LEN,1,1), 0, 0, Ad);
- HIP_ASSERT(hipMemcpy(B, Ad, SIZE, hipMemcpyDeviceToHost));
-
- for(unsigned i=0;i8.12 CU_POINTER_ATTRIBUTE_MEMORY_TYPE
-
- For example:
- double * ptr;
- hipMalloc(reinterpret_cast
- For example, on AMD platform, hipMemoryType is defined in hip_runtime_api.h,
-
- typedef enum hipMemoryType {
- hipMemoryTypeHost = 0, ///< Memory is physically located on host
- hipMemoryTypeDevice = 1, ///< Memory is physically located on device. (see deviceId,
- --for specific device)
- hipMemoryTypeArray = 2, ///< Array memory, physically located on device. (see_,
- --deviceId for specific device)
- hipMemoryTypeUnified = 3, ///< Not used currently
- hipMemoryTypeManaged = 4 ///< Managed memory, automatically managed by the unified.
- --memory system
- } hipMemoryType;
-<_Cuda_>8.13 threadfence_system
-8.13.1 Textures and Cache Control
-
-
-8.14 More Tips
-8.14.1 HIP Logging
-
-<_C++_>
-<_C++_>8.14.2 Debugging hipcc
-
-export HIPCC_VERBOSE=1
-make
-
-...
-hipcc-cmd: /opt/rcm/bin/hipcc --offload-arch=native -x hip backprop_cuda.cu8.14.3 Editor Highlighting
-PORTING CUDA DRIVER API
-9.1 Introduction to the CUDA Driver and Runtime APIs
-
-
-9.1.1 cuModule API
-9.1.2 cuCtx API
-9.2 HIP Module and Ctx APIs
-9.2.1 hipModule API
-
-Format APIs NVCC HIP-CLANG Code Object Fat Binary hipModuleLoad , hipModuleLoadData hipModuleLoadFatBin .cubin or PTX text .fatbin .hsaco .hip_fatbin 9.2.2 hipCtx API
-9.2.3 hipify translation of CUDA Driver API
-9.2.3.1 Address Spaces
-9.2.3.2 Using hipModuleLaunchKernel
-9.2.3.3 Additional Information
-
-
-9.2.4 hip-clang Implementation Notes
-9.2.4.1 .hip_fatbin
-9.2.4.2 Initialization and Termination Functions
-9.2.4.3 Kernel Launching
-9.2.5 NVCC Implementation Notes
-9.2.5.1 Interoperation between HIP and CUDA Driver
-
-HIP Type CU Driver Type CUDA Runtime Type hipModule_t CUmodule hipFunction_t CUfunction hipCtx_t CUcontext hipDevice_t CUdevice hipStream_t CUstream cudaStream_t hipEvent_t CUevent cudaEvent_t hipArray CUarray cudaArray 9.2.5.2 Compilation Options
-
-<_Cuda_>
-
-
-
-
-
- ?xml version="2.0" encoding="UTF-8" />
-
-
-
-
-
-
- ?xml version="5.0" encoding="UTF-8" />
-
-
-
-
- >
-
- //
- !xml version="8.0" encoding="UTF-8" />
- }
-#include
-#include
-HIP Documentation, Release 6.1.40092
-
-
-
-
-#define LEN 64
-#define SIZE LEN<<2
-
-#ifdef __HIP_PLATFORM_AMD__
-#define fileName "vcpy_isa.co"
-#endif
-
-#ifdef __HIP_PLATFORM_NVIDIA__
-#define fileName "vcpy_isa.ptx"
-#endif
-
-#define kernel_name "hello_world"
-
-int main(){
- float *A, *B;
- hipDeviceptr_t Ad, Bd;
- A = new float[LEN];
- B = new float[LEN];
-
- for(uint32_t i=0;i
- HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
- HIP_LAUNCH_PARAM_END
- };
-
- hipModuleLaunchKernel(Function, 1, 1, 1, LEN, 1, 1, 0, 0, NULL, (void**)&config);
-
- hipMemcpyDtoH(B, Bd, SIZE);
- for(uint32_t i=0;i9.3 HIP Module and Texture Driver API
-
-
-
-
- // Code to generate code object
-
-
-#include "hip/hip_runtime.h"
-
-extern texture
- }
-
- // Host code:
-
- texture
- hipTexRefSetFlags(texref, 0);
- hipTexRefSetFormat(texref, HIP_AD_FORMAT_FLOAT, 1);
- hipTexRefSetArray(texref, array, HIP_TRSA_OVERRIDE_FORMAT);
-
- //...
-}TEN
-PROGRAMMING FOR HIP RUNTIME COMPILER (RTC)
-
-
-10.1 Example
-
-<_C_>
-hiprtcCreateProgram(&prog, // HIPRTC program
- kernel, // kernel string
- "gpu_kernel.cu", // Name of the file
- num_headers, // Number of headers
- &header_sources[0], // Header sources
- &header_names[0]); // Name of header files
-
-
-hiprtcCompileProgram(prog, // hiprtcProgram
- 0, // Number of options
- options); // Clang Options [Supported Clang Options](clang_options.
-
---md)
-<_C++_>
- size_t codeSize;
- hiprtcGetCodeSize(prog, &codeSize);
-
- vector
-hipModule_t module;
-hipFunction_t kernel;
-
-hipModuleLoadData(&module, kernel_binary.data());
-hipModuleGetFunction(&kernel, module, "vector_add");
-
-
-
- //
-
-
-
- <---------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------- >----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
- <---------------------------------------------------------------------------------------------------------------------- <---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------
-HIP Documentation,Release 6.1.40092
-
-
-
- }
-
-
- }
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- }
-
- }
-
- }
-
-
-
-
- }10.2 HIPRTC specific options
-
-
-10.2.1 Bitcode
-
- std::string sarg = std::string("-fgpu-rdc");
- const char* options[] = {
- sarg.c_str() };
- hiprtcCompileProgram(prog, // hiprtcProgram
- 1, // Number of options
- options);
-size_t bitCodeSize;
-hiprtcGetBitcodeSize(prog, &bitCodeSize);
-
-vector10.2.2 CU Mode vs WGP mode
-10.3 Linker APIs
-10.3.1 Example
-
-<_C++_>
-
-hiprtcLinkAddData(rtc_link_state, // HIPRTC link state
- input_type, // type of the input data or bitcode
- bit_code_ptr, // input data which is null terminated
- bit_code_size, // size of the input data
- "a", // optional name for this input
- 0, // size of the options
- 0, // Array of options applied to this input
- 0); // Array of option values cast to void*
-hiprtcLinkAddFile(rtc_link_state, // HIPRTC link state
- input_type, // type of the input data or bitcode
- bc_file_path.c_str(), // path to the input file where bitcode is_
---present
- 0, // size of the options
- 0, // Array of options applied to this input
- 0); // Array of option values cast to void*
-<_C_>
- |hipModuleLoadData(&module, bina
-binary);10.3.1.1 Note
-
-
-hiprtcLinkDestroy(rtc_link_state);
-
-
-10.3.2 Input Types
-
-<_Cuda_>10.3.3 Backward Compatibility of LLVM Bitcode/IR
-10.3.4 Link Options
-
-
-
-const char* isaopts[] = {"-mllvm", "-inline-threshold=1", "-mllvm", "-inlinehint-
---threshold=1"};
-std::vector
- const void* lopts[] = {(void*)isaopts, (void*)(isaoptssize)};
- hiprtcLinkState linkstate;
- hiprtcLinkCreate(2, jit_options.data(), (void**)lopts, &linkstate);10.4 Error Handling
-
-<_Python_>
-hiprtcResult result;
-result = hiprtcCompileProgram(prog, 1, opts);
-if (result!= HIPRTC_SUCCESS) {
-std::cout << "hiprtcCompileProgram fails with error " << hiprtcGetErrorString(result);
-}10.5 HIPRTC General APIs
-10.6 Lowered Names (Mangled Names)
-10.6.1 Note
-
-
-10.6.2 Example
-
-
-
-
- static constexpr const char gpu_program[] {
-kernel_name_vec.push_back("&f1");
-kernel_name_vec.push_back("N1::N2::f2");
-kernel_name_vec.push_back("f3
-auto&& x : variable_name_vec) hiprtcAddNameExpression(prog, x.c_str());
- | variable_name_vec.push_back("&N1::N2::V2");
- for (auto&& x : variable_name_vec) hiprtcAddNameExp
-for (decltype(variable_name_vec.size()) i = 0; i!= variable_name_vec.size(); ++i) {
- const char* name;
- hiprtcGetLoweredName(prog, variable_name_vec[i].c_str(), &name);
-}
- for (decltype(kernel_name_vec.size()) i = 0; i!= kernel_name_vec.size(); ++i) {
- const char* name;
- hiprtcGetLoweredName(prog, kernel_name_vec[i].c_str(), &name);
- }
- hipDeviceptr_t variable_addr;
- size_t bytes{};
- hipModuleGetGlobal(&variable_addr, &bytes, module, name);
- hipMemcpyHtoD(variable_addr, &initial_value, sizeof(initial_value));
- hipFunction_t kernel;
- hipModuleGetFunction(&kernel, module, name);
- hipModuleLaunchKernel(kernel, 1, 1, 1, 1, 1, 0, nullptr, nullptr, config);10.7 Versioning
-
-
-10.8 HIP header support
-
-
-10.9 Deprecation notice
-
-
-CHAPTER
-ELEVEN
-PERFORMANCE GUIDELINES
-
-
-11.1 Parallel execution
-11.1.1 Application level
-11.1.2 Device level
-11.1.3 Multiprocessor level
-11.2 Memory optimization
-11.2.1 Data Transfer
-11.2.2 Device Memory Access
-11.3 Optimization for maximum instruction throughput
-
-
-11.3.1 Arithmetic instructions
-11.3.2 Control flow instructions
-11.3.3 Synchronization
-11.4 Minimizing memory thrashing
-
-
-TWELVE
-DEBUGGING WITH HIP
-12.1 Tracing
-
-
-
-
-
- >
- Here's another example that uses ltrace to trace hsa APIs and output:
- $ ltrace -C -e "hsa*"./hipGetChanDesc
- libamdhip64.so.4->hsa_init(0, 0x7fff325a69d0, 0x9c80e0, 0
-HIP Documentation, Release 6.1.4009212.2 Debugging
-
-
-
-
-
-
-
-
- 12.2.1 Debugging HIP applications
-
-
-
-
-
-HIP Documentation, Release 6.1.40092
-
-
- (continued from previous page)
-
-1 " <---------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------- <----------------------------------------------------------------------------------------------------------------------- >----------------------------------------------------------------------------------------------------------------------
-1 // : | - + \
- |
- <============================================================================================ } )12.3 Useful environment variables
-12.3.1 Kernel enqueue serialization
-AMD_SERIALIZE_KERNEL , for serializing kernel enqueue
-AMD_SERIALIZE_COPY , for serializing copies
-12.3.2 Making device visible
-
- | $ HIP_VISIBLE_DEVICES=0,1
-<_Python_>
-if (totalDeviceNum > 2) {
-setenv("HIP_VISIBLE_DEVICES", "0,1,2", 1);
-assert(getDeviceNumber(false) == 3);
-
-.......
-}12.3.3 Dump code object
-12.3.4 HSA-related environment variables (Linux)
-
-
-12.3.5 HIP environment variable summary
-
-Environment variable De- fault value Usage AMD_LOG_LEVEL Enable HIP log on different Level 0 0: Disable log. 1: Enable log on error level 2: Enable log on warning and below levels 0x3: Enable log on information and below levels 0x4: Decode and display AQL packets AMD_LOG_MASK Enable HIP log on different Level 0x7FFFFFFF 0x1: Log API calls 0x02: Kernel and Copy Commands and Barriers 0x4: Synchroniza- tion and waiting for commands to finish 0x8: Enable log on information and below levels 0x20: Queue commands and queue contents 0x40: Signal creation, allocation, pool 0x80: Locks and thread-safety code 0x100: Copy debug 0x200: Detailed copy debug 0x400: Resource allocation, performance-impacting events 0x800: Initialization and shutdown 0x1000: Misc debug, not yet classified 0x2000: Show raw bytes of AQL packet 0x4000: Show code creation debug 0x8000: More detailed command info, including barrier com- mands 0x10000: Log message location 0xFFFFFFFF: Log always even mask flag is zero HIP_LAUNCH_BLOCKING Used for serial- ization on kernel execution. 0 0: Disable. Kernel executes normally. 1: Enable. Serializes kernel enqueue, behaves the same as AMD_SERIALIZE_KERNEL. HIP_VISIBLE_DEVICES (or CUDA_VISIBLE_DEVICES) Only devices whose index is present in the sequence are visible to HIP 0,1,2: Depending on the number of devices on the system GPU_DUMP_CODE_OBJECT Dump code ob- ject 0 0: Disable 1: Enable AMD_SERIALIZE_KERNEL Serialize kernel enqueue 0 1: Wait for completion before enqueue 2: Wait for completion after enqueue 3: Both AMD_SERIALIZE_COPY Serialize copies 0 1: Wait for completion before enqueue 2: Wait for completion after enqueue 3: Both HIP_HOST_COHERENT Coherent mem- 0 0: memory is not coherent between host and GPU 1: memory is coherent with host ory in hipHost- Malloc AMD_DIRECT_DISPATCH Enable direct kernel dispatch (Currently for Linux; under development for Windows) 1 0: Disable 1: Enable GPU_MAX_HW_QUEUES The maximum number of hard- ware queues allocated per device 4 The variable controls how many independent hardware queues HIP runtime can create per process, per device. If an application allocates more HIP streams than this number, then HIP runtime reuses the same hardware queues for the new streams in a round-robin manner. Note that this maximum number does not apply to hardware queues that are created for CU-masked HIP streams, or cooperative queues for HIP Cooperative Groups (single queue per device). 12.4 General debugging tips
-
-
-
- | (gdb) set env AND_SERIALIZE_KERNEL 3
-
-CHAPTER
-THIRTEEN
-LOGGING HIP ACTIVITY
-
- |user@user-test:~/hip/bin$./hipinfo > ~/hipinfo > ~/hip_log.txt13.1 Logging level
-
-
- enum LogLevel {
- LOG_NONE = 0,
- LOG_ERROR = 1,
- LOG_WARNING = 2,
- LOG_INFO = 3,
- LOG_DEBUG = 4
- };13.2 Logging mask
-
- The logging mask is designed to print functionality types when you're running a HIP application. Once you set
- AMD_LOG_LEVEL, the logging mask is set as the default value (0x7FFFFFFF). You can change this to any of the valid
- values:
-
- enum LogMask {
- LOG_API = 0x000000001, //!< API call
- LOG_CMD = 0x000000002, //!< Kernel and Copy Commands and Barriers
- LOG_WAIT = 0x000000004, //!< Synchronization and waiting for commands to finish
- LOG_AQL = 0x000000008, //!< Decode and display AQL packets
- LOG_QUEUE = 0x00000010, //!< Queue commands and queue contents
- LOG_SIG = 0x00000020, //!< Signal creation, allocation, pool
- LOG_LOCK = 0x00000040, //!< Locks and thread-safety code.
- LOG_KERN = 0x00000080, //!< kernel creations and arguments, etc.
- LOG_COPY = 0x000000100, //!< Copy debug
- LOG_COPY2 = 0x000000200, //!< Detailed copy debug
- LOG_RESOURCE = 0x000000400, //!< Resource allocation, performance-impacting events.
- LOG_INIT = 0x00000080, //!< Initialization and shutdown
- LOG_MISC = 0x00001000, //!< misc debug, not yet classified
- LOG_AQL2 = 0x00002000, //!< Show raw bytes of AQL packet
- LOG_CODE = 0x00004000, //!< Show code creation debug
- LOG_CMD2 = 0x00008000, //!< More detailed command info, including barrier commands
- LOG_LOCATION = 0x00010000, //!< Log message location
- LOG_MEM = 0x0000200000, //!< Memory allocation
- LOG_MEM_POOL = 0x00040000, //!< Memory pool allocation, including memory in graphs
- LOG_ALWAYS = 0xFFFFFFFF, //!< Log always even mask flag is zero
- };
-
- You can also define the logging mask via the AMD_LOG_MASK environment variable.13.3 Logging command
-
-
-
-
- ?
- <& &
-
-
-
- |ClPrint(amd::LOG_INFO, amd::LOG_INIT, "Initializing HSA stack.");13.4 Logging examples
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-HIP Documentation, Release 6.1.40092
-
-
-
-
-
-
-
-
- ?xml:%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws,com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws,com%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.amazonaws.com/%3.
- --copyBuffer
-...
-:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.
- --cpp:206 : 605414523422 us: 29864: [tid:0x9298] Alloc: 100000 bytes,_,
- --ptr[0000003008D0000-0000003009D0000], obj[0000003007D0000-0000003047D0000]
-:4:C:\constructicon\builds\gfx\two\22.40\drivers\compute\vdi\device\pal\palmemory.
- --cpp:206 : 605414523767 us: 29864: [tid:0x9298] Alloc: 100000 bytes,_,
- --ptr[0000003009D0000-000000300AD0000], obj[0000003007D0000-0000003047D0000]
-:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_memory.cpp:681 :_,
- --605414524092 us: 29864: [tid:0x9298] hipMemGetInfo: Returned hipSuccess :
-memInfo.total: 12.06 GB
-memInfo.free: 11.93 GB (99%)CHAPTER
-FOURTEEN
-COOPERATIVE GROUPS
-
-
-14.1 Cooperative groups thread model
-14.2 Group types
-14.2.1 Thread-block group
-
- class thread_block;
-
- Constructed via:
-
- thread_block g = this_thread_block();14.2.2 Grid group
-
-class grid_group;
-
- Constructed via:
-
-grid_group g = this_grid();14.2.3 Multi-grid group
-
- |class multi_grid_group;
-;
-<_C_>
-<_YAML_>14.2.4 Thread-block tile
-
-<_C++_>
-<_SQL_>Note:
-
-
-14.2.5 Coalesced groups
-
- | class coalesced_group;
- |roup active = coalesced_threads() ;14.3 Cooperative groups simple example
-Original Block
-
-<_Cuda_>Cooperative groups
-
-
-
-
- // Thread ID
-
- / * /* */
- */
- for(unsigned int i = g.size() / 2; i > 0; i /= 2) {
- // Store value in shared memory with thread ID
- shared[group_thread_id] = val;
-
- // Synchronize all threads in the group
- g.sync();
-
- // Active thread sum up
- if(group_thread_id < i)
- val += shared[group_thread_id + i];
-
- // Synchronize all threads in the group
- g.sync();
- }
-
- //...
-}
-
-The reduce_sum() function call and input data initialization difference to the origin.Original Block
-
-Original Block
-
-__global__ void sum_kernel(...) {
-
- //...
-
- // Workspace array in shared memory
- __shared__ unsigned int workspace[2048];
-
- //...
-
- // Perform reduction
- output = reduce_sum(workspace, input);
-
- //...
-}Cooperative groups
-
-
-
-
- // const auto } /* */ *
- thread_block thread_block_group = this_thread_block();
- // Perform reduction
- output = reduce_sum(thread_block_group, workspace, input);
-
- //...
-}14.4 Synchronization
-Check the kernel launch capability
-Thread-block
-Grid
-
- Confirm the cooperative launch capability on the single AMD GPU:
-
- int device = 0;
- int supports_coop_launch = 0;
- // Check support
- // Use hipDeviceAttributeCooperativeMultiDeviceLaunch when launching across multiple_
- --devices
- HIP_CHECK(hipGetDevice(&device));
- HIP_CHECK(
- hipDeviceGetAttribute(&supports_coop_launch, hipDeviceAttributeCooperativeLaunch,\
- --device));
- if(!supports_coop_launch)
- {
- std::cout << "Skipping, device " << device << " does not support cooperative groups"
- << std::endl;
- return 0;
- }Multi-grid
-
- Multi-grid
-
- Confirm the cooperative launch capability over multiple GPUs:
-
- // Check support of cooperative groups
- std::vectorKernel launch
-Thread-block
-
-
-
-
- // Launching kernel from host.Grid
-
-<_C_>Multi-grid
-
- Multi-grid
-
- Launch the cooperative kernel over multiple GPUs:
-
- hipLaunchParams *launchParamsList = (hipLaunchParams*)malloc(sizeof(hipLaunchParams) *_
- --deviceIDs.size());
- for(int deviceID : deviceIDs) {
-
- // Set device
- HIP_CHECK(hipSetDevice(deviceID));
-
- // Create stream
- hipStream_t stream;
- HIP_CHECK(hipStreamCreate(&stream));
-
- // Parameters
- void* params[] = {&(d_vector[deviceID]), &(d_block_reduced[deviceID]), &(d_partition_
- --reduced[deviceID])};
-
- // Set launchParams
- launchParamsList[deviceID].func = (void*)vector_reduce_kernelThread-block
-
-<_C_>Grid
-
-<_Cython_>
- = this._grid() ;Multi-grid
-
- |multi_grid_group multi_grid = this_multi_grid();
-|multi_grid.sync();14.5 Unsupported NVIDIA CUDA features
-
-
-
-
-
-
-CHAPTER
-FIFTEEN
-UNIFIED MEMORY
-15.1 Unified memory
-15.2 System requirements
-
-Architecture hipMallocManaged() __managed__ malloc() MI200, MI300 Series 1 MI100 RDNA (Navi) Series GCN5 (Vega) Series : Supported
-
-
-15.3 Unified memory programming models
-
-
-
-
-
-
-15.3.1 Checking unified memory management support
-
-attribute description hipDeviceAttributeManagedMemory unified addressing is supported hipDeviceAttributeConcurrentManagedAccess full managed memory support, concurrent access is supported hipDeviceAttributePageableMemoryAccess both managed and system memory allocation API is supported
-
-
-#include 15.3.2 Example for unified memory management
-hipMallocManaged()
-
-
-
-
- // } */
-
-
- //
-
-
-
- }
- */
-__managed__
-
-#include malloc()
-
-malloc()
-
-#include
-
-
-
- // Launch add() kernel on GPU.
- hipLaunchKernelGGL(add, dim3(1), dim3(1), 0, 0, a, b, c);
-
- // Wait for GPU to finish before accessing on host.
- hipDeviceSynchronize();
-
- // Prints the result.
- std::cout << *a << " + " << *b << " = " << *c << std::endl;
-
- // Cleanup allocated memory.
- free(a);
- free(b);
- free(c);
-
- return 0;
- }
- tree
-
- // Cleanup allocated memory.
- hipFree(d_a);
- hipFree(d_b);
- hipFree(d_c);
-
- // Prints the result.
- std::cout << a << " + " << b << " = " << c << std::endl;
-
- return 0;
-}15.4 Using unified memory management (UMM)
-
-
-
-
-
-
-15.5 Unified memory HIP runtime hints for the better performance
-
-
-15.5.1 Data prefetching
-
-
-
-
- // All }
-
-// # */
- // *
- */
-
-
-
- *
-
- * /*
- /*
-
- */ /
- }15.5.2 Memory advice
-
-
- The effectiveness of nipMemAdvise() comes from its ability to inform the runtime system at the developer's intentions
- regarding memory usage. When the runtime system has knowledge of the expected memory access patterns, it can make
- better decisions about data placement and caching, leading to more efficient execution of the application. However, the
- actual impact on performance can vary based on the specific use case and the hardware architecture.
- For the description of hipMemAdvise() and the detailed list of advice, visit the HIP managed memory allocation API.
- Here is the updated version of the example above with memory advice.
-
- #include
- hipFree(b);
- hipFree(c);
-
- return 0;
-}15.5.3 Memory range attributes
-
-
- Memory Range attributes allow you to query attributes of a given memory range.
- The hipMemRangeGetAttribute() is added to the example to query the hipMemRangeAttributeReadMostly at-
- title of the memory range pointed to by a. The result is stored in attributeValue and then printed out.
- For more details, visit the HIP managed memory allocation API.
- #include
- std::cout << "The queried attribute value is: " << attributeValue << std::endl;
-
- // Cleanup allocated memory.
- hipFree(a);
- hipFree(b);
- hipFree(c);
-
- return 0;
-}15.5.4 Asynchronously attach memory to a stream
-SIXTEEN
-VIRTUAL MEMORY MANAGEMENT
-16.1 Memory allocation
-16.1.1 Allocate physical memory
-
-<_C_>16.1.2 Reserve virtual address range
-
-0) ;
-<_C++_>16.1.3 Set memory access
-
-hipMemAccessDesc accessDesc = {};
-accessDesc.location.type = HIP_MEM_LOCATION_TYPE_DEVICE;
-accessDesc.location.id = currentDev;
-accessDesc.flags = HIP_MEM_ACCESS_FLAGS_PROT_READWRITE;
-hipMemSetAccess(ptr, padded_size, &accessDesc, 1);16.1.4 Free virtual memory
-
- |hipMemUnmap(ptr, size);
- |hipMemRelease(allocHandle);
- |hipMemAddressFree(ptr, size);16.2 Memory usage
-16.2.1 Dynamically increase allocation size
-
- hipMemAddressReserve(&new_ptr, (new_size - padded_size), 0, ptr + padded_size, 0);
- hipMemMap(new_ptr, (new_size - padded_size), 0, newAllocHandle, 0);
- hipMemSetAccess(new_ptr, (new_size - padded_size), &accessDesc, 1);CHAPTER
-SEVENTEEN
-FREQUENTLY ASKED QUESTIONS
-17.1 What APIs and features does HIP support?
-
-
-17.2 What is not supported?
-17.2.1 Runtime/Driver API features
-
-
-17.2.2 Kernel language features
-
-
-17.3 Is HIP a drop-in replacement for CUDA?
-17.4 What specific version of CUDA does HIP support?
-
-
-
-
-17.5 What libraries does HIP support?
-
-
-17.6 How does HIP compare with OpenCL?
-
-
-17.7 How does porting CUDA to HIP compare to porting CUDA to OpenCL?
-17.8 What hardware does HIP support?
-
-
-17.9 Do HIPIFY tools automatically convert all source code?
-17.10 What is NVCC?
-17.11 What is HIP-Clang?
-17.12 Why use HIP rather than supporting CUDA directly?
-17.13 Can I develop HIP code on an NVIDIA CUDA platform?
-17.14 Can I develop HIP code on an AMD HIP-Clang platform?
-17.15 How to use HIP-Clang to build HIP programs?
-
-
-
-
-17.16 What is AMD clr?
-
-
-17.17 What is hipother?
-17.18 Can I get HIP open source repository for Windows?
-17.19 Can a HIP binary run on both AMD and NVIDIA platforms?
-17.20 On HIP-Clang, can I link HIP code with host code compiled with another compiler such as gcc, icc, or clang?
-17.21 Can HIP API support C style application? What is the difference between C and C++?
-
- //the file name `test.hip.cpp`
-
-
-#include "hip/hip_runtime_api.h"
- //this file name `test.hip.cpp`
-
- int main(int argc, char** argv) {
- dim3 grid1;
- printf("dim3 grid1; x=%d, y=%d, z=%d\n",grid1.x,grid1.y,grid1.z);
- dim3 grid2 = {1,1,1};
- printf("dim3 grid2 = {1,1,1}; x=%d, y=%d, z=%d\n",grid2.x,grid2.y,grid2.z);
- return 0;
- }
-$ gcc -x c++ $(hipconfig --cpp_config) test3.hip.cpp -o test
-$./test
-dim3 grid1; x=1, y=1, z=1
-dim3 grid2 = {1,1,1}; x=1, y=1, z=1
- |dim3 grid = {1,1,1}; // initialized as in C++
-C++17.22 Can I install both CUDA SDK and HIP-Clang on the same machine?
-17.23 HIP detected my platform (HIP-Clang vs NVCC) incorrectly * what should I do?
-
-| export
-rt HIP_PLATFORM=amd
-<_Python_>
- | HIP_COMPILER=cuda
- | HIP_RUNTIME=nvcc17.24 On CUDA, can I mix CUDA code with HIP code?
-17.25 How do I trace HIP application flow?
-17.26 What are the maximum limits of kernel launch parameters?
-17.27 Are __shfl_*_sync functions supported on HIP platform?
-17.28 How to create a guard for code that is specific to the host or the GPU?
-17.29 Why _OpenMP is undefined when compiling with -fopenmp ?
-17.30 Does the HIP-Clang compiler support extern shared declarations?
-17.31 I have multiple HIP enabled devices and I am getting an error code hipErrorSharedObjectInitFailed with the message 'Error: shared object initialization failed'?
-
-
-17.32 How to use per-thread default stream in HIP?
-17.33 How to use complex multiplication and division operations?
-
-<_C_>
-
-17.34 Can I develop applications with HIP APIs on Windows the same on Linux?
-17.35 Does HIP support LUID?
-17.36 How can I know the version of HIP?
-
-<_SQL_>18.1 Related Pages
-18.3 Namespaces
-18.4 Data Structures
-
-
-EIGHTEEN
-HIP RUNTIME API REFERENCE
-CHAPTER
-NINETEEN
-C++ LANGUAGE EXTENSIONS
-
-
-
-
-19.1 Function-type qualifiers
-19.1.1 __device__
-
-
-19.1.2 __global__
-
-
-19.1.3 __host__
-
-
-19.2 Calling __global__ functions
-
-
-
- // Example hipLaunchKernelGGL pseudocode:
-
-ize_t N)
- (continued from previous page)
-
-
-}
-
-MyKernel<<19.3 Kernel launch example
-
-
-// Example showing device function, __device__ __host__
-// <- compile for both device and host
-float PlusOne(float x)
-{
- return x + 1.0;
-}
-
-__global__
-void
-MyKernel (hipLaunchParm lp, /*lp parm for execution configuration */
- const float *a, const float *b, float *c, unsigned N)
-{
- unsigned gid = threadIdx.x; // <- coordinate index function
- if (gid < N) {
- c[gid] = a[gid] + PlusOne(b[gid]);
- }
-}
-void callMyKernel()
-{
- float *a, *b, *c; // initialization not shown...
- unsigned N = 1000000;
- const unsigned blockSize = 256;
-
- MyKernel<<19.4 Variable type qualifiers
-19.4.1 __constant__
-
-
-19.4.2 __shared__
-19.4.3 __managed__
-19.4.4 __restrict__
-19.5 Built-in variables
-19.5.1 Coordinate built-ins
-19.5.2 warpSize
-19.6 Vector types
-19.6.1 Short vector types
-
-
-19.6.2 dim3
-
-<_C_>19.7 Memory fence instructions
-
-
-19.8 Synchronization functions
-19.9 Math functions
-19.10 Texture functions
-19.11 Surface functions
-Parameters
-
-
-Returns
-Parameters
-Returns
-Template Parameters
-Parameters
-
-
-Template Parameters
-Parameters
-
-
-
-
-Template Parameters
-Parameters
-
-
-Template Parameters
-Parameters
-
-
-Template Parameters
-Parameters
-
-
-Template Parameters
-Parameters
-
-
-Template Parameters
-Parameters
-
-
-Template Parameters
-Parameters
-
-
-Template Parameters
-Parameters
-
-
-Template Parameters
-Parameters
-
-
-Template Parameters
-Parameters
-
-
-Template Parameters
-Parameters
-
-
-Template Parameters
-Parameters
-
-
-
-
-Template Parameters
-Parameters
-
-
-19.12 Timer functions
-
-
-
-<_SQL_>
- [clock_t clock()
- long long int close
-
-
- | long long int w:
- it will_clock64()
- int wallClkRate = 0; //in kilohertz
- HIPCHECK(hipDeviceGetAttribute(&wallClkRate, hipDeviceAttributeWallClockRate, _
- --deviceId));19.13 Atomic functions
-
-
-
-Function int atomicAdd(int* address, int val) int atomicAdd_system(int* address, int val) unsigned int atomicAdd(unsigned int* address,unsigned unsigned int atomicAdd_system(unsigned int* address, unsigned long long atomicAdd(unsigned long long* unsigned long long atomicAdd_system(unsigned long long* float atomicAdd(float* address, float val) float atomicAdd_system(float* address, float val) double atomicAdd(double* address, double val) double atomicAdd_system(double* address, double val) float unsafeAtomicAdd(float* address, float val) float safeAtomicAdd(float* address, float val) int val) unsigned int val) address,unsigned long long val) address, unsigned long long val) double unsafeAtomicAdd(double* address, double val) double safeAtomicAdd(double* address, double val) int atomicSub(int* address, int val) int atomicSub_system(int* address, int val) unsigned int atomicSub(unsigned int* address,unsigned int val) unsigned int atomicSub_system(unsigned int* address, unsigned int val) int atomicExch(int* address, int val) int atomicExch_system(int* address, int val) unsigned int atomicExch(unsigned int* address,unsigned int val) unsigned int atomicExch_system(unsigned int* address, unsigned int val) unsigned long long atomicExch(unsigned long long int* address,unsigned long val) long unsigned long long atomicExch_system(unsigned long long* address, unsigned long int long val) unsigned long long atomicExch_system(unsigned long long* address, unsigned long long val) float atomicExch(float* address, float val) int atomicMin(int* address, int val) int atomicMin_system(int* address, int val) unsigned int atomicMin(unsigned int* address,unsigned int val) unsigned int atomicMin_system(unsigned int* address, unsigned int val) unsigned long long atomicMin(unsigned long long* address,unsigned long long val) atomicMax(int* address, int val) atomicMax_system(int* address, int val) int unsigned int atomicMax(unsigned int* address,unsigned int val) unsigned int atomicMax_system(unsigned int* address, unsigned int int val) unsigned long long atomicMax(unsigned long long* address,unsigned long long val)
-unsigned int atomicDec(unsigned int* address) int atomicCAS(int* address, int compare, int val) int atomicCAS_system(int* address, int compare, int val) unsigned int atomicCAS(unsigned int* address,unsigned int compare,unsigned int val) unsigned int atomicCAS_system(unsigned int* address, unsigned int compare, unsigned int val) unsigned long long atomicCAS(unsigned long long* address,unsigned long long compare,unsigned long long unsigned long long atomicCAS_system(unsigned long long* address, unsigned long long compare, unsigned int atomicAnd(int* address, int val) int atomicAnd_system(int* address, int val) unsigned int atomicAnd(unsigned int* address,unsigned int val) unsigned int atomicAnd_system(unsigned int* address, unsigned int val) unsigned long long atomicAnd(unsigned long long* address,unsigned long long val) unsigned long long atomicAnd_system(unsigned long long* address, unsigned long long val) int atomicOr(int* address, int val) int atomicOr_system(int* address, int val) unsigned int atomicOr(unsigned int* address,unsigned int val) unsigned int atomicOr_system(unsigned int* address, unsigned int val) unsigned int atomicOr_system(unsigned int* address, unsigned int val) unsigned long long atomicOr(unsigned long long int* address,unsigned long long val) unsigned long long atomicOr_system(unsigned long long* address, unsigned long long val) int atomicXor(int* address, int val) int atomicXor_system(int* address, int val) unsigned int atomicXor(unsigned int* address,unsigned int val) unsigned int atomicXor_system(unsigned int* address, unsigned int val) unsigned long long atomicXor(unsigned long long* address,unsigned long long val) unsigned long long atomicXor_system(unsigned long long* address, unsigned long long val) 19.13.1 Unsafe floating-point atomic RMW operations
-
-
-19.14 Warp cross-lane functions
-
- cudaDeviceProp props;
- cudaGetDeviceProperties(&props, deviceID);
- int w = props.warpSize;
- // implement portable algorithm based on w (rather than assume 32 or 64)19.14.1 Warp vote and ballot functions
-
-int __all(int predicate)
-int __any(int predicate)
-unsigned long long __ballot(int predicate)
-unsigned long long __activemask()
-
-int __all_sync(unsigned long long mask, int predicate)
-<_Python_>
-
-19.14.2 Warp match functions
-
- unsigned long long __match_any(T value)
- unsigned long long __match_all(T value, int *pred)
-
- unsigned long long __match_any_sync(unsigned long long mask, T value)
- unsigned long long __match_all_sync(unsigned long long mask, T value, int *pred)19.14.3 Warp shuffle functions
-
- The default width is warpSize (see Warp cross-lane functions). Half-float shuffles are not supported.
-
-
-int __shfl (T var, int srcLane, int width=warpSize);19.15 Cooperative groups functions
-
-Function Supported in HIP Supported in CUDA void thread_group.sync(); ✓ ✓ unsigned thread_group.size(); ✓ ✓ unsigned thread_group.thread_rank() ✓ ✓ bool thread_group.is_valid(); ✓ ✓ grid_group this_grid() ✓ ✓ void grid_group.sync() ✓ ✓ unsigned grid_group.size() ✓ ✓ unsigned grid_group.thread_rank() ✓ ✓ bool grid_group.is_valid() ✓ ✓ multi_grid_group this_multi_grid() ✓ ✓ void multi_grid_group.sync() ✓ ✓ unsigned multi_grid_group.size() ✓ ✓ unsigned multi_grid_group.thread_rank() ✓ ✓ bool multi_grid_group.is_valid() ✓ ✓ unsigned multi_grid_group.num_grids() ✓ ✓ unsigned multi_grid_group.grid_rank() ✓ ✓ thread_block this_thread_block() ✓ ✓ multi_grid_group this_multi_grid() ✓ ✓ void multi_grid_group.sync() ✓ ✓ void thread_block.sync() ✓ ✓ unsigned thread_block.size() ✓ ✓ unsigned thread_block.thread_rank() ✓ ✓ bool thread_block.is_valid() ✓ ✓ dim3 thread_block.group_index() ✓ ✓ dim3 thread_block.thread_index() ✓ ✓ 19.16 Warp matrix functions
-
-Function Sup- ported in HIP Supported in CUDA void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned lda) ✓ void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned lda, layout_t layout) ✓ void store_matrix_sync(T* mptr, fragment<...> &a, unsigned lda, layout_t layout) ✓ void fill_fragment(fragment<...> &a, const T &value) void mma_sync(fragment<...> &d, const fragment<...> &a, ✓ const fragment<...> &b, const fragment<...> &c , bool sat) ✓ 19.17 Independent thread scheduling
-19.18 Profiler Counter Function
-19.19 Assert
-
-|void assert(int ir
- input()19.20 printf
-
-
-#include 19.21 Device-Side Dynamic Global Memory Allocation
-19.22 __launch_bounds__
-
-<_Cython_>19.22.1 Compiler Impact
-19.22.2 CU and EU Definitions
-19.22.3 Porting from CUDA __launch_bounds
-
-
-
- | MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) /_\
- : -- < > }19.22.4 maxregcount
-19.23 Asynchronous Functions
-19.23.1 Memory stream
-See also:
-Parameters
-Returns
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-See also:
-Parameters
-Returns
-hipError_t hipStreamQuery ( hipStream_t stream )
-See also:
-Parameters
-Returns
-hipError_t hipStreamSynchronize ( hipStream_t stream )
-See also:
-Parameters
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-Returns
-See also:
-Parameters
-
-
-Returns
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-template<class T >
-See also:
-template<class T >
-See also:
-template<class T >
-See also:
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-
-
-See also:
-Parameters
-
-
-Returns
-
-
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-19.23.2 Peer to peer
-Parameters
-
-
-Returns
-Returns
-Parameters
-
-
-Returns
-hipError_t hipDeviceDisablePeerAccess ( int peerDeviceId )
-Parameters
-Returns
-See also:
-Parameters
-
-
-Returns
-USE_PEER_NON_UNIFIED
-19.23.3 Memory management
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-
-
-
-arning: This API is deprecated, use hipHostMalloc() insteadParameters
-
-
-Returns
-hipError_t hipMemAllocHost ( void **ptr, size_t size )
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-Parameters
-
-
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-hipHostMalloc
-Parameters
-
-
-Returns
-Flags:
-
-
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-Returns
-See also:
-Parameters
-
-
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-hipError_t hipFree ( void *ptr )
-See also:
-Parameters
-Returns
-Returns
-hipError_t hipFreeHost ( void *ptr )
-
-
-
-urning: This API is deprecated, use hipHostFree() insteadParameters
-Returns
-hipError_t hipHostFree ( void *ptr )
-See also:
-Parameters
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-Returns
-See also:
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-See also:
-Returns
-See also:
-Returns
-See also:
-Returns
-See also:
-Returns
-
-
-See also:
-template<class T >
-
-
-See also:
-19.23.4 External Resource Interoperability
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-hipError_t hipDestroyExternalSemaphore ( hipExternalSemaphore_t extSem )
-See also:
-Parameters
-Returns
-See also:
-Parameters
-
-
-Returns
-See also:
-Parameters
-
-
-Returns
-hipError_t hipDestroyExternalMemory ( hipExternalMemory_t extMem )
-Parameters
-Returns
-See also:
-Parameters
-
-
-Returns
-19.24 Register Keyword
-19.25 Pragma Unroll
-
-
-
-
- // #pragma unroll 16 /* hint to compiler to unroll next loop by 16 */
-
- } /* */
-
-
-
- //
- }
- */19.26 In-Line Assembly
-
-
-
-
- void void
19.27 Kernel Compilation
-
-hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]
-
-[TARGET GPU] = GPU architecture
-[INPUT FILE] = Name of the file containing kernels
-[OUTPUT FILE] = Name of the generated code object file19.28 gfx-arch-specific-kernel
-CHAPTER
-TWENTY
-C++ LANGUAGE SUPPORT
-20.1 Modern C++ support
-20.1.1 C++11 support
-20.1.2 C++14 support
-20.1.3 C++17 support
-20.1.4 C++20 support
-20.2 Extensions and restrictions
-20.2.1 Global functions
-
-
-20.2.2 Device space memory specifiers
-20.2.3 Exception handling
-20.2.4 Kernel parameters
-20.2.5 Classes
-20.2.6 Polymorphic function wrappers
-20.2.7 Extended lambdas
-20.2.8 Inline namespaces
-
-
-CHAPTER
-TWENTYONE
-HIP MATH API
-21.1 Single precision mathematical functions
-
-Function Supported on Host Supported on Device float abs(float x) Returns the absolute value of 𝑥 ✓ ✓ float acosf(float x) Returns the arc cosine of 𝑥 . ✓ ✓ float acoshf(float x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . ✓ ✓ float asinf(float x) Returns the arc sine of 𝑥 . ✓ ✓ float asinhf(float x) Returns the arc hyperbolic sine of 𝑥 . ✓ ✓ float atanf(float x) Returns the arc tangent of 𝑥 . ✓ ✓
-Table 1 - continued from previous page float atan2f(float x, float y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . ✓ ✓ float atanhf(float x) Returns the arc hyperbolic tangent of 𝑥 . ✓ ✓ float cbrtf(float x) Returns the cube root of 𝑥 . ✓ ✓ float ceilf(float x) Returns ceiling of 𝑥 . ✓ ✓ float copysignf(float x, float y) Create value with given magnitude, copying sign of second value. ✓ ✓ float cosf(float x) Returns the cosine of 𝑥 . ✓ ✓ float coshf(float x) Returns the hyperbolic cosine of 𝑥 . ✓ ✓ float cospif(float x) Returns the cosine of 𝜋 · 𝑥 . ✓ ✓ float cyl_bessel_i0f(float x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 .
-float cyl_bessel_i1f(float x) Returns the value of the regular modified cylindrical Bessel function of order 1 for 𝑥 . float erff(float x) Returns the error function of 𝑥 . ✓ ✓ float erfcf(float x) Returns the complementary error function of 𝑥 . ✓ ✓ float erfcinvf(float x) Returns the inverse complementary function of 𝑥 . ✓ ✓ float erfcxf(float x) Returns the scaled complementary error function of 𝑥 . ✓ ✓ float erfinvf(float x) Returns the inverse error function of 𝑥 . ✓ ✓ float expf(float x) Returns 𝑒 𝑥 . ✓ ✓ float exp10f(float x) Returns 10 𝑥 . ✓ ✓ float exp2f( float x) Returns 2 𝑥 . ✓ ✓ float expm1f(float x) Returns 𝑙𝑛 ( 𝑥 - 1) ✓ ✓
-float fabsf(float x) Returns the absolute value of x ✓ ✓ float fdimf(float x, float y) Returns the positive difference between 𝑥 and 𝑦 . ✓ ✓ float fdividef(float x, float y) Divide two floating point values. ✓ ✓ float floorf(float x) Returns the largest integer less than or equal to 𝑥 . ✓ ✓ float fmaf(float x, float y, float z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. ✓ ✓ float fmaxf(float x, float y) Determine the maximum numeric value of 𝑥 and 𝑦 . ✓ ✓ float fminf(float x, float y) Determine the minimum numeric value of 𝑥 and 𝑦 . ✓ ✓ float fmodf(float x, float y) Returns the floating-point remainder of 𝑥/𝑦 . ✓ ✓ float modff(float x, float* iptr) Break down 𝑥 into fractional and integral parts. ✓
-float frexpf(float x, int* nptr) Extract mantissa and exponent of 𝑥 . ✓ float hypotf(float x, float y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . ✓ int ilogbf(float x) Returns the unbiased integer exponent of 𝑥 . ✓ bool isfinite(float x) Determine whether 𝑥 is finite. ✓ bool isinf(float x) Determine whether 𝑥 is infinite. ✓ bool isnan(float x) Determine whether 𝑥 is a NAN . ✓ float j0f(float x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . ✓ float j1f(float x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . ✓ float jnf(int n, float x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . ✓
-float ldexpf(float x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . ✓ ✓ float lgammaf(float x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . ✓ long int lrintf(float x) Round 𝑥 to nearest integer value. ✓ ✓ long long int llrintf(float x) Round 𝑥 to nearest integer value. ✓ ✓ long int lroundf(float x) Round to nearest integer value. ✓ ✓ long long int llroundf(float x) Round to nearest integer value. ✓ ✓ float log10f(float x) Returns the base 10 logarithm of 𝑥 . ✓ ✓ float log1pf(float x) Returns the natural logarithm of 𝑥 +1 . ✓ ✓ float log2f(float x) Returns the base 2 logarithm of 𝑥 . ✓ ✓ float logf(float x) Returns the natural logarithm of 𝑥 . ✓ ✓
-✓ float logbf(float x) Returns the floating point representation of the exponent of 𝑥 . ✓ float nanf(const char* tagp) Returns 'Not a Number' value. ✓ float nearbyintf(float x) Round 𝑥 to the nearest integer. ✓ ✓ float nextafterf(float x, float y) Returns next representable single-precision floating-point value after argument. ✓ float norm3df(float x, float y, float z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . ✓ ✓ float norm4df(float x, float y, float z, float w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . ✓ ✓ float normcdff(float y) Returns the standard normal cumulative distribution function. ✓ ✓ float normcdfinvf(float y) Returns the inverse of the standard normal cumulative distribution function. ✓ ✓ float normf(int dim, const float *a) Returns the square root of the sum of squares of any number of coordinates. ✓ ✓
-Table 1 - continued from previous page float powf(float x, float y) Returns 𝑥 𝑦 . ✓ ✓ float powif(float base, int iexp) Returns the value of first argument to the power of second argument. ✓ ✓ float remainderf(float x, float y) Returns single-precision floating-point remainder. ✓ ✓ float remquof(float x, float y, int* quo) Returns single-precision floating-point remainder and part of quotient. ✓ ✓ float roundf(float x) Round to nearest integer value in floating-point. ✓ ✓ float rcbrtf(float x) Returns the reciprocal cube root function. ✓ ✓ float rhypotf(float x, float y) Returns one over the square root of the sum of squares of two arguments. ✓ ✓ float rintf(float x) Round input to nearest integer value in floating-point. ✓ ✓
-float rnorm3df(float x, float y, float z) Returns one over the square root of the sum of squares of three coordinates of the argument. ✓ ✓ float rnorm4df(float x, float y, float z, float w) Returns one over the square root of the sum of squares of four coordinates of the argument. ✓ ✓ float rnormf(int dim, const float *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. ✓ ✓ float scalblnf(float x, long int n) Scale 𝑥 by 2 𝑛 . ✓ ✓ float scalbnf(float x, int n) Scale 𝑥 by 2 𝑛 . ✓ ✓ bool signbit(float x) Return the sign bit of 𝑥 . ✓ ✓ float sinf(float x) Returns the sine of 𝑥 . ✓ ✓ float sinhf(float x) Returns the hyperbolic sine of 𝑥 . ✓ ✓ float sinpif(float x) Returns the hyperbolic sine of 𝜋 · 𝑥 . ✓ ✓
-void sincosf(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝑥 . ✓ ✓ void sincospif(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . ✓ ✓ float sqrtf(float x) Returns the square root of 𝑥 . ✓ ✓ float rsqrtf(float x) Returns the reciprocal of the square root of 𝑥 . ✓ float tanf(float x) Returns the tangent of 𝑥 . ✓ ✓ float tanhf(float x) Returns the hyperbolic tangent of 𝑥 . ✓ ✓ float tgammaf(float x) Returns the gamma function of 𝑥 . ✓ ✓ float truncf(float x) Truncate 𝑥 to the integral part. ✓ ✓ float y0f(float x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . ✓ ✓ float y1f(float x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . ✓ ✓
- \
- float ynf(int n, float x)
- Returns the value of the Bessel
- function of the second kind of order
- n for x.21.2 Double precision mathematical functions
-
-Function Supported on Host Supported on Device double abs(double x) Returns the absolute value of 𝑥 ✓ ✓ double acos(double x) Returns the arc cosine of 𝑥 . ✓ ✓ double acosh(double x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . ✓ ✓ double asin(double x) Returns the arc sine of 𝑥 . ✓ ✓ double asinh(double x) Returns the arc hyperbolic sine of 𝑥 . ✓ ✓ double atan(double x) Returns the arc tangent of 𝑥 . ✓ ✓ double atan2(double x, double y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . ✓ ✓
-double atanh(double x) Returns the arc hyperbolic tangent of 𝑥 . ✓ ✓ double cbrt(double x) Returns the cube root of 𝑥 . ✓ ✓ double ceil(double x) Returns ceiling of 𝑥 . ✓ ✓ double copysign(double x, double y) Create value with given magnitude, copying sign of second value. ✓ ✓ double cos(double x) Returns the cosine of 𝑥 . ✓ ✓ double cosh(double x) Returns the hyperbolic cosine of 𝑥 . ✓ ✓ double cospi(double x) Returns the cosine of 𝜋 · 𝑥 . ✓ ✓ double cyl_bessel_i0(double x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 . double cyl_bessel_i1(double x) Returns the value of the regular modified cylindrical Bessel function of order 1 for 𝑥 . double erf(double x) Returns the error function of 𝑥 . ✓ ✓
-double erfc(double x) Returns the complementary error function of 𝑥 . ✓ ✓ double erfcinv(double x) Returns the inverse complementary function of 𝑥 . ✓ ✓ double erfcx(double x) Returns the scaled complementary error function of 𝑥 . ✓ ✓ double erfinv(double x) Returns the inverse error function of 𝑥 . ✓ ✓ double exp(double x) Returns 𝑒 𝑥 . ✓ ✓ double exp10(double x) Returns 10 𝑥 . ✓ ✓ double exp2( double x) Returns 2 𝑥 . ✓ ✓ double expm1(double x) Returns 𝑙𝑛 ( 𝑥 - 1) ✓ ✓ double fabs(double x) Returns the absolute value of x ✓ ✓ double fdim(double x, double y) Returns the positive difference between 𝑥 and 𝑦 . ✓ ✓
-double floor(double x) Returns the largest integer less than or equal to 𝑥 . ✓ ✓ double fma(double x, double y, double z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. ✓ ✓ double fmax(double x, double y) Determine the maximum numeric value of 𝑥 and 𝑦 . ✓ ✓ double fmin(double x, double y) Determine the minimum numeric value of 𝑥 and 𝑦 . ✓ ✓ double fmod(double x, double y) Returns the floating-point remainder of 𝑥/𝑦 . ✓ ✓ double modf(double x, double* iptr) Break down 𝑥 into fractional and integral parts. ✓ double frexp(double x, int* nptr) Extract mantissa and exponent of 𝑥 . ✓ double hypot(double x, double y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . ✓ ✓ int ilogb(double x) Returns the unbiased integer exponent of 𝑥 . ✓ ✓
-bool isfinite(double x) Determine whether 𝑥 is finite. ✓ ✓ bool isin(double x) Determine whether 𝑥 is infinite. ✓ ✓ bool isnan(double x) Determine whether 𝑥 is a NAN . ✓ ✓ double j0(double x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . ✓ ✓ double j1(double x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . ✓ ✓ double jn(int n, double x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . ✓ ✓ double ldexp(double x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . ✓ ✓ double lgamma(double x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . ✓ long int lrint(double x) Round 𝑥 to nearest integer value. ✓ ✓
-long long int llrint(double x) Round 𝑥 to nearest integer value. ✓ ✓ long int lround(double x) Round to nearest integer value. ✓ ✓ long long int llround(double x) Round to nearest integer value. ✓ ✓ double log10(double x) Returns the base 10 logarithm of 𝑥 . ✓ ✓ double log1p(double x) Returns the natural logarithm of 𝑥 +1 . ✓ ✓ double log2(double x) Returns the base 2 logarithm of 𝑥 . ✓ ✓ double log(double x) Returns the natural logarithm of 𝑥 . ✓ ✓ double logb(double x) Returns the floating point representation of the exponent of 𝑥 . ✓ ✓ double nan(const char* tagp) Returns 'Not a Number' value. ✓ double nearbyint(double x) Round 𝑥 to the nearest integer. ✓ ✓
-✓ double nextafter(double x, double y) Returns next representable double-precision floating-point value after argument. ✓ double norm3d(double x, double y, double z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . ✓ ✓ double norm4d(double x, double y, double z, double w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . ✓ ✓ double normcdf(double y) Returns the standard normal cumulative distribution function. ✓ ✓ double normcdfinv(double y) Returns the inverse of the standard normal cumulative distribution function. ✓ ✓ double norm(int dim, const double *a) Returns the square root of the sum of squares of any number of coordinates. ✓ ✓ double pow(double x, double y) Returns 𝑥 𝑦 . ✓ ✓ double powi(double base, int iexp) Returns the value of first argument to the power of second argument. ✓ ✓
-Table 2 - continued from previous page double remainder(double x, double y) Returns double-precision floating-point remainder. ✓ ✓ double remquo(double x, double y, int* quo) Returns double-precision floating-point remainder and part quotient. ✓ of double round(double x) Round to nearest integer value in floating-point. ✓ ✓ double rcbrt(double x) Returns the reciprocal cube root function. ✓ ✓ double rhypot(double x, double y) Returns one over the square root of the sum of squares of two arguments. ✓ ✓ double rint(double x) Round input to nearest integer value in floating-point. ✓ ✓ double rnorm3d(double x, double y, double z) Returns one over the square root of the sum of squares of three coordinates of the argument. ✓ ✓ double rnorm4d(double x, double y, double z, double w) Returns one over the square root of the sum of squares of four coordinates of the argument. ✓ ✓
-✓ double rnorm(int dim, const double *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. ✓ double scalbln(double x, long int n) Scale 𝑥 by 2 𝑛 . ✓ ✓ double scalbn(double x, int n) Scale 𝑥 by 2 𝑛 . ✓ ✓ bool signbit(double x) Return the sign bit of 𝑥 . ✓ ✓ double sin(double x) Returns the sine of 𝑥 . ✓ ✓ double sinh(double x) Returns the hyperbolic sine of 𝑥 . ✓ ✓ double sinpi(double x) Returns the hyperbolic sine of 𝜋 · 𝑥 . ✓ ✓ void sincos(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝑥 . ✓ ✓ void sincospi(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . ✓ ✓ double sqrt(double x) Returns the square root of 𝑥 . ✓ ✓
-double rsqrt(double x) Returns the reciprocal of the square root of 𝑥 . ✓ double tan(double x) Returns the tangent of 𝑥 . ✓ double tanh(double x) Returns the hyperbolic tangent of 𝑥 . ✓ double tgamma(double x) Returns the gamma function of 𝑥 . ✓ double trunc(double x) Truncate 𝑥 to the integral part. ✓ double y0(double x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . ✓ double y1(double x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . ✓ double yn(int n, double x) Returns the value of the Bessel function of the second kind of order n for 𝑥 . ✓ 21.3 Integer intrinsics
-Function
-21.4 Floating-point Intrinsics
-CHAPTER
-TWENTYTWO
-TABLE COMPARING SYNTAX FOR DIFFERENT COMPUTE APIS
-
-Term CUDA HIP OpenCL Device int deviceId int deviceId cl_device Queue cudaStream_t hipStream_t cl_command_queue Event cudaEvent_t hipEvent_t cl_event Memory void * void * cl_mem grid grid NDRange block block work-group thread thread work-item warp warp sub-group Thread-index threadIdx.x threadIdx.x get_local_id(0) Block-index blockIdx.x blockIdx.x get_group_id(0) Block-dim blockDim.x blockDim.x get_local_size(0) Grid-dim gridDim.x gridDim.x get_num_groups(0) Device Kernel __global__ __global__ __kernel Device Function __device__ __device__ Implied in device com Host Function __host_ (default) __host_ (default) Implied in host compil Host + Device Function __host__ __device__ __host__ __device__ No equivalent Kernel Launch <<< >>> hipLaunchKernel / hipLaunchKernelGGL / <<< clEnqueueNDRangeK Global Memory __global__ __global__ __global Group Memory __shared__ __shared__ __local Constant __constant__ __constant__ __constant __syncthreads __syncthreads barrier(CLK_LOCAL Atomic Builtins atomicAdd atomicAdd atomic_add Precise Math cos(f) cos(f) cos(f) Fast Math __cos(f) __cos(f) native_cos(f) Vector float4 float4 float4 22.1 Notes
-CHAPTER
-TWENTYTHREE
-HIP COOPERATIVE GROUPS API
-23.1 Cooperative kernel launches
-
-Warning: doxygenfunction: Cannot find function 'hipLaunchCooperativeKernel' Documentation' in doxygen xml output for project 'HIP 6.1.40092 from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- 6.1.2/docs/doxygen/xml
-Warning: doxygenfunction: project Cannot find function 'hipLaunchCooperativeKernel' in doxygen xml output for 'HIP 6.1.40092 Documentation' from /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- 6.1.2/docs/doxygen/xml
-Warning: doxygenfunction: Cannot find function 'hipLaunchCooperativeKernelMultiDe- vice' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- 6.1.2/docs/doxygen/xml
-Warning: in doxygenfunction: Cannot find xml output for project 'HIP nel' function 6.1.40092 'hipModuleLaunchCooperativeKer- Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- 6.1.2/docs/doxygen/xml 23.2 Cooperative groups classes
-class thread_group
-template<unsigned int size , class ParentCGTy >
-Public Functions
-void sync ()
-unsigned int meta_group_rank () const
-template<class T >
-Template Parameters
-Parameters
-
-
-template<class T >
-Template Parameters
-Parameters
-
-
-T shfl_up ( T var, unsigned int lane_delta ) const
-Template Parameters
-Parameters
-
-
-template<class T >
-Template Parameters
-
-
-Parameters
-
-
-Parameters
-Parameters
-Parameters
-Parameters
-Parameters
-
-
-23.3 Cooperative groups construct functions
-
-Warning: doxygenfunction: Cannot find function 'cooperative_groups::this_multi_grid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- 6.1.2/docs/doxygen/xml
-Warning: doxygenfunction: Cannot find function 'cooperative_groups::coalesced_threads' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory:
-Warning: doxygenfunction: Cannot find function 'cooperative_groups::tiled_partition' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- 6.1.2/docs/doxygen/xml
-Warning: doxygenfunction: Cannot find function 'cooperative_groups::tiled_partition' 6.1.40092 in doxygen xml output for project 'HIP Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- 6.1.2/docs/doxygen/xml
-Warning: doxygenfunction: Cannot find function 'cooperative_groups::binary_partition' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- 6.1.2/docs/doxygen/xml
-Warning: doxygenfunction: Cannot find function 'cooperative_groups::binary_partition' 6.1.40092 in doxygen xml output for project 'HIP Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- 6.1.2/docs/doxygen/xml 23.4 Cooperative groups exposed API functions
-
-Warning: doxygenfunction: project Cannot find function 'cooperative_groups::group_size' in doxygen xml output for 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- 6.1.2/docs/doxygen/xml
-Warning: doxygenfunction: Cannot find 'HIP function 'cooperative_groups::thread_rank' in doxygen xml output for project 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- 6.1.2/docs/doxygen/xml CHAPTER
-TWENTYFOUR
-HSA RUNTIME API FOR ROCM
-Parameters
-
-
-Return values
-
-
-Parameters
-
-
-Return values
-
-
-Parameters
-
-
-Return values
-
-
-Parameters
-Return values
-
-
-Parameters
-
-
-
-
-Return values
-
-
-hsa_status_t hsa_amd_vmem_unmap ( void *va, size_t size )
-Parameters
-
-
-Return values
-
-
-Parameters
-
-
-Return values
-
-
-Parameters
-
-
-Return values
-
-
-Parameters
-
-
-Return values
-
-
-hsa_status_t hsa_amd_vmem_import_shareable_handle ( int dmabuf_fd, hsa_amd_vmem_alloc_handle_t *handle )
-Parameters
-
-
-Return values
-
-
-
-
-Parameters
-
-
-Return values
-
-
-Parameters
-
-
-Return values
-
-
-CHAPTER
-TWENTYFIVE
-HIP MANAGED MEMORY ALLOCATION API
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-template<class T >
-
-
-See also:
-CHAPTER
-TWENTYSIX
-HIP VIRTUAL MEMORY MANAGEMENT API
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-Parameters
-Returns
-Parameters
-
-
-Returns
-Parameters
-
-
-Returns
-hipError_t hipMemUnmap ( void *ptr, size_t size )
-Parameters
-
-
-Returns
-CHAPTER
-TWENTYSEVEN
-HIP DEPRECATED RUNTIME API FUNCTIONS
-27.1 Context management
-
-
-
-
-27.2 Memory management
-
-
-27.3 Profiler control
-
-
-27.4 Texture management
-
-
-
-
-CHAPTER
-TWENTYEIGHT
-SAXPY - HELLO, HIP
-28.1 Prerequisites
-28.2 Heterogeneous programming
-28.3 Your first lines of HIP code
-
-++i)
-<_SQL_>
- |git clone https://github.com/amd/rcm-examples.git
-// Allocate and copy vectors to device memory.
-float* d_x{};
-float* d_y{};
-HIP_CHECK(hipMalloc(&d_x, size_bytes));
-HIP_CHECK(hipMalloc(&d_y, size_bytes));
-HIP_CHECK(hipMemcpy(d_x, x.data(), size_bytes, hipMemcpyHostToDevice));
-HIP_CHECK(hipMemcpy(d_y, y.data(), size_bytes, hipMemcpyHostToDevice));
- Launch the calculation on the device after the input data has been prepared.
- __global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned_
- __int size)
- {
- //...
- }
-
- int main()
- {
- //...
-
- // Launch the kernel on the default stream.
- saxpy_kernel<<
-
-
-
-
-
-
-<_Cuda_>
-
-
- |HIP_CHECK(hipMemcpy(y.data()), d_y, size_bytes, hipMemcpyDeviceToHost));28.4 Compiling on the command line
-28.4.1 Setting up the command line
-Linux and AMD
-
- | export PATH=/opt/rcm/bin:${PATH}
-}
- You should be able to call the compiler on the command line now:
-
- amdclang++ --versionLinux and NVIDIA
-
-| nvcc --versionWindows and AMD
-
-
-
-
-
-$InstallationPath = Get-CimInstance MSFT_VSInstance | Sort-Object -Property Version -
- --Descending | Select-Object -First 1 -ExpandProperty InstallLocation
- Import-Module $InstallationPath\Common?\Tools\Microsoft.VisualStudio.DevShell.dll
-Enter-VsDevShell -InstallPath $InstallationPath -SkipAutomaticLocation -Arch amd64 -
- --HostArch amd64 -DevCmdArguments '-no_logo'
-$env:PATH = "${env:HIP_PATH}bin;${env:PATH}"
- | clang++ --versionWindows and NVIDIA
-
-
-
-
-
-$InstallationPath = Get-CimInstance MSFT_VSInstance | Sort-Object -Property Version -
---Descending | Select-Object -First 1 -ExpandProperty InstallLocation
-Import-Module $InstallationPath\Common7\Tools\Microsoft.VisualStudio.DevShell.dll
-Enter-VsDevShell -InstallPath $InstallationPath -SkipAutomaticLocation -Arch amd64 -
---HostArch amd64 -DevCmdArguments '-no_logo'
-| nvcc --version28.4.2 Invoking the compiler manually
-Linux and AMD
-
- |
- <---------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------- <--------------------------------------------------------------------------------------------------------------------- >---------------------------------------------------------------------------------------------------------------------- |
- |
-
-
-
-
- ----------------------------------------------------------------------------------------------------------------------Linux and NVIDIA
-
- | nvcc ./HIP-Basic/saxpy/main.hip -o saxpy -I ./Common -I /opt/rocm/include -02 -x cu
- -x cu | }Windows and AMD
-
- |clang++.\HIP-Basic\saxpy\main.hip -o saxpy.exe -I.\Common -lamdhip64 -L ${env:HIP_PATH}
- -lib -02Windows and NVIDIA
-
- | nvcc.\HIP-Basic\saxpy\main.hip -o saxpy.exe -I ${env:HIP_PATH}include -I.\Common -02 - -
- -x
- | +x |Linux and AMD
-
-
-
-
- >
-
- "
-<_SQL_>
- | r o c - o b j \ - t \ g f x 8 0 3 \ - d \. / s a x p y
-<_XML_>
- |
- - -- \ + <---------------------------------------------------------------------------------------------------------------------- } ---------------------------------------------------------------------------------------------------------------------- |
- | - | 0 ] ) .
-
- : '
-
-
-
- ls main-hip-amdgcn-amd-amdhsa-*
- main-hip-amdgcn-amd-amdhsa-gfx803.bc
- main-hip-amdgcn-amd-amdhsa-gfx803.cui
- main-hip-amdgcn-amd-amdhsa-gfx803.o
- main-hip-amdgcn-amd-amdhsa-gfx803.out
- main-hip-amdgcn-amd-amdhsa-gfx803.out.resolution.txt
- main-hip-amdgcn-amd-amdhsa-gfx803.sLinux and NVIDIA
-
- cuobjdump --list-ptx./saxpy
-
- Which will print something like:
-| P T X \ f i l e
-1: saxpy. 1. sm_5. ptxWindows and AMD
-
- | dumpbin.exe /nologo /section:.hip_fat /rawdata:8 .\saxpy.exe | select -Skip 20 -First 12
- - -- + <-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | dumpbin .exe /nologo /section:.hip_fat /rawdata:8 .\saxpy.exe | select -Skip 20 -First 12 | >--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
- | dumpbin.exe /nologo /section:.hip_fat /rawdata:8 .\saxpy.exe | select -Skip 20 -First 12 |
- | |
- |
- | |
- |
- |
- |
- |
- |
-<_Python_>
- |clang++.\HIP-Basic\saxpy\main.hip -o saxpy.exe -I.\Common -lamdhip64 -L ${env:HIP_PATH}
- -lib -02 --save-temps \
- |Get-ChildItem -Filter main-hip-* | selec
-n-hip-* | select -Property Name
- (continues on next page)
-Name
------
-main-hip-amdgcn-amd-amdhsa-gfx906.bc
-main-hip-amdgcn-amd-amdhsa-gfx906.hipi
-main-hip-amdgcn-amd-amdhsa-gfx906.o
-main-hip-amdgcn-amd-amdhsa-gfx906.out
-main-hip-amdgcn-amd-amdhsa-gfx906.out.resolution.txt
-main-hip-amdgcn-amd-amdhsa-gfx906.s
-main-hip-amdgcn-amd-amdsha-gfx906.out
-main-hip-amdgcn-amd-amdsha-gfx906.out.resolution.txt
-main-hip-amdgcn-amd-amdsha-gfx906.s
-
-Files with the.s extension hold the disassembled contents of the binary and the filename directly informs us of the
-graphics IPs used by the compiler.
-
-Get-ChildItem main-hip-*.s | Get-Content
- .text
- .amdgcn_target "amdgcn-amd-amdsha--gfx906"
- .protected _Z12saxpy_kernelPKfPfj ; -- Begin function _Z12saxpy_
- --kernelPKfPfj
- .glob1 _Z12saxpy_kernelPKfPfj
- .p2align 8
- .type _Z12saxpy_kernelPKfPfj,@function
-_Z12saxpy_kernelPKfPfj:
- ; %bb.0:
- s_load_dword s0, s[4:5], 0x4
- s_load_dword s1, s[6:7], 0x18
- s_waitcnt lgkmcnt(0)
- s_and_b32 s0, s0, 0xffff
- s_mul_i32 s8, s8, s0
- v_add_u32_e32 v0, s8, v0
- v_cmp_gt_u32_e32 vcc, s1, v0
- s_and_saveexec_b64 s[0:1], vcc
- s_cbranch_execz.LBB0_2
- ; %bb.1:
- s_load_dwordx4 s[0:3], s[6:7], 0x8
- v_mov_b32_e32 v1, 0
- v_lshlrev_b64 v[0:1], 2, v[0:1]
- s_waitcnt lgkmcnt(0)
- v_mov_b32_e32 v3, s1
- v_add_co_u32_e32 v2, vcc, s0, v0
- v_addc_co_u32_e32 v3, vcc, v3, v1, vcc
- global_load_dword v2, v[2:3], off
- v_mov_b32_e32 v3, s3
- v_add_co_u32_e32 v0, vcc, s2, v0
- v_addc_co_u32_e32 v1, vcc, v3, v1, vcc
- global_load_dword v3, v[0:1], off
- s_load_dword s0, s[6:7], 0x0
- s_waitcnt vmcnt(0) lgkmcnt(0)
- v_fmac_f32_e32 v3, s0, v2
- global_store_dword v[0:1], v3, off
- .LBB0_2:
- s_endpgm
- ...Windows and NVIDIA
-
-<_Bash_>
-| P T X \ f i l e
-1: saxpy. 1. sm_5. ptxLinux and AMD
-
-
-
-
-
- |
- - -- \ + < } > & ) ]
-
- : ; . "
- /saxpy
- Calculating y[i] = a * x[i] + y[i] over 10000000 elements.
- First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
-<_Python_>Linux and NVIDIA
-
- |
- <.02 .00 <.00
-<_YAML_>
-
-
-
- <_arch=sm_70,sm_86
- ./saxpy
- Calculating y[i] = a * x[i] + y[i] over 10000000 elements.
- First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]Windows and AMD
-
-& ${env:HIP_PATH}bin\hipInfo.exe | Select-String gfx
-
-gcnArchName: gfx1032
-gcnArchName: gfx1035
- |clang++.\HIP-Basic\saxpy\main.hip -o saxpy.exe -I.\Common -lamdhip64 -L ${env:HIP_PATH}
- -lib -02 --offload-arch=gfx1032 --offload-arch=gfx1035 --lib -02 --offload-arch=gfx1035
- .\saxpy.exe
- Calculating y[i] = a * x[i] + y[i] over 10000000 elements.
- First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]Windows and NVIDIA
-
- |nvcc.\HIP-Basic\device_query\main.cpp -o device_query.exe -I.\Common -I ${env:HIP_PATH}
- -include -02 --include -0022
- .\device_query.exe | Select-String "major.minor"
-
- major.minor: 8.6
- major.minor: 7.0
-
-
-
- --x -x
- .\saxpy.exe
- Calculating y[i] = a * x[i] + y[i] over 10000000 elements.
- First 10 elements of the results: [ 3, 5, 7, 9, 11, 13, 15, 17, 19, 21 ]
-<_Python_>CHAPTER
-TWENTYNINE
-REDUCTION
-29.1 The algorithm
-29.2 Reduction on GPUs
-29.2.1 Naive shared reduction
-
-
-
-
-
-
-
-
-
-of the input size.
-
-for (uint32_t curr = input_count; curr > 1;)
-{
- hipLaunchKernelGGL(
- kernel,
- dim3(new_size(curr)),
- dim3(block_size),
- factor * sizeof(unsigned),
- hipStreamDefault,
- front,
- back,
- kernel_op,
-1;)
- zero_elem,
- curr);
-
- curr = new_size(curr);
- if (curr > 1)
- std::swap(front, back);
-}
-<_Cuda_>29.2.2 Reducing thread divergence
-
-// Shared reduction
-for (uint32_t i = 1; i < blockDim.x; i *= 2)
-{
-- if (tid % (2 * i) == 0)
-- shared[tid] = op(shared[tid], shared[tid + i]);
-+ if (uint32_t j = 2 * i * tid; j < blockDim.x)
-+ shared[j] = op(shared[j], shared[j + i]);
- __syncthreads();
-}29.2.3 Resolving bank conflicts
-
-
-
-
- implementation of the naive algorithm is to form continuous ranges of the threads activ
-
- // Shared reduction
- -for (uint32_t i = 1; i < blockDim.x; i *= 2)
- -{
- -
-
-
- -f +f29.2.4 Utilize upper half of the block
-
-<_Cuda_>
-constexpr int size = 4;
-for (int i = 0 ; i < size ; ++i)
-{
- printf("%d", i);
-}LLVM Block
-
-LLVM Block
-main:
- push rbx
- lea rbx, [rip +.L.str]
- mov rdi, rbx
- xor esi, esi
- xor eax, eax
- call printf@PLT
- mov rdi, rbx
- mov esi, 1
- xor eax, eax
- call printf@PLT
- mov rdi, rbx
- mov esi, 2
- xor eax, eax
- call printf@PLT
- mov rdi, rbx
- mov esi, 3
- xor eax, eax
- call printf@PLT
- xor eax, eax
- pop rbx
- ret
-.L.str:
- .asciz "%d"
-
- GCCGCC
-
- GCC
- .LC0:
- .string "%d"
- main:
- push rbx
- xor ebx, ebx
- .L2:
- mov esi, ebx
- mov edi, 0FFSET FLAT:.LC0
- xor eax, eax
- add ebx, 1
- call printf
- cmp ebx, 4
- jne .L2
- xor eax, eax
- pop rbx
- ret
-
- MSVCMSVC
-
- MSVC
-
-main PROC
- $LN12:
- push rbx
- sub rsp, 32
- xor ebx, ebx
- npad 8
- $LL4@main:
- mov edx, ebx
- lea rcx, OFFSET FLAT:'string'
- call printf
- inc ebx
- cmp ebx, 4
- jl SHORT $LL4@main
- xor eax, eax
- add rsp, 32
- pop rbx
- ret 0
- main ENDP
-printf("%d", 0);
-printf("%d", 1);
-printf("%d", 2);
-printf("%d", 3);
-
-
-
-
-
- ?xml version="1.0" encoding="UTF-8" />
- Consider the following code:
-
- int warp_size = device_props.warpSize;
-
- switch (warp_size)
-
- {
-
- case 32:
-
- hipLaunchKernelGGL(kernel<32>, ...);
-
- break;
-
- case 64:
-
- hipLaunchKernelGGL(kernel<64>, ...);
-
- break;
-
- }
- tmp::static_switch
-
-t WarpSize>()
-
- HIP Documentation, Release 6.1.40092
-
-
-
- -template
-<_C++_>29.2.5 Unroll all loops
-
-
-
-
-
-
-29.2.6 Communicate using warp-collective functions
-
-
-
-
- // Warp reduction29.2.7 Prefer warp communication over shared
-
- The kernel versions differ significantly enough to be described using a diff; use afresh instead.
-
- template
-static constexpr uint32_t WarpCount = BlockSize / WarpSize;
-
-__shared__ T shared[WarpCount];
-
-auto read_global_safe =
- [&](const uint32_t i) { return i < front_size? front[i] : zero_elem; };
-auto read_shared_safe =
- [&](const uint32_t i) { return i < WarpCount? shared[i] : zero_elem; };
-
-const uint32_t tid = threadIdx.x,
- bid = blockIdx.x,
- gid = bid * (blockDim.x * 2) + tid,
- wid = tid / WarpSize,
- lid = tid % WarpSize;
-
-// Read input from front buffer to local
-T res = op(read_global_safe(gid), read_global_safe(gid + blockDim.x));
-
-As we communicate the results of warps through shared memory, the same number of elements are required in the
-shared memory as warps within the block. Similar to how you can only launch kernels at block granularity. you can
-});
-
-// Write result from local to back buffer
-if(tid == 0)
- back[bid] = res;29.2.8 Amortize bookkeeping variable overhead
-
-_t ItemsPerThread>
-
-
-
- --global__ static __launch_bounds__(BlockSize) void kernel(...)
-<_C_>29.2.8.1 Reading ItemsPerThread
-
- The change to reading happens inside read_global_safe:
- auto read_global_safe = [&](const int32_t i) -> hip::static_array
-T arr[4] = {
- front[gid + 0],
- front[gid + 1],
- front[gid + 2],
- front[gid + 3]
-}
-T arr[4] = {
- i + 0 < front_size? front[i + 0] : zero_elem,
- i + 1 < front_size? front[i + 1] : zero_elem,
- i + 2 < front_size? front[i + 2] : zero_elem,
- i + 3 < front_size? front[i + 3] : zero_elem
-}29.2.8.2 Processing ItemsPerThread
-
-
-
-
- > ?>29.2.9 Two-pass reduction
-29.2.10 Global data share
-29.3 Conclusion
-THIRTY
-COOPERATIVE GROUPS
-30.1 Prerequisites
-30.2 Simple HIP Code
-30.3 Tiled partition
-
- You can use ued partition to calculate the sum or partition_size length sequences and the sum or result_size/
- BlockSize length sequences. The host-side reference implementation is the following:
-
- // Host-side function to perform the same reductions as executed on the GPU
- std::vector
- result[i] = partition_result;
- }
-
- return result;
- }30.3.1 Device-side code
-
- The warp level intrinsics usage is not covered in this tutorial, unlike in the reduction tutorial. x input variable is a
- shared pointer, which needs to be synchronized after every value change. The thread_group input parameter can be
- thread_block_tile or thread_block because the thread_group is the parent class of these types. The val are
- the numbers to calculate the sum of. The returned results of this function return the final results of the reduction on
- thread ID 0 of the thread_group, and for every other thread, the function results are 0.
-
- /// \brief Summation of `unsigned int val`s in `thread_group g` using shared memory `x`
- __device__ unsigned int reduce_sum(thread_group g, unsigned int* x, unsigned int val)
- {
- // Rank of this thread in the group
- const unsigned int group_thread_id = g.thread_rank();
-
- // We start with half the group size as active threads
- // Every iteration the number of active threads halves, until we processed all values
- for(unsigned int i = g.size() / 2; i > 0; i /= 2)
- {
- // Store value for this thread in a shared, temporary array
- x[group_thread_id] = val;
-
- // Synchronize all threads in the group
- g.sync();
-
- // If our thread is still active, sum with its counterpart in the other half
- if(group_thread_id < i)
- {
- val += x[group_thread_id + i];
- }
-
- // Synchronize all threads in the group
- g.sync();
- }
-
- // Only the first thread returns a valid value
- if(g.thread_rank() == 0)
- return val;
- else
- return 0;
- }
-
- The reduce_sum device function is reused to calculate the block and custom partition sum of the input numbers. The
- kernel has three sections:
-
- 1. Initialization of the reduction function variables.
-
-
-
-30.3.1.1 1. Initialization of the reduction function variables
-
-
-
-in this code section, the shared memory is declared, the thread_block_group and custom_partition are define
-input variables are loaded from global memory.
-
-// threadBlockGroup consists of all threads in the block
-thread_block thread_block_group = this_thread_block();
-
-// Workspace array in shared memory required for reduction
-__shared__ unsigned int workspace[2048];
-
-unsigned int output;
-
-// Input to reduce
-const unsigned int input = d_vector[thread_block_group.thread_rank()];
-
-//...
-
-// Every custom_partition group consists of 16 threads
-thread_block_tile30.3.1.2 2. The reduction of thread block
-
-// Perform reduction
-output = reduce_sum(thread_block_group, workspace, input);
-
-// Only the first thread returns a valid value
-if(thread_block_group.thread_rank() == 0)
-{
- d_block_reduced_vector[0] = output;
-}30.3.1.3 3. The reduction of custom partition
-
-
-
-
- // Perform reduction }
-
-output = reduce_sum(custom_partition, &workspace[group_offset], input);
-
-
-
- // Only the first thread in each partition returns a valid value \
- \
-
- \
- const unsigned int partition_id = thread_block_group.thread_rank() /\__
- ~PartitionSize;
- d_partition_reduced_vector[partition_id] = output;
- }30.3.2 Host-side code
-
-
-30.3.2.1 1. Confirm the cooperative group support on AMD GPUs
-
-<_C++_>30.3.2.2 2. Initialize the cooperative group configuration
-
-<_C_>30.3.2.3 4. Launch the kernel
-
- The kernel launch is done with the hipLaunchCooperativeKernel of the cooperative groups API.
- void* params[] = {&d_vector, &d_block_reduced, &d_partition_reduced};
- // Launching kernel from host.
- HIP_CHECK(hipLaunchCooperativeKernel(vector_reduce_kernel30.4 Conclusion
-CHAPTER
-THIRTYONE
-LICENSE
-INDEX
-
-C hipArrayCreate ( C++ function ), 183 cooperative_groups::coalesced_group class ), 237 C++ hipArrayDestroy ( C++ function ), 183 hipArrayGetDescriptor ( C++ function ), 185 ( C++ function ), 185 ( cooperative_groups::grid_group ( C++ class ), 234 hipArrayGetInfo hipCreateSurfaceObject ( C++ function ), 133 cooperative_groups::multi_grid_group ( C++ ( C++ function ), 197 class ), 234 cooperative_groups::thread_block ( C++ ), hipDestroyExternalMemory ( C++ function ), 196 class 234 hipDestroyExternalSemaphore hipDestroySurfaceObject ( C++ function ), 133 C++ function cooperative_groups::thread_block_tile ( C++ class ), 234 hipDeviceCanAccessPeer ( ), 163 hipDeviceDisablePeerAccess ( C++ function ), 164 ( C++ function ), 163 cooperative_groups::thread_block_tile::all ( C++ function ), 236 cooperative_groups::thread_block_tile::any hipDeviceEnablePeerAccess hipDeviceGetStreamPriorityRange ( C++ function ), 148 ( C++ function ), 236 hipDrvMemcpy3D ( C++ function ), 192 cooperative_groups::thread_block_tile::ballot hipDrvMemcpy3DAsync ( C++ function ), 192 hipDrvPointerGetAttributes ( C++ function ), 166 ( C++ function ), 236 cooperative_groups::thread_block_tile::match_all hipExternalMemoryGetMappedBuffer ( C++ func- tion ), 196 ( C++ function ), 237 cooperative_groups::thread_block_tile::match_any hipExternalMemoryGetMappedMipmappedArray ( C++ function ), 197 hipExtMallocWithFlags ( C++ function ), 167 236 ( ), C++ function cooperative_groups::thread_block_tile::meta_group_rank ( C++ function ), 235 hipExtStreamCreateWithCUMask ( C++function ), 151 hipExtStreamGetCUMask ( C++ function ), 152 hipFree ( C++ function ), 171 hipFreeArray ( C++ function ), 184 cooperative_groups::thread_block_tile::meta_group_size ( C++ function ), 235 hipFreeAsync cooperative_groups::thread_block_tile::shfl ( C++ function ), 235 ( C++ function ), 154 cooperative_groups::thread_block_tile::shfl_down ( C++ function ), 235 hipFreeHost ( C++ function ), 172 hipGetProcAddress ( C++ function ), 176 cooperative_groups::thread_block_tile::shfl_up hipGetSymbolAddress ( C++ function ), 176, 193 ( C++ function ), 176, 193 ( C++ function ), 235 hipHostAlloc ( C++ function ), 168 hipHostFree ( C++ function ), 172 ( C++ function ), 236 ( C++ function ), cooperative_groups::thread_block_tile::sync ( C++ function ), 235 169 ( C++ function ), 169 cooperative_groups::thread_block_tile::thread_rank ( C++ function ), 235 ), ( C++ function ), 169 cooperative_groups::thread_group ( C++ class ( C++ function ), 170 ( C++ function ), 234 196 ( C++ function ), ( C++ function ), 166, 194 H hipMalloc hipMalloc3D ( C++ function ), 184 hipArray3DCreate ( C++ function ), 183 185 hipMalloc3DArray ( C++ function ), hipArray3DGetDescriptor ( C++ function ), 195 184 hipImportExternalSemaphore hipImportExternalMemory hipHostMalloc ( C++ function ), 168, 194 hipHostRegister hipHostUnregister hipHostGetDevicePointer hipHostGetFlags hipGetSymbolSize cooperative_groups::thread_block_tile::shfl_xor
-hipMallocArray ( C++ function ), 182 ( C++ function ), 152, 153 hipMallocFromPoolAsync ( C++ function ), 153, 160 hipMallocHost ( C++ function ), 167 hipMallocManaged ( C++ function ), 247, 249 hipMallocPitch ( C++ function ), 170 hipMemAddressFree ( C++ function ), 251 hipMemAddressReserve ( C++ function ), 251 hipMemAdvise ( C++ function ), 247 hipMemAllocHost ( C++ function ), 167 hipMemAllocPitch ( C++ function ), 171 hipMemcpy ( C++ function ), 172 hipMemcpy2D ( C++ function ), 186 hipMemcpy2DAsync ( C++ function ), 187 hipMemcpy2DFromArray ( C++ function ), 190 hipMemcpy2DFromArrayAsync ( C++ function ), 190 hipMemcpy2DToArray ( C++ function ), 188 hipMemcpy2DToArrayAsync ( C++ function ), 188 hipMemcpy3D ( C++ function ), 191 hipMemcpy3DAsync ( C++ function ), 192 hipMemcpyAsync ( C++ function ), 178 hipMemcpyAtoH ( C++ function ), 191 hipMemcpyDtoD ( C++ function ), 174 hipMemcpyDtoDAsync ( C++ function ), 175 hipMemcpyDtoH ( C++ function ), 174 hipMemcpyDtoHAsync ( C++ function ), 175 hipMemcpyFromArray ( C++ function ), 189 hipMemcpyFromSymbol ( C++ function ), 177, 194 hipMemcpyFromSymbolAsync ( C++ function ), 178, 194 hipMemcpyHtoA ( C++ function ), 191 hipMemcpyHtoD ( C++ function ), 173 hipMemcpyHtoDAsync ( C++ function ), hipMemcpyParam2D ( C++ function ), 174 186 hipMemcpyParam2DAsync ( C++ function ), 187 hipMemcpyToArray ( C++ function ), 189 ( C++ function ), 177, hipMemcpyToSymbol 193 hipMemcpyToSymbolAsync ( C++ function ), 193 177, ( C++ function ), 173 hipMemcpyWithStream hipMemCreate ( C++ function ), 252 hipMemExportToShareableHandle ( C++ function 252 ( C++ function ), 252 hipMemGetAddressRange ( C++ function ), 164 hipMemGetAllocationGranularity ( C++ function 253 hipMemGetAllocationPropertiesFromHandle ( C++ function ), 253 hipMemGetInfo ( C++ function ), 182 hipMemImportFromShareableHandle ( C++ function 253 hipMemMap ( C++ function ), 254 hipMemMapArrayAsync ( C++ function ), hipMemPoolCreate 254 ( C++ function ), 158 hipMemPoolDestroy 159 ( C++ function ),
-hipMemPoolExportPointer hipMemPoolExportToShareableHandle ( C++ func- tion ), 160 hipMemPoolGetAccess ( C++ function ), 158 hipMemPoolGetAttribute ( C++ function ), 156 hipMemPoolImportFromShareableHandle function ), 161 ( C++ hipMemPoolImportPointer ( C++ function ), 162 hipMemPoolSetAccess ( C++ function ), 157 hipMemPoolSetAttribute ( C++ function ), 156 hipMemPoolTrimTo ( C++ function ), 155 hipMemPrefetchAsync ( C++ function ), 247 hipMemPtrGetInfo ( C++ function ), 182 hipMemRangeGetAttribute ( C++ function ), 248 hipMemRangeGetAttributes ( C++ function ), 248 hipMemRelease ( C++ function ), 255 hipMemRetainAllocationHandle ( C++function ), 255 hipMemset ( C++ function ), 179 hipMemset2D ( C++ function ), 181 hipMemset2DAsync ( C++ function ), 181 hipMemset3D ( C++ function ), 181 hipMemset3DAsync ( C++ function ), 182 hipMemSetAccess ( C++ function ), 255 hipMemsetAsync ( C++ function ), 180 hipMemsetD16 ( C++ function ), 180 hipMemsetD16Async ( C++ function ), 180 hipMemsetD32 ( C++ function ), 180 hipMemsetD32Async ( C++ function ), 181 hipMemsetD8 ( C++ function ), 179 hipMemsetD8Async ( C++ function ), 179 hipMemUnmap ( C++ function ), 256 hipModuleGetGlobal ( C++ function ), 176 ( C++ function ), 165 hipPointerGetAttribute hipPointerGetAttributes ( C++ function ), 165 hipPointerSetAttribute ( C++ function ), 165 hipSignalExternalSemaphoresAsync ( C++ func- tion ), 195 hipStreamAddCallback ( C++ function ), 152 hipStreamAttachMemAsync ( C++ function ), 249 hipStreamCallback_t ( C++ type ), 147 ( C++ function ), 147 hipStreamCreate hipStreamCreateWithFlags ( C++ function ), 147 hipStreamCreateWithPriority ( C++ function ), 147 hipStreamDestroy ( C++ function ), 148 hipStreamGetDevice ( C++ function ), 151 hipStreamGetFlags ( C++ function ), 150 hipStreamGetPriority ( C++ function ), 150 hipStreamQuery ( C++ function ), 149 hipStreamSynchronize ( C++ function ), 149 hipStreamWaitEvent ( C++ function ), 149 hipWaitExternalSemaphoresAsync ( C++ function ), 195 hsa_amd_vmem_address_free ( C++ function ), 241 hsa_amd_vmem_address_reserve ( C++function ), 241
-hsa_amd_vmem_export_shareable_handle function ), 244 hsa_amd_vmem_get_access ( C++ function ), 243 hsa_amd_vmem_get_alloc_properties_from_handle ( C++ function ), 245 hsa_amd_vmem_handle_create ( C++ function ), 242 hsa_amd_vmem_handle_release ( C++ function ), 242 hsa_amd_vmem_import_shareable_handle function ), 244 ( C++ hsa_amd_vmem_map ( C++ function ), 242 hsa_amd_vmem_retain_alloc_handle ( C++ tion ), 245 func- hsa_amd_vmem_set_access ( C++ function ), 243 hsa_amd_vmem_unmap ( C++ function ), 243 S
-
-surf1DLayeredread surf1DLayeredwrite ( C++ function ), 135 surf1Dread ( C++ function ), 133 surf1Dwrite ( C++ function ), 133 surf2DLayeredread ( C++ function ), 135 surf2DLayeredwrite ( C++ function ), 135 surf2Dread ( C++ function ), 134 surf2Dwrite ( C++ function ), 134 surf3Dread ( C++ function ), 134 surf3Dwrite ( C++ function ), 134 surfCubemapLayeredread ( C++ function ), 136 surfCubemapLayeredwrite ( C++ function ), 137 surfCubemapread ( C++ function ), 136 surfCubemapwrite ( C++ function ), 136 U
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-```
-
-## 12.2.1 Debugging HIP applications
-
-The following Linux example shows how to get useful information from the debugger while running a simple memory copy test, which caused a segmentation fault issue.
-
-```
-
-
-
- test, which caused a segmentation fault issue. `), headings (` test, which caused a segmentation fault issue.
-
-
-
-
-
-
-
-
-
-
-
-
-```
-
-(continues on next page)
-
-(continued from previous page)
-
-```
-HIP Documentation, Release 6.1.40092
-```
-
-On Windows , you can set AMD\_LOG\_LEVEL via environment variable from the advanced system settings or the command prompt (when run as administrator). The following example shows debug log information when calling the backend runtime.
-
-```
-
-
-
- void void
-```
-
-We insert the GCN isa into the kernel using asm() Assembler statement. volatile keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations. v\_mac\_f32\_e32 is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/) Index for the respective operand in the ordered fashion is provided by % followed by position in the list of operands 'v' is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list) Output Constraints are specified by an '=' prefix as shown above ('=v'). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of '0' says to use the assigned register for output as an input as well (it being the 0'th constraint).
-
-## C++ Support The following C++ features are not supported: - Run-time-type information (RTTI) - Try/catch Virtual functions Virtual functions are not supported if objects containing virtual function tables are passed between GPU's of different offload arch's, e.g. between gfx906 and gfx1030. Otherwise virtual functions are supported.
-
-## 19.27 Kernel Compilation
-
-hipcc now supports compiling C++/HIP kernels to binary code objects. The file format for binary is .co which means Code Object. The following command builds the code object using hipcc .
-
-```
-hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]
-
-[TARGET GPU] = GPU architecture
-[INPUT FILE] = Name of the file containing kernels
-[OUTPUT FILE] = Name of the generated code object file
-```
-
-Note: When using binary code objects is that the number of arguments to the kernel is different on HIP-Clang and NVCC path. Refer to the HIP module\_api sample for differences in the arguments to be passed to the kernel.
-
-## 19.28 gfx-arch-specific-kernel
-
-Clang defined '\_\_gfx*\_\_' macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample in HIP 14\_gpu\_arch sample.
-
-## CHAPTER
-
-## TWENTY
-
-## C++ LANGUAGE SUPPORT
-
-The ROCm platform enables the power of combined C++ and HIP (Heterogeneous-computing Interface for Portability) code. This code is compiled with a clang or clang++ compiler. The official compilers support the HIP platform, or you can use the amdclang or amdclang++ included in the ROCm installation, which are a wrapper for the official versions.
-
-The source code is compiled according to the C++03 , C++11 , C++14 , C++17 , and C++20 standards, along with HIPspecific extensions, but is subject to restrictions. The key restriction is the reduced support of standard library in device code. This is due to the fact that by default a function is considered to run on host, except for constexpr functions, which can run on host and device as well.
-
-## 20.1 Modern C++ support
-
-C++ is considered a modern programming language as of C++11. This section describes how HIP supports these new C++ features.
-
-## 20.1.1 C++11 support
-
-The C++11 standard introduced many new features. These features are supported in HIP host code, with some notable omissions on the device side. The rule of thumb here is that constexpr functions work on device, the rest doesn't. This means that some important functionality like std::function is missing on the device, but unfortunately the standard library wasn't designed with HIP in mind, which means that the support is in a state of 'works as-is'.
-
-Certain features have restrictions and clarifications. For example, any functions using the constexpr qualifier or the new initializer lists , std::move or std::forward features are implicitly considered to have the \_\_host\_\_ and \_\_device\_\_ execution space specifier. Also, constexpr variables that are static members or namespace scoped can be used from both host and device, but only for read access. Dereferencing a static constexpr outside its specified execution space causes an error.
-
-Lambdas are supported, but there are some extensions and restrictions on their usage. For more information, see the Extended lambdas section below.
-
-## 20.1.2 C++14 support
-
-The C++14 language features are supported.
-
-## 20.1.3 C++17 support
-
-All C++17 language features are supported.
-
-## 20.1.4 C++20 support
-
-All C++20 language features are supported, but extensions and restrictions apply. C++20 introduced coroutines and modules, which fundamentally changed how programs are written. HIP doesn't support these features. However, consteval functions can be called from host and device, even if specified for host use only.
-
-The three-way comparison operator (spaceship operator <=> ) works with host and device code.
-
-## 20.2 Extensions and restrictions
-
-In addition to the deviations from the standard, there are some general extensions and restrictions to consider.
-
-## 20.2.1 Global functions
-
-Functions that serve as an entry point for device execution are called kernels and are specified with the \_\_global\_\_ qualifier. To call a kernel function, use the triple chevron operator: <<< >>> . Kernel functions must have a void return type. These functions can't:
-
-- have a constexpr specifier
-- have a parameter of type std::initializer\_list or va\_list
-- use an rvalue reference as a parameter.
-- use parameters having different sizes in host and device code, e.g. long double arguments, or structs containing long double members.
-- use struct-type arguments which have different layout in host and device code.
-
-Kernels can have variadic template parameters, but only one parameter pack, which must be the last item in the template parameter list.
-
-## 20.2.2 Device space memory specifiers
-
-HIP includes device space memory specifiers to indicate whether a variable is allocated in host or device memory and howits memory should be allocated. HIP supports the \_\_device\_\_ , \_\_shared\_\_ , \_\_managed\_\_ , and \_\_constant\_\_ specifiers.
-
-The \_\_device\_\_ and \_\_constant\_\_ specifiers define global variables, which are allocated within global memory on the HIP devices. The only difference is that \_\_constant\_\_ variables can't be changed after allocation. The \_\_shared\_\_ specifier allocates the variable within shared memory, which is available for all threads in a block.
-
-The \_\_managed\_\_ variable specifier creates global variables that are initially undefined and unaddressed within the global symbol table. The HIP runtime allocates managed memory and defines the symbol when it loads the device binary. A managed variable can be accessed in both device and host code.
-
-It's important to know where a variable is stored because it is only available from certain locations. Generally, variables allocated in the host memory are not accessible from the device code, while variables allocated in the device memory are not directly accessible from the host code. Dereferencing a pointer to device memory on the host results in a segmentation fault. Accessing device variables in host code should be done through kernel execution or HIP functions like hipMemCpyToSymbol .
-
-## 20.2.3 Exception handling
-
-An important difference between the host and device code is exception handling. In device code, this control flow isn't available due to the hardware architecture. The device code must use return codes to handle errors.
-
-## 20.2.4 Kernel parameters
-
-There are some restrictions on kernel function parameters. They cannot be passed by reference, because these functions are called from the host but run on the device. Also, a variable number of arguments is not allowed.
-
-## 20.2.5 Classes
-
-Classes work on both the host and device side, but there are some constraints. The static member functions can't be \_\_global\_\_ . Virtual member functions work, but a virtual function must not be called from the host if the parent object was created on the device, or the other way around, because this behavior is undefined. Another minor restriction is that \_\_device\_\_ variables, that are global scoped must have trivial constructors.
-
-## 20.2.6 Polymorphic function wrappers
-
-HIP doesn't support the polymorphic function wrapper std::function , which was introduced in C++11.
-
-## 20.2.7 Extended lambdas
-
-HIP supports Lambdas, which by default work as expected.
-
-Lambdas have implicit host device attributes. This means that they can be executed by both host and device code, and works the way you would expect. To make a lambda callable only by host or device code, users can add \_\_host\_\_ or \_\_device\_\_ attribute. The only restriction is that host variables can only be accessed through copy on the device. Accessing through reference will cause undefined behavior.
-
-## 20.2.8 Inline namespaces
-
-Inline namespaces are supported, but with a few exceptions. The following entities can't be declared in namespace scope within an inline unnamed namespace:
-
-- \_\_managed\_\_ , \_\_device\_\_ , \_\_shared\_\_ and \_\_constant\_\_ variables
-- \_\_global\_\_ function and function templates
-- variables with surface or texture type
-
-## CHAPTER
-
-## TWENTYONE
-
-## HIP MATH API
-
-HIP-Clang supports a set of math operations that are callable from the device. HIP supports most of the device functions supported by NVIDIA CUDA. These are described in the following sections.
-
-## 21.1 Single precision mathematical functions
-
-Following is the list of supported single precision mathematical functions.
-
-Table 1: Single precision mathematical functions
-
-| Function | Supported on Host | Supported on Device |
-|----------------------------------------------------------------------------|---------------------|-----------------------|
-| float abs(float x) Returns the absolute value of 𝑥 | ✓ | ✓ |
-| float acosf(float x) Returns the arc cosine of 𝑥 . | ✓ | ✓ |
-| float acoshf(float x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| float asinf(float x) Returns the arc sine of 𝑥 . | ✓ | ✓ |
-| float asinhf(float x) Returns the arc hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| float atanf(float x) Returns the arc tangent of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 1 - continued from previous page |
-|------------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| float atan2f(float x, float y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float atanhf(float x) Returns the arc hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-| float cbrtf(float x) Returns the cube root of 𝑥 . | ✓ | ✓ |
-| float ceilf(float x) Returns ceiling of 𝑥 . | ✓ | ✓ |
-| float copysignf(float x, float y) Create value with given magnitude, copying sign of second value. | ✓ | ✓ |
-| float cosf(float x) Returns the cosine of 𝑥 . | ✓ | ✓ |
-| float coshf(float x) Returns the hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| float cospif(float x) Returns the cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| float cyl_bessel_i0f(float x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 . | | |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| float cyl_bessel_i1f(float x) Returns the value of the regular modified cylindrical Bessel function of order 1 for 𝑥 . | | |
-|--------------------------------------------------------------------------------------------------------------------------|----|----|
-| float erff(float x) Returns the error function of 𝑥 . | ✓ | ✓ |
-| float erfcf(float x) Returns the complementary error function of 𝑥 . | ✓ | ✓ |
-| float erfcinvf(float x) Returns the inverse complementary function of 𝑥 . | ✓ | ✓ |
-| float erfcxf(float x) Returns the scaled complementary error function of 𝑥 . | ✓ | ✓ |
-| float erfinvf(float x) Returns the inverse error function of 𝑥 . | ✓ | ✓ |
-| float expf(float x) Returns 𝑒 𝑥 . | ✓ | ✓ |
-| float exp10f(float x) Returns 10 𝑥 . | ✓ | ✓ |
-| float exp2f( float x) Returns 2 𝑥 . | ✓ | ✓ |
-| float expm1f(float x) Returns 𝑙𝑛 ( 𝑥 - 1) | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float fabsf(float x) Returns the absolute value of x | ✓ | ✓ |
-|------------------------------------------------------------------------------------|-----|-----|
-| float fdimf(float x, float y) Returns the positive difference between 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fdividef(float x, float y) Divide two floating point values. | ✓ | ✓ |
-| float floorf(float x) Returns the largest integer less than or equal to 𝑥 . | ✓ | ✓ |
-| float fmaf(float x, float y, float z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. | ✓ | ✓ |
-| float fmaxf(float x, float y) Determine the maximum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fminf(float x, float y) Determine the minimum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fmodf(float x, float y) Returns the floating-point remainder of 𝑥/𝑦 . | ✓ | ✓ |
-| float modff(float x, float* iptr) Break down 𝑥 into fractional and integral parts. | ✓ | |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| float frexpf(float x, int* nptr) Extract mantissa and exponent of 𝑥 . | ✓ |
-|---------------------------------------------------------------------------------------------------------|-----|
-| float hypotf(float x, float y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . | ✓ |
-| int ilogbf(float x) Returns the unbiased integer exponent of 𝑥 . | ✓ |
-| bool isfinite(float x) Determine whether 𝑥 is finite. | ✓ |
-| bool isinf(float x) Determine whether 𝑥 is infinite. | ✓ |
-| bool isnan(float x) Determine whether 𝑥 is a NAN . | ✓ |
-| float j0f(float x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . | ✓ |
-| float j1f(float x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . | ✓ |
-| float jnf(int n, float x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float ldexpf(float x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | ✓ |
-|-------------------------------------------------------------------------------------------------------------------|-----|-----|
-| float lgammaf(float x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | |
-| long int lrintf(float x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-| long long int llrintf(float x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-| long int lroundf(float x) Round to nearest integer value. | ✓ | ✓ |
-| long long int llroundf(float x) Round to nearest integer value. | ✓ | ✓ |
-| float log10f(float x) Returns the base 10 logarithm of 𝑥 . | ✓ | ✓ |
-| float log1pf(float x) Returns the natural logarithm of 𝑥 +1 . | ✓ | ✓ |
-| float log2f(float x) Returns the base 2 logarithm of 𝑥 . | ✓ | ✓ |
-| float logf(float x) Returns the natural logarithm of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| | | ✓ |
-|----------------------------------------------------------------------------------------------------------------------|----|-----|
-| float logbf(float x) Returns the floating point representation of the exponent of 𝑥 . | ✓ | |
-| float nanf(const char* tagp) Returns 'Not a Number' value. | | ✓ |
-| float nearbyintf(float x) Round 𝑥 to the nearest integer. | ✓ | ✓ |
-| float nextafterf(float x, float y) Returns next representable single-precision floating-point value after argument. | ✓ | |
-| float norm3df(float x, float y, float z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . | ✓ | ✓ |
-| float norm4df(float x, float y, float z, float w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . | ✓ | ✓ |
-| float normcdff(float y) Returns the standard normal cumulative distribution function. | ✓ | ✓ |
-| float normcdfinvf(float y) Returns the inverse of the standard normal cumulative distribution function. | ✓ | ✓ |
-| float normf(int dim, const float *a) Returns the square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 1 - continued from previous page |
-|-------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| float powf(float x, float y) Returns 𝑥 𝑦 . | ✓ | ✓ |
-| float powif(float base, int iexp) Returns the value of first argument to the power of second argument. | ✓ | ✓ |
-| float remainderf(float x, float y) Returns single-precision floating-point remainder. | ✓ | ✓ |
-| float remquof(float x, float y, int* quo) Returns single-precision floating-point remainder and part of quotient. | ✓ | ✓ |
-| float roundf(float x) Round to nearest integer value in floating-point. | ✓ | ✓ |
-| float rcbrtf(float x) Returns the reciprocal cube root function. | ✓ | ✓ |
-| float rhypotf(float x, float y) Returns one over the square root of the sum of squares of two arguments. | ✓ | ✓ |
-| float rintf(float x) Round input to nearest integer value in floating-point. | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float rnorm3df(float x, float y, float z) Returns one over the square root of the sum of squares of three coordinates of the argument. | ✓ | ✓ |
-|------------------------------------------------------------------------------------------------------------------------------------------------|-----|-----|
-| float rnorm4df(float x, float y, float z, float w) Returns one over the square root of the sum of squares of four coordinates of the argument. | ✓ | ✓ |
-| float rnormf(int dim, const float *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-| float scalblnf(float x, long int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| float scalbnf(float x, int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| bool signbit(float x) Return the sign bit of 𝑥 . | ✓ | ✓ |
-| float sinf(float x) Returns the sine of 𝑥 . | ✓ | ✓ |
-| float sinhf(float x) Returns the hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| float sinpif(float x) Returns the hyperbolic sine of 𝜋 · 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| void sincosf(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝑥 . | ✓ | ✓ |
-|---------------------------------------------------------------------------------------------------|-----|-----|
-| void sincospif(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| float sqrtf(float x) Returns the square root of 𝑥 . | ✓ | ✓ |
-| float rsqrtf(float x) Returns the reciprocal of the square root of 𝑥 . | | ✓ |
-| float tanf(float x) Returns the tangent of 𝑥 . | ✓ | ✓ |
-| float tanhf(float x) Returns the hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-| float tgammaf(float x) Returns the gamma function of 𝑥 . | ✓ | ✓ |
-| float truncf(float x) Truncate 𝑥 to the integral part. | ✓ | ✓ |
-| float y0f(float x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . | ✓ | ✓ |
-| float y1f(float x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-```
- \
- float ynf(int n, float x)
- Returns the value of the Bessel
- function of the second kind of order
- n for x.
-```
-
-Table 1 - continued from previous page
-
-## 21.2 Double precision mathematical functions
-
-Following is the list of supported double precision mathematical functions.
-
-Table 2: Double precision mathematical functions
-
-| Function | Supported on Host | Supported on Device |
-|------------------------------------------------------------------------------------|---------------------|-----------------------|
-| double abs(double x) Returns the absolute value of 𝑥 | ✓ | ✓ |
-| double acos(double x) Returns the arc cosine of 𝑥 . | ✓ | ✓ |
-| double acosh(double x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| double asin(double x) Returns the arc sine of 𝑥 . | ✓ | ✓ |
-| double asinh(double x) Returns the arc hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| double atan(double x) Returns the arc tangent of 𝑥 . | ✓ | ✓ |
-| double atan2(double x, double y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double atanh(double x) Returns the arc hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-|-------------------------------------------------------------------------------------------------------------------------|-----|-----|
-| double cbrt(double x) Returns the cube root of 𝑥 . | ✓ | ✓ |
-| double ceil(double x) Returns ceiling of 𝑥 . | ✓ | ✓ |
-| double copysign(double x, double y) Create value with given magnitude, copying sign of second value. | ✓ | ✓ |
-| double cos(double x) Returns the cosine of 𝑥 . | ✓ | ✓ |
-| double cosh(double x) Returns the hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| double cospi(double x) Returns the cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| double cyl_bessel_i0(double x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 . | | |
-| double cyl_bessel_i1(double x) Returns the value of the regular modified cylindrical Bessel function of order 1 for | 𝑥 . | |
-| double erf(double x) Returns the error function of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double erfc(double x) Returns the complementary error function of 𝑥 . | ✓ | ✓ |
-|-----------------------------------------------------------------------------------|-----|-----|
-| double erfcinv(double x) Returns the inverse complementary function of 𝑥 . | ✓ | ✓ |
-| double erfcx(double x) Returns the scaled complementary error function of 𝑥 . | ✓ | ✓ |
-| double erfinv(double x) Returns the inverse error function of 𝑥 . | ✓ | ✓ |
-| double exp(double x) Returns 𝑒 𝑥 . | ✓ | ✓ |
-| double exp10(double x) Returns 10 𝑥 . | ✓ | ✓ |
-| double exp2( double x) Returns 2 𝑥 . | ✓ | ✓ |
-| double expm1(double x) Returns 𝑙𝑛 ( 𝑥 - 1) | ✓ | ✓ |
-| double fabs(double x) Returns the absolute value of x | ✓ | ✓ |
-| double fdim(double x, double y) Returns the positive difference between 𝑥 and 𝑦 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double floor(double x) Returns the largest integer less than or equal to 𝑥 . | ✓ | ✓ |
-|---------------------------------------------------------------------------------------------|-----|-----|
-| double fma(double x, double y, double z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. | ✓ | ✓ |
-| double fmax(double x, double y) Determine the maximum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| double fmin(double x, double y) Determine the minimum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| double fmod(double x, double y) Returns the floating-point remainder of 𝑥/𝑦 . | ✓ | ✓ |
-| double modf(double x, double* iptr) Break down 𝑥 into fractional and integral parts. | ✓ | |
-| double frexp(double x, int* nptr) Extract mantissa and exponent of 𝑥 . | ✓ | |
-| double hypot(double x, double y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . | ✓ | ✓ |
-| int ilogb(double x) Returns the unbiased integer exponent of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| bool isfinite(double x) Determine whether 𝑥 is finite. | ✓ | ✓ |
-|------------------------------------------------------------------------------------------------------------------|-----|-----|
-| bool isin(double x) Determine whether 𝑥 is infinite. | ✓ | ✓ |
-| bool isnan(double x) Determine whether 𝑥 is a NAN . | ✓ | ✓ |
-| double j0(double x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . | ✓ | ✓ |
-| double j1(double x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . | ✓ | ✓ |
-| double jn(int n, double x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . | ✓ | ✓ |
-| double ldexp(double x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | ✓ |
-| double lgamma(double x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | |
-| long int lrint(double x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| long long int llrint(double x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-|----------------------------------------------------------------------------------------|-----|-----|
-| long int lround(double x) Round to nearest integer value. | ✓ | ✓ |
-| long long int llround(double x) Round to nearest integer value. | ✓ | ✓ |
-| double log10(double x) Returns the base 10 logarithm of 𝑥 . | ✓ | ✓ |
-| double log1p(double x) Returns the natural logarithm of 𝑥 +1 . | ✓ | ✓ |
-| double log2(double x) Returns the base 2 logarithm of 𝑥 . | ✓ | ✓ |
-| double log(double x) Returns the natural logarithm of 𝑥 . | ✓ | ✓ |
-| double logb(double x) Returns the floating point representation of the exponent of 𝑥 . | ✓ | ✓ |
-| double nan(const char* tagp) Returns 'Not a Number' value. | | ✓ |
-| double nearbyint(double x) Round 𝑥 to the nearest integer. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| | | ✓ |
-|--------------------------------------------------------------------------------------------------------------------------|----|-----|
-| double nextafter(double x, double y) Returns next representable double-precision floating-point value after argument. | ✓ | |
-| double norm3d(double x, double y, double z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . | ✓ | ✓ |
-| double norm4d(double x, double y, double z, double w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . | ✓ | ✓ |
-| double normcdf(double y) Returns the standard normal cumulative distribution function. | ✓ | ✓ |
-| double normcdfinv(double y) Returns the inverse of the standard normal cumulative distribution function. | ✓ | ✓ |
-| double norm(int dim, const double *a) Returns the square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-| double pow(double x, double y) Returns 𝑥 𝑦 . | ✓ | ✓ |
-| double powi(double base, int iexp) Returns the value of first argument to the power of second argument. | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 2 - continued from previous page |
-|----------------------------------------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| double remainder(double x, double y) Returns double-precision floating-point remainder. | ✓ | ✓ |
-| double remquo(double x, double y, int* quo) Returns double-precision floating-point remainder and part quotient. | ✓ | of |
-| double round(double x) Round to nearest integer value in floating-point. | ✓ | ✓ |
-| double rcbrt(double x) Returns the reciprocal cube root function. | ✓ | ✓ |
-| double rhypot(double x, double y) Returns one over the square root of the sum of squares of two arguments. | ✓ | ✓ |
-| double rint(double x) Round input to nearest integer value in floating-point. | ✓ | ✓ |
-| double rnorm3d(double x, double y, double z) Returns one over the square root of the sum of squares of three coordinates of the argument. | ✓ | ✓ |
-| double rnorm4d(double x, double y, double z, double w) Returns one over the square root of the sum of squares of four coordinates of the argument. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| | ✓ | |
-|----------------------------------------------------------------------------------------------------------------------------------|-----|----|
-| double rnorm(int dim, const double *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. | | ✓ |
-| double scalbln(double x, long int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| double scalbn(double x, int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| bool signbit(double x) Return the sign bit of 𝑥 . | ✓ | ✓ |
-| double sin(double x) Returns the sine of 𝑥 . | ✓ | ✓ |
-| double sinh(double x) Returns the hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| double sinpi(double x) Returns the hyperbolic sine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| void sincos(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝑥 . | ✓ | ✓ |
-| void sincospi(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| double sqrt(double x) Returns the square root of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double rsqrt(double x) Returns the reciprocal of the square root of 𝑥 . | ✓ |
-|-----------------------------------------------------------------------------------------------------------|-----|
-| double tan(double x) Returns the tangent of 𝑥 . | ✓ |
-| double tanh(double x) Returns the hyperbolic tangent of 𝑥 . | ✓ |
-| double tgamma(double x) Returns the gamma function of 𝑥 . | ✓ |
-| double trunc(double x) Truncate 𝑥 to the integral part. | ✓ |
-| double y0(double x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . | ✓ |
-| double y1(double x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . | ✓ |
-| double yn(int n, double x) Returns the value of the Bessel function of the second kind of order n for 𝑥 . | ✓ |
-
-## 21.3 Integer intrinsics
-
-Following is the list of supported integer intrinsics. Note that intrinsics are supported on device only.
-
-Table 3: Integer intrinsics mathematical functions
-
-## Function
-
-unsigned int \_\_brev(unsigned int x) Reverse the bit order of a 32 bit unsigned integer.
-
-unsigned long long int \_\_brevll(unsigned long long int x) Reverse the bit order of a 64 bit unsigned integer.
-
-unsigned int \_\_byte\_perm(unsigned int x, unsigned int y, unsigned int z) Return selected bytes from two 32-bit unsigned integers.
-
-unsigned int \_\_clz(int x) Return the number of consecutive high-order zero bits in 32 bit integer.
-
-unsigned int \_\_clzll(long long int x) Return the number of consecutive high-order zero bits in 64 bit integer.
-
-unsigned int \_\_ffs(int x) Find the position of least significant bit set to 1 in a 32 bit integer.
-
-unsigned int \_\_ffsll(long long int x) Find the position of least significant bit set to 1 in a 64 bit signed integer.
-
-unsigned int \_\_fns32(unsigned long long mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 32-bit integer.
-
-unsigned int \_\_fns64(unsigned long long int mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 64-bit integer.
-
-unsigned int \_\_funnelshift\_l(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by shift & 31 bits, return the most significant 32 bits.
-
-unsigned int \_\_funnelshift\_lc(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by min(shift, 32) bits, return the most significant 32 bits.
-
-unsigned int \_\_funnelshift\_r(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift right by shift & 31 bits, return the least significant 32 bits. 226 Chapter 21. HIP math API
-
-The HIP-Clang implementation of \_\_ffs() and \_\_ffsll() contains code to add a constant +1 to produce the ffs result format. For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform, HIP-Clang provides \_\_lastbit\_u32\_u32(unsigned int input) and \_\_lastbit\_u32\_u64(unsigned long long int input) . The index returned by \_\_lastbit\_ instructions starts at -1, while for ffs the index starts at 0.
-
-## 21.4 Floating-point Intrinsics
-
-Following is the list of supported floating-point intrinsics. Note that intrinsics are supported on device only.
-
-Note: Only the nearest even rounding mode supported on AMD GPUs by defaults. The \_rz , \_ru and \_rd suffixed intrinsic functions are existing in HIP AMD backend, if the OCML\_BASIC\_ROUNDED\_OPERATIONS macro is defined.
-
-Table 4: Single precision intrinsics mathematical functions
-
-Function float \_\_cosf(float x) Returns the fast approximate cosine of 𝑥 . float \_\_exp10f(float x) Returns the fast approximate for 10 x . float \_\_expf(float x) Returns the fast approximate for e x . float \_\_fadd\_rn(float x, float y) Add two floating-point values in round-to-nearest-even mode. float \_\_fdiv\_rn(float x, float y) Divide two floating point values in round-to-nearest-even mode. float \_\_fmaf\_rn(float x, float y, float z) Returns x × y + z as a single operation in round-to-nearest-even mode. float \_\_fmul\_rn(float x, float y) Multiply two floating-point values in round-to-nearest-even mode. float \_\_frcp\_rn(float x, float y) Returns 1 / x in round-to-nearest-even mode. float \_\_frsqrt\_rn(float x) Returns 1 / x in round-to-nearest-even mode. float \_\_fsqrt\_rn(float x) Returns x in round-to-nearest-even mode. float \_\_fsub\_rn(float x, float y) Subtract two floating-point values in round-to-nearest-even mode. float \_\_log10f(float x) Returns the fast approximate for base 10 logarithm of 𝑥 . 228 Chapter 21. HIP math API
-
-Table 5: Double precision intrinsics mathematical functions
-
-Function double \_\_dadd\_rn(double x, double y) Add two floating-point values in round-to-nearest-even mode. double \_\_ddiv\_rn(double x, double y) Divide two floating-point values in round-to-nearest-even mode. double \_\_dmul\_rn(double x, double y) Multiply two floating-point values in round-to-nearest-even mode. double \_\_drcp\_rn(double x, double y) Returns 1 / x in round-to-nearest-even mode. double \_\_dsqrt\_rn(double x) Returns x in round-to-nearest-even mode. double \_\_dsub\_rn(double x, double y) Subtract two floating-point values in round-to-nearest-even mode. double \_\_fma\_rn(double x, double y, double z) Returns x × y + z as a single operation in round-to-nearest-even mode.
-
-## CHAPTER
-
-## TWENTYTWO
-
-## TABLE COMPARING SYNTAX FOR DIFFERENT COMPUTE APIS
-
-| Term | CUDA | HIP | OpenCL |
-|------------------------|---------------------|--------------------------------------------|------------------------|
-| Device | int deviceId | int deviceId | cl_device |
-| Queue | cudaStream_t | hipStream_t | cl_command_queue |
-| Event | cudaEvent_t | hipEvent_t | cl_event |
-| Memory | void * | void * | cl_mem |
-| | grid | grid | NDRange |
-| | block | block | work-group |
-| | thread | thread | work-item |
-| | warp | warp | sub-group |
-| Thread-index | threadIdx.x | threadIdx.x | get_local_id(0) |
-| Block-index | blockIdx.x | blockIdx.x | get_group_id(0) |
-| Block-dim | blockDim.x | blockDim.x | get_local_size(0) |
-| Grid-dim | gridDim.x | gridDim.x | get_num_groups(0) |
-| Device Kernel | __global__ | __global__ | __kernel |
-| Device Function | __device__ | __device__ | Implied in device com |
-| Host Function | __host_ (default) | __host_ (default) | Implied in host compil |
-| Host + Device Function | __host__ __device__ | __host__ __device__ | No equivalent |
-| Kernel Launch | <<< >>> | hipLaunchKernel / hipLaunchKernelGGL / <<< | clEnqueueNDRangeK |
-| Global Memory | __global__ | __global__ | __global |
-| Group Memory | __shared__ | __shared__ | __local |
-| Constant | __constant__ | __constant__ | __constant |
-| | __syncthreads | __syncthreads | barrier(CLK_LOCAL |
-| Atomic Builtins | atomicAdd | atomicAdd | atomic_add |
-| Precise Math | cos(f) | cos(f) | cos(f) |
-| Fast Math | __cos(f) | __cos(f) | native_cos(f) |
-| Vector | float4 | float4 | float4 |
-
-## 22.1 Notes
-
-The indexing functions (starting with thread-index ) show the terminology for a 1D grid. Some APIs use reverse order of xyz / 012 indexing for 3D grids.
-
-## CHAPTER
-
-## TWENTYTHREE
-
-## HIP COOPERATIVE GROUPS API
-
-## 23.1 Cooperative kernel launches
-
-The following host-side functions are used for cooperative kernel launches.
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find function | 'hipLaunchCooperativeKernel' Documentation' | 'hipLaunchCooperativeKernel' Documentation' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for project | 'HIP | 6.1.40092 | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: project | doxygenfunction: project | doxygenfunction: project | Cannot | Cannot | find function | 'hipLaunchCooperativeKernel' | 'hipLaunchCooperativeKernel' |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | | 'HIP | 6.1.40092 | Documentation' | from |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function | 'hipLaunchCooperativeKernelMultiDe- | 'hipLaunchCooperativeKernelMultiDe- | 'hipLaunchCooperativeKernelMultiDe- |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| vice' | in | doxygen | xml | output for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: in | doxygenfunction: Cannot find xml output for project 'HIP | doxygenfunction: Cannot find xml output for project 'HIP | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| nel' | function 6.1.40092 | 'hipModuleLaunchCooperativeKer- Documentation' from directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'hipModuleLaunchCooperativeKernelMultiDevice' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-## 23.2 Cooperative groups classes
-
-The following cooperative groups classes can be used on the device side.
-
-## class thread\_group
-
-The base type of all cooperative group types.
-
-Holds the key properties of a constructed cooperative group types object, like the group type, its size, etc.
-
-Note: Cooperative groups feature is implemented on Linux, under development on Microsoft Windows.
-
-Subclassed by cooperative\_groups::coalesced\_group , cooperative\_groups::grid\_group , coopera-tive\_groups::multi\_grid\_group , cooperative\_groups::thread\_block , cooperative\_groups::tiled\_group class thread\_block : public cooperative\_groups:: thread\_group
-
-The workgroup (thread-block in CUDA terminology) cooperative group type.
-
-Represents an intra-workgroup cooperative group type, where the participating threads within the group are the same threads that participated in the currently executing workgroup .
-
-Note: This function is implemented on Linux and is under development on Microsoft Windows.
-
-class grid\_group : public cooperative\_groups:: thread\_group
-
-The grid cooperative group type.
-
-Represents an inter-workgroup cooperative group type, where the participating threads within the group spans across multiple workgroups running the (same) kernel on the same device.
-
-Note: This is implemented on Linux and is under development on Microsoft Windows.
-
-class multi\_grid\_group : public cooperative\_groups:: thread\_group
-
-The multi-grid cooperative group type.
-
-Represents an inter-device cooperative group type, where the participating threads within the group span across multiple devices, running the (same) kernel on these devices.
-
-Note: The multi-grid cooperative group type is implemented on Linux, under development on Microsoft Windows.
-
-## template<unsigned int size , class ParentCGTy >
-
-class thread\_block\_tile : public cooperative\_groups::impl::thread\_block\_tile\_internal< size , ParentCGTy > Group type -thread\_block\_tile .
-
-Represents one tiled thread group in a wavefront. This group type also supports sub-wave level intrinsics.
-
-Note: This type is implemented on Linux, under development on Microsoft Windows.
-
-## Public Functions
-
-unsigned int thread\_rank () const
-
-Rank of the calling thread within [0, size() ).
-
-## void sync ()
-
-Synchronizes the threads in the group.
-
-Causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group.
-
-Note: There are potential read-after-write (RAW), write-after-read (WAR), or write-after-write (WAW) hazards, when threads in the group access the same addresses in shared or global memory. The data hazards can be avoided with synchronization of the group.
-
-## unsigned int meta\_group\_rank () const
-
-Returns the linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta\_group\_size)
-
-unsigned int meta\_group\_size () const
-
-Returns the number of groups created when the parent group was partitioned.
-
-## template<class T >
-
-T shfl ( T var, int srcRank ) const
-
-Shuffle operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle operation is a direct copy of var from srcRank thread ID of group.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy. Only the srcRank thread ID of group is copied to other threads.
-- srcRank - [in] The source thread ID of the group for copy.
-
-## template<class T >
-
-T shfl\_down ( T var, unsigned int lane\_delta ) const
-
-Shuffle down operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle down operation is copy of var from thread with thread ID of group relative higher with lane\_delta to caller thread ID.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- lane\_delta - [in] The lane\_delta is the relative thread ID difference between caller thread ID and source of copy thread ID. sourceID = (threadID + lane\_delta) % size()
-
-template<class T >
-
-## T shfl\_up ( T var, unsigned int lane\_delta ) const
-
-Shuffle up operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle up operation is copy of var from thread with thread ID of group relative lower with lane\_delta to caller thread ID.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- lane\_delta - [in] The lane\_delta is the relative thread ID difference between caller thread ID and source of copy thread ID. sourceID = (threadID - lane\_delta) % size()
-
-## template<class T >
-
-T shfl\_xor ( T var, unsigned int laneMask ) const
-
-Shuffle xor operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle xor operation is copy of var from thread with thread ID of group based on laneMask XOR of the caller thread ID.
-
-## Template Parameters
-
-- T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- laneMask - [in] The laneMask is the mask for XOR operation. sourceID = threadID ^ laneMask
-
-unsigned long long ballot ( int pred ) const
-
-Ballot function on group level.
-
-Returns a bit mask with the Nth bit set to one if the Nth thread predicate evaluates true.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-int any ( int pred ) const
-
-Any function on group level.
-
-Returns non-zero if a predicate evaluates true for any threads.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-int all ( int pred ) const
-
-All function on group level.
-
-Returns non-zero if a predicate evaluates true for all threads.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-template<typename T >
-
-unsigned long long match\_any ( T value ) const
-
-Match any function on group level.
-
-Returns a bit mask containing a 1-bit for every participating thread if that thread has the same value in value as the caller thread.
-
-## Parameters
-
-value - [in] The value to examine on the current thread in group.
-
-template<typename T > unsigned long long match\_all ( T value, int &pred ) const
-
-Match all function on group level.
-
-Returns a bit mask containing a 1-bit for every participating thread if they all have the same value in value as the caller thread. The predicate pred is set to true if all participating threads have the same value in value .
-
-## Parameters
-
-- value - [in] The value to examine on the current thread in group.
-- pred - [out] The predicate is set to true if all participating threads in the thread group have the same value.
-
-class coalesced\_group : public cooperative\_groups:: thread\_group
-
-The coalesced\_group cooperative group type.
-
-Represents an active thread group in a wavefront. This group type also supports sub-wave level intrinsics.
-
-Note: This is implemented on Linux and is under development on Microsoft Windows.
-
-## 23.3 Cooperative groups construct functions
-
-The following functions are used to construct different group-type instances on the device side.
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | 'cooperative_groups::this_multi_grid' | 'cooperative_groups::this_multi_grid' | 'cooperative_groups::this_multi_grid' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::this\_grid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::this\_thread\_block' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function | 'cooperative_groups::coalesced_threads' | 'cooperative_groups::coalesced_threads' | |
-|------------|------------|--------------------|--------------------|--------------------|----------|--------|------------|-------------------------------------------|-------------------------------------------|------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: |
-
-/home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | 'cooperative_groups::tiled_partition' | 'cooperative_groups::tiled_partition' | 'cooperative_groups::tiled_partition' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function 'cooperative_groups::tiled_partition' 6.1.40092 | function 'cooperative_groups::tiled_partition' 6.1.40092 | function 'cooperative_groups::tiled_partition' 6.1.40092 | | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | Documentation' | from | directory: | | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | function | 'cooperative_groups::binary_partition' | 'cooperative_groups::binary_partition' |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | | 6.1.40092 | Documentation' | from |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function 'cooperative_groups::binary_partition' 6.1.40092 | function 'cooperative_groups::binary_partition' 6.1.40092 | function 'cooperative_groups::binary_partition' 6.1.40092 |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | Documentation' | from | directory: |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-## 23.4 Cooperative groups exposed API functions
-
-The following functions are the exposed API for different group-type instances on the device side.
-
-| Warning: | Warning: | doxygenfunction: project | doxygenfunction: project | doxygenfunction: project | Cannot find | function | 'cooperative_groups::group_size' | 'cooperative_groups::group_size' | 'cooperative_groups::group_size' | | | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | 'HIP | 6.1.40092 | Documentation' | from | directory: | | | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: Cannot | doxygenfunction: Cannot | doxygenfunction: Cannot | doxygenfunction: Cannot | find 'HIP | find 'HIP | function 'cooperative_groups::thread_rank' | function 'cooperative_groups::thread_rank' | function 'cooperative_groups::thread_rank' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::is\_valid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::sync' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advancedmicro-devices-hip/checkouts/docs-6.1.2/docs/doxygen/xml
-
-## CHAPTER
-
-## TWENTYFOUR
-
-## HSA RUNTIME API FOR ROCM
-
-The following functions are located in the https://github.com/ROCm/ROCR-Runtime repository.
-
-hsa\_status\_t hsa\_amd\_vmem\_address\_reserve ( void **va, size\_t size, uint64\_t address, uint64\_t flags )
-
-Allocate a reserved address range.
-
-Reserve a virtual address range. The size must be a multiple of the system page size. If it is not possible to allocate the address specified by address , then va will be a different address range. Address range should be released by calling hsa\_amd\_vmem\_address\_free.
-
-Note that this API will be deprecated in a future release and replaced by hsa\_amd\_vmem\_address\_reserve\_align
-
-## Parameters
-
-- va -[out] virtual address allocated
-- size -[in] of address range requested
-- address -[in] requested
-- flags -[in] currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range allocated successfully
-- ::HSA\_STATUS\_ERROR\_NOT\_INITIALIZED - The HSA runtime has not been initialized.
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources to allocate an address range of this size.
-
-hsa\_status\_t hsa\_amd\_vmem\_address\_free ( void *va, size\_t size )
-
-Free a reserved address range.
-
-Free a previously allocated address range. The size must match the size of a previously allocated address range.
-
-## Parameters
-
-- va -[out] virtual address to be freed
-- size -[in] of address range
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range released successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid va specified
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - Invalid size specified
-- ::HSA\_STATUS\_ERROR\_RESOURCE\_FREE - Address range is still in use
-
-· ::HSA\_STATUS\_ERROR - Internal unexpected error
-
-hsa\_status\_t hsa\_amd\_vmem\_handle\_create ( hsa\_amd\_memory\_pool\_t pool, size\_t size, hsa\_amd\_memory\_type\_t type, uint64\_t flags, hsa\_amd\_vmem\_alloc\_handle\_t *memory\_handle
-
-)
-
-Create a virtual memory handle.
-
-Create a virtual memory handle within this pool size must be a aligned to allocation granule size for this memory pool, see HSA\_AMD\_MEMORY\_POOL\_INFO\_RUNTIME\_ALLOC\_GRANULE To minimize internal memory fragmentation, align the size to the recommended allocation granule size, see HSA\_AMD\_MEMORY\_POOL\_INFO\_RUNTIME\_ALLOC\_REC\_GRANULE
-
-## Parameters
-
-- pool -[in] memory to use
-- size -[in] of the memory allocation
-- type -[in] of memory
-- flags -[in] - currently unsupported
-- memory\_handle -[out] - handle for the allocation
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - memory allocated successfully
-- ::HSA\_STATUS\_ERROR\_NOT\_INITIALIZED - The HSA runtime has not been initialized.
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - Invalid arguments
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - This memory pool does not support allocations
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources to allocate this memory
-
-hsa\_status\_t hsa\_amd\_vmem\_handle\_release ( hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle )
-
-Release a virtual memory handle.
-
-## Parameters
-
-memory -[in] handle that was previously allocated
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range allocated successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-
-hsa\_status\_t hsa\_amd\_vmem\_map ( void *va, size\_t size, size\_t in\_offset, hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle, uint64\_t flags )
-
-Map a virtual memory handle.
-
-Map a virtual memory handle to a reserved address range. The virtual address requested must be within a previously reserved address range. va and ( va + size) must be must be within (va + size) of the previous allocated address range. size must be equal to size of the memory\_handle hsa\_amd\_vmem\_set\_access needs to be called to make the memory accessible to specific agents
-
-## Parameters
-
-- va -[in] virtual address range where memory will be mapped
-- size -[in] of memory mapping
-- in\_offset -[in] offset into memory. Currently unsupported
-
-- memory\_handle -[in] virtual memory handle to be mapped
-- flags. -[in] Currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Memory mapped successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - va, size or memory\_handle are invalid
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-## hsa\_status\_t hsa\_amd\_vmem\_unmap ( void *va, size\_t size )
-
-Unmap a virtual memory handle.
-
-Unmap previously mapped virtual address range
-
-## Parameters
-
-- va -[in] virtual address range where memory will be mapped
-- size -[in] of memory mapping
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Memory backing unmapped successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - memory\_handle is invalid
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - size is invalid
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_set\_access ( void *va, size\_t size, const hsa\_amd\_memory\_access\_desc\_t *desc, size\_t desc\_cnt )
-
-Make a memory mapping accessible.
-
-Make previously mapped virtual address accessible to specific agents. size must be equal to size of previously mapped virtual memory handle. Calling hsa\_amd\_vmem\_set\_access multiple times on the same va will overwrite previous permissions for all agents
-
-## Parameters
-
-- va -[in] previously mapped virtual address
-- size -[in] of memory mapping
-- desc -[in] list of access permissions for each agent
-- desc\_cnt -[in] number of elements in desc
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - va, size or memory\_handle are invalid
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - memory\_handle is invalid
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources
-- ::HSA\_STATUS\_ERROR\_INVALID\_AGENT - Invalid agent in desc
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_get\_access ( void *va, hsa\_access\_permission\_t *perms, hsa\_agent\_t agent\_handle )
-
-Get current access permissions for memory mapping.
-
-Get access permissions for memory mapping for specific agent.
-
-## Parameters
-
-- va -[in] previously mapped virtual address
-- perms -[in] current permissions
-- agent\_handle -[in] agent
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_AGENT - Invalid agent
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - va is not mapped or permissions never set for this agent
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_export\_shareable\_handle ( int *dmabuf\_fd, hsa\_amd\_vmem\_alloc\_handle\_t handle, uint64\_t flags )
-
-Get an exportable shareable handle.
-
-Get an exportable shareable handle for a memory\_handle. This shareabl handle can then be used to re-create a virtual memory handle using hsa\_amd\_vmem\_import\_shareable\_handle. The shareable handle can be transferred using mechanisms that support posix file descriptors Once all shareable handles are closed, the memory\_handle is released.
-
-## Parameters
-
-- dmabuf\_fd -[out] shareable handle
-- handle -[in] previously allocated virtual memory handle
-- flags -[in] Currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Out of resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-## hsa\_status\_t hsa\_amd\_vmem\_import\_shareable\_handle ( int dmabuf\_fd, hsa\_amd\_vmem\_alloc\_handle\_t *handle )
-
-Import a shareable handle.
-
-Import a shareable handle for a memory handle. Importing a shareable handle that has been closed and released results in undefined behavior.
-
-## Parameters
-
-- dmabuf\_fd -[in] shareable handle exported with hsa\_amd\_vmem\_export\_shareable\_handle
-- handle -[out] virtual memory handle
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Out of resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_retain\_alloc\_handle ( hsa\_amd\_vmem\_alloc\_handle\_t *memory\_handle, void *addr )
-
-Returns memory handle for mapped memory.
-
-Return a memory handle for previously mapped memory. The handle will be the same value of handle used to map the memory. The returned handle must be released with corresponding number of calls to hsa\_amd\_vmem\_handle\_release.
-
-## Parameters
-
-- memory\_handle -[out] memory handle for this mapped address
-- mapped -[in] address
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid address
-
-hsa\_status\_t hsa\_amd\_vmem\_get\_alloc\_properties\_from\_handle ( hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle, hsa\_amd\_memory\_pool\_t *pool, hsa\_amd\_memory\_type\_t *type )
-
-Returns the current allocation properties of a handle.
-
-Returns the allocation properties of an existing handle
-
-## Parameters
-
-- memory\_handle -[in] memory handle to be queried
-- pool -[out] memory pool that owns this handle
-- memory -[out] type
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory\_handle
-
-## CHAPTER
-
-## TWENTYFIVE
-
-## HIP MANAGED MEMORY ALLOCATION API
-
-hipError\_t hipMallocManaged ( void **dev\_ptr, size\_t size, unsigned int flags )
-
-Allocates memory that will be automatically managed by HIP.
-
-This API is used for managed memory, allows data be shared and accessible to both CPU and GPU using a single pointer.
-
-The API returns the allocation pointer, managed by HMM, can be used further to execute kernels on device and fetch data between the host and device as needed.
-
-Note: It is recommend to do the capability check before call this API.
-
-## Parameters
-
-- dev\_ptr -[out] - pointer to allocated device memory
-- size -[in] - requested allocation size in bytes, it should be granularity of 4KB
-- flags -[in] - must be either hipMemAttachGlobal or hipMemAttachHost (defaults to hipMemAttachGlobal)
-
-## Returns
-
-hipSuccess, hipErrorMemoryAllocation, hipErrorNotSupported, hipErrorInvalidValue hipError\_t hipMemPrefetchAsync ( const void *dev\_ptr, size\_t count, int device, hipStream\_t stream
-
-) Prefetches memory to the specified destination device using HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- dev\_ptr -[in] pointer to be prefetched
-- count -[in] size in bytes for prefetching
-- device -[in] destination device to prefetch to
-- stream -[in] stream to enqueue prefetch operation
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue
-
-hipError\_t hipMemAdvise ( const void *dev\_ptr, size\_t count, hipMemoryAdvise advice, int device )
-
-Advise about the usage of a given memory range to HIP.
-
-This HIP API advises about the usage to be applied on unified memory allocation in the range starting from the pointer address devPtr, with the size of count bytes. The memory range must refer to managed memory allocated via the API hipMallocManaged, and the range will be handled with proper round down and round up respectively in the driver to be aligned to CPU page size, the same way as corresponding CUDA API behaves in CUDA version 8.0 and afterwards.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- dev\_ptr -[in] pointer to memory to set the advice for
-- count -[in] size in bytes of the memory range, it should be CPU page size alligned.
-- advice -[in] advice to be applied for the specified memory range
-- device -[in] device to apply the advice for
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipMemRangeGetAttribute ( void *data, size\_t data\_size, hipMemRangeAttribute attribute, const void *dev\_ptr, size\_t count )
-
-Query an attribute of a given memory range in HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- data -[inout] a pointer to a memory location where the result of each attribute query will be written to
-- data\_size -[in] the size of data
-- attribute -[in] the attribute to query
-- dev\_ptr -[in] start of the range to query
-- count -[in] size of the range to query
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipMemRangeGetAttributes ( void **data, size\_t *data\_sizes, hipMemRangeAttribute *attributes, size\_t num\_attributes, const void *dev\_ptr, size\_t count )
-
-Query attributes of a given memory range in HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- data -[inout] a two-dimensional array containing pointers to memory locations where the result of each attribute query will be written to
-- data\_sizes -[in] an array, containing the sizes of each result
-- attributes -[in] the attribute to query
-- num\_attributes -[in] an array of attributes to query (numAttributes and the number of attributes in this array should match)
-- dev\_ptr -[in] start of the range to query
-- count -[in] size of the range to query
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipStreamAttachMemAsync ( hipStream\_t stream, void *dev\_ptr, size\_t length, unsigned int flags ) Attach memory to a stream asynchronously in HIP.
-
-Warning: This API is under development. Currently it is a no-operation (NOP) function on AMD GPUs and returns hipSuccess.
-
-## Parameters
-
-- stream -[in] - stream in which to enqueue the attach operation
-- dev\_ptr -[in] - pointer to memory (must be a pointer to managed memory or to a valid host-accessible region of system-allocated memory)
-- length -[in] - length of memory (defaults to zero)
-- flags -[in] - must be one of hipMemAttachGlobal, hipMemAttachHost or hipMemAttachSingle (defaults to hipMemAttachSingle)
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue
-
-## template<class T >
-
-static inline hipError\_t hipMallocManaged ( T **devPtr, size\_t size, unsigned int flags = hipMemAttachGlobal )
-
-- : C++ wrapper for hipMallocManaged
-
-Provide an override to automatically typecast the pointer type from void**, and also provide a default for the flags.
-
-HIP\_DISABLE\_CPP\_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs.
-
-## See also:
-
-hipMallocManaged
-
-## CHAPTER
-
-## TWENTYSIX
-
-## HIP VIRTUAL MEMORY MANAGEMENT API
-
-hipError\_t hipMemAddressFree ( void *devPtr, size\_t size )
-
-Frees an address range reservation made via hipMemAddressReserve.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- devPtr -[in] - starting address of the range.
-- size -[in] - size of the range.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemAddressReserve ( void **ptr, size\_t size, size\_t alignment, void *addr, unsigned long long flags )
-
-Reserves an address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[out] - starting address of the reserved range.
-- size -[in] - size of the reservation.
-- alignment -[in] - alignment of the address.
-- addr -[in] - requested starting address of the range.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-hipError\_t hipMemCreate ( hipMemGenericAllocationHandle\_t *handle, size\_t size, const hipMemAllocationProp *prop, unsigned long long flags )
-
-Creates a memory allocation described by the properties and size.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - value of the returned handle.
-- size -[in] - size of the allocation.
-- prop -[in] - properties of the allocation.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemExportToShareableHandle ( void *shareableHandle, hipMemGenericAllocationHandle\_t handle, hipMemAllocationHandleType handleType, unsigned long long flags )
-
-Exports an allocation to a requested shareable handle type.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- shareableHandle -[out] - value of the returned handle.
-- handle -[in] - handle to share.
-- handleType -[in] - type of the shareable handle.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAccess ( unsigned long long *flags, const hipMemLocation *location, void *ptr
-
-) Get the access flags set for the given location and ptr.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- flags -[out] - flags for this location.
-- location -[in] - target location.
-- ptr -[in] - address to check the access flags.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAllocationGranularity ( size\_t *granularity, const hipMemAllocationProp *prop, hipMemAllocationGranularity\_flags option )
-
-Calculates either the minimal or recommended granularity.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- granularity -[out] - returned granularity.
-- prop -[in] - location properties.
-- option -[in] - determines which granularity to return.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAllocationPropertiesFromHandle ( hipMemAllocationProp *prop,
-
-hipMemGenericAllocationHandle\_t handle )
-
-Retrieve the property structure of the given handle.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- prop -[out] - properties of the given handle.
-- handle -[in] - handle to perform the query on.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-hipError\_t hipMemImportFromShareableHandle ( hipMemGenericAllocationHandle\_t *handle, void *osHandle, hipMemAllocationHandleType shHandleType )
-
-Imports an allocation from a requested shareable handle type.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - returned value.
-- osHandle -[in] - shareable handle representing the memory allocation.
-- shHandleType -[in] - handle type.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemMap ( void *ptr, size\_t size, size\_t offset, hipMemGenericAllocationHandle\_t handle, unsigned long long flags )
-
-Maps an allocation handle to a reserved virtual address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - address where the memory will be mapped.
-- size -[in] - size of the mapping.
-- offset -[in] - offset into the memory, currently must be zero.
-- handle -[in] - memory allocation to be mapped.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemMapArrayAsync ( hipArrayMapInfo *mapInfoList, unsigned int count, hipStream\_t stream )
-
-Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays.
-
-Warning: This API is under development. Currently it is not supported on AMD GPUs and returns hipErrorNotSupported.
-
-## Parameters
-
-- mapInfoList -[in] - list of hipArrayMapInfo.
-- count -[in] - number of hipArrayMapInfo in mapInfoList.
-- stream -[in] - stream identifier for the stream to use for map or unmap operations.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemRelease ( hipMemGenericAllocationHandle\_t handle )
-
-Release a memory handle representing a memory allocation which was previously allocated through hipMemCreate.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-handle -[in] - handle of the memory allocation.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemRetainAllocationHandle ( hipMemGenericAllocationHandle\_t *handle, void *addr )
-
-Returns the allocation handle of the backing memory allocation given the address.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - handle representing addr.
-- addr -[in] - address to look up.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemSetAccess ( void *ptr, size\_t size, const hipMemAccessDesc *desc, size\_t count )
-
-Set the access flags for each location specified in desc for the given virtual address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - starting address of the virtual address range.
-- size -[in] - size of the range.
-- desc -[in] - array of hipMemAccessDesc.
-- count -[in] - number of hipMemAccessDesc in desc.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-## hipError\_t hipMemUnmap ( void *ptr, size\_t size )
-
-Unmap memory allocation of a given address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - starting address of the range to unmap.
-- size -[in] - size of the virtual address range.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-## CHAPTER
-
-## TWENTYSEVEN
-
-## HIP DEPRECATED RUNTIME API FUNCTIONS
-
-Several of our API functions have been flagged for deprecation. Using the following functions results in errors and unexpected results, so we encourage you to update your code accordingly.
-
-## 27.1 Context management
-
-CUDAsupports cuCtx API, which is the driver API that defines 'Context' and 'Devices' as separate entities. Context contains a single device, and a device can theoretically have multiple contexts. HIP initially added limited support for these APIs in order to facilitate porting from existing driver codes. These APIs are now marked as deprecated because there are better alternate interfaces (such as hipSetDevice or the stream API) to achieve these functions.
-
-- hipCtxCreate
-- hipCtxDestroy
-- hipCtxPopCurrent
-- hipCtxPushCurrent
-- hipCtxSetCurrent
-- hipCtxGetCurrent
-- hipCtxGetDevice
-- hipCtxGetApiVersion
-- hipCtxGetCacheConfig
-- hipCtxSetCacheConfig
-- hipCtxSetSharedMemConfig
-- hipCtxGetSharedMemConfig
-- hipCtxSynchronize
-- hipCtxGetFlags
-- hipCtxEnablePeerAccess
-- hipCtxDisablePeerAccess
-- hipDevicePrimaryCtxGetState
-- hipDevicePrimaryCtxRelease
-- hipDevicePrimaryCtxRetain
-- hipDevicePrimaryCtxReset
-
-- hipDevicePrimaryCtxSetFlags
-
-## 27.2 Memory management
-
-- hipMallocHost (replaced with hipHostMalloc )
-- hipMemAllocHost (replaced with hipHostMalloc )
-- hipHostAlloc (replaced with hipHostMalloc )
-- hipFreeHost (replaced with hipHostFree )
-- hipMemcpyToArray
-- hipMemcpyFromArray
-
-## 27.3 Profiler control
-
-- hipProfilerStart (use roctracer/rocTX)
-- hipProfilerStop (use roctracer/rocTX)
-
-## 27.4 Texture management
-
-- hipGetTextureReference
-- hipTexRefSetAddressMode
-- hipTexRefSetArray
-- hipTexRefSetFilterMode
-- hipTexRefSetFlags
-- hipTexRefSetFormat
-- hipTexRefGetAddress
-- hipTexRefGetAddressMode
-- hipTexRefGetFilterMode
-- hipTexRefGetFlags
-- hipTexRefGetFormat
-- hipTexRefGetMaxAnisotropy
-- hipTexRefGetMipmapFilterMode
-- hipTexRefGetMipmapLevelBias
-- hipTexRefGetMipmapLevelClamp
-- hipTexRefGetMipMappedArray
-- hipTexRefSetAddress
-- hipTexRefSetAddress2D
-- hipTexRefSetMaxAnisotropy
-
-- hipTexRefSetBorderColor
-- hipTexRefSetMipmapFilterMode
-- hipTexRefSetMipmapLevelBias
-- hipTexRefSetMipmapLevelClamp
-- hipTexRefSetMipmappedArray
-- hipTexRefGetBorderColor
-- hipTexRefGetArray
-- hipBindTexture
-- hipBindTexture2D
-- hipBindTextureToArray
-- hipGetTextureAlignmentOffset
-- hipUnbindTexture
-- hipBindTextureToMipmappedArray
-
-## CHAPTER
-
-## TWENTYEIGHT
-
-## SAXPY - HELLO, HIP
-
-This tutorial explains the basic concepts of the single-source Heterogeneous-computing Interface for Portability (HIP) programming model and the essential tooling around it. It also reviews some commonalities of heterogenous APIs in general. This topic assumes basic familiarity with the C/C++ compilation model and language.
-
-## 28.1 Prerequisites
-
-To follow this tutorial, you'll need installed drivers and a HIP compiler toolchain to compile your code. Because HIP for ROCm supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP .
-
-## 28.2 Heterogeneous programming
-
-Heterogeneous programming and offloading APIs are often mentioned together. Heterogeneous programming deals with devices of varying capabilities simultaneously. Offloading focuses on the 'remote' and asynchronous aspects of computation. HIP encompasses both. It exposes GPGPU (general-purpose GPU) programming much like ordinary host-side CPU programming and lets you move data across various devices.
-
-When programming in HIP (and other heterogenous APIs for that matter), remember that target devices are built for a specific purpose. They are designed with different tradeoffs than traditional CPUs and therefore have very different performance characteristics. Even subtle changes in code might adversely affect execution time.
-
-## 28.3 Your first lines of HIP code
-
-First, let's do the 'Hello, World!' of GPGPU: SAXPY. Single-precision A times X Plus Y ( SAXPY ) is a mathematical acronym; a vector equation 𝑎 · 𝑥 + 𝑦 = 𝑧 where 𝑎 ∈ R is a scalar and 𝑥, 𝑦, 𝑧 ∈ V are vector quantities of some large dimensionality. This vector space is defined over the set of reals. Practically speaking, you can compute this using a single for loop over three arrays.
-
-```
-++i)
-```
-
-```
-<_SQL_>
-```
-
-In linear algebra libraries, such as BLAS (Basic Linear Algebra Subsystem) this operation is defined as AXPY 'A times X Plus Y'. The 'S' comes from single-precision , meaning that array element is float -s (IEEE 754 binary32 representation).
-
-To quickly get started, use the set of HIP samples from GitHub. With Git configured on your machine, open a commandline and navigate to your desired working directory, then run:
-
-```
- |git clone https://github.com/amd/rcm-examples.git
-```
-
-A simple implementation of SAXPY resides in the HIP-Basic/saxpy/main.hip file in this repository. The HIP code here mostly deals with where data has to be and when, and how devices transform this data. The first HIP calls deal with allocating device-side memory and copying data from host-side memory to device side in a C runtime-like fashion.
-
-```
-// Allocate and copy vectors to device memory.
-float* d_x{};
-float* d_y{};
-HIP_CHECK(hipMalloc(&d_x, size_bytes));
-HIP_CHECK(hipMalloc(&d_y, size_bytes));
-HIP_CHECK(hipMemcpy(d_x, x.data(), size_bytes, hipMemcpyHostToDevice));
-HIP_CHECK(hipMemcpy(d_y, y.data(), size_bytes, hipMemcpyHostToDevice));
-```
-
-HIP\_CHECK is a custom macro borrowed from the examples utilities which checks the error code returned by API functions for errors and reports them to the console. It is not essential to the API, but it is a good practice to check the error codes of the HIP APIs in case you pass on incorrect values to the API, or the API might be out of resources.
-
-The code selects the device to allocate to and to copy to. Commands are issued to the HIP runtime per thread, and every thread has a device set as the target of commands. The default device is 0 , which is equivalent to calling hipSetDevice(0) .
-
-Launch the calculation on the device after the input data has been prepared.
-
-```
- Launch the calculation on the device after the input data has been prepared.
- __global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned_
- __int size)
- {
- //...
- }
-
- int main()
- {
- //...
-
- // Launch the kernel on the default stream.
- saxpy_kernel<<
`), anchors (``), and bold text (``), but they are not properly nested or used in a coherent way. As it stands, this code does not serve any functional purpose in a web page and would not render any meaningful content. It seems to be a collection of HTML tags without a clear intent or structure.
-
-
-```
-<_SQL_>
-```
-
-## 8.11 memcpyToSymbol
-
-HIP support for hipMemcpyToSymbol is complete. This feature allows a kernel to define a device-side data symbol which can be accessed on the host side. The symbol can be in \_\_constant or device space.
-
-Note that the symbol name needs to be encased in the HIP\_SYMBOL macro, as shown in the code example below. This also applies to hipMemcpyFromSymbol , hipGetSymbolAddress , and hipGetSymbolSize .
-
-For example:
-
-Device Code:
-**Following code does:** The provided snippet appears to be a fragment of HTML code rather than a complete Python script. It includes an HTML paragraph (`
-
-
-
-
-
-
-
-
-
-
-
-
-```
-
-(continues on next page)
-
-(continued from previous page)
-**Following code does:** The code snippet you provided appears to be a mistake or a mix-up. It seems to be an attempt to specify a programming language (C++) rather than actual code. If you intended to provide a C++ code snippet, please include the relevant code so I can help explain its purpose.
-
-
-```
-HIP Documentation, Release 6.1.40092
-```
-
-On Windows , you can set AMD\_LOG\_LEVEL via environment variable from the advanced system settings or the command prompt (when run as administrator). The following example shows debug log information when calling the backend runtime.
-**Following code does:** The code snippet `| export` appears to be incomplete or out of context for a typical Python script. In Python, the `|` character is not used in this way, and `export` is not a standard Python keyword. This snippet might be part of a shell command or a configuration file where `export` is used to set environment variables, but as it stands, it does not represent a valid or complete Python statement.
-
-
-```
-
-
-
- ` and ``) with the word "void" inside them, but these are not valid in Python. The snippet does not perform any meaningful operation or serve a functional purpose in a Python context.
-
-
-```
-<_C++_>
-```
-
-## 16.1.3 Set memory access
-
-Finally, use the hipMemSetAccess function to enable memory access. It accepts the pointer to the virtual memory, the size, and a hipMemAccessDesc descriptor as parameters. In a multi-GPU environment, you can map the device memory of one GPU to another. This feature also works with the traditional memory management system, but isn't as scalable as with virtual memory. When memory is allocated with hipMalloc , hipDeviceEnablePeerAccess is used to enable peer access. This function enables access between two devices, but it means that every call to hipMalloc takes more time to perform the checks and the mapping between the devices. When using virtual memory management, peer access is enabled by hipMemSetAccess , which provides a finer level of control over what is shared. This has no performance impact on memory allocation and gives you more control over what memory buffers are shared with which devices.
-**Following code does:** This code snippet is a command-line instruction using `hipcc`, which is a compiler for HIP (Heterogeneous-Compute Interface for Portability) programs. The command is used to generate a code object file for a specific GPU architecture. It takes an input file containing GPU kernels, compiles it for the specified target GPU architecture, and outputs the compiled code into a designated output file. The `--genco` flag indicates that the command is specifically for generating code objects, and `--offload-arch` specifies the target GPU architecture for which the code should be compiled.
-
-
-```
-hipMemAccessDesc accessDesc = {};
-accessDesc.location.type = HIP_MEM_LOCATION_TYPE_DEVICE;
-accessDesc.location.id = currentDev;
-accessDesc.flags = HIP_MEM_ACCESS_FLAGS_PROT_READWRITE;
-hipMemSetAccess(ptr, padded_size, &accessDesc, 1);
-```
-
-At this point the memory is allocated, mapped, and ready for use. You can read and write to it, just like you would a C style memory allocation.
-
-## 16.1.4 Free virtual memory
-
-To free the memory allocated in this manner, use the corresponding free functions. To unmap the memory, use hipMemUnmap . To release the virtual address range, use hipMemAddressFree . Finally, to release the physical memory, use hipMemRelease . A side effect of these functions is the lack of synchronization when memory is released. If you call hipFree when you have multiple streams running in parallel, it synchronizes the device. This causes worse resource usage and performance.
-**Following code does:** This code snippet is a series of shell commands that clone specific branches of three different repositories from GitHub. The repositories are related to the ROCm (Radeon Open Compute) platform, which is an open-source software platform for GPU computing. The branches being cloned are specified by the environment variable `ROCM_BRANCH`. The repositories being cloned are `clr`, `hip`, and `hipother`, all from the ROCm organization on GitHub. This setup is typically used to obtain specific versions of these projects for development or deployment purposes.
-
-
-```
- |hipMemUnmap(ptr, size);
- |hipMemRelease(allocHandle);
-```
-**Following code does:** The code snippet appears to be part of a script or instructions for setting up a development environment related to HIP (Heterogeneous-Compute Interface for Portability). It outlines two main steps:
-
-1. Cloning the HIP source code from a repository using Git, specifically checking out a branch specified by the environment variable `$ROCM_BRANCH`. The repetition of the `git clone` command suggests either a mistake or an incomplete snippet where different repositories or directories might be intended.
-
-2. Setting environment variables, which is a common step in configuring a development environment to ensure that the necessary paths and settings are available for building or running the software.
-
-Overall, the snippet is part of a setup process for working with HIP, likely in the context of ROCm (Radeon Open Compute), a platform for GPU computing.
-
-
-```
- |hipMemAddressFree(ptr, size);
-```
-
-## 16.2 Memory usage
-
-## 16.2.1 Dynamically increase allocation size
-
-The hipMemAddressReserve function allows you to increase the amount of pre-allocated memory. This function accepts a parameter representing the requested starting address of the virtual memory. This allows you to have a continuous virtual address space without worrying about the underlying physical allocation.
-**Following code does:** This code snippet sets three environment variables (`CLR_DIR`, `HIP_DIR`, and `HIP_OTHER`) to the absolute paths of the directories named `clr`, `hip`, and `hipother`, respectively. The `readlink -f` command is used to resolve and return the full path of each directory, ensuring that any symbolic links are followed to their final target locations. This setup is typically used in shell scripts to configure paths for use in subsequent commands or scripts.
-
-
-```
- hipMemAddressReserve(&new_ptr, (new_size - padded_size), 0, ptr + padded_size, 0);
- hipMemMap(new_ptr, (new_size - padded_size), 0, newAllocHandle, 0);
- hipMemSetAccess(new_ptr, (new_size - padded_size), &accessDesc, 1);
-```
-
-The code sample above assumes that hipMemAddressReserve was able to reserve the memory address at the specified location. However, this isn't guaranteed to be true, so you should validate that new\_ptr points to a specific virtual address before using it.
-
-## CHAPTER
-
-## SEVENTEEN
-
-## FREQUENTLY ASKED QUESTIONS
-
-## 17.1 What APIs and features does HIP support?
-
-HIP provides the following:
-
-- Devices ( hipSetDevice() , hipGetDeviceProperties() , etc.)
-- Memory management ( hipMalloc() , hipMemcpy() , hipFree() , etc.)
-- Streams ( hipStreamCreate() , hipStreamSynchronize() , hipStreamWaitEvent() , etc.)
-- Events ( hipEventRecord() , hipEventElapsedTime() , etc.)
-- Kernel launching ( hipLaunchKernel / hipLaunchKernelGGL is the preferred way of launching kernels. hipLaunchKernelGGL is a standard C/C++ macro that can serve as an alternative way to launch kernels, replacing the CUDA triple-chevron ( <<< >>> ) syntax).
-- HIP Module API to control when and how code is loaded.
-- CUDA-style kernel coordinate functions ( threadIdx , blockIdx , blockDim , gridDim )
-- Cross-lane instructions including shfl , ballot , any , all
-- Most device-side math built-ins
-- Error reporting ( hipGetLastError() , hipGetErrorString() )
-
-The HIP API documentation describes each API and its limitations, if any, compared with the equivalent CUDA API.
-
-## 17.2 What is not supported?
-
-## 17.2.1 Runtime/Driver API features
-
-At a high-level, the following features are not supported:
-
-- Textures (partial support available)
-- Dynamic parallelism (CUDA 5.0)
-- Graphics interoperability with OpenGL or Direct3D
-- CUDA IPC Functions (Under Development)
-- CUDA array, mipmappedArray and pitched memory
-- Queue priority controls
-
-See the API Support Table for more detailed information.
-
-## 17.2.2 Kernel language features
-
-- C+ ± style device-side dynamic memory allocations (free, new, delete) (CUDA 4.0)
-- Virtual functions, indirect functions and try/catch (CUDA 4.0)
-- \_\_prof\_trigger
-- PTX assembly (CUDA 4.0). HIP-Clang supports inline GCN assembly.
-- Several kernel features are under development. See the C++ language extensions for more information.
-
-## 17.3 Is HIP a drop-in replacement for CUDA?
-
-No. HIP provides porting tools which do most of the work to convert CUDA code into portable C++ code that uses the HIP APIs. Most developers will port their code from CUDA to HIP and then maintain the HIP version. HIP code provides the same performance as native CUDA code, plus the benefits of running on AMD platforms.
-
-## 17.4 What specific version of CUDA does HIP support?
-
-HIP APIs and features do not map to a specific CUDA version. HIP provides a strong subset of the functionality provided in CUDA, and the hipify tools can scan code to identify any unsupported CUDA functions - this is useful for identifying the specific features required by a given application.
-
-However, we can provide a rough summary of the features included in each CUDA SDK and the support level in HIP. Each bullet below lists the major new language features in each CUDA release and then indicate which are supported/not supported in HIP:
-
-- CUDA 4.0 and earlier :
-- -HIP supports CUDA 4.0 except for the limitations described above.
-- CUDA 5.0 :
-- -Dynamic Parallelism (not supported)
-- -cuIpc functions (under development).
-- CUDA 6.0 :
-- -Managed memory (under development)
-- CUDA 6.5 :
-- -\_\_shfl intrinsic (supported)
-- CUDA 7.0 :
-- -Per-thread default streams (supported)
-- -C++11 (Hip-Clang supports all of C++11, all of C++14 and some C++17 features)
-- CUDA 7.5 :
-- -float16 (supported)
-- CUDA 8.0 :
-- -Page Migration including cudaMemAdvise , cudaMemPrefetch , other cudaMem* APIs(not supported)
-- CUDA 9.0 :
-
-- -Cooperative Launch, Surface Object Management, Version Management
-
-## 17.5 What libraries does HIP support?
-
-HIP includes growing support for the four key math libraries using hipBLAS, hipFFT, hipRAND and hipSPARSE, as well as MIOpen for machine intelligence applications. These offer pointer-based memory interfaces (as opposed to opaque buffers) and can be easily interfaced with other HIP applications. The hip interfaces support both ROCm and CUDA paths, with familiar library interfaces.
-
-- hipBLAS, which utilizes rocBlas.
-- hipFFT
-- hipsSPARSE
-- hipRAND
-- MIOpen
-
-Additionally, some of the cuBLAS routines are automatically converted to hipblas equivalents by the HIPIFY tools. These APIs use cuBLAS or hcBLAS depending on the platform and replace the need to use conditional compilation.
-
-## 17.6 How does HIP compare with OpenCL?
-
-Both AMD and NVIDIA support OpenCL 1.2 on their devices so that developers can write portable code. HIP offers several benefits over OpenCL:
-
-- Developers can code in C++ as well as mix host and device C++ code in their source files. HIP C++ code can use templates, lambdas, classes and so on.
-- The HIP API is less verbose than OpenCL and is familiar to CUDA developers.
-- Because both CUDA and HIP are C++ languages, porting from CUDA to HIP is significantly easier than porting from CUDA to OpenCL.
-- HIP uses the best available development tools on each platform: on NVIDIA GPUs, HIP code compiles using NVCC and can employ the Nsight profiler and debugger (unlike OpenCL on NVIDIA GPUs).
-- HIP provides pointers and host-side pointer arithmetic.
-- HIP provides device-level control over memory allocation and placement.
-- HIP offers an offline compilation model.
-
-## 17.7 How does porting CUDA to HIP compare to porting CUDA to OpenCL?
-
-Both HIP and CUDA are dialects of C++, and thus porting between them is relatively straightforward. Both dialects support templates, classes, lambdas, and other C++ constructs. As one example, the hipify-perl tool was originally a Perl script that used simple text conversions from CUDA to HIP. HIP and CUDA provide similar math library calls as well. In summary, the HIP philosophy was to make the HIP language close enough to CUDA that the porting effort is relatively simple. This reduces the potential for error, and also makes it easy to automate the translation. HIP goal is to quickly get the ported program running on both platforms with little manual intervention, so that the programmer can focus on performance optimizations.
-
-There have been several tools that have attempted to convert CUDA into OpenCL, such as CU2CL. OpenCL is a C99based kernel language (rather than C++) and also does not support single-source compilation. As a result, the OpenCL syntax is different from CUDA, and the porting tools have to perform some heroic transformations to bridge this gap. The tools also struggle with more complex CUDA applications, in particular, those that use templates, classes, or other C++ features inside the kernel.
-
-## 17.8 What hardware does HIP support?
-
-- For AMD platforms, see the ROCm documentation for the list of supported platforms.
-- For NVIDIA platforms, HIP requires unified memory and should run on any device supporting CUDA SDK 6.0 or newer. We have tested the NVIDIA Titan and Tesla K40.
-
-## 17.9 Do HIPIFY tools automatically convert all source code?
-
-Typically, HIPIFY tools can automatically convert almost all run-time code. Most device code needs no additional conversion since HIP and CUDA have similar names for math and built-in functions. The hipify-clang tool will automatically modify the kernel signature as needed (automating a step that used to be done manually). Additional porting may be required to deal with architecture feature queries or with CUDA capabilities that HIP doesn't support. In general, developers should always expect to perform some platform-specific tuning and optimization.
-
-## 17.10 What is NVCC?
-
-NVCC is NVIDIA's compiler driver for compiling 'CUDA C++' code into PTX or device code for NVIDIA GPUs. It's a closed-source binary compiler that is provided by the CUDA SDK.
-
-## 17.11 What is HIP-Clang?
-
-HIP-Clang is a Clang/LLVM based compiler to compile HIP programs which can run on AMD platform.
-
-## 17.12 Why use HIP rather than supporting CUDA directly?
-
-While HIP is a strong subset of the CUDA, it is a subset. The HIP layer allows that subset to be clearly defined and documented. Developers who code to the HIP API can be assured their code will remain portable across NVIDIA and AMD platforms. In addition, HIP defines portable mechanisms to query architectural features and supports a larger 64-bit WaveSize which expands the return type for cross-lane functions like ballot and shuffle from 32-bit integers to 64-bit integers.
-
-## 17.13 Can I develop HIP code on an NVIDIA CUDA platform?
-
-Yes. HIP's CUDA path only exposes the APIs and functionality that work on both NVCC and AMDGPU back-ends. 'Extra' APIs, parameters, and features which exist in CUDA but not in HIP-Clang will typically result in compile-time or run-time errors. Developers need to use the HIP API for most accelerator code and bracket any CUDA-specific code with preprocessor conditionals. Developers concerned about portability should, of course, run on both platforms, and should expect to tune for performance. In some cases, CUDA has a richer set of modes for some APIs, and some C++ capabilities such as virtual functions - see the HIP @API documentation for more details.
-
-## 17.14 Can I develop HIP code on an AMD HIP-Clang platform?
-
-Yes. HIP's HIP-Clang path only exposes the APIs and functions that work on AMD runtime back ends. 'Extra' APIs, parameters and features that appear in HIP-Clang but not CUDA will typically cause compile- or run-time errors. Developers must use the HIP API for most accelerator code and bracket any HIP-Clang specific code with preprocessor conditionals. Those concerned about portability should, of course, test their code on both platforms and should tune it for performance. Typically, HIP-Clang supports a more modern set of C++11/C++14/C++17 features, so HIP developers who want portability should be careful when using advanced C++ features on the HIP-Clang path.
-
-## 17.15 How to use HIP-Clang to build HIP programs?
-
-The environment variable can be used to set compiler path:
-
-- HIP\_CLANG\_PATH: path to hip-clang. When set, this variable let hipcc to use hip-clang for compilation/linking.
-
-There is an alternative environment variable to set compiler path:
-
-- HIP\_ROCCLR\_HOME: path to root directory of the HIP-ROCclr runtime. When set, this variable let hipcc use hip-clang from the ROCclr distribution. NOTE: If HIP\_ROCCLR\_HOME is set, there is no need to set HIP\_CLANG\_PATH since hipcc will deduce them from HIP\_ROCCLR\_HOME.
-
-## 17.16 What is AMD clr?
-
-AMD Common Language Runtime (CLR) is a repository for the AMD platform, which contains source codes for AMD's compute languages runtimes as follows,
-
-- hipamd - contains implementation of HIP language for AMD GPU.
-- rocclr - contains virtual device interfaces that compute runtimes interact with backends, such as ROCr on Linux and PAL on Windows.
-- opencl - contains implementation of OpenCL™ on the AMD platform.
-
-## 17.17 What is hipother?
-
-A new repository 'hipother' is added in the ROCm 6.1 release, which is branched out from HIP. hipother supports the HIP back-end implementation on some non-AMD platforms, like NVIDIA.
-
-## 17.18 Can I get HIP open source repository for Windows?
-
-No, there is no HIP repository open publicly on Windows.
-
-## 17.19 Can a HIP binary run on both AMD and NVIDIA platforms?
-
-HIP is a source-portable language that can be compiled to run on either AMD or NVIDIA platform. HIP tools don't create a 'fat binary' that can run on either platform, however.
-
-## 17.20 On HIP-Clang, can I link HIP code with host code compiled with another compiler such as gcc, icc, or clang?
-
-Yes. HIP generates the object code which conforms to the GCC ABI, and also links with libstdc++. This means you can compile host code with the compiler of your choice and link the generated object code with GPU code compiled with HIP. Larger projects often contain a mixture of accelerator code (initially written in CUDA with NVCC) and host code (compiled with gcc, icc, or clang). These projects can convert the accelerator code to HIP, compile that code with hipcc, and link with object code from their preferred compiler.
-
-## 17.21 Can HIP API support C style application? What is the difference between C and C++?
-
-HIP is C++ runtime API that supports C style applications as well.
-
-Some C style applications (and interfaces to other languages (FORTRAN, Python)) would call certain HIP APIs but not use kernel programming. They can be compiled with a C compiler and run correctly, however, small details must be considered in the code. For example, initialization, as shown in the simple application below, uses HIP structs dim3 with the file name 'test.hip.cpp'
-**Following code does:** The code snippet provided is not a valid Python code. It appears to be a fragment of a list or a set of instructions, specifically the third step in a sequence, which is "Build HIP." Without additional context, it's unclear what "HIP" refers to, but it could be an acronym or a specific component in a larger process. The snippet suggests that this step involves constructing or assembling something referred to as HIP.
-
-
-```
- //the file name `test.hip.cpp`
-
-
-#include "hip/hip_runtime_api.h"
- //this file name `test.hip.cpp`
-
- int main(int argc, char** argv) {
- dim3 grid1;
- printf("dim3 grid1; x=%d, y=%d, z=%d\n",grid1.x,grid1.y,grid1.z);
- dim3 grid2 = {1,1,1};
- printf("dim3 grid2 = {1,1,1}; x=%d, y=%d, z=%d\n",grid2.x,grid2.y,grid2.z);
- return 0;
- }
-```
-
-When using a C++ compiler,
-**Following code does:** This code snippet is a shell script that automates the process of building and installing a software project using CMake and Make, specifically targeting a HIP (Heterogeneous-Compute Interface for Portability) platform with NVIDIA support. Here's a high-level breakdown of its purpose:
-
-1. **Change Directory**: It navigates to a directory specified by the environment variable `CLR_DIR`.
-
-2. **Create and Navigate to Build Directory**: It creates a `build` directory if it doesn't exist and then changes into it.
-
-3. **Configure the Build with CMake**: It runs the `cmake` command to configure the build system. Various options are set, such as:
- - `HIP_COMMON_DIR` and `HIPNV_DIR` for specifying directories related to HIP.
- - `HIP_PLATFORM=nvidia` to target NVIDIA GPUs.
- - `CMAKE_INSTALL_PREFIX` to set the installation directory to the current working directory.
- - Disabling certain build options like `HIP_CATCH_TEST` and `CLR_BUILD_OCL`.
-
-4. **Compile the Project**: It uses `make` with parallel execution (`-j$(nproc)`) to compile the project, utilizing all available CPU cores.
-
-5. **Install the Compiled Software**: It runs `sudo make install` to install the compiled software, which typically requires superuser privileges.
-
-Overall, this script is used to build and install a HIP-based software project configured for NVIDIA GPUs.
-
-
-```
-$ gcc -x c++ $(hipconfig --cpp_config) test3.hip.cpp -o test
-$./test
-dim3 grid1; x=1, y=1, z=1
-dim3 grid2 = {1,1,1}; x=1, y=1, z=1
-```
-
-In which 'dim3 grid1;' will yield a dim3 grid with all dimensional members x,y,z initialized to 1, as the default constructor behaves that way. Further, if written: dim3 grid(2); // yields {2,1,1} dim3 grid(2,3); yields {2,3,1} In comparison, when using the C compiler, $ gcc -x c $( hipconfig --cpp\_config ) test.hip.cpp -o test $ ./test dim3 grid1; x=646881376, y=21975, z=1517277280 dim3 grid2 = {1,1,1}; x=1, y=1, z=1 In which 'dim3 grid;' does not imply any initialization, no constructor is called, and dimensional values x,y,z of grid are undefined. NOTE: To get the C++ default behavior, C programmers must additionally specify the right-hand side as shown below,
-**Following code does:** This code snippet is a shell command that uses `git` to clone a specific branch of a repository from GitHub. It clones the `hip-tests` repository from the ROCm (Radeon Open Compute) GitHub organization. The branch to be cloned is specified by the environment variable `ROCM_BRANCH`. This command is typically used to obtain a local copy of the code from a particular branch of the repository for development, testing, or deployment purposes.
-
-
-```
- |dim3 grid = {1,1,1}; // initialized as in C++
-```
-**Following code does:** This code snippet appears to be a shell script intended for setting up and running tests for a project that uses HIP (Heterogeneous-Compute Interface for Portability), which is a C++ runtime API and kernel language that allows developers to create portable applications across different GPU platforms. Here's a high-level summary of what it does:
-
-1. **Set Environment Variable**: It sets the `HIPTESTS_DIR` environment variable to the absolute path of the `hip-tests` directory using `readlink -f`.
-
-2. **Navigate to Directory**: It changes the current directory to `HIPTESTS_DIR`.
-
-3. **Create and Navigate to Build Directory**: It creates a `build` directory if it doesn't exist and navigates into it.
-
-4. **Configure Build with CMake**: It runs `cmake` to configure the build system for the project, specifying the HIP platform as AMD and setting the HIP path to a specified directory.
-
-5. **Build Tests**: It compiles the test suite using `make build_tests`.
-
-6. **Run Tests**: It executes the tests using `ctest`.
-
-Overall, this script automates the process of setting up the environment, configuring, building, and running tests for a HIP-based project.
-
-
-```
-C++
-```
-
-## 17.22 Can I install both CUDA SDK and HIP-Clang on the same machine?
-
-Yes. You can use HIP\_PLATFORM to choose which path hipcc targets. This configuration can be useful when using HIP to develop an application which is portable to both AMD and NVIDIA.
-
-## 17.23 HIP detected my platform (HIP-Clang vs NVCC) incorrectly * what should I do?
-
-HIP will set the platform to AMD and use HIP-Clang as compiler if it sees that the AMD graphics driver is installed and has detected an AMD GPU. Sometimes this isn't what you want * you can force HIP to recognize the platform by setting the following,
-**Following code does:** The code snippet defines a function `ynf` that calculates and returns the value of the Bessel function of the second kind of order `n` for a given input `x`. Bessel functions are a family of solutions to Bessel's differential equation and are commonly used in various fields such as physics and engineering, particularly in problems involving cylindrical or spherical symmetry.
-
-
-```
-| export
-```
-**Following code does:** The code snippet appears to be a comment or documentation rather than executable code. It describes the process of building HIP (Heterogeneous-Compute Interface for Portability) tests using Catch2, a unit testing framework. Specifically, it mentions that these HIP tests are separate from the main HIP project and outlines a step to obtain the source code for these tests.
-
-
-```
-rt HIP_PLATFORM=amd
-```
-**Following code does:** This code snippet appears to be a part of a script or configuration file that compiles and runs a specific test for a HIP (Heterogeneous-Compute Interface for Portability) application. It navigates to a directory specified by the environment variable `HIPTESTS_DIR`, then uses the `hipcc` compiler to compile a C++ test file named `hipPointerGetAttributes.cc` along with some include directories. The compiled output is an executable named `hipPointerGetAttributes`, which is then executed. The purpose of this script is to test the functionality related to HIP pointer attributes, and it concludes with a message indicating that all tests have passed.
-
-
-```
-<_Python_>
-```
-**Following code does:** The code snippet `++i)` appears to be a syntax error in Python. In Python, the `++` operator does not exist as it does in languages like C++ or JavaScript. Instead, Python uses `i += 1` to increment a variable. The snippet seems to be an incomplete or incorrect attempt to increment a variable, possibly influenced by syntax from another programming language.
-
-
-```
- | HIP_COMPILER=cuda
-```
-**Following code does:** The code snippet `<_SQL_>` appears to be a placeholder or a tag indicating that the actual SQL code is not provided. It suggests that the content is related to SQL, which is a language used for managing and manipulating relational databases. Without the actual SQL code, it's not possible to determine the specific operations or queries being performed. The placeholder might be used in documentation, templates, or code generation tools to signify where SQL code should be inserted or referenced.
-
-
-```
- | HIP_RUNTIME=nvcc
-```
-
-One symptom of this problem is the message 'error: 'unknown error'(11) at square.hipref.cpp:56 . This can occur if you have a CUDA installation on an AMD platform, and HIP incorrectly detects the platform as NVCC. HIP may be able to compile the application using the NVCC tool-chain but will generate this error at runtime since the platform does not have a CUDA device.
-
-## 17.24 On CUDA, can I mix CUDA code with HIP code?
-
-Yes. Most HIP data structures ( hipStream\_t , hipEvent\_t ) are typedefs to CUDA equivalents and can be intermixed. Both CUDA and HIP use integer device ids. One notable exception is that hipError\_t is a new type, and cannot be used where a cudaError\_t is expected. In these cases, refactor the code to remove the expectation. Alternatively, hip\_runtime\_api.h defines functions which convert between the error code spaces:
-
-hipErrorToCudaError hipCUDAErrorTohipError hipCUResultTohipError
-
-If platform portability is important, use #ifdef \_\_HIP\_PLATFORM\_NVIDIA\_\_ to guard the CUDA-specific code.
-
-## 17.25 How do I trace HIP application flow?
-
-See Logging HIP activity for more information.
-
-## 17.26 What are the maximum limits of kernel launch parameters?
-
-Product of block.x, block.y, and block.z should be less than 1024. Please note, HIP does not support kernel launch with total work items defined in dimension with size gridDim x blockDim >= 2^32 , so gridDim.x * blockDim.x, gridDim.y * blockDim.y and gridDim.z * blockDim.z are always less than 2^32.
-
-## 17.27 Are \_\_shfl\_*\_sync functions supported on HIP platform?
-
-\_\_shfl\_*\_sync is not supported on HIP but for NVCC path CUDA 9.0 and above all shuffle calls get redirected to it's sync version.
-
-## 17.28 How to create a guard for code that is specific to the host or the GPU?
-
-The compiler defines the \_\_HIP\_DEVICE\_COMPILE\_\_ macro only when compiling the code for the GPU. It could be used to guard code that is specific to the host or the GPU.
-
-## 17.29 Why \_OpenMP is undefined when compiling with -fopenmp ?
-
-When compiling an OpenMP source file with hipcc -fopenmp , the compiler may generate error if there is a reference to the \_OPENMP macro. This is due to a limitation in hipcc that treats any source file type (for example .cpp ) as an HIP translation unit leading to some conflicts with the OpenMP language switch. If the OpenMP source file doesn't contain any HIP language constructs you could work around this issue by adding the -x c++ switch to force the compiler to treat the file as regular C++. Another approach would be to guard the OpenMP code with #ifdef \_OPENMP so that the code block is disabled when compiling for the GPU. The \_\_HIP\_DEVICE\_COMPILE\_\_ macro defined by the HIP compiler when compiling GPU code could also be used for guarding code paths specific to the host or the GPU.
-
-## 17.30 Does the HIP-Clang compiler support extern shared declarations?
-
-Previously, it was essential to declare dynamic shared memory using the HIP\_DYNAMIC\_SHARED macro for accuracy, as using static shared memory in the same kernel could result in overlapping memory ranges and data-races.
-
-Now, the HIP-Clang compiler provides support for extern shared declarations, and the HIP\_DYNAMIC\_SHARED option is no longer required. You may use the standard extern definition: extern shared type var[];
-
-## 17.31 I have multiple HIP enabled devices and I am getting an error code hipErrorSharedObjectInitFailed with the message 'Error: shared object initialization failed'?
-
-This error message is seen due to the fact that you do not have valid code object for all of your devices.
-
-If you have compiled the application yourself, make sure you have given the correct device name(s) and its features via: --offload-arch . If you are not mentioning the --offload-arch , make sure that hipcc is using the correct offload arch by verifying the hipcc output generated by setting the environment variable HIPCC\_VERBOSE=1 .
-
-If you have a precompiled application/library (like rocblas, TensorFlow etc) which gives you such error, there are one of two possibilities.
-
-- The application/library does not ship code object bundles for all of your device(s): in this case you need to recompile the application/library yourself with correct --offload-arch .
-- The application/library does not ship code object bundles for some of your device(s), for example you have a system with an APU + GPU and the library does not ship code objects for your APU. For this you can set the environment variable HIP\_VISIBLE\_DEVICES or CUDA\_VISIBLE\_DEVICES on NVIDIA platform, to only enable GPUs for which code object is available. This will limit the GPUs visible to your application and allow it to run.
-
-Note: In previous releases, the error code is hipErrorNoBinaryForGpu with message 'Unable to find code object for all current devices'. The error code handling behavior is changed. HIP runtime shows the error code hipErrorSharedObjectInitFailed with message 'Error: shared object initialization failed' on unsupported GPU.
-
-## 17.32 How to use per-thread default stream in HIP?
-
-The per-thread default stream is an implicit stream local to both the thread and the current device. It does not do any implicit synchronization with other streams (like explicitly created streams), or default per-thread stream on other threads.
-
-The per-thread default stream is a blocking stream and will synchronize with the default null stream if both are used in a program.
-
-In ROCm, a compilation option should be added in order to compile the translation unit with per-thread default stream enabled. -fgpu-default-stream=per-thread . Once source is compiled with per-thread default stream enabled, all APIs will be executed on per thread default stream, hence there will not be any implicit synchronization with other streams.
-
-Besides, per-thread default stream be enabled per translation unit, users can compile some files with feature enabled and some with feature disabled. Feature enabled translation unit will have default stream as per thread and there will not be any implicit synchronization done but other modules will have legacy default stream which will do implicit synchronization.
-
-## 17.33 How to use complex multiplication and division operations?
-
-In HIP, hipFloatComplex and hipDoubleComplex are defined as complex data types,
-**Following code does:** This code is a command-line instruction that uses the `git` version control system to create a local copy (clone) of the repository located at the specified URL, `https://github.com/amd/rcm-examples.git`. This repository is hosted on GitHub and likely contains example code or resources related to AMD's RCM (Resource and Configuration Management) tools or projects. The cloned repository will be downloaded to the current directory where the command is executed.
-
-
-```
-<_C_>
-```
-
-Any application uses complex multiplication and division operations, need to replace '*' and '/' operators with the following,
-
-- hipCmulf() and hipCdivf() for hipFloatComplex
-- hipCmul() and hipCdiv() for hipDoubleComplex
-
-Note: These complex operations are equivalent to corresponding types/functions on the NVIDIA platform.
-
-## 17.34 Can I develop applications with HIP APIs on Windows the same on Linux?
-
-Yes, HIP APIs are available to use on both Linux and Windows. Due to different working mechanisms on operating systems like Windows vs Linux, HIP APIs call corresponding lower level backend runtime libraries and kernel drivers for the OS, in order to control the executions on GPU hardware accordingly. There might be a few differences on the related backend software and driver support, which might affect usage of HIP APIs. See OS support details in HIP API document.
-
-## 17.35 Does HIP support LUID?
-
-Starting ROCm 6.0, HIP runtime supports Locally Unique Identifier (LUID). This feature enables the local physical device(s) to interoperate with other devices. For example, DirectX 12.
-
-HIP runtime sets device LUID properties so the driver can query LUID to identify each device for interoperability.
-
-Note: HIP supports LUID only on Windows OS.
-
-## 17.36 How can I know the version of HIP?
-
-HIP version definition has been updated since ROCm 4.2 release as the following:
-**Following code does:** This code snippet is written in C++ using the HIP API, which is used for GPU programming. The code's high-level purpose is to allocate memory on a GPU device and copy data from the host (CPU) to the device (GPU). Specifically, it allocates memory for two float arrays (`d_x` and `d_y`) on the GPU, each with a size specified by `size_bytes`. It then copies data from two host arrays (`x` and `y`) to these newly allocated device arrays. The `HIP_CHECK` macro is likely used to handle errors that may occur during these operations.
-
-
-```
-<_SQL_>
-```
-
-HIP version can be queried from HIP API call, hipRuntimeGetVersion(&runtimeVersion);
-
-The version returned will always be greater than the versions in previous ROCm releases.
-
-Note: The version definition of HIP runtime is different from CUDA. On AMD platform, the function returns HIP runtime version, while on NVIDIA platform, it returns CUDA runtime version. And there is no mapping/correlation between HIP version and CUDA version.
-
-## 18.1 Related Pages
-
-18.2 Topics
-
-## 18.3 Namespaces
-
-18.3.1 Namespace List
-
-18.3.2 Namespace Members
-
-18.3.2.1 Namespace Members
-
-18.3.2.2 Namespace Members
-
-## 18.4 Data Structures
-
-- 18.4.1 Data Structures
-- 18.4.2 Data Structure Index
-- 18.4.3 Class Hierarchy
-
-18.4.4 Data Fields
-
-18.4.4.1 All
-
-18.4.4.1.1 Data Fields
-
-18.4.4.1.2 Data Fields
-
-18.4.4.1.3 Data Fields
-
-18.4.4.1.4 Data Fields
-
-18.4.4.1.5 Data Fields
-
-18.4.4.1.6 Data Fields 26
-
-18.4.4.1.7 Data Fields
-
-CHAPTER
-
-## EIGHTEEN
-
-## HIP RUNTIME API REFERENCE
-
-## CHAPTER
-
-## NINETEEN
-
-## C++ LANGUAGE EXTENSIONS
-
-HIP provides a C++ syntax that is suitable for compiling most code that commonly appears in compute kernels (classes, namespaces, operator overloading, and templates). HIP also defines other language features that are designed to target accelerators, such as:
-
-- A kernel-launch syntax that uses standard C++ (this resembles a function call and is portable to all HIP targets)
-- Short-vector headers that can serve on a host or device
-- Math functions that resemble those in math.h , which is included with standard C++ compilers
-- Built-in functions for accessing specific GPU hardware capabilities
-
-Note: This chapter describes the built-in variables and functions that are accessible from the HIP kernel. It's intended for users who are familiar with CUDA kernel syntax and want to learn how HIP differs from CUDA.
-
-Features are labeled with one of the following keywords:
-
-- Supported : HIP supports the feature with a CUDA-equivalent function
-- Not supported : HIP does not support the feature
-- Under development : The feature is under development and not yet available
-
-## 19.1 Function-type qualifiers
-
-## 19.1.1 \_\_device\_\_
-
-Supported \_\_device\_\_ functions are:
-
-- Run on the device
-- Called from the device only
-
-You can combine \_\_device\_\_ with the host keyword ( \_\_host\_\_ ).
-
-## 19.1.2 \_\_global\_\_
-
-Supported \_\_global\_\_ functions are:
-
-- Run on the device
-- Called (launched) from the host
-
-HIP \_\_global\_\_ functions must have a void return type.
-
-HIP doesn't support dynamic-parallelism, which means that you can't call \_\_global\_\_ functions from the device.
-
-## 19.1.3 \_\_host\_\_
-
-Supported \_\_host\_\_ functions are:
-
-- Run on the host
-- Called from the host
-
-You can combine \_\_host\_\_ with \_\_device\_\_ ; in this case, the function compiles for the host and the device. Note that these functions can't use the HIP grid coordinate functions (e.g., threadIdx.x ). If you need to use HIP grid coordinate functions, you can pass the necessary coordinate information as an argument.
-
-You can't combine \_\_host\_\_ with \_\_global\_\_ .
-
-HIP parses the \_\_noinline\_\_ and \_\_forceinline\_\_ keywords and converts them into the appropriate Clang attributes.
-
-## 19.2 Calling \_\_global\_\_ functions
-
-\_\_global\_\_ functions are often referred to as kernels . When you call a global function, you're launching a kernel . When launching a kernel, you must specify an execution configuration that includes the grid and block dimensions. The execution configuration can also include other information for the launch, such as the amount of additional shared memory to allocate and the stream where you want to execute the kernel.
-
-HIP introduces a standard C++ calling convention ( hipLaunchKernelGGL ) to pass the run configuration to the kernel. However, you can also use the CUDA <<< >>> syntax.
-
-When using hipLaunchKernelGGL , your first five parameters must be:
-
-- symbol kernelName : The name of the kernel you want to launch. To support template kernels that contain "," , use the HIP\_KERNEL\_NAME macro (HIPIFY tools insert this automatically).
-- dim3 gridDim : 3D-grid dimensions that specify the number of blocks to launch.
-- dim3 blockDim : 3D-block dimensions that specify the number of threads in each block.
-- size\_t dynamicShared : The amount of additional shared memory that you want to allocate when launching the kernel (see \_\_shared\_\_ ).
-- hipStream\_t : The stream where you want to run the kernel. A value of 0 corresponds to the NULL stream (see Synchronization functions ).
-
-You can include your kernel arguments after these parameters.
-**Following code does:** This code snippet is part of a GPU programming context, likely using a framework like CUDA or HIP for parallel computing. The high-level purpose of the code is to perform a SAXPY operation (Single-Precision A·X Plus Y) on a GPU. The `saxpy_kernel` function is a kernel function that runs on the GPU and performs the SAXPY operation on arrays `d_x` and `d_y` with a scalar `a`. The `main` function sets up the execution environment and launches this kernel on the GPU using a specified grid and block size configuration. The kernel is executed on the default stream, which is a queue for managing the execution order of operations on the GPU.
-
-
-```
- // Example hipLaunchKernelGGL pseudocode:
-```
-**Following code does:** The code snippet `<_Cuda_>` appears to be a placeholder or a symbolic representation rather than actual executable Python code. In a typical context, "Cuda" refers to NVIDIA's parallel computing platform and application programming interface (API) model, which allows developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. However, without additional context or surrounding code, it is not possible to determine the specific functionality or purpose of this snippet.
-
-
-```
-
-ize_t N)
-```
-
-(continued from previous page)
-**Following code does:** This code snippet is part of a program that uses HIP, a C++ runtime API and kernel language that allows developers to write portable code to run on AMD and NVIDIA GPUs. The specific line of code is performing a memory copy operation from a GPU device to the host (CPU) memory.
-
-Here's a high-level explanation of what it does:
-
-- `hipMemcpy` is a function that copies data between host and device memory.
-- `y.data()` is likely a pointer or an array on the host where the data will be copied to.
-- `d_y` is a pointer or an array on the device (GPU) from which the data will be copied.
-- `size_bytes` specifies the number of bytes to copy.
-- `hipMemcpyDeviceToHost` is an enumeration that indicates the direction of the copy, from device to host.
-
-The `HIP_CHECK` macro is likely used to check for errors in the `hipMemcpy` operation, ensuring that the memory copy was successful.
-
-
-```
- (continued from previous page)
-
-
-}
-
-MyKernel<<
void void
-```
-
-We insert the GCN isa into the kernel using asm() Assembler statement. volatile keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations. v\_mac\_f32\_e32 is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/) Index for the respective operand in the ordered fashion is provided by % followed by position in the list of operands 'v' is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list) Output Constraints are specified by an '=' prefix as shown above ('=v'). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of '0' says to use the assigned register for output as an input as well (it being the 0'th constraint).
-
-## C++ Support The following C++ features are not supported: - Run-time-type information (RTTI) - Try/catch Virtual functions Virtual functions are not supported if objects containing virtual function tables are passed between GPU's of different offload arch's, e.g. between gfx906 and gfx1030. Otherwise virtual functions are supported.
-
-## 19.27 Kernel Compilation
-
-hipcc now supports compiling C++/HIP kernels to binary code objects. The file format for binary is .co which means Code Object. The following command builds the code object using hipcc .
-**Following code does:** The code snippet appears to be a command-line operation that uses `dumpbin.exe` to analyze a specific section of an executable file (`saxpy.exe`). It targets the `.hip_fat` section and extracts raw data with a specified format. The output is then piped into a `select` command, which skips the first 20 lines and selects the next 12 lines from the output. This operation is likely used for inspecting or debugging specific parts of the executable's binary data.
-
-
-```
-hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]
-
-[TARGET GPU] = GPU architecture
-[INPUT FILE] = Name of the file containing kernels
-[OUTPUT FILE] = Name of the generated code object file
-```
-
-Note: When using binary code objects is that the number of arguments to the kernel is different on HIP-Clang and NVCC path. Refer to the HIP module\_api sample for differences in the arguments to be passed to the kernel.
-
-## 19.28 gfx-arch-specific-kernel
-
-Clang defined '\_\_gfx*\_\_' macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample in HIP 14\_gpu\_arch sample.
-
-## CHAPTER
-
-## TWENTY
-
-## C++ LANGUAGE SUPPORT
-
-The ROCm platform enables the power of combined C++ and HIP (Heterogeneous-computing Interface for Portability) code. This code is compiled with a clang or clang++ compiler. The official compilers support the HIP platform, or you can use the amdclang or amdclang++ included in the ROCm installation, which are a wrapper for the official versions.
-
-The source code is compiled according to the C++03 , C++11 , C++14 , C++17 , and C++20 standards, along with HIPspecific extensions, but is subject to restrictions. The key restriction is the reduced support of standard library in device code. This is due to the fact that by default a function is considered to run on host, except for constexpr functions, which can run on host and device as well.
-
-## 20.1 Modern C++ support
-
-C++ is considered a modern programming language as of C++11. This section describes how HIP supports these new C++ features.
-
-## 20.1.1 C++11 support
-
-The C++11 standard introduced many new features. These features are supported in HIP host code, with some notable omissions on the device side. The rule of thumb here is that constexpr functions work on device, the rest doesn't. This means that some important functionality like std::function is missing on the device, but unfortunately the standard library wasn't designed with HIP in mind, which means that the support is in a state of 'works as-is'.
-
-Certain features have restrictions and clarifications. For example, any functions using the constexpr qualifier or the new initializer lists , std::move or std::forward features are implicitly considered to have the \_\_host\_\_ and \_\_device\_\_ execution space specifier. Also, constexpr variables that are static members or namespace scoped can be used from both host and device, but only for read access. Dereferencing a static constexpr outside its specified execution space causes an error.
-
-Lambdas are supported, but there are some extensions and restrictions on their usage. For more information, see the Extended lambdas section below.
-
-## 20.1.2 C++14 support
-
-The C++14 language features are supported.
-
-## 20.1.3 C++17 support
-
-All C++17 language features are supported.
-
-## 20.1.4 C++20 support
-
-All C++20 language features are supported, but extensions and restrictions apply. C++20 introduced coroutines and modules, which fundamentally changed how programs are written. HIP doesn't support these features. However, consteval functions can be called from host and device, even if specified for host use only.
-
-The three-way comparison operator (spaceship operator <=> ) works with host and device code.
-
-## 20.2 Extensions and restrictions
-
-In addition to the deviations from the standard, there are some general extensions and restrictions to consider.
-
-## 20.2.1 Global functions
-
-Functions that serve as an entry point for device execution are called kernels and are specified with the \_\_global\_\_ qualifier. To call a kernel function, use the triple chevron operator: <<< >>> . Kernel functions must have a void return type. These functions can't:
-
-- have a constexpr specifier
-- have a parameter of type std::initializer\_list or va\_list
-- use an rvalue reference as a parameter.
-- use parameters having different sizes in host and device code, e.g. long double arguments, or structs containing long double members.
-- use struct-type arguments which have different layout in host and device code.
-
-Kernels can have variadic template parameters, but only one parameter pack, which must be the last item in the template parameter list.
-
-## 20.2.2 Device space memory specifiers
-
-HIP includes device space memory specifiers to indicate whether a variable is allocated in host or device memory and howits memory should be allocated. HIP supports the \_\_device\_\_ , \_\_shared\_\_ , \_\_managed\_\_ , and \_\_constant\_\_ specifiers.
-
-The \_\_device\_\_ and \_\_constant\_\_ specifiers define global variables, which are allocated within global memory on the HIP devices. The only difference is that \_\_constant\_\_ variables can't be changed after allocation. The \_\_shared\_\_ specifier allocates the variable within shared memory, which is available for all threads in a block.
-
-The \_\_managed\_\_ variable specifier creates global variables that are initially undefined and unaddressed within the global symbol table. The HIP runtime allocates managed memory and defines the symbol when it loads the device binary. A managed variable can be accessed in both device and host code.
-
-It's important to know where a variable is stored because it is only available from certain locations. Generally, variables allocated in the host memory are not accessible from the device code, while variables allocated in the device memory are not directly accessible from the host code. Dereferencing a pointer to device memory on the host results in a segmentation fault. Accessing device variables in host code should be done through kernel execution or HIP functions like hipMemCpyToSymbol .
-
-## 20.2.3 Exception handling
-
-An important difference between the host and device code is exception handling. In device code, this control flow isn't available due to the hardware architecture. The device code must use return codes to handle errors.
-
-## 20.2.4 Kernel parameters
-
-There are some restrictions on kernel function parameters. They cannot be passed by reference, because these functions are called from the host but run on the device. Also, a variable number of arguments is not allowed.
-
-## 20.2.5 Classes
-
-Classes work on both the host and device side, but there are some constraints. The static member functions can't be \_\_global\_\_ . Virtual member functions work, but a virtual function must not be called from the host if the parent object was created on the device, or the other way around, because this behavior is undefined. Another minor restriction is that \_\_device\_\_ variables, that are global scoped must have trivial constructors.
-
-## 20.2.6 Polymorphic function wrappers
-
-HIP doesn't support the polymorphic function wrapper std::function , which was introduced in C++11.
-
-## 20.2.7 Extended lambdas
-
-HIP supports Lambdas, which by default work as expected.
-
-Lambdas have implicit host device attributes. This means that they can be executed by both host and device code, and works the way you would expect. To make a lambda callable only by host or device code, users can add \_\_host\_\_ or \_\_device\_\_ attribute. The only restriction is that host variables can only be accessed through copy on the device. Accessing through reference will cause undefined behavior.
-
-## 20.2.8 Inline namespaces
-
-Inline namespaces are supported, but with a few exceptions. The following entities can't be declared in namespace scope within an inline unnamed namespace:
-
-- \_\_managed\_\_ , \_\_device\_\_ , \_\_shared\_\_ and \_\_constant\_\_ variables
-- \_\_global\_\_ function and function templates
-- variables with surface or texture type
-
-## CHAPTER
-
-## TWENTYONE
-
-## HIP MATH API
-
-HIP-Clang supports a set of math operations that are callable from the device. HIP supports most of the device functions supported by NVIDIA CUDA. These are described in the following sections.
-
-## 21.1 Single precision mathematical functions
-
-Following is the list of supported single precision mathematical functions.
-
-Table 1: Single precision mathematical functions
-
-| Function | Supported on Host | Supported on Device |
-|----------------------------------------------------------------------------|---------------------|-----------------------|
-| float abs(float x) Returns the absolute value of 𝑥 | ✓ | ✓ |
-| float acosf(float x) Returns the arc cosine of 𝑥 . | ✓ | ✓ |
-| float acoshf(float x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| float asinf(float x) Returns the arc sine of 𝑥 . | ✓ | ✓ |
-| float asinhf(float x) Returns the arc hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| float atanf(float x) Returns the arc tangent of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 1 - continued from previous page |
-|------------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| float atan2f(float x, float y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float atanhf(float x) Returns the arc hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-| float cbrtf(float x) Returns the cube root of 𝑥 . | ✓ | ✓ |
-| float ceilf(float x) Returns ceiling of 𝑥 . | ✓ | ✓ |
-| float copysignf(float x, float y) Create value with given magnitude, copying sign of second value. | ✓ | ✓ |
-| float cosf(float x) Returns the cosine of 𝑥 . | ✓ | ✓ |
-| float coshf(float x) Returns the hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| float cospif(float x) Returns the cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| float cyl_bessel_i0f(float x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 . | | |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| float cyl_bessel_i1f(float x) Returns the value of the regular modified cylindrical Bessel function of order 1 for 𝑥 . | | |
-|--------------------------------------------------------------------------------------------------------------------------|----|----|
-| float erff(float x) Returns the error function of 𝑥 . | ✓ | ✓ |
-| float erfcf(float x) Returns the complementary error function of 𝑥 . | ✓ | ✓ |
-| float erfcinvf(float x) Returns the inverse complementary function of 𝑥 . | ✓ | ✓ |
-| float erfcxf(float x) Returns the scaled complementary error function of 𝑥 . | ✓ | ✓ |
-| float erfinvf(float x) Returns the inverse error function of 𝑥 . | ✓ | ✓ |
-| float expf(float x) Returns 𝑒 𝑥 . | ✓ | ✓ |
-| float exp10f(float x) Returns 10 𝑥 . | ✓ | ✓ |
-| float exp2f( float x) Returns 2 𝑥 . | ✓ | ✓ |
-| float expm1f(float x) Returns 𝑙𝑛 ( 𝑥 - 1) | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float fabsf(float x) Returns the absolute value of x | ✓ | ✓ |
-|------------------------------------------------------------------------------------|-----|-----|
-| float fdimf(float x, float y) Returns the positive difference between 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fdividef(float x, float y) Divide two floating point values. | ✓ | ✓ |
-| float floorf(float x) Returns the largest integer less than or equal to 𝑥 . | ✓ | ✓ |
-| float fmaf(float x, float y, float z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. | ✓ | ✓ |
-| float fmaxf(float x, float y) Determine the maximum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fminf(float x, float y) Determine the minimum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fmodf(float x, float y) Returns the floating-point remainder of 𝑥/𝑦 . | ✓ | ✓ |
-| float modff(float x, float* iptr) Break down 𝑥 into fractional and integral parts. | ✓ | |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| float frexpf(float x, int* nptr) Extract mantissa and exponent of 𝑥 . | ✓ |
-|---------------------------------------------------------------------------------------------------------|-----|
-| float hypotf(float x, float y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . | ✓ |
-| int ilogbf(float x) Returns the unbiased integer exponent of 𝑥 . | ✓ |
-| bool isfinite(float x) Determine whether 𝑥 is finite. | ✓ |
-| bool isinf(float x) Determine whether 𝑥 is infinite. | ✓ |
-| bool isnan(float x) Determine whether 𝑥 is a NAN . | ✓ |
-| float j0f(float x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . | ✓ |
-| float j1f(float x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . | ✓ |
-| float jnf(int n, float x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float ldexpf(float x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | ✓ |
-|-------------------------------------------------------------------------------------------------------------------|-----|-----|
-| float lgammaf(float x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | |
-| long int lrintf(float x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-| long long int llrintf(float x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-| long int lroundf(float x) Round to nearest integer value. | ✓ | ✓ |
-| long long int llroundf(float x) Round to nearest integer value. | ✓ | ✓ |
-| float log10f(float x) Returns the base 10 logarithm of 𝑥 . | ✓ | ✓ |
-| float log1pf(float x) Returns the natural logarithm of 𝑥 +1 . | ✓ | ✓ |
-| float log2f(float x) Returns the base 2 logarithm of 𝑥 . | ✓ | ✓ |
-| float logf(float x) Returns the natural logarithm of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| | | ✓ |
-|----------------------------------------------------------------------------------------------------------------------|----|-----|
-| float logbf(float x) Returns the floating point representation of the exponent of 𝑥 . | ✓ | |
-| float nanf(const char* tagp) Returns 'Not a Number' value. | | ✓ |
-| float nearbyintf(float x) Round 𝑥 to the nearest integer. | ✓ | ✓ |
-| float nextafterf(float x, float y) Returns next representable single-precision floating-point value after argument. | ✓ | |
-| float norm3df(float x, float y, float z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . | ✓ | ✓ |
-| float norm4df(float x, float y, float z, float w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . | ✓ | ✓ |
-| float normcdff(float y) Returns the standard normal cumulative distribution function. | ✓ | ✓ |
-| float normcdfinvf(float y) Returns the inverse of the standard normal cumulative distribution function. | ✓ | ✓ |
-| float normf(int dim, const float *a) Returns the square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 1 - continued from previous page |
-|-------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| float powf(float x, float y) Returns 𝑥 𝑦 . | ✓ | ✓ |
-| float powif(float base, int iexp) Returns the value of first argument to the power of second argument. | ✓ | ✓ |
-| float remainderf(float x, float y) Returns single-precision floating-point remainder. | ✓ | ✓ |
-| float remquof(float x, float y, int* quo) Returns single-precision floating-point remainder and part of quotient. | ✓ | ✓ |
-| float roundf(float x) Round to nearest integer value in floating-point. | ✓ | ✓ |
-| float rcbrtf(float x) Returns the reciprocal cube root function. | ✓ | ✓ |
-| float rhypotf(float x, float y) Returns one over the square root of the sum of squares of two arguments. | ✓ | ✓ |
-| float rintf(float x) Round input to nearest integer value in floating-point. | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float rnorm3df(float x, float y, float z) Returns one over the square root of the sum of squares of three coordinates of the argument. | ✓ | ✓ |
-|------------------------------------------------------------------------------------------------------------------------------------------------|-----|-----|
-| float rnorm4df(float x, float y, float z, float w) Returns one over the square root of the sum of squares of four coordinates of the argument. | ✓ | ✓ |
-| float rnormf(int dim, const float *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-| float scalblnf(float x, long int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| float scalbnf(float x, int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| bool signbit(float x) Return the sign bit of 𝑥 . | ✓ | ✓ |
-| float sinf(float x) Returns the sine of 𝑥 . | ✓ | ✓ |
-| float sinhf(float x) Returns the hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| float sinpif(float x) Returns the hyperbolic sine of 𝜋 · 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| void sincosf(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝑥 . | ✓ | ✓ |
-|---------------------------------------------------------------------------------------------------|-----|-----|
-| void sincospif(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| float sqrtf(float x) Returns the square root of 𝑥 . | ✓ | ✓ |
-| float rsqrtf(float x) Returns the reciprocal of the square root of 𝑥 . | | ✓ |
-| float tanf(float x) Returns the tangent of 𝑥 . | ✓ | ✓ |
-| float tanhf(float x) Returns the hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-| float tgammaf(float x) Returns the gamma function of 𝑥 . | ✓ | ✓ |
-| float truncf(float x) Truncate 𝑥 to the integral part. | ✓ | ✓ |
-| float y0f(float x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . | ✓ | ✓ |
-| float y1f(float x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . | ✓ | ✓ |
-
-continues on next page
-**Following code does:** It seems like there is a placeholder or an error in your request, as the code snippet is missing. Please provide the actual code snippet you would like me to analyze, and I'll be happy to help!
-
-
-```
- \
- float ynf(int n, float x)
- Returns the value of the Bessel
- function of the second kind of order
- n for x.
-```
-
-Table 1 - continued from previous page
-
-## 21.2 Double precision mathematical functions
-
-Following is the list of supported double precision mathematical functions.
-
-Table 2: Double precision mathematical functions
-
-| Function | Supported on Host | Supported on Device |
-|------------------------------------------------------------------------------------|---------------------|-----------------------|
-| double abs(double x) Returns the absolute value of 𝑥 | ✓ | ✓ |
-| double acos(double x) Returns the arc cosine of 𝑥 . | ✓ | ✓ |
-| double acosh(double x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| double asin(double x) Returns the arc sine of 𝑥 . | ✓ | ✓ |
-| double asinh(double x) Returns the arc hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| double atan(double x) Returns the arc tangent of 𝑥 . | ✓ | ✓ |
-| double atan2(double x, double y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double atanh(double x) Returns the arc hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-|-------------------------------------------------------------------------------------------------------------------------|-----|-----|
-| double cbrt(double x) Returns the cube root of 𝑥 . | ✓ | ✓ |
-| double ceil(double x) Returns ceiling of 𝑥 . | ✓ | ✓ |
-| double copysign(double x, double y) Create value with given magnitude, copying sign of second value. | ✓ | ✓ |
-| double cos(double x) Returns the cosine of 𝑥 . | ✓ | ✓ |
-| double cosh(double x) Returns the hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| double cospi(double x) Returns the cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| double cyl_bessel_i0(double x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 . | | |
-| double cyl_bessel_i1(double x) Returns the value of the regular modified cylindrical Bessel function of order 1 for | 𝑥 . | |
-| double erf(double x) Returns the error function of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double erfc(double x) Returns the complementary error function of 𝑥 . | ✓ | ✓ |
-|-----------------------------------------------------------------------------------|-----|-----|
-| double erfcinv(double x) Returns the inverse complementary function of 𝑥 . | ✓ | ✓ |
-| double erfcx(double x) Returns the scaled complementary error function of 𝑥 . | ✓ | ✓ |
-| double erfinv(double x) Returns the inverse error function of 𝑥 . | ✓ | ✓ |
-| double exp(double x) Returns 𝑒 𝑥 . | ✓ | ✓ |
-| double exp10(double x) Returns 10 𝑥 . | ✓ | ✓ |
-| double exp2( double x) Returns 2 𝑥 . | ✓ | ✓ |
-| double expm1(double x) Returns 𝑙𝑛 ( 𝑥 - 1) | ✓ | ✓ |
-| double fabs(double x) Returns the absolute value of x | ✓ | ✓ |
-| double fdim(double x, double y) Returns the positive difference between 𝑥 and 𝑦 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double floor(double x) Returns the largest integer less than or equal to 𝑥 . | ✓ | ✓ |
-|---------------------------------------------------------------------------------------------|-----|-----|
-| double fma(double x, double y, double z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. | ✓ | ✓ |
-| double fmax(double x, double y) Determine the maximum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| double fmin(double x, double y) Determine the minimum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| double fmod(double x, double y) Returns the floating-point remainder of 𝑥/𝑦 . | ✓ | ✓ |
-| double modf(double x, double* iptr) Break down 𝑥 into fractional and integral parts. | ✓ | |
-| double frexp(double x, int* nptr) Extract mantissa and exponent of 𝑥 . | ✓ | |
-| double hypot(double x, double y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . | ✓ | ✓ |
-| int ilogb(double x) Returns the unbiased integer exponent of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| bool isfinite(double x) Determine whether 𝑥 is finite. | ✓ | ✓ |
-|------------------------------------------------------------------------------------------------------------------|-----|-----|
-| bool isin(double x) Determine whether 𝑥 is infinite. | ✓ | ✓ |
-| bool isnan(double x) Determine whether 𝑥 is a NAN . | ✓ | ✓ |
-| double j0(double x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . | ✓ | ✓ |
-| double j1(double x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . | ✓ | ✓ |
-| double jn(int n, double x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . | ✓ | ✓ |
-| double ldexp(double x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | ✓ |
-| double lgamma(double x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | |
-| long int lrint(double x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| long long int llrint(double x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-|----------------------------------------------------------------------------------------|-----|-----|
-| long int lround(double x) Round to nearest integer value. | ✓ | ✓ |
-| long long int llround(double x) Round to nearest integer value. | ✓ | ✓ |
-| double log10(double x) Returns the base 10 logarithm of 𝑥 . | ✓ | ✓ |
-| double log1p(double x) Returns the natural logarithm of 𝑥 +1 . | ✓ | ✓ |
-| double log2(double x) Returns the base 2 logarithm of 𝑥 . | ✓ | ✓ |
-| double log(double x) Returns the natural logarithm of 𝑥 . | ✓ | ✓ |
-| double logb(double x) Returns the floating point representation of the exponent of 𝑥 . | ✓ | ✓ |
-| double nan(const char* tagp) Returns 'Not a Number' value. | | ✓ |
-| double nearbyint(double x) Round 𝑥 to the nearest integer. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| | | ✓ |
-|--------------------------------------------------------------------------------------------------------------------------|----|-----|
-| double nextafter(double x, double y) Returns next representable double-precision floating-point value after argument. | ✓ | |
-| double norm3d(double x, double y, double z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . | ✓ | ✓ |
-| double norm4d(double x, double y, double z, double w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . | ✓ | ✓ |
-| double normcdf(double y) Returns the standard normal cumulative distribution function. | ✓ | ✓ |
-| double normcdfinv(double y) Returns the inverse of the standard normal cumulative distribution function. | ✓ | ✓ |
-| double norm(int dim, const double *a) Returns the square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-| double pow(double x, double y) Returns 𝑥 𝑦 . | ✓ | ✓ |
-| double powi(double base, int iexp) Returns the value of first argument to the power of second argument. | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 2 - continued from previous page |
-|----------------------------------------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| double remainder(double x, double y) Returns double-precision floating-point remainder. | ✓ | ✓ |
-| double remquo(double x, double y, int* quo) Returns double-precision floating-point remainder and part quotient. | ✓ | of |
-| double round(double x) Round to nearest integer value in floating-point. | ✓ | ✓ |
-| double rcbrt(double x) Returns the reciprocal cube root function. | ✓ | ✓ |
-| double rhypot(double x, double y) Returns one over the square root of the sum of squares of two arguments. | ✓ | ✓ |
-| double rint(double x) Round input to nearest integer value in floating-point. | ✓ | ✓ |
-| double rnorm3d(double x, double y, double z) Returns one over the square root of the sum of squares of three coordinates of the argument. | ✓ | ✓ |
-| double rnorm4d(double x, double y, double z, double w) Returns one over the square root of the sum of squares of four coordinates of the argument. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| | ✓ | |
-|----------------------------------------------------------------------------------------------------------------------------------|-----|----|
-| double rnorm(int dim, const double *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. | | ✓ |
-| double scalbln(double x, long int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| double scalbn(double x, int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| bool signbit(double x) Return the sign bit of 𝑥 . | ✓ | ✓ |
-| double sin(double x) Returns the sine of 𝑥 . | ✓ | ✓ |
-| double sinh(double x) Returns the hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| double sinpi(double x) Returns the hyperbolic sine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| void sincos(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝑥 . | ✓ | ✓ |
-| void sincospi(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| double sqrt(double x) Returns the square root of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double rsqrt(double x) Returns the reciprocal of the square root of 𝑥 . | ✓ |
-|-----------------------------------------------------------------------------------------------------------|-----|
-| double tan(double x) Returns the tangent of 𝑥 . | ✓ |
-| double tanh(double x) Returns the hyperbolic tangent of 𝑥 . | ✓ |
-| double tgamma(double x) Returns the gamma function of 𝑥 . | ✓ |
-| double trunc(double x) Truncate 𝑥 to the integral part. | ✓ |
-| double y0(double x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . | ✓ |
-| double y1(double x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . | ✓ |
-| double yn(int n, double x) Returns the value of the Bessel function of the second kind of order n for 𝑥 . | ✓ |
-
-## 21.3 Integer intrinsics
-
-Following is the list of supported integer intrinsics. Note that intrinsics are supported on device only.
-
-Table 3: Integer intrinsics mathematical functions
-
-## Function
-
-unsigned int \_\_brev(unsigned int x) Reverse the bit order of a 32 bit unsigned integer.
-
-unsigned long long int \_\_brevll(unsigned long long int x) Reverse the bit order of a 64 bit unsigned integer.
-
-unsigned int \_\_byte\_perm(unsigned int x, unsigned int y, unsigned int z) Return selected bytes from two 32-bit unsigned integers.
-
-unsigned int \_\_clz(int x) Return the number of consecutive high-order zero bits in 32 bit integer.
-
-unsigned int \_\_clzll(long long int x) Return the number of consecutive high-order zero bits in 64 bit integer.
-
-unsigned int \_\_ffs(int x) Find the position of least significant bit set to 1 in a 32 bit integer.
-
-unsigned int \_\_ffsll(long long int x) Find the position of least significant bit set to 1 in a 64 bit signed integer.
-
-unsigned int \_\_fns32(unsigned long long mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 32-bit integer.
-
-unsigned int \_\_fns64(unsigned long long int mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 64-bit integer.
-
-unsigned int \_\_funnelshift\_l(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by shift & 31 bits, return the most significant 32 bits.
-
-unsigned int \_\_funnelshift\_lc(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by min(shift, 32) bits, return the most significant 32 bits.
-
-unsigned int \_\_funnelshift\_r(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift right by shift & 31 bits, return the least significant 32 bits. 226 Chapter 21. HIP math API
-
-The HIP-Clang implementation of \_\_ffs() and \_\_ffsll() contains code to add a constant +1 to produce the ffs result format. For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform, HIP-Clang provides \_\_lastbit\_u32\_u32(unsigned int input) and \_\_lastbit\_u32\_u64(unsigned long long int input) . The index returned by \_\_lastbit\_ instructions starts at -1, while for ffs the index starts at 0.
-
-## 21.4 Floating-point Intrinsics
-
-Following is the list of supported floating-point intrinsics. Note that intrinsics are supported on device only.
-
-Note: Only the nearest even rounding mode supported on AMD GPUs by defaults. The \_rz , \_ru and \_rd suffixed intrinsic functions are existing in HIP AMD backend, if the OCML\_BASIC\_ROUNDED\_OPERATIONS macro is defined.
-
-Table 4: Single precision intrinsics mathematical functions
-
-Function float \_\_cosf(float x) Returns the fast approximate cosine of 𝑥 . float \_\_exp10f(float x) Returns the fast approximate for 10 x . float \_\_expf(float x) Returns the fast approximate for e x . float \_\_fadd\_rn(float x, float y) Add two floating-point values in round-to-nearest-even mode. float \_\_fdiv\_rn(float x, float y) Divide two floating point values in round-to-nearest-even mode. float \_\_fmaf\_rn(float x, float y, float z) Returns x × y + z as a single operation in round-to-nearest-even mode. float \_\_fmul\_rn(float x, float y) Multiply two floating-point values in round-to-nearest-even mode. float \_\_frcp\_rn(float x, float y) Returns 1 / x in round-to-nearest-even mode. float \_\_frsqrt\_rn(float x) Returns 1 / x in round-to-nearest-even mode. float \_\_fsqrt\_rn(float x) Returns x in round-to-nearest-even mode. float \_\_fsub\_rn(float x, float y) Subtract two floating-point values in round-to-nearest-even mode. float \_\_log10f(float x) Returns the fast approximate for base 10 logarithm of 𝑥 . 228 Chapter 21. HIP math API
-
-Table 5: Double precision intrinsics mathematical functions
-
-Function double \_\_dadd\_rn(double x, double y) Add two floating-point values in round-to-nearest-even mode. double \_\_ddiv\_rn(double x, double y) Divide two floating-point values in round-to-nearest-even mode. double \_\_dmul\_rn(double x, double y) Multiply two floating-point values in round-to-nearest-even mode. double \_\_drcp\_rn(double x, double y) Returns 1 / x in round-to-nearest-even mode. double \_\_dsqrt\_rn(double x) Returns x in round-to-nearest-even mode. double \_\_dsub\_rn(double x, double y) Subtract two floating-point values in round-to-nearest-even mode. double \_\_fma\_rn(double x, double y, double z) Returns x × y + z as a single operation in round-to-nearest-even mode.
-
-## CHAPTER
-
-## TWENTYTWO
-
-## TABLE COMPARING SYNTAX FOR DIFFERENT COMPUTE APIS
-
-| Term | CUDA | HIP | OpenCL |
-|------------------------|---------------------|--------------------------------------------|------------------------|
-| Device | int deviceId | int deviceId | cl_device |
-| Queue | cudaStream_t | hipStream_t | cl_command_queue |
-| Event | cudaEvent_t | hipEvent_t | cl_event |
-| Memory | void * | void * | cl_mem |
-| | grid | grid | NDRange |
-| | block | block | work-group |
-| | thread | thread | work-item |
-| | warp | warp | sub-group |
-| Thread-index | threadIdx.x | threadIdx.x | get_local_id(0) |
-| Block-index | blockIdx.x | blockIdx.x | get_group_id(0) |
-| Block-dim | blockDim.x | blockDim.x | get_local_size(0) |
-| Grid-dim | gridDim.x | gridDim.x | get_num_groups(0) |
-| Device Kernel | __global__ | __global__ | __kernel |
-| Device Function | __device__ | __device__ | Implied in device com |
-| Host Function | __host_ (default) | __host_ (default) | Implied in host compil |
-| Host + Device Function | __host__ __device__ | __host__ __device__ | No equivalent |
-| Kernel Launch | <<< >>> | hipLaunchKernel / hipLaunchKernelGGL / <<< | clEnqueueNDRangeK |
-| Global Memory | __global__ | __global__ | __global |
-| Group Memory | __shared__ | __shared__ | __local |
-| Constant | __constant__ | __constant__ | __constant |
-| | __syncthreads | __syncthreads | barrier(CLK_LOCAL |
-| Atomic Builtins | atomicAdd | atomicAdd | atomic_add |
-| Precise Math | cos(f) | cos(f) | cos(f) |
-| Fast Math | __cos(f) | __cos(f) | native_cos(f) |
-| Vector | float4 | float4 | float4 |
-
-## 22.1 Notes
-
-The indexing functions (starting with thread-index ) show the terminology for a 1D grid. Some APIs use reverse order of xyz / 012 indexing for 3D grids.
-
-## CHAPTER
-
-## TWENTYTHREE
-
-## HIP COOPERATIVE GROUPS API
-
-## 23.1 Cooperative kernel launches
-
-The following host-side functions are used for cooperative kernel launches.
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find function | 'hipLaunchCooperativeKernel' Documentation' | 'hipLaunchCooperativeKernel' Documentation' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for project | 'HIP | 6.1.40092 | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: project | doxygenfunction: project | doxygenfunction: project | Cannot | Cannot | find function | 'hipLaunchCooperativeKernel' | 'hipLaunchCooperativeKernel' |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | | 'HIP | 6.1.40092 | Documentation' | from |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function | 'hipLaunchCooperativeKernelMultiDe- | 'hipLaunchCooperativeKernelMultiDe- | 'hipLaunchCooperativeKernelMultiDe- |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| vice' | in | doxygen | xml | output for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: in | doxygenfunction: Cannot find xml output for project 'HIP | doxygenfunction: Cannot find xml output for project 'HIP | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| nel' | function 6.1.40092 | 'hipModuleLaunchCooperativeKer- Documentation' from directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'hipModuleLaunchCooperativeKernelMultiDevice' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-## 23.2 Cooperative groups classes
-
-The following cooperative groups classes can be used on the device side.
-
-## class thread\_group
-
-The base type of all cooperative group types.
-
-Holds the key properties of a constructed cooperative group types object, like the group type, its size, etc.
-
-Note: Cooperative groups feature is implemented on Linux, under development on Microsoft Windows.
-
-Subclassed by cooperative\_groups::coalesced\_group , cooperative\_groups::grid\_group , coopera-tive\_groups::multi\_grid\_group , cooperative\_groups::thread\_block , cooperative\_groups::tiled\_group class thread\_block : public cooperative\_groups:: thread\_group
-
-The workgroup (thread-block in CUDA terminology) cooperative group type.
-
-Represents an intra-workgroup cooperative group type, where the participating threads within the group are the same threads that participated in the currently executing workgroup .
-
-Note: This function is implemented on Linux and is under development on Microsoft Windows.
-
-class grid\_group : public cooperative\_groups:: thread\_group
-
-The grid cooperative group type.
-
-Represents an inter-workgroup cooperative group type, where the participating threads within the group spans across multiple workgroups running the (same) kernel on the same device.
-
-Note: This is implemented on Linux and is under development on Microsoft Windows.
-
-class multi\_grid\_group : public cooperative\_groups:: thread\_group
-
-The multi-grid cooperative group type.
-
-Represents an inter-device cooperative group type, where the participating threads within the group span across multiple devices, running the (same) kernel on these devices.
-
-Note: The multi-grid cooperative group type is implemented on Linux, under development on Microsoft Windows.
-
-## template<unsigned int size , class ParentCGTy >
-
-class thread\_block\_tile : public cooperative\_groups::impl::thread\_block\_tile\_internal< size , ParentCGTy > Group type -thread\_block\_tile .
-
-Represents one tiled thread group in a wavefront. This group type also supports sub-wave level intrinsics.
-
-Note: This type is implemented on Linux, under development on Microsoft Windows.
-
-## Public Functions
-
-unsigned int thread\_rank () const
-
-Rank of the calling thread within [0, size() ).
-
-## void sync ()
-
-Synchronizes the threads in the group.
-
-Causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group.
-
-Note: There are potential read-after-write (RAW), write-after-read (WAR), or write-after-write (WAW) hazards, when threads in the group access the same addresses in shared or global memory. The data hazards can be avoided with synchronization of the group.
-
-## unsigned int meta\_group\_rank () const
-
-Returns the linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta\_group\_size)
-
-unsigned int meta\_group\_size () const
-
-Returns the number of groups created when the parent group was partitioned.
-
-## template<class T >
-
-T shfl ( T var, int srcRank ) const
-
-Shuffle operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle operation is a direct copy of var from srcRank thread ID of group.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy. Only the srcRank thread ID of group is copied to other threads.
-- srcRank - [in] The source thread ID of the group for copy.
-
-## template<class T >
-
-T shfl\_down ( T var, unsigned int lane\_delta ) const
-
-Shuffle down operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle down operation is copy of var from thread with thread ID of group relative higher with lane\_delta to caller thread ID.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- lane\_delta - [in] The lane\_delta is the relative thread ID difference between caller thread ID and source of copy thread ID. sourceID = (threadID + lane\_delta) % size()
-
-template<class T >
-
-## T shfl\_up ( T var, unsigned int lane\_delta ) const
-
-Shuffle up operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle up operation is copy of var from thread with thread ID of group relative lower with lane\_delta to caller thread ID.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- lane\_delta - [in] The lane\_delta is the relative thread ID difference between caller thread ID and source of copy thread ID. sourceID = (threadID - lane\_delta) % size()
-
-## template<class T >
-
-T shfl\_xor ( T var, unsigned int laneMask ) const
-
-Shuffle xor operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle xor operation is copy of var from thread with thread ID of group based on laneMask XOR of the caller thread ID.
-
-## Template Parameters
-
-- T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- laneMask - [in] The laneMask is the mask for XOR operation. sourceID = threadID ^ laneMask
-
-unsigned long long ballot ( int pred ) const
-
-Ballot function on group level.
-
-Returns a bit mask with the Nth bit set to one if the Nth thread predicate evaluates true.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-int any ( int pred ) const
-
-Any function on group level.
-
-Returns non-zero if a predicate evaluates true for any threads.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-int all ( int pred ) const
-
-All function on group level.
-
-Returns non-zero if a predicate evaluates true for all threads.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-template<typename T >
-
-unsigned long long match\_any ( T value ) const
-
-Match any function on group level.
-
-Returns a bit mask containing a 1-bit for every participating thread if that thread has the same value in value as the caller thread.
-
-## Parameters
-
-value - [in] The value to examine on the current thread in group.
-
-template<typename T > unsigned long long match\_all ( T value, int &pred ) const
-
-Match all function on group level.
-
-Returns a bit mask containing a 1-bit for every participating thread if they all have the same value in value as the caller thread. The predicate pred is set to true if all participating threads have the same value in value .
-
-## Parameters
-
-- value - [in] The value to examine on the current thread in group.
-- pred - [out] The predicate is set to true if all participating threads in the thread group have the same value.
-
-class coalesced\_group : public cooperative\_groups:: thread\_group
-
-The coalesced\_group cooperative group type.
-
-Represents an active thread group in a wavefront. This group type also supports sub-wave level intrinsics.
-
-Note: This is implemented on Linux and is under development on Microsoft Windows.
-
-## 23.3 Cooperative groups construct functions
-
-The following functions are used to construct different group-type instances on the device side.
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | 'cooperative_groups::this_multi_grid' | 'cooperative_groups::this_multi_grid' | 'cooperative_groups::this_multi_grid' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::this\_grid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::this\_thread\_block' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function | 'cooperative_groups::coalesced_threads' | 'cooperative_groups::coalesced_threads' | |
-|------------|------------|--------------------|--------------------|--------------------|----------|--------|------------|-------------------------------------------|-------------------------------------------|------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: |
-
-/home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | 'cooperative_groups::tiled_partition' | 'cooperative_groups::tiled_partition' | 'cooperative_groups::tiled_partition' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function 'cooperative_groups::tiled_partition' 6.1.40092 | function 'cooperative_groups::tiled_partition' 6.1.40092 | function 'cooperative_groups::tiled_partition' 6.1.40092 | | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | Documentation' | from | directory: | | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | function | 'cooperative_groups::binary_partition' | 'cooperative_groups::binary_partition' |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | | 6.1.40092 | Documentation' | from |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function 'cooperative_groups::binary_partition' 6.1.40092 | function 'cooperative_groups::binary_partition' 6.1.40092 | function 'cooperative_groups::binary_partition' 6.1.40092 |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | Documentation' | from | directory: |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-## 23.4 Cooperative groups exposed API functions
-
-The following functions are the exposed API for different group-type instances on the device side.
-
-| Warning: | Warning: | doxygenfunction: project | doxygenfunction: project | doxygenfunction: project | Cannot find | function | 'cooperative_groups::group_size' | 'cooperative_groups::group_size' | 'cooperative_groups::group_size' | | | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | 'HIP | 6.1.40092 | Documentation' | from | directory: | | | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: Cannot | doxygenfunction: Cannot | doxygenfunction: Cannot | doxygenfunction: Cannot | find 'HIP | find 'HIP | function 'cooperative_groups::thread_rank' | function 'cooperative_groups::thread_rank' | function 'cooperative_groups::thread_rank' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::is\_valid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::sync' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advancedmicro-devices-hip/checkouts/docs-6.1.2/docs/doxygen/xml
-
-## CHAPTER
-
-## TWENTYFOUR
-
-## HSA RUNTIME API FOR ROCM
-
-The following functions are located in the https://github.com/ROCm/ROCR-Runtime repository.
-
-hsa\_status\_t hsa\_amd\_vmem\_address\_reserve ( void **va, size\_t size, uint64\_t address, uint64\_t flags )
-
-Allocate a reserved address range.
-
-Reserve a virtual address range. The size must be a multiple of the system page size. If it is not possible to allocate the address specified by address , then va will be a different address range. Address range should be released by calling hsa\_amd\_vmem\_address\_free.
-
-Note that this API will be deprecated in a future release and replaced by hsa\_amd\_vmem\_address\_reserve\_align
-
-## Parameters
-
-- va -[out] virtual address allocated
-- size -[in] of address range requested
-- address -[in] requested
-- flags -[in] currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range allocated successfully
-- ::HSA\_STATUS\_ERROR\_NOT\_INITIALIZED - The HSA runtime has not been initialized.
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources to allocate an address range of this size.
-
-hsa\_status\_t hsa\_amd\_vmem\_address\_free ( void *va, size\_t size )
-
-Free a reserved address range.
-
-Free a previously allocated address range. The size must match the size of a previously allocated address range.
-
-## Parameters
-
-- va -[out] virtual address to be freed
-- size -[in] of address range
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range released successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid va specified
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - Invalid size specified
-- ::HSA\_STATUS\_ERROR\_RESOURCE\_FREE - Address range is still in use
-
-· ::HSA\_STATUS\_ERROR - Internal unexpected error
-
-hsa\_status\_t hsa\_amd\_vmem\_handle\_create ( hsa\_amd\_memory\_pool\_t pool, size\_t size, hsa\_amd\_memory\_type\_t type, uint64\_t flags, hsa\_amd\_vmem\_alloc\_handle\_t *memory\_handle
-
-)
-
-Create a virtual memory handle.
-
-Create a virtual memory handle within this pool size must be a aligned to allocation granule size for this memory pool, see HSA\_AMD\_MEMORY\_POOL\_INFO\_RUNTIME\_ALLOC\_GRANULE To minimize internal memory fragmentation, align the size to the recommended allocation granule size, see HSA\_AMD\_MEMORY\_POOL\_INFO\_RUNTIME\_ALLOC\_REC\_GRANULE
-
-## Parameters
-
-- pool -[in] memory to use
-- size -[in] of the memory allocation
-- type -[in] of memory
-- flags -[in] - currently unsupported
-- memory\_handle -[out] - handle for the allocation
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - memory allocated successfully
-- ::HSA\_STATUS\_ERROR\_NOT\_INITIALIZED - The HSA runtime has not been initialized.
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - Invalid arguments
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - This memory pool does not support allocations
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources to allocate this memory
-
-hsa\_status\_t hsa\_amd\_vmem\_handle\_release ( hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle )
-
-Release a virtual memory handle.
-
-## Parameters
-
-memory -[in] handle that was previously allocated
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range allocated successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-
-hsa\_status\_t hsa\_amd\_vmem\_map ( void *va, size\_t size, size\_t in\_offset, hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle, uint64\_t flags )
-
-Map a virtual memory handle.
-
-Map a virtual memory handle to a reserved address range. The virtual address requested must be within a previously reserved address range. va and ( va + size) must be must be within (va + size) of the previous allocated address range. size must be equal to size of the memory\_handle hsa\_amd\_vmem\_set\_access needs to be called to make the memory accessible to specific agents
-
-## Parameters
-
-- va -[in] virtual address range where memory will be mapped
-- size -[in] of memory mapping
-- in\_offset -[in] offset into memory. Currently unsupported
-
-- memory\_handle -[in] virtual memory handle to be mapped
-- flags. -[in] Currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Memory mapped successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - va, size or memory\_handle are invalid
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-## hsa\_status\_t hsa\_amd\_vmem\_unmap ( void *va, size\_t size )
-
-Unmap a virtual memory handle.
-
-Unmap previously mapped virtual address range
-
-## Parameters
-
-- va -[in] virtual address range where memory will be mapped
-- size -[in] of memory mapping
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Memory backing unmapped successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - memory\_handle is invalid
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - size is invalid
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_set\_access ( void *va, size\_t size, const hsa\_amd\_memory\_access\_desc\_t *desc, size\_t desc\_cnt )
-
-Make a memory mapping accessible.
-
-Make previously mapped virtual address accessible to specific agents. size must be equal to size of previously mapped virtual memory handle. Calling hsa\_amd\_vmem\_set\_access multiple times on the same va will overwrite previous permissions for all agents
-
-## Parameters
-
-- va -[in] previously mapped virtual address
-- size -[in] of memory mapping
-- desc -[in] list of access permissions for each agent
-- desc\_cnt -[in] number of elements in desc
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - va, size or memory\_handle are invalid
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - memory\_handle is invalid
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources
-- ::HSA\_STATUS\_ERROR\_INVALID\_AGENT - Invalid agent in desc
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_get\_access ( void *va, hsa\_access\_permission\_t *perms, hsa\_agent\_t agent\_handle )
-
-Get current access permissions for memory mapping.
-
-Get access permissions for memory mapping for specific agent.
-
-## Parameters
-
-- va -[in] previously mapped virtual address
-- perms -[in] current permissions
-- agent\_handle -[in] agent
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_AGENT - Invalid agent
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - va is not mapped or permissions never set for this agent
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_export\_shareable\_handle ( int *dmabuf\_fd, hsa\_amd\_vmem\_alloc\_handle\_t handle, uint64\_t flags )
-
-Get an exportable shareable handle.
-
-Get an exportable shareable handle for a memory\_handle. This shareabl handle can then be used to re-create a virtual memory handle using hsa\_amd\_vmem\_import\_shareable\_handle. The shareable handle can be transferred using mechanisms that support posix file descriptors Once all shareable handles are closed, the memory\_handle is released.
-
-## Parameters
-
-- dmabuf\_fd -[out] shareable handle
-- handle -[in] previously allocated virtual memory handle
-- flags -[in] Currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Out of resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-## hsa\_status\_t hsa\_amd\_vmem\_import\_shareable\_handle ( int dmabuf\_fd, hsa\_amd\_vmem\_alloc\_handle\_t *handle )
-
-Import a shareable handle.
-
-Import a shareable handle for a memory handle. Importing a shareable handle that has been closed and released results in undefined behavior.
-
-## Parameters
-
-- dmabuf\_fd -[in] shareable handle exported with hsa\_amd\_vmem\_export\_shareable\_handle
-- handle -[out] virtual memory handle
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Out of resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_retain\_alloc\_handle ( hsa\_amd\_vmem\_alloc\_handle\_t *memory\_handle, void *addr )
-
-Returns memory handle for mapped memory.
-
-Return a memory handle for previously mapped memory. The handle will be the same value of handle used to map the memory. The returned handle must be released with corresponding number of calls to hsa\_amd\_vmem\_handle\_release.
-
-## Parameters
-
-- memory\_handle -[out] memory handle for this mapped address
-- mapped -[in] address
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid address
-
-hsa\_status\_t hsa\_amd\_vmem\_get\_alloc\_properties\_from\_handle ( hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle, hsa\_amd\_memory\_pool\_t *pool, hsa\_amd\_memory\_type\_t *type )
-
-Returns the current allocation properties of a handle.
-
-Returns the allocation properties of an existing handle
-
-## Parameters
-
-- memory\_handle -[in] memory handle to be queried
-- pool -[out] memory pool that owns this handle
-- memory -[out] type
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory\_handle
-
-## CHAPTER
-
-## TWENTYFIVE
-
-## HIP MANAGED MEMORY ALLOCATION API
-
-hipError\_t hipMallocManaged ( void **dev\_ptr, size\_t size, unsigned int flags )
-
-Allocates memory that will be automatically managed by HIP.
-
-This API is used for managed memory, allows data be shared and accessible to both CPU and GPU using a single pointer.
-
-The API returns the allocation pointer, managed by HMM, can be used further to execute kernels on device and fetch data between the host and device as needed.
-
-Note: It is recommend to do the capability check before call this API.
-
-## Parameters
-
-- dev\_ptr -[out] - pointer to allocated device memory
-- size -[in] - requested allocation size in bytes, it should be granularity of 4KB
-- flags -[in] - must be either hipMemAttachGlobal or hipMemAttachHost (defaults to hipMemAttachGlobal)
-
-## Returns
-
-hipSuccess, hipErrorMemoryAllocation, hipErrorNotSupported, hipErrorInvalidValue hipError\_t hipMemPrefetchAsync ( const void *dev\_ptr, size\_t count, int device, hipStream\_t stream
-
-) Prefetches memory to the specified destination device using HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- dev\_ptr -[in] pointer to be prefetched
-- count -[in] size in bytes for prefetching
-- device -[in] destination device to prefetch to
-- stream -[in] stream to enqueue prefetch operation
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue
-
-hipError\_t hipMemAdvise ( const void *dev\_ptr, size\_t count, hipMemoryAdvise advice, int device )
-
-Advise about the usage of a given memory range to HIP.
-
-This HIP API advises about the usage to be applied on unified memory allocation in the range starting from the pointer address devPtr, with the size of count bytes. The memory range must refer to managed memory allocated via the API hipMallocManaged, and the range will be handled with proper round down and round up respectively in the driver to be aligned to CPU page size, the same way as corresponding CUDA API behaves in CUDA version 8.0 and afterwards.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- dev\_ptr -[in] pointer to memory to set the advice for
-- count -[in] size in bytes of the memory range, it should be CPU page size alligned.
-- advice -[in] advice to be applied for the specified memory range
-- device -[in] device to apply the advice for
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipMemRangeGetAttribute ( void *data, size\_t data\_size, hipMemRangeAttribute attribute, const void *dev\_ptr, size\_t count )
-
-Query an attribute of a given memory range in HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- data -[inout] a pointer to a memory location where the result of each attribute query will be written to
-- data\_size -[in] the size of data
-- attribute -[in] the attribute to query
-- dev\_ptr -[in] start of the range to query
-- count -[in] size of the range to query
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipMemRangeGetAttributes ( void **data, size\_t *data\_sizes, hipMemRangeAttribute *attributes, size\_t num\_attributes, const void *dev\_ptr, size\_t count )
-
-Query attributes of a given memory range in HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- data -[inout] a two-dimensional array containing pointers to memory locations where the result of each attribute query will be written to
-- data\_sizes -[in] an array, containing the sizes of each result
-- attributes -[in] the attribute to query
-- num\_attributes -[in] an array of attributes to query (numAttributes and the number of attributes in this array should match)
-- dev\_ptr -[in] start of the range to query
-- count -[in] size of the range to query
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipStreamAttachMemAsync ( hipStream\_t stream, void *dev\_ptr, size\_t length, unsigned int flags ) Attach memory to a stream asynchronously in HIP.
-
-Warning: This API is under development. Currently it is a no-operation (NOP) function on AMD GPUs and returns hipSuccess.
-
-## Parameters
-
-- stream -[in] - stream in which to enqueue the attach operation
-- dev\_ptr -[in] - pointer to memory (must be a pointer to managed memory or to a valid host-accessible region of system-allocated memory)
-- length -[in] - length of memory (defaults to zero)
-- flags -[in] - must be one of hipMemAttachGlobal, hipMemAttachHost or hipMemAttachSingle (defaults to hipMemAttachSingle)
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue
-
-## template<class T >
-
-static inline hipError\_t hipMallocManaged ( T **devPtr, size\_t size, unsigned int flags = hipMemAttachGlobal )
-
-- : C++ wrapper for hipMallocManaged
-
-Provide an override to automatically typecast the pointer type from void**, and also provide a default for the flags.
-
-HIP\_DISABLE\_CPP\_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs.
-
-## See also:
-
-hipMallocManaged
-
-## CHAPTER
-
-## TWENTYSIX
-
-## HIP VIRTUAL MEMORY MANAGEMENT API
-
-hipError\_t hipMemAddressFree ( void *devPtr, size\_t size )
-
-Frees an address range reservation made via hipMemAddressReserve.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- devPtr -[in] - starting address of the range.
-- size -[in] - size of the range.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemAddressReserve ( void **ptr, size\_t size, size\_t alignment, void *addr, unsigned long long flags )
-
-Reserves an address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[out] - starting address of the reserved range.
-- size -[in] - size of the reservation.
-- alignment -[in] - alignment of the address.
-- addr -[in] - requested starting address of the range.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-hipError\_t hipMemCreate ( hipMemGenericAllocationHandle\_t *handle, size\_t size, const hipMemAllocationProp *prop, unsigned long long flags )
-
-Creates a memory allocation described by the properties and size.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - value of the returned handle.
-- size -[in] - size of the allocation.
-- prop -[in] - properties of the allocation.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemExportToShareableHandle ( void *shareableHandle, hipMemGenericAllocationHandle\_t handle, hipMemAllocationHandleType handleType, unsigned long long flags )
-
-Exports an allocation to a requested shareable handle type.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- shareableHandle -[out] - value of the returned handle.
-- handle -[in] - handle to share.
-- handleType -[in] - type of the shareable handle.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAccess ( unsigned long long *flags, const hipMemLocation *location, void *ptr
-
-) Get the access flags set for the given location and ptr.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- flags -[out] - flags for this location.
-- location -[in] - target location.
-- ptr -[in] - address to check the access flags.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAllocationGranularity ( size\_t *granularity, const hipMemAllocationProp *prop, hipMemAllocationGranularity\_flags option )
-
-Calculates either the minimal or recommended granularity.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- granularity -[out] - returned granularity.
-- prop -[in] - location properties.
-- option -[in] - determines which granularity to return.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAllocationPropertiesFromHandle ( hipMemAllocationProp *prop,
-
-hipMemGenericAllocationHandle\_t handle )
-
-Retrieve the property structure of the given handle.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- prop -[out] - properties of the given handle.
-- handle -[in] - handle to perform the query on.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-hipError\_t hipMemImportFromShareableHandle ( hipMemGenericAllocationHandle\_t *handle, void *osHandle, hipMemAllocationHandleType shHandleType )
-
-Imports an allocation from a requested shareable handle type.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - returned value.
-- osHandle -[in] - shareable handle representing the memory allocation.
-- shHandleType -[in] - handle type.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemMap ( void *ptr, size\_t size, size\_t offset, hipMemGenericAllocationHandle\_t handle, unsigned long long flags )
-
-Maps an allocation handle to a reserved virtual address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - address where the memory will be mapped.
-- size -[in] - size of the mapping.
-- offset -[in] - offset into the memory, currently must be zero.
-- handle -[in] - memory allocation to be mapped.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemMapArrayAsync ( hipArrayMapInfo *mapInfoList, unsigned int count, hipStream\_t stream )
-
-Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays.
-
-Warning: This API is under development. Currently it is not supported on AMD GPUs and returns hipErrorNotSupported.
-
-## Parameters
-
-- mapInfoList -[in] - list of hipArrayMapInfo.
-- count -[in] - number of hipArrayMapInfo in mapInfoList.
-- stream -[in] - stream identifier for the stream to use for map or unmap operations.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemRelease ( hipMemGenericAllocationHandle\_t handle )
-
-Release a memory handle representing a memory allocation which was previously allocated through hipMemCreate.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-handle -[in] - handle of the memory allocation.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemRetainAllocationHandle ( hipMemGenericAllocationHandle\_t *handle, void *addr )
-
-Returns the allocation handle of the backing memory allocation given the address.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - handle representing addr.
-- addr -[in] - address to look up.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemSetAccess ( void *ptr, size\_t size, const hipMemAccessDesc *desc, size\_t count )
-
-Set the access flags for each location specified in desc for the given virtual address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - starting address of the virtual address range.
-- size -[in] - size of the range.
-- desc -[in] - array of hipMemAccessDesc.
-- count -[in] - number of hipMemAccessDesc in desc.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-## hipError\_t hipMemUnmap ( void *ptr, size\_t size )
-
-Unmap memory allocation of a given address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - starting address of the range to unmap.
-- size -[in] - size of the virtual address range.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-## CHAPTER
-
-## TWENTYSEVEN
-
-## HIP DEPRECATED RUNTIME API FUNCTIONS
-
-Several of our API functions have been flagged for deprecation. Using the following functions results in errors and unexpected results, so we encourage you to update your code accordingly.
-
-## 27.1 Context management
-
-CUDAsupports cuCtx API, which is the driver API that defines 'Context' and 'Devices' as separate entities. Context contains a single device, and a device can theoretically have multiple contexts. HIP initially added limited support for these APIs in order to facilitate porting from existing driver codes. These APIs are now marked as deprecated because there are better alternate interfaces (such as hipSetDevice or the stream API) to achieve these functions.
-
-- hipCtxCreate
-- hipCtxDestroy
-- hipCtxPopCurrent
-- hipCtxPushCurrent
-- hipCtxSetCurrent
-- hipCtxGetCurrent
-- hipCtxGetDevice
-- hipCtxGetApiVersion
-- hipCtxGetCacheConfig
-- hipCtxSetCacheConfig
-- hipCtxSetSharedMemConfig
-- hipCtxGetSharedMemConfig
-- hipCtxSynchronize
-- hipCtxGetFlags
-- hipCtxEnablePeerAccess
-- hipCtxDisablePeerAccess
-- hipDevicePrimaryCtxGetState
-- hipDevicePrimaryCtxRelease
-- hipDevicePrimaryCtxRetain
-- hipDevicePrimaryCtxReset
-
-- hipDevicePrimaryCtxSetFlags
-
-## 27.2 Memory management
-
-- hipMallocHost (replaced with hipHostMalloc )
-- hipMemAllocHost (replaced with hipHostMalloc )
-- hipHostAlloc (replaced with hipHostMalloc )
-- hipFreeHost (replaced with hipHostFree )
-- hipMemcpyToArray
-- hipMemcpyFromArray
-
-## 27.3 Profiler control
-
-- hipProfilerStart (use roctracer/rocTX)
-- hipProfilerStop (use roctracer/rocTX)
-
-## 27.4 Texture management
-
-- hipGetTextureReference
-- hipTexRefSetAddressMode
-- hipTexRefSetArray
-- hipTexRefSetFilterMode
-- hipTexRefSetFlags
-- hipTexRefSetFormat
-- hipTexRefGetAddress
-- hipTexRefGetAddressMode
-- hipTexRefGetFilterMode
-- hipTexRefGetFlags
-- hipTexRefGetFormat
-- hipTexRefGetMaxAnisotropy
-- hipTexRefGetMipmapFilterMode
-- hipTexRefGetMipmapLevelBias
-- hipTexRefGetMipmapLevelClamp
-- hipTexRefGetMipMappedArray
-- hipTexRefSetAddress
-- hipTexRefSetAddress2D
-- hipTexRefSetMaxAnisotropy
-
-- hipTexRefSetBorderColor
-- hipTexRefSetMipmapFilterMode
-- hipTexRefSetMipmapLevelBias
-- hipTexRefSetMipmapLevelClamp
-- hipTexRefSetMipmappedArray
-- hipTexRefGetBorderColor
-- hipTexRefGetArray
-- hipBindTexture
-- hipBindTexture2D
-- hipBindTextureToArray
-- hipGetTextureAlignmentOffset
-- hipUnbindTexture
-- hipBindTextureToMipmappedArray
-
-## CHAPTER
-
-## TWENTYEIGHT
-
-## SAXPY - HELLO, HIP
-
-This tutorial explains the basic concepts of the single-source Heterogeneous-computing Interface for Portability (HIP) programming model and the essential tooling around it. It also reviews some commonalities of heterogenous APIs in general. This topic assumes basic familiarity with the C/C++ compilation model and language.
-
-## 28.1 Prerequisites
-
-To follow this tutorial, you'll need installed drivers and a HIP compiler toolchain to compile your code. Because HIP for ROCm supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP .
-
-## 28.2 Heterogeneous programming
-
-Heterogeneous programming and offloading APIs are often mentioned together. Heterogeneous programming deals with devices of varying capabilities simultaneously. Offloading focuses on the 'remote' and asynchronous aspects of computation. HIP encompasses both. It exposes GPGPU (general-purpose GPU) programming much like ordinary host-side CPU programming and lets you move data across various devices.
-
-When programming in HIP (and other heterogenous APIs for that matter), remember that target devices are built for a specific purpose. They are designed with different tradeoffs than traditional CPUs and therefore have very different performance characteristics. Even subtle changes in code might adversely affect execution time.
-
-## 28.3 Your first lines of HIP code
-
-First, let's do the 'Hello, World!' of GPGPU: SAXPY. Single-precision A times X Plus Y ( SAXPY ) is a mathematical acronym; a vector equation 𝑎 · 𝑥 + 𝑦 = 𝑧 where 𝑎 ∈ R is a scalar and 𝑥, 𝑦, 𝑧 ∈ V are vector quantities of some large dimensionality. This vector space is defined over the set of reals. Practically speaking, you can compute this using a single for loop over three arrays.
-**Following code does:** This code snippet is a command to compile a C++ program using the `clang++` compiler. It is specifically compiling a HIP (Heterogeneous-Compute Interface for Portability) program, which is used for writing portable code that can run on both AMD and NVIDIA GPUs. The command compiles the source file `main.hip` located in the `HIP-Basic\saxpy` directory into an executable named `saxpy.exe`. It includes headers from the `.\Common` directory, links against the `amdhip64` library, and uses the library path specified by the `HIP_PATH` environment variable. The `-O2` flag is used for optimization, and `--save-temps` is used to save intermediate compilation files.
-
-
-```
-++i)
-```
-**Following code does:** This code snippet is a partial command written in PowerShell. It uses the `Get-ChildItem` cmdlet to list items (files and directories) in the current directory that match a specific pattern. The `-Filter main-hip-*` option specifies that only items with names starting with "main-hip-" should be included in the output. The command is incomplete, as it ends with `selec`, which is likely intended to be `Select-Object` to further process or filter the properties of the listed items.
-
-
-```
-<_SQL_>
-```
-
-In linear algebra libraries, such as BLAS (Basic Linear Algebra Subsystem) this operation is defined as AXPY 'A times X Plus Y'. The 'S' comes from single-precision , meaning that array element is float -s (IEEE 754 binary32 representation).
-
-To quickly get started, use the set of HIP samples from GitHub. With Git configured on your machine, open a commandline and navigate to your desired working directory, then run:
-**Following code does:** The code snippet appears to be incomplete and seems to be a mix of Python and PowerShell syntax. The part `n-hip-* | select -Property Name` resembles a PowerShell command rather than Python. In PowerShell, this command would filter objects whose names match the pattern `n-hip-*` and then select only the `Name` property of those objects. However, without additional context or a complete command, it's difficult to provide a precise explanation.
-
-
-```
- |git clone https://github.com/amd/rcm-examples.git
-```
-
-A simple implementation of SAXPY resides in the HIP-Basic/saxpy/main.hip file in this repository. The HIP code here mostly deals with where data has to be and when, and how devices transform this data. The first HIP calls deal with allocating device-side memory and copying data from host-side memory to device side in a C runtime-like fashion.
-**Following code does:** The code snippet appears to be a list of filenames rather than executable code. These filenames suggest that they are related to a project or compilation process targeting the AMD GCN (Graphics Core Next) architecture, specifically the gfx906 variant, which is used in AMD GPUs. The files likely represent different stages or outputs of compiling a program using the HIP (Heterogeneous-Compute Interface for Portability) framework for AMD hardware. The extensions indicate various file types:
-
-- `.bc` is likely a LLVM bitcode file.
-- `.hipi` could be an intermediate file specific to HIP.
-- `.o` is an object file.
-- `.out` is an executable or output file.
-- `.out.resolution.txt` might be a text file containing resolution or linking information.
-- `.s` is an assembly file.
-
-Overall, these files are part of the build process for a HIP application targeting AMD GPUs.
-
-
-```
-// Allocate and copy vectors to device memory.
-float* d_x{};
-float* d_y{};
-HIP_CHECK(hipMalloc(&d_x, size_bytes));
-HIP_CHECK(hipMalloc(&d_y, size_bytes));
-HIP_CHECK(hipMemcpy(d_x, x.data(), size_bytes, hipMemcpyHostToDevice));
-HIP_CHECK(hipMemcpy(d_y, y.data(), size_bytes, hipMemcpyHostToDevice));
-```
-
-HIP\_CHECK is a custom macro borrowed from the examples utilities which checks the error code returned by API functions for errors and reports them to the console. It is not essential to the API, but it is a good practice to check the error codes of the HIP APIs in case you pass on incorrect values to the API, or the API might be out of resources.
-
-The code selects the device to allocate to and to copy to. Commands are issued to the HIP runtime per thread, and every thread has a device set as the target of commands. The default device is 0 , which is equivalent to calling hipSetDevice(0) .
-
-Launch the calculation on the device after the input data has been prepared.
-**Following code does:** This code snippet appears to be a disassembled output of a compiled program targeting AMD's GCN (Graphics Core Next) architecture, specifically the gfx906 variant. The disassembly is likely generated from a HIP (Heterogeneous-Compute Interface for Portability) application, which is used for GPU programming on AMD hardware.
-
-The code includes assembly instructions for a function named `_Z12saxpy_kernelPKfPfj`, which suggests it is implementing a SAXPY (Single-Precision A·X Plus Y) operation, a common vector operation in linear algebra. The SAXPY operation computes the result of `Y = a * X + Y`, where `a` is a scalar and `X` and `Y` are vectors.
-
-The assembly instructions involve loading data, performing arithmetic operations, and storing results back to memory, which are typical steps in executing a SAXPY operation on a GPU. The use of specific instructions like `s_load_dword`, `v_add_u32_e32`, and `global_store_dword` indicates manipulation of scalar and vector registers, memory access, and arithmetic operations optimized for parallel execution on the GPU.
-
-
-```
- Launch the calculation on the device after the input data has been prepared.
- __global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned_
- __int size)
- {
- //...
- }
-
- int main()
- {
- //...
-
- // Launch the kernel on the default stream.
- saxpy_kernel<<
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-```
-
-## 12.2.1 Debugging HIP applications
-
-The following Linux example shows how to get useful information from the debugger while running a simple memory copy test, which caused a segmentation fault issue.
-
-```
-
-
-
- test, which caused a segmentation fault issue.
-
-
-
-
-
-
-
-
-
-
-
-```
-
-(continues on next page)
-
-(continued from previous page)
-
-```
-HIP Documentation, Release 6.1.40092
-```
-
-On Windows , you can set AMD\_LOG\_LEVEL via environment variable from the advanced system settings or the command prompt (when run as administrator). The following example shows debug log information when calling the backend runtime.
-
-```
-
-
-
- void void
-```
-
-We insert the GCN isa into the kernel using asm() Assembler statement. volatile keyword is used so that the optimizers must not change the number of volatile operations or change their order of execution relative to other volatile operations. v\_mac\_f32\_e32 is the GCN instruction, for more information please refer - [AMD GCN3 ISA architecture manual](http://gpuopen.com/compute-product/amd-gcn3-isa-architecture-manual/) Index for the respective operand in the ordered fashion is provided by % followed by position in the list of operands 'v' is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register, for more info please refer - [Supported Constraint Code List for AMDGPU](https://llvm.org/docs/LangRef.html#supported-constraint-code-list) Output Constraints are specified by an '=' prefix as shown above ('=v'). This indicate that assembly will write to this operand, and the operand will then be made available as a return value of the asm expression. Input constraints do not have a prefix - just the constraint code. The constraint string of '0' says to use the assigned register for output as an input as well (it being the 0'th constraint).
-
-## C++ Support The following C++ features are not supported: - Run-time-type information (RTTI) - Try/catch Virtual functions Virtual functions are not supported if objects containing virtual function tables are passed between GPU's of different offload arch's, e.g. between gfx906 and gfx1030. Otherwise virtual functions are supported.
-
-## 19.27 Kernel Compilation
-
-hipcc now supports compiling C++/HIP kernels to binary code objects. The file format for binary is .co which means Code Object. The following command builds the code object using hipcc .
-
-```
-hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]
-
-[TARGET GPU] = GPU architecture
-[INPUT FILE] = Name of the file containing kernels
-[OUTPUT FILE] = Name of the generated code object file
-```
-
-Note: When using binary code objects is that the number of arguments to the kernel is different on HIP-Clang and NVCC path. Refer to the HIP module\_api sample for differences in the arguments to be passed to the kernel.
-
-## 19.28 gfx-arch-specific-kernel
-
-Clang defined '\_\_gfx*\_\_' macros can be used to execute gfx arch specific codes inside the kernel. Refer to the sample in HIP 14\_gpu\_arch sample.
-
-## CHAPTER
-
-## TWENTY
-
-## C++ LANGUAGE SUPPORT
-
-The ROCm platform enables the power of combined C++ and HIP (Heterogeneous-computing Interface for Portability) code. This code is compiled with a clang or clang++ compiler. The official compilers support the HIP platform, or you can use the amdclang or amdclang++ included in the ROCm installation, which are a wrapper for the official versions.
-
-The source code is compiled according to the C++03 , C++11 , C++14 , C++17 , and C++20 standards, along with HIPspecific extensions, but is subject to restrictions. The key restriction is the reduced support of standard library in device code. This is due to the fact that by default a function is considered to run on host, except for constexpr functions, which can run on host and device as well.
-
-## 20.1 Modern C++ support
-
-C++ is considered a modern programming language as of C++11. This section describes how HIP supports these new C++ features.
-
-## 20.1.1 C++11 support
-
-The C++11 standard introduced many new features. These features are supported in HIP host code, with some notable omissions on the device side. The rule of thumb here is that constexpr functions work on device, the rest doesn't. This means that some important functionality like std::function is missing on the device, but unfortunately the standard library wasn't designed with HIP in mind, which means that the support is in a state of 'works as-is'.
-
-Certain features have restrictions and clarifications. For example, any functions using the constexpr qualifier or the new initializer lists , std::move or std::forward features are implicitly considered to have the \_\_host\_\_ and \_\_device\_\_ execution space specifier. Also, constexpr variables that are static members or namespace scoped can be used from both host and device, but only for read access. Dereferencing a static constexpr outside its specified execution space causes an error.
-
-Lambdas are supported, but there are some extensions and restrictions on their usage. For more information, see the Extended lambdas section below.
-
-## 20.1.2 C++14 support
-
-The C++14 language features are supported.
-
-## 20.1.3 C++17 support
-
-All C++17 language features are supported.
-
-## 20.1.4 C++20 support
-
-All C++20 language features are supported, but extensions and restrictions apply. C++20 introduced coroutines and modules, which fundamentally changed how programs are written. HIP doesn't support these features. However, consteval functions can be called from host and device, even if specified for host use only.
-
-The three-way comparison operator (spaceship operator <=> ) works with host and device code.
-
-## 20.2 Extensions and restrictions
-
-In addition to the deviations from the standard, there are some general extensions and restrictions to consider.
-
-## 20.2.1 Global functions
-
-Functions that serve as an entry point for device execution are called kernels and are specified with the \_\_global\_\_ qualifier. To call a kernel function, use the triple chevron operator: <<< >>> . Kernel functions must have a void return type. These functions can't:
-
-- have a constexpr specifier
-- have a parameter of type std::initializer\_list or va\_list
-- use an rvalue reference as a parameter.
-- use parameters having different sizes in host and device code, e.g. long double arguments, or structs containing long double members.
-- use struct-type arguments which have different layout in host and device code.
-
-Kernels can have variadic template parameters, but only one parameter pack, which must be the last item in the template parameter list.
-
-## 20.2.2 Device space memory specifiers
-
-HIP includes device space memory specifiers to indicate whether a variable is allocated in host or device memory and howits memory should be allocated. HIP supports the \_\_device\_\_ , \_\_shared\_\_ , \_\_managed\_\_ , and \_\_constant\_\_ specifiers.
-
-The \_\_device\_\_ and \_\_constant\_\_ specifiers define global variables, which are allocated within global memory on the HIP devices. The only difference is that \_\_constant\_\_ variables can't be changed after allocation. The \_\_shared\_\_ specifier allocates the variable within shared memory, which is available for all threads in a block.
-
-The \_\_managed\_\_ variable specifier creates global variables that are initially undefined and unaddressed within the global symbol table. The HIP runtime allocates managed memory and defines the symbol when it loads the device binary. A managed variable can be accessed in both device and host code.
-
-It's important to know where a variable is stored because it is only available from certain locations. Generally, variables allocated in the host memory are not accessible from the device code, while variables allocated in the device memory are not directly accessible from the host code. Dereferencing a pointer to device memory on the host results in a segmentation fault. Accessing device variables in host code should be done through kernel execution or HIP functions like hipMemCpyToSymbol .
-
-## 20.2.3 Exception handling
-
-An important difference between the host and device code is exception handling. In device code, this control flow isn't available due to the hardware architecture. The device code must use return codes to handle errors.
-
-## 20.2.4 Kernel parameters
-
-There are some restrictions on kernel function parameters. They cannot be passed by reference, because these functions are called from the host but run on the device. Also, a variable number of arguments is not allowed.
-
-## 20.2.5 Classes
-
-Classes work on both the host and device side, but there are some constraints. The static member functions can't be \_\_global\_\_ . Virtual member functions work, but a virtual function must not be called from the host if the parent object was created on the device, or the other way around, because this behavior is undefined. Another minor restriction is that \_\_device\_\_ variables, that are global scoped must have trivial constructors.
-
-## 20.2.6 Polymorphic function wrappers
-
-HIP doesn't support the polymorphic function wrapper std::function , which was introduced in C++11.
-
-## 20.2.7 Extended lambdas
-
-HIP supports Lambdas, which by default work as expected.
-
-Lambdas have implicit host device attributes. This means that they can be executed by both host and device code, and works the way you would expect. To make a lambda callable only by host or device code, users can add \_\_host\_\_ or \_\_device\_\_ attribute. The only restriction is that host variables can only be accessed through copy on the device. Accessing through reference will cause undefined behavior.
-
-## 20.2.8 Inline namespaces
-
-Inline namespaces are supported, but with a few exceptions. The following entities can't be declared in namespace scope within an inline unnamed namespace:
-
-- \_\_managed\_\_ , \_\_device\_\_ , \_\_shared\_\_ and \_\_constant\_\_ variables
-- \_\_global\_\_ function and function templates
-- variables with surface or texture type
-
-## CHAPTER
-
-## TWENTYONE
-
-## HIP MATH API
-
-HIP-Clang supports a set of math operations that are callable from the device. HIP supports most of the device functions supported by NVIDIA CUDA. These are described in the following sections.
-
-## 21.1 Single precision mathematical functions
-
-Following is the list of supported single precision mathematical functions.
-
-Table 1: Single precision mathematical functions
-
-| Function | Supported on Host | Supported on Device |
-|----------------------------------------------------------------------------|---------------------|-----------------------|
-| float abs(float x) Returns the absolute value of 𝑥 | ✓ | ✓ |
-| float acosf(float x) Returns the arc cosine of 𝑥 . | ✓ | ✓ |
-| float acoshf(float x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| float asinf(float x) Returns the arc sine of 𝑥 . | ✓ | ✓ |
-| float asinhf(float x) Returns the arc hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| float atanf(float x) Returns the arc tangent of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 1 - continued from previous page |
-|------------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| float atan2f(float x, float y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float atanhf(float x) Returns the arc hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-| float cbrtf(float x) Returns the cube root of 𝑥 . | ✓ | ✓ |
-| float ceilf(float x) Returns ceiling of 𝑥 . | ✓ | ✓ |
-| float copysignf(float x, float y) Create value with given magnitude, copying sign of second value. | ✓ | ✓ |
-| float cosf(float x) Returns the cosine of 𝑥 . | ✓ | ✓ |
-| float coshf(float x) Returns the hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| float cospif(float x) Returns the cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| float cyl_bessel_i0f(float x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 . | | |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| float cyl_bessel_i1f(float x) Returns the value of the regular modified cylindrical Bessel function of order 1 for 𝑥 . | | |
-|--------------------------------------------------------------------------------------------------------------------------|----|----|
-| float erff(float x) Returns the error function of 𝑥 . | ✓ | ✓ |
-| float erfcf(float x) Returns the complementary error function of 𝑥 . | ✓ | ✓ |
-| float erfcinvf(float x) Returns the inverse complementary function of 𝑥 . | ✓ | ✓ |
-| float erfcxf(float x) Returns the scaled complementary error function of 𝑥 . | ✓ | ✓ |
-| float erfinvf(float x) Returns the inverse error function of 𝑥 . | ✓ | ✓ |
-| float expf(float x) Returns 𝑒 𝑥 . | ✓ | ✓ |
-| float exp10f(float x) Returns 10 𝑥 . | ✓ | ✓ |
-| float exp2f( float x) Returns 2 𝑥 . | ✓ | ✓ |
-| float expm1f(float x) Returns 𝑙𝑛 ( 𝑥 - 1) | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float fabsf(float x) Returns the absolute value of x | ✓ | ✓ |
-|------------------------------------------------------------------------------------|-----|-----|
-| float fdimf(float x, float y) Returns the positive difference between 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fdividef(float x, float y) Divide two floating point values. | ✓ | ✓ |
-| float floorf(float x) Returns the largest integer less than or equal to 𝑥 . | ✓ | ✓ |
-| float fmaf(float x, float y, float z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. | ✓ | ✓ |
-| float fmaxf(float x, float y) Determine the maximum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fminf(float x, float y) Determine the minimum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| float fmodf(float x, float y) Returns the floating-point remainder of 𝑥/𝑦 . | ✓ | ✓ |
-| float modff(float x, float* iptr) Break down 𝑥 into fractional and integral parts. | ✓ | |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| float frexpf(float x, int* nptr) Extract mantissa and exponent of 𝑥 . | ✓ |
-|---------------------------------------------------------------------------------------------------------|-----|
-| float hypotf(float x, float y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . | ✓ |
-| int ilogbf(float x) Returns the unbiased integer exponent of 𝑥 . | ✓ |
-| bool isfinite(float x) Determine whether 𝑥 is finite. | ✓ |
-| bool isinf(float x) Determine whether 𝑥 is infinite. | ✓ |
-| bool isnan(float x) Determine whether 𝑥 is a NAN . | ✓ |
-| float j0f(float x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . | ✓ |
-| float j1f(float x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . | ✓ |
-| float jnf(int n, float x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float ldexpf(float x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | ✓ |
-|-------------------------------------------------------------------------------------------------------------------|-----|-----|
-| float lgammaf(float x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | |
-| long int lrintf(float x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-| long long int llrintf(float x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-| long int lroundf(float x) Round to nearest integer value. | ✓ | ✓ |
-| long long int llroundf(float x) Round to nearest integer value. | ✓ | ✓ |
-| float log10f(float x) Returns the base 10 logarithm of 𝑥 . | ✓ | ✓ |
-| float log1pf(float x) Returns the natural logarithm of 𝑥 +1 . | ✓ | ✓ |
-| float log2f(float x) Returns the base 2 logarithm of 𝑥 . | ✓ | ✓ |
-| float logf(float x) Returns the natural logarithm of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 1 - continued from previous page
-
-| | | ✓ |
-|----------------------------------------------------------------------------------------------------------------------|----|-----|
-| float logbf(float x) Returns the floating point representation of the exponent of 𝑥 . | ✓ | |
-| float nanf(const char* tagp) Returns 'Not a Number' value. | | ✓ |
-| float nearbyintf(float x) Round 𝑥 to the nearest integer. | ✓ | ✓ |
-| float nextafterf(float x, float y) Returns next representable single-precision floating-point value after argument. | ✓ | |
-| float norm3df(float x, float y, float z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . | ✓ | ✓ |
-| float norm4df(float x, float y, float z, float w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . | ✓ | ✓ |
-| float normcdff(float y) Returns the standard normal cumulative distribution function. | ✓ | ✓ |
-| float normcdfinvf(float y) Returns the inverse of the standard normal cumulative distribution function. | ✓ | ✓ |
-| float normf(int dim, const float *a) Returns the square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 1 - continued from previous page |
-|-------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| float powf(float x, float y) Returns 𝑥 𝑦 . | ✓ | ✓ |
-| float powif(float base, int iexp) Returns the value of first argument to the power of second argument. | ✓ | ✓ |
-| float remainderf(float x, float y) Returns single-precision floating-point remainder. | ✓ | ✓ |
-| float remquof(float x, float y, int* quo) Returns single-precision floating-point remainder and part of quotient. | ✓ | ✓ |
-| float roundf(float x) Round to nearest integer value in floating-point. | ✓ | ✓ |
-| float rcbrtf(float x) Returns the reciprocal cube root function. | ✓ | ✓ |
-| float rhypotf(float x, float y) Returns one over the square root of the sum of squares of two arguments. | ✓ | ✓ |
-| float rintf(float x) Round input to nearest integer value in floating-point. | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| float rnorm3df(float x, float y, float z) Returns one over the square root of the sum of squares of three coordinates of the argument. | ✓ | ✓ |
-|------------------------------------------------------------------------------------------------------------------------------------------------|-----|-----|
-| float rnorm4df(float x, float y, float z, float w) Returns one over the square root of the sum of squares of four coordinates of the argument. | ✓ | ✓ |
-| float rnormf(int dim, const float *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-| float scalblnf(float x, long int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| float scalbnf(float x, int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| bool signbit(float x) Return the sign bit of 𝑥 . | ✓ | ✓ |
-| float sinf(float x) Returns the sine of 𝑥 . | ✓ | ✓ |
-| float sinhf(float x) Returns the hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| float sinpif(float x) Returns the hyperbolic sine of 𝜋 · 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table
-
-1 - continued from previous page
-
-| void sincosf(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝑥 . | ✓ | ✓ |
-|---------------------------------------------------------------------------------------------------|-----|-----|
-| void sincospif(float x, float *sptr, float *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| float sqrtf(float x) Returns the square root of 𝑥 . | ✓ | ✓ |
-| float rsqrtf(float x) Returns the reciprocal of the square root of 𝑥 . | | ✓ |
-| float tanf(float x) Returns the tangent of 𝑥 . | ✓ | ✓ |
-| float tanhf(float x) Returns the hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-| float tgammaf(float x) Returns the gamma function of 𝑥 . | ✓ | ✓ |
-| float truncf(float x) Truncate 𝑥 to the integral part. | ✓ | ✓ |
-| float y0f(float x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . | ✓ | ✓ |
-| float y1f(float x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-```
- \
- float ynf(int n, float x)
- Returns the value of the Bessel
- function of the second kind of order
- n for x.
-```
-
-Table 1 - continued from previous page
-
-## 21.2 Double precision mathematical functions
-
-Following is the list of supported double precision mathematical functions.
-
-Table 2: Double precision mathematical functions
-
-| Function | Supported on Host | Supported on Device |
-|------------------------------------------------------------------------------------|---------------------|-----------------------|
-| double abs(double x) Returns the absolute value of 𝑥 | ✓ | ✓ |
-| double acos(double x) Returns the arc cosine of 𝑥 . | ✓ | ✓ |
-| double acosh(double x) Returns the nonnegative arc hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| double asin(double x) Returns the arc sine of 𝑥 . | ✓ | ✓ |
-| double asinh(double x) Returns the arc hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| double atan(double x) Returns the arc tangent of 𝑥 . | ✓ | ✓ |
-| double atan2(double x, double y) Returns the arc tangent of the ratio of 𝑥 and 𝑦 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double atanh(double x) Returns the arc hyperbolic tangent of 𝑥 . | ✓ | ✓ |
-|-------------------------------------------------------------------------------------------------------------------------|-----|-----|
-| double cbrt(double x) Returns the cube root of 𝑥 . | ✓ | ✓ |
-| double ceil(double x) Returns ceiling of 𝑥 . | ✓ | ✓ |
-| double copysign(double x, double y) Create value with given magnitude, copying sign of second value. | ✓ | ✓ |
-| double cos(double x) Returns the cosine of 𝑥 . | ✓ | ✓ |
-| double cosh(double x) Returns the hyperbolic cosine of 𝑥 . | ✓ | ✓ |
-| double cospi(double x) Returns the cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| double cyl_bessel_i0(double x) Returns the value of the regular modified cylindrical Bessel function of order 0 for 𝑥 . | | |
-| double cyl_bessel_i1(double x) Returns the value of the regular modified cylindrical Bessel function of order 1 for | 𝑥 . | |
-| double erf(double x) Returns the error function of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double erfc(double x) Returns the complementary error function of 𝑥 . | ✓ | ✓ |
-|-----------------------------------------------------------------------------------|-----|-----|
-| double erfcinv(double x) Returns the inverse complementary function of 𝑥 . | ✓ | ✓ |
-| double erfcx(double x) Returns the scaled complementary error function of 𝑥 . | ✓ | ✓ |
-| double erfinv(double x) Returns the inverse error function of 𝑥 . | ✓ | ✓ |
-| double exp(double x) Returns 𝑒 𝑥 . | ✓ | ✓ |
-| double exp10(double x) Returns 10 𝑥 . | ✓ | ✓ |
-| double exp2( double x) Returns 2 𝑥 . | ✓ | ✓ |
-| double expm1(double x) Returns 𝑙𝑛 ( 𝑥 - 1) | ✓ | ✓ |
-| double fabs(double x) Returns the absolute value of x | ✓ | ✓ |
-| double fdim(double x, double y) Returns the positive difference between 𝑥 and 𝑦 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double floor(double x) Returns the largest integer less than or equal to 𝑥 . | ✓ | ✓ |
-|---------------------------------------------------------------------------------------------|-----|-----|
-| double fma(double x, double y, double z) Returns 𝑥 · 𝑦 + 𝑧 as a single operation. | ✓ | ✓ |
-| double fmax(double x, double y) Determine the maximum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| double fmin(double x, double y) Determine the minimum numeric value of 𝑥 and 𝑦 . | ✓ | ✓ |
-| double fmod(double x, double y) Returns the floating-point remainder of 𝑥/𝑦 . | ✓ | ✓ |
-| double modf(double x, double* iptr) Break down 𝑥 into fractional and integral parts. | ✓ | |
-| double frexp(double x, int* nptr) Extract mantissa and exponent of 𝑥 . | ✓ | |
-| double hypot(double x, double y) Returns the square root of the sum of squares of 𝑥 and 𝑦 . | ✓ | ✓ |
-| int ilogb(double x) Returns the unbiased integer exponent of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| bool isfinite(double x) Determine whether 𝑥 is finite. | ✓ | ✓ |
-|------------------------------------------------------------------------------------------------------------------|-----|-----|
-| bool isin(double x) Determine whether 𝑥 is infinite. | ✓ | ✓ |
-| bool isnan(double x) Determine whether 𝑥 is a NAN . | ✓ | ✓ |
-| double j0(double x) Returns the value of the Bessel function of the first kind of order 0 for 𝑥 . | ✓ | ✓ |
-| double j1(double x) Returns the value of the Bessel function of the first kind of order 1 for 𝑥 . | ✓ | ✓ |
-| double jn(int n, double x) Returns the value of the Bessel function of the first kind of order n for 𝑥 . | ✓ | ✓ |
-| double ldexp(double x, int exp) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | ✓ |
-| double lgamma(double x) Returns the natural logarithm of the absolute value of the gamma function of 𝑥 . | ✓ | |
-| long int lrint(double x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| long long int llrint(double x) Round 𝑥 to nearest integer value. | ✓ | ✓ |
-|----------------------------------------------------------------------------------------|-----|-----|
-| long int lround(double x) Round to nearest integer value. | ✓ | ✓ |
-| long long int llround(double x) Round to nearest integer value. | ✓ | ✓ |
-| double log10(double x) Returns the base 10 logarithm of 𝑥 . | ✓ | ✓ |
-| double log1p(double x) Returns the natural logarithm of 𝑥 +1 . | ✓ | ✓ |
-| double log2(double x) Returns the base 2 logarithm of 𝑥 . | ✓ | ✓ |
-| double log(double x) Returns the natural logarithm of 𝑥 . | ✓ | ✓ |
-| double logb(double x) Returns the floating point representation of the exponent of 𝑥 . | ✓ | ✓ |
-| double nan(const char* tagp) Returns 'Not a Number' value. | | ✓ |
-| double nearbyint(double x) Round 𝑥 to the nearest integer. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| | | ✓ |
-|--------------------------------------------------------------------------------------------------------------------------|----|-----|
-| double nextafter(double x, double y) Returns next representable double-precision floating-point value after argument. | ✓ | |
-| double norm3d(double x, double y, double z) Returns the square root of the sum of squares of 𝑥 , 𝑦 and 𝑧 . | ✓ | ✓ |
-| double norm4d(double x, double y, double z, double w) Returns the square root of the sum of squares of 𝑥 , 𝑦 , 𝑧 and 𝑤 . | ✓ | ✓ |
-| double normcdf(double y) Returns the standard normal cumulative distribution function. | ✓ | ✓ |
-| double normcdfinv(double y) Returns the inverse of the standard normal cumulative distribution function. | ✓ | ✓ |
-| double norm(int dim, const double *a) Returns the square root of the sum of squares of any number of coordinates. | ✓ | ✓ |
-| double pow(double x, double y) Returns 𝑥 𝑦 . | ✓ | ✓ |
-| double powi(double base, int iexp) Returns the value of first argument to the power of second argument. | ✓ | ✓ |
-
-continues on next page
-
-| | Table | 2 - continued from previous page |
-|----------------------------------------------------------------------------------------------------------------------------------------------------|---------|------------------------------------|
-| double remainder(double x, double y) Returns double-precision floating-point remainder. | ✓ | ✓ |
-| double remquo(double x, double y, int* quo) Returns double-precision floating-point remainder and part quotient. | ✓ | of |
-| double round(double x) Round to nearest integer value in floating-point. | ✓ | ✓ |
-| double rcbrt(double x) Returns the reciprocal cube root function. | ✓ | ✓ |
-| double rhypot(double x, double y) Returns one over the square root of the sum of squares of two arguments. | ✓ | ✓ |
-| double rint(double x) Round input to nearest integer value in floating-point. | ✓ | ✓ |
-| double rnorm3d(double x, double y, double z) Returns one over the square root of the sum of squares of three coordinates of the argument. | ✓ | ✓ |
-| double rnorm4d(double x, double y, double z, double w) Returns one over the square root of the sum of squares of four coordinates of the argument. | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| | ✓ | |
-|----------------------------------------------------------------------------------------------------------------------------------|-----|----|
-| double rnorm(int dim, const double *a) Returns the reciprocal of square root of the sum of squares of any number of coordinates. | | ✓ |
-| double scalbln(double x, long int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| double scalbn(double x, int n) Scale 𝑥 by 2 𝑛 . | ✓ | ✓ |
-| bool signbit(double x) Return the sign bit of 𝑥 . | ✓ | ✓ |
-| double sin(double x) Returns the sine of 𝑥 . | ✓ | ✓ |
-| double sinh(double x) Returns the hyperbolic sine of 𝑥 . | ✓ | ✓ |
-| double sinpi(double x) Returns the hyperbolic sine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| void sincos(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝑥 . | ✓ | ✓ |
-| void sincospi(double x, double *sptr, double *cptr) Returns the sine and cosine of 𝜋 · 𝑥 . | ✓ | ✓ |
-| double sqrt(double x) Returns the square root of 𝑥 . | ✓ | ✓ |
-
-continues on next page
-
-Table 2 - continued from previous page
-
-| double rsqrt(double x) Returns the reciprocal of the square root of 𝑥 . | ✓ |
-|-----------------------------------------------------------------------------------------------------------|-----|
-| double tan(double x) Returns the tangent of 𝑥 . | ✓ |
-| double tanh(double x) Returns the hyperbolic tangent of 𝑥 . | ✓ |
-| double tgamma(double x) Returns the gamma function of 𝑥 . | ✓ |
-| double trunc(double x) Truncate 𝑥 to the integral part. | ✓ |
-| double y0(double x) Returns the value of the Bessel function of the second kind of order 0 for 𝑥 . | ✓ |
-| double y1(double x) Returns the value of the Bessel function of the second kind of order 1 for 𝑥 . | ✓ |
-| double yn(int n, double x) Returns the value of the Bessel function of the second kind of order n for 𝑥 . | ✓ |
-
-## 21.3 Integer intrinsics
-
-Following is the list of supported integer intrinsics. Note that intrinsics are supported on device only.
-
-Table 3: Integer intrinsics mathematical functions
-
-## Function
-
-unsigned int \_\_brev(unsigned int x) Reverse the bit order of a 32 bit unsigned integer.
-
-unsigned long long int \_\_brevll(unsigned long long int x) Reverse the bit order of a 64 bit unsigned integer.
-
-unsigned int \_\_byte\_perm(unsigned int x, unsigned int y, unsigned int z) Return selected bytes from two 32-bit unsigned integers.
-
-unsigned int \_\_clz(int x) Return the number of consecutive high-order zero bits in 32 bit integer.
-
-unsigned int \_\_clzll(long long int x) Return the number of consecutive high-order zero bits in 64 bit integer.
-
-unsigned int \_\_ffs(int x) Find the position of least significant bit set to 1 in a 32 bit integer.
-
-unsigned int \_\_ffsll(long long int x) Find the position of least significant bit set to 1 in a 64 bit signed integer.
-
-unsigned int \_\_fns32(unsigned long long mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 32-bit integer.
-
-unsigned int \_\_fns64(unsigned long long int mask, unsigned int base, int offset) Find the position of the n-th set to 1 bit in a 64-bit integer.
-
-unsigned int \_\_funnelshift\_l(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by shift & 31 bits, return the most significant 32 bits.
-
-unsigned int \_\_funnelshift\_lc(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift left by min(shift, 32) bits, return the most significant 32 bits.
-
-unsigned int \_\_funnelshift\_r(unsigned int lo, unsigned int hi, unsigned int shift) Concatenate ℎ𝑖 and 𝑙𝑜 , shift right by shift & 31 bits, return the least significant 32 bits. 226 Chapter 21. HIP math API
-
-The HIP-Clang implementation of \_\_ffs() and \_\_ffsll() contains code to add a constant +1 to produce the ffs result format. For the cases where this overhead is not acceptable and programmer is willing to specialize for the platform, HIP-Clang provides \_\_lastbit\_u32\_u32(unsigned int input) and \_\_lastbit\_u32\_u64(unsigned long long int input) . The index returned by \_\_lastbit\_ instructions starts at -1, while for ffs the index starts at 0.
-
-## 21.4 Floating-point Intrinsics
-
-Following is the list of supported floating-point intrinsics. Note that intrinsics are supported on device only.
-
-Note: Only the nearest even rounding mode supported on AMD GPUs by defaults. The \_rz , \_ru and \_rd suffixed intrinsic functions are existing in HIP AMD backend, if the OCML\_BASIC\_ROUNDED\_OPERATIONS macro is defined.
-
-Table 4: Single precision intrinsics mathematical functions
-
-Function float \_\_cosf(float x) Returns the fast approximate cosine of 𝑥 . float \_\_exp10f(float x) Returns the fast approximate for 10 x . float \_\_expf(float x) Returns the fast approximate for e x . float \_\_fadd\_rn(float x, float y) Add two floating-point values in round-to-nearest-even mode. float \_\_fdiv\_rn(float x, float y) Divide two floating point values in round-to-nearest-even mode. float \_\_fmaf\_rn(float x, float y, float z) Returns x × y + z as a single operation in round-to-nearest-even mode. float \_\_fmul\_rn(float x, float y) Multiply two floating-point values in round-to-nearest-even mode. float \_\_frcp\_rn(float x, float y) Returns 1 / x in round-to-nearest-even mode. float \_\_frsqrt\_rn(float x) Returns 1 / x in round-to-nearest-even mode. float \_\_fsqrt\_rn(float x) Returns x in round-to-nearest-even mode. float \_\_fsub\_rn(float x, float y) Subtract two floating-point values in round-to-nearest-even mode. float \_\_log10f(float x) Returns the fast approximate for base 10 logarithm of 𝑥 . 228 Chapter 21. HIP math API
-
-Table 5: Double precision intrinsics mathematical functions
-
-Function double \_\_dadd\_rn(double x, double y) Add two floating-point values in round-to-nearest-even mode. double \_\_ddiv\_rn(double x, double y) Divide two floating-point values in round-to-nearest-even mode. double \_\_dmul\_rn(double x, double y) Multiply two floating-point values in round-to-nearest-even mode. double \_\_drcp\_rn(double x, double y) Returns 1 / x in round-to-nearest-even mode. double \_\_dsqrt\_rn(double x) Returns x in round-to-nearest-even mode. double \_\_dsub\_rn(double x, double y) Subtract two floating-point values in round-to-nearest-even mode. double \_\_fma\_rn(double x, double y, double z) Returns x × y + z as a single operation in round-to-nearest-even mode.
-
-## CHAPTER
-
-## TWENTYTWO
-
-## TABLE COMPARING SYNTAX FOR DIFFERENT COMPUTE APIS
-
-| Term | CUDA | HIP | OpenCL |
-|------------------------|---------------------|--------------------------------------------|------------------------|
-| Device | int deviceId | int deviceId | cl_device |
-| Queue | cudaStream_t | hipStream_t | cl_command_queue |
-| Event | cudaEvent_t | hipEvent_t | cl_event |
-| Memory | void * | void * | cl_mem |
-| | grid | grid | NDRange |
-| | block | block | work-group |
-| | thread | thread | work-item |
-| | warp | warp | sub-group |
-| Thread-index | threadIdx.x | threadIdx.x | get_local_id(0) |
-| Block-index | blockIdx.x | blockIdx.x | get_group_id(0) |
-| Block-dim | blockDim.x | blockDim.x | get_local_size(0) |
-| Grid-dim | gridDim.x | gridDim.x | get_num_groups(0) |
-| Device Kernel | __global__ | __global__ | __kernel |
-| Device Function | __device__ | __device__ | Implied in device com |
-| Host Function | __host_ (default) | __host_ (default) | Implied in host compil |
-| Host + Device Function | __host__ __device__ | __host__ __device__ | No equivalent |
-| Kernel Launch | <<< >>> | hipLaunchKernel / hipLaunchKernelGGL / <<< | clEnqueueNDRangeK |
-| Global Memory | __global__ | __global__ | __global |
-| Group Memory | __shared__ | __shared__ | __local |
-| Constant | __constant__ | __constant__ | __constant |
-| | __syncthreads | __syncthreads | barrier(CLK_LOCAL |
-| Atomic Builtins | atomicAdd | atomicAdd | atomic_add |
-| Precise Math | cos(f) | cos(f) | cos(f) |
-| Fast Math | __cos(f) | __cos(f) | native_cos(f) |
-| Vector | float4 | float4 | float4 |
-
-## 22.1 Notes
-
-The indexing functions (starting with thread-index ) show the terminology for a 1D grid. Some APIs use reverse order of xyz / 012 indexing for 3D grids.
-
-## CHAPTER
-
-## TWENTYTHREE
-
-## HIP COOPERATIVE GROUPS API
-
-## 23.1 Cooperative kernel launches
-
-The following host-side functions are used for cooperative kernel launches.
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find function | 'hipLaunchCooperativeKernel' Documentation' | 'hipLaunchCooperativeKernel' Documentation' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for project | 'HIP | 6.1.40092 | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: project | doxygenfunction: project | doxygenfunction: project | Cannot | Cannot | find function | 'hipLaunchCooperativeKernel' | 'hipLaunchCooperativeKernel' |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | | 'HIP | 6.1.40092 | Documentation' | from |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function | 'hipLaunchCooperativeKernelMultiDe- | 'hipLaunchCooperativeKernelMultiDe- | 'hipLaunchCooperativeKernelMultiDe- |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| vice' | in | doxygen | xml | output for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: in | doxygenfunction: Cannot find xml output for project 'HIP | doxygenfunction: Cannot find xml output for project 'HIP | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| nel' | function 6.1.40092 | 'hipModuleLaunchCooperativeKer- Documentation' from directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'hipModuleLaunchCooperativeKernelMultiDevice' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-## 23.2 Cooperative groups classes
-
-The following cooperative groups classes can be used on the device side.
-
-## class thread\_group
-
-The base type of all cooperative group types.
-
-Holds the key properties of a constructed cooperative group types object, like the group type, its size, etc.
-
-Note: Cooperative groups feature is implemented on Linux, under development on Microsoft Windows.
-
-Subclassed by cooperative\_groups::coalesced\_group , cooperative\_groups::grid\_group , coopera-tive\_groups::multi\_grid\_group , cooperative\_groups::thread\_block , cooperative\_groups::tiled\_group class thread\_block : public cooperative\_groups:: thread\_group
-
-The workgroup (thread-block in CUDA terminology) cooperative group type.
-
-Represents an intra-workgroup cooperative group type, where the participating threads within the group are the same threads that participated in the currently executing workgroup .
-
-Note: This function is implemented on Linux and is under development on Microsoft Windows.
-
-class grid\_group : public cooperative\_groups:: thread\_group
-
-The grid cooperative group type.
-
-Represents an inter-workgroup cooperative group type, where the participating threads within the group spans across multiple workgroups running the (same) kernel on the same device.
-
-Note: This is implemented on Linux and is under development on Microsoft Windows.
-
-class multi\_grid\_group : public cooperative\_groups:: thread\_group
-
-The multi-grid cooperative group type.
-
-Represents an inter-device cooperative group type, where the participating threads within the group span across multiple devices, running the (same) kernel on these devices.
-
-Note: The multi-grid cooperative group type is implemented on Linux, under development on Microsoft Windows.
-
-## template<unsigned int size , class ParentCGTy >
-
-class thread\_block\_tile : public cooperative\_groups::impl::thread\_block\_tile\_internal< size , ParentCGTy > Group type -thread\_block\_tile .
-
-Represents one tiled thread group in a wavefront. This group type also supports sub-wave level intrinsics.
-
-Note: This type is implemented on Linux, under development on Microsoft Windows.
-
-## Public Functions
-
-unsigned int thread\_rank () const
-
-Rank of the calling thread within [0, size() ).
-
-## void sync ()
-
-Synchronizes the threads in the group.
-
-Causes all threads in the group to wait at this synchronization point, and for all shared and global memory accesses by the threads to complete, before running synchronization. This guarantees the visibility of accessed data for all threads in the group.
-
-Note: There are potential read-after-write (RAW), write-after-read (WAR), or write-after-write (WAW) hazards, when threads in the group access the same addresses in shared or global memory. The data hazards can be avoided with synchronization of the group.
-
-## unsigned int meta\_group\_rank () const
-
-Returns the linear rank of the group within the set of tiles partitioned from a parent group (bounded by meta\_group\_size)
-
-unsigned int meta\_group\_size () const
-
-Returns the number of groups created when the parent group was partitioned.
-
-## template<class T >
-
-T shfl ( T var, int srcRank ) const
-
-Shuffle operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle operation is a direct copy of var from srcRank thread ID of group.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy. Only the srcRank thread ID of group is copied to other threads.
-- srcRank - [in] The source thread ID of the group for copy.
-
-## template<class T >
-
-T shfl\_down ( T var, unsigned int lane\_delta ) const
-
-Shuffle down operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle down operation is copy of var from thread with thread ID of group relative higher with lane\_delta to caller thread ID.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- lane\_delta - [in] The lane\_delta is the relative thread ID difference between caller thread ID and source of copy thread ID. sourceID = (threadID + lane\_delta) % size()
-
-template<class T >
-
-## T shfl\_up ( T var, unsigned int lane\_delta ) const
-
-Shuffle up operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle up operation is copy of var from thread with thread ID of group relative lower with lane\_delta to caller thread ID.
-
-## Template Parameters
-
-T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- lane\_delta - [in] The lane\_delta is the relative thread ID difference between caller thread ID and source of copy thread ID. sourceID = (threadID - lane\_delta) % size()
-
-## template<class T >
-
-T shfl\_xor ( T var, unsigned int laneMask ) const
-
-Shuffle xor operation on group level.
-
-Exchanging variables between threads without use of shared memory. Shuffle xor operation is copy of var from thread with thread ID of group based on laneMask XOR of the caller thread ID.
-
-## Template Parameters
-
-- T - The type can be a 32-bit integer or single-precision floating point.
-
-## Parameters
-
-- var - [in] The source variable to copy.
-- laneMask - [in] The laneMask is the mask for XOR operation. sourceID = threadID ^ laneMask
-
-unsigned long long ballot ( int pred ) const
-
-Ballot function on group level.
-
-Returns a bit mask with the Nth bit set to one if the Nth thread predicate evaluates true.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-int any ( int pred ) const
-
-Any function on group level.
-
-Returns non-zero if a predicate evaluates true for any threads.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-int all ( int pred ) const
-
-All function on group level.
-
-Returns non-zero if a predicate evaluates true for all threads.
-
-## Parameters
-
-pred - [in] The predicate to evaluate on group threads.
-
-template<typename T >
-
-unsigned long long match\_any ( T value ) const
-
-Match any function on group level.
-
-Returns a bit mask containing a 1-bit for every participating thread if that thread has the same value in value as the caller thread.
-
-## Parameters
-
-value - [in] The value to examine on the current thread in group.
-
-template<typename T > unsigned long long match\_all ( T value, int &pred ) const
-
-Match all function on group level.
-
-Returns a bit mask containing a 1-bit for every participating thread if they all have the same value in value as the caller thread. The predicate pred is set to true if all participating threads have the same value in value .
-
-## Parameters
-
-- value - [in] The value to examine on the current thread in group.
-- pred - [out] The predicate is set to true if all participating threads in the thread group have the same value.
-
-class coalesced\_group : public cooperative\_groups:: thread\_group
-
-The coalesced\_group cooperative group type.
-
-Represents an active thread group in a wavefront. This group type also supports sub-wave level intrinsics.
-
-Note: This is implemented on Linux and is under development on Microsoft Windows.
-
-## 23.3 Cooperative groups construct functions
-
-The following functions are used to construct different group-type instances on the device side.
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | 'cooperative_groups::this_multi_grid' | 'cooperative_groups::this_multi_grid' | 'cooperative_groups::this_multi_grid' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::this\_grid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::this\_thread\_block' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function | 'cooperative_groups::coalesced_threads' | 'cooperative_groups::coalesced_threads' | |
-|------------|------------|--------------------|--------------------|--------------------|----------|--------|------------|-------------------------------------------|-------------------------------------------|------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: |
-
-/home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | 'cooperative_groups::tiled_partition' | 'cooperative_groups::tiled_partition' | 'cooperative_groups::tiled_partition' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot | find | function 'cooperative_groups::tiled_partition' 6.1.40092 | function 'cooperative_groups::tiled_partition' 6.1.40092 | function 'cooperative_groups::tiled_partition' 6.1.40092 | | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | Documentation' | from | directory: | | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function | function | 'cooperative_groups::binary_partition' | 'cooperative_groups::binary_partition' |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | | 6.1.40092 | Documentation' | from |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: | doxygenfunction: | doxygenfunction: | Cannot find | Cannot find | function 'cooperative_groups::binary_partition' 6.1.40092 | function 'cooperative_groups::binary_partition' 6.1.40092 | function 'cooperative_groups::binary_partition' 6.1.40092 |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | 'HIP | Documentation' | from | directory: |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-## 23.4 Cooperative groups exposed API functions
-
-The following functions are the exposed API for different group-type instances on the device side.
-
-| Warning: | Warning: | doxygenfunction: project | doxygenfunction: project | doxygenfunction: project | Cannot find | function | 'cooperative_groups::group_size' | 'cooperative_groups::group_size' | 'cooperative_groups::group_size' | | | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | 'HIP | 6.1.40092 | Documentation' | from | directory: | | | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-| Warning: | Warning: | doxygenfunction: Cannot | doxygenfunction: Cannot | doxygenfunction: Cannot | doxygenfunction: Cannot | find 'HIP | find 'HIP | function 'cooperative_groups::thread_rank' | function 'cooperative_groups::thread_rank' | function 'cooperative_groups::thread_rank' | |
-|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
-| in | doxygen | xml | output | for | project | | 6.1.40092 | Documentation' | from | directory: | |
-| /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- | /home/docs/checkouts/readthedocs.org/user_builds/advanced-micro-devices-hip/checkouts/docs- |
-| 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml | 6.1.2/docs/doxygen/xml |
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::is\_valid' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advanced-micro-devices-hip/checkouts/docs6.1.2/docs/doxygen/xml
-
-Warning: doxygenfunction: Cannot find function 'cooperative\_groups::sync' in doxygen xml output for project 'HIP 6.1.40092 Documentation' from directory: /home/docs/checkouts/readthedocs.org/user\_builds/advancedmicro-devices-hip/checkouts/docs-6.1.2/docs/doxygen/xml
-
-## CHAPTER
-
-## TWENTYFOUR
-
-## HSA RUNTIME API FOR ROCM
-
-The following functions are located in the https://github.com/ROCm/ROCR-Runtime repository.
-
-hsa\_status\_t hsa\_amd\_vmem\_address\_reserve ( void **va, size\_t size, uint64\_t address, uint64\_t flags )
-
-Allocate a reserved address range.
-
-Reserve a virtual address range. The size must be a multiple of the system page size. If it is not possible to allocate the address specified by address , then va will be a different address range. Address range should be released by calling hsa\_amd\_vmem\_address\_free.
-
-Note that this API will be deprecated in a future release and replaced by hsa\_amd\_vmem\_address\_reserve\_align
-
-## Parameters
-
-- va -[out] virtual address allocated
-- size -[in] of address range requested
-- address -[in] requested
-- flags -[in] currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range allocated successfully
-- ::HSA\_STATUS\_ERROR\_NOT\_INITIALIZED - The HSA runtime has not been initialized.
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources to allocate an address range of this size.
-
-hsa\_status\_t hsa\_amd\_vmem\_address\_free ( void *va, size\_t size )
-
-Free a reserved address range.
-
-Free a previously allocated address range. The size must match the size of a previously allocated address range.
-
-## Parameters
-
-- va -[out] virtual address to be freed
-- size -[in] of address range
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range released successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid va specified
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - Invalid size specified
-- ::HSA\_STATUS\_ERROR\_RESOURCE\_FREE - Address range is still in use
-
-· ::HSA\_STATUS\_ERROR - Internal unexpected error
-
-hsa\_status\_t hsa\_amd\_vmem\_handle\_create ( hsa\_amd\_memory\_pool\_t pool, size\_t size, hsa\_amd\_memory\_type\_t type, uint64\_t flags, hsa\_amd\_vmem\_alloc\_handle\_t *memory\_handle
-
-)
-
-Create a virtual memory handle.
-
-Create a virtual memory handle within this pool size must be a aligned to allocation granule size for this memory pool, see HSA\_AMD\_MEMORY\_POOL\_INFO\_RUNTIME\_ALLOC\_GRANULE To minimize internal memory fragmentation, align the size to the recommended allocation granule size, see HSA\_AMD\_MEMORY\_POOL\_INFO\_RUNTIME\_ALLOC\_REC\_GRANULE
-
-## Parameters
-
-- pool -[in] memory to use
-- size -[in] of the memory allocation
-- type -[in] of memory
-- flags -[in] - currently unsupported
-- memory\_handle -[out] - handle for the allocation
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - memory allocated successfully
-- ::HSA\_STATUS\_ERROR\_NOT\_INITIALIZED - The HSA runtime has not been initialized.
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - Invalid arguments
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - This memory pool does not support allocations
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources to allocate this memory
-
-hsa\_status\_t hsa\_amd\_vmem\_handle\_release ( hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle )
-
-Release a virtual memory handle.
-
-## Parameters
-
-memory -[in] handle that was previously allocated
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Address range allocated successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-
-hsa\_status\_t hsa\_amd\_vmem\_map ( void *va, size\_t size, size\_t in\_offset, hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle, uint64\_t flags )
-
-Map a virtual memory handle.
-
-Map a virtual memory handle to a reserved address range. The virtual address requested must be within a previously reserved address range. va and ( va + size) must be must be within (va + size) of the previous allocated address range. size must be equal to size of the memory\_handle hsa\_amd\_vmem\_set\_access needs to be called to make the memory accessible to specific agents
-
-## Parameters
-
-- va -[in] virtual address range where memory will be mapped
-- size -[in] of memory mapping
-- in\_offset -[in] offset into memory. Currently unsupported
-
-- memory\_handle -[in] virtual memory handle to be mapped
-- flags. -[in] Currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Memory mapped successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - va, size or memory\_handle are invalid
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-## hsa\_status\_t hsa\_amd\_vmem\_unmap ( void *va, size\_t size )
-
-Unmap a virtual memory handle.
-
-Unmap previously mapped virtual address range
-
-## Parameters
-
-- va -[in] virtual address range where memory will be mapped
-- size -[in] of memory mapping
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS - Memory backing unmapped successfully
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - memory\_handle is invalid
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - size is invalid
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_set\_access ( void *va, size\_t size, const hsa\_amd\_memory\_access\_desc\_t *desc, size\_t desc\_cnt )
-
-Make a memory mapping accessible.
-
-Make previously mapped virtual address accessible to specific agents. size must be equal to size of previously mapped virtual memory handle. Calling hsa\_amd\_vmem\_set\_access multiple times on the same va will overwrite previous permissions for all agents
-
-## Parameters
-
-- va -[in] previously mapped virtual address
-- size -[in] of memory mapping
-- desc -[in] list of access permissions for each agent
-- desc\_cnt -[in] number of elements in desc
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ARGUMENT - va, size or memory\_handle are invalid
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - memory\_handle is invalid
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Insufficient resources
-- ::HSA\_STATUS\_ERROR\_INVALID\_AGENT - Invalid agent in desc
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_get\_access ( void *va, hsa\_access\_permission\_t *perms, hsa\_agent\_t agent\_handle )
-
-Get current access permissions for memory mapping.
-
-Get access permissions for memory mapping for specific agent.
-
-## Parameters
-
-- va -[in] previously mapped virtual address
-- perms -[in] current permissions
-- agent\_handle -[in] agent
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_AGENT - Invalid agent
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - va is not mapped or permissions never set for this agent
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_export\_shareable\_handle ( int *dmabuf\_fd, hsa\_amd\_vmem\_alloc\_handle\_t handle, uint64\_t flags )
-
-Get an exportable shareable handle.
-
-Get an exportable shareable handle for a memory\_handle. This shareabl handle can then be used to re-create a virtual memory handle using hsa\_amd\_vmem\_import\_shareable\_handle. The shareable handle can be transferred using mechanisms that support posix file descriptors Once all shareable handles are closed, the memory\_handle is released.
-
-## Parameters
-
-- dmabuf\_fd -[out] shareable handle
-- handle -[in] previously allocated virtual memory handle
-- flags -[in] Currently unsupported
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Out of resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-## hsa\_status\_t hsa\_amd\_vmem\_import\_shareable\_handle ( int dmabuf\_fd, hsa\_amd\_vmem\_alloc\_handle\_t *handle )
-
-Import a shareable handle.
-
-Import a shareable handle for a memory handle. Importing a shareable handle that has been closed and released results in undefined behavior.
-
-## Parameters
-
-- dmabuf\_fd -[in] shareable handle exported with hsa\_amd\_vmem\_export\_shareable\_handle
-- handle -[out] virtual memory handle
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory handle
-- ::HSA\_STATUS\_ERROR\_OUT\_OF\_RESOURCES - Out of resources
-- ::HSA\_STATUS\_ERROR - Unexpected internal error
-
-hsa\_status\_t hsa\_amd\_vmem\_retain\_alloc\_handle ( hsa\_amd\_vmem\_alloc\_handle\_t *memory\_handle, void *addr )
-
-Returns memory handle for mapped memory.
-
-Return a memory handle for previously mapped memory. The handle will be the same value of handle used to map the memory. The returned handle must be released with corresponding number of calls to hsa\_amd\_vmem\_handle\_release.
-
-## Parameters
-
-- memory\_handle -[out] memory handle for this mapped address
-- mapped -[in] address
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid address
-
-hsa\_status\_t hsa\_amd\_vmem\_get\_alloc\_properties\_from\_handle ( hsa\_amd\_vmem\_alloc\_handle\_t memory\_handle, hsa\_amd\_memory\_pool\_t *pool, hsa\_amd\_memory\_type\_t *type )
-
-Returns the current allocation properties of a handle.
-
-Returns the allocation properties of an existing handle
-
-## Parameters
-
-- memory\_handle -[in] memory handle to be queried
-- pool -[out] memory pool that owns this handle
-- memory -[out] type
-
-## Return values
-
-- ::HSA\_STATUS\_SUCCESS -
-- ::HSA\_STATUS\_ERROR\_INVALID\_ALLOCATION - Invalid memory\_handle
-
-## CHAPTER
-
-## TWENTYFIVE
-
-## HIP MANAGED MEMORY ALLOCATION API
-
-hipError\_t hipMallocManaged ( void **dev\_ptr, size\_t size, unsigned int flags )
-
-Allocates memory that will be automatically managed by HIP.
-
-This API is used for managed memory, allows data be shared and accessible to both CPU and GPU using a single pointer.
-
-The API returns the allocation pointer, managed by HMM, can be used further to execute kernels on device and fetch data between the host and device as needed.
-
-Note: It is recommend to do the capability check before call this API.
-
-## Parameters
-
-- dev\_ptr -[out] - pointer to allocated device memory
-- size -[in] - requested allocation size in bytes, it should be granularity of 4KB
-- flags -[in] - must be either hipMemAttachGlobal or hipMemAttachHost (defaults to hipMemAttachGlobal)
-
-## Returns
-
-hipSuccess, hipErrorMemoryAllocation, hipErrorNotSupported, hipErrorInvalidValue hipError\_t hipMemPrefetchAsync ( const void *dev\_ptr, size\_t count, int device, hipStream\_t stream
-
-) Prefetches memory to the specified destination device using HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- dev\_ptr -[in] pointer to be prefetched
-- count -[in] size in bytes for prefetching
-- device -[in] destination device to prefetch to
-- stream -[in] stream to enqueue prefetch operation
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue
-
-hipError\_t hipMemAdvise ( const void *dev\_ptr, size\_t count, hipMemoryAdvise advice, int device )
-
-Advise about the usage of a given memory range to HIP.
-
-This HIP API advises about the usage to be applied on unified memory allocation in the range starting from the pointer address devPtr, with the size of count bytes. The memory range must refer to managed memory allocated via the API hipMallocManaged, and the range will be handled with proper round down and round up respectively in the driver to be aligned to CPU page size, the same way as corresponding CUDA API behaves in CUDA version 8.0 and afterwards.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- dev\_ptr -[in] pointer to memory to set the advice for
-- count -[in] size in bytes of the memory range, it should be CPU page size alligned.
-- advice -[in] advice to be applied for the specified memory range
-- device -[in] device to apply the advice for
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipMemRangeGetAttribute ( void *data, size\_t data\_size, hipMemRangeAttribute attribute, const void *dev\_ptr, size\_t count )
-
-Query an attribute of a given memory range in HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- data -[inout] a pointer to a memory location where the result of each attribute query will be written to
-- data\_size -[in] the size of data
-- attribute -[in] the attribute to query
-- dev\_ptr -[in] start of the range to query
-- count -[in] size of the range to query
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipMemRangeGetAttributes ( void **data, size\_t *data\_sizes, hipMemRangeAttribute *attributes, size\_t num\_attributes, const void *dev\_ptr, size\_t count )
-
-Query attributes of a given memory range in HIP.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-## Parameters
-
-- data -[inout] a two-dimensional array containing pointers to memory locations where the result of each attribute query will be written to
-- data\_sizes -[in] an array, containing the sizes of each result
-- attributes -[in] the attribute to query
-- num\_attributes -[in] an array of attributes to query (numAttributes and the number of attributes in this array should match)
-- dev\_ptr -[in] start of the range to query
-- count -[in] size of the range to query
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue hipError\_t hipStreamAttachMemAsync ( hipStream\_t stream, void *dev\_ptr, size\_t length, unsigned int flags ) Attach memory to a stream asynchronously in HIP.
-
-Warning: This API is under development. Currently it is a no-operation (NOP) function on AMD GPUs and returns hipSuccess.
-
-## Parameters
-
-- stream -[in] - stream in which to enqueue the attach operation
-- dev\_ptr -[in] - pointer to memory (must be a pointer to managed memory or to a valid host-accessible region of system-allocated memory)
-- length -[in] - length of memory (defaults to zero)
-- flags -[in] - must be one of hipMemAttachGlobal, hipMemAttachHost or hipMemAttachSingle (defaults to hipMemAttachSingle)
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue
-
-## template<class T >
-
-static inline hipError\_t hipMallocManaged ( T **devPtr, size\_t size, unsigned int flags = hipMemAttachGlobal )
-
-- : C++ wrapper for hipMallocManaged
-
-Provide an override to automatically typecast the pointer type from void**, and also provide a default for the flags.
-
-HIP\_DISABLE\_CPP\_FUNCTIONS macro can be defined to suppress these wrappers. It is useful for applications which need to obtain decltypes of HIP runtime APIs.
-
-## See also:
-
-hipMallocManaged
-
-## CHAPTER
-
-## TWENTYSIX
-
-## HIP VIRTUAL MEMORY MANAGEMENT API
-
-hipError\_t hipMemAddressFree ( void *devPtr, size\_t size )
-
-Frees an address range reservation made via hipMemAddressReserve.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- devPtr -[in] - starting address of the range.
-- size -[in] - size of the range.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemAddressReserve ( void **ptr, size\_t size, size\_t alignment, void *addr, unsigned long long flags )
-
-Reserves an address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[out] - starting address of the reserved range.
-- size -[in] - size of the reservation.
-- alignment -[in] - alignment of the address.
-- addr -[in] - requested starting address of the range.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-hipError\_t hipMemCreate ( hipMemGenericAllocationHandle\_t *handle, size\_t size, const hipMemAllocationProp *prop, unsigned long long flags )
-
-Creates a memory allocation described by the properties and size.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - value of the returned handle.
-- size -[in] - size of the allocation.
-- prop -[in] - properties of the allocation.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemExportToShareableHandle ( void *shareableHandle, hipMemGenericAllocationHandle\_t handle, hipMemAllocationHandleType handleType, unsigned long long flags )
-
-Exports an allocation to a requested shareable handle type.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- shareableHandle -[out] - value of the returned handle.
-- handle -[in] - handle to share.
-- handleType -[in] - type of the shareable handle.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAccess ( unsigned long long *flags, const hipMemLocation *location, void *ptr
-
-) Get the access flags set for the given location and ptr.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- flags -[out] - flags for this location.
-- location -[in] - target location.
-- ptr -[in] - address to check the access flags.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAllocationGranularity ( size\_t *granularity, const hipMemAllocationProp *prop, hipMemAllocationGranularity\_flags option )
-
-Calculates either the minimal or recommended granularity.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- granularity -[out] - returned granularity.
-- prop -[in] - location properties.
-- option -[in] - determines which granularity to return.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemGetAllocationPropertiesFromHandle ( hipMemAllocationProp *prop,
-
-hipMemGenericAllocationHandle\_t handle )
-
-Retrieve the property structure of the given handle.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- prop -[out] - properties of the given handle.
-- handle -[in] - handle to perform the query on.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-hipError\_t hipMemImportFromShareableHandle ( hipMemGenericAllocationHandle\_t *handle, void *osHandle, hipMemAllocationHandleType shHandleType )
-
-Imports an allocation from a requested shareable handle type.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - returned value.
-- osHandle -[in] - shareable handle representing the memory allocation.
-- shHandleType -[in] - handle type.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemMap ( void *ptr, size\_t size, size\_t offset, hipMemGenericAllocationHandle\_t handle, unsigned long long flags )
-
-Maps an allocation handle to a reserved virtual address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - address where the memory will be mapped.
-- size -[in] - size of the mapping.
-- offset -[in] - offset into the memory, currently must be zero.
-- handle -[in] - memory allocation to be mapped.
-- flags -[in] - currently unused, must be zero.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemMapArrayAsync ( hipArrayMapInfo *mapInfoList, unsigned int count, hipStream\_t stream )
-
-Maps or unmaps subregions of sparse HIP arrays and sparse HIP mipmapped arrays.
-
-Warning: This API is under development. Currently it is not supported on AMD GPUs and returns hipErrorNotSupported.
-
-## Parameters
-
-- mapInfoList -[in] - list of hipArrayMapInfo.
-- count -[in] - number of hipArrayMapInfo in mapInfoList.
-- stream -[in] - stream identifier for the stream to use for map or unmap operations.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemRelease ( hipMemGenericAllocationHandle\_t handle )
-
-Release a memory handle representing a memory allocation which was previously allocated through hipMemCreate.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-handle -[in] - handle of the memory allocation.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemRetainAllocationHandle ( hipMemGenericAllocationHandle\_t *handle, void *addr )
-
-Returns the allocation handle of the backing memory allocation given the address.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- handle -[out] - handle representing addr.
-- addr -[in] - address to look up.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported hipError\_t hipMemSetAccess ( void *ptr, size\_t size, const hipMemAccessDesc *desc, size\_t count )
-
-Set the access flags for each location specified in desc for the given virtual address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - starting address of the virtual address range.
-- size -[in] - size of the range.
-- desc -[in] - array of hipMemAccessDesc.
-- count -[in] - number of hipMemAccessDesc in desc.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-## hipError\_t hipMemUnmap ( void *ptr, size\_t size )
-
-Unmap memory allocation of a given address range.
-
-Note: This API is implemented on Linux and is under development on Microsoft Windows.
-
-Warning: This API is marked as Beta. While this feature is complete, it can change and might have outstanding issues.
-
-## Parameters
-
-- ptr -[in] - starting address of the range to unmap.
-- size -[in] - size of the virtual address range.
-
-## Returns
-
-hipSuccess, hipErrorInvalidValue, hipErrorNotSupported
-
-## CHAPTER
-
-## TWENTYSEVEN
-
-## HIP DEPRECATED RUNTIME API FUNCTIONS
-
-Several of our API functions have been flagged for deprecation. Using the following functions results in errors and unexpected results, so we encourage you to update your code accordingly.
-
-## 27.1 Context management
-
-CUDAsupports cuCtx API, which is the driver API that defines 'Context' and 'Devices' as separate entities. Context contains a single device, and a device can theoretically have multiple contexts. HIP initially added limited support for these APIs in order to facilitate porting from existing driver codes. These APIs are now marked as deprecated because there are better alternate interfaces (such as hipSetDevice or the stream API) to achieve these functions.
-
-- hipCtxCreate
-- hipCtxDestroy
-- hipCtxPopCurrent
-- hipCtxPushCurrent
-- hipCtxSetCurrent
-- hipCtxGetCurrent
-- hipCtxGetDevice
-- hipCtxGetApiVersion
-- hipCtxGetCacheConfig
-- hipCtxSetCacheConfig
-- hipCtxSetSharedMemConfig
-- hipCtxGetSharedMemConfig
-- hipCtxSynchronize
-- hipCtxGetFlags
-- hipCtxEnablePeerAccess
-- hipCtxDisablePeerAccess
-- hipDevicePrimaryCtxGetState
-- hipDevicePrimaryCtxRelease
-- hipDevicePrimaryCtxRetain
-- hipDevicePrimaryCtxReset
-
-- hipDevicePrimaryCtxSetFlags
-
-## 27.2 Memory management
-
-- hipMallocHost (replaced with hipHostMalloc )
-- hipMemAllocHost (replaced with hipHostMalloc )
-- hipHostAlloc (replaced with hipHostMalloc )
-- hipFreeHost (replaced with hipHostFree )
-- hipMemcpyToArray
-- hipMemcpyFromArray
-
-## 27.3 Profiler control
-
-- hipProfilerStart (use roctracer/rocTX)
-- hipProfilerStop (use roctracer/rocTX)
-
-## 27.4 Texture management
-
-- hipGetTextureReference
-- hipTexRefSetAddressMode
-- hipTexRefSetArray
-- hipTexRefSetFilterMode
-- hipTexRefSetFlags
-- hipTexRefSetFormat
-- hipTexRefGetAddress
-- hipTexRefGetAddressMode
-- hipTexRefGetFilterMode
-- hipTexRefGetFlags
-- hipTexRefGetFormat
-- hipTexRefGetMaxAnisotropy
-- hipTexRefGetMipmapFilterMode
-- hipTexRefGetMipmapLevelBias
-- hipTexRefGetMipmapLevelClamp
-- hipTexRefGetMipMappedArray
-- hipTexRefSetAddress
-- hipTexRefSetAddress2D
-- hipTexRefSetMaxAnisotropy
-
-- hipTexRefSetBorderColor
-- hipTexRefSetMipmapFilterMode
-- hipTexRefSetMipmapLevelBias
-- hipTexRefSetMipmapLevelClamp
-- hipTexRefSetMipmappedArray
-- hipTexRefGetBorderColor
-- hipTexRefGetArray
-- hipBindTexture
-- hipBindTexture2D
-- hipBindTextureToArray
-- hipGetTextureAlignmentOffset
-- hipUnbindTexture
-- hipBindTextureToMipmappedArray
-
-## CHAPTER
-
-## TWENTYEIGHT
-
-## SAXPY - HELLO, HIP
-
-This tutorial explains the basic concepts of the single-source Heterogeneous-computing Interface for Portability (HIP) programming model and the essential tooling around it. It also reviews some commonalities of heterogenous APIs in general. This topic assumes basic familiarity with the C/C++ compilation model and language.
-
-## 28.1 Prerequisites
-
-To follow this tutorial, you'll need installed drivers and a HIP compiler toolchain to compile your code. Because HIP for ROCm supports compiling and running on Linux and Windows with AMD and NVIDIA GPUs, the combination of install instructions is more than worth covering as part of this tutorial. For more information about installing HIP development packages, see Install HIP .
-
-## 28.2 Heterogeneous programming
-
-Heterogeneous programming and offloading APIs are often mentioned together. Heterogeneous programming deals with devices of varying capabilities simultaneously. Offloading focuses on the 'remote' and asynchronous aspects of computation. HIP encompasses both. It exposes GPGPU (general-purpose GPU) programming much like ordinary host-side CPU programming and lets you move data across various devices.
-
-When programming in HIP (and other heterogenous APIs for that matter), remember that target devices are built for a specific purpose. They are designed with different tradeoffs than traditional CPUs and therefore have very different performance characteristics. Even subtle changes in code might adversely affect execution time.
-
-## 28.3 Your first lines of HIP code
-
-First, let's do the 'Hello, World!' of GPGPU: SAXPY. Single-precision A times X Plus Y ( SAXPY ) is a mathematical acronym; a vector equation 𝑎 · 𝑥 + 𝑦 = 𝑧 where 𝑎 ∈ R is a scalar and 𝑥, 𝑦, 𝑧 ∈ V are vector quantities of some large dimensionality. This vector space is defined over the set of reals. Practically speaking, you can compute this using a single for loop over three arrays.
-
-```
-++i)
-```
-
-```
-<_SQL_>
-```
-
-In linear algebra libraries, such as BLAS (Basic Linear Algebra Subsystem) this operation is defined as AXPY 'A times X Plus Y'. The 'S' comes from single-precision , meaning that array element is float -s (IEEE 754 binary32 representation).
-
-To quickly get started, use the set of HIP samples from GitHub. With Git configured on your machine, open a commandline and navigate to your desired working directory, then run:
-
-```
- |git clone https://github.com/amd/rcm-examples.git
-```
-
-A simple implementation of SAXPY resides in the HIP-Basic/saxpy/main.hip file in this repository. The HIP code here mostly deals with where data has to be and when, and how devices transform this data. The first HIP calls deal with allocating device-side memory and copying data from host-side memory to device side in a C runtime-like fashion.
-
-```
-// Allocate and copy vectors to device memory.
-float* d_x{};
-float* d_y{};
-HIP_CHECK(hipMalloc(&d_x, size_bytes));
-HIP_CHECK(hipMalloc(&d_y, size_bytes));
-HIP_CHECK(hipMemcpy(d_x, x.data(), size_bytes, hipMemcpyHostToDevice));
-HIP_CHECK(hipMemcpy(d_y, y.data(), size_bytes, hipMemcpyHostToDevice));
-```
-
-HIP\_CHECK is a custom macro borrowed from the examples utilities which checks the error code returned by API functions for errors and reports them to the console. It is not essential to the API, but it is a good practice to check the error codes of the HIP APIs in case you pass on incorrect values to the API, or the API might be out of resources.
-
-The code selects the device to allocate to and to copy to. Commands are issued to the HIP runtime per thread, and every thread has a device set as the target of commands. The default device is 0 , which is equivalent to calling hipSetDevice(0) .
-
-Launch the calculation on the device after the input data has been prepared.
-
-```
- Launch the calculation on the device after the input data has been prepared.
- __global__ void saxpy_kernel(const float a, const float* d_x, float* d_y, const unsigned_
- __int size)
- {
- //...
- }
-
- int main()
- {
- //...
-
- // Launch the kernel on the default stream.
- saxpy_kernel<<