This project implements the ShangMi 3 (SM3) cryptographic hash algorithm as a Swift Package. The implementation must:
- Use Swift 6.2
- Be entirely implemented in Swift (no C or Objective-C)
- No third-party libraries
- Produce 256-bit hash values
- Be compatible with official SM3 specifications
SM3 is a cryptographic hash function published by the Chinese National Cryptography Administration on 2010-12-17 as GM/T 0004-2012: SM3 cryptographic hash algorithm. It is also standardized in:
- GB/T 32905-2016 (Chinese standard)
- ISO/IEC 10118-3:2018 (International standard)
- IETF Draft: draft-sca-cfrg-sm3
- Output Size: 256 bits (32 bytes)
- Block Size: 512 bits (64 bytes)
- Input Limit: Messages up to 2^64 bits
- Construction: Merkle-Damgård with Davies-Meyer compression function
- Security Level: Similar to SHA-256
0x7380166f, 0x4914b2b9, 0x172442d7, 0xda8a0600,
0xa96f30bc, 0x163138aa, 0xe38dee4d, 0xb0fb0e4e
- For j = 0 to 15:
0x79cc4519 - For j = 16 to 63:
0x7a879d8a
- For j = 0 to 15:
FF_j(X,Y,Z) = X ⊕ Y ⊕ Z - For j = 16 to 63:
FF_j(X,Y,Z) = (X ∧ Y) ∨ (X ∧ Z) ∨ (Y ∧ Z)
- For j = 0 to 15:
GG_j(X,Y,Z) = X ⊕ Y ⊕ Z - For j = 16 to 63:
GG_j(X,Y,Z) = (X ∧ Y) ∨ (¬X ∧ Z)
P₀(X) = X ⊕ (X <<< 9) ⊕ (X <<< 17)
Where <<< denotes circular left shift (rotate left).
P₁(X) = X ⊕ (X <<< 15) ⊕ (X <<< 23)
Given a message M of length l bits:
- Append a single "1" bit to the message
- Append k "0" bits where k is the smallest non-negative solution to:
l + 1 + k ≡ 448 (mod 512) - Append the 64-bit big-endian representation of l
- Result: padded message with length ≡ 0 (mod 512)
For each 512-bit block, divide into 16 words W₀...W₁₅ (32-bit big-endian), then expand to 68 words:
For j = 16 to 67:
W_j = P₁(W_{j-16} ⊕ W_{j-9} ⊕ (W_{j-3} <<< 15))
⊕ (W_{j-13} <<< 7) ⊕ W_{j-6}
Generate W' array (64 words):
For j = 0 to 63:
W'_j = W_j ⊕ W_{j+4}
Initialize working variables A,B,C,D,E,F,G,H with current hash value V_i.
For j = 0 to 63:
SS1 = ((A <<< 12) + E + (T_j <<< (j mod 32))) <<< 7
SS2 = SS1 ⊕ (A <<< 12)
TT1 = FF_j(A,B,C) + D + SS2 + W'_j
TT2 = GG_j(E,F,G) + H + SS1 + W_j
D = C
C = B <<< 9
B = A
A = TT1
H = G
G = F <<< 19
F = E
E = P₀(TT2)
After all 64 rounds:
V_{i+1} = (A||B||C||D||E||F||G||H) ⊕ V_i
After processing all message blocks, the final hash value is:
H = V_n = (A||B||C||D||E||F||G||H)
Input: "abc" (UTF-8: 0x616263)
Expected Output:
66c7f0f462eeedd9d1f2d46bdc10e4e24167c4875cf2f7a2297da02b8f4ba8e0
Input: "abcd" repeated 16 times (64 bytes)
Expected Output:
debe9ff92275b8a138604889c18e5a4d6fdb70e5387e5765293dcba39c0c5732
Input: "Yoda said, Do or do not. There is not try."
Expected Output:
6bb5ff84416dc1edf21c7b0c36d7adfdebe9378702a8982dd6ff0842188b67a5
Input: "" (empty string)
Expected Output:
1ab21d8355cfa17f8e61194831e81a8f22bec8c728fefb747ed035eb5082aa2b
-
emmansun/gmsm - https://github.com/emmansun/gmsm
- Comprehensive ShangMi cipher suite
- SIMD optimizations (AVX2, AVX, SSE2, NEON)
- MIT License
- Good reference for optimization techniques
-
sammyne/sm3 - https://github.com/sammyne/sm3
- Pure Go implementation
- Simple, readable code structure
- zhao07/libsm3 - https://github.com/zhao07/libsm3
- Reference C implementation
- Clear algorithm structure
- Crypto++ Library
- SM3 implementation in the Crypto++ suite
- Well-documented API
- Extensive testing
- siddontang/pygmcrypto - https://github.com/siddontang/pygmcrypto
- C implementation with Python bindings
- Use
UInt32for all 32-bit word operations - Use
UInt64for message length tracking - All multi-byte values are big-endian
- Circular left shift (rotate left):
<<< - XOR:
⊕(use^in Swift) - AND:
∧(use&in Swift) - OR:
∨(use|in Swift) - NOT:
¬(use~in Swift) - Addition: modulo 2^32 (natural for UInt32)
- Endianness: Use
bigEndianproperty or byte swapping - Rotate Left: Implement as:
(value << n) | (value >> (32 - n)) - Array Access: W array needs 68 elements, W' needs 64 elements
- Memory Safety: Swift 6.2's strict concurrency will help prevent data races
- Performance: Consider using inline functions for frequently called operations
- Protocol Conformance: Consider conforming to Hashable protocol patterns
The Go implementation (emmansun/gmsm) achieves ~53% performance improvement using SIMD instructions (AVX2 on x86, NEON on ARM64). Swift can achieve similar or better results on Apple Silicon using native SIMD types.
According to the Go SIMD optimization documentation, the primary parallelized operations are:
- Message Schedule Computation: Calculating multiple W words simultaneously
- P₁ Permutation Function: The most computation-heavy operation
- Multiple rotations: 15-bit and 23-bit circular shifts
- XOR operations across multiple words
- W[-13] and W[-3] Rotations: 7-bit and 15-bit shifts in parallel
- Vector XOR Operations: Multiple XOR combinations computed simultaneously
Performance: The Go AVX2 implementation achieves ~384.5 MB/s throughput.
Swift 5+ includes built-in SIMD types that compile directly to hardware instructions:
SIMD2<T>,SIMD4<T>,SIMD8<T>,SIMD16<T>,SIMD32<T>,SIMD64<T>Tcan beUInt32(perfect for SM3's 32-bit words)- On Apple Silicon, these compile directly to NEON instructions
// Arithmetic (masked, wrapping)
let result = a &+ b // Vector addition
let result = a &- b // Vector subtraction
let result = a &* b // Vector multiplication
// Bitwise operations
let result = a & b // AND
let result = a | b // OR
let result = a ^ b // XOR
let result = ~a // NOT
// Shifts (but NOT rotations - need custom implementation)
let result = a << 5 // Left shift
let result = a >> 5 // Right shift- 2-10x speedup for data-parallel operations
- Zero overhead abstraction - compiles to native SIMD instructions
- Auto-vectorization: Swift compiler can automatically vectorize some operations
- Standard UInt32 operations
- Clear, readable code
- Easy to verify correctness
- Target: Correct implementation first
Most promising optimization target:
// Process 4 W words at once using SIMD4<UInt32>
func expandMessageSIMD4(W: inout [UInt32]) {
for j in stride(from: 16, to: 68, by: 4) {
// Load 4 words into SIMD registers
let w_j_minus_16 = SIMD4<UInt32>(W[j-16], W[j-15], W[j-14], W[j-13])
let w_j_minus_9 = SIMD4<UInt32>(W[j-9], W[j-8], W[j-7], W[j-6])
// ... compute 4 words in parallel
}
}Benefits:
- Message expansion has minimal data dependencies between iterations
- Perfect for SIMD processing
- Each P₁ permutation can be computed independently
// Compute W'[j] = W[j] ^ W[j+4] in parallel
func generateWPrimeSIMD8(W: [UInt32]) -> [UInt32] {
var WPrime = [UInt32](repeating: 0, count: 64)
for j in stride(from: 0, to: 64, by: 8) {
let w_j = SIMD8<UInt32>( /* load 8 words */ )
let w_j_plus_4 = SIMD8<UInt32>( /* load 8 words */ )
let result = w_j ^ w_j_plus_4 // 8 XORs in one instruction
// ... store result
}
}Process multiple independent message blocks simultaneously:
// Hash 4 blocks in parallel (SIMD-across-blocks)
func hashMultipleBlocks(_ blocks: [[UInt8]]) -> [[UInt8]] {
// Use SIMD4<UInt32> where each lane processes one block
// Requires significant refactoring but maximum throughput
}Swift doesn't have built-in rotation operators, but they can be implemented efficiently:
@inline(__always)
func rotateLeft(_ value: UInt32, by amount: UInt32) -> UInt32 {
return (value << amount) | (value >> (32 - amount))
}
// SIMD version for rotating vectors
@inline(__always)
func rotateLeft(_ vector: SIMD4<UInt32>, by amount: UInt32) -> SIMD4<UInt32> {
return (vector << amount) | (vector >> (32 - amount))
}-
Phase 1: Implement scalar version
- Verify correctness with all test vectors
- Profile to identify hotspots
-
Phase 2: Add SIMD message expansion
- Use
SIMD4<UInt32>orSIMD8<UInt32> - Benchmark against scalar version
- Expected: 30-50% improvement
- Use
-
Phase 3: Optimize based on profiling
- Add SIMD to other hotspots
- Consider multi-block processing for batch operations
Why SIMD types are better than Accelerate for SM3:
- Type Safety: SIMD types are type-safe at compile time
- Simplicity: No need to manage vDSP buffers or setup
- Portability: SIMD types work on all platforms (iOS, macOS, Linux)
- Cryptographic Operations: Accelerate/vDSP is designed for DSP operations (FFT, convolution), not bitwise crypto operations
- Direct Hardware Mapping: SIMD types compile directly to NEON on Apple Silicon
- Inline-able: Can be inlined by compiler for zero overhead
Accelerate framework limitations:
- vDSP focuses on floating-point DSP operations
- No direct support for bitwise rotations or crypto-specific operations
- Overhead of function calls to Accelerate library
- Not designed for the type of operations SM3 requires
Based on the Go SIMD implementation results:
- Baseline (scalar): ~250 MB/s (estimated)
- SIMD4 message expansion: ~350-400 MB/s (40-60% improvement)
- Full SIMD optimization: ~450-500 MB/s (80-100% improvement)
On Apple Silicon (M1/M2/M3) with NEON instructions, we may achieve even better results due to:
- Unified memory architecture
- Wide execution units
- Advanced branch prediction
- L1/L2 cache optimization
SM3/
├── Package.swift
├── README.md
├── CLAUDE.md (this file)
├── Sources/
│ └── SM3/
│ ├── SM3.swift (main algorithm)
│ ├── SM3+Extensions.swift (convenience methods)
│ └── Internal/
│ ├── Constants.swift
│ ├── BitOperations.swift
│ └── Padding.swift
└── Tests/
└── SM3Tests/
├── SM3Tests.swift
└── TestVectors.swift
// Hash a string
let hash = SM3.hash(data: "abc".data(using: .utf8)!)
// Hash data
let data = Data([0x61, 0x62, 0x63])
let hash = SM3.hash(data: data)
// Streaming API
var hasher = SM3()
hasher.update(data: data1)
hasher.update(data: data2)
let hash = hasher.finalize()- GM/T OID: 1.2.156.10197.1.401
- ISO OID: 1.0.10118.3.0.65
Current cryptanalytic attacks can reach approximately:
- 31% of compression function steps for collision attacks
- 47% for preimage attacks
This demonstrates security resistance comparable to or exceeding SHA-2 variants. No practical attacks are known against full SM3.
- GM/T 0004-2012 (Chinese National Standard)
- GB/T 32905-2016 (Chinese National Standard)
- ISO/IEC 10118-3:2018
- IETF Draft: https://datatracker.ietf.org/doc/html/draft-sca-cfrg-sm3
- Wikipedia: https://en.wikipedia.org/wiki/SM3_(hash_function)
- Algorithm specification reviewed
- Multiple reference implementations analyzed
- Test vectors collected and verified
- Implementation strategy defined
- Swift package structure planned
Research Date: 2025-10-25