Skip to content

[方案分享] 使用 SoundTouch 实现倍速播放音调不变 #487

@mayu888

Description

@mayu888

问题描述

在使用 OffscreenSprite 渲染非 1 倍速的音频 Clip 时,内部会调用 changePCMPlaybackRate 对 PCM 数据进行重采样来实现变速。这种方式在改变播放速度的同时也改变了音调——1.5 倍速播放时,音频不仅变快,音调也会升高,这对于视频剪辑场景来说是不正确的行为。

根本原因

OffscreenSprite 通过修改 PCM 采样率来处理播放速率,而不是通过时间拉伸(Time Stretching)算法。目前没有内置方式将播放速度与音调解耦。

解决方案:用 SoundTouch 包装 AudioClip

核心思路是绕过 OffscreenSprite 的变速逻辑,在外部自行处理音频时间拉伸:

  1. 创建一个实现 IClip 接口的包装类 SpeedAudioClip
  2. tick(time) 中将时间乘以速率(time * speed)传入原始 AudioClip,让其读取更多/更少的原始素材
  3. 将读取到的 PCM 数据通过 soundtouch-ts 处理,设置 tempo = speedpitch = 1,实现时间拉伸但音调不变
  4. sprite.time.playbackRate 设置为 1,让 OffscreenSprite 跳过自身的重采样逻辑

依赖安装

npm install soundtouch-ts
#
pnpm add soundtouch-ts

完整实现代码

import type { AudioClip, IClip } from '@webav/av-cliper';
import { SoundTouch } from 'soundtouch-ts';

interface IClipMeta {
    width: number;
    height: number;
    duration: number;
    sampleRate?: number;
    chanCount?: number;
}

export class SpeedAudioClip implements IClip {
    private realClip: AudioClip;
    private speed: number;
    private st: SoundTouch;
    private sampleRate: number;
    private lastRealTime: number = 0;
    private outputBuffer: Float32Array[] = [];
    private bufferedSamples: number = 0;

    readonly ready: Promise<IClipMeta>;
    private _meta: IClipMeta | null = null;

    constructor(realClip: AudioClip, speed: number, sampleRate: number = 48000) {
        this.realClip = realClip;
        this.speed = speed;
        this.sampleRate = sampleRate;

        this.st = new SoundTouch(sampleRate);

        // 根据速率计算最优参数(参考 SoundTouch 自动参数算法)
        const { sequenceMs, seekWindowMs, overlapMs } = this.calculateOptimalParams(speed);
        this.st.tdStretch.setParameters(sampleRate, sequenceMs, seekWindowMs, overlapMs);
        this.st.tdStretch.quickSeek = false;

        // tempo 变速保持音调,pitch 固定为 1
        this.st.tempo = speed;
        this.st.pitch = 1;

        this.outputBuffer = [new Float32Array(0), new Float32Array(0)];

        this.ready = realClip.ready.then(async () => {
            const meta = await realClip.meta;
            this._meta = {
                width: meta.width,
                height: meta.height,
                duration: meta.duration / speed, // 变速后时长
                sampleRate: meta.sampleRate,
                chanCount: meta.chanCount,
            };
            return this._meta;
        });
    }

    /**
     * 根据播放速率计算最优 SoundTouch 参数(移植自 SoundTouch C++ 自动参数算法)
     */
    private calculateOptimalParams(rate: number) {
        const AUTOSEQ_TEMPO_LOW = 0.5;
        const AUTOSEQ_TEMPO_TOP = 2.0;
        const AUTOSEQ_AT_MIN = 125.0;
        const AUTOSEQ_AT_MAX = 50.0;
        const AUTOSEEK_AT_MIN = 25.0;
        const AUTOSEEK_AT_MAX = 15.0;

        const AUTOSEQ_K = (AUTOSEQ_AT_MAX - AUTOSEQ_AT_MIN) / (AUTOSEQ_TEMPO_TOP - AUTOSEQ_TEMPO_LOW);
        const AUTOSEQ_C = AUTOSEQ_AT_MIN - AUTOSEQ_K * AUTOSEQ_TEMPO_LOW;
        const AUTOSEEK_K = (AUTOSEEK_AT_MAX - AUTOSEEK_AT_MIN) / (AUTOSEQ_TEMPO_TOP - AUTOSEQ_TEMPO_LOW);
        const AUTOSEEK_C = AUTOSEEK_AT_MIN - AUTOSEEK_K * AUTOSEQ_TEMPO_LOW;

        const clampedTempo = Math.max(AUTOSEQ_TEMPO_LOW, Math.min(AUTOSEQ_TEMPO_TOP, rate));

        const sequenceMs = Math.max(AUTOSEQ_AT_MAX, Math.min(AUTOSEQ_AT_MIN, AUTOSEQ_C + AUTOSEQ_K * clampedTempo));
        const seekWindowMs = Math.max(AUTOSEEK_AT_MAX, Math.min(AUTOSEEK_AT_MIN, AUTOSEEK_C + AUTOSEEK_K * clampedTempo));

        return { sequenceMs, seekWindowMs, overlapMs: 8 };
    }

    get meta(): IClipMeta {
        if (!this._meta) throw new Error('SpeedAudioClip not ready');
        return this._meta;
    }

    tick = async (time: number): Promise<{ audio: Float32Array[]; state: 'success' | 'done' }> => {
        // 变速后时间 → 原始时间
        const realTime = time * this.speed;
        const result = await this.realClip.tick(realTime);

        if (result.state === 'done' || !result.audio || result.audio.length === 0) {
            return result;
        }

        // 速度接近 1 时直接返回,不做处理
        if (Math.abs(this.speed - 1) < 0.01) return result;

        // 检测 seek 或时间跳跃,重置 SoundTouch 状态
        const timeDiff = realTime - this.lastRealTime;
        if ((this.lastRealTime > 0 && timeDiff < 0) || (this.lastRealTime > 0 && timeDiff > 1_000_000)) {
            this.st.clear();
            this.outputBuffer = [new Float32Array(0), new Float32Array(0)];
            this.bufferedSamples = 0;
        }
        this.lastRealTime = realTime;

        const audio = result.audio;
        const channelCount = audio.length;
        const inputFrameCount = audio[0].length;

        // 转换为立体声交错格式(SoundTouch 只接受立体声交错输入)
        const stereoInterleaved = new Float32Array(inputFrameCount * 2);
        for (let i = 0; i < inputFrameCount; i++) {
            stereoInterleaved[i * 2] = audio[0][i];
            stereoInterleaved[i * 2 + 1] = audio[1]?.[i] ?? audio[0][i];
        }

        // 送入 SoundTouch 处理
        this.st.inputBuffer.putSamples(stereoInterleaved, 0, inputFrameCount);
        this.st.process();

        // 取出处理结果,分离为双通道并追加到输出缓冲区
        const stOutputCount = this.st.outputBuffer.frameCount;
        if (stOutputCount > 0) {
            const stereoOutput = new Float32Array(stOutputCount * 2);
            this.st.outputBuffer.receiveSamples(stereoOutput, stOutputCount);

            const left = new Float32Array(stOutputCount);
            const right = new Float32Array(stOutputCount);
            for (let i = 0; i < stOutputCount; i++) {
                left[i] = stereoOutput[i * 2];
                right[i] = stereoOutput[i * 2 + 1];
            }
            this.appendToBuffer(left, right);
        }

        // 期望输出帧数 = 输入帧数 / speed(因为读入了 speed 倍的原始数据)
        const expectedOutputFrames = Math.round(inputFrameCount / this.speed);

        // 缓冲区不足时返回静音(SoundTouch 预热期)
        if (this.bufferedSamples < expectedOutputFrames) {
            return {
                audio: Array.from({ length: channelCount }, () => new Float32Array(expectedOutputFrames)),
                state: 'success',
            };
        }

        const [left, right] = this.consumeFromBuffer(expectedOutputFrames);

        const outputAudio: Float32Array[] = [];
        if (channelCount === 1) {
            outputAudio[0] = new Float32Array(expectedOutputFrames);
            for (let i = 0; i < expectedOutputFrames; i++) {
                outputAudio[0][i] = (left[i] + right[i]) / 2;
            }
        } else {
            outputAudio[0] = left;
            outputAudio[1] = right;
            for (let ch = 2; ch < channelCount; ch++) {
                outputAudio[ch] = new Float32Array(right);
            }
        }

        return { audio: outputAudio, state: 'success' };
    };

    private appendToBuffer(left: Float32Array, right: Float32Array): void {
        const newLeft = new Float32Array(this.bufferedSamples + left.length);
        const newRight = new Float32Array(this.bufferedSamples + right.length);
        newLeft.set(this.outputBuffer[0]);
        newLeft.set(left, this.bufferedSamples);
        newRight.set(this.outputBuffer[1]);
        newRight.set(right, this.bufferedSamples);
        this.outputBuffer[0] = newLeft;
        this.outputBuffer[1] = newRight;
        this.bufferedSamples += left.length;
    }

    private consumeFromBuffer(frameCount: number): [Float32Array, Float32Array] {
        const left = this.outputBuffer[0].slice(0, frameCount);
        const right = this.outputBuffer[1].slice(0, frameCount);
        this.outputBuffer[0] = this.outputBuffer[0].slice(frameCount);
        this.outputBuffer[1] = this.outputBuffer[1].slice(frameCount);
        this.bufferedSamples = Math.max(0, this.bufferedSamples - frameCount);
        return [left, right];
    }

    clone = async () => {
        const clonedReal = await this.realClip.clone();
        return new SpeedAudioClip(clonedReal as AudioClip, this.speed, this.sampleRate) as this;
    };

    split = async (time: number) => {
        const realTime = time * this.speed;
        const [l, r] = await this.realClip.split(realTime);
        return [
            new SpeedAudioClip(l as AudioClip, this.speed, this.sampleRate),
            new SpeedAudioClip(r as AudioClip, this.speed, this.sampleRate),
        ] as [this, this];
    };

    destroy(): void {
        this.realClip.destroy();
        this.st.clear();
        this.outputBuffer = [];
    }
}

/**
 * 工厂函数:速度接近 1 时直接返回原始 Clip,避免不必要的处理开销
 */
export function createSpeedAudioClip(audioClip: AudioClip, speed: number, sampleRate = 48000): IClip {
    if (Math.abs(speed - 1) < 0.01) return audioClip;
    return new SpeedAudioClip(audioClip, speed, sampleRate);
}

使用方式

import { createSpeedAudioClip } from './speed-audio-clip';

// 将原始 AudioClip 包装后传入 OffscreenSprite
const audioClip = new AudioClip(audioSource);
const speedClip = createSpeedAudioClip(audioClip, 1.5); // 1.5 倍速

const sprite = new OffscreenSprite(speedClip);

// 关键:将 sprite 的 playbackRate 设为 1,让 OffscreenSprite 不再做自己的变速处理
sprite.time.playbackRate = 1;

注意事项

  • 预热延迟:SoundTouch 有几帧的内部预热期,在此期间缓冲区数据不足,会返回静音帧,属于正常现象
  • Seek 重置:检测到时间倒退或大幅跳跃时需调用 st.clear() 重置内部状态,否则会产生音频错乱
  • 单声道兼容:SoundTouch 只处理立体声交错格式,单声道输入需复制到两个通道,输出时再取均值还原
  • 速度范围calculateOptimalParams 参照 SoundTouch 自动参数算法,在 0.5x~2.0x 范围内效果最佳,超出范围时参数会被 clamp

效果

方式 速度变化 音调变化
OffscreenSprite 原生 playbackRate ❌ 音调随速度改变
本方案(SoundTouch 包装) ✅ 音调保持不变

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions