Skip to content

the question about LLM inference performance #81

@davidray222

Description

@davidray222

Thank you for providing such outstanding research!

I tested the llama7b model, and after pruning, both the memory usage and inference speed are not significantly different from the original model. May I ask if you mentioned any methods to accelerate inference for pruned models?

GPU:NVIDIA A6000
torch 2.2.0
transformers 4.31.0
accelerate 0.21.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions