Skip to content

Slow regex performance #853

@stefanspringer1

Description

@stefanspringer1

The regex implementation in Swift is quite slow and this very much affects applications that apply a lot of regex in a very negative way.

There are some other issues open that mention slow performance in some cases, but it might be a bigger problem than those few special cases. So I am adding this issue as a reminder.

I started a Swift forums topic on this matter in November 2024 but not much seems to have changed since than. In this topic, I have links to a benchmark implemented in Java and in Swift (the Swift one is now part of the benchmarks of this package), the factor is more than 10x between them. I think a factor of 2 to 3 could be OK given that the Swift version is more capable and more correct, but with the current slowness it was considered for some of my projects to leave the Swift platform completely just for this matter.

But there seem to be ways for improvements as noted by Michael_Ilseman in the mentioned forums topic:

...the slow perf comes down to some areas we're currently looking at improving:

  • More time is spent on retain/release than actually matching content
    • Upcoming adoption of ~Copyable and ~Escapable
  • More time is spent in copying our COW arrays than in actual matching
    • Upcoming rework using custom data structures
    • Situational: dropping captures that are never read from
  • Lots of time drilling through String's higher-level representation
    • Upcoming adoption of UTF8Span for direct UTF-8 processing

Also, we'd like to add more optimizations, such as more quickly finding runs of literal content. Unlike the improvements listed above, these might only benefit specific regexes (or the way they're written) over specific kinds of input. Thus, there is a risk that a benchmark might see dramatic improvements that don't translate to the workload the benchmark was derived from.

For example, most of your input data are lines that are successfully matched in their entirety. Other input data could include long lines with content that starts off looking like a URL, but ultimately isn't matched. If you're able to share more regexes or more kinds of input, we can make sure that optimizations are applicable to real workloads.

Finally, we've made some significant performance improvements that are (unfortunately) not yet in a toolchain but will be soon. On your benchmark, I'm seeing a 1.56x improvement from those changes for grapheme-cluster semantics and a 1.53x improvement for scalar semantics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions