Skip to content

Speed up BATS tests#4478

Open
janhoy wants to merge 13 commits into
apache:mainfrom
janhoy:feature/slim-bats-tests
Open

Speed up BATS tests#4478
janhoy wants to merge 13 commits into
apache:mainfrom
janhoy:feature/slim-bats-tests

Conversation

@janhoy
Copy link
Copy Markdown
Contributor

@janhoy janhoy commented May 30, 2026

The BATS test suite was taking longer and longer, now about 40 mins on GH CI. This PR makes targeted, low-risk reductions without removing meaningful coverage.

Total estimated CI time saving: ~10 minutes

Slowest tests are these
image

Changes

DISCLAIMER All improvements made by Claude Code LLM, be aware of rough edges

Gate OpenNLP test behind SOLR_BATS_OPENNLP_TESTS=true (~90 s saved)

test_opennlp.bats downloads ONNX models from the network and its own comment describes it as exploratory rather than a regular CI test. It now skips unless SOLR_BATS_OPENNLP_TESTS=true is set.

test_status.bats — one shared Solr startup (~200 s saved)

Four of six tests independently started and stopped their own Solr instance. Restructured with setup_file/teardown_file so all tests share a single startup (~3 Solr startups saved).

CI timing showed "status with --short format" at 197 s — almost certainly caused by solr stop in teardown hanging when each test managed its own Solr lifecycle. The new structure uses a single teardown_file stop which eliminates this.

test_healthcheck.bats — cheaper collection setup (~15 s saved)

Both tests used solr start -e films, which loads and indexes the full films dataset. Replaced with solr start + solr create -c healthcheck_test -d _default for the cloud test, and plain solr start --user-managed for the standalone test (which fails before any collection is needed).

test_packages.batsPackageToolTest.java (~25 s saved, delete BATS file)

PackageToolTest already exercises the full package lifecycle (list-available, add-repo, install, deploy, undeploy). Added testDeployValidationMessages() to cover the two specific BATS assertions not previously tested in JUnit: collection exists but package does not → correct error message; undeploy of never-deployed package → correct error message.

test_compression.batsGzipCompressionTest.java (~23 s saved, delete BATS file)

The per-test times (154 ms, 72 ms, 77 ms) looked cheap, but the setup_file that ran solr start -e films cost ~22 s before any test ran. Added GzipCompressionTest to solr/core (SolrCloudTestCase) verifying Jetty's GzipHandler via the same Jetty HttpClient pattern used by CacheHeaderTest and SecurityHeadersTest. Sets jetty.gzip.minGzipSize=1 as a system property before cluster start so that any non-empty response body is eligible for compression.

test_zk.bats — remove/replace unnecessary sleep calls (~7 s saved)

Removed six sleep 1 calls that were redundant: solr start already waits until Solr is ready before returning, and solr zk cp is synchronous. Replaced one sleep 1 after solr zk upconfig with wait_for that polls the configsets REST endpoint instead of relying on a fixed delay.

test_start_solr.bats — cap solr stop wait to 30 s (~150 s saved)

CI timing showed test 105 ("deprecated system properties converted to modern properties") taking 196 s. The culprit was solr stop --all in teardown() hanging until the default ~180 s timeout expired. Added SOLR_STOP_WAIT=30 so teardown gives up after 30 s instead of 180 s, saving ~150 s per occurrence.

test_example_noprompt.bats merged into test_example.bats (~78 s saved, delete BATS file)

test_example_noprompt.bats contained a single test (start -e cloud works with --no-prompt) that did a full two-node cloud startup, identical command to the existing test_example.bats test. Merged both into test_example.bats using setup_file/teardown_file so the two-node cloud is started once and both tests reuse it. Deletes test_example_noprompt.bats.

test_ssl.bats — replace fixed sleep 6 with wait_for polling in keystore reload test (~8 s saved)

"test keystore reload" (test 100, 75 s on CI) slept 6 s twice after each keystore file replacement to give Jetty time to detect the change via its file-scan interval. Replaced both with wait_for 30 1 solr healthcheck --solr-url https://localhost:${SOLR_PORT} so the test advances as soon as the reload completes rather than always paying 12 s. The 30 s cap keeps CI safe on slow boxes.

Testing

# JUnit
./gradlew :solr:core:test --tests "*.PackageToolTest"
./gradlew :solr:core:test --tests "*.GzipCompressionTest"

janhoy added 6 commits May 30, 2026 20:31
The test downloads ONNX models from the network and its own comment
describes it as exploratory rather than a regular integration test.
Skip it by default; set OPENNLP_TESTS=true to opt in.
Previously each of the four tests that needed a running Solr started and
stopped their own instance, costing four separate Solr startups.  Switch
to a setup_file/teardown_file pair so all tests share a single startup.
The "No Solr nodes running" assertion is removed from the lifecycle test
since it is already verified by the suite-wide test_zz_cleanup.bats.
Both healthcheck tests previously used 'solr start -e films' which loads
and indexes the full films example dataset.  Replace with a plain
'solr start' plus 'solr create -c healthcheck_test -d _default' for the
cloud test, and a plain 'solr start --user-managed' for the standalone
test (which fails before any collection is needed).
PackageToolTest already tests the full lifecycle (list-available, add-repo,
install, deploy, undeploy).  Add testDeployValidationMessages() to cover
the two remaining BATS assertions:
- collection exists but package not found → "Package instance doesn't exist"
- undeploy of never-deployed package → "Package … not deployed on collection"

Delete test_packages.bats.
Add GzipCompressionTest (SolrCloudTestCase) to solr/core which verifies:
- Requests without Accept-Encoding get no Content-Encoding header
- Requests with Accept-Encoding: gzip get Content-Encoding: gzip

The minGzipSize Jetty property is lowered to 1 byte for the test cluster
so that any non-empty response body is eligible for compression.  Uses
the Jetty HttpClient already available via JettySolrRunner, consistent
with CacheHeaderTest and SecurityHeadersTest.

Delete test_compression.bats.
solr start already waits until Solr is ready before returning, so the
sleep 1 calls at the start of "listing out files" and "connecting to
solr via various solr urls" were unnecessary.

solr zk cp is synchronous; the three sleep 1 calls that followed it
before listing the copied file were also unnecessary.

The sleep 1 before the ZK_HOST env-var test had no purpose.

Replace the sleep 1 after solr zk upconfig with wait_for so we poll
until the configsets REST endpoint actually reflects the new config
instead of relying on a fixed delay.
janhoy added 2 commits May 30, 2026 20:49
Default SOLR_STOP_WAIT is ~180 s; on CI test 105 ("deprecated system
properties") was taking 196 s entirely because teardown waited for the
180 s timeout before giving up.  Set SOLR_STOP_WAIT=30 in teardown so
worst-case the stop waits 30 s, saving ~150 s per affected test.
…startup

Both tests ran 'solr start -e cloud --no-prompt'.  Merge them into a
single file with setup_file/teardown_file so the two-node cloud is
started once and both tests reuse it.  Deletes test_example_noprompt.bats.

Saves ~78 s on CI (one full two-node cloud startup removed).
@janhoy janhoy requested review from dsmiley and epugh May 30, 2026 18:55
janhoy added 2 commits May 30, 2026 21:00
…test

'test keystore reload' (test 100, 75 s on CI) slept 6 s twice to give
Jetty time to pick up the replaced keystore file.  Replace both with
wait_for so the test proceeds as soon as the reload completes rather
than always paying 12 s.  Timeout is 30 s to handle slow CI boxes.
More descriptive name that makes it clear the variable is specific to
the Solr BATS test suite.
@janhoy janhoy force-pushed the feature/slim-bats-tests branch from 9b91336 to 9aadafd Compare May 30, 2026 19:01
@janhoy
Copy link
Copy Markdown
Contributor Author

janhoy commented May 30, 2026

Estimated speedup was 10min, i.e. from 40 to 30 minutes. Reality for last run was

BUILD SUCCESSFUL in 24m 49s

So that's almost 50% improvement!

Now 10 slowest are:
image

Anyone see more bats tests that could be improved?

Copy link
Copy Markdown
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!!

SOLR_BATS_OPENNLP_TESTS=true

Sounds like we should have a general concept of "nightly" BATS. The Jenkins job for integration tests can also run these since these slower tests aren't yet too much for. The GitHub PR workflow for them shouldn't run them, however. A one-off obscure boolean effectively means this test is dead-weight forever (tests that never run are nothing but a maintenance burden).

test_compression.bats → GzipCompressionTest.java

I wrote this recently... and I'm flabbergasted that this is testable in our Java test infrastructure because our simpler JettySolrRunner doesn't configure production matters like Gzip nor the configuration in jetty XML files. This is deliberate.

*
* <p>Replaces the BATS integration test {@code test_compression.bats}.
*/
public class GzipCompressionTest extends SolrCloudTestCase {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would use need SolrCloud for only a jetty level thing?

Copy link
Copy Markdown
Contributor Author

@janhoy janhoy May 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solrcloud is the default now, why not - it's a 1-node "cluster". Is there some overhead with SolrCloudTestCase that we don't have with a simpler JettySolrRunner?

I also tried to refactor to JettySolrRunner but it has no GZip handler as the cloud setup does.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is the principle of using the leanest test infrastructure to accomplish the testing goal. Default doesn't matter.

Where is the gzip support in the SolrCloudTestCase/MiniSolrCloudCluster you speak of?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have to dive into the test setup to answer. I thought we used same JettySolrRunner as in standalone, and that Jetty is configured in java instead of xml, but still not 100% identical. I have not verified that the new Gzip test actually tests what it claims to. Will have to do a manual dive to verify. I had my hopes up when Claude selected it as a low hanging fruit.

Comment thread solr/core/src/test/org/apache/solr/servlet/GzipCompressionTest.java Outdated
@epugh
Copy link
Copy Markdown
Contributor

epugh commented May 31, 2026

An interesting things about the AI generated code comments is that I really like them to help me understand "this is the migraiton, this is what was before, this is how we do it now", but that the life span of that comment is that of the PR, once the PR is merged, I wish the comment disappeared. I think that some of the "self review" comments people put into PR's accomplish that same goal.

I also expect at some point that in a git diff you will someday see files that are added/changed/removed, and then a set of files that are marked to be deleted WHEN the PR is merged. I know I often have a notes.md file that accidentally gets committed that is my notes to myself when working on the code. It would be nice if we could label files as only lasting as long as the PR is open!

* Verifies that Solr's Jetty GzipHandler correctly compresses HTTP responses when the client sends
* {@code Accept-Encoding: gzip} and omits compression when that header is absent.
*/
public class GzipCompressionTest extends SolrCloudTestCase {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss the succinteness of the .bats test and how easy it is to read (to my eyes at least!), but totally understand the overhead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep adding BATS tests in the pace we have been doing lately, we'll surpass 60m run time very soon. So we should probably reserve BATS for testing our bin/solr shell script, plus mechanisms that won't kick in in unit tests. The java-agent is an example of such. And "realistic" jetty setup is probably another.

But I'd hope that as we succeed in slimming down bin/solr even more and move things into java-land, we should be able to migrate more BATS tests to JUnit since the logic is no longer done in bash.

We have already delegated lots of SolrCLI arg parsing to java-land. There is still some glue in script that construct the java cmdline, which do need to be tested, but the cmdline arg/opts parsing for each tool should not need full bats coverage if we know we have separate junit coverage of such parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants