Speed up BATS tests#4478
Conversation
The test downloads ONNX models from the network and its own comment describes it as exploratory rather than a regular integration test. Skip it by default; set OPENNLP_TESTS=true to opt in.
Previously each of the four tests that needed a running Solr started and stopped their own instance, costing four separate Solr startups. Switch to a setup_file/teardown_file pair so all tests share a single startup. The "No Solr nodes running" assertion is removed from the lifecycle test since it is already verified by the suite-wide test_zz_cleanup.bats.
Both healthcheck tests previously used 'solr start -e films' which loads and indexes the full films example dataset. Replace with a plain 'solr start' plus 'solr create -c healthcheck_test -d _default' for the cloud test, and a plain 'solr start --user-managed' for the standalone test (which fails before any collection is needed).
PackageToolTest already tests the full lifecycle (list-available, add-repo, install, deploy, undeploy). Add testDeployValidationMessages() to cover the two remaining BATS assertions: - collection exists but package not found → "Package instance doesn't exist" - undeploy of never-deployed package → "Package … not deployed on collection" Delete test_packages.bats.
Add GzipCompressionTest (SolrCloudTestCase) to solr/core which verifies: - Requests without Accept-Encoding get no Content-Encoding header - Requests with Accept-Encoding: gzip get Content-Encoding: gzip The minGzipSize Jetty property is lowered to 1 byte for the test cluster so that any non-empty response body is eligible for compression. Uses the Jetty HttpClient already available via JettySolrRunner, consistent with CacheHeaderTest and SecurityHeadersTest. Delete test_compression.bats.
solr start already waits until Solr is ready before returning, so the sleep 1 calls at the start of "listing out files" and "connecting to solr via various solr urls" were unnecessary. solr zk cp is synchronous; the three sleep 1 calls that followed it before listing the copied file were also unnecessary. The sleep 1 before the ZK_HOST env-var test had no purpose. Replace the sleep 1 after solr zk upconfig with wait_for so we poll until the configsets REST endpoint actually reflects the new config instead of relying on a fixed delay.
Default SOLR_STOP_WAIT is ~180 s; on CI test 105 ("deprecated system
properties") was taking 196 s entirely because teardown waited for the
180 s timeout before giving up. Set SOLR_STOP_WAIT=30 in teardown so
worst-case the stop waits 30 s, saving ~150 s per affected test.
…startup Both tests ran 'solr start -e cloud --no-prompt'. Merge them into a single file with setup_file/teardown_file so the two-node cloud is started once and both tests reuse it. Deletes test_example_noprompt.bats. Saves ~78 s on CI (one full two-node cloud startup removed).
…test 'test keystore reload' (test 100, 75 s on CI) slept 6 s twice to give Jetty time to pick up the replaced keystore file. Replace both with wait_for so the test proceeds as soon as the reload completes rather than always paying 12 s. Timeout is 30 s to handle slow CI boxes.
More descriptive name that makes it clear the variable is specific to the Solr BATS test suite.
9b91336 to
9aadafd
Compare
dsmiley
left a comment
There was a problem hiding this comment.
Thanks for doing this!!
SOLR_BATS_OPENNLP_TESTS=true
Sounds like we should have a general concept of "nightly" BATS. The Jenkins job for integration tests can also run these since these slower tests aren't yet too much for. The GitHub PR workflow for them shouldn't run them, however. A one-off obscure boolean effectively means this test is dead-weight forever (tests that never run are nothing but a maintenance burden).
test_compression.bats → GzipCompressionTest.java
I wrote this recently... and I'm flabbergasted that this is testable in our Java test infrastructure because our simpler JettySolrRunner doesn't configure production matters like Gzip nor the configuration in jetty XML files. This is deliberate.
| * | ||
| * <p>Replaces the BATS integration test {@code test_compression.bats}. | ||
| */ | ||
| public class GzipCompressionTest extends SolrCloudTestCase { |
There was a problem hiding this comment.
Why would use need SolrCloud for only a jetty level thing?
There was a problem hiding this comment.
Solrcloud is the default now, why not - it's a 1-node "cluster". Is there some overhead with SolrCloudTestCase that we don't have with a simpler JettySolrRunner?
I also tried to refactor to JettySolrRunner but it has no GZip handler as the cloud setup does.
There was a problem hiding this comment.
My point is the principle of using the leanest test infrastructure to accomplish the testing goal. Default doesn't matter.
Where is the gzip support in the SolrCloudTestCase/MiniSolrCloudCluster you speak of?
There was a problem hiding this comment.
I'd have to dive into the test setup to answer. I thought we used same JettySolrRunner as in standalone, and that Jetty is configured in java instead of xml, but still not 100% identical. I have not verified that the new Gzip test actually tests what it claims to. Will have to do a manual dive to verify. I had my hopes up when Claude selected it as a low hanging fruit.
|
An interesting things about the AI generated code comments is that I really like them to help me understand "this is the migraiton, this is what was before, this is how we do it now", but that the life span of that comment is that of the PR, once the PR is merged, I wish the comment disappeared. I think that some of the "self review" comments people put into PR's accomplish that same goal. I also expect at some point that in a |
| * Verifies that Solr's Jetty GzipHandler correctly compresses HTTP responses when the client sends | ||
| * {@code Accept-Encoding: gzip} and omits compression when that header is absent. | ||
| */ | ||
| public class GzipCompressionTest extends SolrCloudTestCase { |
There was a problem hiding this comment.
I miss the succinteness of the .bats test and how easy it is to read (to my eyes at least!), but totally understand the overhead.
There was a problem hiding this comment.
If we keep adding BATS tests in the pace we have been doing lately, we'll surpass 60m run time very soon. So we should probably reserve BATS for testing our bin/solr shell script, plus mechanisms that won't kick in in unit tests. The java-agent is an example of such. And "realistic" jetty setup is probably another.
But I'd hope that as we succeed in slimming down bin/solr even more and move things into java-land, we should be able to migrate more BATS tests to JUnit since the logic is no longer done in bash.
We have already delegated lots of SolrCLI arg parsing to java-land. There is still some glue in script that construct the java cmdline, which do need to be tested, but the cmdline arg/opts parsing for each tool should not need full bats coverage if we know we have separate junit coverage of such parsing.

The BATS test suite was taking longer and longer, now about 40 mins on GH CI. This PR makes targeted, low-risk reductions without removing meaningful coverage.
Total estimated CI time saving: ~10 minutes
Slowest tests are these

Changes
DISCLAIMER All improvements made by Claude Code LLM, be aware of rough edges
Gate OpenNLP test behind
SOLR_BATS_OPENNLP_TESTS=true(~90 s saved)test_opennlp.batsdownloads ONNX models from the network and its own comment describes it as exploratory rather than a regular CI test. It now skips unlessSOLR_BATS_OPENNLP_TESTS=trueis set.test_status.bats— one shared Solr startup (~200 s saved)Four of six tests independently started and stopped their own Solr instance. Restructured with
setup_file/teardown_fileso all tests share a single startup (~3 Solr startups saved).CI timing showed "status with --short format" at 197 s — almost certainly caused by
solr stopin teardown hanging when each test managed its own Solr lifecycle. The new structure uses a singleteardown_filestop which eliminates this.test_healthcheck.bats— cheaper collection setup (~15 s saved)Both tests used
solr start -e films, which loads and indexes the full films dataset. Replaced withsolr start+solr create -c healthcheck_test -d _defaultfor the cloud test, and plainsolr start --user-managedfor the standalone test (which fails before any collection is needed).test_packages.bats→PackageToolTest.java(~25 s saved, delete BATS file)PackageToolTestalready exercises the full package lifecycle (list-available, add-repo, install, deploy, undeploy). AddedtestDeployValidationMessages()to cover the two specific BATS assertions not previously tested in JUnit: collection exists but package does not → correct error message; undeploy of never-deployed package → correct error message.test_compression.bats→GzipCompressionTest.java(~23 s saved, delete BATS file)The per-test times (154 ms, 72 ms, 77 ms) looked cheap, but the
setup_filethat ransolr start -e filmscost ~22 s before any test ran. AddedGzipCompressionTesttosolr/core(SolrCloudTestCase) verifying Jetty's GzipHandler via the same JettyHttpClientpattern used byCacheHeaderTestandSecurityHeadersTest. Setsjetty.gzip.minGzipSize=1as a system property before cluster start so that any non-empty response body is eligible for compression.test_zk.bats— remove/replace unnecessarysleepcalls (~7 s saved)Removed six
sleep 1calls that were redundant:solr startalready waits until Solr is ready before returning, andsolr zk cpis synchronous. Replaced onesleep 1aftersolr zk upconfigwithwait_forthat polls the configsets REST endpoint instead of relying on a fixed delay.test_start_solr.bats— capsolr stopwait to 30 s (~150 s saved)CI timing showed test 105 ("deprecated system properties converted to modern properties") taking 196 s. The culprit was
solr stop --allinteardown()hanging until the default ~180 s timeout expired. AddedSOLR_STOP_WAIT=30so teardown gives up after 30 s instead of 180 s, saving ~150 s per occurrence.test_example_noprompt.batsmerged intotest_example.bats(~78 s saved, delete BATS file)test_example_noprompt.batscontained a single test (start -e cloud works with --no-prompt) that did a full two-node cloud startup, identical command to the existingtest_example.batstest. Merged both intotest_example.batsusingsetup_file/teardown_fileso the two-node cloud is started once and both tests reuse it. Deletestest_example_noprompt.bats.test_ssl.bats— replace fixedsleep 6withwait_forpolling in keystore reload test (~8 s saved)"test keystore reload" (test 100, 75 s on CI) slept 6 s twice after each keystore file replacement to give Jetty time to detect the change via its file-scan interval. Replaced both with
wait_for 30 1 solr healthcheck --solr-url https://localhost:${SOLR_PORT}so the test advances as soon as the reload completes rather than always paying 12 s. The 30 s cap keeps CI safe on slow boxes.Testing