Skip to content

Migrate parameter estimation to Quarkus REST with database-backed job tracking#1659

Open
jcschaff wants to merge 39 commits intomasterfrom
parest-bug
Open

Migrate parameter estimation to Quarkus REST with database-backed job tracking#1659
jcschaff wants to merge 39 commits intomasterfrom
parest-bug

Conversation

@jcschaff
Copy link
Copy Markdown
Member

@jcschaff jcschaff commented Apr 8, 2026

Summary

Migrates parameter estimation (optimization) from the legacy vcell-api to Quarkus vcell-rest with database-backed job tracking, replacing the fragile in-memory TCP socket protocol. Fixes #1653.

Architecture

  • Database-backed job tracking: vc_optjob table with bigint keys, status lifecycle (SUBMITTED, QUEUED, RUNNING, COMPLETE, FAILED, STOPPED)
  • REST endpoints: POST/GET /api/v1/optimization, GET /{id}, POST /{id}/stop with typed OptimizationJobStatus response
  • Cross-protocol Artemis messaging: vcell-rest (SmallRye AMQP 1.0) and vcell-submit (OpenWire JMS) through shared Artemis broker
  • Filesystem polling: Progress and results read from NFS, no persistent socket connection needed
  • Python solver unchanged: vcell-opt/COPASI/SLURM pipeline is untouched

Changes

  • vcell-rest: OptimizationResource, OptimizationRestService, OptimizationMQ, DB schema
  • vcell-server: JMS queue listener in OptimizationBatchServer, removed legacy socket server
  • vcell-core: Shared message types, removed OptMessage.java
  • vcell-client: CopasiOptimizationSolverRemote rewritten for generated API client, 10-minute timeout
  • vcell-apiclient: Added getOptimizationApi(), removed legacy submitOptimization()/getOptRunJson()/VCellOptClient
  • vcell-api: Removed /api/v0/optimization endpoints, route registration, vcell.submit.service.host property
  • pythonCopasiOpt: Upgraded basico 0.40 to 0.86, python-copasi 4.37 to 4.46, Dockerfile bullseye to bookworm, fixed COPASI report buffering (confirm_overwrite=False)
  • OpenAPI: Regenerated Java/Python/TypeScript clients
  • Docker/K8s: Removed port 8877 from docker-compose and submit.yaml, removed submit_service_host config

Deployment notes

  • Legacy /api/v0/optimization endpoints have been removed. Old desktop clients will not be able to run parameter estimation until they update.
  • vcell-fluxcd changes (port 8877, submit_service_host removal) should be deployed alongside this release
  • The vcell-opt singularity image must be rebuilt on the SLURM cluster for the new COPASI version

Design document

See docs/parameter-estimation-service.md for architecture, configuration, and maintenance reference.

Test plan

  • mvn compile test-compile builds successfully
  • SlurmProxyTest passes (12 tests)
  • Python vcell-opt tests pass locally (4 tests including incremental progress verification)
  • Docker vcell-opt image builds and runs
  • Dev deployment: submit, AMQP, SLURM, results verified via curl
  • Desktop client: submit, poll, COMPLETE with results and progress graph
  • Full end-to-end with upgraded COPASI image (7.7.0.66) on dev
  • Verify real-time progress updates during RUNNING state

jcschaff and others added 14 commits April 8, 2026 11:07
Design document for migrating optimization endpoints from legacy
vcell-api (/api/v0/) to Quarkus vcell-rest (/api/v1/) with
database-backed job tracking, ActiveMQ messaging, and filesystem
polling. Includes desktop client migration and decommissioning plan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add vc_optjob table to init.sql for database-backed job tracking
- Add OptJobStatus enum (SUBMITTED, QUEUED, RUNNING, COMPLETE, FAILED, STOPPED)
- Add OptimizationJobStatus response DTO with progress and results fields
- Add OptimizationRestService with submit, status polling, and update methods
  - Writes OptProblem to NFS filesystem
  - Reads progress/results from filesystem via CopasiUtils
  - JDBC operations for vc_optjob table

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- POST /api/v1/optimization — submit optimization job
- GET /api/v1/optimization/{optId} — get status, progress, or results
- POST /api/v1/optimization/{optId}/stop — stop a running job

All endpoints require authenticated user. Status endpoint returns
OptimizationJobStatus with typed fields for progress and results.
ActiveMQ dispatch to vcell-submit is marked TODO for commit 3.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use VCell convention: bigint primary key from newSeq database sequence,
consistent with all other VCell tables. Uses KeyValue type and
KeyFactory.getNewKey() infrastructure throughout.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Returns lightweight job metadata (id, status, htcJobId, statusMessage)
without the heavy progressReport/results fields. Most recent first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add OptimizationMQ with producer (opt-request) and consumer (opt-status)
- Producer sends submit/stop commands to vcell-submit via AMQP
- Consumer receives status updates (QUEUED/htcJobId, FAILED/error) and
  updates the database accordingly
- Wire messaging into OptimizationResource submit and stop endpoints
- Add AMQP channel configuration to application.properties (test profile)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
OptimizationBatchServer.initOptimizationQueue() creates a JMS consumer
on the "opt-request" queue (activemqint broker). Cross-protocol with
vcell-rest's SmallRye AMQP producer — ActiveMQ bridges AMQP 1.0 and
OpenWire transparently on the same queue name.

On "submit": reads OptProblem from NFS, submits SLURM job via
SlurmProxy, sends QUEUED status back on "opt-status" with htcJobId.
On "stop": parses htcJobId and calls killJobSafe() for scancel.

Message format: plain JSON text matching OptimizationMQ records in
vcell-rest. Uses mutable POJOs (not records) for Jackson compatibility
with the vcell-server Java 17 codebase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move OptJobStatus, OptRequestMessage, and OptStatusMessage from
duplicate definitions in vcell-rest and vcell-server into
org.vcell.optimization in vcell-core. Both modules now share a
single source of truth for the cross-protocol messaging contract.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests cover: submit and verify status, list jobs, progress report
from mock report file, auto-transition to COMPLETE on output file,
stop running job, unauthorized access (different user), and
unauthenticated access (401).

Uses testcontainers for PostgreSQL and Keycloak (same infrastructure
as existing Quarkus tests).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…endpoints

- Add tools/openapi-clients.sh: single script for spec generation and
  client generation, with --update-spec flag to optionally rebuild
  vcell-rest first. Replaces tools/generate.sh and
  tools/compile-and-build-clients.sh.
- Regenerate OpenAPI spec with new optimization endpoints:
  GET/POST /api/v1/optimization, GET /api/v1/optimization/{optId},
  POST /api/v1/optimization/{optId}/stop
- Regenerate Java (vcell-restclient), Python (python-restclient),
  and TypeScript-Angular (webapp-ng) clients
- Update CLAUDE.md with new script name and usage

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add 3 tests using the auto-generated OptimizationResourceApi from
vcell-restclient. These exercise the same client library the desktop
client will use, validating serialization round-trips:
- Submit and get status via generated client
- List jobs via generated client
- Stop a running job via generated client

Also serves as usage documentation for the generated API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite CopasiOptimizationSolverRemote.solveRemoteApi() to use the
auto-generated OptimizationResourceApi (vcell-restclient) instead of
the legacy VCellApiClient.submitOptimization/getOptRunJson methods.

Key improvements:
- Typed OptimizationJobStatus response with explicit status enum,
  progressReport, and results fields
- Replaces error-prone string-prefix parsing ("QUEUED:", "RUNNING:")
- Separate stop endpoint (POST /{id}/stop) replaces bStop query param
- Clean switch-based status handling

Add getOptimizationApi() accessor to VCellApiClient.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The optimization messaging uses Artemis (shared with vcell-rest's
SmallRye AMQP), not activemqint. Add PropertyLoader properties for
Artemis host/port (vcell.jms.artemis.host.internal,
vcell.jms.artemis.port.internal) and use them in
HtcSimulationWorker.init() for the optimization queue listener.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The optimization endpoint had a hardcoded /simdata/parest_data path
which doesn't exist in CI. Make it configurable via
vcell.optimization.parest-data-dir property, defaulting to
/simdata/parest_data in production. Test profile uses java.io.tmpdir.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jcschaff and others added 2 commits April 8, 2026 15:02
Validate that file paths from JMS messages are under the expected
parest_data directory using canonical path comparison. Also validate
that jobId is numeric to prevent injection in file names constructed
from it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refactor CopasiOptimizationSolverRemote to extract a testable overload
that accepts OptimizationResourceApi directly and a pluggable progress
dispatcher (SwingUtilities::invokeLater in GUI, Runnable::run in tests).

Add OptimizationE2ETest that exercises the same client code path as the
desktop client against a live Quarkus instance with testcontainers:
- testOptimizationE2E_submitPollComplete: submit, mock vcell-submit
  processing (QUEUED → RUNNING with progress → COMPLETE with results),
  poll and verify results match
- testOptimizationE2E_submitAndStop: submit, transition to RUNNING with
  progress, stop, verify progress survives stop

The mock vcell-submit consumer runs in-process, updating DB status and
writing result files to the filesystem — same contract as the real
vcell-submit service.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jcschaff and others added 5 commits April 8, 2026 21:38
Document the three-tier database architecture, Table class hierarchy,
CRUD operation patterns, connection management, access control, and
schema management utilities (AdminCli db-create-script, db-compare-schema).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Create OptJobTable (extends Table) with field declarations, SQL generation
methods, and ResultSet mapping following VCell's established patterns.
Register in SQLCreateAllTables.getVCellTables() so the table participates
in db-create-script and db-compare-schema tooling. Refactor
OptimizationRestService to use OptJobTable instead of inline SQL strings.
Regenerate init.sql DDL from db-create-script. Update database design
patterns doc with corrected SQLDataType table and init.sql structure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Map existing K8s configmap/secret env vars (jmshost_artemis_internal,
jmsport_artemis_internal, AMQP_USER, AMQP_PASSWORD) to SmallRye AMQP
connection properties so the REST pod can connect to Artemis for
optimization job messaging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace Executors.newFixedThreadPool with Quarkus ManagedExecutor in
ExportRequestListenerMQ so async export jobs run on threads with CDI
context, fixing PropertyLoader access failures. Apply same fix in
ExportServerTest. Scope AMQP connection properties to %prod profile
so DevServices handles test AMQP configuration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split the single sequential build job into parallel matrix jobs:
- maven-build: compiles Java once, uploads JARs as artifacts
- docker-build: 17 parallel jobs, one per image (api, rest, exporter,
  5 webapp variants, db, sched, submit, data, mongo, batch, opt,
  clientgen, admin)
- tag-and-push: tags all images with friendly version and latest

Installer secrets for clientgen are fetched directly via SSH in that
matrix job rather than uploaded as artifacts (security: artifacts are
downloadable on public repos).

Previously all 13+ Docker builds ran sequentially in one job taking
6+ hours. With matrix parallelization, total wall time should be
limited by the slowest single image build.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jcschaff jcschaff force-pushed the parest-bug branch 2 times, most recently from c40fbd4 to 1e9c088 Compare April 9, 2026 15:09
- Upload all **/target/ directories and localsolvers/ as artifacts
  so all Docker matrix jobs get the complete Maven build output
  including transitive dependencies
- rest/exporter do full mvn install dependency:copy-dependencies
  with their respective -Dvcell.exporter flag
- localsolvers/ contains solver binaries downloaded by Maven profiles

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jcschaff and others added 3 commits April 9, 2026 17:25
Add production AMQP channel config for opt-request/opt-status queues —
without these, SmallRye was sending to the channel name instead of the
queue address, so messages never reached vcell-submit. Fix Statement and
ResultSet leaks in getOptJobRecord() and listOptimizationJobs() by
wrapping in try-with-resources.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pass vcell.jms.artemis.host.internal and vcell.jms.artemis.port.internal
as Java system properties so the optimization queue listener can connect
to the Artemis broker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…routing

Add capabilities=queue to both production and test AMQP channel configs
so SmallRye attaches as ANYCAST consumer/producer. Without this, Artemis
creates MULTICAST subscriptions that miss messages from OpenWire JMS
producers on the same queue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jcschaff and others added 2 commits April 9, 2026 23:21
Test the full round-trip through Artemis: vcell-rest publishes via AMQP
1.0, an OpenWire JMS stub (mimicking vcell-submit) consumes and sends
status back, vcell-rest consumes the response. This catches address
mapping and ANYCAST/MULTICAST routing bugs that the existing E2E test
misses by bypassing messaging.

- ArtemisTestResource: testcontainer with both AMQP and OpenWire ports
- OpenWireOptSubmitStub: mirrors OptimizationBatchServer.handleSubmitRequest()
- OptimizationCrossProtocolTest: submit via REST, poll until COMPLETE

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stale import of org.apache.activemq.ActiveMQConnectionFactory fails
compile when activemq-client is only in test scope.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jcschaff and others added 10 commits April 10, 2026 07:05
Server now reads the progress report file for SUBMITTED/QUEUED states
(not just RUNNING), and auto-promotes to RUNNING when progress appears
on disk. Client now dispatches progress to the UI for all active states,
so the objective function graph and best parameter values update as soon
as the SLURM solver starts writing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite architecture diagram to show Artemis cross-protocol flow and
filesystem-driven status promotion. Add sections on cross-protocol
messaging pitfalls, real-time progress reporting, and message types.
Replace implementation plan with completed work and remaining items.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Set confirm_overwrite=False in basico.assign_report() so COPASI flushes
  progress lines incrementally during execution. The default (True) caused
  COPASI to buffer the entire report until task completion, preventing
  real-time progress updates in the client.
- Remove redundant mkdir on external NFS path in SlurmProxy.createOptJobScript()
  — vcell-rest already creates the parest_data directory, and the external path
  is not accessible from inside the vcell-submit container.
- Add test_incremental_report_writing using multiprocessing to verify COPASI
  writes progress to the report file during execution (not just at the end).
- Add debug/info logging to CopasiOptimizationSolverRemote polling loop.
- Add .gitignore entries for vcell-opt .venv and test artifacts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Upgrade copasi-basico 0.40 → 0.86, python-copasi 4.37.264 → 4.46.300
- Set minimum Python to 3.10 (matches Dockerfile and COPASI wheel availability)
- Fix report format: use separator='\t' parameter instead of inline '\\\t'
  body items, which new basico writes as literal backslashes
- Upgrade Dockerfile base from bullseye (EOL) to bookworm (Debian 12)
- Add gcc and python3-dev to Dockerfile for psutil compilation
- Fix deprecated poetry.dev-dependencies → poetry.group.dev.dependencies
- Regenerate poetry.lock

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…points

Delete the legacy parameter estimation code path that used direct TCP socket
connections (port 8877) between vcell-api and vcell-submit. This is replaced
by the new Quarkus /api/v1/optimization endpoints with database-backed job
tracking and AMQP messaging via Artemis.

Removed:
- OptimizationRunServerResource.java, OptimizationRunResource.java (vcell-api)
- Optimization route registration in VCellApiApplication.java
- OptMessage.java socket protocol classes (vcell-core)
- Socket server (initOptimizationSocket, OptCommunicationThread) from OptimizationBatchServer
- Legacy submitOptProblem (random IDs), optServerStopJob, optServerGetJobStatus
- submitOptimization(), getOptRunJson() from VCellApiClient
- VCellOptClient.java (unused standalone client)
- Port 8877 from docker-compose.yml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This property was only used by the deleted OptimizationRunServerResource
to find the submit service for socket connections on port 8877.

The corresponding submit_service_host config in vcell-fluxcd api.env files
and port 8877 in submit.yaml should also be removed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove migration-specific content (commit history, decommissioning plan,
implementation status tracking). Legacy code has been removed — the doc
now describes the current architecture for ongoing maintenance.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- SlurmProxy.submitOptimizationJob: validate sub_file_internal is under
  htcLogDir and use canonical path for writeString
- OpenWireOptSubmitStub: add validatePath() and use canonical paths for
  all file operations (matches real OptimizationBatchServer.validateParestPath)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jcschaff and others added 2 commits April 15, 2026 07:22
Using getCanonicalPath().startsWith(String) is not slash-terminated,
so /data/parest_data would incorrectly match /data/parest_data_evil.
Switch to Path.startsWith(Path) which compares path segments correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

VCell Parameter Estimation is broken

2 participants