Update paper.tex

mhjensen · mhjensen · commit c22196cd85b8 · 2026-02-01T14:20:35.000+01:00
diff --git a/doc/src/Papers/RLpaper/paper.tex b/doc/src/Papers/RLpaper/paper.tex
@@ -661,3 +661,38 @@ \section{Conclusion}
 the connection to metrology explicit, and several works (cited above)
 have demonstrated that RL agents can indeed maximize QFI or minimize
 the Cramér–Rao bound in practice.
+
+
+
+
+Quantum sensing and control with spin qubits
+
+Quantum spin qubits (e.g. NV centers or trapped electrons/ions) are powerful quantum sensors, where one applies carefully shaped control pulses and potentials to prepare probe states, accumulate phase from an unknown field, and read out its value.  The goal is to maximize the quantum Fisher information (QFI) about the target parameter (e.g. a magnetic field), yielding the highest possible measurement precision ￼.  In practice, this means optimizing pulse sequences (and even trap potentials) during the state-preparation and interaction steps so that the final state has maximal sensitivity to the parameter ￼ ￼.  For example, MacLellan et al. describe a variational sensing protocol where a parameterized state preparation $U(\theta)$ encodes an initial probe, it evolves under the unknown phase $\phi$, and a parameterized measurement $M(\mu)$ is performed; finally a classical neural net $E(\lambda)$ processes the outcomes to estimate $\phi$ ￼.  Such end-to-end variational quantum sensing frameworks allow the entire protocol to be trainable under realistic noise (Fig.1).
+
+*Figure: A variational quantum sensing pipeline from MacLellan et al. (npj QI 2024) ￼.  A quantum circuit $U(\theta)$ prepares a probe state, which interacts with the unknown parameter $\phi$ (block $K(\phi)$), then a parameterized measurement $M(\mu)$ yields data for a neural estimator $E(\lambda)$ to predict $\bar\phi$.  All stages are trainable.
+
+In this language, classical RL (deep or otherwise) can be used to optimize the control parameters $\theta,\mu$ by treating them as actions and rewarding high QFI.  Indeed, several works have demonstrated classical RL agents finding near-optimal control for sensing tasks.  For example, Cooke et al. train a Soft Actor-Critic (SAC) RL agent to control a spin-$S$ magnetometer by applying a sequence of transverse-field pulses, and they find policies that maximize the QFI for estimating a background magnetic field ￼ ￼.  Xiao et al. use deep RL (policy gradients) to find time-dependent control signals that saturate the theoretical QFI bound in noisy and noiseless settings ￼ ￼.  These studies report robust, sample-efficient learning: the RL agents achieve state-of-the-art sensing precision and even generalize to shifted parameter values (i.e. slightly different fields than seen in training) ￼ ￼.  In [25], the RL scheme (called DRLQS) attains near-optimal QFI under both noise-free and noisy dynamics, and exhibits strong transferability when the true parameter shifts, outperforming conventional optimizers.  Likewise, [11] finds that the trained RL agent generalizes well to Hamiltonian parameters not seen in training ￼.
+
+Quantum reinforcement learning (QRL) concepts
+
+“Quantum reinforcement learning” refers to schemes where the RL agent or its internal models are quantum.  In contrast to “classical RL for quantum systems” (a classical computer controlling a quantum system), QRL envisions a quantum agent – for example, a policy or value function realized as a variational quantum circuit, or a Boltzmann machine implemented with qubits ￼.  One can further distinguish (QC) scenarios (quantum agent, classical environment) and (QQ) scenarios (both quantum agent and environment) ￼.  A key motivation is that quantum circuits can in principle evaluate certain functions or make decisions in superposition or with entanglement, potentially speeding up learning or enabling richer function classes.  For instance, quantum parallelism might allow evaluating many actions in one pass, and Grover-like subroutines can quadratically speed up searching for high-reward actions ￼.
+
+Several near-term QRL proposals use parameterized quantum circuits (also known as quantum neural networks) as function approximators.  For example, a policy or Q-function may be encoded by a variational circuit whose rotation angles are updated by an RL algorithm ￼ ￼.  Alternatively, quantum Boltzmann machines (QBM) – entangled networks of qubits with transverse fields – have been proposed to model action-value distributions ￼.  Hybrid architectures are also possible: e.g. the agent’s policy might consist of alternating quantum and classical episodes, using short photonic circuits that exchange quantum information with the environment ￼.  Crucially, all these methods are intended to run on noisy intermediate-scale quantum (NISQ) devices: they use only a few qubits and shallow circuits, with parameter updates handled by a classical optimizer.  For example, recent works have implemented QRL agents on current hardware – one study ran a trapped-ion QRL agent with 16 qubits, demonstrating feasibility and hinting at advantages like higher expressivity and sample efficiency ￼ ￼.
+
+Comparing QRL and classical RL in sensing tasks
+
+In principle, QRL could offer advantages in training efficiency and function expressivity.  Studies indicate that variational circuits can represent complex policies with fewer parameters than neural nets.  Gurgul et al. compare quantum DQN/DDPG agents against classical deep RL on a sequential decision task.  They find the quantum agents outperform classical ones when measured per-parameter: i.e. QRL achieves similar or better reward with orders of magnitude fewer variational parameters ￼ ￼.  Likewise, Crawford et al. showed that a QBM-based RL outperformed an RBM (classical) RL on a toy navigation task: the QBM (with a transverse field) trained more effectively than a classical Boltzmann machine counterpart ￼.  These results suggest sample efficiency and model-compactness benefits: a quantum policy ansatz may require fewer training episodes to converge or generalize well because of its richer representational power ￼ ￼.
+
+On the other hand, classical deep RL agents for sensing already perform very well in practice.  They can take advantage of conventional policy gradient and actor-critic methods, and can be run on powerful GPUs.  The SAC agent in [11] achieves high QFI and good generalization over a range of Hamiltonian parameters, without any quantum resources ￼.  Therefore, a fair comparison focuses on per-resource performance: if a QRL agent uses a small quantum circuit with far fewer parameters, it may train faster or generalize better than a massive classical network.  Indeed, the finance QRL example observed that a quantum circuit with ~30 parameters matched the performance of a classical network with thousands of parameters ￼ ￼.
+
+In terms of generalization, QRL agents can also leverage continuous quantum actions.  For example, Wu et al. propose a quantum DDPG algorithm in a continuous action space: their variational policy can output a continuous control sequence for any target state after a single training run.  In other words, once trained, the single quantum policy generates the appropriate pulse sequence for any desired target without retraining ￼.  This “one-shot” capability contrasts with standard control methods (including classical RL), which typically require re-optimizing for each new target state ￼.  Such general-purpose adaptability could be valuable in sensing tasks where the optimal pulses depend smoothly on the unknown parameter or trap configuration.
+
+Finally, regarding final sensing performance, some theoretical works suggest QRL could unlock higher precision.  For instance, a very recent study using a QRL agent in a many-body critical system found protocols that saturate the quantum speed limit and achieve Heisenberg-limited or even super-Heisenberg scaling of sensitivity ￼.  Although this is not yet demonstrated on a physical NISQ device, it shows QRL can in principle discover highly nontrivial control sequences beyond standard adiabatic methods.  Classical RL has also been used to approach optimal performance; e.g. Xiao et al. report “state-of-the-art” precision using deep RL, but they note that RL required many episodes to converge and does not easily reach the $T^4$ scaling in time-dependent problems ￼.  The promise is that a quantum learner might overcome such limits by exploring entangled or coherent control resources more naturally.
+
+Demonstrations and feasibility on NISQ devices
+
+To date, most demonstrations of RL for quantum sensing are numerical or on classical computers.  For example, [11] and [25] use simulated spin systems to train deep RL, and [38] and [40] present proof-of-principle simulations of QRL algorithms.  On actual quantum hardware, QRL is still in its infancy, but early experiments exist: e.g. the hybrid nanophotonic processor in [22] implemented a small QRL agent that learned a search problem faster than a classical agent.  More relevantly, Gurgul et al. ran a 16-qubit trapped-ion device to realize a variational QRL agent (for a portfolio task) and observed it working in practice ￼.  Although no experiment has yet applied QRL specifically to a spin-based sensor, these examples demonstrate that NISQ devices can host the required variational circuits.  Such circuits typically involve only a few qubits of entangling layers (well within current capability) and use classical outer-loop optimizers – exactly the hybrid quantum-classical models envisioned for near-term QRL ￼ ￼.
+
+In summary, current evidence suggests that QRL algorithms (using variational quantum circuits or quantum Boltzmann machines) are feasible on NISQ hardware and may offer advantages in model efficiency and learning speed ￼ ￼.  Classical RL remains highly effective and versatile, but QRL could become advantageous as quantum processors improve.  For quantum sensing with spin qubits, QRL could in principle learn optimal pulse and potential controls that maximize QFI, potentially surpassing classical strategies.  Future work should benchmark QRL and classical RL side-by-side on realistic sensor models to quantify any quantum advantage in training efficiency, generalization, and final precision.
+
+Sources: We have drawn on recent literature in quantum metrology and reinforcement learning.  Classical RL applications to spin sensors include Cooke et al. ￼ and Xiao et al. ￼.  Variational sensing models are described by MacLellan et al. ￼.  QRL and QBM references include Crawford et al. ￼ and Wu et al. ￼, as well as reviews ￼ ￼ and simulations showing QRL vs RL performance ￼ ￼.  Notably, Gurgul et al. report QRL agents needing far fewer parameters than classical agents ￼ ￼.  These sources illustrate the state of the art in applying RL to quantum sensing.

-Original file line number
+Diff line change
 the connection to metrology explicit, and several works (cited above)
 have demonstrated that RL agents can indeed maximize QFI or minimize
 the Cramér–Rao bound in practice.
++
++
++
++
 +Quantum sensing and control with spin qubits
++
 +Quantum spin qubits (e.g. NV centers or trapped electrons/ions) are powerful quantum sensors, where one applies carefully shaped control pulses and potentials to prepare probe states, accumulate phase from an unknown field, and read out its value.  The goal is to maximize the quantum Fisher information (QFI) about the target parameter (e.g. a magnetic field), yielding the highest possible measurement precision .  In practice, this means optimizing pulse sequences (and even trap potentials) during the state-preparation and interaction steps so that the final state has maximal sensitivity to the parameter  .  For example, MacLellan et al. describe a variational sensing protocol where a parameterized state preparation $U(\theta)$ encodes an initial probe, it evolves under the unknown phase $\phi$, and a parameterized measurement $M(\mu)$ is performed; finally a classical neural net $E(\lambda)$ processes the outcomes to estimate $\phi$ .  Such end-to-end variational quantum sensing frameworks allow the entire protocol to be trainable under realistic noise (Fig.1).
++
 +*Figure: A variational quantum sensing pipeline from MacLellan et al. (npj QI 2024) .  A quantum circuit $U(\theta)$ prepares a probe state, which interacts with the unknown parameter $\phi$ (block $K(\phi)$), then a parameterized measurement $M(\mu)$ yields data for a neural estimator $E(\lambda)$ to predict $\bar\phi$.  All stages are trainable.
++
 +In this language, classical RL (deep or otherwise) can be used to optimize the control parameters $\theta,\mu$ by treating them as actions and rewarding high QFI.  Indeed, several works have demonstrated classical RL agents finding near-optimal control for sensing tasks.  For example, Cooke et al. train a Soft Actor-Critic (SAC) RL agent to control a spin-$S$ magnetometer by applying a sequence of transverse-field pulses, and they find policies that maximize the QFI for estimating a background magnetic field  .  Xiao et al. use deep RL (policy gradients) to find time-dependent control signals that saturate the theoretical QFI bound in noisy and noiseless settings  .  These studies report robust, sample-efficient learning: the RL agents achieve state-of-the-art sensing precision and even generalize to shifted parameter values (i.e. slightly different fields than seen in training)  .  In [25], the RL scheme (called DRLQS) attains near-optimal QFI under both noise-free and noisy dynamics, and exhibits strong transferability when the true parameter shifts, outperforming conventional optimizers.  Likewise, [11] finds that the trained RL agent generalizes well to Hamiltonian parameters not seen in training .
++
 +Quantum reinforcement learning (QRL) concepts
++
 +“Quantum reinforcement learning” refers to schemes where the RL agent or its internal models are quantum.  In contrast to “classical RL for quantum systems” (a classical computer controlling a quantum system), QRL envisions a quantum agent – for example, a policy or value function realized as a variational quantum circuit, or a Boltzmann machine implemented with qubits .  One can further distinguish (QC) scenarios (quantum agent, classical environment) and (QQ) scenarios (both quantum agent and environment) .  A key motivation is that quantum circuits can in principle evaluate certain functions or make decisions in superposition or with entanglement, potentially speeding up learning or enabling richer function classes.  For instance, quantum parallelism might allow evaluating many actions in one pass, and Grover-like subroutines can quadratically speed up searching for high-reward actions .
++
 +Several near-term QRL proposals use parameterized quantum circuits (also known as quantum neural networks) as function approximators.  For example, a policy or Q-function may be encoded by a variational circuit whose rotation angles are updated by an RL algorithm  .  Alternatively, quantum Boltzmann machines (QBM) – entangled networks of qubits with transverse fields – have been proposed to model action-value distributions .  Hybrid architectures are also possible: e.g. the agent’s policy might consist of alternating quantum and classical episodes, using short photonic circuits that exchange quantum information with the environment .  Crucially, all these methods are intended to run on noisy intermediate-scale quantum (NISQ) devices: they use only a few qubits and shallow circuits, with parameter updates handled by a classical optimizer.  For example, recent works have implemented QRL agents on current hardware – one study ran a trapped-ion QRL agent with 16 qubits, demonstrating feasibility and hinting at advantages like higher expressivity and sample efficiency  .
++
 +Comparing QRL and classical RL in sensing tasks
++
 +In principle, QRL could offer advantages in training efficiency and function expressivity.  Studies indicate that variational circuits can represent complex policies with fewer parameters than neural nets.  Gurgul et al. compare quantum DQN/DDPG agents against classical deep RL on a sequential decision task.  They find the quantum agents outperform classical ones when measured per-parameter: i.e. QRL achieves similar or better reward with orders of magnitude fewer variational parameters  .  Likewise, Crawford et al. showed that a QBM-based RL outperformed an RBM (classical) RL on a toy navigation task: the QBM (with a transverse field) trained more effectively than a classical Boltzmann machine counterpart .  These results suggest sample efficiency and model-compactness benefits: a quantum policy ansatz may require fewer training episodes to converge or generalize well because of its richer representational power  .
++
 +On the other hand, classical deep RL agents for sensing already perform very well in practice.  They can take advantage of conventional policy gradient and actor-critic methods, and can be run on powerful GPUs.  The SAC agent in [11] achieves high QFI and good generalization over a range of Hamiltonian parameters, without any quantum resources .  Therefore, a fair comparison focuses on per-resource performance: if a QRL agent uses a small quantum circuit with far fewer parameters, it may train faster or generalize better than a massive classical network.  Indeed, the finance QRL example observed that a quantum circuit with ~30 parameters matched the performance of a classical network with thousands of parameters  .
++
 +In terms of generalization, QRL agents can also leverage continuous quantum actions.  For example, Wu et al. propose a quantum DDPG algorithm in a continuous action space: their variational policy can output a continuous control sequence for any target state after a single training run.  In other words, once trained, the single quantum policy generates the appropriate pulse sequence for any desired target without retraining .  This “one-shot” capability contrasts with standard control methods (including classical RL), which typically require re-optimizing for each new target state .  Such general-purpose adaptability could be valuable in sensing tasks where the optimal pulses depend smoothly on the unknown parameter or trap configuration.
++
 +Finally, regarding final sensing performance, some theoretical works suggest QRL could unlock higher precision.  For instance, a very recent study using a QRL agent in a many-body critical system found protocols that saturate the quantum speed limit and achieve Heisenberg-limited or even super-Heisenberg scaling of sensitivity .  Although this is not yet demonstrated on a physical NISQ device, it shows QRL can in principle discover highly nontrivial control sequences beyond standard adiabatic methods.  Classical RL has also been used to approach optimal performance; e.g. Xiao et al. report “state-of-the-art” precision using deep RL, but they note that RL required many episodes to converge and does not easily reach the $T^4$ scaling in time-dependent problems .  The promise is that a quantum learner might overcome such limits by exploring entangled or coherent control resources more naturally.
++
 +Demonstrations and feasibility on NISQ devices
++
 +To date, most demonstrations of RL for quantum sensing are numerical or on classical computers.  For example, [11] and [25] use simulated spin systems to train deep RL, and [38] and [40] present proof-of-principle simulations of QRL algorithms.  On actual quantum hardware, QRL is still in its infancy, but early experiments exist: e.g. the hybrid nanophotonic processor in [22] implemented a small QRL agent that learned a search problem faster than a classical agent.  More relevantly, Gurgul et al. ran a 16-qubit trapped-ion device to realize a variational QRL agent (for a portfolio task) and observed it working in practice .  Although no experiment has yet applied QRL specifically to a spin-based sensor, these examples demonstrate that NISQ devices can host the required variational circuits.  Such circuits typically involve only a few qubits of entangling layers (well within current capability) and use classical outer-loop optimizers – exactly the hybrid quantum-classical models envisioned for near-term QRL  .
++
 +In summary, current evidence suggests that QRL algorithms (using variational quantum circuits or quantum Boltzmann machines) are feasible on NISQ hardware and may offer advantages in model efficiency and learning speed  .  Classical RL remains highly effective and versatile, but QRL could become advantageous as quantum processors improve.  For quantum sensing with spin qubits, QRL could in principle learn optimal pulse and potential controls that maximize QFI, potentially surpassing classical strategies.  Future work should benchmark QRL and classical RL side-by-side on realistic sensor models to quantify any quantum advantage in training efficiency, generalization, and final precision.
++
 +Sources: We have drawn on recent literature in quantum metrology and reinforcement learning.  Classical RL applications to spin sensors include Cooke et al.  and Xiao et al. .  Variational sensing models are described by MacLellan et al. .  QRL and QBM references include Crawford et al.  and Wu et al. , as well as reviews   and simulations showing QRL vs RL performance  .  Notably, Gurgul et al. report QRL agents needing far fewer parameters than classical agents  .  These sources illustrate the state of the art in applying RL to quantum sensing.