This project implements a Double Deep Q-Network (Double-DQN) to optimize inventory management for a retail environment (Walmart-style simulation). The AI agent learns to balance the cost of holding inventory against the penalty of stockouts, specifically anticipating surges during Weekends and Festivals.
The agent uses a 3-layer neural network implemented in pure NumPy for high-speed inference. It learns by interacting with a 10-year demand dataset (walmart_demand.csv).
The model observes the following 6 variables to make a decision:
- Current Inventory: Scaled (0.0 to 1.0) for the warehouse capacity.
- Is Today Weekend?: Indicates if a weekend surge is happening now.
- Is Tomorrow Weekend?: Allows the agent to "Pre-stock" ahead of time.
- Days Since Weekend: Normalized time feature to catch weekly patterns.
- Is Today Festival?: Indicates a massive holiday/festival surge.
- Is Tomorrow Festival?: Crucial for large volume pre-ordering to avoid stockouts.
The model outputs Q-Values for 11 discrete order quantities:
[0, 5, 10, 15, 20, 25, 30, 35, 40, 50, 60]
It will always select the quantity that maximizes the future expected "Reward" (Profit).
The RL agent's performance is governed by several critical settings that balance profit and warehouse safety:
STATE_SIZE = 6: The "eyes" of the AI. By increasing this from the previous 4 to 6, we've given the model the ability to "see" festivals up to 24 hours in advance, allowing for preemptive stocking.GAMMA = 0.90(Discount Factor): This determines the agent's foresight. A value of 0.90 is high, meaning the agent values future stockouts almost as much as today's costs. It learns to "save for a rainy day" (surges).EPS_DECAY = 0.995: This controls the transition from Exploration (trying random actions) to Exploitation (using its brain). It ensures the model tries every possible order quantity before settling on the most profitable one.BATCH = 128: During training, the model doesn't just learn from the current day. It looks back at 128 random past experiences to ensure its learning is stable and not biased by a single bad day.LR = 0.0001(Learning Rate): A very careful step size. This prevents the "Exploding Gradient" problem where the AI might overreact to a single holiday and start ordering too much every day.STOCK_PENALTY = 80: Set intentionally high ($80 per unit). In the retail world, losing a customer due to a stockout is far more expensive than paying for warehouse shelf space.TARGET_SYNC = 10: We use a Double-DQN architecture. Every 10 episodes, we sync the "Target Brain" with the "Online Brain" to prevent the model from "chasing its own tail" during learning.
- Simulation: The dashboard compares the RL Agent against a "Traditional" fixed-order model.
- SHAP (Explainability): The project uses SHAP values to show which factor (e.g., Tomorrow Festival) caused the AI to order a specific quantity.
- Learning Logic: The agent learns that if
Tomorrow Festivalis True, it must order50+units even if current stock is high, to overcome the predicted surge.
- Generate data:
python generate_data.py - Train model:
python inventory_rl.py - Start Dashboard:
python app.py(Visit http://localhost:5000)