This repository implements a production-grade machine learning pipeline to forecast the effectiveness of digital advertising creatives. It utilizes a Gated Recurrent Unit (GRU) network—a specialized type of Recurrent Neural Network (RNN)—to predict lead generation (leads) by modeling temporal dependencies in historical performance data.
Unlike standard regression models, the GRU architecture is specifically designed to capture long-term sequential patterns and dependencies in time-series data, making it a strong alternative to Convolutional (TCN) approaches.
- Hybrid Data Ingestion: Merges time-series performance metrics (CSV) with static creative attributes parsed from complex, nested JSON files.
- Automated Feature Engineering:
- Extracts NLP features (e.g., keyword presence, question length) from ad copy.
- Engineers rolling-window statistics (7-day/14-day means) for trend capture.
- Bayesian Hyperparameter Tuning: Utilizes Optuna to optimize RNN architecture (Hidden Dimensions, Stacked Layers, Dropout) with cross-validation.
- Robust Evaluation: Compares deep learning forecasts against a custom "Realistic Baseline" (Efficiency = Target/Spend) to ensure true model lift.
- Production Safety: Implements strict temporal splitting to prevent data leakage and handles missing data with forward-filling logic.
The raw performance data is extracted from the company's data warehouse using the scripts located in the sql/ directory:
data_extraction_train.sql: Extracting historical data for model training (applying business logic filters).data_extraction_inf.sql: Extracting recent data for daily inference batches.
- JSON Parsing: Flattens nested JSON hierarchies to extract creative metadata (e.g.,
button_style,mood,colors). - Cleaning: Handles messy real-world data, including outlier removal and NaN imputation for "cold start" ads.
- Model: GRU (Gated Recurrent Unit) implemented via the Darts library (
RNNModel). - Objective: Minimize Mean Absolute Error (MAE) on the validation set.
- Search Space:
hidden_dim: 16 - 64 (Size of the internal memory state)n_rnn_layers: 1 - 3 (Depth of the network)dropout: 0.1 - 0.4 (Regularization)
The pipeline evaluates models using a suite of regression and business metrics:
- MAE / RMSE: For raw error magnitude.
- sMAPE: For relative error handling zero-values.
- Bias (ME): To detect systematic over/under-forecasting.
-
Clone the repository:
git clone [https://github.com/yourusername/gru-forecasting-pipeline.git](https://github.com/yourusername/gru-forecasting-pipeline.git) cd gru-forecasting-pipeline -
Install dependencies:
pip install -r requirements.txt
-
Configuration: Update
config.pywith your local data paths:PERFORMANCE_CSV_PATH = "./data/database_full.csv" CREATIVE_JSON_DIR = "./data/jsons/"
-
Run the Pipeline: Execute the main orchestration script:
python main.py
This will ingest data, run feature engineering, optimize the GRU model using Optuna, and output performance metrics.
-
Run Inference: Generate forecasts for your ads:
python inference.py
The pipeline automatically generates diagnostic plots in the results/ directory:
- Forecast vs. Actuals: Time-series overlay of predictions.
- Residual Analysis: Analysis of error distribution over time.
Luciën Tuijp