|
| 1 | +--- |
| 2 | +title: "Complex Metadata Filtering" |
| 3 | +description: "Advanced document filtering using dates, arrays, decimals, and multiple operators for precise retrieval." |
| 4 | +--- |
| 5 | + |
| 6 | +This cookbook demonstrates Morphik's advanced metadata filtering capabilities with rich typed metadata fields including dates, decimals, booleans, arrays, and nested objects. |
| 7 | + |
| 8 | +> **Prerequisites** |
| 9 | +> - Install the Morphik SDK: `pip install morphik` |
| 10 | +> - Provide credentials via Morphik URI |
| 11 | +> - Basic understanding of document ingestion |
| 12 | +
|
| 13 | +## 1. Ingest Documents with Rich Typed Metadata |
| 14 | + |
| 15 | +Morphik supports various metadata types for sophisticated filtering: |
| 16 | + |
| 17 | +```python |
| 18 | +from datetime import date, datetime, timezone |
| 19 | +from decimal import Decimal |
| 20 | +from morphik import Morphik |
| 21 | + |
| 22 | +client = Morphik("morphik://your-app:token@api.morphik.ai") |
| 23 | + |
| 24 | +# Rich metadata with multiple types |
| 25 | +metadata = { |
| 26 | + # Strings |
| 27 | + "region": "andes", |
| 28 | + "project_code": "hydro-life-2024", |
| 29 | + |
| 30 | + # Dates and datetimes |
| 31 | + "fieldwork_date": date(2024, 9, 18), |
| 32 | + "monitoring_window_start": datetime(2024, 9, 18, 9, 10, tzinfo=timezone.utc), |
| 33 | + "monitoring_window_end": datetime(2024, 9, 18, 17, 35, tzinfo=timezone.utc), |
| 34 | + |
| 35 | + # Numbers |
| 36 | + "hazard_score": 41, # Integer |
| 37 | + "ph_reading": Decimal("6.3"), # Decimal (precise) |
| 38 | + "water_depth_cm": 12.4, # Float |
| 39 | + "samples_collected": 18, |
| 40 | + |
| 41 | + # Boolean |
| 42 | + "is_priority_site": True, |
| 43 | + |
| 44 | + # Arrays |
| 45 | + "tags": ["wildlife", "flood-risk", "community"], |
| 46 | + |
| 47 | + # Nested objects |
| 48 | + "sensor_loadout": { |
| 49 | + "drone": "Skydio X10", |
| 50 | + "camera": "multispectral", |
| 51 | + "thermal_gain": 0.43, |
| 52 | + }, |
| 53 | +} |
| 54 | + |
| 55 | +# Ingest document with metadata |
| 56 | +doc = client.ingest_text( |
| 57 | + content="Laguna Amazonas boardwalk inspection for wetlands buffers...", |
| 58 | + filename="laguna-amazonas-field-brief.md", |
| 59 | + metadata=metadata, |
| 60 | + use_colpali=True, |
| 61 | +) |
| 62 | + |
| 63 | +# Wait for completion |
| 64 | +doc.wait_for_completion(timeout_seconds=150) |
| 65 | +print(f"Ingested: {doc.external_id}") |
| 66 | +``` |
| 67 | + |
| 68 | +## 2. Build Complex Filters |
| 69 | + |
| 70 | +Combine multiple operators to create sophisticated queries: |
| 71 | + |
| 72 | +```python |
| 73 | +from datetime import date |
| 74 | + |
| 75 | +# Complex filter with multiple conditions |
| 76 | +filters = { |
| 77 | + "$and": [ |
| 78 | + # Exact match |
| 79 | + {"project_code": {"$eq": "hydro-life-2024"}}, |
| 80 | + |
| 81 | + # Array membership |
| 82 | + {"region": {"$in": ["andes"]}}, |
| 83 | + |
| 84 | + # Date range (>= September 15, 2024) |
| 85 | + {"fieldwork_date": {"$gte": date(2024, 9, 15).isoformat()}}, |
| 86 | + |
| 87 | + # Number range (<= 45) |
| 88 | + {"hazard_score": {"$lte": 45}}, |
| 89 | + |
| 90 | + # Boolean match |
| 91 | + {"is_priority_site": True}, |
| 92 | + |
| 93 | + # Array contains value |
| 94 | + {"tags": {"$contains": {"value": "wildlife"}}}, |
| 95 | + |
| 96 | + # Decimal comparison |
| 97 | + {"ph_reading": {"$lte": "6.5"}}, |
| 98 | + ] |
| 99 | +} |
| 100 | +``` |
| 101 | + |
| 102 | +## 3. List Documents with Filters |
| 103 | + |
| 104 | +Find documents matching your criteria: |
| 105 | + |
| 106 | +```python |
| 107 | +# Query documents with filters |
| 108 | +response = client.list_documents( |
| 109 | + filters=filters, |
| 110 | + include_total_count=True, |
| 111 | + completed_only=True |
| 112 | +) |
| 113 | + |
| 114 | +print(f"\nFound {response.total_count} matching documents:") |
| 115 | +for doc in response.documents: |
| 116 | + print(f"- {doc.filename}") |
| 117 | + print(f" Hazard Score: {doc.metadata.get('hazard_score')}") |
| 118 | + print(f" Tags: {doc.metadata.get('tags')}") |
| 119 | +``` |
| 120 | + |
| 121 | +## 4. Retrieve Chunks with Filters |
| 122 | + |
| 123 | +Get document chunks that match your metadata filters: |
| 124 | + |
| 125 | +```python |
| 126 | +# Retrieve filtered chunks |
| 127 | +chunks = client.retrieve_chunks( |
| 128 | + query="Summarize wildlife or flood risks that impact the wetlands buffer program", |
| 129 | + filters=filters, |
| 130 | + k=4, |
| 131 | + padding=1, |
| 132 | + use_colpali=True, |
| 133 | +) |
| 134 | + |
| 135 | +print(f"\nRetrieved {len(chunks)} filtered chunks:") |
| 136 | +for chunk in chunks: |
| 137 | + print(f"\nChunk {chunk.chunk_number} from {chunk.filename} (score={chunk.score:.3f})") |
| 138 | + print(f"Content preview: {chunk.content[:200]}...") |
| 139 | + print(f"Metadata: {chunk.metadata}") |
| 140 | +``` |
| 141 | + |
| 142 | +## Supported Filter Operators |
| 143 | + |
| 144 | +| Operator | Description | Example | |
| 145 | +|----------|-------------|---------| |
| 146 | +| `$eq` | Exact match | `{"status": {"$eq": "active"}}` | |
| 147 | +| `$in` | Value in array | `{"region": {"$in": ["andes", "altiplano"]}}` | |
| 148 | +| `$gte` | Greater than or equal | `{"date": {"$gte": "2024-01-01"}}` | |
| 149 | +| `$lte` | Less than or equal | `{"score": {"$lte": 45}}` | |
| 150 | +| `$gt` | Greater than | `{"temperature": {"$gt": 0}}` | |
| 151 | +| `$lt` | Less than | `{"count": {"$lt": 100}}` | |
| 152 | +| `$contains` | Array contains value | `{"tags": {"$contains": {"value": "urgent"}}}` | |
| 153 | +| `$and` | All conditions must match | `{"$and": [condition1, condition2]}` | |
| 154 | +| `$or` | Any condition must match | `{"$or": [condition1, condition2]}` | |
| 155 | + |
| 156 | +## Use Cases |
| 157 | + |
| 158 | +Complex metadata filtering is ideal for: |
| 159 | + |
| 160 | +- **Document management systems** with multi-dimensional categorization |
| 161 | +- **Compliance and audit systems** requiring date-based queries |
| 162 | +- **Scientific data repositories** with measurements and precise numerical filtering |
| 163 | +- **Multi-tenant applications** with scope-based isolation |
| 164 | +- **Time-series document collections** with date range queries |
| 165 | +- **Hierarchical data** with nested metadata structures |
| 166 | + |
| 167 | +## Best Practices |
| 168 | + |
| 169 | +### 1. Use Appropriate Types |
| 170 | + |
| 171 | +Use the correct Python types for metadata: |
| 172 | + |
| 173 | +```python |
| 174 | +# ✅ Correct |
| 175 | +metadata = { |
| 176 | + "date": date(2024, 9, 15), # Use date objects |
| 177 | + "price": Decimal("19.99"), # Use Decimal for precision |
| 178 | + "is_active": True, # Use bool for flags |
| 179 | +} |
| 180 | + |
| 181 | +# ❌ Avoid |
| 182 | +metadata = { |
| 183 | + "date": "2024-09-15", # String instead of date |
| 184 | + "price": 19.99, # Float loses precision |
| 185 | + "is_active": "true", # String instead of bool |
| 186 | +} |
| 187 | +``` |
| 188 | + |
| 189 | +### 2. Convert Dates for Filtering |
| 190 | + |
| 191 | +Always convert date objects to ISO format when building filters: |
| 192 | + |
| 193 | +```python |
| 194 | +# ✅ Correct |
| 195 | +{"fieldwork_date": {"$gte": date(2024, 9, 15).isoformat()}} |
| 196 | + |
| 197 | +# ❌ Wrong |
| 198 | +{"fieldwork_date": {"$gte": date(2024, 9, 15)}} # Date object won't work |
| 199 | +``` |
| 200 | + |
| 201 | +### 3. Combine Operators Strategically |
| 202 | + |
| 203 | +- Use `$and` for required conditions that must all match |
| 204 | +- Use `$in` when a field can have multiple possible values |
| 205 | +- Use range operators (`$gte`, `$lte`) for numerical and date filtering |
| 206 | +- Use `$contains` for array membership checks |
| 207 | + |
| 208 | +### 4. Index Important Fields |
| 209 | + |
| 210 | +Frequently filtered fields benefit from proper indexing. Consider performance when adding many metadata fields. |
| 211 | + |
| 212 | +## Running the Example |
| 213 | + |
| 214 | +```bash |
| 215 | +# Set your Morphik URI |
| 216 | +export MORPHIK_URI="morphik://your-app:your-token@api.morphik.ai" |
| 217 | + |
| 218 | +# Run your Python script with the code above |
| 219 | +python your_script.py |
| 220 | +``` |
| 221 | + |
| 222 | +## Related Cookbooks |
| 223 | + |
| 224 | +- [Generating Completions with Retrieved Chunks](./generating-completions-with-retrieved-chunks) - Send filtered chunks to OpenAI |
| 225 | +- [Python SDK Basic Operations](./python-basic-operations) - Core Morphik operations |
0 commit comments