Skip to content

Commit d18cc72

Browse files
authored
complex filtering and open ai completions (#12)
1 parent 00fba57 commit d18cc72

3 files changed

Lines changed: 484 additions & 6 deletions

File tree

Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
---
2+
title: "Complex Metadata Filtering"
3+
description: "Advanced document filtering using dates, arrays, decimals, and multiple operators for precise retrieval."
4+
---
5+
6+
This cookbook demonstrates Morphik's advanced metadata filtering capabilities with rich typed metadata fields including dates, decimals, booleans, arrays, and nested objects.
7+
8+
> **Prerequisites**
9+
> - Install the Morphik SDK: `pip install morphik`
10+
> - Provide credentials via Morphik URI
11+
> - Basic understanding of document ingestion
12+
13+
## 1. Ingest Documents with Rich Typed Metadata
14+
15+
Morphik supports various metadata types for sophisticated filtering:
16+
17+
```python
18+
from datetime import date, datetime, timezone
19+
from decimal import Decimal
20+
from morphik import Morphik
21+
22+
client = Morphik("morphik://your-app:token@api.morphik.ai")
23+
24+
# Rich metadata with multiple types
25+
metadata = {
26+
# Strings
27+
"region": "andes",
28+
"project_code": "hydro-life-2024",
29+
30+
# Dates and datetimes
31+
"fieldwork_date": date(2024, 9, 18),
32+
"monitoring_window_start": datetime(2024, 9, 18, 9, 10, tzinfo=timezone.utc),
33+
"monitoring_window_end": datetime(2024, 9, 18, 17, 35, tzinfo=timezone.utc),
34+
35+
# Numbers
36+
"hazard_score": 41, # Integer
37+
"ph_reading": Decimal("6.3"), # Decimal (precise)
38+
"water_depth_cm": 12.4, # Float
39+
"samples_collected": 18,
40+
41+
# Boolean
42+
"is_priority_site": True,
43+
44+
# Arrays
45+
"tags": ["wildlife", "flood-risk", "community"],
46+
47+
# Nested objects
48+
"sensor_loadout": {
49+
"drone": "Skydio X10",
50+
"camera": "multispectral",
51+
"thermal_gain": 0.43,
52+
},
53+
}
54+
55+
# Ingest document with metadata
56+
doc = client.ingest_text(
57+
content="Laguna Amazonas boardwalk inspection for wetlands buffers...",
58+
filename="laguna-amazonas-field-brief.md",
59+
metadata=metadata,
60+
use_colpali=True,
61+
)
62+
63+
# Wait for completion
64+
doc.wait_for_completion(timeout_seconds=150)
65+
print(f"Ingested: {doc.external_id}")
66+
```
67+
68+
## 2. Build Complex Filters
69+
70+
Combine multiple operators to create sophisticated queries:
71+
72+
```python
73+
from datetime import date
74+
75+
# Complex filter with multiple conditions
76+
filters = {
77+
"$and": [
78+
# Exact match
79+
{"project_code": {"$eq": "hydro-life-2024"}},
80+
81+
# Array membership
82+
{"region": {"$in": ["andes"]}},
83+
84+
# Date range (>= September 15, 2024)
85+
{"fieldwork_date": {"$gte": date(2024, 9, 15).isoformat()}},
86+
87+
# Number range (<= 45)
88+
{"hazard_score": {"$lte": 45}},
89+
90+
# Boolean match
91+
{"is_priority_site": True},
92+
93+
# Array contains value
94+
{"tags": {"$contains": {"value": "wildlife"}}},
95+
96+
# Decimal comparison
97+
{"ph_reading": {"$lte": "6.5"}},
98+
]
99+
}
100+
```
101+
102+
## 3. List Documents with Filters
103+
104+
Find documents matching your criteria:
105+
106+
```python
107+
# Query documents with filters
108+
response = client.list_documents(
109+
filters=filters,
110+
include_total_count=True,
111+
completed_only=True
112+
)
113+
114+
print(f"\nFound {response.total_count} matching documents:")
115+
for doc in response.documents:
116+
print(f"- {doc.filename}")
117+
print(f" Hazard Score: {doc.metadata.get('hazard_score')}")
118+
print(f" Tags: {doc.metadata.get('tags')}")
119+
```
120+
121+
## 4. Retrieve Chunks with Filters
122+
123+
Get document chunks that match your metadata filters:
124+
125+
```python
126+
# Retrieve filtered chunks
127+
chunks = client.retrieve_chunks(
128+
query="Summarize wildlife or flood risks that impact the wetlands buffer program",
129+
filters=filters,
130+
k=4,
131+
padding=1,
132+
use_colpali=True,
133+
)
134+
135+
print(f"\nRetrieved {len(chunks)} filtered chunks:")
136+
for chunk in chunks:
137+
print(f"\nChunk {chunk.chunk_number} from {chunk.filename} (score={chunk.score:.3f})")
138+
print(f"Content preview: {chunk.content[:200]}...")
139+
print(f"Metadata: {chunk.metadata}")
140+
```
141+
142+
## Supported Filter Operators
143+
144+
| Operator | Description | Example |
145+
|----------|-------------|---------|
146+
| `$eq` | Exact match | `{"status": {"$eq": "active"}}` |
147+
| `$in` | Value in array | `{"region": {"$in": ["andes", "altiplano"]}}` |
148+
| `$gte` | Greater than or equal | `{"date": {"$gte": "2024-01-01"}}` |
149+
| `$lte` | Less than or equal | `{"score": {"$lte": 45}}` |
150+
| `$gt` | Greater than | `{"temperature": {"$gt": 0}}` |
151+
| `$lt` | Less than | `{"count": {"$lt": 100}}` |
152+
| `$contains` | Array contains value | `{"tags": {"$contains": {"value": "urgent"}}}` |
153+
| `$and` | All conditions must match | `{"$and": [condition1, condition2]}` |
154+
| `$or` | Any condition must match | `{"$or": [condition1, condition2]}` |
155+
156+
## Use Cases
157+
158+
Complex metadata filtering is ideal for:
159+
160+
- **Document management systems** with multi-dimensional categorization
161+
- **Compliance and audit systems** requiring date-based queries
162+
- **Scientific data repositories** with measurements and precise numerical filtering
163+
- **Multi-tenant applications** with scope-based isolation
164+
- **Time-series document collections** with date range queries
165+
- **Hierarchical data** with nested metadata structures
166+
167+
## Best Practices
168+
169+
### 1. Use Appropriate Types
170+
171+
Use the correct Python types for metadata:
172+
173+
```python
174+
# ✅ Correct
175+
metadata = {
176+
"date": date(2024, 9, 15), # Use date objects
177+
"price": Decimal("19.99"), # Use Decimal for precision
178+
"is_active": True, # Use bool for flags
179+
}
180+
181+
# ❌ Avoid
182+
metadata = {
183+
"date": "2024-09-15", # String instead of date
184+
"price": 19.99, # Float loses precision
185+
"is_active": "true", # String instead of bool
186+
}
187+
```
188+
189+
### 2. Convert Dates for Filtering
190+
191+
Always convert date objects to ISO format when building filters:
192+
193+
```python
194+
# ✅ Correct
195+
{"fieldwork_date": {"$gte": date(2024, 9, 15).isoformat()}}
196+
197+
# ❌ Wrong
198+
{"fieldwork_date": {"$gte": date(2024, 9, 15)}} # Date object won't work
199+
```
200+
201+
### 3. Combine Operators Strategically
202+
203+
- Use `$and` for required conditions that must all match
204+
- Use `$in` when a field can have multiple possible values
205+
- Use range operators (`$gte`, `$lte`) for numerical and date filtering
206+
- Use `$contains` for array membership checks
207+
208+
### 4. Index Important Fields
209+
210+
Frequently filtered fields benefit from proper indexing. Consider performance when adding many metadata fields.
211+
212+
## Running the Example
213+
214+
```bash
215+
# Set your Morphik URI
216+
export MORPHIK_URI="morphik://your-app:your-token@api.morphik.ai"
217+
218+
# Run your Python script with the code above
219+
python your_script.py
220+
```
221+
222+
## Related Cookbooks
223+
224+
- [Generating Completions with Retrieved Chunks](./generating-completions-with-retrieved-chunks) - Send filtered chunks to OpenAI
225+
- [Python SDK Basic Operations](./python-basic-operations) - Core Morphik operations

0 commit comments

Comments
 (0)