Skip to content

Commit 7c8c395

Browse files
authored
Merge pull request #30 from edanalytics/feature/reference_validation
validate references
2 parents 608e809 + b97e203 commit 7c8c395

6 files changed

Lines changed: 417 additions & 82 deletions

File tree

README.md

Lines changed: 29 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -135,13 +135,36 @@ Like [selectors](#selectors), `keep-keys` and `drop-keys` are comma-separated li
135135
```bash
136136
lightbeam validate -c path/to/config.yaml
137137
```
138-
You may `validate` your JSONL before transmitting it. This checks that the payloads
139-
1. are valid JSON
140-
1. conform to the structure described in the Swagger documents for [resources](https://api.ed-fi.org/v5.3/api/metadata/data/v3/resourcess/swagger.json) and [descriptors](https://api.ed-fi.org/v5.3/api/metadata/data/v3/descriptors/swagger.json) fetched from your API
141-
1. contain valid descriptor values (fetched from your API and/or from descriptor values in your JSONL files)
142-
1. contain unique values for any natural key
138+
You may `validate` your JSONL before transmitting it. Configuration for `validate` goes in its own section of `lightbeam.yaml`:
139+
```yaml
140+
validate:
141+
methods:
142+
- schema # checks that payloads conform to the Swagger definitions from the API
143+
- descriptors # checks that descriptor values are either locally-defined or exist in the remote API
144+
- uniqueness # checks that local payloads are unique by the required property values
145+
- references # checks that references resolve, either locally or in the remote API
146+
# or
147+
# methods: "*"
148+
```
149+
Default `validate`.`methods` are `["schema", "descriptors", "uniqueness"]` (not `references`; see below). In addition to the above methods, `lighteam validate` will also (first) check that each payload is valid JSON.
150+
151+
The `references` `method` can be slow, as a separate `GET` request may be made to your API for each reference. (Therefore the validation method is disabled by default.) `lightbeam` tries to improve efficiency by:
152+
* batching requests and sending several concurrently (based on `connection`.`pool_size` of `lightbeam.yaml`)
153+
* caching responses and first checking the cache before making another (potentially identical) request
154+
155+
Even with these optimizations, checking `references` can easily take minutes for even relatively small amounts of data. Therefore `lightbeam.yaml` also accepts a further configuration option:
156+
```yaml
157+
validate:
158+
references:
159+
max_failures: 10 # stop testing after X failed payloads ("fail fast")
160+
```
161+
This is optional; if absent, references in every payload are checked, no matter how many fail.
162+
163+
**Note:** Reference validation efficiency may be improved by first `lightbeam fetch`ing certain resources to have a local copy. `lightbeam validate` checks local JSONL files to resolve references before trying the remote API, and `fetch` retrieves many records per `GET`, so total runtime can be faster in this scenario. The downsides include
164+
* more data movement
165+
* `fetch`ed data becoming stale over time
166+
* needing to track which data is your own vs. was `fetch`ed (all the data must coexist in the `config.data_dir` to be discoverable by `lightbeam validate`)
143167

144-
This command will not find invalid reference errors, but is helpful for finding payloads that are invalid JSON, are missing required fields, or have other structural issues.
145168

146169
## `send`
147170
```bash

lightbeam/api.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -53,10 +53,10 @@ def prepare(self):
5353
self.config["open_api_metadata_url"] = api_base["urls"]["openApiMetadata"]
5454

5555
# load all endpoints in dependency-order
56-
all_endpoints = self.get_sorted_endpoints()
56+
self.lightbeam.all_endpoints = self.get_sorted_endpoints()
5757

5858
# filter down to only selected endpoints
59-
self.lightbeam.endpoints = self.apply_filters(all_endpoints)
59+
self.lightbeam.endpoints = self.apply_filters(self.lightbeam.all_endpoints)
6060

6161

6262
def apply_filters(self, endpoints=[]):

lightbeam/delete.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,8 @@ async def do_deletes(self, endpoint):
8686
data = line.strip()
8787
# fill out the required fields from the data payload
8888
# (so we can search for matching records in the API)
89-
params = util.interpolate_params(params_structure, data)
89+
payload = json.loads(data)
90+
params = util.interpolate_params(params_structure, payload)
9091

9192
# check if we've posted this data before
9293
data_hash = hashlog.get_hash(data)

lightbeam/lightbeam.py

Lines changed: 25 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -174,12 +174,33 @@ def get_data_files_for_endpoint(self, endpoint):
174174
return file_list
175175

176176
# Prunes the list of endpoints down to those for which .jsonl files exist in the config.data_dir
177-
def get_endpoints_with_data(self, endpoints):
177+
def get_endpoints_with_data(self):
178178
self.logger.debug("discovering data...")
179179
endpoints_with_data = []
180-
for endpoint in endpoints:
181-
if self.get_data_files_for_endpoint(endpoint):
182-
endpoints_with_data.append(endpoint)
180+
data_dir_list = os.listdir(self.config["data_dir"])
181+
for data_dir_item in data_dir_list:
182+
data_dir_item_path = os.path.join(self.config["data_dir"], data_dir_item)
183+
if os.path.isfile(data_dir_item_path):
184+
filename = os.path.basename(data_dir_item)
185+
extension = filename.rsplit(".", 1)[-1]
186+
filename_without_extension = filename.rsplit(".", 1)[0]
187+
if extension in self.DATA_FILE_EXTENSIONS and filename_without_extension in self.all_endpoints:
188+
endpoints_with_data.append(filename_without_extension)
189+
elif os.path.isdir(data_dir_item_path):
190+
if data_dir_item in self.all_endpoints:
191+
has_data_file = False
192+
sub_dir_list = os.listdir(data_dir_item_path)
193+
for sub_dir_item in sub_dir_list:
194+
sub_dir_item_path = os.path.join(data_dir_item_path, sub_dir_item)
195+
if os.path.isfile(sub_dir_item_path):
196+
filename = os.path.basename(sub_dir_item)
197+
extension = filename.rsplit(".", 1)[-1]
198+
if extension in self.DATA_FILE_EXTENSIONS:
199+
has_data_file = True
200+
break
201+
if has_data_file:
202+
endpoints_with_data.append(data_dir_item)
203+
183204
return endpoints_with_data
184205

185206
# Returns a generator which produces json lines for a given endpoint based on relevant files in config.data_dir

lightbeam/util.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,16 @@ def singularize_endpoint(endpoint):
2828
if endpoint[-3:]=="ies": return endpoint[0:-3] + "y"
2929
elif endpoint=="people": return "person"
3030
else: return endpoint[0:-1]
31+
def pluralize_endpoint(endpoint):
32+
if endpoint[-1:]=="y": return endpoint[0:-1] + "ies"
33+
elif endpoint=="person": return "people"
34+
else: return endpoint+"s"
3135

3236
# Takes a params structure and interpolates values from a (string) JSON payload
3337
def interpolate_params(params_structure, payload):
3438
params = {}
3539
for k,v in params_structure.items():
36-
value = json.loads(payload)
40+
value = payload.copy()
3741
for key in v.split('.'):
3842
value = value[key]
3943
params[k] = value

0 commit comments

Comments
 (0)