|
1 | 1 | # DataBridge Quality Core |
2 | 2 |
|
3 | | -dbq is a data quality tool that provides a set of tools to validate and test data in your data pipeline. |
4 | | -It is designed to be easy to use and integrate into your existing workflow. |
| 3 | +`dbq` is a free, open-source data quality CLI checker that provides a set of tools to profile, validate and test data in your data warehouse or databases. |
| 4 | +It is designed to be flexible, fast, easy to use and integrate seamlessly into your existing workflow. |
5 | 5 |
|
6 | | -## Help |
7 | | -- dqb ping cnn-id |
8 | | -- dbq import cnn-id --filter "reporting.*" --cfg checks.yaml --update-cfg |
9 | | -- dbq check --cfg checks.yaml |
10 | | -- dbq --config /Users/artem/code/dbq/dbq.yaml import |
11 | | -- dbq profile --datasource cnn-id --dataset table_name |
12 | | - |
13 | | -## 0.1 |
14 | | -- [x] basic structure |
15 | | -- [x] define checks cfg v1 |
16 | | -- [x] checks cfg parser v1 |
17 | | -- [x] complete clickhouse support |
18 | | - - [x] ping |
19 | | - - [x] import datasets |
20 | | - - [x] profile dataset |
21 | | - - [x] rows in table |
22 | | - - [x] min, max, avg, stddev for numeric columns |
23 | | - - [x] count of nulls and blanks |
24 | | - - [x] most frequent value in column |
25 | | - - [x] JSON export |
26 | | - - [x] run checks |
27 | | -- [x] implement support for custom sql check |
28 | | -- [x] implement aliases for common checks based on raw sql check |
29 | | -- [x] fix cmd descriptions |
30 | | -- [x] review todos |
31 | | -- [x] improve output |
32 | | -- [ ] basic cross validation (dataset is defined) |
33 | | -- [ ] review logs |
34 | | -- [ ] review crashes (wrong arguments) |
35 | | -- [ ] default values (e.g. severity) |
36 | | -- [ ] quiet/verbose mode for logs |
37 | | -- [ ] docs |
38 | | - |
39 | | -## 0.x |
40 | | -- config validation |
41 | | -- add postgres support |
42 | | -- CLI for adding more checks |
43 | | -- AirFlow integration (operator) |
44 | | -- output format flag |
| 6 | +## Features |
| 7 | + |
| 8 | +data profiling |
45 | 9 |
|
| 10 | +v1 supported checks |
46 | 11 | --- |
| 12 | +row_count > 10 |
| 13 | +null_count(col) == 0 |
| 14 | +avg(col) <= 24.2 |
| 15 | +max(col) < 1000 |
| 16 | +min(col) == 0 |
| 17 | +sum(col) > 0 |
| 18 | +stddevPop(col) between 1 and 100_000_000 |
| 19 | +custom |
| 20 | + |
| 21 | +## Supported databases |
| 22 | +- [ClickHouse](https://clickhouse.com/) |
47 | 23 |
|
48 | | -## Checks config specification |
49 | | -- raw_query(query = "...") |
50 | | -- row_count |
51 | | -- null_count(col) |
52 | | -- <aggr_function> <op> <rest> |
| 24 | +## Usage |
53 | 25 |
|
54 | | -### clickhouse |
| 26 | +### Installation |
| 27 | + |
| 28 | +Download the latest binaries from [GitHub Releases](https://github.com/DataBridge-Tech/dbq/releases). |
| 29 | + |
| 30 | +### Configuration |
| 31 | + |
| 32 | +Create dbq configuration file (default lookup directory is $HOME/.dbq.yaml or ./dbq.yaml). Alternatively, |
| 33 | +you can specify configuration during the launch via `--config` parameter: |
55 | 34 |
|
56 | 35 | ```bash |
57 | | -docker run -d -p 18123:8123 -p19000:9000 -e CLICKHOUSE_PASSWORD=changeme --name some-clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server |
| 36 | +dbq --config /path/to/dbq.yaml import |
58 | 37 | ``` |
59 | 38 |
|
60 | | -# Supported Datasources |
61 | | -- Clickhouse |
| 39 | +```yaml |
| 40 | +# dbq.yaml |
| 41 | +version: "1" |
| 42 | +datasources: |
| 43 | + - id: clickhouse |
| 44 | + type: clickhouse |
| 45 | + configuration: |
| 46 | + host: 0.0.0.0:19000 |
| 47 | + port: 19000 |
| 48 | + username: default |
| 49 | + password: changeme |
| 50 | + database: default |
| 51 | + datasets: |
| 52 | + - nyc_taxi.trips_small |
| 53 | +``` |
| 54 | +
|
| 55 | +### Checks example |
| 56 | +
|
| 57 | +### Commands |
62 | 58 |
|
63 | | -# dbq configuration |
| 59 | +```bash |
| 60 | +$ dbq help |
| 61 | + |
| 62 | +dbq is a CLI tool for profiling data and running quality checks across various data sources |
| 63 | + |
| 64 | +Usage: |
| 65 | + dbq [command] |
| 66 | + |
| 67 | +Available Commands: |
| 68 | + check Runs data quality checks defined in a configuration file against a datasource |
| 69 | + completion Generate the autocompletion script for the specified shell |
| 70 | + help Help about any command |
| 71 | + import Connects to a data source and imports all available tables as datasets |
| 72 | + ping Checks if the data source is reachable |
| 73 | + profile Collects dataset`s information and generates column statistics |
| 74 | + version Prints dbq version |
| 75 | + |
| 76 | +Flags: |
| 77 | + --config string config file (default is $HOME/.dbq.yaml or ./dbq.yaml) |
| 78 | + -h, --help help for dbq |
| 79 | + |
| 80 | +Use "dbq [command] --help" for more information about a command. |
64 | 81 |
|
65 | | -# checks configuration |
| 82 | +``` |
| 83 | + |
| 84 | +### Quick start |
| 85 | +- dqb ping cnn-id |
| 86 | +- dbq import cnn-id --filter "reporting.*" --cfg checks.yaml --update-cfg |
| 87 | +- dbq check --cfg checks.yaml |
| 88 | +- dbq --config /Users/artem/code/dbq/dbq.yaml import |
| 89 | +- dbq profile --datasource cnn-id --dataset table_name |
0 commit comments