cd data_pipeline
pip install -r requirements.txt
python3 pipeline.pybq show --format=prettyjson tli-sample-01:fetch.receipts | jq '.schema.fields' >> 'schema/receipts.sql'
bq mk --table tli-sample-01:fetch.receipts 'schema/receipts.sql'
bq show --format=prettyjson tli-sample-01:fetch.users | jq '.schema.fields' >> 'schema/users.sql'
bq mk --table tli-sample-01:fetch.users 'schema/users.sql'
bq show --format=prettyjson tli-sample-01:fetch.brands_temp | jq '.schema.fields' >> 'schema/brands.sql'
bq mk --table tli-sample-01:fetch.brands 'schema/brand.sql'SELECT ri.brand_code, COUNT(DISTINCT r.receipt_id) AS receipt_scanned
FROM `tli-sample-01.fetch.vw_receipt` r
INNER JOIN `tli-sample-01.fetch.vw_dim_date` d ON r.date_scanned = d.date
INNER JOIN `tli-sample-01.fetch.vw_receipt_item` ri USING (receipt_id)
WHERE
-- used earlier month for testing, should be updated to use CURRENT_DATE
-- d.month_start_date=DATE_TRUNC(CURRENT_DATE(), MONTH)
d.month_start_date = DATE_TRUNC('2021-01-13', MONTH)
AND ri.brand_code IS NOT NULL
GROUP BY ri.brand_code
ORDER BY receipt_scanned DESC
LIMIT 5;
/*
A generic approach also support Month Over Month Compare
*/
WITH top_scanned_brand_by_month AS (
SELECT d.month_start_date, ri.brand_code, COUNT(DISTINCT r.receipt_id) AS receipt_scanned
FROM `tli-sample-01.fetch.vw_receipt` r
INNER JOIN `tli-sample-01.fetch.vw_dim_date` d ON r.date_scanned = d.date
INNER JOIN `tli-sample-01.fetch.vw_receipt_item` ri USING (receipt_id)
WHERE ri.brand_code IS NOT NULL
GROUP BY d.month_start_date, ri.brand_code
)
SELECT brand_code, receipt_scanned
FROM top_scanned_brand_by_month
WHERE
-- month_start_date=DATE_TRUNC(CURRENT_DATE(), MONTH)
month_start_date = DATE_TRUNC('2021-01-13', MONTH)
ORDER BY receipt_scanned DESC
LIMIT 5;
How does the ranking of the top 5 brands by receipts scanned for the recent month compare to the ranking for the previous month?
WITH top_scanned_brand_by_month AS (
SELECT d.month_start_date, ri.brand_code, COUNT(DISTINCT r.receipt_id) AS receipt_scanned
FROM `tli-sample-01.fetch.vw_receipt` r
INNER JOIN `tli-sample-01.fetch.vw_dim_date` d ON r.date_scanned = d.date
INNER JOIN `tli-sample-01.fetch.vw_receipt_item` ri USING (receipt_id)
WHERE ri.brand_code IS NOT NULL
GROUP BY d.month_start_date, ri.brand_code
),
mom_comparison AS (
SELECT month_start_date, brand_code, receipt_scanned,
LAG(receipt_scanned) OVER (PARTITION BY brand_code ORDER BY month_start_date) AS previous_month_receipt_scanned
FROM top_scanned_brand_by_month
)
SELECT brand_code, receipt_scanned, previous_month_receipt_scanned,
(receipt_scanned - previous_month_receipt_scanned) AS receipt_scanned_change,
ROUND(SAFE_DIVIDE(receipt_scanned - previous_month_receipt_scanned, previous_month_receipt_scanned) * 100, 2) AS receipt_scanned_change_percent
FROM mom_comparison
WHERE
-- month_start_date = DATE_TRUNC(CURRENT_DATE(), MONTH) -- Use this for the current month
month_start_date = DATE_TRUNC('2021-02-13', MONTH)
ORDER BY receipt_scanned DESC
LIMIT 5;WITH recent_user_receipts AS (
SELECT r.user_id, r.receipt_id
FROM `tli-sample-01.fetch.vw_receipt` r
INNER JOIN `tli-sample-01.fetch.vw_user` u ON r.user_id = u.user_id
WHERE DATE(u.created_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH)
)
SELECT ri.brand_code, COUNT(DISTINCT receipt_id) AS transaction_counts
FROM recent_user_receipts r
INNER JOIN `tli-sample-01.fetch.vw_receipt_item` ri USING(receipt_id)
WHERE ri.brand_code IS NOT NULL
GROUP BY ri.brand_code
ORDER BY transaction_counts DESC;- Missing
last_loginvalues in theuserstable for some records signUpSourceexists in the data file but is not in the schema provided.- Inconsistent casing for
rolevalues (schema uses uppercase, data uses lowercase), is this field case sensitive? - Duplicate
user_idvalues detected:
SELECT user_id, COUNT(*)
FROM `tli-sample-01.fetch.vw_user`
GROUP BY user_id
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC;- Receipts: Inconsistent
purchase_date(beforecreate_dateor aftermodify_date):
SELECT *
FROM (
SELECT GREATEST(
COALESCE(create_date, DATE '1900-01-01'),
COALESCE(modify_date, DATE '1900-01-01'),
COALESCE(finished_date, DATE '1900-01-01'),
COALESCE(purchase_date, DATE '1900-01-01'),
COALESCE(date_scanned, DATE '1900-01-01')
) AS latest_date, modify_date, * EXCEPT(modify_date)
FROM `tli-sample-01.fetch.vw_receipt`
)
WHERE latest_date > modify_date;- Brands: Missing valid
brand_codeand/orbarcodevalues.
Subject: Data Quality Findings & Next Steps for Optimization
Hi ,
I wanted to share an update on the recent data analysis we conducted and highlight a few key data quality issues that could impact our reporting accuracy and decision-making.
-
Inconsistent Dates:
Somepurchase_datevalues are way earlier than thecreate_date, and a few appear way after themodify_date, which suggests potential data entry errors or system issues. -
Missing and Invalid Values:
Number of records are missing validbrand_codeand/orbarcodevalues, which could affect any brand-level reporting or analysis. -
User Data Inconsistencies:
- Missing
last_logindata for several users, making it difficult to assess user engagement accurately. - The data file includes a
signUpSourcefield that’s not part of the official schema, raising questions about schema alignment. - The
rolefield has inconsistent casing (upper vs. lower case), which may cause discrepancies in role-based reporting. - Duplicate user IDs (e.g.,
'$oid=600056a3f7e5b011fce897b0') suggest potential issues with user record management.
- Missing
-
Deduplication Gaps:
The current ELT process doesn’t deduplicate records, which was not implemented in this exercise but should be addressed in a follow-up ticket to ensure data integrity.
-
Date Inconsistencies:
What are the expected business rules for date sequences (e.g., shouldpurchase_datealways be aftercreate_date)? -
Missing
brand_code/barcode:
Are there fallback rules or alternative data sources we can reference to fill these gaps? -
User Data Concerns:
- Should
signUpSourcebe officially added to the schema, or is it a legacy artifact? - Are there guidelines on handling duplicate user IDs, or should we consider a deduplication strategy based on other attributes?
- Should
- Business Context: Understanding how these fields are used in decision-making will help us prioritize fixes.
- Data Ownership: Clarification on who owns the source systems for these data points, so we can work with the right teams to address root causes.
- Historical Data: Access to historical data trends could help identify when these issues started and whether they’re systemic or isolated incidents.
- Deduplication at Scale: As data volume grows, deduplication logic will need to be optimized to handle large datasets efficiently, potentially leveraging partitioning strategies.
- Real-time Data Integrity: If we move toward more real-time data processing, we’ll need robust validation rules to catch these issues early without impacting performance.
- Schema Evolution: Handling schema changes gracefully, especially with new fields like
signUpSource, to prevent breaking downstream systems.
Let me know if you'd like to discuss these points further or if there’s additional context that could help us address these issues effectively.
Best regards,