Skip to content

didrod205/datalint

📊 datalint

Lint your CSVs before they break your pipeline — locally, no Python, no API key.

npm version CI node license

A deterministic CLI that profiles every column of a CSV/TSV file and lints it for data-quality problems — ragged rows, type drift, missing values, duplicates, mixed date formats, numeric outliers, and optional schema violations — with a quality score, A–F grade and JSON/Markdown reports.


One-line summary

datalint reads your CSV/TSV files, infers each column's type, profiles the data, and reports every quality issue that would trip up an import or analysis — 100% locally, no API key, no server, and no dependency on a data library (the CSV parser is hand-rolled).

Why this project exists

CSV is the universal data format, and it's almost always messy. A file that "looks fine" in a spreadsheet hides:

  • Ragged rows — an unescaped comma silently shifts every column after it.
  • Type drift — a number column with a stray N/A, , or 1.2.3.
  • Mixed date formats2024-01-05 next to 01/06/2024 (which is which?).
  • Missing values, duplicates, stray whitespace, inconsistent casing (US vs us), and outliers that are really data-entry errors.

Eyeballing this doesn't scale, and feeding a 50k-row file to an LLM gets you a confident-but-wrong summary. You want a deterministic, repeatable audit you can run on every export and gate in CI. That's datalint.

Key features

  • 🧱 Dependency-free CSV/TSV parser — RFC 4180 quotes, embedded newlines, escaped quotes, CRLF/LF, plus automatic delimiter detection.
  • 🔎 Column profiling — inferred type, empty rate, distinct count, min/max/mean, and top values for every column.
  • 🚦 12 built-in checks — ragged rows, duplicate/empty headers, empty columns/rows, missing values, type drift, whitespace, mixed date formats, inconsistent casing, duplicate rows, and numeric outliers (Tukey/IQR).
  • 📐 Optional schema — required, type, enum, min/max, regex pattern, unique, not-null constraints per column.
  • 📊 Quality score + A–F grade, per file and overall.
  • 📄 JSON & Markdown export, colored console output, CI gate exit codes.
  • ⚙️ Config file, custom delimiter, headerless mode, per-rule severities.
  • 🔒 Runs entirely offline. Nothing is uploaded.

Install

# run without installing
npx @didrod2539/datalint scan data.csv

# or install
npm install -g @didrod2539/datalint    # global CLI (provides `datalint`)
npm install -D @didrod2539/datalint    # project dev-dependency (for CI)

Node ≥ 18. ESM + CJS + TypeScript types.

Quick start

datalint scan data.csv
data.csv  42/100 (F)  12 rows × 8 cols · comma
  • id integer · 11 distinct
  • email email · 12 distinct
  • country string · 5 distinct
  • signup_date date · 11 distinct
  • amount decimal · 10 distinct
  • note string · 3 distinct 75% empty
  ✗ 1 row(s) have a different column count than the header (8)
  ✗ Duplicate header "email" (columns 3 and 4)
  ⚠ Column "note" is 75.0% empty (9/12)
  ⚠ Column "amount" looks decimal but 1 value(s) don't match
  ⚠ Column "signup_date" mixes 2 date formats
  ⚠ 1 duplicate row(s)
  ℹ Column "country" has 1 value(s) that differ only by case

Overall  42/100 (F)  1 file(s), 12 row(s), 2 error(s), 4 warning(s), 1 info

CLI usage

datalint scan [...targets]    # analyze CSV/TSV files or directories
datalint report <input.json>  # re-render a saved JSON report as Markdown
datalint init                 # scaffold datalint.config.json (with a schema)
datalint --help
datalint --version

scan options:

Option Description
--config <file> Path to a config file (otherwise auto-detected)
--delimiter <char> , \t ; | or auto (default)
--no-header Treat the first row as data (synthesize column names)
--json <file> Write a JSON report
--md <file> Write a Markdown report
--min-score <n> Exit non-zero if the overall score < n (CI gate)
--quiet Hide info-level issues in the console

Point scan at a directory and it finds every *.csv, *.tsv, *.txt recursively.

Example result

Full reports for the bundled sample files are in examples/sample-report.md and examples/sample-report.json.

📸 Screenshot / demo GIF placeholder: ./docs/screenshot.png — record the terminal running npx @didrod2539/datalint scan examples/messy.csv.

Configuration

Create datalint.config.json (or run datalint init):

{
  "delimiter": "auto",
  "hasHeader": true,
  "maxEmptyRate": 0.1,
  "enumThreshold": 20,
  "outlierIqrFactor": 1.5,
  "minScore": 80,
  "disableRules": [],
  "ruleSeverity": { "inconsistent-case": "warning" },
  "schema": [
    { "name": "id", "type": "integer", "required": true, "unique": true },
    { "name": "email", "type": "email", "notNull": true },
    { "name": "amount", "type": "decimal", "min": 0, "max": 100000 },
    { "name": "country", "enum": ["US", "CA", "UK"] }
  ]
}
Field Meaning
delimiter "auto" or a literal delimiter
hasHeader Whether row 1 is a header
maxEmptyRate Warn columns above this empty rate (0–1)
enumThreshold Max distinct values for casing checks to apply
outlierIqrFactor Tukey IQR multiplier (1.5 default; 0 disables outliers)
minScore CI gate threshold (overridable with --min-score)
disableRules Rule ids to turn off
ruleSeverity Override severity per rule id
schema Optional per-column constraints

Rule ids: ragged-rows, duplicate-headers, empty-column, empty-row, missing-values, type-drift, whitespace, mixed-date-formats, inconsistent-case, duplicate-rows, outliers, and schema-*.

Real-world use cases

  1. Gate a data pipeline in CI. Add datalint scan ./exports --min-score 85 to your workflow. A nightly export that arrives with shifted columns or a broken date format fails the build instead of corrupting downstream tables.
  2. Vet a file before import. Before loading a vendor/marketing CSV into your warehouse, run datalint scan leads.csv --md audit.md and fix what it finds.
  3. Profile an unfamiliar dataset. Run datalint scan dataset.csv to instantly see each column's type, null rate, distinct count and ranges — a fast EDA pass without spinning up a notebook.

Programmatic API

import { analyze, buildReport, toMarkdown } from "@didrod2539/datalint";

const ds = analyze({ source: "data.csv", content });
console.log(ds.score, ds.grade, ds.profiles, ds.issues);

const report = buildReport([ds], { version: "0.1.0" });
await fs.writeFile("report.md", toMarkdown(report));

Roadmap

  • Excel (.xlsx) and Parquet input.
  • Cross-file referential checks (foreign keys across CSVs).
  • A --fix mode to auto-trim whitespace and normalize obvious issues.
  • An HTML report with charts.
  • A GitHub Action that comments data-quality on PRs.
  • Streaming mode for very large files.

FAQ

Does it send my data anywhere? No. datalint runs entirely on your machine — no API key, no telemetry, no uploads, no network calls.

Do I need to define a schema? No. datalint is useful with zero config — it infers column types and catches drift, duplicates, missing values, etc. A schema is optional for stricter checks.

How does it parse CSV? With a small, hand-rolled RFC 4180 parser (no external CSV library) that handles quoted fields, embedded delimiters/newlines, escaped quotes and CRLF/LF — so behavior is fully predictable. Delimiter is auto-detected or set via config.

How are dates / types detected? By deterministic pattern matching (src/infer.ts). Type inference is conservative; ambiguous cells fall back to string. The date check recognizes common ISO and slash/dot formats and flags a column that mixes more than one.

Is the quality score official? No — it's a transparent metric: each issue costs a base penalty plus an amount scaled by how much of the data it affects, weighted by severity (src/score.ts). Use it to track and gate quality.

My valid data is being flagged — how do I silence it? Use disableRules, ruleSeverity, maxEmptyRate, or outlierIqrFactor in the config. Every heuristic is tunable.

Contributing

Contributions welcome! Each check is a small, self-contained rule in src/rules/. See CONTRIBUTING.md and the Code of Conduct.

git clone https://github.com/didrod205/datalint.git
cd datalint
npm install
npm test
npm run build
node dist/cli.js scan examples/messy.csv

License

MIT © datalint contributors

💖 Sponsor

datalint is free, MIT-licensed, and built in spare time. If it caught a bad export before it hit production, please consider supporting it:

Where your support goes: Excel/Parquet input, cross-file referential checks, a --fix autoclean mode, an HTML report, a PR-commenting GitHub Action, and fast issue responses.

About

Deterministic CLI that profiles and lints CSV/TSV data quality: ragged rows, type drift, missing values, duplicates, mixed date formats, outliers, and optional schema checks. JSON/Markdown reports, no Python, no API key.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors