Skip to content

Latest commit

 

History

History
152 lines (130 loc) · 11.7 KB

File metadata and controls

152 lines (130 loc) · 11.7 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

[0.6.0] - 2026-03-22

Added

  • ProcessToWriter API (457d341): New method that writes preprocessed output directly to an io.Writer, avoiding the output buffer allocation for large datasets
  • Sentinel Errors (0cd1558):
    • ErrNilWriter: Returned when a nil io.Writer is passed to ProcessToWriter
    • ErrNilReader: Returned when a nil io.Reader is passed to Process or ProcessToWriter, distinguished from ErrEmptyFile
  • Struct Tag Parse Cache (457d341): sync.Map-based cache keyed by (reflect.Type, strict) eliminates redundant tag parsing on repeated Process calls

Fixed

  • Slice Reset on Reuse (457d341): Process now calls SetLen(0) before appending, so reusing the same destination slice no longer carries over stale elements
  • Sentinel Error Wrapping (457d341, 3026a38): Errors from fileparser.Parse are now wrapped with ErrEmptyFile / ErrUnsupportedFileType so errors.Is works correctly
  • Cross-Field Validation (457d341): Cross-field validators now use a preprocessed field-value map instead of column indices, fixing target field ... not found when the column is absent but filled by prep:"default=..."
  • wrapParseError Precision (3026a38, 0cd1558): Replaced broad substring matching with exact message matching against fileparser v0.5.1 error strings; separated "reader cannot be nil" from ErrEmptyFile into its own ErrNilReader

Changed

  • CI Hardening (457d341, 3026a38): Added -race flag and govulncheck (pinned v1.1.4) to CI workflow
  • Go Version (785493a, 3026a38): Bumped minimum Go version to 1.25; updated golang.org/x/net to v0.51.0 to resolve GO-2026-4559
  • Documentation (c685c44):
    • Rewrote "Before Using fileprep" as concise "Gotchas" section across all 7 README languages
    • Added ProcessToWriter to README, doc.go, and example_test.go
    • Fixed CONTRIBUTING.md reference to non-existent .cursorrules / .github/copilot-instructions.md
    • Added version support table to SECURITY.md

[0.5.0] - 2026-02-15

Added

  • JSON/JSONL Format Support: First-class support for JSON (.json) and JSONL (.jsonl) file formats
    • JSON arrays are parsed into individual rows; each element becomes a row with a single "data" column
    • JSONL files are parsed line-by-line; each line becomes a row with a single "data" column
    • JSON/JSONL output is always compact JSONL (one JSON value per line, no header)
    • Pretty-printed JSON input is automatically compacted via json.Compact
  • 18 New FileType Constants: 2 base types + 16 compressed variants
    • JSON: FileTypeJSON, FileTypeJSONGZ, FileTypeJSONBZ2, FileTypeJSONXZ, FileTypeJSONZSTD, FileTypeJSONZLIB, FileTypeJSONSNAPPY, FileTypeJSONS2, FileTypeJSONLZ4
    • JSONL: FileTypeJSONL, FileTypeJSONLGZ, FileTypeJSONLBZ2, FileTypeJSONLXZ, FileTypeJSONLZSTD, FileTypeJSONLZLIB, FileTypeJSONLSNAPPY, FileTypeJSONLS2, FileTypeJSONLLZ4
  • Sentinel Errors for JSON Integrity:
    • ErrInvalidJSONAfterPrep: Hard error when preprocessing (e.g., truncate) destroys JSON structure
    • ErrEmptyJSONOutput: Hard error when all rows become empty after preprocessing, resulting in 0-line JSONL output
  • omitempty Validator:
    • Added validate:"omitempty,..." support to skip subsequent validators when a field value is empty
    • Useful for optional fields such as omitempty,email
  • Processor Options:
    • WithStrictTagParsing(): Strict mode that returns an error for invalid tag arguments
    • WithValidRowsOnly(): Output and destination slice include only rows that passed all validations
  • Comprehensive Tests: Unit and integration tests for JSON/JSONL processing including pretty-printed input, compressed variants, validation, and error paths
    • Added tests for conditional cross-field validators (required_if, required_unless, required_with, required_without)
    • Added tests for type-conversion paths (setFieldValue) across string/int/uint/float/bool
    • Added end-to-end tests for XLSX and Parquet pipelines

Changed

  • Tag Parser Refactor:
    • Refactored prep / validate tag parsing to a registry-based implementation for easier extension and maintenance
    • Improved error reporting for invalid tag argument formats in strict mode
  • Output Behavior:
    • Added optional valid-row filtering behavior via WithValidRowsOnly() while preserving row/error statistics in ProcessResult
  • Dependency Update: Updated fileparser from v0.4.0 to v0.5.1 for JSON/JSONL parsing support
  • Documentation:
    • Updated README content with clearer pre-use notes and conditional-validator examples
    • Replaced internal CLAUDE.md reference in package docs with pkg.go.dev link

[0.4.0] - 2025-12-11

Added

  • New Compression Formats: Added support for 4 new compression formats via fileparser v0.2.0
    • zlib (.z) - Standard DEFLATE compression
    • snappy (.snappy) - Google's high-speed compression
    • s2 (.s2) - Improved Snappy extension, faster
    • lz4 (.lz4) - Extremely fast compression
  • New FileType Constants: Added 20 new FileType aliases for new compression format combinations
    • CSV: FileTypeCSVZLIB, FileTypeCSVSNAPPY, FileTypeCSVS2, FileTypeCSVLZ4
    • TSV: FileTypeTSVZLIB, FileTypeTSVSNAPPY, FileTypeTSVS2, FileTypeTSVLZ4
    • LTSV: FileTypeLTSVZLIB, FileTypeLTSVSNAPPY, FileTypeLTSVS2, FileTypeLTSVLZ4
    • Parquet: FileTypeParquetZLIB, FileTypeParquetSNAPPY, FileTypeParquetS2, FileTypeParquetLZ4
    • Excel: FileTypeXLSXZLIB, FileTypeXLSXSNAPPY, FileTypeXLSXS2, FileTypeXLSXLZ4
  • Integration Tests: Added comprehensive tests for new compression formats (CSV, TSV, LTSV)

Changed

  • Dependency Update: Updated to fileparser v0.2.0 for new compression format support
  • Documentation: Updated all README files (en, ja, es, fr, ko, ru, zh-cn) with new compression formats

[0.3.0] - 2025-12-11

Changed

  • Migrated from github.com/nao1215/filesql/parser to github.com/nao1215/fileparser for file parsing
  • Updated all internal references from parser. to fileparser.

Removed

  • Dependency on github.com/nao1215/filesql

[0.2.0] - 2025-12-08

Added

  • Conditional Required Validators (9caa374): New validators for conditional field requirements
    • required_if: Required if another field equals a specific value
    • required_unless: Required unless another field equals a specific value
    • required_with: Required if another field is present
    • required_without: Required if another field is not present
  • Date/Time Validator (9caa374): datetime validator with custom Go layout format support
  • Phone Number Validator (9caa374): e164 validator for E.164 international phone number format
  • Geolocation Validators (9caa374): latitude (-90 to 90) and longitude (-180 to 180) validators
  • UUID Variant Validators (9caa374): uuid3, uuid4, uuid5 for specific UUID versions, and ulid for ULID format
  • Hexadecimal and Color Validators (9caa374): hexadecimal, hexcolor, rgb, rgba, hsl, hsla validators
  • MAC Address Validator (9caa374): mac validator for MAC address format
  • Advanced Examples (f771f9b): Comprehensive documentation examples
    • Complex Data Preprocessing and Validation example with real-world messy data
    • Detailed Error Reporting example demonstrating validation error handling
  • Benchmark Tests (607b868): Comprehensive benchmark suite for performance testing

Changed

  • Performance Improvement (PR #6, 607b868): ~10% faster processing through optimized preprocessing and validation pipeline
  • Documentation (f771f9b): Complete update of all README translations (Japanese, Spanish, French, Korean, Russian, Chinese) to match the English version with full feature documentation

[0.1.0] - 2025-12-07

Added

  • Initial Release: First stable release of fileprep library
  • File Format Support: CSV, TSV, LTSV, Parquet, Excel (.xlsx) with compression support (gzip, bzip2, xz, zstd)
  • Preprocessing Tags (prep): Comprehensive struct tag-based preprocessing
    • Basic preprocessors: trim, ltrim, rtrim, lowercase, uppercase, default
    • String transformation: replace, prefix, suffix, truncate, strip_html, strip_newline, collapse_space
    • Character filtering: remove_digits, remove_alpha, keep_digits, keep_alpha, trim_set
    • Padding: pad_left, pad_right
    • Advanced: normalize_unicode, nullify, coerce, fix_scheme, regex_replace
  • Validation Tags (validate): Compatible with go-playground/validator syntax
    • Basic validators: required, boolean
    • Character type validators: alpha, alphaunicode, alphaspace, alphanumeric, alphanumunicode, numeric, number, ascii, printascii, multibyte
    • Numeric comparison: eq, ne, gt, gte, lt, lte, min, max, len
    • String validators: oneof, lowercase, uppercase, eq_ignore_case, ne_ignore_case
    • String content: startswith, startsnotwith, endswith, endsnotwith, contains, containsany, containsrune, excludes, excludesall, excludesrune
    • Format validators: email, uri, url, http_url, https_url, url_encoded, datauri, uuid
    • Network validators: ip_addr, ip4_addr, ip6_addr, cidr, cidrv4, cidrv6, fqdn, hostname, hostname_rfc1123, hostname_port
    • Cross-field validators: eqfield, nefield, gtfield, gtefield, ltfield, ltefield, fieldcontains, fieldexcludes
  • Name-Based Column Binding: Automatic snake_case conversion with name tag override
  • filesql Integration: Returns io.Reader for direct use with filesql
  • Detailed Error Reporting: Row and column information for each validation error

Technical Details

  • Memory Optimization: In-place record processing, pre-allocated output buffers, streaming parsers for CSV/TSV/LTSV
  • XLSX Streaming: Uses excelize streaming API to reduce memory usage for large files
  • Parquet Buffer Reuse: Reusable row buffer across row groups to reduce allocations
  • Format-Specific Limitations:
    • XLSX: Only the first sheet is processed
    • LTSV: Maximum line size is 10MB