All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
ProcessToWriterAPI (457d341): New method that writes preprocessed output directly to anio.Writer, avoiding the output buffer allocation for large datasets- Sentinel Errors (0cd1558):
ErrNilWriter: Returned when a nilio.Writeris passed toProcessToWriterErrNilReader: Returned when a nilio.Readeris passed toProcessorProcessToWriter, distinguished fromErrEmptyFile
- Struct Tag Parse Cache (457d341):
sync.Map-based cache keyed by(reflect.Type, strict)eliminates redundant tag parsing on repeatedProcesscalls
- Slice Reset on Reuse (457d341):
Processnow callsSetLen(0)before appending, so reusing the same destination slice no longer carries over stale elements - Sentinel Error Wrapping (457d341, 3026a38): Errors from
fileparser.Parseare now wrapped withErrEmptyFile/ErrUnsupportedFileTypesoerrors.Isworks correctly - Cross-Field Validation (457d341): Cross-field validators now use a preprocessed field-value map instead of column indices, fixing
target field ... not foundwhen the column is absent but filled byprep:"default=..." wrapParseErrorPrecision (3026a38, 0cd1558): Replaced broad substring matching with exact message matching against fileparser v0.5.1 error strings; separated"reader cannot be nil"fromErrEmptyFileinto its ownErrNilReader
- CI Hardening (457d341, 3026a38): Added
-raceflag andgovulncheck(pinned v1.1.4) to CI workflow - Go Version (785493a, 3026a38): Bumped minimum Go version to 1.25; updated
golang.org/x/netto v0.51.0 to resolve GO-2026-4559 - Documentation (c685c44):
- Rewrote "Before Using fileprep" as concise "Gotchas" section across all 7 README languages
- Added
ProcessToWriterto README,doc.go, andexample_test.go - Fixed CONTRIBUTING.md reference to non-existent
.cursorrules/.github/copilot-instructions.md - Added version support table to SECURITY.md
- JSON/JSONL Format Support: First-class support for JSON (.json) and JSONL (.jsonl) file formats
- JSON arrays are parsed into individual rows; each element becomes a row with a single
"data"column - JSONL files are parsed line-by-line; each line becomes a row with a single
"data"column - JSON/JSONL output is always compact JSONL (one JSON value per line, no header)
- Pretty-printed JSON input is automatically compacted via
json.Compact
- JSON arrays are parsed into individual rows; each element becomes a row with a single
- 18 New FileType Constants: 2 base types + 16 compressed variants
- JSON:
FileTypeJSON,FileTypeJSONGZ,FileTypeJSONBZ2,FileTypeJSONXZ,FileTypeJSONZSTD,FileTypeJSONZLIB,FileTypeJSONSNAPPY,FileTypeJSONS2,FileTypeJSONLZ4 - JSONL:
FileTypeJSONL,FileTypeJSONLGZ,FileTypeJSONLBZ2,FileTypeJSONLXZ,FileTypeJSONLZSTD,FileTypeJSONLZLIB,FileTypeJSONLSNAPPY,FileTypeJSONLS2,FileTypeJSONLLZ4
- JSON:
- Sentinel Errors for JSON Integrity:
ErrInvalidJSONAfterPrep: Hard error when preprocessing (e.g.,truncate) destroys JSON structureErrEmptyJSONOutput: Hard error when all rows become empty after preprocessing, resulting in 0-line JSONL output
omitemptyValidator:- Added
validate:"omitempty,..."support to skip subsequent validators when a field value is empty - Useful for optional fields such as
omitempty,email
- Added
- Processor Options:
WithStrictTagParsing(): Strict mode that returns an error for invalid tag argumentsWithValidRowsOnly(): Output and destination slice include only rows that passed all validations
- Comprehensive Tests: Unit and integration tests for JSON/JSONL processing including pretty-printed input, compressed variants, validation, and error paths
- Added tests for conditional cross-field validators (
required_if,required_unless,required_with,required_without) - Added tests for type-conversion paths (
setFieldValue) across string/int/uint/float/bool - Added end-to-end tests for XLSX and Parquet pipelines
- Added tests for conditional cross-field validators (
- Tag Parser Refactor:
- Refactored
prep/validatetag parsing to a registry-based implementation for easier extension and maintenance - Improved error reporting for invalid tag argument formats in strict mode
- Refactored
- Output Behavior:
- Added optional valid-row filtering behavior via
WithValidRowsOnly()while preserving row/error statistics inProcessResult
- Added optional valid-row filtering behavior via
- Dependency Update: Updated fileparser from v0.4.0 to v0.5.1 for JSON/JSONL parsing support
- Documentation:
- Updated README content with clearer pre-use notes and conditional-validator examples
- Replaced internal
CLAUDE.mdreference in package docs with pkg.go.dev link
- New Compression Formats: Added support for 4 new compression formats via fileparser v0.2.0
- zlib (.z) - Standard DEFLATE compression
- snappy (.snappy) - Google's high-speed compression
- s2 (.s2) - Improved Snappy extension, faster
- lz4 (.lz4) - Extremely fast compression
- New FileType Constants: Added 20 new FileType aliases for new compression format combinations
- CSV:
FileTypeCSVZLIB,FileTypeCSVSNAPPY,FileTypeCSVS2,FileTypeCSVLZ4 - TSV:
FileTypeTSVZLIB,FileTypeTSVSNAPPY,FileTypeTSVS2,FileTypeTSVLZ4 - LTSV:
FileTypeLTSVZLIB,FileTypeLTSVSNAPPY,FileTypeLTSVS2,FileTypeLTSVLZ4 - Parquet:
FileTypeParquetZLIB,FileTypeParquetSNAPPY,FileTypeParquetS2,FileTypeParquetLZ4 - Excel:
FileTypeXLSXZLIB,FileTypeXLSXSNAPPY,FileTypeXLSXS2,FileTypeXLSXLZ4
- CSV:
- Integration Tests: Added comprehensive tests for new compression formats (CSV, TSV, LTSV)
- Dependency Update: Updated to fileparser v0.2.0 for new compression format support
- Documentation: Updated all README files (en, ja, es, fr, ko, ru, zh-cn) with new compression formats
- Migrated from
github.com/nao1215/filesql/parsertogithub.com/nao1215/fileparserfor file parsing - Updated all internal references from
parser.tofileparser.
- Dependency on
github.com/nao1215/filesql
- Conditional Required Validators (9caa374): New validators for conditional field requirements
required_if: Required if another field equals a specific valuerequired_unless: Required unless another field equals a specific valuerequired_with: Required if another field is presentrequired_without: Required if another field is not present
- Date/Time Validator (9caa374):
datetimevalidator with custom Go layout format support - Phone Number Validator (9caa374):
e164validator for E.164 international phone number format - Geolocation Validators (9caa374):
latitude(-90 to 90) andlongitude(-180 to 180) validators - UUID Variant Validators (9caa374):
uuid3,uuid4,uuid5for specific UUID versions, andulidfor ULID format - Hexadecimal and Color Validators (9caa374):
hexadecimal,hexcolor,rgb,rgba,hsl,hslavalidators - MAC Address Validator (9caa374):
macvalidator for MAC address format - Advanced Examples (f771f9b): Comprehensive documentation examples
- Complex Data Preprocessing and Validation example with real-world messy data
- Detailed Error Reporting example demonstrating validation error handling
- Benchmark Tests (607b868): Comprehensive benchmark suite for performance testing
- Performance Improvement (PR #6, 607b868): ~10% faster processing through optimized preprocessing and validation pipeline
- Documentation (f771f9b): Complete update of all README translations (Japanese, Spanish, French, Korean, Russian, Chinese) to match the English version with full feature documentation
- Initial Release: First stable release of fileprep library
- File Format Support: CSV, TSV, LTSV, Parquet, Excel (.xlsx) with compression support (gzip, bzip2, xz, zstd)
- Preprocessing Tags (
prep): Comprehensive struct tag-based preprocessing- Basic preprocessors:
trim,ltrim,rtrim,lowercase,uppercase,default - String transformation:
replace,prefix,suffix,truncate,strip_html,strip_newline,collapse_space - Character filtering:
remove_digits,remove_alpha,keep_digits,keep_alpha,trim_set - Padding:
pad_left,pad_right - Advanced:
normalize_unicode,nullify,coerce,fix_scheme,regex_replace
- Basic preprocessors:
- Validation Tags (
validate): Compatible with go-playground/validator syntax- Basic validators:
required,boolean - Character type validators:
alpha,alphaunicode,alphaspace,alphanumeric,alphanumunicode,numeric,number,ascii,printascii,multibyte - Numeric comparison:
eq,ne,gt,gte,lt,lte,min,max,len - String validators:
oneof,lowercase,uppercase,eq_ignore_case,ne_ignore_case - String content:
startswith,startsnotwith,endswith,endsnotwith,contains,containsany,containsrune,excludes,excludesall,excludesrune - Format validators:
email,uri,url,http_url,https_url,url_encoded,datauri,uuid - Network validators:
ip_addr,ip4_addr,ip6_addr,cidr,cidrv4,cidrv6,fqdn,hostname,hostname_rfc1123,hostname_port - Cross-field validators:
eqfield,nefield,gtfield,gtefield,ltfield,ltefield,fieldcontains,fieldexcludes
- Basic validators:
- Name-Based Column Binding: Automatic snake_case conversion with
nametag override - filesql Integration: Returns
io.Readerfor direct use with filesql - Detailed Error Reporting: Row and column information for each validation error
- Memory Optimization: In-place record processing, pre-allocated output buffers, streaming parsers for CSV/TSV/LTSV
- XLSX Streaming: Uses excelize streaming API to reduce memory usage for large files
- Parquet Buffer Reuse: Reusable row buffer across row groups to reduce allocations
- Format-Specific Limitations:
- XLSX: Only the first sheet is processed
- LTSV: Maximum line size is 10MB