@@ -75,141 +75,101 @@ scalability of the generated data files.
7575Introduction
7676------------
7777
78- ### Background
79-
80-
81- Data are the digital representations of our world. Generating and processing
82- data are essential parts of our daily lives, and form the very
83- foundations of modern sciences, information technologies, businesses, and
84- interactions between global societies.
85-
86- Data can take many forms. Some data can be represented by simple scalars.
87- Others have complex forms with hierarchical structures. An efficient
88- representation of data also strongly depends upon application-specific needs. In some
89- cases, plain text files with white-space delimited fields are sufficient,
90- however, for performance-sensitive applications, binary formats can
91- significantly reduce loading and processing time. The ability to store and
92- parse complex data structures is particularly important to the scientific
93- community.
94-
95- It is a challenging task to encapsulate a wide variety of data forms within a
96- single data interchange format. There have been many previous efforts in
97- designing a general-purpose data storage specification. Some of them have
98- become popular choices in one or multiple applications.
99- Extensible Markup Language (XML), for example, is ubiquitously used as a
100- data-exchange format, but the verbosity of the syntax, moderate complexity for
101- parsing, impeded readability and inefficiency in expressing structured data
102- Indicate room for improvement. Comma Separated Value (CSV), a rather
103- simple plain-text format, is used among some applications to exchange tabular
104- data structures (such as spreadsheets); yet, its inability to encode more
105- complex data forms, lack of flexibility and data precision restrict it to specific
106- applications.
107-
108- The Hierarchical Data Format (HDF) is a format targeting the broad needs of
109- the scientific communities. It has an extensible hierarchical data model with a
110- large capacity to represent complex binary data. However, to effectively use
111- HDF requires skillful implementation and an in-depth understanding of the
112- complex underlying programming interfaces. For small projects with non-critical performance needs,
113- using an advanced data format such as HDF may require additional development
114- and maintenance efforts. Similar arguments can be made for the Common Data
115- Format (CDF) or Network Common Data Format (netCDF) that are partly derived
116- from HDF. In addition, the MATLAB mat-file format and Python pickle format
117- have also been used among the research communities. However, their usage
118- has been largely limited to the respective programming environments.
119-
120- ### JSON and Binary JData
121-
122- The JavaScript Object Notation (JSON) format is a text-based data format that
123- is known for its complex data storage capability, excellent portability and
124- human-readability. JSON has been widely adopted in modern web applications, and is
125- becoming popular among local/native applications. The key advantages of JSON
126- include:
127-
128- * ** simplicity** : JSON data are composed of lists of ` "name":value ` pairs;
129- such simple syntax greatly eases the use and parsing of the data file; free
130- JSON-encoders and decoders are widely available for most popular programming
131- languages;
132- * ** human-readability** : the text-based nature of JSON and its clean,
133- easy-to-read format make it intuitively readable without in-depth knowledge of
134- the format itself;
135- * ** hierarchical data support** : JSON has a tree-like data storage paradigm
136- which has the capacity to support complex hierarchical data structures;
137- there is no inherent data size limit imposed by the format itself;
138- * ** web-readiness** : because JSON can be readily parsed by JavaScript, most
139- JSON-encoded data files can be directly invoked (inline or loaded from remote
140- sites) by a JavaScript based web-application.
141-
142- JSON also has limitations. JSON's ` "value" ` fields are weakly-typed. They only
143- support strings, numbers, and Boolean types, and lack the fine-granularity
144- to represent various numerical types of different byte-lengths (in C-language,
145- for example, short, int, long int, float, double, long double) and their signs
146- (signed and unsigned). Because JSON is a text-based format, the size of the
147- data file can be significantly larger than a respective binary file and requires
148- additional conversion when used in an application. This introduces overhead in
149- both storage and processing.
150-
151- The [ Binary JData (BJData) format] ( https://github.com/NeuroJSON/bjdata ) was derived
152- from the Universal Binary JSON (UBJSON) format, which is one of the binary counterparts
153- to the JSON format. It specifically addresses the above mentioned limitations,
154- yet adheres to a simple grammar similar to the text-based JSON. Compared to other
155- binary JSON-like formats, such as BSON (Binary JSON, https://bson.org ), CBOR (Concise
156- Binary Object Representation, [ RFC 7049] , https://cbor.io ) and MessagePack
157- (https://msgpack.org ), BJData and UBJSON files are ** "quasi-human-readable"** -
158- a unique capability that is absent from almost all other binary formats.
159- Compared to UBJSON, BJData specification supports extended binary data types (such
160- as unsigned integers and half-precision floating-point numbers) as well as
161- optimized N-dimensional array format. The extended data constructs also allow
162- a BJData file to store binary arrays larger than 4 GB in size, which is not
163- currently possible with MessagePack (maximum data record size is limited to
164- 4 GB) and BSON (maximum total file size is 4 GB).
165-
166- With ease-of-use, superior portability and parser availability, JSON and
167- BJData/UBJSON have the potential to serve as main-stream data storage and
168- interchange formats for general needs, especially for the storage and interchange
169- scientific data. A combination of JSON and its binary counterpart offers features
170- that are not currently available within existing data storage schemes. Although
171- they do not provide all of the advanced features found in more sophisticated
172- formats, their greatly simplified encoding and decoding strategies permit
173- efficient data sharing among general audiences.
174-
175- ### JData specification overview
176-
177- JData is a specification for storing, exchanging and processing general-purpose
178- data that are commonly encountered in the information technology (IT) industries and
179- research communities. It has a text/UNICODE format derived from the JSON
180- specification and a binary format derived from the BJData/UBJSON specification. JData
181- is designed to represent commonly used data structures, including arrays,
182- structures, trees and graphs. A round-trip conversion is defined between the
183- text and binary versions of JData documents.
184-
185- The inception of this specification started in 2011 as part of the development
186- of the [ JSONLab Toolbox] ( https://neurojson.org/jsonlab/ ) - a popular
187- open-source MATLAB/GNU Octave JSON reader/writer. The majority of the
188- [ annotated N-D array constructs] ( #annotated-storage-of-n-d-arrays ) had been implemented
189- in the [ early releases] ( https://sourceforge.net/projects/iso2mesh/files/jsonlab/ )
190- of JSONLab. In 2015, the initial draft of this specification
191- was [ developed in the Iso2Mesh Wiki] ( https://iso2mesh.sourceforge.net/cgi-bin/index.cgi?action=history&id=jsonlab/Doc/JData ) ;
192- since 2019, the development has been migrated to Github.
193-
194- The purpose of this document is to define the text and binary JData format
195- specifications. This is achieved through defining a semantic layer
196- over the JSON/UBJSON data storage syntax to map various types of complex data
197- structures. Such a semantic layer includes
198-
199- - a list of dedicated ` "name" ` fields, or keywords, that define the containers
200- of various data types that are commonly used in research,
201- - a list of dedicated ` "name" ` fields and formats to facilitate the grouping and
202- organization of hierarchical data,
203- - a list of format properties for the associated "value" field to store the
204- specific metadata of the data points
205- - a set of conversion rules between the text and binary forms.
206-
207- In the following sections, we will define the basic JData grammar and data models,
208- followed by the keywords for data grouping and various data types, including
209- scalars, N-dimensional arrays, sparse and complex arrays, structures, tables,
210- hashes/associative arrays, trees and graphs. The expressions for these data
211- structures in both text and binary forms are specified and exemplified, and
212- their conversion rules are defined.
78+ #### Background
79+
80+ Open sharing of scientific data has become a cornerstone of modern research,
81+ driven by community standards such as the FAIR principles -- Findability,
82+ Accessibility, Interoperability, and Reusability -- and mandated by an
83+ increasing number of funding agencies and journals. Yet while open-source
84+ software development has converged on widely accepted, human-readable
85+ "source-code" formats -- plain text files under version control, universally
86+ readable without specialized tools -- the sharing of complex scientific
87+ datasets lacks an equivalent convention. Shared datasets today are commonly
88+ distributed in domain-specific binary formats whose readability and
89+ long-term usability depend entirely on the continued availability of
90+ specialized parsers and libraries.
91+
92+ This tight coupling between data and its parser creates a fundamental
93+ fragility. Binary formats encode data in opaque byte sequences whose
94+ interpretation requires an external schema -- often embedded only in the
95+ parser source code or a separate specification document. As formats
96+ evolve, parsers are updated or retired, and the risk of datasets becoming
97+ unreadable grows with time. The scientific record is consequently
98+ vulnerable: data carefully collected and shared today may be effectively
99+ inaccessible to researchers a decade from now, undermining the very
100+ reproducibility that data sharing is meant to support. A durable solution
101+ requires a format in which the data structure is self-describing and
102+ human-readable -- analogous to source code -- rather than opaque and
103+ parser-dependent.
104+
105+ #### JData as the source-code format for scientific data
106+
107+ JData is designed to serve as the ``source-code'' format for scientific data
108+ exchange -- a representation that is simultaneously machine-processable and
109+ directly interpretable by humans, without requiring domain-specific tooling.
110+ By encoding data structures using self-explanatory, standardized annotation
111+ keywords embedded within the data file itself, JData makes the organization
112+ and meaning of a dataset transparent at the point of storage. This
113+ human-readability also makes JData inherently compatible with modern
114+ artificial intelligence (AI) tools: large language models and AI-assisted
115+ data pipelines can parse, query, and reason about JData documents without
116+ bespoke adapters, enabling sophisticated data interoperation, automated
117+ processing, and broad reuse in ways that opaque binary formats preclude.
118+
119+ In this specification, we define a set of JSON-compatible annotation
120+ keywords to unambiguously map the data structures most commonly encountered
121+ in research -- N-dimensional (N-D) arrays, associative arrays (maps),
122+ tables, trees, linked lists, directed and undirected graphs, and binary
123+ large objects (blobs) -- into standardized, self-describing JSON and binary
124+ JSON wrappers. Built-in support for internal data compression is also
125+ provided, allowing JData files to remain compact without sacrificing
126+ readability or portability.
127+
128+ #### Building on the JSON ecosystem
129+
130+ JData is built upon the JavaScript Object Notation (JSON) format [ RFC4627]
131+ -- an internationally standardized (ECMA-404, ISO21778:2017), text-based
132+ data-exchange format that has become one of the most widely adopted
133+ serialization standards across web and native applications. By anchoring
134+ JData to JSON, the specification immediately inherits a vast and mature
135+ ecosystem of free parsers available for virtually every programming language,
136+ as well as a suite of powerful complementary technologies: JSON Schema for
137+ automated data validation, JSONPath for structured data queries, JSON-LD for
138+ semantic data linking and referencing, and NoSQL database engines -- such as
139+ CouchDB and MongoDB -- that natively ingest JSON documents for scalable,
140+ searchable data storage. This ecosystem transforms JData from a static file
141+ format into an active data platform capable of supporting automated
142+ pipelines, cross-dataset queries, and large-scale data integration.
143+
144+ #### Binary JData for performance-sensitive applications
145+
146+ For applications where storage efficiency and parsing speed are critical,
147+ JData provides a binary representation through the Binary JData (BJData)
148+ format, derived from the Universal Binary JSON (UBJSON) specification.
149+ BJData retains a grammar closely aligned with JSON while addressing its core
150+ limitations for scientific use: it supports strongly typed numerical data
151+ across a full range of integer and floating-point precisions, enables
152+ storage of N-dimensional packed arrays in an optimized container format, and
153+ lifts the per-record and total-file size restrictions imposed by competing
154+ binary formats such as MessagePack and BSON. A distinguishing property of
155+ BJData is that all semantic elements -- record name tags and data-type
156+ markers -- remain human-readable strings, placing it in a unique
157+ ``quasi-human-readable'' category absent from almost all other binary
158+ formats. Conversion between text JData and binary JData is lossless and
159+ fully specified, so the choice between the two representations is purely
160+ a performance trade-off with no loss of data fidelity or self-description.
161+
162+ #### Scope of this document
163+
164+ The remainder of this specification defines the text and binary JData
165+ grammar, the topological and semantic data models, and the complete set of
166+ data annotation keywords. For each supported data structure, both the text
167+ and binary representations are specified and exemplified, and their mutual
168+ conversion rules are defined. We first describe the basic JData grammar,
169+ then the data annotation keywords for data grouping, N-D arrays, maps,
170+ tables, trees, linked lists, graphs, and byte-stream data, followed by
171+ indexing and query conventions, data referencing and linking, and a
172+ summary of recommended file specifiers.
213173
214174Grammar
215175------------------------
0 commit comments