Skip to content

Commit 57de727

Browse files
authored
Update introduction
1 parent 9157670 commit 57de727

1 file changed

Lines changed: 95 additions & 135 deletions

File tree

JData_specification.md

Lines changed: 95 additions & 135 deletions
Original file line numberDiff line numberDiff line change
@@ -75,141 +75,101 @@ scalability of the generated data files.
7575
Introduction
7676
------------
7777

78-
### Background
79-
80-
81-
Data are the digital representations of our world. Generating and processing
82-
data are essential parts of our daily lives, and form the very
83-
foundations of modern sciences, information technologies, businesses, and
84-
interactions between global societies.
85-
86-
Data can take many forms. Some data can be represented by simple scalars.
87-
Others have complex forms with hierarchical structures. An efficient
88-
representation of data also strongly depends upon application-specific needs. In some
89-
cases, plain text files with white-space delimited fields are sufficient,
90-
however, for performance-sensitive applications, binary formats can
91-
significantly reduce loading and processing time. The ability to store and
92-
parse complex data structures is particularly important to the scientific
93-
community.
94-
95-
It is a challenging task to encapsulate a wide variety of data forms within a
96-
single data interchange format. There have been many previous efforts in
97-
designing a general-purpose data storage specification. Some of them have
98-
become popular choices in one or multiple applications.
99-
Extensible Markup Language (XML), for example, is ubiquitously used as a
100-
data-exchange format, but the verbosity of the syntax, moderate complexity for
101-
parsing, impeded readability and inefficiency in expressing structured data
102-
Indicate room for improvement. Comma Separated Value (CSV), a rather
103-
simple plain-text format, is used among some applications to exchange tabular
104-
data structures (such as spreadsheets); yet, its inability to encode more
105-
complex data forms, lack of flexibility and data precision restrict it to specific
106-
applications.
107-
108-
The Hierarchical Data Format (HDF) is a format targeting the broad needs of
109-
the scientific communities. It has an extensible hierarchical data model with a
110-
large capacity to represent complex binary data. However, to effectively use
111-
HDF requires skillful implementation and an in-depth understanding of the
112-
complex underlying programming interfaces. For small projects with non-critical performance needs,
113-
using an advanced data format such as HDF may require additional development
114-
and maintenance efforts. Similar arguments can be made for the Common Data
115-
Format (CDF) or Network Common Data Format (netCDF) that are partly derived
116-
from HDF. In addition, the MATLAB mat-file format and Python pickle format
117-
have also been used among the research communities. However, their usage
118-
has been largely limited to the respective programming environments.
119-
120-
### JSON and Binary JData
121-
122-
The JavaScript Object Notation (JSON) format is a text-based data format that
123-
is known for its complex data storage capability, excellent portability and
124-
human-readability. JSON has been widely adopted in modern web applications, and is
125-
becoming popular among local/native applications. The key advantages of JSON
126-
include:
127-
128-
* **simplicity**: JSON data are composed of lists of `"name":value` pairs;
129-
such simple syntax greatly eases the use and parsing of the data file; free
130-
JSON-encoders and decoders are widely available for most popular programming
131-
languages;
132-
* **human-readability**: the text-based nature of JSON and its clean,
133-
easy-to-read format make it intuitively readable without in-depth knowledge of
134-
the format itself;
135-
* **hierarchical data support**: JSON has a tree-like data storage paradigm
136-
which has the capacity to support complex hierarchical data structures;
137-
there is no inherent data size limit imposed by the format itself;
138-
* **web-readiness**: because JSON can be readily parsed by JavaScript, most
139-
JSON-encoded data files can be directly invoked (inline or loaded from remote
140-
sites) by a JavaScript based web-application.
141-
142-
JSON also has limitations. JSON's `"value"` fields are weakly-typed. They only
143-
support strings, numbers, and Boolean types, and lack the fine-granularity
144-
to represent various numerical types of different byte-lengths (in C-language,
145-
for example, short, int, long int, float, double, long double) and their signs
146-
(signed and unsigned). Because JSON is a text-based format, the size of the
147-
data file can be significantly larger than a respective binary file and requires
148-
additional conversion when used in an application. This introduces overhead in
149-
both storage and processing.
150-
151-
The [Binary JData (BJData) format](https://github.com/NeuroJSON/bjdata) was derived
152-
from the Universal Binary JSON (UBJSON) format, which is one of the binary counterparts
153-
to the JSON format. It specifically addresses the above mentioned limitations,
154-
yet adheres to a simple grammar similar to the text-based JSON. Compared to other
155-
binary JSON-like formats, such as BSON (Binary JSON, https://bson.org), CBOR (Concise
156-
Binary Object Representation, [RFC 7049], https://cbor.io) and MessagePack
157-
(https://msgpack.org), BJData and UBJSON files are **"quasi-human-readable"** -
158-
a unique capability that is absent from almost all other binary formats.
159-
Compared to UBJSON, BJData specification supports extended binary data types (such
160-
as unsigned integers and half-precision floating-point numbers) as well as
161-
optimized N-dimensional array format. The extended data constructs also allow
162-
a BJData file to store binary arrays larger than 4 GB in size, which is not
163-
currently possible with MessagePack (maximum data record size is limited to
164-
4 GB) and BSON (maximum total file size is 4 GB).
165-
166-
With ease-of-use, superior portability and parser availability, JSON and
167-
BJData/UBJSON have the potential to serve as main-stream data storage and
168-
interchange formats for general needs, especially for the storage and interchange
169-
scientific data. A combination of JSON and its binary counterpart offers features
170-
that are not currently available within existing data storage schemes. Although
171-
they do not provide all of the advanced features found in more sophisticated
172-
formats, their greatly simplified encoding and decoding strategies permit
173-
efficient data sharing among general audiences.
174-
175-
### JData specification overview
176-
177-
JData is a specification for storing, exchanging and processing general-purpose
178-
data that are commonly encountered in the information technology (IT) industries and
179-
research communities. It has a text/UNICODE format derived from the JSON
180-
specification and a binary format derived from the BJData/UBJSON specification. JData
181-
is designed to represent commonly used data structures, including arrays,
182-
structures, trees and graphs. A round-trip conversion is defined between the
183-
text and binary versions of JData documents.
184-
185-
The inception of this specification started in 2011 as part of the development
186-
of the [JSONLab Toolbox](https://neurojson.org/jsonlab/) - a popular
187-
open-source MATLAB/GNU Octave JSON reader/writer. The majority of the
188-
[annotated N-D array constructs](#annotated-storage-of-n-d-arrays) had been implemented
189-
in the [early releases](https://sourceforge.net/projects/iso2mesh/files/jsonlab/)
190-
of JSONLab. In 2015, the initial draft of this specification
191-
was [developed in the Iso2Mesh Wiki](https://iso2mesh.sourceforge.net/cgi-bin/index.cgi?action=history&id=jsonlab/Doc/JData);
192-
since 2019, the development has been migrated to Github.
193-
194-
The purpose of this document is to define the text and binary JData format
195-
specifications. This is achieved through defining a semantic layer
196-
over the JSON/UBJSON data storage syntax to map various types of complex data
197-
structures. Such a semantic layer includes
198-
199-
- a list of dedicated `"name"` fields, or keywords, that define the containers
200-
of various data types that are commonly used in research,
201-
- a list of dedicated `"name"` fields and formats to facilitate the grouping and
202-
organization of hierarchical data,
203-
- a list of format properties for the associated "value" field to store the
204-
specific metadata of the data points
205-
- a set of conversion rules between the text and binary forms.
206-
207-
In the following sections, we will define the basic JData grammar and data models,
208-
followed by the keywords for data grouping and various data types, including
209-
scalars, N-dimensional arrays, sparse and complex arrays, structures, tables,
210-
hashes/associative arrays, trees and graphs. The expressions for these data
211-
structures in both text and binary forms are specified and exemplified, and
212-
their conversion rules are defined.
78+
#### Background
79+
80+
Open sharing of scientific data has become a cornerstone of modern research,
81+
driven by community standards such as the FAIR principles -- Findability,
82+
Accessibility, Interoperability, and Reusability -- and mandated by an
83+
increasing number of funding agencies and journals. Yet while open-source
84+
software development has converged on widely accepted, human-readable
85+
"source-code" formats -- plain text files under version control, universally
86+
readable without specialized tools -- the sharing of complex scientific
87+
datasets lacks an equivalent convention. Shared datasets today are commonly
88+
distributed in domain-specific binary formats whose readability and
89+
long-term usability depend entirely on the continued availability of
90+
specialized parsers and libraries.
91+
92+
This tight coupling between data and its parser creates a fundamental
93+
fragility. Binary formats encode data in opaque byte sequences whose
94+
interpretation requires an external schema -- often embedded only in the
95+
parser source code or a separate specification document. As formats
96+
evolve, parsers are updated or retired, and the risk of datasets becoming
97+
unreadable grows with time. The scientific record is consequently
98+
vulnerable: data carefully collected and shared today may be effectively
99+
inaccessible to researchers a decade from now, undermining the very
100+
reproducibility that data sharing is meant to support. A durable solution
101+
requires a format in which the data structure is self-describing and
102+
human-readable -- analogous to source code -- rather than opaque and
103+
parser-dependent.
104+
105+
#### JData as the source-code format for scientific data
106+
107+
JData is designed to serve as the ``source-code'' format for scientific data
108+
exchange -- a representation that is simultaneously machine-processable and
109+
directly interpretable by humans, without requiring domain-specific tooling.
110+
By encoding data structures using self-explanatory, standardized annotation
111+
keywords embedded within the data file itself, JData makes the organization
112+
and meaning of a dataset transparent at the point of storage. This
113+
human-readability also makes JData inherently compatible with modern
114+
artificial intelligence (AI) tools: large language models and AI-assisted
115+
data pipelines can parse, query, and reason about JData documents without
116+
bespoke adapters, enabling sophisticated data interoperation, automated
117+
processing, and broad reuse in ways that opaque binary formats preclude.
118+
119+
In this specification, we define a set of JSON-compatible annotation
120+
keywords to unambiguously map the data structures most commonly encountered
121+
in research -- N-dimensional (N-D) arrays, associative arrays (maps),
122+
tables, trees, linked lists, directed and undirected graphs, and binary
123+
large objects (blobs) -- into standardized, self-describing JSON and binary
124+
JSON wrappers. Built-in support for internal data compression is also
125+
provided, allowing JData files to remain compact without sacrificing
126+
readability or portability.
127+
128+
#### Building on the JSON ecosystem
129+
130+
JData is built upon the JavaScript Object Notation (JSON) format [RFC4627]
131+
-- an internationally standardized (ECMA-404, ISO21778:2017), text-based
132+
data-exchange format that has become one of the most widely adopted
133+
serialization standards across web and native applications. By anchoring
134+
JData to JSON, the specification immediately inherits a vast and mature
135+
ecosystem of free parsers available for virtually every programming language,
136+
as well as a suite of powerful complementary technologies: JSON Schema for
137+
automated data validation, JSONPath for structured data queries, JSON-LD for
138+
semantic data linking and referencing, and NoSQL database engines -- such as
139+
CouchDB and MongoDB -- that natively ingest JSON documents for scalable,
140+
searchable data storage. This ecosystem transforms JData from a static file
141+
format into an active data platform capable of supporting automated
142+
pipelines, cross-dataset queries, and large-scale data integration.
143+
144+
#### Binary JData for performance-sensitive applications
145+
146+
For applications where storage efficiency and parsing speed are critical,
147+
JData provides a binary representation through the Binary JData (BJData)
148+
format, derived from the Universal Binary JSON (UBJSON) specification.
149+
BJData retains a grammar closely aligned with JSON while addressing its core
150+
limitations for scientific use: it supports strongly typed numerical data
151+
across a full range of integer and floating-point precisions, enables
152+
storage of N-dimensional packed arrays in an optimized container format, and
153+
lifts the per-record and total-file size restrictions imposed by competing
154+
binary formats such as MessagePack and BSON. A distinguishing property of
155+
BJData is that all semantic elements -- record name tags and data-type
156+
markers -- remain human-readable strings, placing it in a unique
157+
``quasi-human-readable'' category absent from almost all other binary
158+
formats. Conversion between text JData and binary JData is lossless and
159+
fully specified, so the choice between the two representations is purely
160+
a performance trade-off with no loss of data fidelity or self-description.
161+
162+
#### Scope of this document
163+
164+
The remainder of this specification defines the text and binary JData
165+
grammar, the topological and semantic data models, and the complete set of
166+
data annotation keywords. For each supported data structure, both the text
167+
and binary representations are specified and exemplified, and their mutual
168+
conversion rules are defined. We first describe the basic JData grammar,
169+
then the data annotation keywords for data grouping, N-D arrays, maps,
170+
tables, trees, linked lists, graphs, and byte-stream data, followed by
171+
indexing and query conventions, data referencing and linking, and a
172+
summary of recommended file specifiers.
213173

214174
Grammar
215175
------------------------

0 commit comments

Comments
 (0)