Turn your MARC exports into something else.
npm install
npm run build:ts
npm link
Generate JSON:
marcattacks --to json ./data/sample.xml
We can also do this for tar (and) gzipped files
marcattacks --to json ./data/sample.tar.gz
Generate Aleph sequential:
marcattacks --to alephseq ./data/sample.xml
Generate RDF:
marcattacks --to rdf --map marc2rdf ./data/sample.xml
Generate XML:
marcattacks --from alephseq --to xml ./data/one.alephseq
Transform the MARC input using a JSONata expression or file:
marcattacks --param fix=./demo/demo.jsonata ./data/sample.xml
Or transform using a Catmandu Fix script — a declarative, line-based mapping language built for library metadata (and faster than JSONata):
marcattacks --to jsonl --map fix --param fix=./demo/marc2rdf.fix ./data/sample.xml
A Fix script is a list of name(args) statements, with if/unless ... end
conditionals and do ... end binds:
marc_map('245ab', title, join: ' ') # copy MARC 245$a$b into title
upcase(title) # uppercase it
add_field(type, Book) # add a constant field
lookup(type, ./types.csv) # map a value through a CSV table
do marc_each() # loop over each MARC field
unless marc_match('500e', skip)
marc_map('500', note.$append)
end
end
remove_field(record)
See ./demo/marc2rdf.fix and ./demo/example.fix for complete examples.
The types.csv is a two column lookup list in CSV format. E.g.
A,B
Book,http://example.org/ns#Book
The fix mapper implements a subset of the most common Catmandu Fix builtins created by the LibreCat project. A full reference for the Fix language can be found here: https://librecat.org/assets/catmandu_cheat_sheet.pdf.
Use a pseudo URL stdin:// to read from the standard input
A remote SFTP path:
marcattacks --key ~/.ssh/privatekey sftp://username@hostname:port/remote/path
The latest XML file in a remote SFTP:
marcattacks --key ~/.ssh/privatekey sftp://username@hostname:port/remote/path/@latest:xml
An HTTP path
marcattacks http://somewhere.org/data.xml
An S3 path
marcattacks s3://accessKey:secretKey@hostname:port/bucket/key
use s3s://... for using an SSL layer.
- alephseq (Aleph sequential)
- json
- jsonl
- marc (ISO2709)
- rdf
- csv
- tsv
- xml (MARCXML)
- fastxml (optimized parser for MARCXML)
- alephseq (Aleph sequential)
- csv
- opts:
- header: string
- delimiter: string
- opts:
- json
- jsonl
- multipart
- opts:
- header: string
- delimited: string
- noEndDelimited: true | false
- opts:
- null (output nothing, for benchmarking)
- parquet
- opts:
- schema: string (path)
- rowGroupSize: number
- pageIndex: true | false (default: false)
- opts:
- rdf
- csv
- tsv
- opts:
- header: string
- delimiter: string
- opts:
- xml (MARCXML)
- avram : A mapper from MARC to Avram
- fix : A Catmandu Fix-language mapper (
--param fix=<file>). See./demo/marc2rdf.fix - jsonata : A jsonata fixer (default)
- marc2rdf : A mapper from MARC to RDF (demonstrator)
- marcids : A mapper from MARC to a list of record ids
- marcinrdf : A naive mapper from MARC into RDF producing a list of lists (demonstrator)
Or, provide your own transformers using JavaScript plugins. See: ./plugin/demo.js for an example.
Provide a params to the mapper, input and output. See examples:
npm run demo:jsonldnpm run demo:n3npm run biblio:one
--workers <n> runs the map stage (--map) on <n> worker threads while the
main thread handles I/O, parsing and serialization. Output order is preserved.
The default is auto, which uses CPU cores − 1 — leaving one core free
for the main thread (parsing / I/O / serialization / result reordering). Using
all cores oversubscribes the machine and is typically a few percent slower, so
cores − 1 is the sweet spot. Pass an explicit number to override (e.g.
--workers 4), or --workers 1 to disable threading.
Threading only helps when the map is the bottleneck — i.e. a heavy,
interpreted JSONata transform
(--param fix=...jsonata), where it scales to roughly 1.8× (capped by
main-thread coordination, not the map). For cheap maps the per-record cost of
shipping records to/from threads outweighs the work, so the auto default only
threads maps that actually benefit:
jsonataopts in —autothreads it (this is also the default map).- The Fix mapper (
--map fix) is compiled and runs at ~100k+ rec/s, so it is almost never the bottleneck;autoleaves it single-threaded. (You can still force threads with an explicit--workers <n>, but it rarely helps.) - Any other map (no
createMapper) always runs single-threaded; an explicit--workers <n>on such a map is ignored with a warning.
For the cheap-map cases the bottleneck is the reader/writer, not the map. The
biggest lever is the input reader: prefer --from fastxml over the default
sax xml reader (roughly 2× on MARCXML). For example, with a Fix map:
marcattacks --from fastxml --to jsonl --map fix --param fix=./demo/marc2rdf.fix input.xml.gz
Rule of thumb: heavy jsonata → keep the auto default (or set --workers <n>);
fix / cheap maps → --from fastxml (the auto default already keeps them
single-threaded, so no --workers flag is needed).
- default: stdout
- file path
- sftp://username@host:port/path
- s3://accessKey:secretKey@host:port/bucket/key (or s3s://)
When writing to an s3:// (or s3s://) destination you can set a canned ACL on the uploaded object with the --acl option. E.g. to make the output publicly readable:
marcattacks input.xml -o s3://accessKey:secretKey@host:port/bucket/key.json --acl public-read
The ACL is left unset by default. Note that public-read only takes effect on buckets where ACLs are enabled (Object Ownership "Bucket owner preferred"); on buckets with ACLs disabled the request is rejected and you should use a bucket policy instead.
Logging messages can be provided with the --info, --debug and --trace options.
Default the logging format is a text format that is written to stderr. This logging format and the output stream can be changed with the --log option:
--log json: write logs in a JSON format--log stdout: write logs to the stdout--log json+stdout: write logs in a JSON format and to the stdout
Gzip and tar compression of input files can be automatically detected by file name extension. If no such extensions are provided the following flags can be set to force decompression:
--z: the input file is gzipped--tar: the input file is tarred
marcattacks (and globtrotr) use semantic exit codes following the
BSD sysexits.h conventions, so set -o pipefail scripts can react to why a
run failed:
| Code | Name | Meaning |
|---|---|---|
0 |
OK | Success — also a benign stop: the downstream reader closed the pipe (| head, quitting | less) or --count reached its limit |
64 |
USAGE | Bad invocation: missing input file, missing --from, an unknown --from/--to/--map plugin name, an unsupported URL scheme |
65 |
DATAERR | The input could not be parsed (malformed XML/JSON/MARC record) |
66 |
NOINPUT | The input file / object / @latest target was not found |
70 |
SOFTWARE | Internal error — a worker thread crashed, or a plugin file failed to load (syntax/runtime error) |
73 |
CANTCREAT | The output could not be created (--out file or S3 object) |
74 |
IOERR | A read/write/connection failure mid-stream (dropped connection, premature close) |
76 |
PROTOCOL | A remote protocol error (HTTP 4xx/5xx, too many redirects) |
77 |
NOPERM | Permission denied |
78 |
CONFIG | A configuration / credential error |
Note: when output is piped to a pager (| less) and you quit mid-stream,
marcattacks restores the terminal and exits via SIGKILL (status 137) to keep
the terminal usable — this is unavoidable for the interactive raw-mode case. A
plain | head or a piped/cron run is detected as benign and exits 0.
SFTP and S3 credentials can be set using environment variables or a local .env file.
Credentials embedded in the URL take precedence; these variables are only used as
a fallback when the URL omits them. Available variables:
- SFTP_USERNAME
- SFTP_PASSWORD
- S3_ACCESS_KEY
- S3_SECRET_KEY
A SFTP private key can be provided using the --key-env command line option. E.g. --key-env PRIVATE_KEY, which results reading a PRIVATE_KEY environment variable.
Find all files that end with xml on an sftp site:
npx globtrotr --key ~/.ssh/mykey sftp://username@hostname:port/remote/path/@glob:xml
Or, for an S3 site:
npx globtrotr s3s://accessKey:privateKey@hostname:port/bucket/@glob:xml
Some formats such as jsonl allow for concatenation of the output. With Bash grouped blocks marcattacks can then be used to concatenate files:
#!/bin/bash
# Example how to process files in sequence and concatenate the output
{
npx marcattacks --from alephseq --to jsonl data/one.alephseq
npx marcattacks --from xml --to jsonl data/sample.tar
npx marcattacks --from xml --to jsonl data/sample.tar.gz
npx marcattacks --from xml --to jsonl data/sample.xml.gz
npx marcattacks --from xml --to jsonl data/sample.xml
} | npx marcattacks --from jsonl --to xml stdin://
