marcattacks!

Turn your MARC exports into something else.

Build

npm install

npm run build:ts

npm link

Run

Generate JSON:

marcattacks --to json ./data/sample.xml

We can also do this for tar (and) gzipped files

marcattacks --to json ./data/sample.tar.gz

Generate Aleph sequential:

marcattacks --to alephseq ./data/sample.xml

Generate RDF:

marcattacks --to rdf --map marc2rdf ./data/sample.xml

Generate XML:

marcattacks --from alephseq --to xml ./data/one.alephseq

Transform the MARC input using a JSONata expression or file:

marcattacks --param fix=./demo/demo.jsonata ./data/sample.xml

Or transform using a Catmandu Fix script — a declarative, line-based mapping language built for library metadata (and faster than JSONata):

marcattacks --to jsonl --map fix --param fix=./demo/marc2rdf.fix ./data/sample.xml

A Fix script is a list of name(args) statements, with if/unless ... end conditionals and do ... end binds:

marc_map('245ab', title, join: ' ')    # copy MARC 245$a$b into title
upcase(title)                          # uppercase it
add_field(type, Book)                  # add a constant field
lookup(type, ./types.csv)              # map a value through a CSV table
do marc_each()                         # loop over each MARC field
  unless marc_match('500e', skip)
    marc_map('500', note.$append)
  end
end
remove_field(record)

See ./demo/marc2rdf.fix and ./demo/example.fix for complete examples.

The types.csv is a two column lookup list in CSV format. E.g.

A,B
Book,http://example.org/ns#Book

The fix mapper implements a subset of the most common Catmandu Fix builtins created by the LibreCat project. A full reference for the Fix language can be found here: https://librecat.org/assets/catmandu_cheat_sheet.pdf.

Stdin

Use a pseudo URL stdin:// to read from the standard input

Remote files

A remote SFTP path:

marcattacks --key ~/.ssh/privatekey sftp://username@hostname:port/remote/path

The latest XML file in a remote SFTP:

marcattacks --key ~/.ssh/privatekey sftp://username@hostname:port/remote/path/@latest:xml

An HTTP path

marcattacks http://somewhere.org/data.xml

An S3 path

marcattacks s3://accessKey:secretKey@hostname:port/bucket/key

use s3s://... for using an SSL layer.

Options

Input (--from)

alephseq (Aleph sequential)
json
jsonl
marc (ISO2709)
rdf
csv
tsv
xml (MARCXML)
fastxml (optimized parser for MARCXML)

Output (--to)

alephseq (Aleph sequential)
csv
- opts:
  - header: string
  - delimiter: string
json
jsonl
multipart
- opts:
  - header: string
  - delimited: string
  - noEndDelimited: true | false
null (output nothing, for benchmarking)
parquet
- opts:
  - schema: string (path)
  - rowGroupSize: number
  - pageIndex: true | false (default: false)
rdf
csv
tsv
- opts:
  - header: string
  - delimiter: string
xml (MARCXML)

Transform (--map)

avram : A mapper from MARC to Avram
fix : A Catmandu Fix-language mapper (--param fix=<file>). See ./demo/marc2rdf.fix
jsonata : A jsonata fixer (default)
marc2rdf : A mapper from MARC to RDF (demonstrator)
marcids : A mapper from MARC to a list of record ids
marcinrdf : A naive mapper from MARC into RDF producing a list of lists (demonstrator)

Or, provide your own transformers using JavaScript plugins. See: ./plugin/demo.js for an example.

Param (--param)

Provide a params to the mapper, input and output. See examples:

npm run demo:jsonld
npm run demo:n3
npm run biblio:one

Parallelism (--workers)

--workers <n> runs the map stage (--map) on <n> worker threads while the main thread handles I/O, parsing and serialization. Output order is preserved.

The default is auto, which uses CPU cores − 1 — leaving one core free for the main thread (parsing / I/O / serialization / result reordering). Using all cores oversubscribes the machine and is typically a few percent slower, so cores − 1 is the sweet spot. Pass an explicit number to override (e.g. --workers 4), or --workers 1 to disable threading.

Threading only helps when the map is the bottleneck — i.e. a heavy, interpreted JSONata transform (--param fix=...jsonata), where it scales to roughly 1.8× (capped by main-thread coordination, not the map). For cheap maps the per-record cost of shipping records to/from threads outweighs the work, so the auto default only threads maps that actually benefit:

jsonata opts in — auto threads it (this is also the default map).
The Fix mapper (--map fix) is compiled and runs at ~100k+ rec/s, so it is almost never the bottleneck; auto leaves it single-threaded. (You can still force threads with an explicit --workers <n>, but it rarely helps.)
Any other map (no createMapper) always runs single-threaded; an explicit --workers <n> on such a map is ignored with a warning.

For the cheap-map cases the bottleneck is the reader/writer, not the map. The biggest lever is the input reader: prefer --from fastxml over the default sax xml reader (roughly 2× on MARCXML). For example, with a Fix map:

marcattacks --from fastxml --to jsonl --map fix --param fix=./demo/marc2rdf.fix input.xml.gz

Rule of thumb: heavy jsonata → keep the auto default (or set --workers <n>); fix / cheap maps → --from fastxml (the auto default already keeps them single-threaded, so no --workers flag is needed).

Writable (--out)

default: stdout
file path
sftp://username@host:port/path
s3://accessKey:secretKey@host:port/bucket/key (or s3s://)

S3 object ACL (--acl)

When writing to an s3:// (or s3s://) destination you can set a canned ACL on the uploaded object with the --acl option. E.g. to make the output publicly readable:

marcattacks input.xml -o s3://accessKey:secretKey@host:port/bucket/key.json --acl public-read

The ACL is left unset by default. Note that public-read only takes effect on buckets where ACLs are enabled (Object Ownership "Bucket owner preferred"); on buckets with ACLs disabled the request is rejected and you should use a bucket policy instead.

Logging (--info,--debug,--trace,--log)

Logging messages can be provided with the --info, --debug and --trace options.

Default the logging format is a text format that is written to stderr. This logging format and the output stream can be changed with the --log option:

--log json : write logs in a JSON format
--log stdout : write logs to the stdout
--log json+stdout : write logs in a JSON format and to the stdout

Compression (--z,--tar)

Gzip and tar compression of input files can be automatically detected by file name extension. If no such extensions are provided the following flags can be set to force decompression:

--z : the input file is gzipped
--tar : the input file is tarred

Exit codes

marcattacks (and globtrotr) use semantic exit codes following the BSD sysexits.h conventions, so set -o pipefail scripts can react to why a run failed:

Code	Name	Meaning
`0`	OK	Success — also a benign stop: the downstream reader closed the pipe (`\| head`, quitting `\| less`) or `--count` reached its limit
`64`	USAGE	Bad invocation: missing input file, missing `--from`, an unknown `--from`/`--to`/`--map` plugin name, an unsupported URL scheme
`65`	DATAERR	The input could not be parsed (malformed XML/JSON/MARC record)
`66`	NOINPUT	The input file / object / `@latest` target was not found
`70`	SOFTWARE	Internal error — a worker thread crashed, or a plugin file failed to load (syntax/runtime error)
`73`	CANTCREAT	The output could not be created (`--out` file or S3 object)
`74`	IOERR	A read/write/connection failure mid-stream (dropped connection, premature close)
`76`	PROTOCOL	A remote protocol error (HTTP 4xx/5xx, too many redirects)
`77`	NOPERM	Permission denied
`78`	CONFIG	A configuration / credential error

Note: when output is piped to a pager (| less) and you quit mid-stream, marcattacks restores the terminal and exits via SIGKILL (status 137) to keep the terminal usable — this is unavoidable for the interactive raw-mode case. A plain | head or a piped/cron run is detected as benign and exits 0.

Environment Variables

SFTP and S3 credentials can be set using environment variables or a local .env file. Credentials embedded in the URL take precedence; these variables are only used as a fallback when the URL omits them. Available variables:

SFTP_USERNAME
SFTP_PASSWORD
S3_ACCESS_KEY
S3_SECRET_KEY

A SFTP private key can be provided using the --key-env command line option. E.g. --key-env PRIVATE_KEY, which results reading a PRIVATE_KEY environment variable.

Discover files at a (remote) endpoint

Find all files that end with xml on an sftp site:

npx globtrotr --key ~/.ssh/mykey sftp://username@hostname:port/remote/path/@glob:xml

Or, for an S3 site:

npx globtrotr s3s://accessKey:privateKey@hostname:port/bucket/@glob:xml

Concatenate files

Some formats such as jsonl allow for concatenation of the output. With Bash grouped blocks marcattacks can then be used to concatenate files:

#!/bin/bash

# Example how to process files in sequence and concatenate the output
{
    npx marcattacks --from alephseq --to jsonl data/one.alephseq
    npx marcattacks --from xml --to jsonl data/sample.tar
    npx marcattacks --from xml --to jsonl data/sample.tar.gz
    npx marcattacks --from xml --to jsonl data/sample.xml.gz
    npx marcattacks --from xml --to jsonl data/sample.xml
} | npx marcattacks --from jsonl --to xml stdin://

Name		Name	Last commit message	Last commit date
Latest commit History 409 Commits
data		data
demo		demo
docker		docker
docs		docs
man		man
plugin		plugin
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README-docker.md		README-docker.md
README-release.md		README-release.md
README.md		README.md
TYPESCRIPT.txt		TYPESCRIPT.txt
docker-compose.yaml		docker-compose.yaml
jest.config.js		jest.config.js
logo.jpg		logo.jpg
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

marcattacks!

Build

Run

Stdin

Remote files

Options

Input (--from)

Output (--to)

Transform (--map)

Param (--param)

Parallelism (--workers)

Writable (--out)

S3 object ACL (--acl)

Logging (--info,--debug,--trace,--log)

Compression (--z,--tar)

Exit codes

Environment Variables

Discover files at a (remote) endpoint

Concatenate files

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

marcattacks!

Build

Run

Stdin

Remote files

Options

Input (--from)

Output (--to)

Transform (--map)

Param (--param)

Parallelism (--workers)

Writable (--out)

S3 object ACL (--acl)

Logging (--info,--debug,--trace,--log)

Compression (--z,--tar)

Exit codes

Environment Variables

Discover files at a (remote) endpoint

Concatenate files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages