Skip to content

ugent-library/marcattacks

Repository files navigation

marcattacks!

Turn your MARC exports into something else.

Build

npm install
npm run build:ts
npm link

Run

Generate JSON:

marcattacks --to json ./data/sample.xml

We can also do this for tar (and) gzipped files

marcattacks --to json ./data/sample.tar.gz

Generate Aleph sequential:

marcattacks --to alephseq ./data/sample.xml

Generate RDF:

marcattacks --to rdf --map marc2rdf ./data/sample.xml

Generate XML:

marcattacks --from alephseq --to xml ./data/one.alephseq

Transform the MARC input using a JSONata expression or file:

marcattacks --param fix=./demo/demo.jsonata ./data/sample.xml

Or transform using a Catmandu Fix script — a declarative, line-based mapping language built for library metadata (and faster than JSONata):

marcattacks --to jsonl --map fix --param fix=./demo/marc2rdf.fix ./data/sample.xml

A Fix script is a list of name(args) statements, with if/unless ... end conditionals and do ... end binds:

marc_map('245ab', title, join: ' ')    # copy MARC 245$a$b into title
upcase(title)                          # uppercase it
add_field(type, Book)                  # add a constant field
lookup(type, ./types.csv)              # map a value through a CSV table
do marc_each()                         # loop over each MARC field
  unless marc_match('500e', skip)
    marc_map('500', note.$append)
  end
end
remove_field(record)

See ./demo/marc2rdf.fix and ./demo/example.fix for complete examples.

The types.csv is a two column lookup list in CSV format. E.g.

A,B
Book,http://example.org/ns#Book

The fix mapper implements a subset of the most common Catmandu Fix builtins created by the LibreCat project. A full reference for the Fix language can be found here: https://librecat.org/assets/catmandu_cheat_sheet.pdf.

Stdin

Use a pseudo URL stdin:// to read from the standard input

Remote files

A remote SFTP path:

marcattacks --key ~/.ssh/privatekey sftp://username@hostname:port/remote/path

The latest XML file in a remote SFTP:

marcattacks --key ~/.ssh/privatekey sftp://username@hostname:port/remote/path/@latest:xml

An HTTP path

marcattacks http://somewhere.org/data.xml

An S3 path

marcattacks s3://accessKey:secretKey@hostname:port/bucket/key

use s3s://... for using an SSL layer.

Options

Input (--from)

  • alephseq (Aleph sequential)
  • json
  • jsonl
  • marc (ISO2709)
  • rdf
  • csv
  • tsv
  • xml (MARCXML)
  • fastxml (optimized parser for MARCXML)

Output (--to)

  • alephseq (Aleph sequential)
  • csv
    • opts:
      • header: string
      • delimiter: string
  • json
  • jsonl
  • multipart
    • opts:
      • header: string
      • delimited: string
      • noEndDelimited: true | false
  • null (output nothing, for benchmarking)
  • parquet
    • opts:
      • schema: string (path)
      • rowGroupSize: number
      • pageIndex: true | false (default: false)
  • rdf
  • csv
  • tsv
    • opts:
      • header: string
      • delimiter: string
  • xml (MARCXML)

Transform (--map)

  • avram : A mapper from MARC to Avram
  • fix : A Catmandu Fix-language mapper (--param fix=<file>). See ./demo/marc2rdf.fix
  • jsonata : A jsonata fixer (default)
  • marc2rdf : A mapper from MARC to RDF (demonstrator)
  • marcids : A mapper from MARC to a list of record ids
  • marcinrdf : A naive mapper from MARC into RDF producing a list of lists (demonstrator)

Or, provide your own transformers using JavaScript plugins. See: ./plugin/demo.js for an example.

Param (--param)

Provide a params to the mapper, input and output. See examples:

  • npm run demo:jsonld
  • npm run demo:n3
  • npm run biblio:one

Parallelism (--workers)

--workers <n> runs the map stage (--map) on <n> worker threads while the main thread handles I/O, parsing and serialization. Output order is preserved.

The default is auto, which uses CPU cores − 1 — leaving one core free for the main thread (parsing / I/O / serialization / result reordering). Using all cores oversubscribes the machine and is typically a few percent slower, so cores − 1 is the sweet spot. Pass an explicit number to override (e.g. --workers 4), or --workers 1 to disable threading.

Threading only helps when the map is the bottleneck — i.e. a heavy, interpreted JSONata transform (--param fix=...jsonata), where it scales to roughly 1.8× (capped by main-thread coordination, not the map). For cheap maps the per-record cost of shipping records to/from threads outweighs the work, so the auto default only threads maps that actually benefit:

  • jsonata opts in — auto threads it (this is also the default map).
  • The Fix mapper (--map fix) is compiled and runs at ~100k+ rec/s, so it is almost never the bottleneck; auto leaves it single-threaded. (You can still force threads with an explicit --workers <n>, but it rarely helps.)
  • Any other map (no createMapper) always runs single-threaded; an explicit --workers <n> on such a map is ignored with a warning.

For the cheap-map cases the bottleneck is the reader/writer, not the map. The biggest lever is the input reader: prefer --from fastxml over the default sax xml reader (roughly 2× on MARCXML). For example, with a Fix map:

marcattacks --from fastxml --to jsonl --map fix --param fix=./demo/marc2rdf.fix input.xml.gz

Rule of thumb: heavy jsonata → keep the auto default (or set --workers <n>); fix / cheap maps → --from fastxml (the auto default already keeps them single-threaded, so no --workers flag is needed).

Writable (--out)

  • default: stdout
  • file path
  • sftp://username@host:port/path
  • s3://accessKey:secretKey@host:port/bucket/key (or s3s://)

S3 object ACL (--acl)

When writing to an s3:// (or s3s://) destination you can set a canned ACL on the uploaded object with the --acl option. E.g. to make the output publicly readable:

marcattacks input.xml -o s3://accessKey:secretKey@host:port/bucket/key.json --acl public-read

The ACL is left unset by default. Note that public-read only takes effect on buckets where ACLs are enabled (Object Ownership "Bucket owner preferred"); on buckets with ACLs disabled the request is rejected and you should use a bucket policy instead.

Logging (--info,--debug,--trace,--log)

Logging messages can be provided with the --info, --debug and --trace options.

Default the logging format is a text format that is written to stderr. This logging format and the output stream can be changed with the --log option:

  • --log json : write logs in a JSON format
  • --log stdout : write logs to the stdout
  • --log json+stdout : write logs in a JSON format and to the stdout

Compression (--z,--tar)

Gzip and tar compression of input files can be automatically detected by file name extension. If no such extensions are provided the following flags can be set to force decompression:

  • --z : the input file is gzipped
  • --tar : the input file is tarred

Exit codes

marcattacks (and globtrotr) use semantic exit codes following the BSD sysexits.h conventions, so set -o pipefail scripts can react to why a run failed:

Code Name Meaning
0 OK Success — also a benign stop: the downstream reader closed the pipe (| head, quitting | less) or --count reached its limit
64 USAGE Bad invocation: missing input file, missing --from, an unknown --from/--to/--map plugin name, an unsupported URL scheme
65 DATAERR The input could not be parsed (malformed XML/JSON/MARC record)
66 NOINPUT The input file / object / @latest target was not found
70 SOFTWARE Internal error — a worker thread crashed, or a plugin file failed to load (syntax/runtime error)
73 CANTCREAT The output could not be created (--out file or S3 object)
74 IOERR A read/write/connection failure mid-stream (dropped connection, premature close)
76 PROTOCOL A remote protocol error (HTTP 4xx/5xx, too many redirects)
77 NOPERM Permission denied
78 CONFIG A configuration / credential error

Note: when output is piped to a pager (| less) and you quit mid-stream, marcattacks restores the terminal and exits via SIGKILL (status 137) to keep the terminal usable — this is unavoidable for the interactive raw-mode case. A plain | head or a piped/cron run is detected as benign and exits 0.

Environment Variables

SFTP and S3 credentials can be set using environment variables or a local .env file. Credentials embedded in the URL take precedence; these variables are only used as a fallback when the URL omits them. Available variables:

  • SFTP_USERNAME
  • SFTP_PASSWORD
  • S3_ACCESS_KEY
  • S3_SECRET_KEY

A SFTP private key can be provided using the --key-env command line option. E.g. --key-env PRIVATE_KEY, which results reading a PRIVATE_KEY environment variable.

Discover files at a (remote) endpoint

Find all files that end with xml on an sftp site:

npx globtrotr --key ~/.ssh/mykey sftp://username@hostname:port/remote/path/@glob:xml

Or, for an S3 site:

npx globtrotr s3s://accessKey:privateKey@hostname:port/bucket/@glob:xml

Concatenate files

Some formats such as jsonl allow for concatenation of the output. With Bash grouped blocks marcattacks can then be used to concatenate files:

#!/bin/bash

# Example how to process files in sequence and concatenate the output
{
    npx marcattacks --from alephseq --to jsonl data/one.alephseq
    npx marcattacks --from xml --to jsonl data/sample.tar
    npx marcattacks --from xml --to jsonl data/sample.tar.gz
    npx marcattacks --from xml --to jsonl data/sample.xml.gz
    npx marcattacks --from xml --to jsonl data/sample.xml
} | npx marcattacks --from jsonl --to xml stdin://

About

Tools to turn MARC dumps into something else (JSON, RDF, ...)

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors