-
Notifications
You must be signed in to change notification settings - Fork 14
Importing MusicBrainz data into Neo4J (work in progress)
So you want to import the awesome MusicBrainz database into the awesome Neo4J graph database? Here's one way to do it.
MusicBrainz database has several "core entities" (see http://musicbrainz.org/statistics)
| Core entities | as of 2013-08-30 |
|---|---|
| Artists | 780,259 |
| Release Groups | 977,377 |
| Releases | 1,185,344 |
| Mediums | 1,316,108 |
| Recordings | 12,624,038 |
| Tracks | 15,146,240 |
| Labels | 73,504 |
| Works | 414,645 |
| URLs | 1,833,144 |
| Areas | 22,969 |
I personally use https://bitbucket.org/lalinsky/mbslave with Postgresl.
And for this write-up, I downloaded http://mirrors.dotsrc.org/MusicBrainz/data/fullexport/20130824-033012/mbdump.tar.bz2 (thus from the EU mirror, this is big, 1,5+GB). See http://musicbrainz.org/doc/MusicBrainz_Database/Download for details on how and where to fetch the data.
Follow the instuctions and you should end up with these tables and sizes close to the following (I only imported mbdump.tar.bz2):
| Schema | Name | Type | Size |
|---|---|---|---|
| musicbrainz | annotation | table | 8192 bytes |
| musicbrainz | application | table | 8192 bytes |
| musicbrainz | area | table | 2320 kB |
| musicbrainz | area_alias | table | 1176 kB |
| musicbrainz | area_alias_type | table | 16 kB |
| musicbrainz | area_annotation | table | 0 bytes |
| musicbrainz | area_gid_redirect | table | 8192 bytes |
| musicbrainz | area_type | table | 8192 bytes |
| musicbrainz | artist | table | 71 MB |
| musicbrainz | artist_alias | table | 8120 kB |
| musicbrainz | artist_alias_type | table | 16 kB |
| musicbrainz | artist_annotation | table | 0 bytes |
| musicbrainz | artist_credit | table | 36 MB |
| musicbrainz | artist_credit_name | table | 50 MB |
| musicbrainz | artist_deletion | table | 8192 bytes |
| musicbrainz | artist_gid_redirect | table | 1512 kB |
| musicbrainz | artist_ipi | table | 576 kB |
| musicbrainz | artist_isni | table | 144 kB |
| musicbrainz | artist_meta | table | 0 bytes |
| musicbrainz | artist_name | table | 80 MB |
| musicbrainz | artist_rating_raw | table | 0 bytes |
| musicbrainz | artist_tag | table | 0 bytes |
| musicbrainz | artist_tag_raw | table | 0 bytes |
| musicbrainz | artist_type | table | 8192 bytes |
| musicbrainz | autoeditor_election | table | 0 bytes |
| musicbrainz | autoeditor_election_vote | table | 0 bytes |
| musicbrainz | cdtoc | table | 87 MB |
| musicbrainz | cdtoc_raw | table | 8192 bytes |
| musicbrainz | clientversion | table | 48 kB |
| musicbrainz | country_area | table | 40 kB |
| musicbrainz | edit | table | 8192 bytes |
| musicbrainz | edit_area | table | 0 bytes |
| musicbrainz | edit_artist | table | 0 bytes |
| musicbrainz | edit_label | table | 0 bytes |
| musicbrainz | edit_note | table | 8192 bytes |
| musicbrainz | edit_recording | table | 0 bytes |
| musicbrainz | edit_release | table | 0 bytes |
| musicbrainz | edit_release_group | table | 0 bytes |
| musicbrainz | edit_url | table | 0 bytes |
| musicbrainz | edit_work | table | 0 bytes |
| musicbrainz | editor | table | 8192 bytes |
| musicbrainz | editor_collection | table | 8192 bytes |
| musicbrainz | editor_collection_release | table | 0 bytes |
| musicbrainz | editor_language | table | 0 bytes |
| musicbrainz | editor_oauth_token | table | 8192 bytes |
| musicbrainz | editor_preference | table | 0 bytes |
| musicbrainz | editor_subscribe_artist | table | 0 bytes |
| musicbrainz | editor_subscribe_artist_deleted | table | 0 bytes |
| musicbrainz | editor_subscribe_collection | table | 0 bytes |
| musicbrainz | editor_subscribe_editor | table | 0 bytes |
| musicbrainz | editor_subscribe_label | table | 0 bytes |
| musicbrainz | editor_subscribe_label_deleted | table | 0 bytes |
| musicbrainz | editor_watch_artist | table | 0 bytes |
| musicbrainz | editor_watch_preferences | table | 0 bytes |
| musicbrainz | editor_watch_release_group_type | table | 0 bytes |
| musicbrainz | editor_watch_release_status | table | 0 bytes |
| musicbrainz | gender | table | 8192 bytes |
| musicbrainz | iso_3166_1 | table | 40 kB |
| musicbrainz | iso_3166_2 | table | 240 kB |
| musicbrainz | iso_3166_3 | table | 8192 bytes |
| musicbrainz | isrc | table | 22 MB |
| musicbrainz | iswc | table | 3360 kB |
| musicbrainz | l_area_area | table | 1304 kB |
| musicbrainz | l_area_artist | table | 0 bytes |
| musicbrainz | l_area_label | table | 0 bytes |
| musicbrainz | l_area_recording | table | 0 bytes |
| musicbrainz | l_area_release | table | 0 bytes |
| musicbrainz | l_area_release_group | table | 0 bytes |
| musicbrainz | l_area_url | table | 2608 kB |
| musicbrainz | l_area_work | table | 40 kB |
| musicbrainz | l_artist_artist | table | 10 MB |
| musicbrainz | l_artist_label | table | 376 kB |
| musicbrainz | l_artist_recording | table | 116 MB |
| musicbrainz | l_artist_release | table | 19 MB |
| musicbrainz | l_artist_release_group | table | 216 kB |
| musicbrainz | l_artist_url | table | 40 MB |
| musicbrainz | l_artist_work | table | 37 MB |
| musicbrainz | l_label_label | table | 424 kB |
| musicbrainz | l_label_recording | table | 464 kB |
| musicbrainz | l_label_release | table | 96 kB |
| musicbrainz | l_label_release_group | table | 0 bytes |
| musicbrainz | l_label_url | table | 3480 kB |
| musicbrainz | l_label_work | table | 1176 kB |
| musicbrainz | l_recording_recording | table | 1968 kB |
| musicbrainz | l_recording_release | table | 40 kB |
| musicbrainz | l_recording_release_group | table | 0 bytes |
| musicbrainz | l_recording_url | table | 856 kB |
| musicbrainz | l_recording_work | table | 62 MB |
| musicbrainz | l_release_group_release_group | table | 312 kB |
| musicbrainz | l_release_group_url | table | 14 MB |
| musicbrainz | l_release_group_work | table | 0 bytes |
| musicbrainz | l_release_release | table | 928 kB |
| musicbrainz | l_release_release_group | table | 0 bytes |
| musicbrainz | l_release_url | table | 44 MB |
| musicbrainz | l_release_work | table | 0 bytes |
| musicbrainz | l_url_url | table | 0 bytes |
| musicbrainz | l_url_work | table | 3008 kB |
| musicbrainz | l_work_work | table | 3424 kB |
| musicbrainz | label | table | 6872 kB |
| musicbrainz | label_alias | table | 560 kB |
| musicbrainz | label_alias_type | table | 16 kB |
| musicbrainz | label_annotation | table | 0 bytes |
| musicbrainz | label_deletion | table | 8192 bytes |
| musicbrainz | label_gid_redirect | table | 184 kB |
| musicbrainz | label_ipi | table | 64 kB |
| musicbrainz | label_isni | table | 8192 bytes |
| musicbrainz | label_meta | table | 0 bytes |
| musicbrainz | label_name | table | 4344 kB |
| musicbrainz | label_rating_raw | table | 0 bytes |
| musicbrainz | label_tag | table | 0 bytes |
| musicbrainz | label_tag_raw | table | 0 bytes |
| musicbrainz | label_type | table | 8192 bytes |
| musicbrainz | language | table | 448 kB |
| musicbrainz | link | table | 5808 kB |
| musicbrainz | link_attribute | table | 3704 kB |
| musicbrainz | link_attribute_credit | table | 8192 bytes |
| musicbrainz | link_attribute_type | table | 136 kB |
| musicbrainz | link_creditable_attribute_type | table | 48 kB |
| musicbrainz | link_type | table | 136 kB |
| musicbrainz | link_type_attribute_type | table | 40 kB |
| musicbrainz | medium | table | 84 MB |
| musicbrainz | medium_cdtoc | table | 34 MB |
| musicbrainz | medium_format | table | 8192 bytes |
| musicbrainz | medium_index | table | 8192 bytes |
| musicbrainz | puid | table | 397 MB |
| musicbrainz | recording | table | 852 MB |
| musicbrainz | recording_annotation | table | 0 bytes |
| musicbrainz | recording_gid_redirect | table | 60 MB |
| musicbrainz | recording_meta | table | 0 bytes |
| musicbrainz | recording_puid | table | 297 MB |
| musicbrainz | recording_rating_raw | table | 0 bytes |
| musicbrainz | recording_tag | table | 0 bytes |
| musicbrainz | recording_tag_raw | table | 0 bytes |
| musicbrainz | release | table | 117 MB |
| musicbrainz | release_annotation | table | 0 bytes |
| musicbrainz | release_country | table | 40 MB |
| musicbrainz | release_coverart | table | 0 bytes |
| musicbrainz | release_gid_redirect | table | 6312 kB |
| musicbrainz | release_group | table | 71 MB |
| musicbrainz | release_group_annotation | table | 0 bytes |
| musicbrainz | release_group_gid_redirect | table | 6472 kB |
| musicbrainz | release_group_meta | table | 0 bytes |
| musicbrainz | release_group_primary_type | table | 8192 bytes |
| musicbrainz | release_group_rating_raw | table | 0 bytes |
| musicbrainz | release_group_secondary_type | table | 16 kB |
| musicbrainz | release_group_secondary_type_join | table | 11 MB |
| musicbrainz | release_group_tag | table | 0 bytes |
| musicbrainz | release_group_tag_raw | table | 0 bytes |
| musicbrainz | release_label | table | 45 MB |
| musicbrainz | release_meta | table | 0 bytes |
| musicbrainz | release_name | table | 62 MB |
| musicbrainz | release_packaging | table | 8192 bytes |
| musicbrainz | release_raw | table | 8192 bytes |
| musicbrainz | release_status | table | 8192 bytes |
| musicbrainz | release_tag | table | 0 bytes |
| musicbrainz | release_tag_raw | table | 0 bytes |
| musicbrainz | release_unknown_country | table | 1712 kB |
| musicbrainz | replication_control | table | 8192 bytes |
| musicbrainz | script | table | 40 kB |
| musicbrainz | script_language | table | 8192 bytes |
| musicbrainz | tag | table | 0 bytes |
| musicbrainz | tag_relation | table | 0 bytes |
| musicbrainz | track | table | 1340 MB |
| musicbrainz | track_gid_redirect | table | 0 bytes |
| musicbrainz | track_name | table | 406 MB |
| musicbrainz | track_raw | table | 8192 bytes |
| musicbrainz | url | table | 184 MB |
| musicbrainz | url_gid_redirect | table | 208 kB |
| musicbrainz | vote | table | 0 bytes |
| musicbrainz | work | table | 28 MB |
| musicbrainz | work_alias | table | 3768 kB |
| musicbrainz | work_alias_type | table | 16 kB |
| musicbrainz | work_annotation | table | 0 bytes |
| musicbrainz | work_attribute | table | 8192 bytes |
| musicbrainz | work_attribute_type | table | 8192 bytes |
| musicbrainz | work_attribute_type_allowed_value | table | 8192 bytes |
| musicbrainz | work_gid_redirect | table | 6776 kB |
| musicbrainz | work_meta | table | 0 bytes |
| musicbrainz | work_name | table | 30 MB |
| musicbrainz | work_rating_raw | table | 0 bytes |
| musicbrainz | work_tag | table | 0 bytes |
| musicbrainz | work_tag_raw | table | 0 bytes |
| musicbrainz | work_type | table | 8192 bytes |
We're not interested in all tables but nearly everything is interesting.
One way to bulk-insert data into Neo4J is to use Michael Hunger's batch-import https://github.com/jexp/batch-import
After having played with a single nodes.csv (for nodes) and a single rels.csv (for relations), I decided to try out a new feature of batch-import: use multiple CSV files for nodes and relations.
MusicBrainz are deeply normalized so that for names for example is not duplicated. For example when you look at the SQL table definitions for artist (in https://github.com/metabrainz/musicbrainz-server/blob/master/admin/sql/CreateTables.sql):
CREATE TABLE artist (
id SERIAL,
gid UUID NOT NULL,
name INTEGER NOT NULL, -- references artist_name.id
sort_name INTEGER NOT NULL, -- references artist_name.id
begin_date_year SMALLINT,
begin_date_month SMALLINT,
begin_date_day SMALLINT,
end_date_year SMALLINT,
end_date_month SMALLINT,
end_date_day SMALLINT,
type INTEGER, -- references artist_type.id
area INTEGER, -- references area.id
gender INTEGER, -- references gender.id
comment VARCHAR(255) NOT NULL DEFAULT '',
edits_pending INTEGER NOT NULL DEFAULT 0 CHECK (edits_pending >= 0),
last_updated TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
ended BOOLEAN NOT NULL DEFAULT FALSE
..
begin_area INTEGER, -- references area.id
end_area INTEGER -- references area.id
);which is linked to the artist_name table for name and sortname
CREATE TABLE artist_name (
id SERIAL,
name VARCHAR NOT NULL
);MusicBrainz schema also defines a few "simple views" (see https://github.com/metabrainz/musicbrainz-server/blob/master/admin/sql/CreateSimpleViews.sql) so that you can "resolve" the name links for example with something like this:
SELECT
a.id,
a.gid,
n.name,
a.begin_date_year,
a.end_date_year,
atype.name,
area.name AS area,
g.name AS gender,
a.comment,
a.ended
FROM artist a
JOIN artist_name n ON a.name=n.id
LEFT JOIN artist_type atype ON a.type=atype.id
LEFT JOIN area ON a.area=area.id
LEFT JOIN gender g ON a.gender=g.id"That's all good but how do we get this data into Neo4J??" I hear you say.
Well it happens Postgresl can very well export CSV files for batch-import directly (TSV files actually, tab-delimited)
Suppose for each MusicBrainz entity we want to store its kind (artist, recording, label...), it's MBID (unique identifier), it's name, it's primary key in the database for reference and various properties depending on the entity, we can write an SQL query like this and feed it to psql (or ./mbslave-psql.sh which is easier):
COPY(
SELECT
'artist' AS kind,
a.id AS pk,
a.gid AS mbid,
n.name AS name,
a.begin_date_year,
a.end_date_year,
atype.name AS type,
area.name AS area,
g.name AS gender,
a.comment,
a.ended
FROM artist a
JOIN artist_name n ON a.name=n.id
LEFT JOIN artist_type atype ON a.type=atype.id
LEFT JOIN area ON a.area=area.id
LEFT JOIN gender g ON a.gender=g.id
)
TO stdout CSV HEADER DELIMITER E'\t';(see http://maxdemarzi.com/2012/02/28/batch-importer-part-1/ and http://maxdemarzi.com/2012/02/28/batch-importer-part-2/ for details)
Redirect this to a new file and you have something like this:
kind pk mbid name begin_date_year end_date_year type area gender comment ended
artist 345225 c3d70436-faa2-4669-be5a-f8ba61f3ac29 Ben Christian "" f
artist 708456 1789287b-d124-47c6-80c4-adb0dfdec1e2 Sean Grissom "" f
artist 514886 a7f3c871-3ba3-40b1-ba58-d08b40312789 Uffe Andersen Person "" f
artist 99032 816688b7-6850-4af9-9c63-cebd0ff48546 Genet "" f
artist 689027 61406f75-53b9-4557-b39f-7cd5a5ad1dc7 Pentti Mikkonen Person "" f
...
A nice new feature of batch-import is automatic indexing by adding an index name to the property type: https://github.com/jexp/batch-import#automatic-indexing
Let's modify our SQL query to add these (again for the artist entity):
COPY(
SELECT
'artist' AS "kind:string:mb",
a.id AS pk,
a.gid AS "mbid:string:mbid",
n.name AS "name:string:mb",
a.begin_date_year,
a.end_date_year,
atype.name AS "type:string:mb",
area.name AS area,
g.name AS gender,
a.comment,
a.ended
FROM artist a
JOIN artist_name n ON a.name=n.id
LEFT JOIN artist_type atype ON a.type=atype.id
LEFT JOIN area ON a.area=area.id
LEFT JOIN gender g ON a.gender=g.id
)
TO stdout CSV HEADER DELIMITER E'\t';
Here I use 2 indexes, "mb" and "mbid", as suggested by @peterneubauer in this issue: "mbid" is used to hold all MBIDs for the different entities, and "mb" is used for the other properties that you want to be fulltext-searchable.
And the output file is modifies like this:
kind:string:mb pk mbid:string:mbid name:string:mb begin_date_year end_date_year etype:string:mb area gender comment ended
artist 345225 c3d70436-faa2-4669-be5a-f8ba61f3ac29 Ben Christian "" f
artist 708456 1789287b-d124-47c6-80c4-adb0dfdec1e2 Sean Grissom "" f
artist 514886 a7f3c871-3ba3-40b1-ba58-d08b40312789 Uffe Andersen Person "" f
artist 99032 816688b7-6850-4af9-9c63-cebd0ff48546 Genet "" f
artist 689027 61406f75-53b9-4557-b39f-7cd5a5ad1dc7 Pentti Mikkonen Person "" f
...
I've prepared SQL queries for all MusicBrainz core entities at https://github.com/redapple/sql2graph/tree/multi_nodescsv/examples/musicbrainz/sql.autoindex
You can run all these queries with an script (also proposed by @peterneubauer) that you can find (and adapt to your paths) at https://github.com/redapple/sql2graph/blob/multi_nodescsv/examples/musicbrainz/export_mb_to_csv.autoindex.sh.default