A local Streamlit web application for curating the SSH Open Marketplace (SSHOMP). It reads data from a local sshompitor snapshot for fast offline analysis and writes back to the live Marketplace API for any changes.
- Overview
- Prerequisites
- Setup
- Running the app
- Login & environments
- Tool reference
- Data freshness
- API reference
- Library reference
- Architecture
- Caveats & known limitations
- Troubleshooting
The SSH Open Marketplace accumulates data quality issues over time: duplicate actors, broken access URLs, orphaned entries, and keywords that duplicate concepts already defined in controlled vocabularies. These issues are difficult to spot and fix through the standard Marketplace web interface.
This toolkit provides a set of purpose-built curation screens that:
- Load the full Marketplace dataset from a local snapshot for fast, offline analysis
- Verify findings against the live API before making any changes
- Write changes back through the official Marketplace REST API with appropriate safeguards
- Record every API write call and major action in an exportable session log
The toolkit is intended for Marketplace moderators and administrators. All write operations require a valid account token and respect the API's own permission checks.
| Requirement | Notes |
|---|---|
| Python 3.10+ | Tested on 3.10–3.12 |
| Marketplace account | Moderator or administrator role required for write operations |
No local clone of sshompitor is required. The sshmarketplacelib Python package is installed directly from GitHub as part of the normal dependency install, and a bundled config.yaml provides the configuration it needs. Snapshot data is downloaded on first use from the Data Source page.
pip install -r requirements.txtThis installs all required packages including sshmarketplacelib (pulled directly from the sshompitor GitHub repository):
streamlit>=1.35, pandas, numpy, pyyaml, pillow, requests, fastparquet, sshmarketplacelib
python3 setup.pyCreates the data/ directory where snapshot files will be stored. No symlinks or local sshompitor clone required.
Start the app, log in, and use the Data Source page to download the current snapshot from GitHub or fetch fresh data from the API.
streamlit run app.pyOpen http://localhost:8501 in a browser. After login you are taken directly to the Data Source page.
The login screen lets you choose between Production and Stage before signing in.
| Environment | API base URL |
|---|---|
| Production | https://marketplace-api.sshopencloud.eu |
| Stage | https://sshoc-marketplace-api-stage.acdh-dev.oeaw.ac.at |
Credentials are sent via POST /api/auth/sign-in. On success the Authorization header from the response is stored as a bearer token in session state and attached to all subsequent API requests.
Navigating directly to any tool page without being logged in redirects back to the login screen.
Recommendation: always test changes on Stage first. The Stage environment is a full copy of Production and its data can be reset.
File: pages/1_Data.py
Landing page after login.
Shows the current environment, logged-in user, and all snapshot files in data/. The most recently modified file is the active snapshot used by all tools.
Lists every full_items_*.json file with its date, age, size, environment (Production or Stage), and source (GitHub or API). The environment and source are read from a sidecar full_items_{ts}.meta.json file written whenever a snapshot is created. Files that pre-date this feature show — in those columns.
| Button | What it does |
|---|---|
| Download latest snapshot | Queries the GitHub Contents API for the newest snapshot in the sshompitor repository and streams it to data/. Fast (typically a few seconds). Production data only. |
| Create fresh snapshot | Fetches all items from the live Marketplace API across all five categories and saves a new full_items_{timestamp}.json. Takes several minutes depending on environment. |
Stage environment: the GitHub snapshot archive only contains Production data. When working on Stage, always use Create fresh snapshot to build a snapshot from the Stage API. The UI highlights this option as primary and shows a reminder banner when Stage is selected.
After either operation all page caches are cleared automatically and the page reloads.
If multiple snapshots are present an expander lists older files with individual Delete buttons.
File: pages/2_Actors.py
All actor-related curation in one place. Actors are the people and organisations credited on Marketplace items. "Contributor" is one of the roles an actor can have; the term "actor" is used throughout this page.
A shared Load actors from API bar at the top of the page fetches all ~9,000 actors from GET /api/actors and stores them in st.session_state["actor_details"]. This data is used across all three tabs and only needs to be fetched once per session.
Displays every actor × item appearance from the snapshot as a filterable table (one row per actor–role–item combination). Built from Util.getContributors().
Filters (applied together):
| Filter | Behaviour |
|---|---|
| Category | Multiselect; defaults to all |
| Role | Multiselect; defaults to all |
| Actor name | Case-insensitive substring match |
| Case-insensitive substring match; available only after actor data is loaded |
Metrics show the number of rows, unique actors, and unique items matching the current filters. Results can be downloaded as CSV.
Finds actors that share a name or website value. These are common when items are imported from different sources with slightly different contributor metadata.
If actor data has been loaded from the API, groups are classified by confidence:
| Confidence | Condition |
|---|---|
| High (★) | At least two actors in the group share the same non-empty email address |
| Low | Name match only; email is absent or different across actors |
High-confidence groups are shown first under Suggested merges and open expanded by default. Low-confidence groups appear under Other name matches.
- Select which field(s) to match on:
name,website. - Click Find Duplicates.
Results are stored in session state and persist while navigating between tabs.
For each duplicate group an expander shows a table of actors with IDs, item counts, example item links, and emails (if loaded). Select which actor to keep using the Keep this actor selectbox, then click Merge.
The merge performs three steps to preserve all data:
GET /api/actors/{id}for every actor in the group to retrieve full recordsPUT /api/actors/{keep_id}with a consolidated payload — email, website, and externalIds are unioned across all actors (keep actor's values take priority, falling back to the first non-empty value from the others); affiliations are kept from the keep actor onlyPOST /api/actors/{keep_id}/merge?with={other_ids}to transfer item associations and delete the merged actors
This three-step approach ensures no email addresses, websites, or external identifiers are silently lost when actors are merged.
The merge button is disabled after a successful merge (tracked in session state) so it cannot be clicked twice.
Finds actors that exist in the Marketplace but are not credited on any item, and allows bulk deletion.
Step 1 — Cross-reference with snapshot
Actors absent from the local snapshot are flagged as candidates. A stale snapshot may produce false positives.
Step 2 — Verify candidates via the live API
For each candidate calls GET /api/actors/{id}?items=true to confirm the API itself considers the actor item-free. Requests run in configurable batches (default: 50 concurrent). Each request retries up to three times on 5xx, timeout, and connection errors. Actors that cannot be confirmed are classified as uncertain and excluded from deletion.
Step 3 — Review and delete
Confirmed orphans are shown in an editable table. All rows start unchecked; actors must be selected explicitly. Deletion calls DELETE /api/actors/{id}?force=false — actors affiliated with other actors are refused by the API and reported individually.
File: pages/3_Item_Duplicates.py
Scans snapshot items for shared values in selected fields.
Items without an accessibleAt URL are excluded before the search. These are stub or incomplete entries that cannot be meaningfully distinguished from each other by URL and would produce noise in the results. The count of excluded items is shown when the search runs.
How to use:
- Select one or more categories.
- Choose which fields to check:
label,description,accessibleAt. - Click Find Duplicates.
Results persist in session state while navigating the page.
Shows matched items with label, category, persistentId, and a clickable link to each item in the Marketplace. Downloadable as CSV.
Below the summary table each duplicate group appears as a collapsible expander. By default it shows snapshot data (label, category, ID, MP link). Click Fetch live data from API to load the full current record for each item and display them side by side. Each card shows label, status, source system, access URLs, contributor names, and description. Fetched data is cached per group in session state so it is not re-fetched when other controls are interacted with.
Note:
accessibleAtis a list value and cannot be used directly as a groupby key. It is converted to a comma-separated string for grouping purposes; the original list values are preserved in the data.
File: pages/4_URL_Checker.py
Checks whether URLs extracted from the snapshot are reachable.
| Mode | What is checked |
|---|---|
accessibleAt only |
The primary access URL(s) for each item |
All URLs in entry |
Every http(s) URL in any field (thumbnails, media, external IDs, descriptions, …) |
- Select categories and scope, then click Extract URLs. Scans the snapshot and reports unique URL count — no HTTP requests yet.
- Set Timeout and Parallel workers, then click Check URLs.
- Each URL receives an HTTP
HEADrequest. Servers returning405or501are retried with a streamingGET(body not downloaded).
Sorted by items with the most broken links first, then by item, then broken-before-OK within each item. Filter to Broken only / OK only / All. The same URL appearing on multiple items is checked once and joined to all rows. Downloadable as CSV.
File: pages/5_Keywords.py
Audits and curates the sshoc-keyword vocabulary.
Unlike other Marketplace vocabularies (languages, disciplines, standards), the keyword vocabulary grows organically and is not formally governed. This leads to three classes of quality issues:
- Unused concepts — keywords registered but not applied to any item
- Malformed labels — keywords with leading spaces, no alphabetical characters, or excessive length
- Cross-vocabulary duplicates — keywords whose labels already exist as concepts in a controlled vocabulary
The screenshot below illustrates all three classes at once:
Click Load keyword concepts from API to fetch all concepts in the sshoc-keyword vocabulary from GET /api/concept-search?types=keyword. Keywords are also extracted from the local snapshot via Util.getAllProperties() to compute usage counts.
Concepts in the vocabulary not referenced by any item in the current snapshot.
Filters: free-text search on label or code; Candidates only toggle.
Deletion: tick Delete? for one or more rows. Selected rows are immediately highlighted in red below the table as a confirmation preview. Confirm the checkbox and click Delete. Calls DELETE /api/vocabularies/sshoc-keyword/concepts/{code}?force=true.
force=true is required because the API keeps a full version history of every item. Older revisions may still reference a keyword even after it has been removed from the current version; the API counts those historical references as active and refuses a plain delete. force=true removes the concept and purges all historical references.
Delete results persist visually until the next delete operation — a success message or per-concept error appears above the table and is not cleared by page reruns. 403 Forbidden errors are explained explicitly: the concept may be a system/protected entry or the account may lack vocabulary administrator privileges.
All concepts appearing on at least one item, sorted by usage count descending.
Quality filters (independent, combinable):
| Filter | Detects |
|---|---|
| Starts with space | Labels like " machine learning" that sort and display incorrectly |
| No letters | Labels composed entirely of numbers or symbols |
| Longer than N characters | Labels that are suspiciously long — often pasted description text |
Near-duplicate detection
Below the quality filters a Near-duplicate keywords section groups concepts that are identical after normalisation:
- Labels are lowercased and stripped
- Hyphens, underscores, slashes, and backslashes are replaced with spaces
- Multiple spaces are collapsed
This catches cases like Machine Learning / machine learning / machine-learning which are three separate vocabulary entries for the same concept.
Each group of near-duplicates appears in an expander showing all variants with their usage counts. A Merge variants control inside the expander lets you:
- Select which variant to keep
- Review the list of affected items (items that use one of the non-kept variants)
- Confirm and click Merge
The merge re-points all affected items to the kept concept (using GET → modify property → PUT per item) and then deletes the merged variants with force=true. Both steps are logged in the session log.
Finds used keywords whose label also exists as a concept in another Marketplace vocabulary. A common high-volume case is language names entered as keywords:
Only used keywords are shown here — unused ones belong in Tab 1 and cannot be fixed here because fixing requires updating item properties.
How to use:
- Click Load all concepts from API (~15 000) to fetch all concepts across all vocabularies.
- The tool performs a case-insensitive label match between
sshoc-keywordconcepts and other-vocabulary concepts. - Use the concept type filter to narrow results.
- Select a keyword → target concept pair, review the affected items, and click Fix N item(s).
For each affected item: GET the full payload, replace the matching property (type keyword, matching concept code) with the target type and concept, PUT the modified payload back.
Complete sshoc-keyword vocabulary with usage counts. Searchable by label or code.
File: pages/6_Session_Log.py
Records every significant action and API write call made during the current session.
| Entry type | Examples |
|---|---|
action |
Login, snapshot download, actor fetch, orphan verification summary |
api |
DELETE /api/actors/{id}, POST /api/actors/{id}/merge, PUT /api/{path}/{id}, DELETE /api/vocabularies/…/concepts/{code} |
Every API write entry records: timestamp, HTTP method, full URL, request summary, HTTP status code, and an abbreviated response. API error responses in the standard JSON envelope ({"status": 403, "error": "Forbidden", "path": "…"}) are automatically collapsed to a readable Forbidden — path: /api/… form in the Response column.
Columns are ordered to surface the most useful information first: time → type → ok → method → status → response → description → url → request. The wide URL and request body columns are at the right so they do not crowd out the response.
Filter by entry type (action / api), result (ok / failed), and free-text search across description, URL, request, and response.
- Export filtered as CSV — the current filtered view
- Export full log as JSON — the complete unfiltered log as a JSON array
Clear log resets the in-session log. The log does not persist across browser sessions or server restarts.
Every page shows a Session log: N entries count in the sidebar once entries exist.
The local snapshot is a JSON export of the full Marketplace catalogue. A data age badge appears in the sidebar of every page:
- Green — snapshot is less than 3 days old
- Orange — snapshot is 3 or more days old
If no snapshot is found when navigating to any tool page, a full-page prompt offers GitHub download or API creation to recover without navigating away.
A stale snapshot affects: orphaned actor detection, unused keyword detection, actor and item duplicate detection, and cross-vocab duplicate coverage. Refresh before starting a curation session if the snapshot is more than a day old.
Stage environment: the GitHub archive contains Production snapshots only. When working on Stage, use Create fresh snapshot on the Data Source page.
All write operations target the environment selected at login.
| Operation | Method | Endpoint | Notes |
|---|---|---|---|
| Login | POST | /api/auth/sign-in |
Returns bearer token in Authorization header |
| List actors | GET | /api/actors?perpage=100&page={n} |
~95 pages for full list |
| Get actor | GET | /api/actors/{id} |
Full record including externalIds, affiliations |
| Check actor items | GET | /api/actors/{id}?items=true |
Authoritative orphan check |
| Update actor | PUT | /api/actors/{id} |
Used before merge to consolidate attributes |
| Merge actors | POST | /api/actors/{id}/merge?with={ids} |
Comma-separated IDs to absorb |
| Delete actor | DELETE | /api/actors/{id}?force=false |
Refused if actor has affiliations |
| Get item | GET | /api/{category-path}/{persistentId} |
Full item payload |
| Update item | PUT | /api/{category-path}/{persistentId} |
Requires full item payload |
| List keyword concepts | GET | /api/concept-search?types=keyword&perpage=100 |
~27 pages |
| List all concepts | GET | /api/concept-search?perpage=100 |
~152 pages; all vocabularies |
| Delete concept | DELETE | /api/vocabularies/{vocab}/concepts/{code}?force=true |
Also clears historical item-version references |
| Fetch category items | GET | /api/{category-path}?perpage=20&page={n} |
Used when creating a fresh snapshot; small pages reduce per-request timeouts |
Concept codes arrive from the API as plain JSON string values — in JSON + is a literal plus sign, not a space. Before building a DELETE URL the toolkit passes the raw code through urllib.parse.quote(code, safe="") without any prior decoding step. This encodes + as %2B, % as %25, and other special characters accordingly, so the server decodes the path back to the original code string stored in the database. An earlier approach used unquote_plus before quote, which incorrectly treated + as a space and produced %20 in the URL, causing 404 responses for any concept whose code contains a plus sign (e.g. 10+languages).
Snapshot category |
API path segment |
|---|---|
tool-or-service |
tools-services |
training-material |
training-materials |
dataset |
datasets |
publication |
publications |
workflow |
workflows |
step |
steps |
Defines the two target environments (Production and Stage) with their API base URLs and frontend URLs. Add new entries here to make additional environments selectable at login.
try_login(username, password, env_name) — posts credentials to /api/auth/sign-in and returns the bearer token string, or None on failure.
require_login() — called at the top of every page. Redirects to the login page and stops rendering if the session is unauthenticated.
get_util() — returns a cached Util instance (@st.cache_resource). Tries the pip-installed sshmarketplacelib package first; falls back to a sibling ../sshompitor/ clone if not installed. Using cache_resource means the snapshot DataFrame is loaded once per server process and shared across browser sessions.
All functions that communicate with the live Marketplace API. Every write function calls log_api() automatically.
| Function | Description |
|---|---|
fetch_all_actors(api_url, bearer) |
Paginated GET /api/actors; returns DataFrame [id, name, email, website, item_count] |
verify_orphans(ids, api_url, bearer, batch_size) |
Batched concurrent GET /api/actors/{id}?items=true with retries; returns dict[id → bool|None] |
delete_actor(actor_id) |
DELETE /api/actors/{id}?force=false |
_get_actor(actor_id, api_url, bearer) |
GET /api/actors/{id}; returns full actor record including externalIds and affiliations |
_consolidate_actor_payload(actors) |
Merges email, website, and externalIds from a list of actor records; used before merge to preserve all attributes |
merge_actors(keep_id, merge_ids) |
3-step: GET all actors → PUT consolidated attributes → POST /api/actors/{id}/merge?with={ids} |
get_item(category, persistent_id, api_url, bearer) |
GET /api/{path}/{id}; returns full item dict |
put_item(category, persistent_id, item_data, api_url, bearer) |
PUT /api/{path}/{id}; logs the call |
fix_item_keyword(category, persistent_id, old_code, new_type, new_concept, api_url, bearer) |
GET item → replace matching keyword property → PUT back |
fetch_all_keyword_concepts(api_url, bearer) |
Paginated GET /api/concept-search?types=keyword |
fetch_all_concepts(api_url, bearer) |
Paginated GET /api/concept-search (all types and vocabularies) |
delete_concept(concept_code, vocab_code) |
DELETE /api/vocabularies/{vocab}/concepts/{code}?force=true; URL-encodes the concept code |
create_snapshot_from_api(api_url, bearer, data_dir, env_label) |
Fetches all items from all 5 categories, saves full_items_{ts}.json and a sidecar full_items_{ts}.meta.json |
get_latest_snapshot_info() — returns (path, age_timedelta, snapshot_datetime) for the newest snapshot.
fetch_latest_from_github() — downloads the newest snapshot from the sshompitor repository; writes a sidecar meta file recording source: github, env_label: Production.
require_snapshot() — stops page rendering with a recovery UI if no snapshot is present.
render_data_status() — renders the age badge, GitHub refresh button, and session log count in the sidebar.
read_snapshot_meta(path) — reads the sidecar *.meta.json for a snapshot file; returns {} if not found.
Maintains a session log in st.session_state["session_log"].
log_action(description, ok) — records a high-level event.
log_api(method, url, description, status, request, response, ok) — records an HTTP API call. API error responses in standard JSON envelope format are automatically formatted into a readable summary.
get_log() / get_log_df() — retrieve the current log as a list or DataFrame.
GitHub (SSHOC/sshompitor)
│ fetch_latest_from_github() → writes .meta.json (source=github, env=Production)
▼
data/full_items_*.json ◄── also created by create_snapshot_from_api() (writes .meta.json with env label)
│
▼
sshmarketplacelib.Util (loaded once, cached via st.cache_resource)
│
├── getContributors() ──► 2_Actors (Browse + Duplicates + Orphaned)
├── _load_snapshot() ──► 3_Item_Duplicates, 4_URL_Checker
├── getDuplicates() ──► 3_Item_Duplicates
├── getDuplicatedActorsWithItems ──► 2_Actors (Duplicates tab)
└── getAllProperties() ──► 5_Keywords
Live Marketplace API ◄──► lib/api.py ◄──► all write operations + read-back for side-by-side compare
│
└── lib/logger.py ──► 6_Session_Log
| File | Page | Role |
|---|---|---|
app.py |
Login | Entry point; redirects to Data Source after login |
pages/1_Data.py |
Data Source | Landing page; snapshot management with environment labels |
pages/2_Actors.py |
Actors | Browse contributions; find and merge duplicates; remove orphaned actors |
pages/3_Item_Duplicates.py |
Item Duplicates | Item field duplicate detection with side-by-side live comparison |
pages/4_URL_Checker.py |
URL Checker | Concurrent URL reachability check |
pages/5_Keywords.py |
Keywords | Keyword vocabulary curation including near-duplicate merge |
pages/6_Session_Log.py |
Session Log | Audit log with export |
| Key | Type | Set by | Used by |
|---|---|---|---|
authenticated |
bool |
Login | All pages |
bearer |
str |
Login | All API calls |
username |
str |
Login | Data Source display |
env |
dict |
Login | All pages |
actor_details |
DataFrame |
Actors page | Actors (all three tabs) |
actor_dup_summary |
DataFrame |
Actors — Duplicates tab | Actors — Duplicates tab |
actor_dup_full |
DataFrame |
Actors — Duplicates tab | Actors — Duplicates tab |
merged_groups |
dict |
Actors — Duplicates tab | Disables merge buttons after merge |
expander_open |
dict |
Actors — Duplicates tab | Keeps expanders open after merge |
orphan_verified |
dict |
Actors — Orphaned tab | Verification results cache |
item_dup_result |
DataFrame |
Item Duplicates | Item Duplicates (survives fetch-button reruns) |
item_dup_props |
list |
Item Duplicates | Item Duplicates |
item_dup_filtered |
int |
Item Duplicates | Count of excluded no-accessibleAt items |
fetched_items |
dict |
Item Duplicates | Live API data per duplicate group |
url_check_input |
DataFrame |
URL Checker | URL Checker |
url_check_results |
DataFrame |
URL Checker | URL Checker |
keyword_vocab |
DataFrame |
Keywords | Keywords (all tabs) |
all_concepts_cache |
DataFrame |
Keywords — Tab 3 | Cross-vocab comparison |
kw_delete_status |
dict |
Keywords — Tab 1 | Persists delete result across reruns |
session_log |
list[dict] |
All pages (via logger) | Session Log |
| Decorator | Used for | Invalidated when |
|---|---|---|
@st.cache_resource |
Util() instance (loads the snapshot once per server process) |
st.cache_resource.clear() after snapshot refresh |
@st.cache_data |
Per-page derived DataFrames (appearances, unique actors, keyword extraction, snapshot load) | st.cache_data.clear() after snapshot refresh, or .clear() on the specific function after a merge |
Snapshot staleness. All analysis that reads from the snapshot reflects the state of the Marketplace at snapshot time. Changes made through the toolkit or by other users will not be visible until the snapshot is refreshed.
Concept delete and item history. The API maintains a full version history of every item. Older revisions may still reference a keyword that was removed from the current version. The API counts those as active and refuses a plain delete — this is why force=true is required.
403 on concept delete. Some vocabulary concepts are system entries or are in a protected state. The API returns HTTP 403 regardless of force=true. The error message displayed identifies this case explicitly. These concepts cannot be deleted via the toolkit.
Concurrent item edits. The keyword fix and near-duplicate keyword merge workflows use GET-modify-PUT. If another user edits the same item concurrently their changes will be overwritten. Use during low-traffic periods and on Stage before Production.
Merge only transfers item associations. POST /api/actors/{id}/merge moves item credits to the kept actor but does not copy email, website, or external identifiers. The toolkit works around this by consolidating attributes in a PUT call before the merge, but only the kept actor's affiliations are retained.
force=false on actor deletion. The API refuses to delete an actor that is the affiliation target of another actor. Such actors must have their affiliation relationship removed first, or be merged into the affiliated actor.
Item PUT requires full payload. Every item update sends the complete object returned by GET with only the target fields modified. Fields absent from the GET response will be absent from the PUT and may be cleared server-side.
ModuleNotFoundError: No module named 'sshmarketplacelib'
Run pip install -r requirements.txt. If pip cannot reach GitHub, install manually:
pip install git+https://github.com/SSHOC/sshompitor.git#egg=sshmarketplacelib.
As a last resort, clone sshompitor into ../sshompitor/ — lib/mplib.py finds it automatically.
No snapshot found
Run python3 setup.py to create the data/ directory, then use the Data Source page to download a snapshot.
Login fails with a network error
The selected API server is unreachable. Check your connection and verify the environment is online. The Stage environment is occasionally offline for maintenance.
GitHub download re-downloads the same file
The local filename must exactly match the GitHub filename (including the Unix timestamp). If the file was renamed the comparison fails and the file is re-downloaded.
Orphan verification produces many uncertain results
The API is under load. Lower the batch size (try 20–30) and re-run. Uncertain actors are excluded from deletion.
Concept delete fails with 403
This is a server-side permissions decision. The concept may be a system entry, or the logged-in account may lack vocabulary administrator privileges. The error is shown explicitly in the UI and in the Session Log.
Concept delete returns 404 after the URL encoding fix
Verify the concept still exists in the vocabulary — it may have already been deleted in a previous session or by another user.
Item PUT returns HTTP 422 after keyword fix
The target concept's type_code is not valid for this property type in the Marketplace schema. Review the type in the cross-vocabulary match.
After merging actors the group still appears in results
The snapshot has not been refreshed. The merge is complete on the server. Use the Data Source page to download a fresh snapshot, then re-run duplicate detection.
Session log is empty
The log is in-memory only and resets when the server restarts or Clear log is clicked. Export before closing the browser if you need a record.


