diff --git a/README.md b/README.md index f6397ac..710ddc4 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,223 @@ # ArcFlow -Code for exporting data from ArchivesSpace to ArcLight, along with additional utility scripts for data handling and transformation. \ No newline at end of file +Code for exporting data from ArchivesSpace to ArcLight, along with additional utility scripts for data handling and transformation. + +## Quick Start + +This directory contains a complete, working installation of arcflow with creator records support. To run it: + +```bash +# 1. Install dependencies +pip install -r requirements.txt + +# 2. Configure credentials +cp .archivessnake.yml.example .archivessnake.yml +nano .archivessnake.yml # Add your ArchivesSpace credentials + +# 3. Set environment variables +export ARCLIGHT_DIR=/path/to/your/arclight-app +export ASPACE_DIR=/path/to/your/archivesspace +export SOLR_URL=http://localhost:8983/solr/blacklight-core + +# 4. Run arcflow +python -m arcflow.main + +``` + +--- + +## Features + +- **Collection Indexing**: Exports EAD XML from ArchivesSpace and indexes to ArcLight Solr +- **Creator Records**: Extracts creator agent information and indexes as standalone documents +- **Biographical Notes**: Injects creator biographical/historical notes into collection EAD XML +- **PDF Generation**: Generates finding aid PDFs via ArchivesSpace jobs +- **Incremental Updates**: Supports modified-since filtering for efficient updates + +## Creator Records + +ArcFlow now generates standalone creator documents in addition to collection records. Creator documents: + +- Include biographical/historical notes from ArchivesSpace agent records +- Link to all collections where the creator is listed +- Can be searched and displayed independently in ArcLight +- Are marked with `is_creator: true` to distinguish from collections +- Must be fed into a Solr instance with fields to match their specific facets (See: Configure Solr Schema below) + +### Agent Filtering + +**ArcFlow automatically filters agents to include only legitimate creators** of archival materials. The following agent types are **excluded** from indexing: + +- ✗ **System users** - ArchivesSpace software users (identified by `is_user` field) +- ✗ **System-generated agents** - Auto-created for users (identified by `system_generated` field) +- ✗ **Software agents** - Excluded by not querying the `/agents/software` endpoint +- ✗ **Repository agents** - Corporate entities representing the repository itself (identified by `is_repo_agent` field) +- ✗ **Donor-only agents** - Agents with only the 'donor' role and no creator role + +**Agents are included if they meet any of these criteria:** + +- ✓ Have the **'creator' role** in linked_agent_roles +- ✓ Are **linked to published records** (and not excluded by filters above) + +This filtering ensures that only legitimate archival creators are discoverable in ArcLight, while protecting privacy and security by excluding system users and donors. + +### How Creator Records Work + +1. **Extraction**: `get_all_agents()` fetches all agents from ArchivesSpace +2. **Filtering**: `is_target_agent()` filters out system users, donors, and non-creator agents +3. **Processing**: `task_agent()` generates an EAC-CPF XML document for each target agent with bioghist notes +4. **Linking**: Handled via Solr using the persistent_id field (agents and collections linked through bioghist references) +5. **Indexing**: Creator XML files are indexed to Solr using `traject_config_eac_cpf.rb` + +### Creator Document Format + +Creator documents are stored as XML files in `agents/` directory using the ArchivesSpace EAC-CPF export: + +```xml + + + + + + corporateBody + + Core: Leadership, Infrastructure, Futures + local + + + + + 2020- + + +

Founded on September 1, 2020, the Core: Leadership, Infrastructure, Futures division of the American Library Association has a mission to cultivate and amplify the collective expertise of library workers in core functions through community building, advocacy, and learning. + In June 2020, the ALA Council voted to approve Core: Leadership, Infrastructure, Futures as a new ALA division beginning September 1, 2020, and to dissolve the Association for Library Collections and Technical Services (ALCTS), the Library Information Technology Association (LITA) and the Library Leadership and Management Association (LLAMA) effective August 31, 2020. The vote to form Core was 163 to 1.(1)

+ 1. "ALA Council approves Core; dissolves ALCTS, LITA and LLAMA," July 1, 2020, http://www.ala.org/news/member-news/2020/07/ala-council-approves-core-dissolves-alcts-lita-and-llama. +
+
+ +
+
+``` + +### Indexing Creator Documents + +#### Configure Solr Schema (Required Before Indexing) + +⚠️ **CRITICAL PREREQUISITE** - Before you can index creator records to Solr, you must configure the Solr schema. + +**See [SOLR_SCHEMA.md](SOLR_SCHEMA.md) for complete instructions on:** +- Which fields to add (is_creator, creator_persistent_id, etc.) +- Three methods to add them (Schema API recommended, managed-schema, or schema.xml) +- How to verify they're added +- Troubleshooting "unknown field" errors + +**Quick Schema Setup (Schema API method):** +```bash +# Add is_creator field +curl -X POST -H 'Content-type:application/json' \ + http://localhost:8983/solr/blacklight-core/schema \ + -d '{"add-field": {"name": "is_creator", "type": "boolean", "indexed": true, "stored": true}}' + +# Add other required fields (see SOLR_SCHEMA.md for complete list) +``` + +**Verify schema is configured:** +```bash +curl "http://localhost:8983/solr/blacklight-core/schema/fields/is_creator" +# Should return field definition, not 404 +``` + +⚠️ **If you skip this step, you'll get:** +``` +ERROR: [doc=creator_corporate_entities_584] unknown field 'is_creator' +``` + +This is a **one-time setup** per Solr instance. + +--- + +### Traject Configuration for Creator Indexing + +The `traject_config_eac_cpf.rb` file defines how EAC-CPF creator records are mapped to Solr fields. + +**Search Order**: arcflow searches for the traject config following the collection records pattern: +1. **arcuit_dir parameter** (if provided via `--arcuit-dir`) - Highest priority, most up-to-date user control +2. **arcuit gem** (via `bundle show arcuit`) - For backward compatibility when arcuit_dir not provided +3. **example_traject_config_eac_cpf.rb** in arcflow - Fallback for module usage without arcuit + +**Example File**: arcflow includes `example_traject_config_eac_cpf.rb` as a reference implementation. For production: +- Copy this file to your arcuit gem as `traject_config_eac_cpf.rb`, or +- Specify the location with `--arcuit-dir /path/to/arcuit` + +**Logging**: arcflow clearly logs which traject config file is being used when creator indexing runs. + +To index creator documents to Solr manually: + +```bash +bundle exec traject \ + -u http://localhost:8983/solr/blacklight-core \ + -i xml \ + -c traject_config_eac_cpf.rb \ + /path/to/agents/*.xml +``` + +Or integrate into your ArcFlow deployment workflow. + +## Installation + +See the original installation instructions in your deployment documentation. + +## Configuration + +- `.archivessnake.yml` - ArchivesSpace API credentials +- `.arcflow.yml` - Last update timestamp tracking + +## Usage + +```bash +python -m arcflow.main --arclight-dir /path --aspace-dir /path --solr-url http://... [options] +``` + +### Command Line Options + +Required arguments: +- `--arclight-dir` - Path to ArcLight installation directory +- `--aspace-dir` - Path to ArchivesSpace installation directory +- `--solr-url` - URL of the Solr core (e.g., http://localhost:8983/solr/blacklight-core) + +Optional arguments: +- `--force-update` - Force update of all data (recreates everything from scratch) +- `--traject-extra-config` - Path to extra Traject configuration file +- `--agents-only` - Process only agent records, skip collections (useful for testing agents) +- `--collections-only` - Skips creators, processes EAD, PDF finding aid and indexes collections +- `--skip-creator-indexing` - Collects EAC-CPF files only, does not index into Solr +### Examples + +**Normal run (process all collections and agents):** +```bash +python -m arcflow.main \ + --arclight-dir /path/to/arclight \ + --aspace-dir /path/to/archivesspace \ + --solr-url http://localhost:8983/solr/blacklight-core +``` + +**Process only agents (skip collections):** +```bash +python -m arcflow.main \ + --arclight-dir /path/to/arclight \ + --aspace-dir /path/to/archivesspace \ + --solr-url http://localhost:8983/solr/blacklight-core \ + --agents-only +``` + +**Force full update:** +```bash +python -m arcflow.main \ + --arclight-dir /path/to/arclight \ + --aspace-dir /path/to/archivesspace \ + --solr-url http://localhost:8983/solr/blacklight-core \ + --force-update +``` + +See `--help` for all available options. \ No newline at end of file diff --git a/TESTING_PLAN.md b/TESTING_PLAN.md new file mode 100644 index 0000000..37b7fb7 --- /dev/null +++ b/TESTING_PLAN.md @@ -0,0 +1,457 @@ +# Testing and CI/CD Infrastructure Plan for ArcFlow + +This document outlines a phased approach to implementing automated testing and CI/CD infrastructure for the arcflow repository. The goal is two-fold: +1. **Test Coverage**: Ensure code quality and catch regressions +2. **Agent Testing Environment**: Enable Copilot agents to test PRs against infrastructure and iterate without manual intervention on live systems + +## Overview + +The plan is divided into 6 phases, ordered by complexity and ROI (Return on Investment). Early phases provide quick wins and enable future development, while later phases add polish and production-grade features. + +--- + +## Phase 1: Foundation - Basic Testing Infrastructure + +**Effort**: 1-2 days +**ROI**: ⭐⭐⭐⭐⭐ High +**Priority**: Critical + +### Goals +Establish the minimum viable testing infrastructure that enables continuous integration and provides immediate value. + +### Tasks + +#### 1.1 Set up pytest framework +- Install pytest and pytest-cov as development dependencies +- Create `tests/` directory structure: + ``` + tests/ + ├── __init__.py + ├── conftest.py # Shared fixtures + ├── test_utils.py # Utility function tests + └── fixtures/ # Mock data files + ├── mock_aspace_responses.json + └── sample_ead.xml + ``` +- Create `conftest.py` with base fixtures for mocking + +#### 1.2 GitHub Actions CI Workflow +- Create `.github/workflows/ci.yml`: + - Run on: push, pull_request + - Test matrix: Python 3.8, 3.9, 3.10, 3.11 + - Steps: checkout, setup Python, install dependencies, run linters, run tests + - Upload coverage reports to codecov.io + +#### 1.3 Linting Configuration +- Add `flake8` or `pylint` to development dependencies +- Create `.flake8` or `pylintrc` configuration +- Enforce basic style rules (line length, imports, etc.) +- Add linting step to CI workflow + +#### 1.4 Initial Test Suite +Write first 5-10 tests to validate infrastructure: +- `test_xml_escape()` - Test XML escaping utility +- `test_extract_labels()` - Test classification label extraction +- `test_config_file_loading()` - Test YAML configuration parsing +- `test_pid_file_creation()` - Test process locking +- `test_get_ead_id_from_file()` - Test EAD ID extraction + +#### 1.5 Code Coverage Reporting +- Configure pytest-cov to generate coverage reports +- Add coverage badge to README.md +- Set minimum coverage threshold (start low, e.g., 30%, increase over time) + +### Deliverables +- ✅ Tests run automatically on every PR +- ✅ Test failures block PR merges +- ✅ Basic code coverage visibility +- ✅ Linting catches style issues early + +### Why This Matters +This phase provides **immediate feedback** for developers and agents. A failing test or lint check prevents broken code from being merged. This is the foundation for everything else. + +--- + +## Phase 2: Core Logic Testing + +**Effort**: 2-3 days +**ROI**: ⭐⭐⭐⭐⭐ High +**Priority**: High + +### Goals +Test the critical business logic that defines arcflow's functionality, particularly the complex agent filtering and XML manipulation. + +### Tasks + +#### 2.1 Agent Filtering Tests +Test `is_target_agent()` and related logic: +- Test system user exclusion (is_user field) +- Test system-generated agent exclusion +- Test repository agent exclusion (is_repo_agent) +- Test donor-only agent exclusion +- Test creator role inclusion +- Test linked published records logic +- Test edge cases (multiple roles, missing fields) + +#### 2.2 XML Manipulation Tests +- Test XML escaping vs. pass-through for structured content +- Test bioghist injection into EAD XML +- Test recordgroup/subgroup label injection +- Test XML parsing and DOM manipulation +- Test handling of malformed XML + +#### 2.3 Configuration Handling Tests +- Test `.archivessnake.yml` parsing +- Test `.arcflow.yml` timestamp handling +- Test missing configuration file error handling +- Test invalid configuration format handling + +#### 2.4 Timestamp and Modified-Since Logic +- Test `last_updated` timestamp comparison +- Test force-update mode bypassing timestamps +- Test modified_since parameter in API calls +- Test datetime parsing and timezone handling + +#### 2.5 Mock ArchivesSpace Client +Create comprehensive mocks: +- Mock `ASnakeClient.authorize()` +- Mock repository queries +- Mock resource queries with resolve parameters +- Mock agent queries with filtering +- Mock EAC-CPF XML generation +- Mock error responses (401, 404, 500) + +#### 2.6 File Management Tests +- Test `save_file()` with various XML content +- Test symlink creation and directory structure +- Test file deletion and cleanup +- Test handling of write permissions errors + +### Deliverables +- ✅ Core business logic has >80% test coverage +- ✅ Agent filtering is thoroughly validated +- ✅ XML manipulation is safe and correct +- ✅ Configuration edge cases are handled + +### Why This Matters +These are the functions most likely to have bugs and the hardest to test manually. Comprehensive tests here give confidence that arcflow's core logic is correct. + +--- + +## Phase 3: Integration Testing + +**Effort**: 3-4 days +**ROI**: ⭐⭐⭐⭐ Medium-High +**Priority**: Medium + +### Goals +Test how components work together, including interactions with external systems (mocked or real). + +### Tasks + +#### 3.1 Mock Solr Server +- Use `responses` library or create test Solr HTTP mock +- Mock POST requests for document indexing +- Mock DELETE requests for document removal +- Mock query responses for verification +- Test error handling (Solr down, timeout, 400 errors) + +#### 3.2 EAD Processing Workflow Tests +Test `update_eads()` end-to-end: +- Mock ArchivesSpace responses for repositories and resources +- Mock traject subprocess calls +- Verify XML files are created correctly +- Verify Solr indexing is called with correct parameters +- Test error handling (missing EAD, invalid XML) + +#### 3.3 Creator Processing Workflow Tests +Test `process_creators()` end-to-end: +- Mock agent API responses +- Mock traject subprocess calls +- Verify EAC-CPF XML files are generated +- Verify Solr indexing is called +- Test `--agents-only` and `--skip-creator-indexing` modes + +#### 3.4 Subprocess Call Tests +Mock and verify: +- `traject` command invocations +- `bundle exec` command invocations +- `bundle show` for finding gems +- Handle subprocess errors and non-zero exit codes + +#### 3.5 Parallel Processing Tests +- Test ThreadPool with mocked tasks +- Test error handling in parallel tasks +- Test graceful degradation (some tasks fail, others succeed) +- Test resource cleanup after parallel execution + +#### 3.6 Traject Config Discovery Tests +Test `find_traject_config()`: +- Mock `--arcuit-dir` parameter +- Mock `bundle show arcuit` output +- Test fallback to example config +- Test error when no config found + +### Deliverables +- ✅ Multi-component workflows are tested +- ✅ External system interactions are verified +- ✅ Error propagation is correct +- ✅ Parallel processing is reliable + +### Why This Matters +Integration tests catch issues that unit tests miss, like incorrect API calls or mismatched data formats between components. + +--- + +## Phase 4: End-to-End Testing + +**Effort**: 4-5 days +**ROI**: ⭐⭐⭐ Medium +**Priority**: Medium + +### Goals +Test arcflow in a realistic environment with real instances of ArchivesSpace, Solr, and ArcLight. + +### Tasks + +#### 4.1 Docker Compose Test Environment +Create `docker-compose.test.yml`: +- ArchivesSpace container (pre-populated with test data) +- Solr container (pre-configured with ArcLight schema) +- ArcLight container (optional, for full stack testing) +- Test data initialization scripts + +#### 4.2 Full Workflow Tests +Test `python -m arcflow.main` with real services: +- Test initial run with empty Solr +- Test incremental update (modified_since) +- Test force-update mode (full reindex) +- Test agents-only mode +- Test collections-only mode + +#### 4.3 Data Validation Tests +After test runs: +- Verify Solr document counts +- Verify XML file generation +- Verify PDF file generation +- Verify symlink structure +- Verify `.arcflow.yml` timestamp updates + +#### 4.4 Performance Benchmarking +- Measure indexing throughput (collections/minute) +- Measure agent processing throughput +- Measure memory usage +- Identify bottlenecks for optimization + +#### 4.5 Regression Test Suite +Create test fixtures for known issues: +- Test previously reported bugs +- Test edge cases discovered in production +- Test compatibility with ArchivesSpace versions + +### Deliverables +- ✅ One-command test environment setup +- ✅ Full workflow validation +- ✅ Agents can test PRs against real infrastructure +- ✅ Performance baselines established + +### Why This Matters +End-to-end tests give the highest confidence that arcflow works in production. The Docker environment enables agents to test PRs without needing manual setup. + +--- + +## Phase 5: Advanced CI/CD + +**Effort**: 2-3 days +**ROI**: ⭐⭐⭐ Medium-Low +**Priority**: Low + +### Goals +Add development tools and automation to improve code quality and developer experience. + +### Tasks + +#### 5.1 Pre-commit Hooks +- Install `pre-commit` framework +- Configure hooks: + - Run linters (flake8, pylint) + - Run formatters (black, isort) + - Check for secrets (detect-secrets) + - Check YAML/JSON validity + - Check for large files + +#### 5.2 Type Checking +- Add type hints to function signatures +- Configure `mypy` for static type checking +- Add mypy to CI workflow +- Gradually increase type coverage + +#### 5.3 Security Scanning +- Add `bandit` for Python security issues +- Add `safety` for dependency vulnerability scanning +- Add security checks to CI workflow +- Configure issue notifications + +#### 5.4 Dependency Management +- Configure Dependabot for automated updates +- Set update schedule (weekly) +- Configure auto-merge for minor/patch updates +- Test dependency updates in CI before merging + +#### 5.5 Test Artifacts and Reporting +- Generate HTML coverage reports +- Upload test results in JUnit XML format +- Create test summary comments on PRs +- Archive logs and artifacts for debugging + +#### 5.6 Developer Documentation +- Document how to run tests locally +- Document how to add new tests +- Document how to use Docker test environment +- Create contributing guidelines for testing + +### Deliverables +- ✅ Pre-commit hooks prevent common mistakes +- ✅ Type safety reduces runtime errors +- ✅ Security issues are caught early +- ✅ Dependencies stay up-to-date automatically + +### Why This Matters +These tools catch issues before they reach CI, saving time and making development smoother. However, they're lower priority than functional testing. + +--- + +## Phase 6: Production-Ready Enhancements + +**Effort**: 3-4 days +**ROI**: ⭐⭐ Low +**Priority**: Low + +### Goals +Add polish and production-grade features for mature CI/CD pipeline. + +### Tasks + +#### 6.1 Smoke Tests for Deployments +- Create minimal test suite for post-deployment validation +- Test ArchivesSpace connectivity +- Test Solr connectivity +- Test file system permissions +- Run in production after deployments + +#### 6.2 Test Data Generation +- Create utilities to generate synthetic test data +- Generate ArchivesSpace test fixtures +- Generate EAD/EAC-CPF test files +- Support various edge cases and scenarios + +#### 6.3 Mutation Testing +- Install `mutmut` for mutation testing +- Generate mutants from source code +- Verify tests catch introduced bugs +- Identify weak test coverage areas + +#### 6.4 Performance Regression Tests +- Establish performance baselines +- Track indexing throughput over time +- Alert on performance degradation +- Optimize slow operations + +#### 6.5 Load Testing +- Test parallel processing with high volumes +- Test ThreadPool under load +- Test memory usage with large datasets +- Identify concurrency issues + +#### 6.6 External Monitoring Integration +- Send test results to monitoring systems +- Create dashboards for test trends +- Alert on test failures +- Track coverage trends over time + +### Deliverables +- ✅ Production deployments are validated automatically +- ✅ Test quality is measured objectively +- ✅ Performance regressions are caught +- ✅ Load handling is verified + +### Why This Matters +These are "nice-to-have" features that add polish but aren't critical for basic testing and CI/CD. They become valuable as the project matures. + +--- + +## Recommended Implementation Order + +For **maximum ROI with minimal effort**, implement in this order: + +1. **Start with Phase 1** (Foundation) - This is essential and provides immediate value +2. **Move to Phase 2** (Core Logic) - Tests the most critical functionality +3. **Skip to Phase 4.1** (Docker Compose) - Gives agents a test environment quickly +4. **Complete Phase 3** (Integration) - Now that you have environment + core tests +5. **Add Phase 4.2-4.5** (E2E Tests) - Full validation +6. **Add Phase 5** (Advanced) - As time permits +7. **Add Phase 6** (Production) - Only if needed + +## Success Metrics + +### Short-term (After Phase 1-2) +- ✅ CI runs on every PR +- ✅ Core logic has >60% test coverage +- ✅ Agents get immediate feedback on test failures + +### Medium-term (After Phase 3-4) +- ✅ Integration tests cover major workflows +- ✅ Docker environment enables full testing +- ✅ Test coverage >75% + +### Long-term (After Phase 5-6) +- ✅ Code quality tools prevent issues proactively +- ✅ Performance is monitored and stable +- ✅ Test coverage >85% + +--- + +## Reference: Previous Conversation + +This plan is based on **PR #19** which laid out initial ideas for testing infrastructure. That PR is currently open but had no actual implementation yet. + +**PR #19 Link**: https://github.com/UIUCLibrary/arcflow/pull/19 + +This document expands on those ideas with detailed phasing, ROI analysis, and specific implementation tasks. + +--- + +## Getting Started + +To begin implementation, start with Phase 1: + +```bash +# Install test dependencies +pip install pytest pytest-cov flake8 responses + +# Create test directory +mkdir -p tests/fixtures + +# Create first test file +# tests/test_utils.py + +# Run tests +pytest + +# Run with coverage +pytest --cov=arcflow --cov-report=html +``` + +Then create `.github/workflows/ci.yml` to run tests in CI. + +--- + +## Questions or Feedback? + +This plan is designed to be flexible. Phases can be reordered, split, or combined based on: +- Available development time +- Specific pain points in current workflow +- Agent testing needs +- Production requirements + +The key is to start with Phase 1 (foundation) and build incrementally from there. diff --git a/arcflow/main.py b/arcflow/main.py index f4f900e..0c8c043 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -9,12 +9,14 @@ import re import logging import math +import sys from xml.dom.pulldom import parse, START_ELEMENT from xml.sax.saxutils import escape as xml_escape +from xml.etree import ElementTree as ET from datetime import datetime, timezone from asnake.client import ASnakeClient from multiprocessing.pool import ThreadPool as Pool -from utils.stage_classifications import extract_labels +from .utils.stage_classifications import extract_labels base_dir = os.path.abspath((__file__) + "/../../") @@ -38,14 +40,19 @@ class ArcFlow: """ - def __init__(self, arclight_dir, aspace_dir, solr_url, traject_extra_config='', force_update=False): + def __init__(self, arclight_dir, aspace_dir, solr_url, traject_extra_config='', force_update=False, agents_only=False, collections_only=False, arcuit_dir=None, skip_creator_indexing=False): self.solr_url = solr_url self.batch_size = 1000 - self.traject_extra_config = f'-c {traject_extra_config}' if traject_extra_config.strip() else '' + clean_extra_config = traject_extra_config.strip() + self.traject_extra_config = clean_extra_config or None self.arclight_dir = arclight_dir self.aspace_jobs_dir = f'{aspace_dir}/data/shared/job_files' self.job_type = 'print_to_pdf_job' self.force_update = force_update + self.agents_only = agents_only + self.collections_only = collections_only + self.arcuit_dir = arcuit_dir + self.skip_creator_indexing = skip_creator_indexing self.log = logging.getLogger('arcflow') self.pid = os.getpid() self.pid_file_path = os.path.join(base_dir, 'arcflow.pid') @@ -395,9 +402,11 @@ def update_eads(self): ArchivesSpace. """ xml_dir = f'{self.arclight_dir}/public/xml' + resource_dir = f'{xml_dir}/resources' pdf_dir = f'{self.arclight_dir}/public/pdf' modified_since = int(self.last_updated.timestamp()) + if self.force_update or modified_since <= 0: modified_since = 0 # delete all EADs and related files in ArcLight Solr @@ -407,7 +416,7 @@ def update_eads(self): json={'delete': {'query': '*:*'}}, ) if response.status_code == 200: - self.log.info('Deleted all EADs from ArcLight Solr.') + self.log.info('Deleted all EADs and Creators from ArcLight Solr.') # delete related directories after suscessful # deletion from solr for dir_path, dir_name in [(xml_dir, 'XMLs'), (pdf_dir, 'PDFs')]: @@ -419,10 +428,10 @@ def update_eads(self): else: self.log.error(f'Failed to delete all EADs from Arclight Solr. Status code: {response.status_code}') except requests.exceptions.RequestException as e: - self.log.error(f'Error deleting all EADs from ArcLight Solr: {e}') + self.log.error(f'Error deleting all EADs and Creators from ArcLight Solr: {e}') # create directories if don't exist - for dir_path in (xml_dir, pdf_dir): + for dir_path in (resource_dir, pdf_dir): os.makedirs(dir_path, exist_ok=True) # process resources that have been modified in ArchivesSpace since last update @@ -435,7 +444,7 @@ def update_eads(self): # Tasks for processing repositories results_1 = [pool.apply_async( self.task_repository, - args=(repo, xml_dir, modified_since, indent_size)) + args=(repo, resource_dir, modified_since, indent_size)) for repo in repos] # Collect outputs from repository tasks outputs_1 = [r.get() for r in results_1] @@ -443,7 +452,7 @@ def update_eads(self): # Tasks for processing resources results_2 = [pool.apply_async( self.task_resource, - args=(repo, resource_id, xml_dir, pdf_dir, indent_size)) + args=(repo, resource_id, resource_dir, pdf_dir, indent_size)) for repo, resources in outputs_1 for resource_id in resources] # Collect outputs from resource tasks outputs_2 = [r.get() for r in results_2] @@ -457,8 +466,8 @@ def update_eads(self): # Tasks for indexing pending resources results_3 = [pool.apply_async( - self.index, - args=(repo_id, f'{xml_dir}/{repo_id}_*_batch_{batch_num}.xml', indent_size)) + self.index_collections, + args=(repo_id, f'{resource_dir}/{repo_id}_*_batch_{batch_num}.xml', indent_size)) for repo_id, batch_num in batches] # Wait for indexing tasks to complete @@ -467,7 +476,7 @@ def update_eads(self): # Remove pending symlinks after indexing for repo_id, batch_num in batches: - xml_file_path = f'{xml_dir}/{repo_id}_*_batch_{batch_num}.xml' + xml_file_path = f'{resource_dir}/{repo_id}_*_batch_{batch_num}.xml' try: result = subprocess.run( f'rm {xml_file_path}', @@ -490,14 +499,23 @@ def update_eads(self): for r in results_4: r.get() - # processing deleted resources is not needed when - # force-update is set or modified_since is set to 0 - if self.force_update or modified_since <= 0: - self.log.info('Skipping deleted resources processing.') - return + return + + + + def process_deleted_records(self): + + xml_dir = f'{self.arclight_dir}/public/xml' + resource_dir = f'{xml_dir}/resources' + agent_dir = f'{xml_dir}/agents' + pdf_dir = f'{self.arclight_dir}/public/pdf' + modified_since = int(self.last_updated.timestamp()) + + # process records that have been deleted since last update in ArchivesSpace + resource_pattern = r'^/repositories/(?P\d+)/resources/(?P\d+)$' + agent_pattern = r'^/agents/(?Ppeople|corporate_entities|families)/(?P\d+)$' + - # process resources that have been deleted since last update in ArchivesSpace - pattern = r'^/repositories/(?P\d+)/resources/(?P\d+)$' page = 1 while True: deleted_records = self.client.get( @@ -508,12 +526,13 @@ def update_eads(self): } ).json() for record in deleted_records['results']: - match = re.match(pattern, record) - if match: + resource_match = re.match(resource_pattern, record) + agent_match = re.match(agent_pattern, record) + if resource_match and not self.agents_only: resource_id = match.group('resource_id') self.log.info(f'{" " * indent_size}Processing deleted resource ID {resource_id}...') - symlink_path = f'{xml_dir}/{resource_id}.xml' + symlink_path = f'{resource_dir}/{resource_id}.xml' ead_id = self.get_ead_from_symlink(symlink_path) if ead_id: self.delete_ead( @@ -525,22 +544,68 @@ def update_eads(self): else: self.log.error(f'{" " * (indent_size+2)}Symlink {symlink_path} not found. Unable to delete the associated EAD from Arclight Solr.') + if agent_match and not self.collections_only: + #TODO: delete agent records. If these can be done in idential ways + self.log.info('there was an agent record to delete') + if deleted_records['last_page'] == page: break page += 1 - def index(self, repo_id, xml_file_path, indent_size=0): + def index_collections(self, repo_id, xml_file_path, indent_size=0): + """Index collection XML files to Solr using traject.""" indent = ' ' * indent_size self.log.info(f'{indent}Indexing pending resources in repository ID {repo_id} to ArcLight Solr...') try: + # Get arclight traject config path + result_show = subprocess.run( + ['bundle', 'show', 'arclight'], + capture_output=True, + text=True, + cwd=self.arclight_dir + ) + arclight_path = result_show.stdout.strip() if result_show.returncode == 0 else '' + + if not arclight_path: + self.log.error(f'{indent}Could not find arclight gem path') + return + + traject_config = f'{arclight_path}/lib/arclight/traject/ead2_config.rb' + + cmd = [ + 'bundle', 'exec', 'traject', + '-u', self.solr_url, + '-s', 'processing_thread_pool=8', + '-s', 'solr_writer.thread_pool=8', + '-s', f'solr_writer.batch_size={self.batch_size}', + '-s', 'solr_writer.commit_on_close=true', + '-i', 'xml', + '-c', traject_config + ] + + if self.traject_extra_config: + if isinstance(self.traject_extra_config, (list, tuple)): + cmd.extend(self.traject_extra_config) + else: + # Treat a string extra config as a path and pass it with -c + cmd.extend(['-c', self.traject_extra_config]) + + cmd.append(xml_file_path) + + env = os.environ.copy() + env['REPOSITORY_ID'] = str(repo_id) + cmd_string = ' '.join(cmd) result = subprocess.run( - f'REPOSITORY_ID={repo_id} bundle exec traject -u {self.solr_url} -s processing_thread_pool=8 -s solr_writer.thread_pool=8 -s solr_writer.batch_size={self.batch_size} -s solr_writer.commit_on_close=true -i xml -c $(bundle show arclight)/lib/arclight/traject/ead2_config.rb {self.traject_extra_config} {xml_file_path}', -# f'FILE={xml_file_path} SOLR_URL={self.solr_url} REPOSITORY_ID={repo_id} TRAJECT_SETTINGS="processing_thread_pool=8 solr_writer.thread_pool=8 solr_writer.batch_size=1000 solr_writer.commit_on_close=false" bundle exec rake arcuit:index', + cmd_string, shell=True, cwd=self.arclight_dir, - stderr=subprocess.PIPE,) - self.log.error(f'{indent}{result.stderr.decode("utf-8")}') + env=env, + stderr=subprocess.PIPE, + ) + + if result.stderr: + self.log.error(f'{indent}{result.stderr.decode("utf-8")}') if result.returncode != 0: self.log.error(f'{indent}Failed to index pending resources in repository ID {repo_id} to ArcLight Solr. Return code: {result.returncode}') else: @@ -628,6 +693,470 @@ def get_creator_bioghist(self, resource, indent_size=0): return None + def is_target_agent(self, agent): + """ + Determine if agent is a target creator of archival materials. + + Excludes: + - System users (is_user field present) + - System-generated agents (system_generated = true) + - Repository agents (is_repo_agent field present) + - Donor-only agents (only has 'donor' role, no creator role) + + Note: Software agents are excluded by not querying /agents/software endpoint. + + Args: + agent: Agent record from ArchivesSpace API + + Returns: + bool: True if agent should be indexed, False to exclude + """ + # TIER 1: Exclude system users (PRIMARY FILTER) + if agent.get('is_user'): + return False + + # TIER 2: Exclude system-generated agents + if agent.get('system_generated'): + return False + + # TIER 3: Exclude repository agents (corporate entities only) + if agent.get('is_repo_agent'): + return False + + # TIER 4: Role-based filtering + roles = agent.get('linked_agent_roles', []) + + # Include if explicitly marked as creator + if 'creator' in roles: + return True + + # Exclude if ONLY marked as donor + if roles == ['donor']: + return False + + # TIER 5: Default - include if linked to published records + # (covers cases where roles aren't populated yet) + return agent.get('is_linked_to_published_record', False) + + def get_all_agents(self, agent_types=None, modified_since=0, indent_size=0): + """ + Fetch target agents from ArchivesSpace and filter to creators only. + Excludes system users, donors, and other non-creator agents. + + Args: + agent_types: List of agent types to fetch. Default: ['corporate_entities', 'people', 'families'] + modified_since: Unix timestamp to filter agents modified since this time (if API supports it) + indent_size: Indentation size for logging + + Returns: + list: List of filtered agent URIs (e.g., '/agents/corporate_entities/123') + """ + if agent_types is None: + agent_types = ['corporate_entities', 'people', 'families'] + + indent = ' ' * indent_size + target_agents = [] + stats = { + 'total': 0, + 'excluded_user': 0, + 'excluded_system_generated': 0, + 'excluded_repo_agent': 0, + 'excluded_donor_only': 0, + 'excluded_no_links': 0, + 'included': 0 + } + + self.log.info(f'{indent}Fetching agents from ArchivesSpace and applying filters...') + + for agent_type in agent_types: + try: + # Try with modified_since parameter first + params = {'all_ids': True} + if modified_since > 0 and not self.force_update: + params['modified_since'] = modified_since + + response = self.client.get(f'/agents/{agent_type}', params=params) + agent_ids = response.json() + + self.log.info(f'{indent}Found {len(agent_ids)} {agent_type} agents, filtering...') + + # Fetch and filter each agent + for agent_id in agent_ids: + stats['total'] += 1 + agent_uri = f'/agents/{agent_type}/{agent_id}' + + try: + # Fetch full agent record to access filtering fields + agent_response = self.client.get(agent_uri) + agent = agent_response.json() + + # Apply filtering logic + if agent.get('is_user'): + stats['excluded_user'] += 1 + continue + + if agent.get('system_generated'): + stats['excluded_system_generated'] += 1 + continue + + if agent.get('is_repo_agent'): + stats['excluded_repo_agent'] += 1 + continue + + roles = agent.get('linked_agent_roles', []) + + # Include creators + if 'creator' in roles: + stats['included'] += 1 + target_agents.append(agent_uri) + continue + + # Exclude donor-only agents + if roles == ['donor']: + stats['excluded_donor_only'] += 1 + continue + + # Default: include if linked to published records + if agent.get('is_linked_to_published_record', False): + stats['included'] += 1 + target_agents.append(agent_uri) + else: + stats['excluded_no_links'] += 1 + + except Exception as e: + self.log.warning(f'{indent}Error fetching agent {agent_uri}: {e}') + # On error, include the agent (fail-open) + target_agents.append(agent_uri) + + except Exception as e: + self.log.error(f'{indent}Error fetching {agent_type} agents: {e}') + # If modified_since fails, try without it + if modified_since > 0: + self.log.warning(f'{indent}Retrying {agent_type} without modified_since filter...') + try: + response = self.client.get(f'/agents/{agent_type}', params={'all_ids': True}) + agent_ids = response.json() + self.log.info(f'{indent}Found {len(agent_ids)} {agent_type} agents (no date filter)') + + # Re-process with filtering + for agent_id in agent_ids: + stats['total'] += 1 + agent_uri = f'/agents/{agent_type}/{agent_id}' + + try: + agent_response = self.client.get(agent_uri) + agent = agent_response.json() + + if self.is_target_agent(agent): + stats['included'] += 1 + target_agents.append(agent_uri) + + except Exception as e: + self.log.warning(f'{indent}Error fetching agent {agent_uri}: {e}') + target_agents.append(agent_uri) + + except Exception as e2: + self.log.error(f'{indent}Failed to fetch {agent_type} agents: {e2}') + + # Log filtering statistics + self.log.info(f'{indent}Agent filtering complete:') + self.log.info(f'{indent} Total agents processed: {stats["total"]}') + self.log.info(f'{indent} Included (target creators): {stats["included"]}') + self.log.info(f'{indent} Excluded (system users): {stats["excluded_user"]}') + self.log.info(f'{indent} Excluded (system-generated): {stats["excluded_system_generated"]}') + self.log.info(f'{indent} Excluded (repository agents): {stats["excluded_repo_agent"]}') + self.log.info(f'{indent} Excluded (donor-only): {stats["excluded_donor_only"]}') + self.log.info(f'{indent} Excluded (no published links): {stats["excluded_no_links"]}') + + return target_agents + + + def task_agent(self, agent_uri, agents_dir, repo_id=1, indent_size=0): + """ + Process a single agent and generate a creator document in EAC-CPF XML format. + Retrieves EAC-CPF directly from ArchivesSpace archival_contexts endpoint. + + Args: + agent_uri: Agent URI from ArchivesSpace (e.g., '/agents/corporate_entities/123') + agents_dir: Directory to save agent XML files + repo_id: Repository ID to use for archival_contexts endpoint (default: 1) + indent_size: Indentation size for logging + + Returns: + str: Creator document ID if successful, None otherwise + """ + indent = ' ' * indent_size + + try: + # Parse agent URI to extract type and ID + # URI format: /agents/{agent_type}/{id} + parts = agent_uri.strip('/').split('/') + if len(parts) != 3 or parts[0] != 'agents': + self.log.error(f'{indent}Invalid agent URI format: {agent_uri}') + return None + + agent_type = parts[1] # e.g., 'corporate_entities', 'people', 'families' + agent_id = parts[2] + + # Construct EAC-CPF endpoint + # Format: /repositories/{repo_id}/archival_contexts/{agent_type}/{id}.xml + eac_cpf_endpoint = f'/repositories/{repo_id}/archival_contexts/{agent_type}/{agent_id}.xml' + + self.log.debug(f'{indent}Fetching EAC-CPF from: {eac_cpf_endpoint}') + + # Fetch EAC-CPF XML + response = self.client.get(eac_cpf_endpoint) + + if response.status_code != 200: + self.log.error(f'{indent}Failed to fetch EAC-CPF for {agent_uri}: HTTP {response.status_code}') + return None + + eac_cpf_xml = response.text + + # Parse the EAC-CPF XML to validate and inspect its structure + try: + root = ET.fromstring(eac_cpf_xml) + self.log.debug(f'{indent}Parsed EAC-CPF XML root element: {root.tag}') + except ET.ParseError as e: + self.log.error(f'{indent}Failed to parse EAC-CPF XML for {agent_uri}: {e}') + return None + + # Generate creator ID + creator_id = f'creator_{agent_type}_{agent_id}' + + # Save EAC-CPF XML to file + filename = f'{agents_dir}/{creator_id}.xml' + with open(filename, 'w', encoding='utf-8') as f: + f.write(eac_cpf_xml) + + self.log.info(f'{indent}Created creator document: {creator_id}') + return creator_id + + except Exception as e: + self.log.error(f'{indent}Error processing agent {agent_uri}: {e}') + import traceback + self.log.error(f'{indent}{traceback.format_exc()}') + return None + + def process_creators(self): + """ + Process creator agents and generate standalone creator documents. + + Returns: + list: List of created creator document IDs + """ + + xml_dir = f'{self.arclight_dir}/public/xml' + agents_dir = f'{xml_dir}/agents' + modified_since = int(self.last_updated.timestamp()) + indent_size = 0 + indent = ' ' * indent_size + + self.log.info(f'{indent}Processing creator agents...') + + # Create agents directory if it doesn't exist + os.makedirs(agents_dir, exist_ok=True) + + # Get agents to process + agents = self.get_all_agents(modified_since=modified_since, indent_size=indent_size) + + # Process agents in parallel + with Pool(processes=10) as pool: + results_agents = [pool.apply_async( + self.task_agent, + args=(agent_uri_item, agents_dir, 1, indent_size)) # Use repo_id=1 + for agent_uri_item in agents] + + creator_ids = [r.get() for r in results_agents] + creator_ids = [cid for cid in creator_ids if cid is not None] + + self.log.info(f'{indent}Created {len(creator_ids)} creator documents.') + + # NOTE: Collection links are NOT added to creator XML files. + # Instead, linking is handled via Solr using the persistent_id field: + # - Creator bioghist has persistent_id as the 'id' attribute + # - Collection EADs reference creators via bioghist with persistent_id + # - Solr indexes both, allowing queries to link them + # This avoids the expensive operation of scanning all resources to build a linkage map. + + # Index creators to Solr (if not skipped) + if not self.skip_creator_indexing and creator_ids: + self.log.info(f'{indent}Indexing {len(creator_ids)} creator records to Solr...') + traject_config = self.find_traject_config() + if traject_config: + self.log.info(f'{indent}Using traject config: {traject_config}') + indexed = self.index_creators(agents_dir, creator_ids) + self.log.info(f'{indent}Creator indexing complete: {indexed}/{len(creator_ids)} indexed') + else: + self.log.warning(f'{indent}Skipping creator indexing (traject config not found)') + self.log.info(f'{indent}To index manually:') + self.log.info(f'{indent} cd {self.arclight_dir}') + self.log.info(f'{indent} bundle exec traject -u {self.solr_url} -i xml \\') + self.log.info(f'{indent} -c /path/to/arcuit-gem/traject_config_eac_cpf.rb \\') + self.log.info(f'{indent} {agents_dir}/*.xml') + elif self.skip_creator_indexing: + self.log.info(f'{indent}Skipping creator indexing (--skip-creator-indexing flag set)') + + return creator_ids + + + def find_traject_config(self): + """ + Find the traject config for creator indexing. + + Search order (follows collection records pattern): + 1. arcuit_dir if provided (most up-to-date user control) + 2. arcuit gem via bundle show (for backward compatibility) + 3. example_traject_config_eac_cpf.rb in arcflow (fallback when used as module without arcuit) + + Returns: + str: Path to traject config, or None if not found + """ + self.log.info('Searching for traject_config_eac_cpf.rb...') + searched_paths = [] + + # Try 1: arcuit_dir if provided (highest priority - user's explicit choice) + if self.arcuit_dir: + self.log.debug(f' Checking arcuit_dir parameter: {self.arcuit_dir}') + candidate_paths = [ + os.path.join(self.arcuit_dir, 'traject_config_eac_cpf.rb'), + os.path.join(self.arcuit_dir, 'lib', 'arcuit', 'traject', 'traject_config_eac_cpf.rb'), + ] + searched_paths.extend(candidate_paths) + for traject_config in candidate_paths: + if os.path.exists(traject_config): + self.log.info(f'✓ Using traject config from arcuit_dir: {traject_config}') + return traject_config + self.log.debug(' traject_config_eac_cpf.rb not found in arcuit_dir') + + # Try 2: bundle show arcuit (for backward compatibility when arcuit_dir not provided) + try: + result = subprocess.run( + ['bundle', 'show', 'arcuit'], + cwd=self.arclight_dir, + capture_output=True, + text=True, + timeout=10 + ) + if result.returncode == 0: + arcuit_path = result.stdout.strip() + self.log.debug(f' Found arcuit gem at: {arcuit_path}') + candidate_paths = [ + os.path.join(arcuit_path, 'traject_config_eac_cpf.rb'), + os.path.join(arcuit_path, 'lib', 'arcuit', 'traject', 'traject_config_eac_cpf.rb'), + ] + searched_paths.extend(candidate_paths) + for traject_config in candidate_paths: + if os.path.exists(traject_config): + self.log.info(f'✓ Using traject config from arcuit gem: {traject_config}') + return traject_config + self.log.debug( + ' traject_config_eac_cpf.rb not found in arcuit gem ' + '(checked root and lib/arcuit/traject/ subdirectory)' + ) + else: + self.log.debug(' arcuit gem not found via bundle show') + except Exception as e: + self.log.debug(f' Error checking for arcuit gem: {e}') + + # Try 3: example file in arcflow package (fallback for module usage without arcuit) + # We know exactly where this file is located - at the repo root + arcflow_package_dir = os.path.dirname(os.path.abspath(__file__)) + arcflow_repo_root = os.path.dirname(arcflow_package_dir) + traject_config = os.path.join(arcflow_repo_root, 'example_traject_config_eac_cpf.rb') + searched_paths.append(traject_config) + + if os.path.exists(traject_config): + self.log.info(f'✓ Using example traject config from arcflow: {traject_config}') + self.log.info( + ' Note: Using example config. For production, copy this file to your ' + 'arcuit gem or specify location with --arcuit-dir.' + ) + return traject_config + + # No config found anywhere - show all paths searched + self.log.error('✗ Could not find traject_config_eac_cpf.rb in any of these locations:') + for i, path in enumerate(searched_paths, 1): + self.log.error(f' {i}. {path}') + self.log.error('') + self.log.error(' Add traject_config_eac_cpf.rb to your arcuit gem or specify with --arcuit-dir.') + return None + + + def index_creators(self, agents_dir, creator_ids, batch_size=100): + """ + Index creator XML files to Solr using traject. + + Args: + agents_dir: Directory containing creator XML files + creator_ids: List of creator IDs to index + batch_size: Number of files to index per traject call (default: 100) + + Returns: + int: Number of successfully indexed creators + """ + traject_config = self.find_traject_config() + if not traject_config: + return 0 + + indexed_count = 0 + failed_count = 0 + + # Process in batches to avoid command line length limits + total_batches = math.ceil(len(creator_ids) / batch_size) + for i in range(0, len(creator_ids), batch_size): + batch = creator_ids[i:i+batch_size] + batch_num = (i // batch_size) + 1 + + # Build list of XML files for this batch + xml_files = [f'{agents_dir}/{cid}.xml' for cid in batch] + + # Filter to only existing files + existing_files = [f for f in xml_files if os.path.exists(f)] + + if not existing_files: + self.log.warning(f' Batch {batch_num}/{total_batches}: No files found, skipping') + continue + + try: + cmd = [ + 'bundle', 'exec', 'traject', + '-u', self.solr_url, + '-i', 'xml', + '-c', traject_config + ] + existing_files + + self.log.info(f' Indexing batch {batch_num}/{total_batches}: {len(existing_files)} files') + + result = subprocess.run( + cmd, + cwd=self.arclight_dir, + stderr=subprocess.PIPE, + timeout=300 # 5 minute timeout per batch + ) + + if result.returncode == 0: + indexed_count += len(existing_files) + self.log.info(f' Successfully indexed {len(existing_files)} creators') + else: + failed_count += len(existing_files) + self.log.error(f' Traject failed with exit code {result.returncode}') + if result.stderr: + self.log.error(f' STDERR: {result.stderr.decode("utf-8")}') + + except subprocess.TimeoutExpired: + self.log.error(f' Traject timed out for batch {batch_num}/{total_batches}') + failed_count += len(existing_files) + except Exception as e: + self.log.error(f' Error indexing batch {batch_num}/{total_batches}: {e}') + failed_count += len(existing_files) + + if failed_count > 0: + self.log.warning(f'Creator indexing completed with errors: {indexed_count} succeeded, {failed_count} failed') + + return indexed_count + + def get_repo_id(self, repo): """ Get the repository ID from the repository URI. @@ -756,11 +1285,31 @@ def run(self): Run the ArcFlow process. """ self.log.info(f'ArcFlow process started (PID: {self.pid}).') - self.update_repositories() - self.update_eads() + + # Update repositories (unless agents-only mode) + if not self.agents_only: + self.update_repositories() + + # Update collections/EADs (unless agents-only mode) + if not self.agents_only: + self.update_eads() + + # Update creator records (unless collections-only mode) + if not self.collections_only: + self.process_creators() + + # processing deleted resources is not needed when + # force-update is set or modified_since is set to 0 + if self.force_update or int(self.last_updated.timestamp()) <= 0: + self.log.info('Skipping deleted record processing.') + else: + self.process_deleted_records() + self.save_config_file() self.log.info(f'ArcFlow process completed (PID: {self.pid}). Elapsed time: {time.strftime("%H:%M:%S", time.gmtime(int(time.time()) - self.start_time))}.') + + def main(): parser = argparse.ArgumentParser(description='ArcFlow') @@ -784,14 +1333,38 @@ def main(): '--traject-extra-config', default='', help='Path to extra Traject configuration file',) + parser.add_argument( + '--agents-only', + action='store_true', + help='Process only agent records, skip collections (for testing)',) + parser.add_argument( + '--collections-only', + action='store_true', + help='Process only repositories and collections, skip creator processing',) + parser.add_argument( + '--arcuit-dir', + default=None, + help='Path to arcuit repository (for traject config). If not provided, will try bundle show arcuit.',) + parser.add_argument( + '--skip-creator-indexing', + action='store_true', + help='Generate creator XML files but skip Solr indexing (for testing)',) args = parser.parse_args() + + # Validate mutually exclusive flags + if args.agents_only and args.collections_only: + parser.error('Cannot use both --agents-only and --collections-only') arcflow = ArcFlow( arclight_dir=args.arclight_dir, aspace_dir=args.aspace_dir, solr_url=args.solr_url, traject_extra_config=args.traject_extra_config, - force_update=args.force_update) + force_update=args.force_update, + agents_only=args.agents_only, + collections_only=args.collections_only, + arcuit_dir=args.arcuit_dir, + skip_creator_indexing=args.skip_creator_indexing) arcflow.run() diff --git a/example_traject_config_eac_cpf.rb b/example_traject_config_eac_cpf.rb new file mode 100644 index 0000000..1fe97d0 --- /dev/null +++ b/example_traject_config_eac_cpf.rb @@ -0,0 +1,277 @@ +# Traject configuration for indexing EAC-CPF creator records to Solr +# +# This config file processes EAC-CPF (Encoded Archival Context - Corporate Bodies, +# Persons, and Families) XML documents from ArchivesSpace archival_contexts endpoint. +# +# Usage: +# bundle exec traject -u $SOLR_URL -c example_traject_config_eac_cpf.rb /path/to/agents/*.xml +# +# For production, copy this file to your arcuit gem as traject_config_eac_cpf.rb +# +# The EAC-CPF XML documents are retrieved directly from ArchivesSpace via: +# /repositories/{repo_id}/archival_contexts/{agent_type}/{id}.xml + +require 'traject' +require 'traject_plus' +require 'traject_plus/macros' +require 'time' + +# Use TrajectPlus macros (provides extract_xpath and other helpers) +extend TrajectPlus::Macros + +# EAC-CPF namespace - used consistently throughout this config +EAC_NS = { 'eac' => 'urn:isbn:1-931666-33-4' } + +settings do + provide "solr.url", ENV['SOLR_URL'] || "http://localhost:8983/solr/blacklight-core" + provide "solr_writer.commit_on_close", "true" + provide "solr_writer.thread_pool", "8" + provide "solr_writer.batch_size", "100" + provide "processing_thread_pool", "4" + + # Use NokogiriReader for XML processing + provide "reader_class_name", "Traject::NokogiriReader" +end + +# Each record from reader +each_record do |record, context| + context.clipboard[:is_creator] = true +end + +# Core identity field +# CRITICAL: The 'id' field is required by Solr's schema (uniqueKey) +# Must ensure this field is never empty or indexing will fail +# +# IMPORTANT: Real EAC-CPF from ArchivesSpace has empty element! +# Cannot rely on recordId being present. Must extract from filename or generate. +to_field 'id' do |record, accumulator, context| + # Try 1: Extract from control/recordId (if present) + record_id = record.xpath('//eac:control/eac:recordId', EAC_NS).first + + if record_id && !record_id.text.strip.empty? + accumulator << record_id.text.strip + else + # Try 2: Extract from source filename (most reliable for ArchivesSpace exports) + # Filename format: creator_corporate_entities_584.xml or similar + source_file = context.source_record_id || context.input_name + if source_file + # Remove .xml extension and any path + id_from_filename = File.basename(source_file, '.xml') + # Check if it looks valid (starts with creator_ or agent_) + if id_from_filename =~ /^(creator_|agent_)/ + accumulator << id_from_filename + context.logger.info("Using filename-based ID: #{id_from_filename}") + else + # Try 3: Generate from entity type and name + entity_type = record.xpath('//eac:cpfDescription/eac:identity/eac:entityType', EAC_NS).first&.text&.strip + name_entry = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS).first&.text&.strip + + if entity_type && name_entry + # Create stable ID from type and name + type_short = case entity_type + when 'corporateBody' then 'corporate' + when 'person' then 'person' + when 'family' then 'family' + else 'entity' + end + name_id = name_entry.gsub(/[^a-z0-9]/i, '_').downcase[0..50] # Limit length + generated_id = "creator_#{type_short}_#{name_id}" + accumulator << generated_id + context.logger.warn("Generated ID from name: #{generated_id}") + else + # Last resort: timestamp-based unique ID + fallback_id = "creator_unknown_#{Time.now.to_i}_#{rand(10000)}" + accumulator << fallback_id + context.logger.error("Using fallback ID: #{fallback_id}") + end + end + else + # No filename available, generate from name + entity_type = record.xpath('//eac:cpfDescription/eac:identity/eac:entityType', EAC_NS).first&.text&.strip + name_entry = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS).first&.text&.strip + + if entity_type && name_entry + type_short = case entity_type + when 'corporateBody' then 'corporate' + when 'person' then 'person' + when 'family' then 'family' + else 'entity' + end + name_id = name_entry.gsub(/[^a-z0-9]/i, '_').downcase[0..50] + generated_id = "creator_#{type_short}_#{name_id}" + accumulator << generated_id + context.logger.warn("Generated ID from name: #{generated_id}") + else + # Absolute last resort + fallback_id = "creator_unknown_#{Time.now.to_i}_#{rand(10000)}" + accumulator << fallback_id + context.logger.error("Using fallback ID: #{fallback_id}") + end + end + end +end + +# Add is_creator marker field +to_field 'is_creator' do |record, accumulator| + accumulator << 'true' +end + +# Record type +to_field 'record_type' do |record, accumulator| + accumulator << 'creator' +end + +# Entity type (corporateBody, person, family) +to_field 'entity_type' do |record, accumulator| + entity = record.xpath('//eac:cpfDescription/eac:identity/eac:entityType', EAC_NS).first + accumulator << entity.text if entity +end + +# Title/name fields - using ArcLight dynamic field naming convention +# _tesim = text, stored, indexed, multiValued (for full-text search) +# _ssm = string, stored, multiValued (for display) +# _ssi = string, stored, indexed (for faceting/sorting) +to_field 'title_tesim' do |record, accumulator| + name = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS) + accumulator << name.map(&:text).join(' ') if name.any? +end + +to_field 'title_ssm' do |record, accumulator| + name = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS) + accumulator << name.map(&:text).join(' ') if name.any? +end + +to_field 'title_filing_ssi' do |record, accumulator| + name = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS) + if name.any? + text = name.map(&:text).join(' ') + # Remove leading articles and convert to lowercase for filing + accumulator << text.gsub(/^(a|an|the)\s+/i, '').downcase + end +end + +# Dates of existence - using ArcLight standard field unitdate_ssm +# (matches what ArcLight uses for collection dates) +to_field 'unitdate_ssm' do |record, accumulator| + # Try existDates element + base_path = '//eac:cpfDescription/eac:description/eac:existDates' + dates = record.xpath("#{base_path}/eac:dateRange/eac:fromDate | #{base_path}/eac:dateRange/eac:toDate | #{base_path}/eac:date", EAC_NS) + if dates.any? + from_date = record.xpath("#{base_path}/eac:dateRange/eac:fromDate", EAC_NS).first + to_date = record.xpath("#{base_path}/eac:dateRange/eac:toDate", EAC_NS).first + + if from_date || to_date + from_text = from_date ? from_date.text : '' + to_text = to_date ? to_date.text : '' + accumulator << "#{from_text}-#{to_text}".gsub(/^-|-$/, '') + else + # Single date + dates.each { |d| accumulator << d.text } + end + end +end + +# Biographical/historical note - using ArcLight conventions +# _tesim for searchable plain text +# _tesm for searchable HTML (text, stored, multiValued but not for display) +# _ssm for section heading display +to_field 'bioghist_tesim' do |record, accumulator| + # Extract text from biogHist elements for full-text search + bioghist = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:p', EAC_NS) + if bioghist.any? + text = bioghist.map(&:text).join(' ') + accumulator << text + end +end + +# Biographical/historical note - HTML +to_field 'bioghist_html_tesm' do |record, accumulator| + # Extract HTML for searchable content (matches ArcLight's bioghist_html_tesm) + bioghist = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:p', EAC_NS) + if bioghist.any? + # Preserve inline EAC markup inside by serializing child nodes + html = bioghist.map { |p| "

#{p.inner_html}

" }.join("\n") + accumulator << html + end +end + +to_field 'bioghist_heading_ssm' do |record, accumulator| + # Extract section heading (matches ArcLight's bioghist_heading_ssm pattern) + heading = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:head', EAC_NS).first + accumulator << heading.text if heading +end + +# Full-text search field +to_field 'text' do |record, accumulator| + # Title + name = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS) + accumulator << name.map(&:text).join(' ') if name.any? + + # Bioghist + bioghist = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:p', EAC_NS) + accumulator << bioghist.map(&:text).join(' ') if bioghist.any? +end + +# Related agents (from cpfRelation elements) +to_field 'related_agents_ssim' do |record, accumulator| + relations = record.xpath('//eac:cpfDescription/eac:relations/eac:cpfRelation', EAC_NS) + relations.each do |rel| + # Get the related entity href/identifier + href = rel['href'] || rel['xlink:href'] + relation_type = rel['cpfRelationType'] + + if href + # Store as: "uri|type" for easy parsing later + accumulator << "#{href}|#{relation_type}" + elsif relation_entry = rel.xpath('eac:relationEntry', EAC_NS).first + # If no href, at least store the name + name = relation_entry.text + accumulator << "#{name}|#{relation_type}" if name + end + end +end + +# Related agents - just URIs (for simpler queries) +to_field 'related_agent_uris_ssim' do |record, accumulator| + relations = record.xpath('//eac:cpfDescription/eac:relations/eac:cpfRelation', EAC_NS) + relations.each do |rel| + href = rel['href'] || rel['xlink:href'] + accumulator << href if href + end +end + +# Relationship types +to_field 'relationship_types_ssim' do |record, accumulator| + relations = record.xpath('//eac:cpfDescription/eac:relations/eac:cpfRelation', EAC_NS) + relations.each do |rel| + relation_type = rel['cpfRelationType'] + accumulator << relation_type if relation_type && !accumulator.include?(relation_type) + end +end + +# Agent source URI (from original ArchivesSpace) +to_field 'agent_uri' do |record, accumulator| + # Try to extract from control section or otherRecordId + other_id = record.xpath('//eac:control/eac:otherRecordId[@localType="archivesspace_uri"]', EAC_NS).first + if other_id + accumulator << other_id.text + end +end + +# Timestamp +to_field 'timestamp' do |record, accumulator| + accumulator << Time.now.utc.iso8601 +end + +# Document type marker +to_field 'document_type' do |record, accumulator| + accumulator << 'creator' +end + +# Log successful indexing +each_record do |record, context| + record_id = record.xpath('//eac:control/eac:recordId', EAC_NS).first + if record_id + context.logger.info("Indexed creator: #{record_id.text}") + end +end