Ray Executor

Overview

Ray Executor is a Ray-based distributed execution framework responsible for executing data processing operators, task scheduling, and distributed computing.

Architecture

runtime/python-executor/
└── datamate/
    ├── core/
    │   ├── base_op.py      # BaseOp, Mapper, Filter, Slicer, LLM
    │   ├── dataset.py      # Dataset processing
    │   └── constant.py     # Constant definitions
    ├── scheduler/
    │   ├── scheduler.py    # TaskScheduler, Task, TaskStatus
    │   ├── func_task_scheduler.py   # Function task scheduling
    │   └── cmd_task_scheduler.py    # Command task scheduling
    ├── wrappers/
    │   ├── executor.py     # Ray executor entry point
    │   ├── datamate_wrapper.py      # DataMate task wrapper
    │   └── data_juicer_wrapper.py   # DataJuicer integration
    └── common/utils/       # Utility functions
        ├── bytes_transform.py
        ├── file_scanner.py
        ├── lazy_loader.py
        └── text_splitter.py

Components

1. Base Classes

BaseOp

Base class for all operators:

class BaseOp:
    def __init__(self, *args, **kwargs):
        self.accelerator = kwargs.get('accelerator', "cpu")
        self.text_key = kwargs.get('text_key', "text")
        # ... other configuration
    
    def execute(self, sample):
        raise NotImplementedError

Mapper

Base class for data transformation operators (1:1):

class Mapper(BaseOp):
    def execute(self, sample: Dict) -> Dict:
        # Transformation logic
        return processed_sample

Filter

Base class for data filtering operators (returns bool):

class Filter(BaseOp):
    def execute(self, sample: Dict) -> bool:
        # Filtering logic
        return True  # Keep or filter out

Slicer

Base class for data slicing operators (1:N):

class Slicer(BaseOp):
    def execute(self, sample: Dict) -> List[Dict]:
        # Slicing logic
        return [sample1, sample2, ...]

LLM

Base class for LLM operators:

class LLM(Mapper):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.llm = self.get_llm(*args, **kwargs)
    
    def build_llm_prompt(self, *args, **kwargs):
        raise NotImplementedError

2. Task Scheduler

Async task scheduler:

class TaskScheduler:
    def __init__(self, max_concurrent: int = 10):
        self.tasks: Dict[str, Task] = {}
        self.semaphore = asyncio.Semaphore(max_concurrent)
    
    async def submit(self, task_id, task, *args, **kwargs):
        # Submit task
        pass
    
    def get_task_status(self, task_id: str) -> Optional[TaskResult]:
        # Get task status
        pass
    
    def cancel_task(self, task_id: str) -> bool:
        # Cancel task
        pass

3. Operator Execution

Operator Registration

from datamate.core.base_op import OPERATORS

OPERATORS.register_module(
    module_name='YourOperatorName',
    module_path="ops.user.operator_package.process"
)

Execute Operator

from datamate.core.base_op import Mapper

class MyMapper(Mapper):
    def execute(self, sample):
        text = sample.get('text', '')
        processed = text.upper()
        sample['text'] = processed
        return sample

Quick Start

Prerequisites

Python 3.11+
Ray 2.7.0+
Poetry

Installation

cd runtime/python-executor
poetry install

Start Ray Head

ray start --head

Start Ray Worker

ray start --head-address=<head-ip>:6379

Usage

Submit Task to Ray

from ray import remote

@remote
def execute_operator(sample, operator_config):
    # Execute operator logic
    return result

# Submit task
result_ref = execute_operator.remote(sample, config)
result = ray.get(result_ref)

Use Task Scheduler

from datamate.scheduler.scheduler import TaskScheduler

scheduler = TaskScheduler(max_concurrent=10)
task_id = "task-001"
scheduler.submit(task_id, my_function, arg1, arg2)
status = scheduler.get_task_status(task_id)

Development

Adding a New Operator

Create operator directory in runtime/ops/
Implement process.py and __init__.py
Register operator in __init__.py
Test the operator

Debugging Operators

# Local test
python -c "from ops.user.operator_package.process import YourOperatorName; op = YourOperatorName(); print(op.execute({'text': 'test'}))"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray Executor

Overview

Architecture

Components

1. Base Classes

BaseOp

Mapper

Filter

Slicer

LLM

2. Task Scheduler

3. Operator Execution

Operator Registration

Execute Operator

Quick Start

Prerequisites

Installation

Start Ray Head

Start Ray Worker

Usage

Submit Task to Ray

Use Task Scheduler

Development

Adding a New Operator

Debugging Operators

Performance

Parallel Execution

Fault Tolerance

Resource Management

Documentation

Related Links

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Ray Executor

Overview

Architecture

Components

1. Base Classes

BaseOp

Mapper

Filter

Slicer

LLM

2. Task Scheduler

3. Operator Execution

Operator Registration

Execute Operator

Quick Start

Prerequisites

Installation

Start Ray Head

Start Ray Worker

Usage

Submit Task to Ray

Use Task Scheduler

Development

Adding a New Operator

Debugging Operators

Performance

Parallel Execution

Fault Tolerance

Resource Management

Documentation

Related Links