Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ dist/
.dir-locals.el
private_configs/
logs/
tmp.*
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ quality: ## check coding style with pycodestyle and pylint
pylint xapi_db_load *.py
pycodestyle xapi_db_load *.py
pydocstyle xapi_db_load *.py
mypy xapi_db_load
isort --check-only --diff --recursive xapi_db_load *.py test_settings.py
python setup.py bdist_wheel
twine check dist/*
Expand Down
118 changes: 112 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,44 @@ Please add any issues you find here: https://github.com/openedx/xapi-db-load/iss

Data can be generated using the following backends:

Backend comparison
------------------

.. list-table::
:header-rows: 1
:widths: 12 18 18 18 34

* - Backend
- Recommended scale
- Speed
- Output
- Best for
* - ``clickhouse``
- Up to ~10K xAPI events
- Slow
- Direct ClickHouse inserts
- Smoke tests, configuration and permission checks
* - ``ralph``
- Up to ~1M xAPI events
- Slowest
- HTTP POST to Ralph LRS
- Exercising the full Aspects / Ralph integration path
* - ``vector``
- Up to ~10M xAPI events
- Medium
- Log statements consumed by Vector
- Testing a Vector-based pipeline into ClickHouse
* - ``csv``
- Up to 100M xAPI events
- Fast
- Gzipped CSV files (local or block storage)
- Reusable fixtures, readable output, medium-to-large performance tests
* - ``chdb``
- 100M+ xAPI events
- Fastest
- lz4 ClickHouse Native files on block storage
- Reusable fixtures, readable output, very large scale tests

clickhouse
----------
This backend issues batched insert statements directly against the configured
Expand All @@ -42,6 +80,14 @@ clickhouse backend. It is useful for testing Ralph configuration, integration,
and permissions. This is the slowest method, but exercises the largest
surface area of the Aspects project.

vector
------
This backend emits xAPI statements through a dedicated ``xapi_tracking``
Python logger so that a co-located `Vector <https://vector.dev>`_ agent can
read them and forward them to ClickHouse. All non-xAPI data (courses, blocks,
enrollments, etc.) is still written using the direct ``clickhouse`` backend.
Use this backend when validating a Vector-based ingestion pipeline.

csv
---
This backend generates a single gzipped CSV file for each type of data and
Expand All @@ -59,8 +105,8 @@ configured `start_date` and `end_date`.

chdb
----
This backend generates lz4 compressed ClickHouse Native files in S3 using the
CHDB in-process ClickHouse engine and can optionally load the files to
This backend generates lz4 compressed ClickHouse Native files in block storage
using the CHDB in-process ClickHouse engine, and can optionally load the files to
a ClickHouse service directly after creation or at a later time using the
``--load_db_only`` option. The generated files are partitioned differently per data
type to parallelize data writing and loading. This is the fastest engine for
Expand Down Expand Up @@ -103,6 +149,37 @@ To try out the new UI mode:



Secrets and environment variable overrides
------------------------------------------
Sensitive credentials should not be committed to source control. The following
environment variables override their corresponding config keys at load time
(env vars take precedence over values in the YAML file):

.. list-table::
:header-rows: 1
:widths: 45 25 30

* - Environment variable
- Config key it overrides
- Used by
* - ``XAPI_DB_LOAD_CLICKHOUSE_PASSWORD``
- ``db_password``
- All ClickHouse-backed runs
* - ``XAPI_DB_LOAD_AWS_SECRET_ACCESS_KEY``
- ``s3_secret``
- ``csv`` (S3 destination), ``chdb``
* - ``XAPI_DB_LOAD_RALPH_PASSWORD``
- ``lrs_password``
- ``ralph``

A typical pattern is to keep all non-secret keys in YAML and provide the
secrets via the shell, your CI secret store, or a ``.env`` file::

export XAPI_DB_LOAD_CLICKHOUSE_PASSWORD=...
export XAPI_DB_LOAD_AWS_SECRET_ACCESS_KEY=...
export XAPI_DB_LOAD_RALPH_PASSWORD=...
xapi-db-load load-db --config_file my_config.yaml

Configuration Format
--------------------
There are a number of different configuration options for tuning the output.
Expand All @@ -117,6 +194,20 @@ test::
# Location where timing logs will be saved
log_dir: logs

# Maximum size of db_load.log before it rotates, in bytes.
# Defaults to 10 MB. Set to 0 to disable rotation and keep a single
# unbounded log file (the pre-rotation behavior).
log_max_bytes: 10485760

# Number of rotated log backups to retain (db_load.log.1 ... .5 by default).
log_backup_count: 5

# Base URL used as the LMS "homePage" / course URL prefix in every
# generated xAPI statement. Defaults to http://localhost:18000. Set this
# to match a real environment when you need the emitted events to point
# at a specific host.
lms_url: http://localhost:18000

# xAPI statements will be generated in batches, the total number of
# statements is ``num_xapi_batches * batch_size``. The batch size is the number
# of xAPI statements sent to the backend (Ralph POST, ClickHouse insert, etc.)
Expand Down Expand Up @@ -208,26 +299,41 @@ Ralph / ClickHouse Backend
^^^^^^^^^^^^^^^^^^^^^^^^^^
Variables necessary to send xAPI statements via Ralph::

backend: ralph_clickhouse
backend: ralph
lrs_url: http://ralph.tutor-nightly-local.orb.local/xAPI/statements
lrs_username: ralph
lrs_password: secret

# Optional: per-request timeout (seconds) applied to every Ralph POST.
# Defaults to 120. Set to ``null`` for unbounded waits (pre-timeout
# behavior). A finite value prevents a hung Ralph endpoint from stalling
# the entire run.
lrs_request_timeout: 120

# This also requires all of the ClickHouse backend variables!

Vector Backend
^^^^^^^^^^^^^^
The ``vector`` backend reuses the ``clickhouse`` connection variables for the
non-xAPI data and emits xAPI statements through the ``xapi_tracking`` logger
for Vector to consume::

backend: vector
# ... plus all the ClickHouse backend variables above


CSV Backend, Local Files
^^^^^^^^^^^^^^^^^^^^^^^^
Generates gzipped CSV files to a local directory::

backend: csv_file
backend: csv
csv_output_destination: logs/

CSV Backend, S3 Compatible Destination
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Generates gzipped CSV files to remote location::

backend: csv_file
backend: csv
# This can be anything smart-open can handle (ex. a local directory or
# an S3 bucket etc.) but importing to ClickHouse using this tool only
# supports S3 or compatible services like MinIO right now.
Expand All @@ -244,7 +350,7 @@ CSV Backend, S3 Compatible Destination, Load to ClickHouse
Generates gzipped CSV files to a remote location, then automatically loads
them to ClickHouse::

backend: csv_file
backend: csv
# csv_output_destination can be anything smart_open can handle, a local
# directory or an S3 bucket etc., but importing to ClickHouse using this
# tool only supports S3 or compatible services (ex: MinIO) right now
Expand Down
21 changes: 20 additions & 1 deletion default_config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,10 @@
# This default configuration should generally work as long as the logs director is writable.
# This default configuration should generally work as long as the logs directory is writable.
#
# SECRETS: Do not commit credentials to source control. Sensitive config values can be
# overridden via environment variables (env vars take precedence over values in this file):
# XAPI_DB_LOAD_CLICKHOUSE_PASSWORD -> db_password
# XAPI_DB_LOAD_AWS_SECRET_ACCESS_KEY -> s3_secret
# XAPI_DB_LOAD_RALPH_PASSWORD -> ralph_password

# CSV backend configuration
# #########################
Expand All @@ -11,6 +17,19 @@ csv_load_from_s3_after: false

# Run options
log_dir: logs

# Maximum size of db_load.log before it rotates, in bytes.
# Set to 0 to disable rotation.
log_max_bytes: 10485760 # 10 MB

# Number of rotated log backups to retain (e.g. db_load.log.1 ... .5).
log_backup_count: 5

# Base URL used as the LMS "homePage" / course URL prefix in generated xAPI
# statements. Override this to match a real environment when you need the
# emitted events to point at a specific host.
lms_url: http://localhost:18000

num_xapi_batches: 300
batch_size: 1000

Expand Down
1 change: 1 addition & 0 deletions requirements/quality.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

edx-lint # edX pylint rules and plugins
isort # to standardize order of imports
mypy # Static type checker
pycodestyle # PEP 8 compliance validation
pydocstyle # PEP 257 compliance validation
twine # Utility for publishing Python packages on PyPI.
10 changes: 10 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,13 @@ skip=

[wheel]
universal = 1

[mypy]
python_version = 3.12
ignore_missing_imports = True
check_untyped_defs = False
warn_unused_ignores = True
exclude = (^docs/|^build/|^dist/)

[mypy-xapi_db_load.tests.*]
ignore_errors = True
7 changes: 6 additions & 1 deletion tox.ini
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,13 @@ ignore = E501
; D412 = No blank lines allowed between a section header and its content (numpy style)
; D413 = Missing blank line after last section (numpy style)
; D414 = Section has no content (numpy style)
; D202 = No blank lines allowed after function docstring (conflicts with isort/black)
; D205 = 1 blank line required between summary line and description (cosmetic only)
; D400 = First line should end with a period (cosmetic only)
; D401 = First line should be in imperative mood (heuristic is unreliable)
; D415 = First line should end with a period, question mark, or exclamation point
; E501 = Line too long, this is handled in pylint
ignore = D101,D105,D107,D200,D203,D212,D215,D404,D405,D406,D407,D408,D409,D410,D411,D412,D413,D414
ignore = D101,D105,D107,D200,D202,D203,D205,D212,D215,D400,D401,D404,D405,D406,D407,D408,D409,D410,D411,D412,D413,D414,D415


[pytest]
Expand Down
Loading