Skip to content

Add throughput bucket samples for Cosmos Spark connector#48734

Open
xinlian12 wants to merge 1 commit intoAzure:mainfrom
xinlian12:addSampleForThroughputBucketInSpark-v2
Open

Add throughput bucket samples for Cosmos Spark connector#48734
xinlian12 wants to merge 1 commit intoAzure:mainfrom
xinlian12:addSampleForThroughputBucketInSpark-v2

Conversation

@xinlian12
Copy link
Copy Markdown
Member

Summary

Add Python and Scala sample notebooks demonstrating server-side throughput bucket configuration for the Cosmos Spark connector (azure-cosmos-spark_3).

Changes

  • Samples/Python/NYC-Taxi-Data/04_ThroughputBucket.ipynb — PySpark notebook
  • Samples/Scala/NYC-Taxi-Data/04_ThroughputBucket.scala — Scala Databricks notebook

Both samples are modeled after the existing 01_Batch samples but replace the SDK-side global throughput control with the simpler server-side throughputBucket configuration:

Config key Description
spark.cosmos.throughputControl.enabled "true"
spark.cosmos.throughputControl.name Group name
spark.cosmos.throughputControl.throughputBucket Integer between 1 and 5

Key differences from 01_Batch

  • Removed the ThroughputControl metadata container creation (not needed for server-side buckets)
  • Removed separate throughput control account/catalog configuration
  • Replaced targetThroughputThreshold, globalControl.database, globalControl.container with throughputBucket

Verification

These are Databricks notebook samples and do not have associated unit tests. The structure and configuration keys were verified against the Spark connector source code (CosmosConfig.scala, ThroughputControlHelper.scala).

Add Python (.ipynb) and Scala sample notebooks demonstrating
server-side throughput bucket configuration as an alternative
to SDK-based global throughput control.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@xinlian12 xinlian12 marked this pull request as ready for review April 8, 2026 19:21
@xinlian12 xinlian12 requested review from a team and kirankumarkolli as code owners April 8, 2026 19:21
Copilot AI review requested due to automatic review settings April 8, 2026 19:21
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new Scala and Python Databricks sample notebooks under azure-cosmos-spark_3 demonstrating server-side throughput buckets (spark.cosmos.throughputControl.throughputBucket) for the NYC Taxi ingestion workflow, modeled after the existing 01_Batch samples but without SDK/global throughput-control metadata container setup.

Changes:

  • Add Scala Databricks notebook sample showing ingest/query/CF validation and deletes using throughput buckets.
  • Add PySpark Databricks notebook sample showing the same flow using throughput buckets.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
sdk/cosmos/azure-cosmos-spark_3/Samples/Scala/NYC-Taxi-Data/04_ThroughputBucket.scala New Scala Databricks sample demonstrating throughput bucket configuration for ingest and delete workloads.
sdk/cosmos/azure-cosmos-spark_3/Samples/Python/NYC-Taxi-Data/04_ThroughputBucket.ipynb New PySpark Databricks sample notebook demonstrating throughput bucket configuration for ingest and delete workloads.

Copy link
Copy Markdown
Member

@tvaron3 tvaron3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants