CODEPUSH-3: Add config gen by bharathappali · Pull Request #8 · kruize/data-gen

bharathappali · 2026-04-23T08:29:00Z

This PR is built on top of #7

#7 Needs to be merged before merging this PR

This PR adds the config generator which generates the hierarchy mapping needed for data generator. Config will have the metadata like namespace details, deployment info, pod info and container details along with the nature of the workload

Summary by Sourcery

Add a configurable workload metadata generator that produces hierarchical cluster/namespace/workload configuration files for use by the data generator.

New Features:

Introduce a command-line script to generate meta.json configs describing clusters, namespaces, workloads, pods, and containers over a specified time range.
Add support for modeling CPU, memory, and optional GPU resource utilization and capacities per node and container when generating configs.
Create the metadata mapping file scaffold used to derive clusters, namespaces, and workloads for configuration generation.

Signed-off-by: bharathappali <abharath@redhat.com>

sourcery-ai · 2026-04-23T08:29:08Z

Reviewer's Guide

Introduce a standalone configuration generator script that creates metadata for synthetic workload hierarchies (clusters → namespaces → workloads → pods → containers) using a mapping.json file and various randomization and resource-constraint rules, and writes the result to data/configs//meta.json.

ER diagram for generated meta_json workload hierarchy

erDiagram
    CONFIG {
        int start_time
        int end_time
        int interval_secs
        int num_clusters
        int num_namespaces
    }

    CLUSTER {
        string name
    }

    NAMESPACE {
        string name
    }

    WORKLOAD {
        string name
        boolean is_deployment
        boolean is_gpu_workload
        int num_replicas
        int num_containers
        string cpu_utilization
        string memory_utilization
        string gpu_utilization
    }

    CONTAINER {
        string name
        string image
        string image_id
        float cpu_min
        float cpu_max
        string cpu_utilization
        float memory_min
        float memory_max
        string memory_utilization
        float gpu_min
        float gpu_max
        string gpu_utilization
    }

    POD {
        string name
        string id
        string node_name
        string gpu_uuid
        string gpu_device
        string gpu_model
    }

    POD_CONTAINER_STATE {
        string container_name
        string container_id
        float current_cpu_seconds
        float current_throttled_seconds
    }

    NODE {
        string name
        int capacity_cpu
        int capacity_memory
        string capacity_gpu_type
        int capacity_gpu_count
    }

    NODE_AVAILABILITY {
        int available_cpu
        int available_memory
        string available_gpu_type
        int available_gpu_count
        boolean is_assigned
    }

    CONFIG ||--o{ CLUSTER : has
    CLUSTER ||--o{ NAMESPACE : contains
    NAMESPACE ||--o{ WORKLOAD : owns
    WORKLOAD ||--o{ POD : replicates
    WORKLOAD ||--o{ CONTAINER : defines

    POD ||--o{ POD_CONTAINER_STATE : tracks
    CONTAINER ||--o{ POD_CONTAINER_STATE : used_in

    NODE ||--|| NODE_AVAILABILITY : has
    NODE ||--o{ POD : hosts

File-Level Changes

Change	Details	Files
Add a CLI-based config generator that builds hierarchical workload metadata and writes meta.json under a named config directory.	Implement argument parsing for config name, namespace/deployment/replica counts, pre/post day ranges, and time interval, using defaults from Constants. Add time-range helpers to convert pre/post day windows and interval strings into UNIX timestamp ranges and sequences. Validate and normalize user inputs for config name and temporal parameters before generation, printing the number of generated timestamps. Create per-config output directories under a common configs root, overwriting existing contents if necessary.	`generate_config.py`
Introduce a synthetic node inventory and node-selection logic to assign pods to nodes with CPU, memory, and optional GPU capacity tracking.	Pre-generate a large NODE_MAP of worker nodes with random IP-like names, CPU/memory capacities, and optional GPU types/counts. Implement get_node to pick an unassigned node that has sufficient CPU/memory (and GPUs for GPU workloads), decrementing its available resources and tracking assignment. Model GPU-specific details such as GPU device ID, model, and UUID for GPU workloads within pod state metadata.	`generate_config.py`
Generate hierarchical cluster/namespace/workload/pod/container metadata based on a world mapping and resource-utilization constraints.	Load cluster-to-namespace-to-workload mappings from mapping.json and randomly choose a subset of namespaces and workloads according to requested counts. For each selected workload, randomly decide deployment vs non-deployment, GPU vs non-GPU (for deployments), replicas, container count, and utilization levels for CPU, memory, and GPU. Derive min/max resource ranges from Constants.ResourceConstraints based on utilization categories and attach them to each container’s resource spec. Generate pod names and container names deterministically patterned off workloads, assign pods to nodes via get_node, and attach IDs, image IDs, image references, and initial resource counters. Assemble the full meta structure with cluster, namespace, workload, pod, and container metadata plus time_range, and write it as pretty-printed JSON to meta.json in the config directory.	`generate_config.py` `data/metadata/mapping.json`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 6 issues, and left some high level feedback:

Building the large NODE_MAP with random data at import time makes this module slow and side‑effectful when imported; consider moving that initialization into main() or a dedicated initializer function so it only runs when needed.
load_metadata() can return None, but generate_meta_json() assumes mapping_data is a valid dict and will fail if the metadata file is missing or invalid; add a defensive check and exit or raise a clear error when metadata cannot be loaded.
The generated timestamps list in main() is only used to print its length and is otherwise unused; either remove this computation or wire the timestamps into the config/meta generation so the work has a clear purpose.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Building the large `NODE_MAP` with random data at import time makes this module slow and side‑effectful when imported; consider moving that initialization into `main()` or a dedicated initializer function so it only runs when needed.
- `load_metadata()` can return `None`, but `generate_meta_json()` assumes `mapping_data` is a valid dict and will fail if the metadata file is missing or invalid; add a defensive check and exit or raise a clear error when metadata cannot be loaded.
- The generated `timestamps` list in `main()` is only used to print its length and is otherwise unused; either remove this computation or wire the timestamps into the config/meta generation so the work has a clear purpose.

## Individual Comments

### Comment 1
<location path="generate_config.py" line_range="70-79" />
<code_context>
+def get_node(is_gpu_workload, max_cpu, max_memory):
</code_context>
<issue_to_address>
**issue (bug_risk):** Node assignment logic only allows a single workload per node due to `is_assigned`, effectively ignoring remaining CPU/memory capacity.

`get_node` sets `node_info["is_assigned"] = True` on the first scheduled workload, so later calls skip that node even if `availability["cpu"]` and `availability["memory"]` still allow more workloads. If nodes are meant to host multiple workloads until capacity is exhausted, either remove `is_assigned` and rely solely on the availability checks, or reserve `is_assigned` for a true terminal state (e.g., drained node). Otherwise, the availability tracking becomes misleading and nodes are systematically underutilized.
</issue_to_address>

### Comment 2
<location path="generate_config.py" line_range="131-140" />
<code_context>
+    return parser.parse_args()
+
+
+def get_num_secs(secs: str):
+    if secs is None:
+        return 30
+    if secs == "":
+        return 30
+    secs = secs.strip()
+    if secs not in Constants.INTERVAL_CHOICES:
+        return 30
+    if secs == "1s":
+        return 1
+    if secs == "5s":
+        return 5
+    if secs == "15s":
+        return 15
+    if secs == "30s":
+        return 30
+    if secs == "60s":
+        return 60
+
</code_context>
<issue_to_address>
**issue (bug_risk):** get_num_secs can fall through without returning a value if new interval choices are added in the future.

Because the function only handles a fixed set of values (`1s`, `5s`, `15s`, `30s`, `60s`), any future value added to `Constants.INTERVAL_CHOICES` (e.g. `"300s"`) could pass the membership check but not match any branch, causing an implicit `None` return and later issues in `get_timestamps`. Please either switch to a mapping dict (e.g. `{'1s': 1, '5s': 5, ...}`) or add an explicit default `return` at the end to guarantee a numeric result for all valid choices.
</issue_to_address>

### Comment 3
<location path="generate_config.py" line_range="151-156" />
<code_context>
+        return 60
+
+
+def get_start_end_unix_seconds(pre_days: int, post_days: int):
+    current_time = datetime.now(UTC).replace(second=0, microsecond=0)
+    start_time = current_time - timedelta(days=pre_days)
+    end_time = current_time + timedelta(days=post_days)
+    start_unix_seconds = int(time.mktime(start_time.timetuple()))
+    end_unix_seconds = int(time.mktime(end_time.timetuple()))
+    return start_unix_seconds, end_unix_seconds
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Using time.mktime with UTC datetimes can skew timestamps on systems with non-UTC local time.

`datetime.now(UTC)` returns UTC-aware datetimes, but `time.mktime()` treats the timetuple as local time. On non-UTC hosts this will offset the epoch value by the local timezone. To get correct UTC epoch seconds, use `calendar.timegm(dt.utctimetuple())` or `int(dt.timestamp())` on the aware UTC datetime instead.
</issue_to_address>

### Comment 4
<location path="generate_config.py" line_range="183-162" />
<code_context>
+    return return_path
+
+
+def load_metadata():
+    try:
+        with open(MAPPING_JSON, 'r') as f:
+            metadata = json.load(f)
+        return metadata
+    except FileNotFoundError:
+        print(f"Error: File '{MAPPING_JSON}' not found.")
+        return None
+    except json.JSONDecodeError as e:
+        print(f"Error decoding JSON from '{MAPPING_JSON}': {str(e)}")
</code_context>
<issue_to_address>
**issue (bug_risk):** Returning None from load_metadata leads to runtime errors in generate_meta_json when the mapping file is missing or invalid.

`main()` passes the result of `load_metadata` directly as `mapping_data` into `generate_meta_json`, which immediately calls `mapping_data.values()`. When the mapping file is missing or invalid, `load_metadata` returns `None`, causing an exception in `generate_meta_json` with an unhelpful error. Either raise a clear exception from `load_metadata` instead of returning `None`, or have `main()` validate the result and fail fast with a clearer message before calling `generate_meta_json`.
</issue_to_address>

### Comment 5
<location path="generate_config.py" line_range="332-340" />
<code_context>
+                    pod = generate_pod_name(deployment_name=workload, is_deployment=is_deployment, replica_index=i)
+                    pod_id = str(uuid.uuid4())
+                    pods.append(pod)
+                    node_name, available_gpu_type = get_node(is_gpu_workload, max_cpu, max_memory)
+                    pod_states[pod] = {
+                        "id": pod_id,
+                        "node": node_name
+                    }
+                    if is_gpu_workload:
+                        pod_states[pod]['gpu_uuid'] = str(uuid.uuid4())
+                        pod_states[pod]['gpu_device'] = "nvidia0"
+                        pod_states[pod]['gpu_model'] = available_gpu_type
+                    for container_name in container_list:
+                        pod_states[pod][container_name] = {
</code_context>
<issue_to_address>
**issue:** No handling for failure to allocate a node leads to pods with `node=None` and GPU workloads with `gpu_model=None`.

`get_node` may return `(None, None)` when no node is available, but the code still records `"node": None` and, for GPU workloads, sets `gpu_uuid`, `gpu_device`, and `gpu_model=None`. This creates pods with GPU metadata that are not actually scheduled on any node. Please handle the `node_name is None` case explicitly (e.g., skip pod creation, mark as unschedulable, or raise) and only populate GPU metadata when a node is successfully allocated.
</issue_to_address>

### Comment 6
<location path="generate_config.py" line_range="399-403" />
<code_context>
+    interval_secs = get_num_secs(interval)
+    config_name = args.config_name
+
+    if not re.match(r"^[a-zA-Z0-9_-]+$", config_name):
+        raise ValueError(
+            "Invalid characters in config_name. Only alphanumeric, hyphen (-), and underscore (_) are allowed.")
+
+    if '/' in config_name:
+        raise ValueError("config_name cannot contain '/' character.")
+
</code_context>
<issue_to_address>
**nitpick:** The explicit '/' check for config_name is redundant with the regex and could be folded into one validation step.

Because `^[a-zA-Z0-9_-]+$` already disallows `/`, the `if '/' in config_name` branch is effectively unreachable when the regex check runs first. To keep distinct error messages, either check for `/` before applying the regex, or remove the second check and rely on the regex error. This simplifies the validation logic and avoids dead code paths.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-04-23T08:30:56Z

+def get_node(is_gpu_workload, max_cpu, max_memory):
+    for node_name, node_info in NODE_MAP.items():
+        if not node_info["is_assigned"]:
+            if (node_info["availability"]["cpu"] >= max_cpu and
+                    node_info["availability"]["memory"] >= max_memory):
+                if is_gpu_workload:
+                    if node_info["availability"]["gpu"] and node_info["availability"]["gpu"]["count"] > 0:
+                        node_info["availability"]["cpu"] -= max_cpu
+                        node_info["availability"]["memory"] -= max_memory
+                        node_info["availability"]["gpu"]["count"] -= 1


issue (bug_risk): Node assignment logic only allows a single workload per node due to is_assigned, effectively ignoring remaining CPU/memory capacity.

get_node sets node_info["is_assigned"] = True on the first scheduled workload, so later calls skip that node even if availability["cpu"] and availability["memory"] still allow more workloads. If nodes are meant to host multiple workloads until capacity is exhausted, either remove is_assigned and rely solely on the availability checks, or reserve is_assigned for a true terminal state (e.g., drained node). Otherwise, the availability tracking becomes misleading and nodes are systematically underutilized.

sourcery-ai · 2026-04-23T08:30:56Z

+def get_num_secs(secs: str):
+    if secs is None:
+        return 30
+    if secs == "":
+        return 30
+    secs = secs.strip()
+    if secs not in Constants.INTERVAL_CHOICES:
+        return 30
+    if secs == "1s":
+        return 1


issue (bug_risk): get_num_secs can fall through without returning a value if new interval choices are added in the future.

Because the function only handles a fixed set of values (1s, 5s, 15s, 30s, 60s), any future value added to Constants.INTERVAL_CHOICES (e.g. "300s") could pass the membership check but not match any branch, causing an implicit None return and later issues in get_timestamps. Please either switch to a mapping dict (e.g. {'1s': 1, '5s': 5, ...}) or add an explicit default return at the end to guarantee a numeric result for all valid choices.

sourcery-ai · 2026-04-23T08:30:56Z

+def get_start_end_unix_seconds(pre_days: int, post_days: int):
+    current_time = datetime.now(UTC).replace(second=0, microsecond=0)
+    start_time = current_time - timedelta(days=pre_days)
+    end_time = current_time + timedelta(days=post_days)
+    start_unix_seconds = int(time.mktime(start_time.timetuple()))
+    end_unix_seconds = int(time.mktime(end_time.timetuple()))


issue (bug_risk): Using time.mktime with UTC datetimes can skew timestamps on systems with non-UTC local time.

datetime.now(UTC) returns UTC-aware datetimes, but time.mktime() treats the timetuple as local time. On non-UTC hosts this will offset the epoch value by the local timezone. To get correct UTC epoch seconds, use calendar.timegm(dt.utctimetuple()) or int(dt.timestamp()) on the aware UTC datetime instead.

sourcery-ai · 2026-04-23T08:30:56Z

+
+def get_timestamps(from_ts: int, to_ts: int, interval_secs: int):
+    if from_ts <= 0 or to_ts <= 0 or interval_secs <= 0 or from_ts >= to_ts:
+        return None


issue (bug_risk): Returning None from load_metadata leads to runtime errors in generate_meta_json when the mapping file is missing or invalid.

main() passes the result of load_metadata directly as mapping_data into generate_meta_json, which immediately calls mapping_data.values(). When the mapping file is missing or invalid, load_metadata returns None, causing an exception in generate_meta_json with an unhelpful error. Either raise a clear exception from load_metadata instead of returning None, or have main() validate the result and fail fast with a clearer message before calling generate_meta_json.

sourcery-ai · 2026-04-23T08:30:56Z

+                    node_name, available_gpu_type = get_node(is_gpu_workload, max_cpu, max_memory)
+                    pod_states[pod] = {
+                        "id": pod_id,
+                        "node": node_name
+                    }
+                    if is_gpu_workload:
+                        pod_states[pod]['gpu_uuid'] = str(uuid.uuid4())
+                        pod_states[pod]['gpu_device'] = "nvidia0"
+                        pod_states[pod]['gpu_model'] = available_gpu_type


issue: No handling for failure to allocate a node leads to pods with node=None and GPU workloads with gpu_model=None.

get_node may return (None, None) when no node is available, but the code still records "node": None and, for GPU workloads, sets gpu_uuid, gpu_device, and gpu_model=None. This creates pods with GPU metadata that are not actually scheduled on any node. Please handle the node_name is None case explicitly (e.g., skip pod creation, mark as unschedulable, or raise) and only populate GPU metadata when a node is successfully allocated.

sourcery-ai · 2026-04-23T08:30:56Z

+    if not re.match(r"^[a-zA-Z0-9_-]+$", config_name):
+        raise ValueError(
+            "Invalid characters in config_name. Only alphanumeric, hyphen (-), and underscore (_) are allowed.")
+
+    if '/' in config_name:


nitpick: The explicit '/' check for config_name is redundant with the regex and could be folded into one validation step.

Because ^[a-zA-Z0-9_-]+$ already disallows /, the if '/' in config_name branch is effectively unreachable when the regex check runs first. To keep distinct error messages, either check for / before applying the regex, or remove the second check and rely on the regex error. This simplifies the validation logic and avoids dead code paths.

chandrams · 2026-04-24T06:32:50Z

+
+NODE_MAP = {}
+
+for node_index in range(NUM_NODES):


Can you add description into this file. Also rebase the PR

bharathappali added 3 commits April 23, 2026 13:41

add mapping json

d8189b4

Signed-off-by: bharathappali <abharath@redhat.com>

add the json content

ec4ffd2

Signed-off-by: bharathappali <abharath@redhat.com>

add generate config python script

d2cb54e

Signed-off-by: bharathappali <abharath@redhat.com>

bharathappali requested a review from chandrams April 23, 2026 08:29

bharathappali self-assigned this Apr 23, 2026

bharathappali changed the title ~~Add config gen~~ CODEPUSH-3: Add config gen Apr 23, 2026

sourcery-ai Bot reviewed Apr 23, 2026

View reviewed changes

chandrams reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CODEPUSH-3: Add config gen#8

CODEPUSH-3: Add config gen#8
bharathappali wants to merge 3 commits into
kruize:mainfrom
bharathappali:add-config-gen

bharathappali commented Apr 23, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

sourcery-ai Bot commented Apr 23, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot Apr 23, 2026

Uh oh!

sourcery-ai Bot Apr 23, 2026

Uh oh!

sourcery-ai Bot Apr 23, 2026

Uh oh!

sourcery-ai Bot Apr 23, 2026

Uh oh!

sourcery-ai Bot Apr 23, 2026

Uh oh!

sourcery-ai Bot Apr 23, 2026

Uh oh!

chandrams Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bharathappali commented Apr 23, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

ER diagram for generated meta_json workload hierarchy

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chandrams Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bharathappali commented Apr 23, 2026 •

edited by sourcery-ai Bot

Loading

sourcery-ai Bot commented Apr 23, 2026 •

edited

Loading