CODEPUSH-3: Add config gen#8
Conversation
Signed-off-by: bharathappali <abharath@redhat.com>
Signed-off-by: bharathappali <abharath@redhat.com>
Signed-off-by: bharathappali <abharath@redhat.com>
Reviewer's GuideIntroduce a standalone configuration generator script that creates metadata for synthetic workload hierarchies (clusters → namespaces → workloads → pods → containers) using a mapping.json file and various randomization and resource-constraint rules, and writes the result to data/configs//meta.json. ER diagram for generated meta_json workload hierarchyerDiagram
CONFIG {
int start_time
int end_time
int interval_secs
int num_clusters
int num_namespaces
}
CLUSTER {
string name
}
NAMESPACE {
string name
}
WORKLOAD {
string name
boolean is_deployment
boolean is_gpu_workload
int num_replicas
int num_containers
string cpu_utilization
string memory_utilization
string gpu_utilization
}
CONTAINER {
string name
string image
string image_id
float cpu_min
float cpu_max
string cpu_utilization
float memory_min
float memory_max
string memory_utilization
float gpu_min
float gpu_max
string gpu_utilization
}
POD {
string name
string id
string node_name
string gpu_uuid
string gpu_device
string gpu_model
}
POD_CONTAINER_STATE {
string container_name
string container_id
float current_cpu_seconds
float current_throttled_seconds
}
NODE {
string name
int capacity_cpu
int capacity_memory
string capacity_gpu_type
int capacity_gpu_count
}
NODE_AVAILABILITY {
int available_cpu
int available_memory
string available_gpu_type
int available_gpu_count
boolean is_assigned
}
CONFIG ||--o{ CLUSTER : has
CLUSTER ||--o{ NAMESPACE : contains
NAMESPACE ||--o{ WORKLOAD : owns
WORKLOAD ||--o{ POD : replicates
WORKLOAD ||--o{ CONTAINER : defines
POD ||--o{ POD_CONTAINER_STATE : tracks
CONTAINER ||--o{ POD_CONTAINER_STATE : used_in
NODE ||--|| NODE_AVAILABILITY : has
NODE ||--o{ POD : hosts
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 6 issues, and left some high level feedback:
- Building the large
NODE_MAPwith random data at import time makes this module slow and side‑effectful when imported; consider moving that initialization intomain()or a dedicated initializer function so it only runs when needed. load_metadata()can returnNone, butgenerate_meta_json()assumesmapping_datais a valid dict and will fail if the metadata file is missing or invalid; add a defensive check and exit or raise a clear error when metadata cannot be loaded.- The generated
timestampslist inmain()is only used to print its length and is otherwise unused; either remove this computation or wire the timestamps into the config/meta generation so the work has a clear purpose.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Building the large `NODE_MAP` with random data at import time makes this module slow and side‑effectful when imported; consider moving that initialization into `main()` or a dedicated initializer function so it only runs when needed.
- `load_metadata()` can return `None`, but `generate_meta_json()` assumes `mapping_data` is a valid dict and will fail if the metadata file is missing or invalid; add a defensive check and exit or raise a clear error when metadata cannot be loaded.
- The generated `timestamps` list in `main()` is only used to print its length and is otherwise unused; either remove this computation or wire the timestamps into the config/meta generation so the work has a clear purpose.
## Individual Comments
### Comment 1
<location path="generate_config.py" line_range="70-79" />
<code_context>
+def get_node(is_gpu_workload, max_cpu, max_memory):
</code_context>
<issue_to_address>
**issue (bug_risk):** Node assignment logic only allows a single workload per node due to `is_assigned`, effectively ignoring remaining CPU/memory capacity.
`get_node` sets `node_info["is_assigned"] = True` on the first scheduled workload, so later calls skip that node even if `availability["cpu"]` and `availability["memory"]` still allow more workloads. If nodes are meant to host multiple workloads until capacity is exhausted, either remove `is_assigned` and rely solely on the availability checks, or reserve `is_assigned` for a true terminal state (e.g., drained node). Otherwise, the availability tracking becomes misleading and nodes are systematically underutilized.
</issue_to_address>
### Comment 2
<location path="generate_config.py" line_range="131-140" />
<code_context>
+ return parser.parse_args()
+
+
+def get_num_secs(secs: str):
+ if secs is None:
+ return 30
+ if secs == "":
+ return 30
+ secs = secs.strip()
+ if secs not in Constants.INTERVAL_CHOICES:
+ return 30
+ if secs == "1s":
+ return 1
+ if secs == "5s":
+ return 5
+ if secs == "15s":
+ return 15
+ if secs == "30s":
+ return 30
+ if secs == "60s":
+ return 60
+
</code_context>
<issue_to_address>
**issue (bug_risk):** get_num_secs can fall through without returning a value if new interval choices are added in the future.
Because the function only handles a fixed set of values (`1s`, `5s`, `15s`, `30s`, `60s`), any future value added to `Constants.INTERVAL_CHOICES` (e.g. `"300s"`) could pass the membership check but not match any branch, causing an implicit `None` return and later issues in `get_timestamps`. Please either switch to a mapping dict (e.g. `{'1s': 1, '5s': 5, ...}`) or add an explicit default `return` at the end to guarantee a numeric result for all valid choices.
</issue_to_address>
### Comment 3
<location path="generate_config.py" line_range="151-156" />
<code_context>
+ return 60
+
+
+def get_start_end_unix_seconds(pre_days: int, post_days: int):
+ current_time = datetime.now(UTC).replace(second=0, microsecond=0)
+ start_time = current_time - timedelta(days=pre_days)
+ end_time = current_time + timedelta(days=post_days)
+ start_unix_seconds = int(time.mktime(start_time.timetuple()))
+ end_unix_seconds = int(time.mktime(end_time.timetuple()))
+ return start_unix_seconds, end_unix_seconds
+
</code_context>
<issue_to_address>
**issue (bug_risk):** Using time.mktime with UTC datetimes can skew timestamps on systems with non-UTC local time.
`datetime.now(UTC)` returns UTC-aware datetimes, but `time.mktime()` treats the timetuple as local time. On non-UTC hosts this will offset the epoch value by the local timezone. To get correct UTC epoch seconds, use `calendar.timegm(dt.utctimetuple())` or `int(dt.timestamp())` on the aware UTC datetime instead.
</issue_to_address>
### Comment 4
<location path="generate_config.py" line_range="183-162" />
<code_context>
+ return return_path
+
+
+def load_metadata():
+ try:
+ with open(MAPPING_JSON, 'r') as f:
+ metadata = json.load(f)
+ return metadata
+ except FileNotFoundError:
+ print(f"Error: File '{MAPPING_JSON}' not found.")
+ return None
+ except json.JSONDecodeError as e:
+ print(f"Error decoding JSON from '{MAPPING_JSON}': {str(e)}")
</code_context>
<issue_to_address>
**issue (bug_risk):** Returning None from load_metadata leads to runtime errors in generate_meta_json when the mapping file is missing or invalid.
`main()` passes the result of `load_metadata` directly as `mapping_data` into `generate_meta_json`, which immediately calls `mapping_data.values()`. When the mapping file is missing or invalid, `load_metadata` returns `None`, causing an exception in `generate_meta_json` with an unhelpful error. Either raise a clear exception from `load_metadata` instead of returning `None`, or have `main()` validate the result and fail fast with a clearer message before calling `generate_meta_json`.
</issue_to_address>
### Comment 5
<location path="generate_config.py" line_range="332-340" />
<code_context>
+ pod = generate_pod_name(deployment_name=workload, is_deployment=is_deployment, replica_index=i)
+ pod_id = str(uuid.uuid4())
+ pods.append(pod)
+ node_name, available_gpu_type = get_node(is_gpu_workload, max_cpu, max_memory)
+ pod_states[pod] = {
+ "id": pod_id,
+ "node": node_name
+ }
+ if is_gpu_workload:
+ pod_states[pod]['gpu_uuid'] = str(uuid.uuid4())
+ pod_states[pod]['gpu_device'] = "nvidia0"
+ pod_states[pod]['gpu_model'] = available_gpu_type
+ for container_name in container_list:
+ pod_states[pod][container_name] = {
</code_context>
<issue_to_address>
**issue:** No handling for failure to allocate a node leads to pods with `node=None` and GPU workloads with `gpu_model=None`.
`get_node` may return `(None, None)` when no node is available, but the code still records `"node": None` and, for GPU workloads, sets `gpu_uuid`, `gpu_device`, and `gpu_model=None`. This creates pods with GPU metadata that are not actually scheduled on any node. Please handle the `node_name is None` case explicitly (e.g., skip pod creation, mark as unschedulable, or raise) and only populate GPU metadata when a node is successfully allocated.
</issue_to_address>
### Comment 6
<location path="generate_config.py" line_range="399-403" />
<code_context>
+ interval_secs = get_num_secs(interval)
+ config_name = args.config_name
+
+ if not re.match(r"^[a-zA-Z0-9_-]+$", config_name):
+ raise ValueError(
+ "Invalid characters in config_name. Only alphanumeric, hyphen (-), and underscore (_) are allowed.")
+
+ if '/' in config_name:
+ raise ValueError("config_name cannot contain '/' character.")
+
</code_context>
<issue_to_address>
**nitpick:** The explicit '/' check for config_name is redundant with the regex and could be folded into one validation step.
Because `^[a-zA-Z0-9_-]+$` already disallows `/`, the `if '/' in config_name` branch is effectively unreachable when the regex check runs first. To keep distinct error messages, either check for `/` before applying the regex, or remove the second check and rely on the regex error. This simplifies the validation logic and avoids dead code paths.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def get_node(is_gpu_workload, max_cpu, max_memory): | ||
| for node_name, node_info in NODE_MAP.items(): | ||
| if not node_info["is_assigned"]: | ||
| if (node_info["availability"]["cpu"] >= max_cpu and | ||
| node_info["availability"]["memory"] >= max_memory): | ||
| if is_gpu_workload: | ||
| if node_info["availability"]["gpu"] and node_info["availability"]["gpu"]["count"] > 0: | ||
| node_info["availability"]["cpu"] -= max_cpu | ||
| node_info["availability"]["memory"] -= max_memory | ||
| node_info["availability"]["gpu"]["count"] -= 1 |
There was a problem hiding this comment.
issue (bug_risk): Node assignment logic only allows a single workload per node due to is_assigned, effectively ignoring remaining CPU/memory capacity.
get_node sets node_info["is_assigned"] = True on the first scheduled workload, so later calls skip that node even if availability["cpu"] and availability["memory"] still allow more workloads. If nodes are meant to host multiple workloads until capacity is exhausted, either remove is_assigned and rely solely on the availability checks, or reserve is_assigned for a true terminal state (e.g., drained node). Otherwise, the availability tracking becomes misleading and nodes are systematically underutilized.
| def get_num_secs(secs: str): | ||
| if secs is None: | ||
| return 30 | ||
| if secs == "": | ||
| return 30 | ||
| secs = secs.strip() | ||
| if secs not in Constants.INTERVAL_CHOICES: | ||
| return 30 | ||
| if secs == "1s": | ||
| return 1 |
There was a problem hiding this comment.
issue (bug_risk): get_num_secs can fall through without returning a value if new interval choices are added in the future.
Because the function only handles a fixed set of values (1s, 5s, 15s, 30s, 60s), any future value added to Constants.INTERVAL_CHOICES (e.g. "300s") could pass the membership check but not match any branch, causing an implicit None return and later issues in get_timestamps. Please either switch to a mapping dict (e.g. {'1s': 1, '5s': 5, ...}) or add an explicit default return at the end to guarantee a numeric result for all valid choices.
| def get_start_end_unix_seconds(pre_days: int, post_days: int): | ||
| current_time = datetime.now(UTC).replace(second=0, microsecond=0) | ||
| start_time = current_time - timedelta(days=pre_days) | ||
| end_time = current_time + timedelta(days=post_days) | ||
| start_unix_seconds = int(time.mktime(start_time.timetuple())) | ||
| end_unix_seconds = int(time.mktime(end_time.timetuple())) |
There was a problem hiding this comment.
issue (bug_risk): Using time.mktime with UTC datetimes can skew timestamps on systems with non-UTC local time.
datetime.now(UTC) returns UTC-aware datetimes, but time.mktime() treats the timetuple as local time. On non-UTC hosts this will offset the epoch value by the local timezone. To get correct UTC epoch seconds, use calendar.timegm(dt.utctimetuple()) or int(dt.timestamp()) on the aware UTC datetime instead.
|
|
||
| def get_timestamps(from_ts: int, to_ts: int, interval_secs: int): | ||
| if from_ts <= 0 or to_ts <= 0 or interval_secs <= 0 or from_ts >= to_ts: | ||
| return None |
There was a problem hiding this comment.
issue (bug_risk): Returning None from load_metadata leads to runtime errors in generate_meta_json when the mapping file is missing or invalid.
main() passes the result of load_metadata directly as mapping_data into generate_meta_json, which immediately calls mapping_data.values(). When the mapping file is missing or invalid, load_metadata returns None, causing an exception in generate_meta_json with an unhelpful error. Either raise a clear exception from load_metadata instead of returning None, or have main() validate the result and fail fast with a clearer message before calling generate_meta_json.
| node_name, available_gpu_type = get_node(is_gpu_workload, max_cpu, max_memory) | ||
| pod_states[pod] = { | ||
| "id": pod_id, | ||
| "node": node_name | ||
| } | ||
| if is_gpu_workload: | ||
| pod_states[pod]['gpu_uuid'] = str(uuid.uuid4()) | ||
| pod_states[pod]['gpu_device'] = "nvidia0" | ||
| pod_states[pod]['gpu_model'] = available_gpu_type |
There was a problem hiding this comment.
issue: No handling for failure to allocate a node leads to pods with node=None and GPU workloads with gpu_model=None.
get_node may return (None, None) when no node is available, but the code still records "node": None and, for GPU workloads, sets gpu_uuid, gpu_device, and gpu_model=None. This creates pods with GPU metadata that are not actually scheduled on any node. Please handle the node_name is None case explicitly (e.g., skip pod creation, mark as unschedulable, or raise) and only populate GPU metadata when a node is successfully allocated.
| if not re.match(r"^[a-zA-Z0-9_-]+$", config_name): | ||
| raise ValueError( | ||
| "Invalid characters in config_name. Only alphanumeric, hyphen (-), and underscore (_) are allowed.") | ||
|
|
||
| if '/' in config_name: |
There was a problem hiding this comment.
nitpick: The explicit '/' check for config_name is redundant with the regex and could be folded into one validation step.
Because ^[a-zA-Z0-9_-]+$ already disallows /, the if '/' in config_name branch is effectively unreachable when the regex check runs first. To keep distinct error messages, either check for / before applying the regex, or remove the second check and rely on the regex error. This simplifies the validation logic and avoids dead code paths.
|
|
||
| NODE_MAP = {} | ||
|
|
||
| for node_index in range(NUM_NODES): |
There was a problem hiding this comment.
Can you add description into this file. Also rebase the PR
This PR is built on top of #7
#7 Needs to be merged before merging this PR
This PR adds the config generator which generates the hierarchy mapping needed for data generator. Config will have the metadata like namespace details, deployment info, pod info and container details along with the nature of the workload
Summary by Sourcery
Add a configurable workload metadata generator that produces hierarchical cluster/namespace/workload configuration files for use by the data generator.
New Features: