(feat.) Initial Chat-With-Factory module#633
Conversation
Add the query_factory_ontology Foundry tool to 513-chat-with-your-factory: ontologyClient (mssql Fabric lakehouse reads), factoryTool dispatch, agent provisioning, Helm/deploy env wiring, and grounding/live-data docs. Validated end-to-end against a real Fabric lakehouse (T2) and through the Foundry agent (T3). Harden env gitignore to .env.* (keep .env.template), ignore megalinter-reports/ and *.lscache, and remove stray tracked C# cache files.
The .hve-core gitlink (mode 160000, no .gitmodules) was a local HVE-Core tooling install swept into the migration commit. Untrack it and ignore .hve-core/ so the dangling submodule reference does not reach main.
Dependency ReviewThe following issues were found:
|
📚 Documentation Health ReportGenerated on: 2026-06-24 15:26:45 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
📚 Documentation Health ReportGenerated on: 2026-06-24 15:31:11 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
- Remove duplicate H1 (MD025) and rename duplicate Copilot Studio headings (MD024) - Format markdown tables to satisfy markdown-table-formatter - Add product terms (Copilot Studio, Dev Tunnels, Direct Line, Bot Framework, Voice Live, Web Speech, sideload) to cspell words - Replace example dev tunnel ID to avoid cspell false positive
📚 Documentation Health ReportGenerated on: 2026-06-24 16:37:28 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
… root eslint The service has its own eslint.config.js importing typescript-eslint and globals, which are only installed in the service's node_modules. Root CI runs npx eslint . with only root deps, so traversing into the service crashed with ERR_MODULE_NOT_FOUND. Ignore the service subtree at the repo root; it is linted with its own config locally.
Transitive dependency of express@5 via router@2. The flagged 8.3.0 is vulnerable to ReDoS via sequential optional groups (CVE-2026-4926, high). Patched in 8.4.0; lockfile resolves to 8.4.2 under router's ^8.0.0 range.
📚 Documentation Health ReportGenerated on: 2026-06-24 18:33:14 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
CodeQL js/missing-rate-limiting flagged the authorized /api middleware and the SPA fallback file-serving handler. Add express-rate-limit and apply a global limiter after the unauthenticated health probe so the authorized API routes and static file fallback are throttled. Set trust proxy=1 so the limiter keys on the real client IP behind the App Service reverse proxy.
📚 Documentation Health ReportGenerated on: 2026-06-24 19:07:08 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
📚 Documentation Health ReportGenerated on: 2026-06-25 15:00:39 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
📚 Documentation Health ReportGenerated on: 2026-06-25 16:33:54 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
| # (synthesizes a dev user — insecure, non-production). Otherwise the server | ||
| # requires TEAMS_APP_ID and TEAMS_AUTH_AUDIENCES and will fail to start without them. | ||
| AUTH_REQUIRED: "false" | ||
| # SKIP_AUTH="true" only bypasses auth for true loopback traffic, not in-cluster requests. |
There was a problem hiding this comment.
The default AUTH_REQUIRED: "false" disables authentication for all requests regardless of network origin. Any operator who installs the chart without explicitly overriding this value deploys without auth in a cluster environment. Unlike SKIP_AUTH=true (which is restricted to loopback connections), AUTH_REQUIRED=false applies to all traffic reaching the pod.
Consider removing this line from the defaults so the chart fails closed, and leaving it as a commented-out example that operators must consciously opt into:
# Set to "false" for a Teams-less standalone/demo deployment ONLY.
# When unset, TEAMS_APP_ID and TEAMS_AUTH_AUDIENCES are required.
# AUTH_REQUIRED: "false"| @@ -0,0 +1,42 @@ | |||
| FROM node:22-slim AS builder | |||
There was a problem hiding this comment.
Every other Dockerfile in the src/500-application/ directory pins base images to a full digest (image:tag@sha256:...). Both stages here use the floating node:22-slim tag, which means the next build silently picks up whatever is current upstream with no review gate.
Please pin both stages to the current digest, for example:
FROM node:22-slim@sha256:<digest> AS builderRun docker pull node:22-slim && docker inspect node:22-slim --format '{{index .RepoDigests 0}}' to resolve the digest at the time of merge.
| path: /api/healthz | ||
| port: http | ||
| initialDelaySeconds: 5 | ||
| periodSeconds: 10 |
There was a problem hiding this comment.
The Dockerfile correctly switches to a non-root appuser, but the Kubernetes manifest does not enforce this at the pod level. Without a securityContext, a cluster that does not enforce Pod Security Admission Restricted profile will allow the container to run as root if the image is changed or overridden.
Adding explicit constraints makes the manifest self-enforcing regardless of cluster policy:
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]| @@ -0,0 +1,54 @@ | |||
| --- | |||
| # Default values for chat-with-your-factory | |||
| replicaCount: 1 | |||
There was a problem hiding this comment.
sessionStore, sseRegistry, sseTicketStore, and wsTicketStore are all process-local Map instances. If replicaCount is scaled above 1, a session created on one pod is invisible to another, and SSE/WebSocket connections land on whichever pod the load balancer selects.
The default of 1 is correct, but nothing prevents an operator from bumping this value. A comment here would make the constraint self-documenting:
# Session state is in-memory and process-local. replicaCount MUST remain 1.
# Scaling beyond 1 replica requires an external session store (e.g. Redis)
# and is not supported in this release.
replicaCount: 1There was a problem hiding this comment.
The Foundry run-polling loop has no timeout or max-iteration guard. If a run stalls in queued or in_progress indefinitely (quota pause, upstream hang), the HTTP connection is held open indefinitely, blocking a server worker and accumulating sockets until resources are exhausted. Express 5 has no built-in request timeout.
Two improvements together:
- Move the status set to module scope as a
ReadonlySet(prevents mutation, avoids recreating the array on every request). - Add a deadline guard inside the loop.
// Module scope
const POLL_STATUSES: ReadonlySet<string> = new Set(['queued', 'in_progress', 'requires_action'])
const MAX_POLL_MS = 120_000
// Inside dispatchChat, Foundry branch
const pollDeadline = Date.now() + MAX_POLL_MS
while (POLL_STATUSES.has(run.status)) {
if (Date.now() > pollDeadline) {
throw new Error('Agent run timed out after 120 s')
}
await new Promise(r => setTimeout(r, 1000))
run = await agentsClient.runs.get(threadId, run.id)
...
}|
|
||
| PORT=${PORT:-3978} | ||
| AUTH_REQUIRED=${AUTH_REQUIRED:-false} | ||
|
|
There was a problem hiding this comment.
AUTH_REQUIRED defaults to false here, so running this script without pre-setting the variable generates a .env with authentication disabled. Combined with the Helm values default, both the script-based and chart-based setup paths point toward auth-disabled as the path of least resistance.
Changing the default to true means operators must explicitly opt out:
AUTH_REQUIRED=${AUTH_REQUIRED:-true}There was a problem hiding this comment.
Thank you for this contribution — the Chat With Factory component is a genuinely useful addition to the repo. The pluggable agent backends, the factory ontology grounding tool, the Teams SSO path, and the graceful shutdown handling all reflect careful engineering. The Foundry polling loop and the SSE ticket design in particular are well thought through.
A few things need to be addressed before this can merge:
Blocking
- The component folder must be renamed from
513-chat-with-your-factoryto516-chat-with-your-factory—513is already taken by513-tiered-notification-service.
Required
- All
ms.datefrontmatter fields across the 8 documentation files need updating to today's date — full file list in the reply there). - The
docs/subfolder structure is out of step with every other component insrc/500-application/; please consolidate into the rootREADME.md, keeping onlyfactory-tool-grounding.mdandfactory-tool-live-data.mdas standalone references (see inline ondocs/development.md). - The README is missing explicit references to the edge-ai blueprints and components that operators need to provision first (see inline on
README.md). image.tag: "latest"withpullPolicy: Alwaysinvalues.yamlis a supply-chain anti-pattern — please change totag: ""andpullPolicy: IfNotPresent(see inline).- There are also several open threads from @rezatnoMsirhC (
AUTH_REQUIREDdefaults, unpinned Dockerfile base image, missing podsecurityContext,replicaCountcomment, Foundry polling timeout) that need responses.
Minor
generate-env-config.shshould useset -euo pipefailto be consistent with the other two scripts in the same directory (see inline).namespace: defaultshould be changed to a dedicated namespace (see inline).- The
trust proxycomment references App Service but the primary deployment path is Kubernetes — a small wording update keeps it accurate for both (see inline).
Happy to discuss any of these. Looking forward to seeing this merged once the items above are addressed!
| image: | ||
| repository: your-registry.azurecr.io/chat-with-your-factory | ||
| pullPolicy: Always | ||
| tag: "latest" |
There was a problem hiding this comment.
image.tag: "latest" Is a Supply-Chain Anti-Pattern
# Current
image:
repository: your-registry.azurecr.io/chat-with-your-factory
pullPolicy: Always
tag: "latest"tag: "latest" combined with pullPolicy: Always means every pod restart silently pulls whatever image is currently tagged latest — including unreviewed changes — with no deployment audit trail. This is in direct tension with the open comment on this PR about pinning the Dockerfile base image to a digest: the application image itself needs equal rigour.
# Suggested
image:
repository: your-registry.azurecr.io/chat-with-your-factory
pullPolicy: IfNotPresent
tag: "" # Set to a specific semver tag or digest at deploy timeWith tag: "", Helm fails at render time if the operator does not supply --set image.tag=<value>, which forces an explicit, auditable version at every deployment.
| # Creates a .env file with all required environment variables. | ||
| # Uses current environment variables as values, with defaults for unset variables. | ||
|
|
||
| set -e |
There was a problem hiding this comment.
Missing set -u and pipefail
# Current
set -e
# Suggested — consistent with the other two scripts in this directory
set -euo pipefaildeploy-chat-with-your-factory.sh and provision-foundry-agent.sh both already use set -euo pipefail. The -u flag catches unbound variable references early; pipefail propagates failures through pipelines. Please align all three scripts.
There was a problem hiding this comment.
Prerequisites: Please Add Edge-AI Blueprint References
This component depends on several other edge-ai components/blueprints that operators need to provision first. Please add an explicit prerequisites block referencing them, following the pattern in 507-ai-inference/README.md.
Suggested prerequisites block:
## Prerequisites
- Azure subscription with sufficient quota
- Edge cluster, ACR, and Azure AI Foundry deployed via the
[full-multi-node-cluster](../../../blueprints/full-multi-node-cluster/) blueprint
- Microsoft Fabric workspace deployed via the
[fabric](../../../blueprints/fabric/) blueprint
- CORA/CORAX ontology provisioned into that workspace via
[033-fabric-ontology](../../000-cloud/033-fabric-ontology/README.md)
- `kubectl` and `helm` installed locallyThere was a problem hiding this comment.
Docs Structure Not Aligned With Repository Convention
Every other application component under src/500-application/ uses a single README.md at the component root — no docs/ subfolder:
| Component | Docs approach |
|---|---|
503-media-capture-service |
README.md only |
507-ai-inference |
README.md only |
509-sse-connector |
README.md only |
513-tiered-notification-service |
README.md only |
This PR introduces 7 separate files under docs/. That creates navigation friction and breaks the established pattern.
Suggested consolidation:
| Current file | Disposition |
|---|---|
docs/development.md |
Merge into README.md as ## Development |
docs/setup-guide.md |
Merge into README.md as ## Setup |
docs/architecture.md |
Merge into README.md as ## Architecture |
docs/DOCKER_COMPOSE_README.md |
Merge into README.md as ## Local Development (Docker Compose) |
docs/HELM_CHART_GUIDE.md |
Merge into README.md as ## Kubernetes Deployment (Helm) |
docs/factory-tool-grounding.md |
✅ Keep — tool-specific reference |
docs/factory-tool-live-data.md |
✅ Keep — future-path spec |
The .env.template approach for configuration documentation is already well-aligned — no change needed there.
| nameOverride: "" | ||
| fullnameOverride: "" | ||
|
|
||
| namespace: default |
There was a problem hiding this comment.
namespace: default — Use a Dedicated Namespace
Deploying application workloads into the default namespace makes it harder to scope RBAC policies, NetworkPolicies, and resource quotas to this workload.
# Suggested
namespace: chat-with-your-factory # Operators may override via --set namespace=<value>Please also add a one-liner to the Helm deployment guide noting that namespace creation is a prerequisite:
kubectl create namespace chat-with-your-factory| // App Service terminates TLS at a reverse proxy, so the real client IP arrives | ||
| // in X-Forwarded-For. Trust the first hop so the rate limiter keys on the | ||
| // caller rather than the proxy. | ||
| app.set('trust proxy', 1) |
There was a problem hiding this comment.
trust proxy Comment References App Service; Deployment Target Is Kubernetes
The comment says "App Service terminates TLS" but the documented deployment path is Kubernetes via the Helm chart. The value of 1 is likely correct for both environments (AKS + ingress-nginx also uses a single proxy hop), but the comment is misleading for operators deploying to Kubernetes without App Service.
// Suggested
// A reverse proxy (App Service or Kubernetes Ingress) terminates TLS and
// forwards the real client IP in X-Forwarded-For. Trust one hop so the rate
// limiter keys on the caller rather than the proxy address.
app.set('trust proxy', 1)|
Supplementing my
|
…faults Remove the active AUTH_REQUIRED="false" default from the chart so an operator who installs without overrides deploys WITH auth required. Leave it as a commented opt-in for Teams-less standalone/demo only, noting it disables auth for all in-cluster traffic (not just loopback). Addresses PR #633 review feedback.
Both builder and runtime stages used the floating node:22-slim tag, so rebuilds could silently pull a new upstream image. Pin both to the current digest, matching the digest-pinning convention used by every other src/500-application Dockerfile. Addresses PR #633 review feedback.
…ontext The Dockerfile runs as appuser but the Deployment relied on cluster Pod Security Admission to prevent running as root. Add an explicit container securityContext (runAsNonRoot, allowPrivilegeEscalation=false, drop ALL capabilities) so the manifest is self-enforcing regardless of cluster policy. Addresses PR #633 review feedback.
Session, SSE registry, and ticket stores are process-local in-memory Maps, so running more than one replica breaks session affinity. Add a comment on replicaCount making the constraint self-documenting. Addresses PR #633 review feedback.
…dline The run-polling loop had no timeout, so a stalled run (quota pause, upstream hang) would hold the HTTP socket and worker open indefinitely. Add a 120 s deadline guard and hoist the poll-status set to a module-scope ReadonlySet. Addresses PR #633 review feedback.
… generator The env-config generator defaulted AUTH_REQUIRED to false, so running it without pre-setting the variable produced a .env with auth disabled. Flip the default to true so operators must explicitly opt out. Addresses PR #633 review feedback.
…hart Replace image.tag=latest + pullPolicy=Always (which silently pulls unreviewed images on every restart) with an empty default and pullPolicy=IfNotPresent. Make the image helper 'required' so rendering fails unless the operator supplies an explicit, auditable tag/digest at deploy time. Addresses PR #633 review feedback.
…ipefail Align generate-env-config.sh with the other two scripts in the directory: add -u (catch unbound variables) and pipefail (propagate pipeline failures). Addresses PR #633 review feedback.
…README Add a top-level Prerequisites section referencing the upstream blueprints and components operators must provision first (full-multi-node-cluster, fabric, 033-fabric-ontology), following the 507-ai-inference pattern. Rename the tool-specific subsection to 'Tool Prerequisites' to avoid a duplicate heading. Addresses PR #633 review feedback.
Merge the five general docs (setup-guide, architecture, development, DOCKER_COMPOSE_README, HELM_CHART_GUIDE) into README.md as top-level sections, aligning with the single-README convention used by every other src/500-application component. Keep the two tool-specific references (factory-tool-grounding, factory-tool-live-data). Demote merged headings to keep them unique (MD024), unlink the pre-existing broken ADR references, and fix the Helm example to use an explicit image tag. Addresses PR #633 review feedback.
Change the chart default namespace from 'default' to 'chat-with-your-factory' so RBAC, NetworkPolicies, and quotas can be scoped to this workload; operators may override via --set namespace. Update the README Helm commands to match and add a kubectl create namespace prerequisite note. Addresses PR #633 review feedback.
…ernetes The trust proxy comment referenced only App Service, but the documented deployment path is Kubernetes via the Helm chart. Reword to cover both reverse proxies (App Service or Kubernetes Ingress). Addresses PR #633 review feedback.
Update the ms.date front-matter on the surviving docs (README, factory-tool- grounding, factory-tool-live-data). The other five files in the reviewer's list were consolidated into README and no longer exist. Addresses PR #633 review feedback.
513 is already used by 513-tiered-notification-service. Rename the component directory to 516-chat-with-your-factory and update the two README self-references and the root eslint ignore path. Addresses PR #633 blocking review feedback.
📚 Documentation Health ReportGenerated on: 2026-06-29 20:28:09 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
📚 Documentation Health ReportGenerated on: 2026-06-29 21:05:55 UTC 📈 Documentation Statistics
🏗️ Three-Tree Architecture Status
🔍 Quality Metrics
This report is automatically generated by the Documentation Automation workflow. |
Pull Request
Description
Adds a new application component,
513-chat-with-your-factory: a voice-enabled web chat application that lets operators ask natural-language questions about factory equipment and get answers grounded in live industrial data. It runs standalone today and is built to also run embedded in Microsoft Teams (Teams SSO + a tab manifest template are included but not yet deployed or validated in Teams — that is a planned follow-up).The service is a Node.js/TypeScript (ESM) Express app (port 3978) with a React client, pluggable agent backends, an optional Teams SSO layer, a Microsoft Fabric–backed grounding tool, and Helm charts for Kubernetes deployment.
Highlights:
AGENT_BACKEND: Copilot Studio Agents SDK (default, OBO token exchange), Azure AI Foundry, and Copilot Studio Direct Line.query_factory_ontology) that queries a Microsoft Fabric lakehouse overmssqlusing Entra token auth (scopehttps://database.windows.net/.default).@microsoft/teams-jsclient integration, Teams SSO/JWT validation (jose) with on-behalf-of token exchange, and a Teams app manifest template.Type of Change
Implementation Details
src/500-application/513-chat-with-your-factory/: service code (services/chat-with-your-factory/), Helm charts (charts/), Teams app manifest template (appManifest/), provisioning scripts (scripts/), and docs (docs/).src/server/): Express handlers, agent-backend adapters, optional Teams JWT auth middleware (bypassed in standalone mode viaAUTH_REQUIRED=false/SKIP_AUTH=true), and thequery_factory_ontologyfactory tool with a Fabric lakehouse client (ontologyClient.ts) using Entra access tokens.src/client/): React UI with chat hooks/components, plus@microsoft/teams-jsintegration for theme/user/SSO when embedded in Teams..env.template(at the component root) documents all variables. No secrets are committed (.env*is gitignored except the template).src/000-cloud/033-fabric-ontology/definitions/examples/for the ontology mapping consumed by the tool..gitignoreupdated to exclude local tooling/IDE artifacts.Testing Performed
Terraform plan/apply
Blueprint deployment test
Unit tests
Integration tests
Bug fix includes regression test (see Test Policy)
Manual validation
Other:
Unit tests: vitest suite passing (6/6) in
factoryTool.test.ts, covering thequery_factory_ontologytool's input schema, query construction, and result shaping. Other modules (agent backends, auth middleware, client) are not yet covered by automated tests.Manual validation (standalone UI): ran the server in standalone mode (
AUTH_REQUIRED=false); the React landing page renders and is interactive — session controls, suggestion chips (Summarize open work orders / Recent alerts / Compare machine downtime / Draft a work order), and the mic + text-input bar. Screenshot below.Not exercised in this run: agent-backend round-trips, Teams SSO, and live Fabric lakehouse queries were not invoked during standalone validation.
Validation Steps
cd src/500-application/513-chat-with-your-factory/services/chat-with-your-factory && npm cinpm test(expect 6/6 passing).src/500-application/513-chat-with-your-factory/), copy.env.templateto.envand populate perdocs/setup-guide.md(agent backend, Fabric lakehouse). SetAUTH_REQUIRED=falsefor standalone validation.npm start(builds, then runs with--env-file ../../.env). The server logsChat With Factory server running on http://localhost:3978.http://localhost:3978/; the landing page renders. (Agent/Fabric responses require a configured backend and Fabric lakehouse.)Checklist
terraform fmton all Terraform code — N/A (no Terraform changes)terraform validateon all Terraform code — N/A (no Terraform changes)az bicep formaton all Bicep code — N/A (no Bicep changes)az bicep buildto validate all Bicep code — N/A (no Bicep changes)Security Review
.env*gitignored (template only). Only well-known public Microsoft first-party IDs (Graph, User.Read scope, Teams web/desktop clients, Power Platform API) appear in docs, annotated as such.https://database.windows.net/.default).Additional Notes
513-chat-with-your-factorycomponent (plus one fabric-ontology example and a.gitignoreupdate).AUTH_REQUIRED=true), Kubernetes deployment, and live agent/Fabric validation are planned follow-ups.