Content Classification: improve relevance of taxonomy suggestions by saarnilauri · Pull Request #633 · WordPress/ai

saarnilauri · 2026-05-29T05:55:36Z

What?

Closes #452.

Improves the relevance of category and tag suggestions produced by the
Content Classification experiment. Six commits, all scoped to the
ability and its tests; no schema changes, no UI changes.

Why?

#452 reports that suggestions are often not relevant to the post.
Audit found three concrete drivers in the existing pipeline:

The system prompt told the model to "strongly prefer selecting
from those terms" when the existing-only candidate pool was
present. That pool was the top 100 terms ordered by usage count, so
the model was being nudged toward generic popular tags
(Tutorial, Guide, Beginner, WordPress, News) regardless
of fit.
The prompt only told the model the taxonomy slug, not its
intent. Tag vs. category guidance was identical, and the rubric
("prioritize specificity") was wrong for categories.
No server-side confidence floor: model-reported confidences of 0.5
("somewhat relevant") were treated identically to 0.95 matches.

How?

The change rolls up four small, independently-shippable adjustments
to includes/Abilities/Content_Classification/:

Lower temperature + confidence floor. 0.5 → 0.2 for more
consistent classification. New MIN_CONFIDENCE = 0.6 constant and
wpai_content_classification_min_confidence filter. Suggestions
below the floor are dropped before sort and slice; the filter
return is clamped to [0, 1] defensively.
Richer taxonomy descriptor in the prompt. Replaces
<taxonomy>slug</taxonomy> with
<taxonomy name="…" label="…" kind="tag|category" hierarchical="…">description</taxonomy>,
built from get_taxonomy(). The kind attribute is what the
next step branches on.
System instruction rewrite — the headline change.
- <available-terms> reframed as a candidate pool: "Use these
  only when they genuinely fit. Relevance always outweighs
  popularity. Do not force a match."
- Branched guidance:
  - kind="category" broad, thematic; respect hierarchy; do not
    pad with loosely-related parents.
  - kind="tag" specific, descriptive; explicit nudge against
    generic process-style tags ("Tutorial", "Guide", "Beginner",
    "News", "WordPress") unless the post is genuinely framed that
    way and there is no more specific tag.
- Confidence rubric tightened with explicit bands and a "fewer
  high-quality suggestions is better than padding" directive.
wpai_content_classification_available_terms filter. Lets
sites swap the default popularity-ordered candidate pool for a
relevance-ranked one (e.g. embedding-based retrieval) without
touching prompt code. Returns the same default as before when no
filter is registered, so this is purely additive.

Evaluation

To make sure each step actually moved the needle, I built an
out-of-tree fixture corpus (12 posts spanning English/Spanish/Finnish,
short/medium/long, hierarchical and orphan categories, long-tail and
near-duplicate tags) plus a wp eval-file-driven runner that invokes
the ability per fixture × taxonomy and computes precision / recall
against a ground-truth relevant_terms block. The eval harness is
intentionally not included in this PR it's a measurement tool,
not part of the plugin but the resulting numbers are:

Stage	Precision	Recall	TP / FP / FN
Baseline (develop HEAD)	0.537	0.785	51 / 44 / 14
1. Temperature + confidence floor	0.607	0.810	51 / 33 / 12
2. Taxonomy metadata in prompt	0.630	0.810	51 / 30 / 12
3. System instruction rewrite	0.833	0.794	50 / 10 / 13
4. Available-terms filter (no-op default)	0.857	0.762	48 / 8 / 15

Eight of twelve fixtures hit precision 1.00 on tag suggestions after
step 3. The generic-tag FP cluster identified in the audit
("Tutorial", "Guide", "Beginner", "WordPress") is essentially gone.

LLM variance at temp=0.2 is ±0.03 precision / ±0.04 recall between
identical-code runs, so the step-by-step deltas should be read as
trend rather than precise; the headline finding roughly +30
percentage points of precision over baseline with recall held
roughly flat is well outside that envelope.

Use of AI Tools

AI assistance: Yes
Tool(s): Claude Code
Model(s): Claude Opus 4.7
Used for: Prompt audit, instruction redraft, test scaffolding, eval
harness design, and step-by-step implementation. All code, tests,
and the prompt rewrite were reviewed before commit; eval numbers
were measured against a real OpenAI provider (gpt-4.1-mini /
gpt-5-mini fallback chain).

Testing Instructions

Activate the plugin and the AI Provider for OpenAI plugin; enable
the Content Classification experiment under
Settings → AI Plugin.
Configure an OpenAI key via the connector's standard mechanism
(env / constant / settings).
Create a few posts on varied topics. In the editor, open the
Content Classification panel and request suggestions for tags and
for categories.
Verify (qualitatively) that:
- Suggested tags are specific to the post (entities, technologies,
  places) rather than generic ("Tutorial", "Guide", "Beginner")
  unless the post is genuinely tutorial-style.
- Suggested categories track the post's topic rather than
  popular site-wide categories.

github-actions · 2026-05-29T05:55:45Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: saarnilauri <laurisaarni@git.wordpress.org>
Co-authored-by: mikinchauhan <mikinc860@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

codecov · 2026-05-29T06:00:07Z

Codecov Report

❌ Patch coverage is 94.87179% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.59%. Comparing base (b4d6ce3) to head (497f500).
⚠️ Report is 3 commits behind head on develop.

Files with missing lines	Patch %	Lines
.../Content_Classification/Content_Classification.php	94.87%	2 Missing ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##             develop     #633      +/-   ##
=============================================
+ Coverage      74.45%   74.59%   +0.13%     
- Complexity      1740     1749       +9     
=============================================
  Files             85       85              
  Lines           7521     7558      +37     
=============================================
+ Hits            5600     5638      +38     
+ Misses          1921     1920       -1

Flag	Coverage Δ
unit	`74.59% <94.87%> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

saarnilauri · 2026-05-29T06:12:16Z

This was the data seeded for the evaluation:

{
  "$comment": "Seed corpus for Content Classification relevance work (issue #452). Imported by seed.php. The `relevant_terms` block on each post is ground truth for the eval script — it is NOT stored in WordPress.",
  "version": 1,
  "categories": [
    { "slug": "technology", "name": "Technology", "parent": null },
    { "slug": "ai", "name": "AI", "parent": "technology" },
    { "slug": "web-development", "name": "Web Development", "parent": "technology" },
    { "slug": "cybersecurity", "name": "Cybersecurity", "parent": "technology" },
    { "slug": "lifestyle", "name": "Lifestyle", "parent": null },
    { "slug": "travel", "name": "Travel", "parent": "lifestyle" },
    { "slug": "food", "name": "Food", "parent": "lifestyle" },
    { "slug": "business", "name": "Business", "parent": null },
    { "slug": "finance", "name": "Finance", "parent": "business" },
    { "slug": "science", "name": "Science", "parent": null },
    { "slug": "climate", "name": "Climate", "parent": "science" },
    { "$comment": "Deliberately broad orphan parent — exercises the 'broad vs. specific' branch.",
      "slug": "opinion", "name": "Opinion", "parent": null }
  ],
  "tags": [
    { "$comment": "Heavily used cluster — popularity-bias test. These get assigned to many posts so they dominate top-100-by-count.",
      "name": "WordPress", "popularity": 8 },
    { "name": "Tutorial", "popularity": 7 },
    { "name": "News", "popularity": 7 },
    { "name": "Beginner", "popularity": 6 },
    { "name": "Guide", "popularity": 6 },

    { "$comment": "Topic-specific tags — moderate use.",
      "name": "Machine Learning", "popularity": 3 },
    { "name": "Neural Networks", "popularity": 2 },
    { "name": "Transformers", "popularity": 2 },
    { "name": "Large Language Models", "popularity": 2 },
    { "name": "React", "popularity": 3 },
    { "name": "TypeScript", "popularity": 3 },
    { "name": "Frontend", "popularity": 2 },
    { "name": "Performance", "popularity": 2 },

    { "$comment": "Near-duplicates — dedup/match test.",
      "name": "AI", "popularity": 4 },
    { "name": "Artificial Intelligence", "popularity": 1 },
    { "name": "JS", "popularity": 1 },
    { "name": "JavaScript", "popularity": 3 },

    { "$comment": "Long-tail tags — used 0–1 times. Some are the *only* relevant tag for a post.",
      "name": "Post-Quantum Cryptography", "popularity": 0 },
    { "name": "Zero Trust", "popularity": 0 },
    { "name": "Kyoto", "popularity": 0 },
    { "name": "Ryokan", "popularity": 0 },
    { "name": "Travel Tips", "popularity": 1 },
    { "name": "Paella", "popularity": 0 },
    { "name": "Valencia", "popularity": 0 },
    { "name": "Recipes", "popularity": 1 },
    { "name": "Index Funds", "popularity": 0 },
    { "name": "Retirement Planning", "popularity": 0 },
    { "name": "Personal Finance", "popularity": 1 },
    { "name": "Carbon Capture", "popularity": 0 },
    { "name": "IPCC", "popularity": 0 },
    { "name": "Climate Policy", "popularity": 1 },
    { "name": "Sauna", "popularity": 0 },
    { "name": "Suomi", "popularity": 0 },
    { "name": "Editorial", "popularity": 0 }
  ],
  "$tag_assignment_note": "Tags with popularity > 0 are spread across filler posts during import so they reach that approximate usage count. Filler posts are created in seed.php alongside the corpus posts below.",
  "posts": [
    {
      "id": "post-ai-transformers-en",
      "lang": "en",
      "length": "medium",
      "title": "How transformer models reshaped natural language processing",
      "content": "Transformer architectures introduced in 2017 fundamentally changed how machines process language. Unlike earlier recurrent networks, transformers use self-attention to weigh every token in a sequence against every other token, which lets them capture long-range dependencies that RNNs struggled with. This single architectural shift unlocked the wave of large language models that now dominate the field, from BERT through GPT to today's frontier systems. We look at the encoder-decoder split, how multi-head attention works in practice, and why positional encodings are needed when the model itself is permutation-invariant. We also walk through what changes when you fine-tune a pretrained transformer on a downstream task versus training from scratch, and where parameter-efficient methods like LoRA fit in. The piece closes with a short note on inference cost, since attention is quadratic in sequence length and that constraint shapes most deployment decisions.",
      "assigned_categories": ["ai"],
      "assigned_tags": [],
      "relevant_terms": {
        "category": ["AI", "Technology"],
        "post_tag": ["Machine Learning", "Neural Networks", "Transformers", "Large Language Models", "Artificial Intelligence"]
      }
    },
    {
      "id": "post-react-perf-en",
      "lang": "en",
      "length": "medium",
      "title": "Profiling React apps: where the time actually goes",
      "content": "Most React performance complaints come down to one of three things: too many re-renders, oversized JavaScript bundles, or work happening on the main thread that should not be there. The React DevTools Profiler is a good first stop — it tells you which components rendered, why, and how long each one took. From there, common fixes include memoizing expensive children with React.memo, stabilizing callback identities with useCallback, and moving derived state into useMemo only when the measurement justifies it. Bundle size is its own discipline: code-split at the route level, lazy-load components that are not visible on first paint, and audit your dependencies — a single misbehaving library can dominate your bundle. Finally, watch the main thread in the browser's Performance panel for long tasks. Hydration cost on server-rendered apps is often the hidden culprit. None of this is guesswork once you have the right tooling open.",
      "assigned_categories": ["web-development"],
      "assigned_tags": ["React"],
      "relevant_terms": {
        "category": ["Web Development", "Technology"],
        "post_tag": ["React", "Performance", "Frontend", "JavaScript", "TypeScript"]
      }
    },
    {
      "id": "post-pqc-en",
      "lang": "en",
      "length": "medium",
      "title": "Preparing your stack for post-quantum cryptography",
      "$note": "Relevant tag (Post-Quantum Cryptography) is in the long-tail bucket with popularity 0 — used to verify the popularity-bias fix in Step 3.",
      "content": "NIST has finalized its first set of post-quantum cryptographic standards, and the migration clock has started for every team that ships TLS, signed software updates, or long-lived encrypted archives. The threat is not that quantum computers can break RSA or elliptic curve cryptography today — they cannot. The threat is that adversaries can capture encrypted traffic now and decrypt it years from now when sufficiently large quantum hardware exists. For data that must remain confidential for a decade, that risk window is already open. Practical first steps: inventory where you use public-key cryptography, understand which protocols you control versus which you depend on, and start tracking library support for ML-KEM and ML-DSA. Hybrid key exchange, where a classical algorithm is combined with a post-quantum one, is the realistic transition path. A zero-trust posture also helps — if every connection is authenticated and short-lived, the blast radius of any cryptographic break shrinks.",
      "assigned_categories": ["cybersecurity"],
      "assigned_tags": [],
      "relevant_terms": {
        "category": ["Cybersecurity", "Technology"],
        "post_tag": ["Post-Quantum Cryptography", "Zero Trust"]
      }
    },
    {
      "id": "post-kyoto-en",
      "lang": "en",
      "length": "medium",
      "title": "Three quiet days in Kyoto, off the tourist track",
      "content": "Kyoto in late autumn is busy, but the city has more shoulder than its reputation suggests. Skip the famous bamboo grove at sunrise and try the smaller groves behind Adashino Nenbutsu-ji instead — same atmosphere, almost nobody there. Stay in a family-run ryokan in the northern Kamigamo neighborhood rather than near the station; the walk to the river bakeries in the morning is worth the bus ride into the center later. Eat where the salarymen eat: the basement of any department store has obanzai counters that beat most kaiseki spots at a tenth of the price. For temples, pick two and spend real time at them rather than racing between ten. Tofuku-ji and Shoren-in both reward an unhurried afternoon. And bring small bills — many of the best places still do not take cards.",
      "assigned_categories": ["travel"],
      "assigned_tags": ["Travel Tips"],
      "relevant_terms": {
        "category": ["Travel", "Lifestyle"],
        "post_tag": ["Kyoto", "Ryokan", "Travel Tips"]
      }
    },
    {
      "id": "post-paella-es",
      "lang": "es",
      "length": "medium",
      "$note": "Multilingual test — model must respond in Spanish.",
      "title": "Cómo hacer una paella valenciana auténtica en casa",
      "content": "La paella valenciana auténtica no lleva chorizo, no lleva guisantes y, sobre todo, no lleva prisa. Los ingredientes tradicionales son pollo, conejo, judía verde plana, garrofó, tomate rallado, pimentón dulce, azafrán, aceite de oliva y arroz bomba. El sofrito se hace con calma, dorando bien la carne antes de añadir la verdura, porque ahí es donde nace el sabor del plato. El caldo se incorpora hirviendo y se reparte por toda la paella; a partir de ese momento ya no se remueve. El fuego debe ser fuerte al principio y bajar progresivamente para que se forme el socarrat en el fondo, esa capa de arroz tostado que es la firma del plato. Una paella para cuatro personas se hace en una paella de treinta y ocho centímetros: si la sartén es demasiado pequeña, el arroz queda apelmazado y nunca se cocina como debe.",
      "assigned_categories": ["food"],
      "assigned_tags": [],
      "relevant_terms": {
        "category": ["Food", "Lifestyle"],
        "post_tag": ["Paella", "Valencia", "Recipes"]
      }
    },
    {
      "id": "post-index-funds-en",
      "lang": "en",
      "length": "medium",
      "title": "Why most investors should just buy index funds",
      "content": "The argument for low-cost index funds has only gotten stronger over the last twenty years. SPIVA reports consistently show that the majority of actively managed funds underperform their benchmark over any meaningful time horizon, and the gap widens the longer you measure. The reason is mostly arithmetic: active management has costs, indexing has almost none, and over decades those costs compound into a substantial gap in returns. For most investors building a retirement portfolio, a small number of broad-market index funds — total US market, total international, and a bond allocation appropriate to your age — covers ninety percent of what you need. Rebalance annually, increase contributions when your income grows, and stop checking the balance during downturns. The boring strategy is the one that works, and most of the energy spent on stock picking would be better spent on increasing your savings rate.",
      "assigned_categories": ["finance"],
      "assigned_tags": [],
      "relevant_terms": {
        "category": ["Finance", "Business"],
        "post_tag": ["Index Funds", "Retirement Planning", "Personal Finance"]
      }
    },
    {
      "id": "post-climate-ipcc-en",
      "lang": "en",
      "length": "long",
      "title": "Reading the latest IPCC synthesis report without losing your mind",
      "content": "The IPCC synthesis reports are some of the most carefully written scientific documents in existence, and also some of the most punishing to read straight through. The structure rewards persistence: the Summary for Policymakers is short, but every sentence in it is the negotiated end-point of weeks of line-by-line approval by government representatives, which is why the language can feel oddly hedged. The full underlying assessment behind it runs into thousands of pages. The headline findings of the most recent cycle are familiar but worth restating: warming has reached roughly 1.1°C above pre-industrial levels, the impacts already observed are widespread and unequal, and limiting warming to 1.5°C now requires emissions to roughly halve by 2030. What changes in this report compared to earlier ones is the weight given to adaptation limits — there are places and systems where no further adaptation is possible, and that finding is new in its bluntness. Carbon capture appears in nearly every modeled pathway, but the report is careful to distinguish proven, near-term options from speculative ones that depend on technologies not yet deployed at scale. Climate policy, ultimately, is policy about who pays and when, and the report is unusually direct about that framing. If you are reading this as a non-specialist, two pieces of advice: start with the figures, especially the famous burning-embers diagrams, and read the SPM once before attempting any chapter of the underlying assessment. The figures carry more of the argument than most readers expect, and the SPM provides the vocabulary you need for the rest.",
      "assigned_categories": ["climate"],
      "assigned_tags": [],
      "relevant_terms": {
        "category": ["Climate", "Science"],
        "post_tag": ["IPCC", "Climate Policy", "Carbon Capture"]
      }
    },
    {
      "id": "post-sauna-fi",
      "lang": "fi",
      "length": "medium",
      "$note": "Multilingual test — model must respond in Finnish.",
      "title": "Suomalaisen saunan etiketti vieraille",
      "content": "Sauna on suomalaisessa kulttuurissa enemmän kuin pelkkä kylpypaikka, ja vieraan kannattaa tietää muutama perussääntö ennen ensimmäistä saunavuoroa. Saunaan mennään yleensä alasti, mutta pyyhe otetaan aina mukaan istuttavaksi lauteille. Löylyä heitetään maltilla, ja yleensä se, joka istuu lähimpänä kiuasta, kysyy muilta ennen kuin lisää vettä. Hiljaisuus on hyväksyttyä ja jopa toivottua: saunassa ei tarvitse pitää keskustelua yllä jos siltä ei tunnu. Saunavuorojen välissä on tapana käydä viilentymässä joko ulkoilmassa, kylmässä suihkussa tai parhaimmillaan järvessä. Saunan jälkeen juodaan jotain kylmää ja syödään pieni välipala, perinteisesti makkaraa, vaikka nykyään mikä tahansa käy. Tärkein sääntö on, että saunassa ei ole kiire: kokonainen saunailta voi kestää useita tunteja, eikä sitä ole tarkoitettu suoritettavaksi.",
      "assigned_categories": ["travel"],
      "assigned_tags": [],
      "relevant_terms": {
        "category": ["Travel", "Lifestyle"],
        "post_tag": ["Sauna", "Suomi"]
      }
    },
    {
      "id": "post-tiny-en",
      "lang": "en",
      "length": "short",
      "$note": "Very short post — exercises behavior on minimal context.",
      "title": "Quick note: GPT-5 mini benchmarks dropped today",
      "content": "OpenAI published the GPT-5 mini benchmarks this morning. Reasoning scores are up across the board, latency is roughly 30% lower than the prior generation, and pricing held steady. Full evaluation pending.",
      "assigned_categories": ["ai"],
      "assigned_tags": [],
      "relevant_terms": {
        "category": ["AI", "Technology"],
        "post_tag": ["AI", "Large Language Models", "News"]
      }
    },
    {
      "id": "post-climate-ai-straddle-en",
      "lang": "en",
      "length": "medium",
      "$note": "Straddles two categories (AI and Climate) — exercises the broad/specific category branch.",
      "title": "Using machine learning to forecast regional climate impacts",
      "content": "Climate models have historically been built on physics-based simulations that run on supercomputers and produce coarse-grained predictions at the regional level. A growing line of research uses machine learning to downscale those predictions to the neighborhood level, combining the physics output with local observations to produce forecasts that are useful for actual planning decisions. The models are not replacing the physics; they are post-processing it. Recent work has shown that transformer-based architectures can capture the spatial dependencies in climate fields surprisingly well, and ensemble methods that blend ML and traditional dynamical downscaling outperform either approach alone. The catch is data: training these models requires decades of high-resolution observations that simply do not exist for most of the planet. Where the data is good — Western Europe, parts of the US, Japan — the results are promising. Where it is not, the models inherit the biases of the sparse inputs they are trained on, and using them for climate policy decisions without that caveat would be a mistake.",
      "assigned_categories": [],
      "assigned_tags": [],
      "relevant_terms": {
        "category": ["AI", "Climate", "Science", "Technology"],
        "post_tag": ["Machine Learning", "Climate Policy", "Neural Networks"]
      }
    },
    {
      "id": "post-undertagged-en",
      "lang": "en",
      "length": "medium",
      "$note": "Under-tagged post — should produce useful tag suggestions.",
      "title": "TypeScript discriminated unions are the feature you actually need",
      "content": "Most TypeScript code I see at review treats the type system as decoration: any everywhere, type assertions where the compiler complains, generics avoided. The single feature that changes how a codebase feels — once a team actually uses it — is the discriminated union. A discriminated union is a union of object types that share a literal property the compiler can narrow on. The classic example is a Result type with success and failure variants, but the pattern generalizes to anything with finite states: form fields, async data, parser tokens, animation phases. The discipline it forces is good: every state must be named, every state must be handled, and exhaustiveness checks catch the case you forgot. The runtime cost is zero, because types are erased at compile time. If you find yourself reaching for booleans-plus-optional-fields to model state, replace that with a discriminated union and watch how much code disappears.",
      "assigned_categories": [],
      "assigned_tags": [],
      "relevant_terms": {
        "category": ["Web Development", "Technology"],
        "post_tag": ["TypeScript", "JavaScript", "Frontend"]
      }
    },
    {
      "id": "post-dup-ai-en",
      "lang": "en",
      "length": "short",
      "$note": "Near-duplicate test — site has both 'AI' and 'Artificial Intelligence' tags; suggestions should not propose both.",
      "title": "What I actually changed when I started using AI coding assistants daily",
      "content": "After six months of using AI coding assistants as part of my normal workflow, the things that changed were not the things I expected. I write fewer one-off scripts by hand. I write more tests, because describing what I want to a model forces me to articulate the contract first. I read more code than I used to, because the assistant produces drafts that need checking. I have not become faster at the tasks I was already fast at; I have become willing to attempt tasks I would previously have skipped. That trade is the one that actually matters.",
      "assigned_categories": [],
      "assigned_tags": [],
      "relevant_terms": {
        "category": ["AI", "Technology", "Opinion"],
        "post_tag": ["AI", "Artificial Intelligence", "Editorial"]
      }
    }
  ]
}

saarnilauri · 2026-06-01T09:35:34Z

@jeffpaul @dkotter

with a quick glance Plugin Check and Test / Run E2E tests failing are not related to changes in this PR. Seems like a repo-wide CI infrastructure failure.

dkotter · 2026-06-01T15:53:42Z

@jeffpaul @dkotter

with a quick glance Plugin Check and Test / Run E2E tests failing are not related to changes in this PR. Seems like a repo-wide CI infrastructure failure.

Plugin Check is a known failure due to upstream problems. I re-ran the E2E tests and they are passing now so seems like a flaky test somewhere in there

miyanialkesh7 · 2026-06-03T17:15:12Z

+- <taxonomy …>…</taxonomy> describes the target taxonomy. The `kind` attribute is either `category` (broad, thematic, often hierarchical) or `tag` (specific, descriptive). Use this to decide what kind of terms to suggest.
+- <content>…</content> is the post content to classify.
+- <assigned-terms>…</assigned-terms> (optional) lists terms already applied to this post. Never propose these.
+- <available-terms>…</available-terms> (optional) is a *candidate pool* of existing terms on the site, listed in arbitrary order. Use these only when they genuinely fit the content. Relevance always outweighs popularity. If nothing in the pool fits, return only the truly relevant suggestions you would propose anyway — do not force a match.


Relevance always outweighs popularity, frequency of use, or availability in the candidate pool.

miyanialkesh7 · 2026-06-03T18:11:36Z

@@ -315,10 +332,41 @@
 			$available_terms = $this->get_top_terms( $taxonomy );


Consider making get_top_terms() limit filterable

private function get_top_terms( string $taxonomy, int $limit = 100 )

100 may be too much for some sites and too little for others.

Suggestion:

Would it be useful to expose the candidate pool size via a filter? Larger taxonomies may benefit from more than 100 terms, while smaller sites could reduce token usage by lowering the limit.

miyanialkesh7 · 2026-06-03T18:24:35Z


 		// Only fetch existing terms when we need them for post-processing (existing_only strategy).
 		$existing_terms = Content_Classification_Experiment::STRATEGY_EXISTING_ONLY === $strategy
 			? $this->get_existing_terms( $taxonomy )


existing_only strategy currently loads all existing terms:

$this->get_existing_terms( $taxonomy )

which uses:

get_terms( [ 'hide_empty' => false, 'fields' => 'names', ] )

Suggestion:

For very large taxonomies, fetching every term name on every request could become expensive. It may be worth caching the lookup or limiting retrieval to only what is needed for matching.

mikinchauhan · 2026-06-03T19:00:33Z

+			? wp_strip_all_tags( (string) $tax_object->labels->name )
+			: $taxonomy;
+
+		$description = trim( wp_strip_all_tags( (string) $tax_object->description ) );


To safely handle a missing property, use the null coalescing operator:

$description = trim( wp_strip_all_tags( (string) ( $tax_object->description ?? '' ) ) );

mikinchauhan · 2026-06-03T19:01:46Z

+
+		// The description is plain text (tags already stripped) and is read by
+		// the model as content, so it is injected as-is rather than entity-encoded.
+		return $open_tag . $description . '</taxonomy>';


You can avoid string concatenation by using sprintf() for the entire return value:

return sprintf( '<taxonomy name="%1$s" label="%2$s" kind="%3$s" hierarchical="%4$s">%5$s</taxonomy>', esc_attr( $taxonomy ), esc_attr( $label ), esc_attr( $kind ), $is_hierarchical ? 'true' : 'false', $description );

saarnilauri added 6 commits May 29, 2026 08:10

Content Classification: lower temperature and add confidence floor

8d276ff

Content Classification: enrich prompt with taxonomy metadata

702565c

Adjust tests after implementing the min confidence floor

cd59b99

Content Classification: rewrite system instruction for relevance

a6d9eba

Content Classification: add filter for available-terms pool

32492c8

Content Classification: add new tests and update existing tests

990f967

Merge branch 'develop' into improvement/content-classification

c1716a5

mikinchauhan reviewed May 29, 2026

View reviewed changes

Comment thread includes/Abilities/Content_Classification/Content_Classification.php

saarnilauri added 2 commits June 1, 2026 08:59

Improve build_taxonomy_descriptor

1fb64c8

Update test related to build_taxonomy_descriptor

497f500

dkotter assigned saarnilauri Jun 1, 2026

dkotter added this to the 1.1.0 milestone Jun 1, 2026

saarnilauri requested a review from mikinchauhan June 3, 2026 11:14

miyanialkesh7 reviewed Jun 3, 2026

View reviewed changes

mikinchauhan reviewed Jun 3, 2026

View reviewed changes

dkotter reviewed Jun 3, 2026

View reviewed changes

Comment thread includes/Abilities/Content_Classification/Content_Classification.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content Classification: improve relevance of taxonomy suggestions#633

Content Classification: improve relevance of taxonomy suggestions#633
saarnilauri wants to merge 9 commits into
WordPress:developfrom
saarnilauri:improvement/content-classification

saarnilauri commented May 29, 2026 •

edited by github-actions Bot

Loading

Uh oh!

github-actions Bot commented May 29, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 29, 2026 •

edited

Loading

Uh oh!

saarnilauri commented May 29, 2026

Uh oh!

Uh oh!

saarnilauri commented Jun 1, 2026

Uh oh!

dkotter commented Jun 1, 2026

Uh oh!

miyanialkesh7 Jun 3, 2026

Uh oh!

miyanialkesh7 Jun 3, 2026

Uh oh!

miyanialkesh7 Jun 3, 2026

Uh oh!

mikinchauhan Jun 3, 2026

Uh oh!

mikinchauhan Jun 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -315,10 +332,41 @@
		$available_terms = $this->get_top_terms( $taxonomy );

Conversation

saarnilauri commented May 29, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

How?

Evaluation

Use of AI Tools

Testing Instructions

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

saarnilauri commented May 29, 2026

Uh oh!

Uh oh!

saarnilauri commented Jun 1, 2026

Uh oh!

dkotter commented Jun 1, 2026

Uh oh!

miyanialkesh7 Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

miyanialkesh7 Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

miyanialkesh7 Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

mikinchauhan Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

mikinchauhan Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

saarnilauri commented May 29, 2026 •

edited by github-actions Bot

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

codecov Bot commented May 29, 2026 •

edited

Loading