Content Classification: improve relevance of taxonomy suggestions#633
Content Classification: improve relevance of taxonomy suggestions#633saarnilauri wants to merge 9 commits into
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #633 +/- ##
=============================================
+ Coverage 74.45% 74.59% +0.13%
- Complexity 1740 1749 +9
=============================================
Files 85 85
Lines 7521 7558 +37
=============================================
+ Hits 5600 5638 +38
+ Misses 1921 1920 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This was the data seeded for the evaluation: |
Plugin Check is a known failure due to upstream problems. I re-ran the E2E tests and they are passing now so seems like a flaky test somewhere in there |
| - <taxonomy …>…</taxonomy> describes the target taxonomy. The `kind` attribute is either `category` (broad, thematic, often hierarchical) or `tag` (specific, descriptive). Use this to decide what kind of terms to suggest. | ||
| - <content>…</content> is the post content to classify. | ||
| - <assigned-terms>…</assigned-terms> (optional) lists terms already applied to this post. Never propose these. | ||
| - <available-terms>…</available-terms> (optional) is a *candidate pool* of existing terms on the site, listed in arbitrary order. Use these only when they genuinely fit the content. Relevance always outweighs popularity. If nothing in the pool fits, return only the truly relevant suggestions you would propose anyway — do not force a match. |
There was a problem hiding this comment.
Relevance always outweighs popularity, frequency of use, or availability in the candidate pool.
| @@ -315,10 +332,41 @@ | |||
| $available_terms = $this->get_top_terms( $taxonomy ); | |||
There was a problem hiding this comment.
Consider making get_top_terms() limit filterable
private function get_top_terms( string $taxonomy, int $limit = 100 )
- 100 may be too much for some sites and too little for others.
Suggestion:
Would it be useful to expose the candidate pool size via a filter? Larger taxonomies may benefit from more than 100 terms, while smaller sites could reduce token usage by lowering the limit.
|
|
||
| // Only fetch existing terms when we need them for post-processing (existing_only strategy). | ||
| $existing_terms = Content_Classification_Experiment::STRATEGY_EXISTING_ONLY === $strategy | ||
| ? $this->get_existing_terms( $taxonomy ) |
There was a problem hiding this comment.
existing_only strategy currently loads all existing terms:
$this->get_existing_terms( $taxonomy )
which uses:
get_terms(
[
'hide_empty' => false,
'fields' => 'names',
]
)
Suggestion:
For very large taxonomies, fetching every term name on every request could become expensive. It may be worth caching the lookup or limiting retrieval to only what is needed for matching.
| ? wp_strip_all_tags( (string) $tax_object->labels->name ) | ||
| : $taxonomy; | ||
|
|
||
| $description = trim( wp_strip_all_tags( (string) $tax_object->description ) ); |
There was a problem hiding this comment.
To safely handle a missing property, use the null coalescing operator:
$description = trim(
wp_strip_all_tags( (string) ( $tax_object->description ?? '' ) )
);
|
|
||
| // The description is plain text (tags already stripped) and is read by | ||
| // the model as content, so it is injected as-is rather than entity-encoded. | ||
| return $open_tag . $description . '</taxonomy>'; |
There was a problem hiding this comment.
You can avoid string concatenation by using sprintf() for the entire return value:
return sprintf(
'<taxonomy name="%1$s" label="%2$s" kind="%3$s" hierarchical="%4$s">%5$s</taxonomy>',
esc_attr( $taxonomy ),
esc_attr( $label ),
esc_attr( $kind ),
$is_hierarchical ? 'true' : 'false',
$description
);
What?
Closes #452.
Improves the relevance of category and tag suggestions produced by the
Content Classification experiment. Six commits, all scoped to the
ability and its tests; no schema changes, no UI changes.
Why?
#452 reports that suggestions are often not relevant to the post.
Audit found three concrete drivers in the existing pipeline:
from those terms" when the existing-only candidate pool was
present. That pool was the top 100 terms ordered by usage count, so
the model was being nudged toward generic popular tags
(
Tutorial,Guide,Beginner,WordPress,News) regardlessof fit.
intent. Tag vs. category guidance was identical, and the rubric
("prioritize specificity") was wrong for categories.
("somewhat relevant") were treated identically to 0.95 matches.
How?
The change rolls up four small, independently-shippable adjustments
to
includes/Abilities/Content_Classification/:Lower temperature + confidence floor.
0.5 → 0.2for moreconsistent classification. New
MIN_CONFIDENCE = 0.6constant andwpai_content_classification_min_confidencefilter. Suggestionsbelow the floor are dropped before sort and slice; the filter
return is clamped to
[0, 1]defensively.Richer taxonomy descriptor in the prompt. Replaces
<taxonomy>slug</taxonomy>with<taxonomy name="…" label="…" kind="tag|category" hierarchical="…">description</taxonomy>,built from
get_taxonomy(). Thekindattribute is what thenext step branches on.
System instruction rewrite — the headline change.
<available-terms>reframed as a candidate pool: "Use theseonly when they genuinely fit. Relevance always outweighs
popularity. Do not force a match."
kind="category"broad, thematic; respect hierarchy; do notpad with loosely-related parents.
kind="tag"specific, descriptive; explicit nudge againstgeneric process-style tags ("Tutorial", "Guide", "Beginner",
"News", "WordPress") unless the post is genuinely framed that
way and there is no more specific tag.
high-quality suggestions is better than padding" directive.
wpai_content_classification_available_termsfilter. Letssites swap the default popularity-ordered candidate pool for a
relevance-ranked one (e.g. embedding-based retrieval) without
touching prompt code. Returns the same default as before when no
filter is registered, so this is purely additive.
Evaluation
To make sure each step actually moved the needle, I built an
out-of-tree fixture corpus (12 posts spanning English/Spanish/Finnish,
short/medium/long, hierarchical and orphan categories, long-tail and
near-duplicate tags) plus a
wp eval-file-driven runner that invokesthe ability per fixture × taxonomy and computes precision / recall
against a ground-truth
relevant_termsblock. The eval harness isintentionally not included in this PR it's a measurement tool,
not part of the plugin but the resulting numbers are:
Eight of twelve fixtures hit precision 1.00 on tag suggestions after
step 3. The generic-tag FP cluster identified in the audit
("Tutorial", "Guide", "Beginner", "WordPress") is essentially gone.
LLM variance at
temp=0.2is ±0.03 precision / ±0.04 recall betweenidentical-code runs, so the step-by-step deltas should be read as
trend rather than precise; the headline finding roughly +30
percentage points of precision over baseline with recall held
roughly flat is well outside that envelope.
Use of AI Tools
AI assistance: Yes
Tool(s): Claude Code
Model(s): Claude Opus 4.7
Used for: Prompt audit, instruction redraft, test scaffolding, eval
harness design, and step-by-step implementation. All code, tests,
and the prompt rewrite were reviewed before commit; eval numbers
were measured against a real OpenAI provider (gpt-4.1-mini /
gpt-5-mini fallback chain).
Testing Instructions
the Content Classification experiment under
Settings → AI Plugin.(env / constant / settings).
Content Classification panel and request suggestions for tags and
for categories.
places) rather than generic ("Tutorial", "Guide", "Beginner")
unless the post is genuinely tutorial-style.
popular site-wide categories.