Skip to content

Fix: locale-aware (words/characters) minimum content length across AI content experiments#581

Open
hbhalodia wants to merge 24 commits into
WordPress:developfrom
hbhalodia:fix/issue-578
Open

Fix: locale-aware (words/characters) minimum content length across AI content experiments#581
hbhalodia wants to merge 24 commits into
WordPress:developfrom
hbhalodia:fix/issue-578

Conversation

@hbhalodia

@hbhalodia hbhalodia commented May 19, 2026

Copy link
Copy Markdown
Contributor

What?

Closes #578, #391, #390

Why?

  • PR uses wordCountType for counting words or characters based on the users locale, standardize across all the experiments.

How?

Use of AI Tools

  • Yes, Claude Code, Opus 4.8,4.7.
  • Used for implementation phase with standardize across the experiments.

Testing Instructions

  1. Change the site language to Japanese.
  2. Open a post.
  3. Copy the following text and paste it multiple times inside a paragraph block: あああああああああああ
  4. Execute the Shorten action
  5. Confirm that no error notice is displayed and word shorten is working as expected.

Screenshots or screencast

Screen.Recording.2026-05-19.at.3.52.39.PM.mov

Changelog Entry

Fixed - Add wordCountType to check for user's locale and update to count character or words and standardize the min content length accross the experiments.

Open WordPress Playground Preview

@github-actions

github-actions Bot commented May 19, 2026

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: hbhalodia <hbhalodia@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Co-authored-by: jeffpaul <jeffpaul@git.wordpress.org>
Co-authored-by: t-hamano <wildworks@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Comment thread src/experiments/content-resizing/components/ContentResizingToolbar.tsx Outdated
@codecov

codecov Bot commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.15%. Comparing base (246680c) to head (3e0143c).

Files with missing lines Patch % Lines
.../Content_Classification/Content_Classification.php 0.00% 4 Missing ⚠️
.../Experiments/Content_Resizing/Content_Resizing.php 0.00% 2 Missing ⚠️
...cludes/Experiments/Summarization/Summarization.php 0.00% 1 Missing ⚠️
includes/helpers.php 0.00% 1 Missing ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             develop     #581      +/-   ##
=============================================
- Coverage      73.18%   73.15%   -0.03%     
  Complexity      1731     1731              
=============================================
  Files             85       85              
  Lines           7473     7476       +3     
=============================================
  Hits            5469     5469              
- Misses          2004     2007       +3     
Flag Coverage Δ
unit 73.15% <0.00%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dkotter dkotter linked an issue May 21, 2026 that may be closed by this pull request
6 tasks
@dkotter dkotter modified the milestones: 1.0.1, 1.1.0 May 26, 2026

@dkotter dkotter left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments but overall this is looking good.

The only other thing I'd flag is this standardizes things for Content Classification, Content Resizing and Summarization but none of the other Features. I still think it would be ideal to standardize everything, which likely means closing out #479 and #545 and handling those in this PR, as well as looking at all other Features and ensure, where needed, they follow the same approach

Comment thread includes/Experiments/Summarization/Summarization.php Outdated
* @param int $min_content_length The minimum number of characters required. Default 100.
*/
$min_content_length = (int) apply_filters( 'wpai_summarization_min_content_length', 100 );
$min_content_length = (int) apply_filters( 'wpai_summarization_min_content_length', get_min_content_length( 'summarization', 100 ) );

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I know 100 was the last value but that was when we looked at characters only. Now we'll look at words if the locale supports it so is 100 words the right length? Seems that may be too long

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure on this what number would be the best here. If we go with less then what if locale supports characters, then it would be less either. So may be we can change to 50?

'enabled' => $this->is_enabled(),
'strategy' => $this->get_strategy(),
'maxSuggestions' => $this->get_max_suggestions(),
'minContentLength' => get_min_content_length( 'content-classification', 150 ),

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 150 the right length here or should we lower that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's good thing to have more. Because if we have words as locale, higher the words AI may provide the better suggestions. For characters as well, it would provide better suggestions.

Here in the comment as well - #581 (comment), I guess we can stick with 100, that's because for words locale higher the words it would provide better summarization? yes?

Comment thread src/experiments/content-resizing/components/ContentResizingToolbar.tsx Outdated
Comment thread src/experiments/content-resizing/components/ContentResizingToolbar.tsx Outdated
Comment thread includes/Experiments/Summarization/Summarization.php
@hbhalodia

Copy link
Copy Markdown
Contributor Author

Left a few comments but overall this is looking good.

The only other thing I'd flag is this standardizes things for Content Classification, Content Resizing and Summarization but none of the other Features. I still think it would be ideal to standardize everything, which likely means closing out #479 and #545 and handling those in this PR, as well as looking at all other Features and ensure, where needed, they follow the same approach

Thanks @dkotter, I am checking those and will update in the PR as needed.

@hbhalodia

Copy link
Copy Markdown
Contributor Author

Summary

Fixes Content Resizing's "Shorten" action not detecting the length of Japanese (and other character-based) text, and generalizes the fix into a unified, locale-aware minimum content length system shared by every AI content experiment.

The root problem: content-length checks counted words (whitespace-delimited), which is meaningless for CJK languages where there are no spaces. This PR introduces locale-aware counting (words vs. characters, driven by WordPress core's word-count strategy) and a single, filterable source of truth for the minimum length, then applies it consistently across all content experiments — disabling the relevant action and showing a clear, locale-correct message when there isn't enough content.

Problem / root cause

  • Content Resizing's "Shorten" used a word count to decide whether there was enough text to shorten. In Japanese/Chinese/Korean locales (no word separators) the count was effectively wrong, so the gate misbehaved.
  • Minimum-length logic was duplicated/hardcoded and inconsistent across experiments, with word-only assumptions baked in.

Solution

  1. Locale-aware counting — new src/utils/word-count.ts utilities that respect WordPress core's word-count strategy (words vs. characters_excluding_spaces / characters_including_spaces).
  2. Centralized, filterable minimum — new get_min_content_length() PHP helper with a single wpai_min_content_length filter.
  3. Consistent UX — each experiment disables its Generate/Regenerate action when content is too short and shows a message phrased in words or characters to match the locale.

New shared infrastructure

  • includes/helpers.php — new get_min_content_length( string $feature_id, int $default ) that returns the per-feature minimum, filterable via wpai_min_content_length (( $length, $feature_id )).
  • src/utils/word-count.ts — new utilities:
    • getWordCountType() — returns the locale's strategy via core's translatable _x( 'words', … ) ("Do not translate" string), so CJK locales report a character strategy.
    • getContentCount( content ) — counts using @wordpress/wordcount with that strategy.
    • hasMinimumContent( content, min ) — the shared gate used everywhere.

Per-experiment changes

All seven content experiments now localize a minContentLength (via get_min_content_length()) and gate their UI through hasMinimumContent() / getWordCountType():

Experiment Default min Notes
Content Classification 100 Hint now shows words or characters; suggest button gated.
Content Resizing 100 The original Japanese-text fix; "Shorten" now uses locale-aware count.
Editorial Notes 100 Gate retained; message standardized to words/characters.
Excerpt Generation 100 New gate; panel + inline buttons disabled with tooltip.
Meta Description 100 New gate; panel Generate/Regenerate + modal Generate gated.
Summarization 100 Now sourced from get_min_content_length(); see deprecation below.
Title Generation 100 New gate; standalone + toolbar buttons disabled with tooltip.

PHP (each adds the use function … get_min_content_length; import + minContentLength in the localized data):

  • includes/Experiments/Content_Classification/Content_Classification.php
  • includes/Experiments/Content_Resizing/Content_Resizing.php
  • includes/Experiments/Editorial_Notes/Editorial_Notes.php
  • includes/Experiments/Excerpt_Generation/Excerpt_Generation.php
  • includes/Experiments/Meta_Description/Meta_Description.php
  • includes/Experiments/Summarization/Summarization.php
  • includes/Experiments/Title_Generation/Title_Generation.php

Frontend (types + components/hooks reading minContentLength, computing isContentTooShort, disabling actions, and rendering words/characters messaging):

  • Content Classification — types.ts, useContentClassification.ts, SuggestionPanel.tsx
  • Content Resizing — types.ts, ContentResizingToolbar.tsx
  • Editorial Notes — EditorialNotesPlugin.tsx, useEditorialNotes.ts
  • Excerpt Generation — types.ts, useExcerptGeneration.ts, ExcerptGeneration.tsx, ExcerptInlineButton.tsx
  • Meta Description — types.ts, useMetaDescription.ts, MetaDescriptionPanel.tsx, MetaDescriptionModal.tsx
  • Summarization — types.ts
  • Title Generation — types.ts, components/TitleToolbar.tsx

Deprecation / backward compatibility

  • Summarization.php — the legacy wpai_summarization_min_content_length filter is migrated to apply_filters_deprecated( …, 'wpai_min_content_length' ). The canonical wpai_min_content_length filter (applied inside get_min_content_length()) is now the supported extension point; the old filter still works for back-compat.

Behavior

When content is below the threshold the action button is disabled (with accessibleWhenDisabled) and surfaces:

  • "… will be available when the post content has at least N words." — word-based locales
  • "… at least N characters." — character-based locales (e.g. Japanese), via getWordCountType()

Because of accessibleWhenDisabled, disabled buttons report state via aria-disabled="true" (focusable, tooltip announced) rather than the native disabled attribute.

Testing

Unit / Integration (PHPUnit) — 7 experiment test files

Each adds localization assertions:

  • test_enqueue_assets_localizes_default_min_content_length → default value present in script data.
  • test_enqueue_assets_localizes_filtered_min_content_length → value reflects a wpai_min_content_length filter.

Files under tests/Integration/Includes/Experiments/…: Content_Classification, Content_Resizing, Editorial_Notes, Excerpt_Generation, Meta_Description, Summarization, Title_Generation.

npm run wp-env:test start
npm run test:php tests/Integration/Includes/Experiments/Title_Generation/Title_GenerationTest.php
# …repeat per experiment, or run the whole suite:
npm run test:php

End-to-end (Playwright) — 7 experiment spec files

Coverage added across tests/e2e/specs/experiments/:

  • Disabled-button tests (short content → aria-disabled="true") for title, excerpt, meta-description (and equivalents for the others).
  • Threshold tests — e.g. Content Classification enables suggestions once content crosses the minimum; Content Resizing "Shorten" opens and renders the exact original/suggested text when content meets the minimum.
  • Existing functional flows updated to seed sufficient content (LONG_CONTENT) so they remain green under the new gate.
  • Removed the obsolete Excerpt "clean error notice on empty content" e2e test (that empty-content path is now prevented client-side).
npm run test:e2e -- specs/experiments/content-resizing.spec.js \
  specs/experiments/content-classification.spec.js \
  specs/experiments/title-generation.spec.js \
  specs/experiments/excerpt-generation.spec.js \
  specs/experiments/meta-description.spec.js

e2e assertions assume each feature's default minimum (no wpai_min_content_length filter seeded in the test env).

Configuration

// Adjust the minimum per feature (defaults shown in the table above).
add_filter( 'wpai_min_content_length', function ( $length, $feature_id ) {
    // $feature_id: 'content-resizing' | 'content-classification' | 'summarization'
    //            | 'editorial-notes' | 'excerpt-generation' | 'meta-description'
    //            | 'title-generation'
    return $length;
}, 10, 2 );

Notes / follow-ups

  • @since / @deprecated x.x.x placeholders should be set to the target release version before merge.
  • Optional: standardize Editorial Notes to the same getSettings() helper pattern used by the other experiments (currently reads the localized global inline; functionally equivalent).

🤖 Generated with Claude Code

@hbhalodia hbhalodia changed the title Fix: Content Resizing: The "shorten" action does not detect the character length of Japanese text Fix: locale-aware (words/characters) minimum content length across AI content experiments Jun 9, 2026
@hbhalodia

Copy link
Copy Markdown
Contributor Author

Hi @dkotter, Below are some visuals for the updates,

Title Generation

Screen.Recording.2026-06-09.at.4.57.45.PM.mov

Excerpt Generation

Screen.Recording.2026-06-09.at.4.58.19.PM.mov

Meta Description Generation

Screen.Recording.2026-06-09.at.4.58.53.PM.mov

Content Classification

Screen.Recording.2026-06-09.at.5.00.01.PM.mov

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants