Rule-based content checking filter for use with Funnelback.
Install the files into the collection's conf folder:
$SEARCH_HOME/conf/COLLECTION/@groovy/com/funnelback/CAFilters/CheckContent.groovy
$SEARCH_HOME/conf/COLLECTION/check-content.cfg
$SEARCH_HOME/conf/COLLECTION/check-content.validation-patterns.cfg
$SEARCH_HOME/conf/COLLECTION/check-content.word-list.plain-english.cfg
$SEARCH_HOME/conf/COLLECTION/check-content.word-list.weasel-words.cfg
Requires:
filter.jsoup.classes=com.funnelback.CAFilters.CheckContentcheck-content.cfgcontaining the rules for content checking. Included file contains some sample rules.collection.cfgentries (depending on features used):filter.check-content.word-list.WORD_LIST_NAME=/path-to/file-containing-word-list.cfg
Test for the existence of an element
Parameters in configuration:
name: A name to identify the checkdescription: A short description of the checkselector: the jsoup selector to check the existence ofmetaField: Adds the metadata field with a value of true if the check was found, and<metaField>-countwith a count of how many times the match was foundextractField: eitherCONTENT, orATTRIBUTE:<attribute name>where<attribute name>is the attribute to extract.extractValue: extract the value of theextractFieldand write to<metaField>-valueif set to trueextractMode: extract the contents as TEXT (will strip out all tags) or HTML. Acceptable values are TEXT or HTML.document: Jsoup object representing the document
Check for the presence of link tags with a property of rel=canonical.
If found sets the metadata field:
X-FUNNELBACK-CANONICAL=true
X-FUNNELBACK-CANONICAL-COUNT=<N> where <N> is the number of times the element was detected within the page
X-FUNNELBACK-CANONICAL-VALUE=<V> where <V> is the extracted value if extractValue is set to true.
Example JSON entry:
{
"name":"Canonical URL defined",
"check":"ELEMENT_EXISTENCE",
"metaField":"X-FUNNELBACK-CANONICAL",
"selector":"link[rel=canonical]",
"extractField":"CONTENT",
"extractValue":true,
"description":"Detects the presence of a canonical URL"
}Test the length in characters or words of an element's content. Example - identify pages with any H1 elements longer than 200 characters.
Parameters in configuration:
name: A name to identify the checkdescription: A short description of the checkselector: the jsoup selector to check the existence ofmetaField: Adds the metadata field with a value of true if the check was found, and<metaField>-countwith a count of how many times the match was foundcompareField: eitherCONTENT, orATTRIBUTE:<attribute name>where<attribute name>is the attribute to compare against and extract.comparator: compare function to use. Acceptable values are (whereNis set using the length parameter):LENGTH_EQ_CHARS: length of the selected value is equal toNcharactersLENGTH_GT_CHARS: length of the selected value is greater thanNcharactersLENGTH_LT_CHARS: length of the selected value is less thanNcharactersLENGTH_EQ_WORDS: length of the selected value is equal toNwordsLENGTH_GT_WORDS: length of the selected value is greated thanNwordsLENGTH_LT_WORDS: length of the selected value is less thanNwords)
extractValue: extract the value of thecompareFieldand write to<metaField>-value, and write the size (in words or chars based on the comparator) to<metaField>-sizeif set to truelength: length value used for the comparison
Check for the presence of any titles with more than 200 characters
If found sets the metadata field:
X-FUNNELBACK-H1-GT-200=true
X-FUNNELBACK-H1-GT-200-COUNT=<N> where <N> is the number of times the element was detected within the page
X-FUNNELBACK-H1-GT-200-SIZE=<N> where <N> is the number of words or chars of the matched field if extractValue is set to true
X-FUNNELBACK-H1-GT-200-VALUE=<V> where <V> is the extracted value if extractValue is set to true.
Example JSON entry:
{
"name":"Heading 1 length",
"check":"ELEMENT_LENGTH",
"metaField":"X-FUNNELBACK-H1-GT-200",
"selector":"title",
"description":"Identifies if the document contains any H1 values that exceed 200 characters in length.",
"comparator":"LENGTH_GT_CHARS",
"length":"200",
"extractValue":false
}Tests an element's content
Parameters in configuration:
name: A name to identify the checkdescription: A short description of the checkselector: the jsoup selector to check the existence ofmetaField: Adds the metadata field with a value of true if the check was found, and<metaField>-countwith a count of how many times the match was foundcomparator: compare function to use. Acceptable values are (where<comparison text>is set using the compareText parameter): -ENDS_WITH: compareField value ends with<comparison text>-STARTS_WITH: compareField value starts with<comparison text>-NOT_ENDS_WITH: compareField value does not end with<comparison text>-NOT_STARTS_WITH: compareField value does not start with<comparison text>-EQUALS: compareField value equals<comparison text>-NOT_EQUALS: compareField value does not equal<comparison text>-CONTAINS: compareField value contains<comparison text>-NOT_CONTAINS: compareField value does not contain<comparison text>-MATCHES: compareField value includes text that matches the regular expression defined in<comparison text>-NOT_MATCHES: compareField value does not include text that matches the regular expression defined in<comparison text>-FULLY_MATCHES: compareField value fully matches the regular expression defined in<comparison text>-NOT_FULLY_MATCHES: compareField value does not fully match the regular expression defined in<comparison text>compareField: eitherCONTENT, orATTRIBUTE:<attribute name>where<attribute name>is the attribute to compare against against and extract.compareText: value used for the comparisonextractValue: extract the value of the selector and write to<metaField>-valueif set to true
Check for the presence of any titles with more than 200 characters
If found sets the metadata field:
X-FUNNELBACK-LINK-CLICK-HERE=true
X-FUNNELBACK-LINK-CLICK-HERE-COUNT=<N> where <N> is the number of times the element was detected within the page
X-FUNNELBACK-LINK-CLICK-HERE-VALUE=<V> where <V> is the extracted value if extractValue is set to true.
Example JSON entry:
{
"name":"Links containing click here",
"check":"ELEMENT_CONTENT",
"metaField":"X-FUNNELBACK-LINK-CLICK-HERE",
"selector":"a",
"description":"Identifies if the document contains any links containing the phrase click here.",
"comparator":"CONTAINS",
"compareText":"click here",
"extractValue":false
}Check an element's content to see if it includes any words from a specified words list
Parameters in configuration:
Params:
name: A name to identify the checkdescription: A short description of the checkmetaField: Adds the metadata field with a value of true if the check was found, and<metaField>-countwith a count of how many times the match was foundselector: the jsoup selector to analyse.wordList: The word list to check against (word list must be loaded via thefilter.check-content.word-list-Xcollection.cfgparameter)
Code is based on the undesirable text core filter code, extended to support multiple word lists.
Check for the presence of any titles with more than 200 characters
If found sets the metadata field:
X-FUNNELBACK-WEASEL-WORDS=<weasel words found>. This field will be set for each word that is found.
X-FUNNELBACK-WEASEL-WORDS_COUNT=<N> where <N> is the number of times a word was detected within the page
Requires a configuration entry:
filter.check-content.word-list.weasel-words=$SEARCH_HOME/conf/$COLLECTION_NAME/weasel-words.cfgExample JSON entry:
{
"name":"Weasel words",
"check":"WORD_LIST_COMPARE",
"metaField":"X-FUNNELBACK-WEASEL-WORDS",
"selector":"body",
"description":"Identifies weasel words present in the document as defined in the weasel words list.",
"wordList":"weasel-words"
}Validate an element's content using a library pattern
Parameters in configuration:
name: A name to identify the checkdescription: A short description of the checkselector: the jsoup selector to validatemetaField: Adds the metadata field with a value of true if the check was found, and<metaField>-countwith a count of how many times the match was foundcomparator: compare function to use. Acceptable values are (where<comparison text>is set using the compareText parameter): -MATCHES: compareField value includes text that matches the regular expression defined in<comparison text>-NOT_MATCHES: compareField value does not include text that matches the regular expression defined in<comparison text>-FULLY_MATCHES: compareField value fully matches the regular expression defined in<comparison text>-NOT_FULLY_MATCHES: compareField value does not fully match the regular expression defined in<comparison text>matchPattern: Regex matcher as defined in the validation patterns. Metadata field is set to true if the selector matches the pattern. Acceptable values are:NUMBER_AS_TEXT: Identifies numbers from 1-10 written out textPHONE_NUMBER_AU: Identifies Australian telephone numbersURL: Identifies valid URLsEMAIL_ADDRESS: Identifies valid email addresses
Check the og:url field for an invalid url.
If found sets the metadata field:
X-FUNNELBACK-VALIDOGURL=true
X-FUNNELBACK-VALIDOGURL-count=<N> where <N> is the number of times a word was detected within the page
Example JSON entry:
Check for valid URLs in og:url metadata fields.
{
"name":"Validate OG URL",
"check":"ELEMENT_VALIDATE",
"metaField":"X-FUNNELBACK-VALIDOGURL",
"selector":"meta[property=og:url]",
"description":"Identifies OG URL fields that do not contain a valid URL",
"comparator":"NOT_FULLY_MATCHES",
"compareField":"ATTRIBUTE:content",
"matchPattern":"URL",
"extractValue":false
}Note the order of items within each JSON record are not important.
[
{
"name":"Canonical URL defined",
"check":"ELEMENT_EXISTENCE",
"metaField":"X-DTA-CANONICAL",
"selector":"link[rel=canonical]",
"description":"Detects the presence of a canonical URL"
},
{
"name":"Links containing click here",
"check":"ELEMENT_CONTENT",
"metaField":"X-FUNNELBACK-LINK-CLICK-HERE",
"selector":"a",
"description":"Identifies if the document contains any links containing the phrase click here.",
"comparator":"CONTAINS",
"compareText":"click here",
"extractValue":false
},
{
"name":"Weasel words",
"check":"WORD_LIST_COMPARE",
"metaField":"X-FUNNELBACK-WEASEL-WORDS",
"selector":"body",
"description":"Identifies weasel words present in the document as defined in the weasel words list.",
"wordList":"weasel-words"
},
{
"name":"Validate OG URL",
"check":"ELEMENT_VALIDATE",
"metaField":"X-FUNNELBACK-VALIDOGURL",
"selector":"meta[property=og:url]",
"description":"Identifies OG URL fields that do not contain a valid URL",
"comparator":"NOT_FULLY_MATCHES",
"compareField":"ATTRIBUTE:content",
"matchPattern":"URL",
"extractValue":false
}
]check-content.validation-patterns.cfg contains the pre-defined regular expressions that can be used in conjunction with the ELEMENT_VALIDATE check.
The format of the file is a key identifying the pattern, and the regular expression to use for the match.
Regular expressions must follow java escaping rules (so backslashes must be escaped).
{
"NUMBER_AS_TEXT":"\\b(one|two|three|four|five|six|seven|eight|nine|ten)\\b",
"PHONE_NUMBER_AU":"(\\b1[38]00\\s+(?:\\D|\\d\\D|\\d\\d\\D))|(\\b1[38]00\\d)",
"URL":"(https?|ftp|file):\/\/[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]",
"EMAIL_ADDRESS":"[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,6}",
}