-
Notifications
You must be signed in to change notification settings - Fork 330
[new model] Identify spam comments #3994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jpangas
wants to merge
67
commits into
mozilla:master
Choose a base branch
from
jpangas:spamcom
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 11 commits
Commits
Show all changes
67 commits
Select commit
Hold shift + click to select a range
efef0bc
Create spamcomment model
jpangas bff58c5
Add New Features
jpangas 48871dc
Merge remote-tracking branch 'upstream/master' into spamcom
jpangas 61b0fe0
Include new features and change spamcom
jpangas e31fa75
Version 0.0.534
suhaibmujahid 5103030
Merge remote-tracking branch 'upstream/master' into spamcom
jpangas a69cc54
Merge remote-tracking branch 'upstream/master' into spamcom
jpangas d365ad3
Create comments extractor
jpangas 9ce864a
Remove comment features from Bug Features
jpangas 77d534d
Add New features
jpangas 73f74a4
Refine Link feature
jpangas 2d65489
Test with TomekLinks
jpangas 501a89f
Change df in text vectorizer
jpangas 606f743
Use oversampling
jpangas 41a73cb
Use max_step
jpangas 586576d
Include and Refine features
jpangas ba7a1a1
Split Date Features
jpangas 8f429d1
Rename features correctly
jpangas 1ef2493
Remove Commenter Experience and Invalid Bugs
jpangas 5a18517
Remove first comment
jpangas ea6c168
Include Links Dictionary
874b19f
Fix Error and Lint
jpangas b3da2e5
Refactor the Links Dictionary
jpangas b49485d
Use List instead
jpangas 71fe950
Merge remote-tracking branch 'origin/master' into spamcom
jpangas 4626064
Merge remote-tracking branch 'origin/spamcom' into spamcom
jpangas a7044b0
Use Dictionary for # of links
jpangas 13772c7
Include older bugs
jpangas 7cf0dcd
Replace Weekday with Weekend
jpangas cc8e6f6
Include max_delta_step
jpangas c4e4f22
Revert "Include max_delta_step"
jpangas 01cca1e
Test using scale_pos_weight
jpangas cc42dee
Use URL Extract
jpangas 4b8cf49
Revert to Using Regex
jpangas 5c5da8c
Introduce new extraction func and features
jpangas dc16331
Include tests for extraction function
jpangas e5b0349
Change scale_pos_weight value
jpangas 644795a
Change regex for extraction
jpangas 45097da
Include tld_extract library
jpangas 0a06ea3
Test without scale_pos_weight
jpangas e193764
Test with n_estimators changed
jpangas dda9b95
Test with GridSearch CV Values
jpangas 5ba0c22
Remove scale_pos_weight from model.py
jpangas ca16b98
Set n_estimators to 1000
jpangas 18d18f0
Revert "Remove scale_pos_weight from model.py"
jpangas 1d35968
Remove comments which have 'redacted-
jpangas 0a21b61
Test with new parameters
jpangas 00a9f9f
Change df
jpangas f55d137
Test: Include tags as feature
jpangas dbcb311
Exclude comment tags
jpangas 1b437da
Exclude emails from commit authors
jpangas 16e14c5
Test without scale pos weight
jpangas 94ab283
Test with scale_pos_weight adjusted
jpangas 5a58108
Adjust scale pos weight
jpangas 3eab988
Test wihout WeekOfYear
jpangas bd16d56
Include comment classifier
jpangas 0a11f3c
Include script in setup
jpangas a3956b4
Fix script error
jpangas 5c93d23
Fix setup error
jpangas 15c8d5a
Classify all comments
jpangas 5f953ac
Include spamcom in model names
jpangas df77a40
Merge remote-tracking branch 'upstream/master' into spamcom
jpangas 4cd6c6d
Merge branch 'mozilla:master' into spamcom
jpangas 4237f8f
Remove comment independent files
jpangas ba2ece2
Merge remote-tracking branch 'origin/spamcom' into spamcom
jpangas 5490d01
Use(bug,comment) tuple
jpangas d95852d
Include BugvsCreator Feature
jpangas File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,136 @@ | ||
| # -*- coding: utf-8 -*- | ||
| # This Source Code Form is subject to the terms of the Mozilla Public | ||
| # License, v. 2.0. If a copy of the MPL was not distributed with this file, | ||
| # You can obtain one at http://mozilla.org/MPL/2.0/. | ||
|
|
||
| import re | ||
| import sys | ||
| from collections import defaultdict | ||
| from typing import Any | ||
|
|
||
| import pandas as pd | ||
| from sklearn.base import BaseEstimator, TransformerMixin | ||
|
|
||
|
|
||
| class CommentFeature(object): | ||
| pass | ||
|
|
||
|
|
||
| class CommentExtractor(BaseEstimator, TransformerMixin): | ||
| def __init__( | ||
| self, | ||
| feature_extractors, | ||
| cleanup_functions, | ||
| ): | ||
| assert len(set(type(fe) for fe in feature_extractors)) == len( | ||
| feature_extractors | ||
| ), "Duplicate Feature Extractors" | ||
| self.feature_extractors = feature_extractors | ||
|
|
||
| assert len(set(type(cf) for cf in cleanup_functions)) == len( | ||
| cleanup_functions | ||
| ), "Duplicate Cleanup Functions" | ||
| self.cleanup_functions = cleanup_functions | ||
|
|
||
| def fit(self, x, y=None): | ||
| for feature in self.feature_extractors: | ||
| if hasattr(feature, "fit"): | ||
| feature.fit(x()) | ||
|
|
||
| return self | ||
|
|
||
| def transform(self, comments): | ||
| comments_iter = iter(comments()) | ||
|
|
||
| commenter_experience_map = defaultdict(int) | ||
|
|
||
| def apply_transform(comment): | ||
| data = {} | ||
|
|
||
| for feature_extractor in self.feature_extractors: | ||
| res = feature_extractor( | ||
| comment, | ||
| commenter_experience=commenter_experience_map[comment["creator"]], | ||
| ) | ||
|
|
||
| if hasattr(feature_extractor, "name"): | ||
| feature_extractor_name = feature_extractor.name | ||
| else: | ||
| feature_extractor_name = feature_extractor.__class__.__name__ | ||
|
|
||
| if res is None: | ||
| continue | ||
|
|
||
| if isinstance(res, (list, set)): | ||
| for item in res: | ||
| data[sys.intern(f"{item} in {feature_extractor_name}")] = True | ||
| continue | ||
|
|
||
| data[feature_extractor_name] = res | ||
|
|
||
| commenter_experience_map[comment["creator"]] += 1 | ||
|
|
||
| comment_text = comment["text"] | ||
| for cleanup_function in self.cleanup_functions: | ||
| comment_text = cleanup_function(comment_text) | ||
|
|
||
| return { | ||
| "data": data, | ||
| "comment_text": comment_text, | ||
| } | ||
|
|
||
| return pd.DataFrame(apply_transform(comment) for comment in comments_iter) | ||
|
|
||
|
|
||
| class CommenterExperience(CommentFeature): | ||
| name = "# of Comments made by Commenter in the past" | ||
|
|
||
| def __call__(self, comment, commenter_experience, **kwargs): | ||
| return commenter_experience | ||
|
|
||
|
|
||
| class CommentHasLink(CommentFeature): | ||
| name = "Comment Has a Link" | ||
|
|
||
| # We check for links that are not from Mozilla | ||
| url_pattern = re.compile(r"http[s]?://(?!mozilla\.org|mozilla\.com)\S+") | ||
|
|
||
| def __call__(self, comment, **kwargs) -> Any: | ||
| return bool(self.url_pattern.search(comment["text"])) | ||
|
|
||
|
|
||
| class LengthofComment(CommentFeature): | ||
| name = "Length of Comment" | ||
|
|
||
| def __call__(self, comment, **kwargs): | ||
| return len(comment["text"]) | ||
|
|
||
|
|
||
| class TimeCommentWasPosted(CommentFeature): | ||
| name = "Time Comment Was Posted" | ||
|
|
||
| def __call__(self, comment, **kwargs): | ||
| pass | ||
|
|
||
|
|
||
| class TimeDifferenceCommentAccountCreation(CommentFeature): | ||
| name = "Time Difference Between Account Creation and when Comment was Made " | ||
|
|
||
| def __call__(self, comment, account_creation_time, **kwargs): | ||
| pass | ||
|
|
||
|
|
||
| class CommentTags(CommentFeature): | ||
| name = "Comment Tags" | ||
|
|
||
| def __init__(self, to_ignore=set()): | ||
| self.to_ignore = to_ignore | ||
|
|
||
| def __call__(self, comment, **kwargs): | ||
| tags = [] | ||
| for tag in comment["tags"]: | ||
| if tag in self.to_ignore: | ||
| continue | ||
|
|
||
| tags.append(tag) | ||
| return tags | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,127 @@ | ||
| # -*- coding: utf-8 -*- | ||
| # This Source Code Form is subject to the terms of the Mozilla Public | ||
| # License, v. 2.0. If a copy of the MPL was not distributed with this file, | ||
| # You can obtain one at http://mozilla.org/MPL/2.0/. | ||
|
|
||
| import logging | ||
|
|
||
| import xgboost | ||
| from imblearn.pipeline import Pipeline as ImblearnPipeline | ||
| from imblearn.under_sampling import RandomUnderSampler | ||
| from sklearn.compose import ColumnTransformer | ||
| from sklearn.feature_extraction import DictVectorizer | ||
| from sklearn.pipeline import Pipeline | ||
|
|
||
| from bugbug import bugzilla, comment_features, feature_cleanup, utils | ||
| from bugbug.model import CommentModel | ||
|
|
||
| logging.basicConfig(level=logging.INFO) | ||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class SpamCommentModel(CommentModel): | ||
| def __init__(self, lemmatization=True): | ||
| CommentModel.__init__(self, lemmatization) | ||
|
|
||
| self.calculate_importance = False | ||
|
|
||
| feature_extractors = [ | ||
| comment_features.CommenterExperience(), | ||
| comment_features.CommentHasLink(), | ||
| comment_features.LengthofComment(), | ||
| ] | ||
|
|
||
| cleanup_functions = [ | ||
| feature_cleanup.fileref(), | ||
| feature_cleanup.url(), | ||
| feature_cleanup.synonyms(), | ||
| ] | ||
|
|
||
| self.extraction_pipeline = Pipeline( | ||
| [ | ||
| ( | ||
| "comment_extractor", | ||
| comment_features.CommentExtractor( | ||
| feature_extractors, cleanup_functions | ||
| ), | ||
| ), | ||
| ] | ||
| ) | ||
|
|
||
| self.clf = ImblearnPipeline( | ||
| [ | ||
| ( | ||
| "union", | ||
| ColumnTransformer( | ||
| [ | ||
| ("data", DictVectorizer(), "data"), | ||
| ( | ||
| "comment_text", | ||
| self.text_vectorizer(min_df=0.001), | ||
| "comment_text", | ||
| ), | ||
| ] | ||
| ), | ||
| ), | ||
| ( | ||
| "sampler", | ||
| RandomUnderSampler( | ||
| random_state=0, sampling_strategy="not minority" | ||
| ), | ||
| ), | ||
| ( | ||
| "estimator", | ||
| xgboost.XGBClassifier(n_jobs=utils.get_physical_cpu_count()), | ||
| ), | ||
| ] | ||
| ) | ||
|
|
||
| def get_labels(self): | ||
| classes = {} | ||
|
|
||
| for bug in bugzilla.get_bugs(include_invalid=True): | ||
| for comment in bug["comments"]: | ||
| comment_id = comment["id"] | ||
|
|
||
| # Skip comments filed by Mozillians and bots, since we are sure they are not spam. | ||
| if "@mozilla" in comment["creator"]: | ||
| continue | ||
|
|
||
| if "spam" in comment["tags"]: | ||
| classes[comment_id] = 1 | ||
| else: | ||
| classes[comment_id] = 0 | ||
|
|
||
| logger.info( | ||
| "%d comments are classified as non-spam", | ||
| sum(label == 0 for label in classes.values()), | ||
| ) | ||
| logger.info( | ||
| "%d comments are classified as spam", | ||
| sum(label == 1 for label in classes.values()), | ||
| ) | ||
|
|
||
| return classes, [0, 1] | ||
|
|
||
| def items_gen(self, classes): | ||
| # Overwriting this method to add include_invalid=True to get_bugs to | ||
| # include spam bugs which have a number of spam comments. | ||
| return ( | ||
| (comment, classes[comment["id"]]) | ||
| for bug in bugzilla.get_bugs(include_invalid=True) | ||
| for comment in bug["comments"] | ||
| if comment["id"] in classes | ||
| ) | ||
|
|
||
| def get_feature_names(self): | ||
| return self.clf.named_steps["union"].get_feature_names_out() | ||
|
|
||
| def overwrite_classes(self, comments, classes, probabilities): | ||
| for i, comment in enumerate(comments): | ||
| if "@mozilla" in comment["creator"]: | ||
| if probabilities: | ||
| classes[i] = [1.0, 0.0] | ||
| else: | ||
| classes[i] = 0 | ||
|
|
||
| return classes |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.