The dataset was extracted from default pycaret datasets - blood, containing features like Recency, Frequency, Monetary, Time, and the target variable - Class.
Histograms of the distribution for each feature, boxplots for each feature, and Pearson and Kendall correlation matrices were generated.
- Data was input into the model as-is. The best F1 metric was 0.3678 for the Gradient Boosting Classifier.
- The Monetary feature was dropped due to its functional dependency on the Frequency feature. The best F1 metric was 0.3693 for the Gradient Boosting Classifier.
- Class imbalance was addressed using the upsampling method for the target variable. The best F1 metric was 0.9601 for the Extra Trees Classifier.
ROC curves, confusion matrices, class prediction errors, learning curves, and feature importance plots were generated. A screenshot from the local run of mlflow was provided for each experiment.
| # | Best model | F1-score |
|---|---|---|
| 1 | Gradient Boosting Classifier | 0.3678 |
| 2 | Gradient Boosting Classifier | 0.3693 |
| 3 | Random Forest Classifier | 0.8137 |
A default pycaret dataset - glass, containing glass parameters such as RI, Na, Mg, Al, Si, K, Ca, Ba, Fe, and the target variable - Type.
Histograms of the distribution for each feature, boxplots for each feature, and Pearson and Kendall correlation matrices were generated.
- Data was input into the model as-is. The best F1 metric was 0.7470 for the Extra Trees Classifier.
- Outliers from the Al, Fe, and K features were removed. The best F1 metric was 0.7615 for the Extra Trees Classifier.
ROC curves, confusion matrices, class prediction errors, learning curves, and feature importance plots were generated. A screenshot from the local run of mlflow was provided for each experiment.
| # | Best model | F1-score |
|---|---|---|
| 1 | Extra Trees Classifier | 0.7470 |
| 2 | Extra Trees Classifier | 0.7615 |
A default pycaret dataset - pokemon, containing features: #, Name, Type 1, Type 2, Total, HP, Attack, Defense, Sp. Atk, Sp. Def, Speed, Generation, Legendary.
Features # and Name were dropped due to their lack of informativeness. Boxplots were generated for each feature. Features 'Type 1', 'Type 2', 'Legendary' were encoded using LabelEncoder to make them categorical. An elbow method diagram was constructed, suggesting an optimal cluster count of 6.
- Raw data was passed to pycaret for various clustering algorithms: kmeans, ap, meanshift, sc, hclust, dbscan, optics, birch. The best models were kmeans and birch with Calinski-Harabasz scores of 172.6402 and 170.6474, respectively.
- Pycaret suggested 4 clusters, but I believe 6 clusters are more suitable. I ran pycaret on 6 clusters for the two best models, obtaining metrics of 141.6419 and 125.6700, indicating that 4 clusters are indeed optimal. To reinforce this conclusion, I asked pycaret to display the elbow method, which also showed that 4 clusters are optimal.
- Attempted dimensionality reduction using the umap method, first on an arbitrary number of components, then on 2 components, where 4 clusters are clearly visible. Trained kmeans and birch on umap-compressed data, obtaining metrics of 926.8205 and 851.5435, respectively.
Silhouette plots and cluster plots were generated. A screenshot from the local run of mlflow was provided for each experiment.
| # | Calinski-Harabasz for kmean | Num clusters |
|---|---|---|
| 1 | 172.6402 | 4 |
| 2 | 141.6419 | 6 |
| 3 | 926.8205 | 4 |
The dataset was taken from default pycaret datasets - diamond, containing features such as Carat Weight, Cut, Color, Clarity, Polish, Symmetry, Report, and the target variable - Price.
All features except Carat Weight are categorical. LabelEncoder was applied to make them understandable for the machine learning model. Pearson and Kendall correlation matrices were generated. Boxplots were displayed for real-valued features.
Models were given the data as-is. The best R2 metric was achieved with the Light Gradient Boosting Machine model. The metric is high, and there is no need for further improvement.
Prediction error, residuals, and feature importance plots were generated. A screenshot from the local run of mlflow for the experiment was provided.
| # | Best model | R2-score |
|---|---|---|
| 1 | Light Gradient Boosting Machine | 0.9813 |
A default pycaret dataset - wikipedia, containing Title, Category, Text.
Rows that do not contain the target variable were dropped since attempting to predict manually engineered data might not be the best idea. A Word Cloud was generated.
- General preprocessing for each model: Data was converted to lowercase, stop words were removed, and lemmatization was applied.
- A Bag of Words was created, and the tf-idf coefficient was calculated. To avoid having 61,000 features, only the first 1000 significant words according to tf-idf were kept. These values were multiplied by the values from the Bag of Words. The best F1 score was 0.1323 with the Decision Tree Classifier.
- An attempt was made to reduce the number of words from 1000 to 250, resulting in F1 scores of 0.0916 for the Extra Trees Classifier and Quadratic Discriminant Analysis. Everything worsened.
- The decision was made to abandon Bag of Words and TF-IDF in favor of LDA to extract topics from the texts. Started with 20 topics, achieving an F1 score of 0.3650 with the Random Forest Classifier.
- Increased the number of topics to 50, resulting in an F1 score of 0.4996 with the Random Forest Classifier.
- Normalized the class distribution by upsampling and downsampling each class to 44 values. Achieved the best F1 score of 0.8741 with the Random Forest Classifier.
- Try reduce number of topicks to 30 and number of increase each class samples to 52.
ROC curves, confusion matrices, class prediction errors, learning curves, and feature importance plots were generated. A screenshot from the local run of mlflow for each experiment was provided.
| # | Best model | F1-score |
|---|---|---|
| 1 | Decision Tree Classifier | 0.1323 |
| 2 | Random Forest Classifier | 0.0916 |
| 3 | Random Forest Classifier | 0.3650 |
| 4 | Random Forest Classifier | 0.4996 |
| 5 | Random Forest Classifier | 0.6527 |
| 6 | Random Forest Classifier | 0.7169 |