Typerclass: Model Training and Evaluation Report • typerclass

Introduction

This document presents the model training and evaluation process carried out using Typerclass. The goal is to provide a clear overview of the methodological choices and results of the modeling process.

Key decisions in the modeling process

The model was trained following a few important choices regarding the data and the algorithm.

Dataset composition

Data from three surveys available at the Italian Istitute of Statistics (ISTAT) were combined:

The final dataset included 400 instances of class “Nominal”, 200 of class “Ordinal”, and 125 of class “Scale” (mapped in the code as N, O, and S, respectively).

Indicators

The following indicators are used in the model, along with a brief description of each:

n_unique_values – Number of unique values in the variable (excluding missing values).
std_dev – Standard deviation; measures how spread out the values are.
max_relative_frequency – Proportion of the most frequent value relative to the total number of observations.
norm_entropy – Normalized entropy; measures how evenly the values are
distributed.
min_value – Minimum observed value.
max_value – Maximum observed value.
range_value – Difference between the maximum and minimum values.
shannon_entropy – Shannon entropy; a measure of uncertainty or information
content.
simpson_index – Simpson diversity index; indicates how concentrated the values are.
skewness_probs – Skewness of the value-probability distribution; measures
asymmetry.
kurtosis_probs – Excess kurtosis of the value distribution; indicates tail
heaviness.
dispersion_index – Variance-to-mean ratio of value probabilities; measures
dispersion.
uniformity – Shannon entropy normalized by log(n_unique_values); measures distributional evenness.
top2_ratio – Sum of the probabilities of the two most frequent values.
top3_ratio – Sum of the probabilities of the three most frequent values.

The table provides an overview of all indicators used in the model. Most indicators have no missing values, ensuring reliable inputs for training. The only exception is dispersion_index, which contains 128 missing values. The ranger package can handle missing values by default; preliminary tests also showed that imputation did not improve model performance.

Indicator	NAs
Overview of Indicators and Missing Values
dispersion_index	128
n_unique_values	0
std_dev	0
max_relative_frequency	0
norm_entropy	0
min_value	0
max_value	0
range_value	0
shannon_entropy	0
simpson_index	0
skewness_probs	0
kurtosis_probs	0
uniformity	0
top2_ratio	0
top3_ratio	0

Indicator Distributions

This section shows the distributions of all indicators, highlighting variability and potential outliers. Several indicators display heavy tails and numerous outliers, while others (e.g., proportion-based measures) are more tightly concentrated; overall, the plots suggest substantial heterogeneity across predictors.

To improve readability, a log transformation (log10 with +1) was applied to a subset of indicators with very different scales or heavy tails. The remaining panels are shown on the original scale so that comparably scaled indicators can be interpreted directly.

Correlation matrix

The correlation matrix displays pairwise relationships between all model indicators. Several predictors show high correlations, but they were retained in the dataset because Random Forest — one of the algorithms selected for this study — is robust to multicollinearity. In fact, including correlated indicators can still improve model performance by providing additional predictive information.

indicator	n_unique_values	std_dev	max_relative_frequency	norm_entropy	min_value	max_value	range_value	shannon_entropy	simpson_index	skewness_probs	kurtosis_probs	dispersion_index	uniformity	top2_ratio	top3_ratio
Correlation Matrix of Indicators
n_unique_values	1.0	0.3	−0.2	0.1	0.3	0.3	0.3	0.7	0.2	0.3	0.3	−0.1	0.1	−0.3	−0.4
std_dev	0.3	1.0	−0.1	0.1	1.0	1.0	1.0	0.3	0.1	0.8	1.0	0.0	0.1	−0.1	−0.2
max_relative_frequency	−0.2	−0.1	1.0	−0.9	−0.1	−0.1	−0.1	−0.8	−1.0	−0.3	−0.1	0.9	−0.9	0.9	0.8
norm_entropy	0.1	0.1	−0.9	1.0	0.1	0.1	0.1	0.6	0.9	0.1	0.0	−1.0	1.0	−0.6	−0.5
min_value	0.3	1.0	−0.1	0.1	1.0	1.0	1.0	0.3	0.1	0.8	1.0	−0.1	0.1	−0.1	−0.2
max_value	0.3	1.0	−0.1	0.1	1.0	1.0	1.0	0.3	0.1	0.8	1.0	0.0	0.1	−0.1	−0.2
range_value	0.3	1.0	−0.1	0.1	1.0	1.0	1.0	0.3	0.1	0.8	1.0	0.0	0.1	−0.1	−0.2
shannon_entropy	0.7	0.3	−0.8	0.6	0.3	0.3	0.3	1.0	0.8	0.5	0.3	−0.5	0.6	−0.9	−0.9
simpson_index	0.2	0.1	−1.0	0.9	0.1	0.1	0.1	0.8	1.0	0.3	0.1	−0.9	0.9	−0.8	−0.7
skewness_probs	0.3	0.8	−0.3	0.1	0.8	0.8	0.8	0.5	0.3	1.0	0.8	−0.1	0.1	−0.4	−0.4
kurtosis_probs	0.3	1.0	−0.1	0.0	1.0	1.0	1.0	0.3	0.1	0.8	1.0	−0.1	0.0	−0.2	−0.2
dispersion_index	−0.1	0.0	0.9	−1.0	−0.1	0.0	0.0	−0.5	−0.9	−0.1	−0.1	1.0	−1.0	0.6	0.5
uniformity	0.1	0.1	−0.9	1.0	0.1	0.1	0.1	0.6	0.9	0.1	0.0	−1.0	1.0	−0.6	−0.5
top2_ratio	−0.3	−0.1	0.9	−0.6	−0.1	−0.1	−0.1	−0.9	−0.8	−0.4	−0.2	0.6	−0.6	1.0	1.0
top3_ratio	−0.4	−0.2	0.8	−0.5	−0.2	−0.2	−0.2	−0.9	−0.7	−0.4	−0.2	0.5	−0.5	1.0	1.0

Model Selection

Preprocessing Recipe

We tested a preprocessing approach using median imputation for all numeric predictors. However, preliminary tests showed no improvement in model performance, so we decided to proceed with a recipe without any imputation.

Random Forest and XGBoost were both selected and evaluated with hyperparameter tuning.

Model Specification: Random Forest

Hyperparameter Tuning for Random Forest

We selected the mtry range using the classic rule-of-thumb centered on sqrt(p), where p is the number of predictors, and expanded it by ±50% to allow slightly simpler or more complex splits.

mtry	trees	min_n	.config
Best Random Forest Hyperparameters
2.00	410.00	4.00	pre0_mod06_post0

Model Specification: XGBoost

XGBoost was specified using a gradient-boosted tree model with tunable hyperparameters. We used the xgboost engine and set the model to classification mode to predict the three classes.

Hyperparameter Tuning for XGBoost

We tuned XGBoost using the same cross-validation folds as Random Forest. The mtry range follows the same rule-of-thumb (centered on sqrt(p)), while the other hyperparameters use typical ranges for boosted trees.

mtry	trees	min_n	tree_depth	learn_rate	loss_reduction	sample_size	.config
Best XGBoost Hyperparameters
3.000	889.000	2.000	6.000	1.674	3.360	0.921	pre0_mod12_post0

Evaluation

Confusion matrix

We compare Random Forest and XGBoost on the same held-out test set. The confusion matrices are normalized by row (true class), so values represent per-class recall.

Model	True N → Pred N	True N → Pred O	True N → Pred S	True O → Pred N	True O → Pred O	True O → Pred S	True S → Pred N	True S → Pred O	True S → Pred S
Summary of Confusion Matrix (Per-Class Recall, %)
RF_tuned	91.1	15.6	4.3	5.1	73.3	13.0	3.8	11.1	82.60870
XGB_tuned	84.3	10.3	5.3	7.9	76.9	15.8	7.9	12.8	78.94737

Performance metrics

The table below compares Random Forest and XGBoost on the same test set using a consistent set of metrics (accuracy, balanced accuracy, macro F1, Cohen’s Kappa, and macro ROC AUC). Hyperparameters were selected using accuracy for simplicity; we report additional metrics to assess class‑balanced performance.

model	accuracy	bal_accuracy	f_meas	kap	roc_auc
Model Comparison Metrics
RF_tuned	0.84	0.86	0.81	0.74	0.93
XGB_tuned	0.82	0.82	0.77	0.68	0.90
Winner	RF_tuned	RF_tuned	RF_tuned	RF_tuned	RF_tuned

Accuracy is the overall proportion of correct predictions, while balanced accuracy averages recall across classes to reduce class-imbalance effects. Macro F1 gives equal weight to each class, Cohen’s Kappa adjusts for chance agreement, and macro ROC AUC summarizes discriminative ability across classes. Based on the average of these metrics, the best overall model is RF_tuned.

Misclassification Analysis

Misclassifications are inspected on the held-out test set to avoid optimistic bias. The tables below summarize the misclassification patterns for each model.

True class	Predicted class	Count
Misclassification Summary (Test Set) — RF
N	O	7
N	S	1
O	N	4
O	S	3
S	N	3
S	O	5

True class	Predicted class	Count
Misclassification Summary (Test Set) — XGBoost
N	O	4
N	S	1
O	N	7
O	S	3
S	N	7
S	O	5

For RF, the most frequent error is N → O (7 cases, 23 total misclassifications). For XGBoost, the top confusion is O → N (7 cases, 27 total). These summaries highlight whether errors cluster between adjacent classes or are more diffuse; fewer and more concentrated errors generally indicate a more reliable model.

Consistent with the overall performance metrics, Random Forest remains the best-performing model and will be used for the final classifier.

Variable importance scores

This plot reports permutation importance for the Random Forest model. For each indicator, its values are randomly permuted and the resulting drop in model performance is measured; larger drops indicate more important predictors. Because several indicators are correlated (as shown in the correlation section), importance can be shared across related features, so the plot should be read as a relative ranking rather than an absolute measure of effect.

Distribution of Predictor Variables by True Class (N, O, S)

The final figure shows the distribution of predictor variables by their true class (N, O, S) on the test set, highlighting how indicator values vary across classes.