ML Guide — UGC Scanner

1

What is the system actually doing?

The scanner visits web pages and looks for User-Generated Content (UGC) — comment sections, review widgets, forum threads. It extracts up to hundreds of HTML elements from each page. The model's job is to score those elements and rank the most likely UGC candidate to the top.

The pipeline in plain English

1. A URL is visited by a headless Chromium browser.

2. The page's DOM is parsed. Every button, input, form, text block and section is a potential candidate.

3. For each candidate, 73 features are extracted — things like: "does this element contain a timestamp?", "does it sit inside a <section>?", "does the text look like a review?".

4. The trained model scores each candidate on a scale of 0–1 (probability it is UGC). The highest-scoring candidate is the best guess for that page.

5. If the score exceeds a threshold, the page is flagged as UGC detected.

What does "training" mean?

Training is when the model learns from historical data. You have already labeled 3,209 candidates as positive (genuine UGC) or negative (not UGC).

The model is shown those labeled examples and adjusts its internal weights so that it can predict the correct label on unseen pages. The 73 feature values for each candidate become a numerical vector; training finds the best decision boundary in that 73-dimensional space.

Once trained, the weights are saved as a JSON artifact. Inference is then instant — just one matrix multiply per candidate.

2

The Class Imbalance Problem

Out of your 3,209 labeled candidates, only about 372 are positive (UGC) and 2,090 are negative. That is a 5.6:1 ratio. When data is this skewed the model has a tempting shortcut — predict everything as negative and be right 84.9% of the time, even though it has learned nothing useful.

Original dataset (unbalanced)

Positive (UGC)

372

15.1%

Negative

2090

84.9%

Why this is a problem

A model that says "not UGC" for every single page would score 84.9% accuracy — but it would be completely useless. It misses every real UGC page. This is why accuracy alone is a bad metric for imbalanced tasks.

After balancing (e.g. SMOTE)

Positive (UGC)

2090

50.0%

Negative

2090

50.0%

What balancing does

Balancing adjusts the in-memory training data so both classes appear equally often. The model then has to learn features that distinguish UGC, not just learn "when in doubt say negative". The stored database is never changed — balancing happens only during the training run.

3

Understanding the Metrics

Every metric you see on the comparison table comes from running the trained model against a held-out test set (data it has never seen). Here is what each number means in plain English.

Step 1 — The Confusion Matrix

Every prediction the model makes falls into one of four boxes. Click each box to see what it means.

← Predicted →

Predicted
Positive

Predicted
Negative

Actual
Positive

TP

True Positive

FN

False Negative

Actual
Negative

FP

False Positive

TN

True Negative

Click a cell to see what it means in the context of UGC detection.

Precision

TP / (TP + FP)

Of all the candidates the model said were UGC, what fraction actually were? High precision = few false alarms.

Recall

TP / (TP + FN)

Of all the candidates that actually were UGC, what fraction did the model find? High recall = few misses.

F1 Score

2 × P × R / (P + R)

The harmonic mean of precision and recall. Use F1 when you need both metrics to be good simultaneously — a model can't cheat by optimising just one.

PR AUC

Area under P-R curve

Best metric for imbalanced data. Measures how well the model's scores separate classes regardless of the threshold you pick. 1.0 = perfect; random = ~0.15 (your positive rate).

ROC AUC

Area under ROC curve

Probability that the model ranks a random positive candidate above a random negative. 1.0 = perfect; 0.5 = coin flip. Less informative than PR AUC for imbalanced data.

Top-1 Accuracy

% pages where best candidate = UGC

For each scanned page, was the highest-scored candidate actually the genuine UGC element? This is the most operationally important metric — it drives what gets auto-accepted.

Precision vs Recall — A Proper X/Y Plot

Each dot = one strategy. X axis = Recall (how many UGC items found). Y axis = Precision (how many predictions were correct). Moving up-right is better. The curved grey lines are F1 iso-contours — any point on the same curve has the same F1 score.

4

The Five Balancing Strategies

Each strategy is a different answer to the question: "How do we stop the model from ignoring the minority class?" They all operate on the in-memory training copy only — your database is never touched.

Minority Share vs F1 — Does Balancing Help?

X axis = what percentage of the prepared training set is positive (UGC). Y axis = resulting F1 score. This shows whether pushing toward 50/50 actually improves the model for your data.

⬛ Baseline

Train on the raw labeled data — 372 positives and 2,090 negatives — with no changes. This is your control run. All other strategies are compared against it.

When it wins: Your dataset is actually well-distributed enough that the model still generalises. In your data, Baseline has the highest F1 (0.756) and highest Precision (0.849) because it doesn't over-fit to balanced-but-artificial data.

🔵 Class Weighted

The dataset is not changed at all. Instead, the loss function is changed: every error on a positive example is penalised 5.6× more heavily than an error on a negative example (matching the class ratio).

When it wins: You want higher recall without touching the data. It sacrifices some precision for recall gains — useful if missing a UGC page is more costly than a false alarm.

🟠 Undersample Majority

Randomly removes majority-class rows in memory until both classes are equal. Trains on 372 positives + 372 negatives (744 rows total instead of 2,462).

Trade-off: You throw away 84% of your negative examples. The model sees less data overall, which can hurt generalisation. But it forces the model to learn what makes positives distinctive rather than just "most things are negative".

🟢 Oversample Minority

Randomly duplicates minority-class rows until both classes are equal. Trains on 2,090 positives + 2,090 negatives (4,180 rows) by repeating the 372 real positive examples ~5.6× each.

Trade-off: No new information is created — the model just sees the same positive examples repeated. Can cause over-fitting to those specific positive examples, which is why SMOTE is often preferred.

🟣 SMOTE-like (K=5)

Creates new synthetic positive examples by interpolating between real ones in feature space. For each minority point, it finds its K=5 nearest minority neighbours, then generates a new point at a random position between them.

Why this matters: Unlike oversample, new information is created. Each synthetic point is slightly different from any real example, which helps the model generalise rather than memorise.

5

Why Does the Radar Chart Have No X/Y Axis?

This is a completely reasonable question. A standard chart has one X axis and one Y axis — that means you can only compare two things at once. But you have six metrics to compare across five strategies simultaneously. The radar chart solves this by giving each metric its own axis, all radiating from the same centre.

1

Choose N metrics. We have 6: F1, Precision, Recall, PR AUC, ROC AUC, Top-1. Place one axis per metric, evenly spaced around a circle (360°/6 = 60° apart).
2

Each axis goes from 0 (centre) to 1.0 (outer ring). The concentric rings mark 25%, 50%, 75%, 100%. So "the outer ring" = perfect score = 1.0 on every metric.
3

For each strategy, find its point on each axis. F1=0.756 for Baseline means: walk 75.6% of the way out along the F1 axis. Do this for all 6 metrics.
4

Connect the 6 points into a polygon. That polygon is the strategy's "shape". A larger polygon = better overall performance. A lopsided polygon = the strategy is strong on some metrics and weak on others.
5

Overlay all strategies. Now you can compare 5 polygons at once — which is impossible with a standard X/Y chart without multiple separate graphs.

The key insight

The radar chart does have axes — it has six of them, arranged radially instead of at right angles. There is no "X" or "Y" because each axis is its own dimension. You read it by looking at how far each polygon extends along each spoke.

Interactive: How a point is placed on the radar

Hover over a metric label to highlight that axis.

6

The KNN Synthesis Diagram — Step by Step

The KNN diagram shows what SMOTE is doing in feature space. Feature space is an abstract mathematical space where each of your 73 features is one dimension. Each candidate is a single point in that 73-dimensional space. We can't draw 73 dimensions, so the diagram shows a simplified 2D slice to illustrate the principle.

1

Every candidate is a dot. Green dots = positive (UGC). Red dots = negative. Their position in the diagram represents their feature values — similar candidates end up near each other.
2

Pick a seed. SMOTE randomly selects a minority (green) candidate as the seed point. In the diagram it is the larger green dot labelled "seed".
3

Find K nearest minority neighbours. Only other positive candidates count — majority (red) dots are ignored entirely. K=5 means: find the 5 closest green dots by Euclidean distance in feature space. They are shown connected by dashed lines.
4

Interpolate. A new synthetic candidate is created at a random point along the line between the seed and one chosen neighbour. It is not a copy — every feature value is a blend: new_feature = seed_feature + random(0,1) × (neighbour_feature − seed_feature).
5

Repeat until balanced. This process continues for every minority candidate until there are as many positive rows as negative ones (1,718 synthetic rows in your run).

Why K=0 for Oversample Minority?

Oversample does NOT use KNN — it simply duplicates existing rows. The K column shows "n/a" for that strategy. The KNN diagram only appears for strategies that actually use nearest-neighbour synthesis (K ≥ 1).

Step-through: KNN synthesis

Click Next Step to walk through the synthesis process.

Step 1 / 5

7

From 73 Features to a 2D Scatter — How?

You asked: if there are 73 features, how can we draw the model's decisions on a flat 2D chart? The answer is dimensionality reduction — a mathematical technique that compresses 73 numbers per candidate into 2, while preserving as much of the original structure as possible.

PCA — Principal Component Analysis

PCA finds the two directions in the 73-dimensional space that capture the most variance (spread) in your data. These become the X and Y axes of the projected scatter plot.

Think of it like finding the best angle to photograph a 3D sculpture so that the flat photo reveals as much shape as possible. The photo loses some depth, but you still get the most important structure.

The trade-off: PCA axes are abstract combinations of features, not single features. "PC1" might mean "0.4 × timestamp_present + 0.3 × is_in_section + …" — the axis label has no simple name.

Why the KNN diagram uses fixed positions

The KNN diagram in the Model Lab uses illustrative fixed positions because we don't ship a full PCA computation to the browser. The real feature space has 73 dimensions — the diagram's 2D layout is a conceptual sketch showing the principle of interpolation, not the actual positions of your candidates.

Simulated PCA scatter (concept)

X = first principal component (most variance). Y = second principal component. Green = positive (UGC), red = negative. A well-trained model finds a boundary that separates the clusters.

Feature Importance — Logistic Regression Weights

X axis = absolute weight magnitude (how much the feature moves the score). Y axis = feature name. Green bars push the prediction toward UGC; red bars push it away. The chart below shows the top features from a typical run — load the Model Lab to see your exact weights.

8

Reading Your Results — What the Table is Telling You

Looking at your actual comparison numbers (from the screenshot), here is what the results mean for your specific dataset.

Baseline wins on F1 (0.756) and Precision (0.849)

This is a good sign. It means your original labeled data is rich enough that balancing doesn't help the overall score. The model can learn the UGC pattern from the real 372 examples without needing synthetic augmentation.

High precision (0.849) means: when the model says "this is UGC", it is right 85% of the time. The cost of a false positive is low.

Undersample wins on Recall (0.769)

By throwing away 84% of negative examples, Undersample forces the model to be more aggressive — it would rather make a false alarm than miss a real UGC page. Recall 0.769 vs Baseline 0.681 means Undersample finds 13% more real UGC pages, but at the cost of more false positives (Precision drops to 0.614).

SMOTE and Oversample are close (F1 ~0.704–0.705)

Both strategies push the minority share to 50% and add 1,718 extra rows. The small difference (SMOTE: 0.667 precision vs Oversample: 0.657) suggests SMOTE's synthetic interpolation creates slightly more diverse training signal than simple duplication.

Neither beats Baseline here, but with more data or harder pages, SMOTE typically helps more.

ROC AUC is high across all strategies (0.943–0.948)

ROC AUC above 0.94 means the model almost always ranks a real UGC candidate higher than a random non-UGC one. This is strong. The strategies barely differ here (range = 0.005), which tells you the ranking is robust regardless of how you balance.

The bigger differences are in Precision and Recall — where the threshold decision happens.

Which strategy should I use?

For your current dataset: Baseline gives the best overall score. Use it for production scoring where you want the fewest false alarms. If you find that the scanner is consistently missing UGC on certain types of pages (recall too low), switch to Undersample or Class Weighted to catch more — accepting some extra false positives as the trade-off. SMOTE becomes more valuable as your dataset grows and becomes more diverse in its positive examples.

9

Quick Glossary

All the terminology in one place.

Artifact

The saved JSON file containing a trained model's weights, feature list, and evaluation results. Without an artifact on disk, the model cannot score anything.

Candidate

A single HTML element extracted from a scanned page that the model considers as a potential UGC container.

Feature vector

The 73 numerical values that represent a single candidate. This is what the model actually sees — not raw HTML, but a list of numbers derived from the HTML.

Logistic Regression

The model family used here. It learns a set of weights (one per feature) and computes a probability by applying those weights via a sigmoid function. Fast, interpretable, and reliable.

Minority class

The less-frequent class — here, positive (UGC). At 15.1% of labeled data, it is the minority. Balancing strategies increase its effective share during training.

Imbalance ratio

Majority count ÷ minority count. Your ratio is 2090 ÷ 372 = 5.6:1. Ratios above 4:1 are typically considered problematic for naive models.

SMOTE

Synthetic Minority Over-sampling TEchnique. Creates new training examples by interpolating between existing minority examples in feature space, using K nearest neighbours.

K (nearest neighbours)

The number of nearest minority-class examples SMOTE considers when generating a synthetic point. K=5 means each seed looks at its 5 closest minority neighbours.

PR AUC

Precision-Recall Area Under Curve. Summarises model quality across all possible classification thresholds. The gold-standard metric for imbalanced binary classification.

Threshold

The probability cutoff above which the model declares "UGC detected". Lowering the threshold increases recall but decreases precision. The comparison metrics are computed at a fixed threshold.

Test set

A held-out portion of labeled data (typically 20%) that the model never sees during training. Metrics computed on the test set reflect real-world performance, not just how well the model memorised training data.

Principal Component

A new axis created by PCA that is a weighted combination of the original features. PC1 captures the most variance, PC2 the next most, etc. Used to project high-dimensional data to 2D for visualisation.