The exponential progress in AI we witnessed in the last years is astonishing. Since the launch of ChatGPT in 2022, things that were previously considered impossible — such as truly intelligent chat agents capable of performing a variety of different tasks — are getting increasingly integrated into business workflows.
But what about the “first generation” of AI, those traditional Machine Learning tools like logistic regression or tree-based classifiers? Do they still have a purpose in this new landscape dominated by super-intelligent agents?
In this blog post, we argue that the answer is “yes.” We will demonstrate — through a concrete example — how “classic” models such as XGBoost can still be useful, particularly when integrated properly with the new wave of generative and conversational AI.
Please note: this post is not a rigorous scientific analysis. It is an illustrative example of how traditional ML tools can be integrated with modern LLM-based agents. The results are shown on a small private dataset and should be interpreted in that spirit.
Use Case: L1 First Responder Agent
The specific use case we consider is that of an L1 first responder agent — a task that has historically been, and in many instances still is, performed by humans. The core function of the agent is to efficiently open and categorize support tickets based on issues raised by users. Given a natural language description of a problem, the agent must accurately fill a number of key fields — fields that are crucial for L2 and L3 support teams to quickly understand and route the issue.
Problem Description
We tackle the problem of classifying support tickets from a private dataset comprising approximately 1,000 unique entries. Due to the proprietary nature of the data, we cannot share further details beyond what is described here. We split the data 75/25 into training (~750 samples) and test (~250 samples) sets.
Each ticket consists of a free-text subject and description written by the user, alongside a set of structured categorical fields — such as Ticket_Type — that are typically filled in by the support agent when opening the ticket, not by the end user. The classification task is to predict four hierarchical categories — Cat1, Cat2, Cat3, and Cat4 — that determine the nature and routing of the issue. These are four distinct problems that must be solved simultaneously, with each level being progressively more fine-grained than the previous one.
The distinction between user-provided text and agent-inferred categorical fields has an important bearing on how the three configurations work, and it is worth clarifying upfront. In Configuration 2 (RAG only), the agent infers Cat1–Cat4 directly from semantically similar examples retrieved from the training set — no structured features are needed. In Configuration 3, however, we introduce an XGBoost classifier that predicts Cat1–Cat4 from structured categorical inputs (such as Ticket_Type), rather than from raw text. Since these categorical fields are not provided by the user at inference time, the RAG step in Configuration 3 serves a dual purpose: it grounds the LLM on what the categories might be, and it provides the context needed to infer the structured features required by the classifier. This design choice is described in detail in the section below.
Three Configurations
We will evaluate three progressively richer configurations of the L1 agent. All share the same goal: produce a JSON object containing predicted values for Cat1, Cat2, Cat3, and Cat4 given a user-provided ticket description.
Configuration 1 — Simple LLM Agent with Hardcoded Context
This baseline tests the raw reasoning capability of the LLM, augmented only with high-level statistical information derived from the training set. Before deployment, we compute simple statistics from the training data — the list of valid category values for each level along with their occurrence counts — and bake this information directly into the system prompt. During inference, no data retrieval takes place.
System prompt excerpt
You are an L1 support ticket responder. Your task is to classify a support ticket into four hierarchical categories (Cat1, Cat2, Cat3, Cat4) based on the user's description, and produce a final JSON.
You have access to the following statistical summary of historical tickets,extracted from the training data. Use it to guide your predictions: {train_stats}
Steps:
1. Based on the subject and the statistical context above, infer the most likely values for: Cat1, Cat2, Cat3, and Cat4.
2. Explain your reasoning briefly, then call the create_json tool.
Strengths and limitations
The key advantage is speed and simplicity — no external calls, no retrieval latency. The limitation is equally clear: frequency statistics alone do not provide enough information for the LLM to reliably disambiguate categories, especially at finer levels. This is reflected in the results: Configuration 1 achieves an overall accuracy of 0% — it never manages to get all four categories correct simultaneously. This is expected. Without access to real examples, the model is simply not equipped to resolve the many ambiguities that arise in practice.
Configuration 2 — LLM Agent with RAG
This configuration augments the agent with a Retrieval-Augmented Generation (RAG) system. The training set is embedded using OpenAI’s text-embedding-3-large model and stored in a ChromaDB vector store. When the agent receives a ticket, it first retrieves the top-N semantically similar historical tickets, then conditions its predictions on those examples before producing the final JSON. Cat1–Cat4 are inferred directly from the patterns observed in the retrieved examples.
System prompt excerpt
You are an L1 support ticket responder. Your task is to classify a support ticket into four hierarchical categories (Cat1, Cat2, Cat3, Cat4) based on the user's description, and produce a final JSON.
Steps:
1. The user will provide a short description of their issue.
2. Immediately call the search_vector_database tool using the subject as query to retrieve the most semantically similar historical tickets.
3. Analyze the retrieved tickets carefully. Use the category patterns you observe to infer the most likely values for all fields.
4. Call the create_json tool with all fields filled in.
Strengths and limitations
RAG gives the agent access to granular, real-world labelled examples. This is particularly powerful for category disambiguation: if retrieved tickets with similar subjects consistently share a Cat1 value, the model can leverage that consensus. The main cost is an added round-trip to the vector store and higher token usage in the context window.
Configuration 3 — LLM Agent with RAG and XGBoost Classifier Tool
This configuration combines contextual LLM reasoning, RAG-based example grounding, and the structured predictive power of a traditional ML model.
Four independent XGBoost classifiers are trained — one per category level. Rather than a single multi-output model, each classifier is trained separately on its own label space, allowing it to specialise independently. They are wrapped in a single ClassifierWrapper class and deployed as one MLflow endpoint, so from the agent’s perspective this is a single tool call returning predictions and confidence scores for all four categories at once.
Because the classifier requires Ticket_Type and other categorical features not directly provided by the user, the RAG step plays a crucial role here beyond just grounding the predictions: the LLM uses the retrieved examples to infer reasonable values for those categorical features, which are then passed to the classifier. The LLM then synthesises the classifier signal with the RAG context — leaning on the classifier when confidence is high, and falling back to RAG when it is not.
System prompt excerpt
You are an L1 support ticket responder. Your task is to classify a support ticket into four hierarchical categories (Cat1, Cat2, Cat3, Cat4) based on the user's description, and produce a final JSON.
You have two tools available:
- search_vector_database: retrieves semantically similar historical tickets.
- classify_ticket: an XGBoost classifier that predicts Cat1-Cat4 with a confidence score for each.
Steps:
1. Call search_vector_database to retrieve similar historical tickets.
2. From the retrieved tickets, infer your best guess for the classifier's required features.
3. Call classify_ticket using the ticket subject and the inferred categorical features.
4. Combine both tools: use RAG for context, classifier predictions weighted by confidence to determine final Cat1-Cat4 values.
5. Call create_json with all fields filled in.
Strenghts and limitations
This configuration is the most capable of the three. However, it is important to stress that its performance depends heavily on the quality of the underlying XGBoost models. A poorly trained classifier will degrade, not improve, the overall system. The trade-off also includes added latency (two external tool calls) and the need to maintain a separate classifier service.
Workflows
All three configurations share a common agent loop built with the OpenAI tool-calling API and tracked via LangSmith. The key difference lies in how many reasoning and retrieval steps the LLM orchestrates before producing the final JSON:
Config 1: user message → LLM (with hardcoded stats) → create_json
Config 2: user message → LLM → search_vector_database → LLM → create_json
Config 3: user message → LLM → search_vector_database → LLM → classify_ticket → LLM → create_json
In Config 3, the LLM is invoked three times: first to understand the ticket and decide to retrieve examples, then to interpret the retrieved examples and infer the categorical features needed for the classifier, and finally to synthesise the classifier output into a structured prediction. The four XGBoost models are wrapped in a ClassifierWrapper class extending mlflow.pyfunc.PythonModel, making them straightforward to serve as a single REST endpoint.
Evaluation Results
We evaluate Configurations 2 and 3 on the held-out test set. Configuration 1 is excluded from the quantitative comparison — its 0% overall accuracy reflects the inadequacy of statistical context alone, not a meaningful baseline to improve upon.
For each configuration we report per-category accuracy and overall accuracy (all four categories simultaneously correct — the stricter and more meaningful metric).
| Configuration | Cat1 | Cat2 | Cat3 | Cat4 | Overall |
|---|---|---|---|---|---|
| Config 2 (RAG only) | 100% | 66.7% | 66.7% | 33.3% | 33.3% |
| Config 3 (RAG + XGBoost) | 100% | 83.3% | 83.3% | 66.7% | 66.7% |
Both configurations achieve perfect Cat1 accuracy, as expected given the small label space at the coarsest level. The performance gap widens at finer-grained categories. Config 3 improves Cat2 and Cat3 by ~17 percentage points each, and Cat4 by ~33 points. Overall accuracy doubles from 33.3% to 66.7%.
These numbers are illustrative. The dataset is small and private, and the results should not be taken as a general claim about this architecture’s superiority.
Discussion
A few observations worth highlighting. The performance gap between configurations widens as the category level increases — all agents handle Cat1 well, but differences become pronounced at Cat4. Coarse categories can often be inferred from the ticket subject alone; fine-grained ones require richer context or a model that has explicitly learned the label distribution.
The confidence scores from the XGBoost classifier also prove useful as a routing signal. When confidence is high, the LLM follows the classifier and performs better. When confidence is low, it appropriately discounts the classifier and leans on RAG.
Finally, Config 3’s performance is tightly coupled to the quality of the trained XGBoost models. This is not a plug-and-play architecture: it requires investing in a reliable ML training pipeline as a prerequisite. If that investment is not made, the classifier will introduce noise rather than signal.
Conclusion
The results of this experiment make a case — illustrative, not definitive — for the continued relevance of traditional ML models in the age of LLM agents. XGBoost does not replace the LLM; it would be helpless without the agent’s ability to infer missing fields, retrieve context, and reason across ambiguous inputs. But deployed as a structured tool, it contributes a kind of confident, data-driven certainty that pure language models can struggle to replicate.
The three configurations represent a useful spectrum for practitioners designing agentic systems. Not every problem warrants the complexity of a hybrid pipeline. But when accuracy matters, labelled training data exists, and a reliable classifier can be built, integrating it as an agent tool is a powerful and underexplored pattern.
The key takeaway is simple: old and new AI tools are not competitors — they are complementary, and thoughtful integration can bring out the best of both.
Main Author: Giovanni Vacanti, Data Scientist @ Bitrock