7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings
Image by Editor
Introduction
Large language models (LLMs) are not only good at understanding and generating text; they can also turn raw text into numerical representations called embeddings. These embeddings are useful for incorporating additional information into traditional predictive machine learning models—such as those used in scikit-learn—to improve downstream performance.
This article presents seven advanced Python examples of feature engineering tricks that add extra value to text data by leveraging LLM-generated embeddings, thereby enhancing the accuracy and robustness of downstream machine learning models that rely on text, in applications such as sentiment analysis, topic classification, document clustering, and semantic similarity detection.
Common setup for all examples
Unless stated otherwise, the seven example tricks below make use of this common setup. We rely on Sentence Transformers for embeddings and scikit-learn for modeling utilities.
|
!pip install sentence–transformers scikit–learn –q from sentence_transformers import SentenceTransformer import numpy as np
# Load a lightweight LLM embedding model; builds 384-dimensional embeddings model = SentenceTransformer(“all-MiniLM-L6-v2”) |
1. Combining TF-IDF and Embedding Features
The first example shows how to jointly extract—given a source text dataset like fetch_20newsgroups—both TF-IDF and LLM-generated sentence-embedding features. We then combine these feature types to train a logistic regression model that classifies news texts based on the combined features, often boosting accuracy by capturing both lexical and semantic information.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler
# Loading data data = fetch_20newsgroups(subset=‘train’, categories=[‘sci.space’, ‘rec.autos’]) texts, y = data.data[:500], data.target[:500]
# Extracting features of two broad types tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray() emb = model.encode(texts, show_progress_bar=False)
# Combining features and training ML model X = np.hstack([tfidf, StandardScaler().fit_transform(emb)]) clf = LogisticRegression(max_iter=1000).fit(X, y) print(“Accuracy:”, clf.score(X, y)) |
2. Topic-Aware Embedding Clusters
This trick takes a few sample text sequences, generates embeddings using the preloaded language model, applies K-Means clustering on these embeddings to assign topics, and then combines the embeddings with a one-hot encoding of each example’s cluster identifier (its “topic class”) to build a new feature representation. It is a useful strategy for creating compact topic meta-features.
|
from sklearn.cluster import KMeans from sklearn.preprocessing import OneHotEncoder
texts = [“Tokyo Tower is a popular landmark.”, “Sushi is a traditional Japanese dish.”, “Mount Fuji is a famous volcano in Japan.”, “Cherry blossoms bloom in the spring in Japan.”]
emb = model.encode(texts) topics = KMeans(n_clusters=2, n_init=‘auto’, random_state=42).fit_predict(emb) topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(topics.reshape(–1, 1))
X = np.hstack([emb, topic_ohe]) print(X.shape) |
3. Semantic Anchor Similarity Features
This simple strategy computes similarity to a small set of fixed “anchor” (or reference) sentences used as compact semantic descriptors—essentially, semantic landmarks. Each column in the similarity-feature matrix contains the similarity of the text to one anchor. The main value lies in allowing the model to learn relationships between the text’s similarity to key concepts and a target variable—useful for text classification models.
|
from sklearn.metrics.pairwise import cosine_similarity
anchors = [“space mission”, “car performance”, “politics”] anchor_emb = model.encode(anchors) texts = [“The rocket launch was successful.”, “The car handled well on the track.”] emb = model.encode(texts)
sim_features = cosine_similarity(emb, anchor_emb) print(sim_features) |
4. Meta-Feature Stacking via Auxiliary Sentiment Classifier
For text associated with labels such as sentiments, the following feature-engineering technique adds extra value. A meta-feature is built as the prediction probability returned by an auxiliary classifier trained on the embeddings. This meta-feature is stacked with the original embeddings, resulting in an augmented feature set that can improve downstream performance by exposing potentially more discriminative information than raw embeddings alone.
A slight additional setup is needed for this example:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
!pip install sentence–transformers scikit–learn –q
from sentence_transformers import SentenceTransformer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # Import StandardScaler import numpy as np
embedder = SentenceTransformer(“all-MiniLM-L6-v2”) # 384-dim
# Small dataset containing texts and sentiment labels texts = [“I love this!”, “This is terrible.”, “Amazing quality.”, “Not good at all.”] y = np.array([1, 0, 1, 0])
# Obtain embeddings from the embedder LLM emb = embedder.encode(texts, show_progress_bar=False)
# Train an auxiliary classifier on embeddings X_train, X_test, y_train, y_test = train_test_split( emb, y, test_size=0.5, random_state=42, stratify=y ) meta_clf = LogisticRegression(max_iter=1000).fit(X_train, y_train)
# Leverage the auxiliary model’s predicted probability as a meta-feature meta_feature = meta_clf.predict_proba(emb)[:, 1].reshape(–1, 1) # Prob of positive class
# Augment original embeddings with the meta-feature # Do not forget to scale again for consistency scaler = StandardScaler() emb_scaled = scaler.fit_transform(emb) X_aug = np.hstack([emb_scaled, meta_feature]) # Stack features together
print(“emb shape:”, emb.shape) print(“meta_feature shape:”, meta_feature.shape) print(“augmented shape:”, X_aug.shape) print(“meta clf accuracy on test slice:”, meta_clf.score(X_test, y_test)) |
5. Embedding Compression and Nonlinear Expansion
This strategy applies PCA dimensionality reduction to compress the raw embeddings built by the LLM and then polynomially expands these compressed embeddings. It may sound odd at first, but this can be an effective approach to capture nonlinear structure while maintaining efficiency.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
!pip install sentence–transformers scikit–learn –q
from sentence_transformers import SentenceTransformer from sklearn.decomposition import PCA from sklearn.preprocessing import PolynomialFeatures import numpy as np
# Loading a lightweight embedding language model embedder = SentenceTransformer(“all-MiniLM-L6-v2”)
texts = [“The satellite was launched into orbit.”, “Cars require regular maintenance.”, “The telescope observed distant galaxies.”]
# Obtaining embeddings emb = embedder.encode(texts, show_progress_bar=False)
# Compressing with PCA and enriching with polynomial features pca = PCA(n_components=2).fit_transform(emb) # Reduced n_components to a valid value poly = PolynomialFeatures(degree=2, include_bias=False).fit_transform(pca)
print(“Original shape:”, emb.shape) print(“After PCA:”, pca.shape) print(“After polynomial expansion:”, poly.shape) |
6. Relational Learning with Pairwise Contrastive Features
The goal here is to build pairwise relational features from text embeddings. Interrelated features—constructed in a contrastive fashion—can highlight aspects of similarity and dissimilarity. This is particularly effective for predictive processes that inherently entail comparisons among texts.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
!pip install sentence–transformers –q from sentence_transformers import SentenceTransformer import numpy as np
# Loading embedder embedder = SentenceTransformer(“all-MiniLM-L6-v2”)
# Example text pairs pairs = [ (“The car is fast.”, “The vehicle moves quickly.”), (“The sky is blue.”, “Bananas are yellow.”) ]
# Generating embeddings for both sides emb1 = embedder.encode([p[0] for p in pairs], show_progress_bar=False) emb2 = embedder.encode([p[1] for p in pairs], show_progress_bar=False)
# Building contrastive features: absolute difference and element-wise product X_pairs = np.hstack([np.abs(emb1 – emb2), emb1 * emb2])
print(“Pairwise feature shape:”, X_pairs.shape) |
7. Cross-Modal Fusion
The last trick combines LLM embeddings with simple linguistic or numeric features—such as punctuation ratio or other domain-specific engineered features. It contributes to more holistic text-derived features by uniting semantic signals with handcrafted linguistic aspects. Here is an example that measures punctuation in the text.
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
!pip install sentence–transformers –q from sentence_transformers import SentenceTransformer import numpy as np, re
# Loading embedder embedder = SentenceTransformer(“all-MiniLM-L6-v2”)
texts = [“Mars mission 2024!”, “New electric car model launched.”]
# Computing embeddings emb = embedder.encode(texts, show_progress_bar=False)
# Adding simple numeric text features lengths = np.array([len(t.split()) for t in texts]).reshape(–1, 1) punct_ratio = np.array([len(re.findall(r“[^\w\s]”, t)) / len(t) for t in texts]).reshape(–1, 1)
# Combining all features X = np.hstack([emb, lengths, punct_ratio])
print(“Final feature matrix shape:”, X.shape) |
Wrapping Up
We explored seven advanced feature-engineering tricks that help extract more information from raw text, going beyond LLM-generated embeddings alone. These practical strategies can boost downstream machine learning models that take text as input by capturing complementary lexical, semantic, relational, and handcrafted signals.

