7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings

October 29, 2025

4

7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings
Image by Editor

Introduction

Large language models (LLMs) are not only good at understanding and generating text; they can also turn raw text into numerical representations called embeddings. These embeddings are useful for incorporating additional information into traditional predictive machine learning models—such as those used in scikit-learn—to improve downstream performance.

This article presents seven advanced Python examples of feature engineering tricks that add extra value to text data by leveraging LLM-generated embeddings, thereby enhancing the accuracy and robustness of downstream machine learning models that rely on text, in applications such as sentiment analysis, topic classification, document clustering, and semantic similarity detection.

Common setup for all examples

Unless stated otherwise, the seven example tricks below make use of this common setup. We rely on Sentence Transformers for embeddings and scikit-learn for modeling utilities.

!pip install sentence-transformers scikit-learn -q from sentence_transformers import SentenceTransformer import numpy as np # Load a lightweight LLM embedding model; builds 384-dimensional embeddings model = SentenceTransformer(“all-MiniLM-L6-v2”)

!pip install sentence–transformers scikit–learn –q

from sentence_transformers import SentenceTransformer

import numpy as np

# Load a lightweight LLM embedding model; builds 384-dimensional embeddings

model = SentenceTransformer(“all-MiniLM-L6-v2”)

1. Combining TF-IDF and Embedding Features

The first example shows how to jointly extract—given a source text dataset like fetch_20newsgroups—both TF-IDF and LLM-generated sentence-embedding features. We then combine these feature types to train a logistic regression model that classifies news texts based on the combined features, often boosting accuracy by capturing both lexical and semantic information.

from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # Loading data data = fetch_20newsgroups(subset=”train”, categories=[‘sci.space’, ‘rec.autos’]) texts, y = data.data[:500], data.target[:500] # Extracting features of two broad types tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray() emb = model.encode(texts, show_progress_bar=False) # Combining features and training ML model X = np.hstack([tfidf, StandardScaler().fit_transform(emb)]) clf = LogisticRegression(max_iter=1000).fit(X, y) print(“Accuracy:”, clf.score(X, y))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

from sklearn.datasets import fetch_20newsgroups

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler

# Loading data

data = fetch_20newsgroups(subset=‘train’, categories=[‘sci.space’, ‘rec.autos’])

texts, y = data.data[:500], data.target[:500]

# Extracting features of two broad types

tfidf = TfidfVectorizer(max_features=300).fit_transform(texts).toarray()

emb = model.encode(texts, show_progress_bar=False)

# Combining features and training ML model

X = np.hstack([tfidf, StandardScaler().fit_transform(emb)])

clf = LogisticRegression(max_iter=1000).fit(X, y)

print(“Accuracy:”, clf.score(X, y))

2. Topic-Aware Embedding Clusters

This trick takes a few sample text sequences, generates embeddings using the preloaded language model, applies K-Means clustering on these embeddings to assign topics, and then combines the embeddings with a one-hot encoding of each example’s cluster identifier (its “topic class”) to build a new feature representation. It is a useful strategy for creating compact topic meta-features.

from sklearn.cluster import KMeans from sklearn.preprocessing import OneHotEncoder texts = [“Tokyo Tower is a popular landmark.”, “Sushi is a traditional Japanese dish.”, “Mount Fuji is a famous volcano in Japan.”, “Cherry blossoms bloom in the spring in Japan.”] emb = model.encode(texts) topics = KMeans(n_clusters=2, n_init=”auto”, random_state=42).fit_predict(emb) topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(topics.reshape(-1, 1)) X = np.hstack([emb, topic_ohe]) print(X.shape)

from sklearn.cluster import KMeans

from sklearn.preprocessing import OneHotEncoder

texts = [“Tokyo Tower is a popular landmark.”, “Sushi is a traditional Japanese dish.”,

“Mount Fuji is a famous volcano in Japan.”, “Cherry blossoms bloom in the spring in Japan.”]

emb = model.encode(texts)

topics = KMeans(n_clusters=2, n_init=‘auto’, random_state=42).fit_predict(emb)

topic_ohe = OneHotEncoder(sparse_output=False).fit_transform(topics.reshape(–1, 1))

X = np.hstack([emb, topic_ohe])

print(X.shape)

3. Semantic Anchor Similarity Features

This simple strategy computes similarity to a small set of fixed “anchor” (or reference) sentences used as compact semantic descriptors—essentially, semantic landmarks. Each column in the similarity-feature matrix contains the similarity of the text to one anchor. The main value lies in allowing the model to learn relationships between the text’s similarity to key concepts and a target variable—useful for text classification models.

from sklearn.metrics.pairwise import cosine_similarity anchors = [“space mission”, “car performance”, “politics”] anchor_emb = model.encode(anchors) texts = [“The rocket launch was successful.”, “The car handled well on the track.”] emb = model.encode(texts) sim_features = cosine_similarity(emb, anchor_emb) print(sim_features)

from sklearn.metrics.pairwise import cosine_similarity

anchors = [“space mission”, “car performance”, “politics”]

anchor_emb = model.encode(anchors)

texts = [“The rocket launch was successful.”, “The car handled well on the track.”]

emb = model.encode(texts)

sim_features = cosine_similarity(emb, anchor_emb)

print(sim_features)

4. Meta-Feature Stacking via Auxiliary Sentiment Classifier

For text associated with labels such as sentiments, the following feature-engineering technique adds extra value. A meta-feature is built as the prediction probability returned by an auxiliary classifier trained on the embeddings. This meta-feature is stacked with the original embeddings, resulting in an augmented feature set that can improve downstream performance by exposing potentially more discriminative information than raw embeddings alone.

A slight additional setup is needed for this example:

!pip install sentence-transformers scikit-learn -q from sentence_transformers import SentenceTransformer from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import StandardScaler # Import StandardScaler import numpy as np embedder = SentenceTransformer(“all-MiniLM-L6-v2”) # 384-dim # Small dataset containing texts and sentiment labels texts = [“I love this!”, “This is terrible.”, “Amazing quality.”, “Not good at all.”] y = np.array([1, 0, 1, 0]) # Obtain embeddings from the embedder LLM emb = embedder.encode(texts, show_progress_bar=False) # Train an auxiliary classifier on embeddings X_train, X_test, y_train, y_test = train_test_split( emb, y, test_size=0.5, random_state=42, stratify=y ) meta_clf = LogisticRegression(max_iter=1000).fit(X_train, y_train) # Leverage the auxiliary model’s predicted probability as a meta-feature meta_feature = meta_clf.predict_proba(emb)[:, 1].reshape(-1, 1) # Prob of positive class # Augment original embeddings with the meta-feature # Do not forget to scale again for consistency scaler = StandardScaler() emb_scaled = scaler.fit_transform(emb) X_aug = np.hstack([emb_scaled, meta_feature]) # Stack features together print(“emb shape:”, emb.shape) print(“meta_feature shape:”, meta_feature.shape) print(“augmented shape:”, X_aug.shape) print(“meta clf accuracy on test slice:”, meta_clf.score(X_test, y_test))

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

!pip install sentence–transformers scikit–learn –q

from sentence_transformers import SentenceTransformer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.preprocessing import StandardScaler # Import StandardScaler

import numpy as np

embedder = SentenceTransformer(“all-MiniLM-L6-v2”) # 384-dim

# Small dataset containing texts and sentiment labels

texts = [“I love this!”, “This is terrible.”, “Amazing quality.”, “Not good at all.”]

y = np.array([1, 0, 1, 0])

# Obtain embeddings from the embedder LLM

emb = embedder.encode(texts, show_progress_bar=False)

# Train an auxiliary classifier on embeddings

X_train, X_test, y_train, y_test = train_test_split(

emb, y, test_size=0.5, random_state=42, stratify=y

)

meta_clf = LogisticRegression(max_iter=1000).fit(X_train, y_train)

# Leverage the auxiliary model’s predicted probability as a meta-feature

meta_feature = meta_clf.predict_proba(emb)[:, 1].reshape(–1, 1) # Prob of positive class

# Augment original embeddings with the meta-feature

# Do not forget to scale again for consistency

scaler = StandardScaler()

emb_scaled = scaler.fit_transform(emb)

X_aug = np.hstack([emb_scaled, meta_feature]) # Stack features together

print(“emb shape:”, emb.shape)

print(“meta_feature shape:”, meta_feature.shape)

print(“augmented shape:”, X_aug.shape)

print(“meta clf accuracy on test slice:”, meta_clf.score(X_test, y_test))

5. Embedding Compression and Nonlinear Expansion

This strategy applies PCA dimensionality reduction to compress the raw embeddings built by the LLM and then polynomially expands these compressed embeddings. It may sound odd at first, but this can be an effective approach to capture nonlinear structure while maintaining efficiency.

!pip install sentence-transformers scikit-learn -q from sentence_transformers import SentenceTransformer from sklearn.decomposition import PCA from sklearn.preprocessing import PolynomialFeatures import numpy as np # Loading a lightweight embedding language model embedder = SentenceTransformer(“all-MiniLM-L6-v2”) texts = [“The satellite was launched into orbit.”, “Cars require regular maintenance.”, “The telescope observed distant galaxies.”] # Obtaining embeddings emb = embedder.encode(texts, show_progress_bar=False) # Compressing with PCA and enriching with polynomial features pca = PCA(n_components=2).fit_transform(emb) # Reduced n_components to a valid value poly = PolynomialFeatures(degree=2, include_bias=False).fit_transform(pca) print(“Original shape:”, emb.shape) print(“After PCA:”, pca.shape) print(“After polynomial expansion:”, poly.shape)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

!pip install sentence–transformers scikit–learn –q

from sentence_transformers import SentenceTransformer

from sklearn.decomposition import PCA

from sklearn.preprocessing import PolynomialFeatures

import numpy as np

# Loading a lightweight embedding language model

embedder = SentenceTransformer(“all-MiniLM-L6-v2”)

texts = [“The satellite was launched into orbit.”,

“Cars require regular maintenance.”,

“The telescope observed distant galaxies.”]

# Obtaining embeddings

emb = embedder.encode(texts, show_progress_bar=False)

# Compressing with PCA and enriching with polynomial features

pca = PCA(n_components=2).fit_transform(emb) # Reduced n_components to a valid value

poly = PolynomialFeatures(degree=2, include_bias=False).fit_transform(pca)

print(“Original shape:”, emb.shape)

print(“After PCA:”, pca.shape)

print(“After polynomial expansion:”, poly.shape)

6. Relational Learning with Pairwise Contrastive Features

The goal here is to build pairwise relational features from text embeddings. Interrelated features—constructed in a contrastive fashion—can highlight aspects of similarity and dissimilarity. This is particularly effective for predictive processes that inherently entail comparisons among texts.

!pip install sentence-transformers -q from sentence_transformers import SentenceTransformer import numpy as np # Loading embedder embedder = SentenceTransformer(“all-MiniLM-L6-v2”) # Example text pairs pairs = [ (“The car is fast.”, “The vehicle moves quickly.”), (“The sky is blue.”, “Bananas are yellow.”) ] # Generating embeddings for both sides emb1 = embedder.encode([p[0] for p in pairs], show_progress_bar=False) emb2 = embedder.encode([p[1] for p in pairs], show_progress_bar=False) # Building contrastive features: absolute difference and element-wise product X_pairs = np.hstack([np.abs(emb1 – emb2), emb1 * emb2]) print(“Pairwise feature shape:”, X_pairs.shape)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

!pip install sentence–transformers –q

from sentence_transformers import SentenceTransformer

import numpy as np

# Loading embedder

embedder = SentenceTransformer(“all-MiniLM-L6-v2”)

# Example text pairs

pairs = [

(“The car is fast.”, “The vehicle moves quickly.”),

(“The sky is blue.”, “Bananas are yellow.”)

]

# Generating embeddings for both sides

emb1 = embedder.encode([p[0] for p in pairs], show_progress_bar=False)

emb2 = embedder.encode([p[1] for p in pairs], show_progress_bar=False)

# Building contrastive features: absolute difference and element-wise product

X_pairs = np.hstack([np.abs(emb1 – emb2), emb1 * emb2])

print(“Pairwise feature shape:”, X_pairs.shape)

7. Cross-Modal Fusion

The last trick combines LLM embeddings with simple linguistic or numeric features—such as punctuation ratio or other domain-specific engineered features. It contributes to more holistic text-derived features by uniting semantic signals with handcrafted linguistic aspects. Here is an example that measures punctuation in the text.

!pip install sentence-transformers -q from sentence_transformers import SentenceTransformer import numpy as np, re # Loading embedder embedder = SentenceTransformer(“all-MiniLM-L6-v2”) texts = [“Mars mission 2024!”, “New electric car model launched.”] # Computing embeddings emb = embedder.encode(texts, show_progress_bar=False) # Adding simple numeric text features lengths = np.array([len(t.split()) for t in texts]).reshape(-1, 1) punct_ratio = np.array([len(re.findall(r”[^\w\s]”, t)) / len(t) for t in texts]).reshape(-1, 1) # Combining all features X = np.hstack([emb, lengths, punct_ratio]) print(“Final feature matrix shape:”, X.shape)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

!pip install sentence–transformers –q

from sentence_transformers import SentenceTransformer

import numpy as np, re

# Loading embedder

embedder = SentenceTransformer(“all-MiniLM-L6-v2”)

texts = [“Mars mission 2024!”, “New electric car model launched.”]

# Computing embeddings

emb = embedder.encode(texts, show_progress_bar=False)

# Adding simple numeric text features

lengths = np.array([len(t.split()) for t in texts]).reshape(–1, 1)

punct_ratio = np.array([len(re.findall(r“[^\w\s]”, t)) / len(t) for t in texts]).reshape(–1, 1)

# Combining all features

X = np.hstack([emb, lengths, punct_ratio])

print(“Final feature matrix shape:”, X.shape)

Wrapping Up

We explored seven advanced feature-engineering tricks that help extract more information from raw text, going beyond LLM-generated embeddings alone. These practical strategies can boost downstream machine learning models that take text as input by capturing complementary lexical, semantic, relational, and handcrafted signals.

Source link

7 Advanced Feature Engineering Tricks for Text Data Using LLM Embeddings

Introduction

1. Combining TF-IDF and Embedding Features

2. Topic-Aware Embedding Clusters

3. Semantic Anchor Similarity Features

4. Meta-Feature Stacking via Auxiliary Sentiment Classifier

5. Embedding Compression and Nonlinear Expansion

6. Relational Learning with Pairwise Contrastive Features

7. Cross-Modal Fusion

Wrapping Up

AI Infra Cost Optimization Tools

AI Model Deployment Strategies: Best Use-Case Approaches

Amazon’s Plan to Replace 600,000 Workers With Robots Just Leaked

LEAVE A REPLY Cancel reply

Most Popular

U.S. plans “show of force” against Chinese aggression in South China Sea as Trump, Xi to meet, sources say

AFG vs ZIM Today Match Prediction & Fantasy Team – 1st T20

Fed Cuts 25bps, Ends QT As Expected; Two FOMC Officials Dissent

Alexsandro sets out key condition to join West Ham after early January transfer talks

Recent Comments

EDITOR PICKS

U.S. plans “show of force” against Chinese aggression in South China Sea as Trump, Xi to meet, sources say

AFG vs ZIM Today Match Prediction & Fantasy Team – 1st T20

Fed Cuts 25bps, Ends QT As Expected; Two FOMC Officials Dissent

POPULAR POSTS

U.S. plans “show of force” against Chinese aggression in South China Sea as Trump, Xi to meet, sources say

AFG vs ZIM Today Match Prediction & Fantasy Team – 1st T20

Fed Cuts 25bps, Ends QT As Expected; Two FOMC Officials Dissent

POPULAR CATEGORY

ABOUT US

FOLLOW US