Fixing Data Problems with Large Language Models: A Practical Guide

Software systems exist to orchestrate the flow of information. When that data is wrong, everything falls apart. I've dealt with this mess firsthand, and I think large language models can help fix it. Here's what I've learned from my experience and two papers worth reading:

Can Foundation Models Wrangle Your Data? by Avanika Narayan, Ines Chami, Laurel Orr, Simran Arora, Christopher Ré
Large Language Models as Data Preprocessors by Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada

This guide will cover techinques for cleaning and preprocessing data, using LLMs to backfill missing values, fix inconsistant formatting, remove duplicates and consolidate. We also cover data integration tasks like matching schemes across different systems.

Using LLMs for Data Cleaning & Processing

Bad data ruins analysis and leads to poor decisions. Traditional cleaning techniques are too specialized and rigid, while Large Language Models (LLMs) can fix multiple problems in one sweep:

Missing Values: Gaps in datasets from incomplete data entry, technical errors, or system limitations.
Inconsistencies: Different representations of the same information (e.g., "New York" vs "NY", "123 Main St." vs "123 Main Street") that complicate aggregation and analysis.
Duplicate Records: Multiple entries of the same data that can skew analysis results and waste resources.

Let's look at how a single, simple prompt to an LLM can automatically detect and fix all these issues at once.

Loading code snippet...

And the results:

Loading code snippet...

Data Labeling: Inferring Categories from Text

LLMs can infer categories from text descriptions, making them useful for automated data labeling. We can either let the model choose appropriate categories based on context, or define a specific set of categories ourselves. In this example, we will constrain the model to use predefined industry labels.

Loading code snippet...

Our labeled data:

Loading code snippet...

Notice that by using the Pydantic models for structured outputs, the returned data automatically conforms to our schema. We don't need to provide additional formatting instructions or parse the response.

Data Integration: Matching Columns Across Databases

When merging data from different systems, we often need to identify which columns mean the same thing. For example, one database might call it "customer_id" while another uses "client_number" - they're the same data, just labeled differently.

Consider this common scenario: Two companies with separate sales tracking systems. Both track the same information but use different column names:

Dataset A:

ProductID	Name	UnitsSold
001	Laptop	50
002	Mouse	100
003	Keyboard	75
004	Monitor	40

Dataset B:

ItemCode	ProductName	QuantitySold
A1	Notebook	48
B2	PC Mouse	105
C3	Key Board	78
D4	Screen	42

Identify the columns between datasets A and B that contain the same information but use different names:

Loading code snippet...

With the columns mapped out, we can now standardize both datasets into the target schema using Python:

Loading code snippet...

The result:

Loading code snippet...

Data Integration II: Matching Identical Records

With our columns standardized, the next challenge is matching products that appear under different names across the datasets.

Looking at our merged dataset from above, we can identify several products that are the same item but named differently:

Laptop matches Notebook
Mouse matches PC Mouse
Keyboard matches Key Board
Monitor matches Screen

Next, we'll combine the matching records using the following code (which continues from our previous code):

Loading code snippet...

The merged dataset:

Loading code snippet...

The LLM merged two obvious matches:

Mouse (sales: 100) and PC Mouse (sales: 105) were combined into a single "Mouse" entry with 205 total sales
Keyboard (sales: 75) and Key Board (sales: 78) were merged into "Keyboard" with 153 total sales

Several other seemingly similar items were kept separate due to potential product distinctions:

Laptop vs Notebook - While often used interchangeably, "Notebook" could refer to a paper notebook product
Monitor vs Screen - These could represent different product categories (e.g., computer monitors vs. TV screens)

Only the most obviously identical items were combined (Mouse/PC Mouse and Keyboard/Key Board), while potentially distinct products remained separate. Of course, we could easily instruct it to merge additional pairs like Laptop/Notebook - but this default behavior of erring on the side of caution helps prevent unwanted data loss while still automating clear matches.

Fixing Data Problems with Large Language Models: A Practical Guide

Using LLMs for Data Cleaning & Processing

Data Labeling: Inferring Categories from Text

Data Integration: Matching Columns Across Databases

Data Integration II: Matching Identical Records

Recommended

Real-Time Task Tracking, Guidance, and Feedback

A Context Preserving AI Assistant

AI-Assisted Critical Reading & Writing Tools

Improving Math Practice with AI Generated Interleaved Problem Sets

Apple-Inspired AI Text Editor: A Simple Implementation