Software systems exist to orchestrate the flow of information. When we engage with these flows, we start the nonlinear, spontaneous process of transforming them into knowledge. This process is entirely subjective: people may draw different conclusions from the same information, but where only one interpretation is right, only one person creates knowledge.
All of this is predicated on the data as a reliable source of truth. But what if it's not? Then the info-rot will cascade throughout the system, and the production of knowledge will stop. Worse, all presumed "knowledge" and prior conclusions have to be discarded.
In the linked example, the data was fradulently falsified, but bad data is bad regardless of how it got there. I want to explore ways to fix this problem—the problem of bad data—with large language models. I will draw on my personal experiences, along with two papers:
This guide will cover techinques for cleaning and preprocessing data, using LLMs to backfill missing values, fix inconsistant formatting, remove duplicates and consolidate. We also cover data integration tasks like matching schemes across different systems.
Bad data ruins analysis and leads to poor decisions. Traditional cleaning techniques are too specialized and rigid, while Large Language Models (LLMs) can fix multiple problems in one sweep:
Let's look at how a single, simple prompt to an LLM can automatically detect and fix all these issues at once.
Loading code snippet...
And the results:
Loading code snippet...
LLMs can infer categories from text descriptions, making them useful for automated data labeling. We can either let the model choose appropriate categories based on context, or define a specific set of categories ourselves. In this example, we will constrain the model to use predefined industry labels.
Loading code snippet...
Our labeled data:
Loading code snippet...
Notice that by using the Pydantic models for structured outputs, the returned data automatically conforms to our schema. We don't need to provide additional formatting instructions or parse the response.
When merging data from different systems, we often need to identify which columns mean the same thing. For example, one database might call it "customer_id" while another uses "client_number" - they're the same data, just labeled differently.
Consider this common scenario: Two companies with separate sales tracking systems. Both track the same information but use different column names:
Dataset A:
ProductID | Name | UnitsSold |
---|---|---|
001 | Laptop | 50 |
002 | Mouse | 100 |
003 | Keyboard | 75 |
004 | Monitor | 40 |
Dataset B:
ItemCode | ProductName | QuantitySold |
---|---|---|
A1 | Notebook | 48 |
B2 | PC Mouse | 105 |
C3 | Key Board | 78 |
D4 | Screen | 42 |
Identify the columns between datasets A and B that contain the same information but use different names:
Loading code snippet...
With the columns mapped out, we can now standardize both datasets into the target schema using Python:
Loading code snippet...
The result:
Loading code snippet...
With our columns standardized, the next challenge is matching products that appear under different names across the datasets.
Looking at our merged dataset from above, we can identify several products that are the same item but named differently:
Next, we'll combine the matching records using the following code (which continues from our previous code):
Loading code snippet...
The merged dataset:
Loading code snippet...
The LLM merged two obvious matches:
Several other seemingly similar items were kept separate due to potential product distinctions:
Only the most obviously identical items were combined (Mouse/PC Mouse and Keyboard/Key Board), while potentially distinct products remained separate. Of course, we could easily instruct it to merge additional pairs like Laptop/Notebook - but this default behavior of erring on the side of caution helps prevent unwanted data loss while still automating clear matches.