You have just pulled data from a promising new API. You open the file, expecting a clean table of rows and columns, but instead, you find a chaotic tree of curly braces, brackets, and nested objects. This is JSON (JavaScript Object Notation), the language of the web—and the nightmare of data analysts who need tabular data.
While JSON is perfect for transmitting flexible data structures, it is fundamentally incompatible with the row-and-column format of DataFrames and SQL databases. A single "row" in JSON might contain a list of ten items, each with its own dictionary of attributes. Trying to force this into a spreadsheet often results in cells containing entire strings of code like {'id': 101, 'status': 'active'}, effectively locking your data away.
In this guide, we will break down exactly how to dismantle these trees and turn them into flat, analytical tables using Python and Pandas.
Why is nested JSON so difficult to analyze?
Nested JSON is difficult because it represents data as a hierarchy (a tree), while data science tools like Pandas and SQL expect a flat matrix (a grid). In a tree, a single parent entity (like a "Customer") can own multiple child entities (like "Orders"), which creates a one-to-many relationship that cannot fit into a single spreadsheet row without duplication or data loss.
To bridge this gap, we perform flattening (or normalization).
In Plain English: Imagine trying to write down your entire family tree on a single line of notebook paper. You can write your name, but where do you put your children? Do you squeeze them into the same box? Do you start new lines? JSON is the family tree; Pandas is the notebook line. Flattening is the set of rules we use to decide how to fit the tree onto the lines.
How do we flatten simple nested dictionaries?
The most basic form of nesting occurs when a column contains a dictionary instead of a single value. Pandas offers a powerful utility called json_normalize designed specifically for this.
If your data looks like a list of dictionaries where some keys contain other dictionaries, json_normalize will flatten the hierarchy by combining field names with dot notation (e.g., address.city).
import pandas as pd
# A list of dictionaries with simple nesting
data = [
{
"id": 101,
"name": "Sarah Connor",
"contact": {
"email": "sarah@example.com",
"phone": "555-0199"
},
"location": {
"city": "Los Angeles",
"state": "CA"
}
},
{
"id": 102,
"name": "Kyle Reese",
"contact": {
"email": "kyle@example.com",
"phone": None
},
"location": {
"city": "Unknown",
"state": "Unknown"
}
}
]
# Flattening the data
df = pd.json_normalize(data)
print(df)
Output:
id name contact.email contact.phone location.city location.state
0 101 Sarah Connor sarah@example.com 555-0199 Los Angeles CA
1 102 Kyle Reese kyle@example.com None Unknown Unknown
Notice how contact and location disappeared as columns and were replaced by contact.email, location.city, etc. This makes the data immediately ready for analysis.
💡 Pro Tip: You can change the separator. If you prefer underscores (better for SQL databases), use pd.json_normalize(data, sep='_') to get contact_email instead of contact.email.
How do we handle lists inside records?
Handling lists is trickier because a list implies a "one-to-many" relationship. If a user has five past orders, you cannot simply flatten that into one row. You usually want to create one row per order.
This is where the record_path parameter becomes essential. It tells Pandas: "Drill down to this specific list and make each item in it a new row."
# Data where each user has a LIST of orders
complex_data = [
{
"user_id": 1,
"name": "Neo",
"orders": [
{"order_id": "A100", "amount": 50.00},
{"order_id": "A101", "amount": 25.50}
]
},
{
"user_id": 2,
"name": "Trinity",
"orders": [
{"order_id": "B200", "amount": 100.00}
]
}
]
# We want to analyze ORDERS, so we set the path to 'orders'
df_orders = pd.json_normalize(complex_data, record_path=['orders'])
print(df_orders)
Output:
order_id amount
0 A100 50.0
1 A101 25.5
2 B200 100.0
Wait! We lost the user information. We know A100 exists, but we don't know who bought it. This leads us to the next critical question.
How do we keep parent data attached to child records?
When you drill down into a list using record_path, you lose the context of the parent object by default. To retain the parent's information (like user_id or name), you must use the meta parameter.
The meta argument specifies which fields from the parent object should be copied down into every child row generated from the list.
# Using record_path AND meta to keep the relationship
df_complete = pd.json_normalize(
complex_data,
record_path=['orders'],
meta=['user_id', 'name']
)
print(df_complete)
Output:
order_id amount user_id name
0 A100 50.0 1 Neo
1 A101 25.5 1 Neo
2 B200 100.0 2 Trinity
This is the "magic command" for most JSON data wrangling. You are effectively performing a "Left Join" between the parent and the children instantly.
⚠️ Common Pitfall: If the field you list in meta is missing from one of the objects, Pandas will raise a KeyError. To prevent this, use errors='ignore' or ensure your data is clean. For handling inconsistent data structures, check our guide on Data Cleaning: A Complete Workflow from Messy to Model-Ready.
What if the nesting goes even deeper?
Real-world API responses, such as those from social media or financial platforms, can be nested five or six levels deep. While json_normalize handles this automatically, sometimes you only want to flatten the first few levels to avoid creating hundreds of sparse columns.
You can control this depth using the max_level parameter.
The Max Level Control
If you set max_level=1, Pandas will flatten the immediate children but leave any deeper dictionaries as raw objects.
deep_data = [
{
"id": 1,
"details": {
"specs": {
"cpu": "M1",
"ram": "16GB"
},
"color": "Silver"
}
}
]
# Flatten completely (default)
print("--- Default Flattening ---")
print(pd.json_normalize(deep_data).columns)
# Flatten only 1 level
print("\n--- Max Level = 1 ---")
print(pd.json_normalize(deep_data, max_level=1).columns)
Output:
--- Default Flattening ---
Index(['id', 'details.color', 'details.specs.cpu', 'details.specs.ram'], dtype='object')
--- Max Level = 1 ---
Index(['id', 'details.specs', 'details.color'], dtype='object')
In the second example, details.specs remains a dictionary. This is useful when you want to parse that specific column later using custom logic or if the schema varies too wildly at deeper levels.
How does this apply to Feature Engineering?
Flattening JSON is often the first step in Feature Engineering. APIs frequently return rich metadata hidden inside nested fields—like a user's device type, browser settings, or geolocation coordinates.
If you leave these as JSON strings, your model can't learn from them. By flattening them into distinct columns like device.os or geo.latitude, you unlock new predictive signals.
However, be careful of column explosion. A JSON field containing a "tags" dictionary with dynamic keys (e.g., {"tag_123": "funny", "tag_456": "viral"}) can result in thousands of columns if flattened blindly. In these cases, it is better to extract specific keys manually rather than normalizing the entire structure.
Handling Common JSON Errors
1. The "Mixing Dicts and Lists" Error
Sometimes an API is inconsistent. One record has a list of items, and another record has null or a completely different structure.
If json_normalize fails, your fallback strategy is to standard Python list comprehension. It is slower but infinitely flexible.
raw_data = [
{"id": 1, "tags": ["urgent", "home"]}, # List
{"id": 2, "tags": None}, # NoneType
{"id": 3, "tags": "work"} # String (Inconsistent!)
]
# json_normalize might struggle or produce weird results here.
# Better to process manually:
def process_tags(row):
tags = row.get('tags')
if isinstance(tags, list):
return ", ".join(tags)
elif isinstance(tags, str):
return tags
return "Unknown"
# Create DataFrame first, then apply logic
df = pd.DataFrame(raw_data)
df['tags_clean'] = df.apply(process_tags, axis=1)
print(df)
2. File Size Issues
JSON files are verbose. A 1GB CSV might bloat to 5GB in JSON format due to repeated keys. If you try to load a massive JSON file into memory at once, your kernel will crash.
Solution: Load the JSON in chunks (lines) using standard Python I/O, process each chunk, and append it to a list.
import json
# Conceptual example for large line-delimited JSON
results = []
# with open('huge_file.json', 'r') as f:
# for line in f:
# data = json.loads(line)
# # Extract only what you need
# results.append({'id': data['id'], 'val': data['metric']['value']})
# df = pd.DataFrame(results)
Conclusion
Working with JSON does not have to be a struggle against brackets and braces. By understanding the difference between the tree structure of JSON and the tabular structure of DataFrames, you can choose the right flattening strategy.
Start with pd.json_normalize for 90% of your use cases. Use record_path when you need to drill into lists, and meta to keep the context. When data gets messy or inconsistent, fall back to Python's robust list comprehensions.
Once your data is flat, the real work begins. You will likely need to handle missing values generated by the flattening process—check out our guide on Missing Data Strategies to ensure your now-flat data is ready for modeling. If your JSON contained text data, your next stop should be Mastering Text Preprocessing.