Skip to content

JSON to Pandas: How to Flatten Complex Nested Data Without Losing Your Mind

DS
LDS Team
Let's Data Science
11 minAudio · 1 listens
Listen Along
0:00/ 0:00
AI voice

You just pulled 50,000 order records from a REST API. Each record is a tree: customer objects nested inside orders, arrays of line items nested inside customers, discount codes nested inside line items. You call pd.DataFrame() and get a DataFrame where half the columns contain raw dictionaries and lists. Nothing is filterable, nothing is joinable, nothing is usable.

That gap between hierarchical JSON and flat, tabular pandas DataFrames trips up even experienced engineers. The pd.json_normalize() function in pandas 3.0 (released January 2026) handles most of the heavy lifting, but knowing when to reach for record_path, when to use .explode(), and when to write a custom flattener is what separates a five-minute solution from an afternoon of frustration.

This guide uses a single running example throughout: an e-commerce API returning order records with nested customer details, shipping addresses, line items, and review metadata. Every code block, every diagram, every table references this same dataset.

The structural mismatch between JSON and DataFrames

JSON organizes data as a tree. A single order record can contain a customer object, which holds an address object, which holds a coordinates object. Each level adds depth, and there's no requirement that sibling records share the same shape.

DataFrames are the opposite. They're strictly two-dimensional grids: every row has the same columns, and every cell holds a scalar value. When you pass nested JSON to pd.DataFrame(), pandas doesn't recursively unpack anything. It stores inner dictionaries and lists as opaque Python objects inside cells.

JSON tree structure versus flat DataFrame grid after json_normalizeClick to expandJSON tree structure versus flat DataFrame grid after json_normalize

Here's the problem in action with our e-commerce data:

Expected output:

code
order_id     object
total       float64
customer     object
items        object
dtype: object

customer column value: <class 'dict'>
items column value: <class 'list'>

The customer column holds a dictionary. The items column holds a list of dictionaries. You can't filter by city, group by customer name, or calculate revenue per SKU until these nested structures become proper columns. That's exactly what pd.json_normalize() fixes.

Key Insight: The object dtype in pandas is a catch-all. When you see it on a column that should contain structured data, it almost always means you have nested dicts or lists that need flattening.

The e-commerce dataset we'll flatten

Before jumping into techniques, let's define the complete dataset we'll reference through every section. Three orders, each with nested customer info, variable-length item lists, and optional reviews:

Expected output:

code
Orders: 3
Total items across all orders: 5
Total reviews across all orders: 3

This dataset covers the patterns you'll hit in production: nested dicts (customer, address), variable-length arrays (items, reviews), and missing data (ORD-1002 has zero reviews).

Flattening nested dictionaries with pd.json_normalize

pd.json_normalize() is the primary tool for converting hierarchical JSON into flat DataFrames. It recursively unpacks nested dictionaries and joins key names with a dot separator by default. The function lives at the top level of pandas, so you call pd.json_normalize(), not pd.io.json.json_normalize() (the old import path was deprecated in pandas 1.0 and removed in pandas 2.0).

Expected output:

code
['order_id', 'total', 'status', 'items', 'reviews', 'customer.name', 'customer.email', 'customer.address.city', 'customer.address.state', 'customer.address.zip']

Every nested dictionary (customer, customer.address) was unpacked into dot-separated column names. The items and reviews columns stayed as lists because json_normalize only auto-flattens dictionaries, not lists. That distinction matters: dicts represent attributes of the same entity (a customer's properties), while lists represent one-to-many relationships (an order's items).

Pro Tip: The sep parameter changes the delimiter between nested keys. For SQL-compatible column names, pass sep='_' to produce customer_address_city instead of customer.address.city. In pandas 3.0, when you pass a Series to json_normalize(), it now retains the original Series index instead of resetting to a RangeIndex, a behavior change from pandas 2.x worth noting in production code.

The complete json_normalize parameter reference

Here's every parameter the function accepts, as of pandas 3.0 documentation:

ParameterTypeDefaultPurpose
datadict, list of dicts, or SeriesrequiredThe JSON data to flatten
record_pathstr or list of strNonePath to the nested list to expand into rows
metalist of str or list of listsNoneParent fields to carry into child rows
meta_prefixstrNonePrefix for meta column names
record_prefixstrNonePrefix for record column names
errors'raise' or 'ignore''raise'How to handle missing meta keys
sepstr'.'Separator for nested key names
max_levelint or NoneNoneMaximum nesting depth to flatten

Expanding nested lists with record_path

Lists inside JSON records represent one-to-many relationships. One order has many items. One customer has many reviews. These can't be flattened into a single row without either duplicating parent data or losing child records.

The record_path parameter tells json_normalize() which list to expand. Each element in that list becomes its own row:

How record_path and meta expand nested JSON lists into DataFrame rowsClick to expandHow record_path and meta expand nested JSON lists into DataFrame rows

Expected output:

code
        sku  qty  price
0  WIDGET-A    2  29.99
1  GADGET-B    1  70.01
2  WIDGET-A    1  29.99
3   CABLE-C    3   5.00
4  GADGET-B    3  70.01

Five items across three orders, each on its own row. But there's a problem: we've lost all order context. Which item belongs to which order? That's what meta fixes.

Preserving parent context with meta

The meta parameter specifies which fields from the parent record should be copied into every child row. Think of it as an automatic left join between the parent and child data:

Expected output:

code
        sku  qty  price  order_id      status   total customer.name
0  WIDGET-A    2  29.99  ORD-1001     shipped  129.99    Maria Chen
1  GADGET-B    1  70.01  ORD-1001     shipped  129.99    Maria Chen
2  WIDGET-A    1  29.99  ORD-1002   delivered    45.0  James Okafor
3   CABLE-C    3   5.00  ORD-1002   delivered    45.0  James Okafor
4  GADGET-B    3  70.01  ORD-1003  processing   210.5  Priya Sharma

Two syntax patterns here: scalar fields like order_id go as plain strings, while nested fields like customer.name go as a list-within-the-list: ["customer", "name"].

The record_prefix parameter prevents column name collisions when record fields share names with meta fields:

Expected output:

code
['item_sku', 'item_qty', 'item_price', 'order_id']

Controlling depth with max_level

Production APIs often nest data five or six levels deep. Flattening everything produces dozens of sparse columns, many of which you don't need. The max_level parameter limits recursion depth:

Expected output:

code
All levels: ['order_id', 'total', 'status', 'items', 'reviews', 'customer.name', 'customer.email', 'customer.address.city', 'customer.address.state', 'customer.address.zip']
Max level 1: ['order_id', 'total', 'status', 'items', 'reviews', 'customer.name', 'customer.email', 'customer.address']

With max_level=1, the customer.address column stays as a dictionary instead of being split into three columns. This is useful when you only care about top-level customer info and want to avoid 20 address-related columns you'll never touch.

Common Pitfall: max_level only controls dict flattening depth. It has zero effect on lists. If you have a list at level 1, it stays as a list regardless of max_level. You still need record_path or .explode() for lists.

Handling missing and inconsistent keys

APIs are rarely consistent. One record might include a loyalty_tier field while another omits it. One customer might have a phone while another doesn't. By default, pd.json_normalize() raises a KeyError when a meta field is missing from any record.

The errors parameter controls this:

Expected output:

code
  sku  qty  price  order_id customer.name customer.loyalty_tier
0  X1    1   10.0  ORD-2001         Alice                  gold
1  X2    2   20.0  ORD-2002           Bob                   NaN

Missing values per column:
sku                      0
qty                      0
price                    0
order_id                 0
customer.name            0
customer.loyalty_tier    1
dtype: int64

With errors='ignore', missing metadata fields become NaN instead of crashing your pipeline. This is non-negotiable for production ETL where API schemas aren't guaranteed to be uniform.

Warning: errors='ignore' silently swallows missing data. Always follow it with df.isna().sum() to catch unexpected nulls. A field that's NaN in 95% of records might indicate a schema change upstream that you should investigate, not just ignore.

Using explode for list columns already in a DataFrame

Sometimes you receive data that's already partially normalized. Maybe you loaded it from a database column that stores JSON arrays, or you ran json_normalize on the top level but still have list columns left over. The .explode() method handles this case without re-running the full normalization:

Expected output:

code
After explode:
   order_id                          items
0  ORD-1001  {'sku': 'WIDGET-A', 'qty': 2}
1  ORD-1001  {'sku': 'GADGET-B', 'qty': 1}
2  ORD-1002   {'sku': 'CABLE-C', 'qty': 3}

Fully flattened:
   order_id       sku  qty
0  ORD-1001  WIDGET-A    2
1  ORD-1001  GADGET-B    1
2  ORD-1002   CABLE-C    3

The explode() then json_normalize() pattern is useful when you can't go back to the raw JSON source, or when you only want to flatten one specific column without re-processing the entire dataset.

Pro Tip: Since pandas 1.3, .explode() supports multiple columns at once: df.explode(["items", "reviews"]). But the lists in each column must have the same length per row, which is rarely the case in practice. For independent lists, explode them separately and merge back.

Custom flattening for irregular structures

pd.json_normalize() works when the JSON structure is predictable. But real-world data often isn't. The same field might be a list in one record, a string in another, and absent in a third. When json_normalize can't handle the inconsistency, a custom flattening function gives you full control:

Expected output:

code
order_id  total     status customer_name    customer_email    city state  item_count  total_quantity  review_count  avg_rating
ORD-1001 129.99    shipped    Maria Chen maria@example.com Seattle    WA           2               3             1         5.0
ORD-1002  45.00  delivered  James Okafor james@example.com  Austin    TX           2               4             0         NaN
ORD-1003 210.50 processing  Priya Sharma priya@example.com Chicago    IL           1               3             2         3.5

This approach trades the convenience of json_normalize for explicit control over every output column. It's the right choice when you need computed fields (like avg_rating) or when the JSON schema varies too wildly for automatic normalization. The .get() method with default values prevents KeyError on missing keys.

Key Insight: Custom flattening isn't a fallback; it's often the better choice for analytics. You're building exactly the DataFrame you need for analysis rather than normalizing everything and then selecting, renaming, and aggregating afterward.

Combining multiple normalized DataFrames

Most analysis requires data from several nested levels at once. The pattern is to normalize each level separately, then merge on the shared key. This mirrors how relational databases handle one-to-many relationships: each level becomes its own table, and foreign keys tie them together.

Normalize-then-merge pipeline for multi-level JSON flatteningClick to expandNormalize-then-merge pipeline for multi-level JSON flattening

Expected output:

code
Items with order context:
order_id item_sku  item_price customer.name  total
ORD-1001 WIDGET-A       29.99    Maria Chen 129.99
ORD-1001 GADGET-B       70.01    Maria Chen 129.99
ORD-1002 WIDGET-A       29.99  James Okafor  45.00
ORD-1002  CABLE-C        5.00  James Okafor  45.00
ORD-1003 GADGET-B       70.01  Priya Sharma 210.50

Reviews:
 review_rating   review_text  review_verified order_id
             5 Fast shipping             True ORD-1001
             4  Good quality             True ORD-1003
             3  Arrived late            False ORD-1003

Notice that ORD-1002 (James Okafor) has no reviews, so it doesn't appear in df_reviews. If you need all orders including those with zero reviews, use a left join from orders to reviews.

Nested arrays inside nested arrays

Some APIs nest arrays inside arrays. An order contains items, and each item contains a list of applied discounts. This double-nesting requires a two-step approach: normalize the outer list, then explode() and normalize the inner list.

Expected output:

code
After step 1 (item level):
     sku  price order_id
WIDGET-A  29.99 ORD-4001
GADGET-B  70.01 ORD-4001

After step 3 (fully flattened):
order_id      sku  price    code  amount
ORD-4001 WIDGET-A  29.99  SAVE10     3.0
ORD-4001 WIDGET-A  29.99 LOYALTY     1.5
ORD-4001 GADGET-B  70.01     NaN     NaN

The pattern generalizes: for N levels of nested lists, you repeat the normalize-explode-normalize cycle N-1 times.

Production patterns for large JSON files

JSON is verbose. Repeated key names in every record mean that a 500 MB CSV equivalent can balloon to 2-3 GB as JSON. Loading the entire file with json.load() allocates everything into memory at once. Here's how to handle scale.

Line-delimited JSON with chunked reading

For large files, use line-delimited JSON (one JSON object per line, file extension .jsonl) with pd.read_json():

python
import pandas as pd

# Read a large .jsonl file in chunks of 10,000 records
chunk_iter = pd.read_json(
    "large_orders.jsonl",
    lines=True,
    chunksize=10_000
)

frames = []
for chunk in chunk_iter:
    # Each chunk is a DataFrame -- normalize, filter, transform
    normalized = pd.json_normalize(chunk.to_dict(orient="records"))
    frames.append(normalized)

df_all = pd.concat(frames, ignore_index=True)

The chunksize parameter returns an iterator instead of loading everything. Memory usage stays proportional to chunk size, not file size. For a 2 GB JSONL file on a 16 GB laptop, chunksize=10_000 typically keeps memory under 500 MB.

Faster parsing with orjson

Python's built-in json module is pure Python (with some C acceleration). For high-throughput pipelines, orjson, a Rust-based JSON parser, is roughly 6x faster for loads() and 10x faster for dumps() compared to the standard library (benchmarks from orjson 3.10, December 2025):

python
import orjson
import pandas as pd

# Parse JSON ~6x faster than json.loads()
with open("large_orders.jsonl", "rb") as f:
    records = [orjson.loads(line) for line in f]

df = pd.json_normalize(records)

Selective field extraction

When you need only a few fields from deeply nested structures, skip json_normalize entirely and extract directly. This avoids building the full column hierarchy:

python
import json
import pandas as pd

results = []
with open("large_orders.jsonl", "r") as f:
    for line in f:
        record = json.loads(line)
        results.append({
            "order_id": record["order_id"],
            "customer_name": record["customer"]["name"],
            "item_count": len(record.get("items", []))
        })

df = pd.DataFrame(results)  # ~3-5x faster than normalize-then-select

Performance comparison

ApproachUse CaseRelative SpeedMemory
pd.json_normalize()Full schema needed1x (baseline)High — builds all columns
Custom .get() extractionFew fields needed3-5x fasterLow — only stores target fields
orjson + json_normalizeFull schema, big files~2-3x fasterSame as baseline
Chunked read_json()Files > 1 GBStreamingBounded by chunk size
Convert to Parquet firstRepeated analysis10-50x faster readsCompact columnar format

Pro Tip: If you're going to analyze the same JSON dataset more than once, convert it to Parquet on first load: df.to_parquet("orders.parquet"). Subsequent reads with pd.read_parquet() are 10-50x faster and use far less memory because Parquet is columnar and compressed.

When to flatten JSON (and when not to)

Not every nested JSON structure needs flattening. Here's the decision framework:

Decision flowchart for choosing the right JSON flattening strategyClick to expandDecision flowchart for choosing the right JSON flattening strategy

When to flatten

  • You need to filter, group, or join on nested fields. If you're writing df[df["customer.address.state"] == "WA"], those fields need to be proper columns.
  • You're feeding data into scikit-learn, XGBoost, or any ML model. Models expect flat feature matrices. Nested dicts aren't features.
  • You're building dashboards or reports. BI tools work with flat tables, not nested structures.
  • You're doing aggregations across the nested level. Revenue per SKU, average rating per customer, count of items per order — all require flattened data.

When NOT to flatten

  • You're just passing data through to another API. If your pipeline receives JSON and sends it to another service unchanged, flattening and re-nesting wastes cycles.
  • The nested structure IS the feature. In NLP or graph problems, the hierarchical structure itself carries information. Flattening destroys it.
  • You're storing it for later and don't know the query patterns yet. Store raw JSON in a document database (MongoDB, DynamoDB) or as JSONB in PostgreSQL. Flatten on read, not on write.
  • The nesting is 8+ levels deep and you only need 2 fields. Manual .get() extraction is cleaner than normalizing 50 columns to use 2.
ScenarioApproachWhy
ML feature engineeringFull json_normalizeModels need flat numeric/categorical arrays
Dashboard/reportingNormalize + mergeBI tools expect tabular data
Quick EDA on a few fieldsCustom .get() extractFaster, less memory, no schema noise
ETL passthroughDon't flattenUnnecessary round-trip transformation
Store-then-queryKeep as JSON, flatten on readAvoids premature schema commitment

Conclusion

Flattening nested JSON into pandas DataFrames is fundamentally about choosing the right tool for the structure you're facing. pd.json_normalize() handles the majority of cases: nested dicts become dot-separated columns automatically, record_path expands one-to-many lists into rows, and meta preserves parent context. For inconsistent schemas, errors='ignore' prevents crashes while isna().sum() catches silent data loss.

The trickier cases (arrays nested inside arrays, schemas that vary record to record, computed aggregates) need the explode-then-normalize pattern or a custom flattening function. Neither approach is inherently better; they solve different problems. Production pipelines almost always combine both: json_normalize for the predictable 80% and custom functions for the irregular 20%.

Once your JSON is flat, the next step is making sure the flattening didn't introduce issues. Run a Data Profiling pass to catch unexpected nulls and type mismatches. If the flattened data has gaps from missing keys, the strategies in Missing Data Strategies will help you decide between dropping, imputing, or flagging. And when your JSON contains text fields like product reviews or descriptions, Mastering Text Preprocessing covers the full pipeline from raw strings to clean, analysis-ready features.

Frequently Asked Interview Questions

Q: What's the difference between pd.json_normalize() and pd.DataFrame() when working with nested JSON?

pd.DataFrame() creates a DataFrame where nested dictionaries and lists remain as opaque Python objects inside cells, so you can't filter or aggregate on them. pd.json_normalize() recursively unpacks nested dictionaries into dot-separated column names, turning {"customer": {"name": "Alice"}} into a customer.name column with the scalar value "Alice". Use DataFrame() for already-flat data; use json_normalize() for anything with nesting.

Q: How do you handle one-to-many relationships in JSON when flattening to a DataFrame?

Use the record_path parameter to tell json_normalize which list to expand into rows. Each list element becomes its own row. Combine it with the meta parameter to carry parent-level fields (like order_id) into every child row. This is equivalent to an automatic left join between the parent record and its child list.

Q: Your JSON API returns records where some have a "phone" field and others don't. json_normalize crashes. How do you fix it?

Set errors='ignore' in json_normalize(). This fills missing metadata fields with NaN instead of raising a KeyError. After normalization, always check df.isna().sum() to quantify how much data is missing per column. If a field is missing in most records, it may indicate a schema version change you should investigate.

Q: When would you write a custom flattening function instead of using json_normalize?

Three situations: (1) the JSON schema varies significantly between records (a field is a string in one record, a list in another), (2) you need computed/aggregated fields like average rating or item count rather than raw values, or (3) you only need 3-4 fields from a deeply nested 50-column structure and want to avoid the overhead of full normalization. Custom functions using .get() with defaults are also more explicit and easier to debug in production.

Q: How do you flatten arrays nested inside arrays? For example, orders containing items, and each item containing a list of discounts.

Use a two-step process: first call json_normalize with record_path="items" and meta=["order_id"] to get one row per item. Then call .explode("discounts") on the resulting DataFrame to expand the inner list, followed by pd.json_normalize() on the exploded discount dictionaries to split them into proper columns. This pattern generalizes to any depth of list nesting.

Q: A 3 GB JSON file crashes your kernel when you try to load it. What's your approach?

Three options depending on context. First, convert to line-delimited JSON (.jsonl, one object per line) and use pd.read_json("file.jsonl", lines=True, chunksize=10_000) to stream it in bounded-memory chunks. Second, if you control the pipeline, switch to Parquet for repeated analysis: it's columnar, compressed, and 10-50x faster to read. Third, if you only need a few fields, read line-by-line with json.loads() and extract directly into a list of flat dicts, which avoids building the full nested structure in memory.

Q: What's the difference between explode() and record_path in json_normalize?

record_path works on raw JSON data (lists of dicts) during the initial normalization step. explode() works on a DataFrame column that already contains lists. If you still have the raw JSON, record_path with meta is cleaner because it handles the parent-child join in one call. If you already have a DataFrame with list columns (from a database query or partial normalization), explode() is the right tool.

Q: How would you handle a JSON field that's sometimes a list and sometimes a single value?

Standardize it before normalization. Wrap single values in a list: field = field if isinstance(field, list) else [field]. This ensures consistent structure so json_normalize or .explode() won't break. In a custom flatten function, handle this with a type check: items = order.get("items", []) followed by if isinstance(items, dict): items = [items].

Practice with real Telecom & ISP data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Telecom & ISP problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths