Rapid Json Loading

Konrad Schultka (PhD)

April 24th 2025

In the idealo entity matching project, we recently ran into a problem with slow json parsing. An intermediate pipeline steps produces results with the following schema

class Candidate(BaseModel):
    offer_key: OfferKey
    score: float

class AnchoredCandidates(BaseModel):
    anchor: OfferKey
    candidates: Sequence[Candidate]

We serialized these results using jsonlines with 1000 candidates per anchor. These results were used for hard mining examples for other models in the pipeline. But naively loading these results like this

with jsonlines.open(blocker_results_path, loads=orjson.loads) as reader:
    for block in tqdm(reader, "Loading blocker results"):
        # process block

leads to mediocre performance: We can only read about 500 results/sec this way. On our dataset this would take roughly 40 minutes to load.

We could speed this up with two tweaks:

Better parsing of the results
Using a faster json parser

Better parsing of results

For our purposes it is actually enough to process only the top 20 candidates. But with the above code, it will load the full 1000 results and parse them to the schema. Using raw json instead,

with jsonlines.open(blocker_results_path) as reader:
    for block in tqdm(reader, "Loading blocker results"):
        anchor = block["anchor"]
        candidates = block["candidates"]
        # process raw json...

gives about 2000 results/sec.

But this is error prone, since we are not validating the fields anymore. Instead, we can adapt the schema slightly to load the results lazily:

class AnchoredCandidates(BaseModel):

    anchor: OfferKey
-    candidates: Sequence[Candidate]
+    candidates: Iterable[Candidate]

     class Config:
         arbitrary_types_allowed = True

+    @validator("candidates", pre=True, always=True)
+    @classmethod
+    def validate_candidates_lazily(cls, v: Any) -> Iterator[Candidate]:
+        if isinstance(v, Iterable):
+            return (Candidate(**item) for item in v)
+        raise ValueError("Candidates must be an iterable")

Then parsing the blocks as follows gives the same speed-up

with jsonlines.open(blocker_results_path) as reader:
    for block in tqdm(reader, "Loading blocker results"):
        block = AnchoredCandidates(**block)
        candidates = itertools.islice(block.candidates, topk) # Only parse topk=20 results

Using orjson to speed up json parsing

Now, the main bottleneck is actually parsing the raw json. We can use orjson for faster json parsing:

-        with jsonlines.open(blocker_results_path) as reader:
+        with jsonlines.open(blocker_results_path, loads=orjson.loads) as reader:

This gives a speed-up of about a factor of two: 2000 results/sec -> 4200 results/sec

Note

jsonlines will use orjson by default if it is installed. So if you want to benchmark this, you have to use loads=json.loads for the baseline comparison.

Final benchmarks

Combining these two tweaks, we get the following table:

	results/second
baseline	500
lazy parsing	2000
orjson + lazy parsing	4200