Rapid Json Loading


Konrad Schultka (PhD)


In the idealo entity matching project, we recently ran into a problem with slow json parsing. An intermediate pipeline steps produces results with the following schema

class Candidate(BaseModel):
    offer_key: OfferKey
    score: float

class AnchoredCandidates(BaseModel):
    anchor: OfferKey
    candidates: Sequence[Candidate]

We serialized these results using jsonlines with 1000 candidates per anchor. These results were used for hard mining examples for other models in the pipeline. But naively loading these results like this

with jsonlines.open(blocker_results_path, loads=orjson.loads) as reader:
    for block in tqdm(reader, "Loading blocker results"):
        # process block

leads to mediocre performance: We can only read about 500 results/sec this way. On our dataset this would take roughly 40 minutes to load.

We could speed this up with two tweaks:

  • Better parsing of the results

  • Using a faster json parser


Better parsing of results


For our purposes it is actually enough to process only the top 20 candidates. But with the above code, it will load the full 1000 results and parse them to the schema. Using raw json instead,

with jsonlines.open(blocker_results_path) as reader:
    for block in tqdm(reader, "Loading blocker results"):
        anchor = block["anchor"]
        candidates = block["candidates"]
        # process raw json... 

gives about 2000 results/sec.

But this is error prone, since we are not validating the fields anymore. Instead, we can adapt the schema slightly to load the results lazily:

class AnchoredCandidates(BaseModel):

    anchor: OfferKey
-    candidates: Sequence[Candidate]
+    candidates: Iterable[Candidate]

     class Config:
         arbitrary_types_allowed = True

+    @validator("candidates", pre=True, always=True)
+    @classmethod
+    def validate_candidates_lazily(cls, v: Any) -> Iterator[Candidate]:
+        if isinstance(v, Iterable):
+            return (Candidate(**item) for item in v)
+        raise ValueError("Candidates must be an iterable")

Then parsing the blocks as follows gives the same speed-up

with jsonlines.open(blocker_results_path) as reader:
    for block in tqdm(reader, "Loading blocker results"):
        block = AnchoredCandidates(**block)
        candidates = itertools.islice(block.candidates, topk) # Only parse topk=20 results


Using orjson to speed up json parsing


Now, the main bottleneck is actually parsing the raw json. We can use orjson for faster json parsing:

-        with jsonlines.open(blocker_results_path) as reader:
+        with jsonlines.open(blocker_results_path, loads=orjson.loads) as reader:

This gives a speed-up of about a factor of two: 2000 results/sec -> 4200 results/sec


Note


jsonlines will use orjson by default if it is installed. So if you want to benchmark this, you have to use loads=json.loads for the baseline comparison.


Final benchmarks


Combining these two tweaks, we get the following table:

results/second

baseline

500

lazy parsing

2000

orjson + lazy parsing

4200