In the idealo entity matching project, we recently ran into a problem with slow json parsing. An intermediate pipeline steps produces results with the following schema
class Candidate(BaseModel):
offer_key: OfferKey
score: float
class AnchoredCandidates(BaseModel):
anchor: OfferKey
candidates: Sequence[Candidate]
We serialized these results using jsonlines with 1000 candidates per anchor. These results were used for hard mining examples for other models in the pipeline. But naively loading these results like this
with jsonlines.open(blocker_results_path, loads=orjson.loads) as reader:
for block in tqdm(reader, "Loading blocker results"):
# process block
leads to mediocre performance: We can only read about 500 results/sec this way. On our dataset this would take roughly 40 minutes to load.
We could speed this up with two tweaks:
Better parsing of the results
Using a faster json parser
Better parsing of results
For our purposes it is actually enough to process only the top 20 candidates. But with the above code, it will load the full 1000 results and parse them to the schema. Using raw json instead,
with jsonlines.open(blocker_results_path) as reader:
for block in tqdm(reader, "Loading blocker results"):
anchor = block["anchor"]
candidates = block["candidates"]
# process raw json...
gives about 2000 results/sec.
But this is error prone, since we are not validating the fields anymore. Instead, we can adapt the schema slightly to load the results lazily:
class AnchoredCandidates(BaseModel):
anchor: OfferKey
- candidates: Sequence[Candidate]
+ candidates: Iterable[Candidate]
class Config:
arbitrary_types_allowed = True
+ @validator("candidates", pre=True, always=True)
+ @classmethod
+ def validate_candidates_lazily(cls, v: Any) -> Iterator[Candidate]:
+ if isinstance(v, Iterable):
+ return (Candidate(**item) for item in v)
+ raise ValueError("Candidates must be an iterable")
Then parsing the blocks as follows gives the same speed-up
with jsonlines.open(blocker_results_path) as reader:
for block in tqdm(reader, "Loading blocker results"):
block = AnchoredCandidates(**block)
candidates = itertools.islice(block.candidates, topk) # Only parse topk=20 results
Using orjson to speed up json parsing
Now, the main bottleneck is actually parsing the raw json. We can use orjson for faster json parsing:
- with jsonlines.open(blocker_results_path) as reader:
+ with jsonlines.open(blocker_results_path, loads=orjson.loads) as reader:
This gives a speed-up of about a factor of two: 2000 results/sec -> 4200 results/sec
Note
jsonlines will use orjson by default if it is installed. So if you want to benchmark this, you have to use loads=json.loads for the baseline comparison.
Final benchmarks
Combining these two tweaks, we get the following table: