Benchmark#

These benchmarks show time spent and memory used by versus.compare() with the default materialize="all" setting. The data comes from the Python nycflights13 package (the weather table). Each size is sampled with replacement from the original table and keeps 95% of rows on each side. In 4 of the 15 columns (temp, dewp, humid, wind_dir), 5% of values differ between the two inputs. For the parquet scan case, the sampled tables are written to parquet before running compare().

Row sizes: 250k, 1M, 2M, 5M, 10M, 20M.

Benchmarks were run on a 2020 13-inch MacBook Pro (2.3 GHz quad-core Intel Core i7, 32 GB RAM).

Methods:

  • Pandas DataFrames: versus.compare() on pandas DataFrames held in memory.

  • DuckDB parquet scan: versus.compare() on DuckDB relations backed by parquet files.

Hover a point to see the exact value.