Skip to content

Commit

Permalink
adding in benchmark docs (#309)
Browse files Browse the repository at this point in the history
* adding in benchmark docs

* Update docs/source/benchmark.rst

Co-authored-by: Jacob Dawang <[email protected]>

---------

Co-authored-by: Jacob Dawang <[email protected]>
  • Loading branch information
fdosani and jdawang authored Jun 4, 2024
1 parent 1fd8ac1 commit 5861594
Show file tree
Hide file tree
Showing 3 changed files with 71 additions and 0 deletions.
70 changes: 70 additions & 0 deletions docs/source/benchmark.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
Benchmarks
==========

.. important::

- Minimal tuning was done
- Benchmarking is hard (and biased)
- Take with a grain of salt


We've tried to create some benchmarks to help users understand which DataFrames to use when.
Please take the results with a grain of salt! If you have ideas on how we can further improve
preformance please make an issue, we always welcome contributions.


Setup used
----------

Single Machine:

- 16 CPUs
- 64GB RAM

Distributed Spark:

- 20 Executors
- 8 Cores
- 32GB RAM

The Data
---------

The data (base, and compare) we generated was purely synthetic consisting of 10 columns:

- 1 id (montonicly increasing) column used for joining
- 3 string columns
- 6 numeric columns


Table of mean benchmark times in seconds:

=========== ======= ======= =============== =============== ===============
Number of pandas polars pandas on spark spark (fugue) spark (fugue)
rows (distributed) (single) (distributed)
=========== ======= ======= =============== =============== ===============
1000 0.025 0.025 15.2838 2.041 1.109
100,000 0.196 0.120 11.1113 1.743 3.175
10,000,000 18.804 11.330 20.6274 17.560 16.455
50,000,000 96.494 62.827 57.5735 90.578 94.304
100,000,000 DNR 127.194 96.3204 DNR 193.234
500,000,000 DNR DNR 262.6094 DNR DNR
=========== ======= ======= =============== =============== ===============

.. note:: DNR = Did not run

.. image:: img/benchmarks.png

TLDR
----

* Polars can handle a lot of data and is fast!

* From our experiments we can see that on a 64GB machine it was able to process 100 Million records

* The Pandas on Spark implementation will be slower for small to mediumish data.

* in the 100 Million + range is starts to shine, and due to its distributed
nature it can process vast amounts of data

* The native Pandas version is best for small and medium data
Binary file added docs/source/img/benchmarks.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Contents
Spark (Pandas on Spark) Usage <spark_usage>
Polars Usage <polars_usage>
Fugue Usage <fugue_usage>
Benchmarks <benchmark>
Developer Instructions <developer_instructions>

.. toctree::
Expand Down

0 comments on commit 5861594

Please sign in to comment.