Skip to content

Commit

Permalink
Pandas on Spark refactor (#275)
Browse files Browse the repository at this point in the history
* refactor SparkCompare

* tweaking SparkCompare and adding back Legacy

* conditional import

* cleaning up tests and using pytest-spark for legacy

* adding docs

* caching and some typo fixes

* adding in doc and pandas 2 changes

* adding pandas to testing matrix

* drop 3.8

* drop 3.8

* refactoring ^

* rebase fix for #277

* fixing legacy uncode column names

* unicode fix for legacy

* unicode test for new spark logic

* typo fix

* changes from PR review
  • Loading branch information
fdosani authored Mar 25, 2024
1 parent 605152c commit 6a1920d
Show file tree
Hide file tree
Showing 10 changed files with 5,322 additions and 3,007 deletions.
13 changes: 7 additions & 6 deletions .github/workflows/test-package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,10 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [3.8, 3.9, '3.10', '3.11']
spark-version: [3.1.3, 3.2.4, 3.3.4, 3.4.2, 3.5.0]
python-version: [3.9, '3.10', '3.11']
spark-version: [3.2.4, 3.3.4, 3.4.2, 3.5.1]
pandas-version: [2.2.1, 1.5.3]
exclude:
- python-version: '3.11'
spark-version: 3.1.3
- python-version: '3.11'
spark-version: 3.2.4
- python-version: '3.11'
Expand Down Expand Up @@ -51,6 +50,7 @@ jobs:
python -m pip install --upgrade pip
python -m pip install pytest pytest-spark pypandoc
python -m pip install pyspark==${{ matrix.spark-version }}
python -m pip install pandas==${{ matrix.pandas-version }}
python -m pip install .[dev]
- name: Test with pytest
run: |
Expand All @@ -62,7 +62,8 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [3.8, 3.9, '3.10', '3.11']
python-version: [3.9, '3.10', '3.11']

env:
PYTHON_VERSION: ${{ matrix.python-version }}

Expand All @@ -88,7 +89,7 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [3.8, 3.9, '3.10', '3.11']
python-version: [3.9, '3.10', '3.11']
env:
PYTHON_VERSION: ${{ matrix.python-version }}

Expand Down
48 changes: 38 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,16 +38,44 @@ pip install datacompy[ray]

```

### In-scope Spark versions
Different versions of Spark play nicely with only certain versions of Python below is a matrix of what we test with
### Legacy Spark Deprecation

#### Starting with version 0.12.0

The original ``SparkCompare`` implementation differs from all the other native implementations. To align the API better, and keep behaviour consistent we are deprecating ``SparkCompare`` into a new module ``LegacySparkCompare``

If you wish to use the old SparkCompare moving forward you can

```python
import datacompy.legacy.LegacySparkCompare
```

#### Supported versions and dependncies

Different versions of Spark, Pandas, and Python interact differently. Below is a matrix of what we test with.
With the move to Pandas on Spark API and compatability issues with Pandas 2+ we will for the mean time note support Pandas 2
with the Pandas on Spark implementation. Spark plans to support Pandas 2 in [Spark 4](https://issues.apache.org/jira/browse/SPARK-44101)

With version ``0.12.0``:
- Not support Pandas ``2.0.0`` For the native Spark implemention
- Spark ``3.1`` support will be dropped
- Python ``3.8`` support is dropped


| | Spark 3.2.4 | Spark 3.3.4 | Spark 3.4.2 | Spark 3.5.1 |
|-------------|-------------|-------------|-------------|-------------|
| Python 3.9 |||||
| Python 3.10 |||||
| Python 3.11 |||||
| Python 3.12 |||||


| | Pandas < 1.5.3 | Pandas >=2.0.0 |
|---------------|----------------|----------------|
| Native Pandas |||
| Native Spark |||
| Fugue |||

| | Spark 3.1.3 | Spark 3.2.3 | Spark 3.3.4 | Spark 3.4.2 | Spark 3.5.0 |
|-------------|--------------|-------------|-------------|-------------|-------------|
| Python 3.8 ||||||
| Python 3.9 ||||||
| Python 3.10 ||||||
| Python 3.11 ||||||
| Python 3.12 ||||||


> [!NOTE]
Expand All @@ -56,7 +84,7 @@ Different versions of Spark play nicely with only certain versions of Python bel
## Supported backends

- Pandas: ([See documentation](https://capitalone.github.io/datacompy/pandas_usage.html))
- Spark: ([See documentation](https://capitalone.github.io/datacompy/spark_usage.html))
- Spark (Pandas on Spark API): ([See documentation](https://capitalone.github.io/datacompy/spark_usage.html))
- Polars (Experimental): ([See documentation](https://capitalone.github.io/datacompy/polars_usage.html))
- Fugue is a Python library that provides a unified interface for data processing on Pandas, DuckDB, Polars, Arrow,
Spark, Dask, Ray, and many other backends. DataComPy integrates with Fugue to provide a simple way to compare data
Expand Down
4 changes: 2 additions & 2 deletions datacompy/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

__version__ = "0.11.3"
__version__ = "0.12.0"

from datacompy.core import *
from datacompy.fugue import (
Expand All @@ -25,4 +25,4 @@
unq_columns,
)
from datacompy.polars import PolarsCompare
from datacompy.spark import NUMERIC_SPARK_TYPES, SparkCompare
from datacompy.spark import SparkCompare
Loading

0 comments on commit 6a1920d

Please sign in to comment.