Skip to content
This repository was archived by the owner on Sep 2, 2025. It is now read-only.

Commit 80dc029

Browse files
dbeatty10ueshin
andauthored
Convert df to pyspark DataFrame if it is koalas before writing (#474)
* Temporarily update dev-requirements.txt * Changelog entry * Temporarily update dev-requirements.txt * Convert df to pyspark DataFrame if it is koalas before writing * Restore original version of dev-requirements.txt * Preferentially convert Koalas DataFrames to pandas-on-Spark DataFrames first * Fix explanation Co-authored-by: Takuya UESHIN <[email protected]>
1 parent 23d17a0 commit 80dc029

File tree

2 files changed

+18
-1
lines changed

2 files changed

+18
-1
lines changed
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
kind: Under the Hood
2+
body: Convert df to pyspark DataFrame if it is koalas before writing
3+
time: 2022-09-24T14:37:13.100404-06:00
4+
custom:
5+
Author: dbeatty10 ueshin
6+
Issue: "473"
7+
PR: "474"

dbt/include/spark/macros/materializations/table.sql

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ import importlib.util
4646

4747
pandas_available = False
4848
pyspark_available = False
49+
koalas_available = False
4950

5051
# make sure pandas exists before using it
5152
if importlib.util.find_spec("pandas"):
@@ -57,17 +58,26 @@ if importlib.util.find_spec("pyspark.pandas"):
5758
import pyspark.pandas
5859
pyspark_available = True
5960

60-
# preferentially convert pandas DataFrames to pandas-on-Spark DataFrames first
61+
# make sure databricks.koalas exists before using it
62+
if importlib.util.find_spec("databricks.koalas"):
63+
import databricks.koalas
64+
koalas_available = True
65+
66+
# preferentially convert pandas DataFrames to pandas-on-Spark or Koalas DataFrames first
6167
# since they know how to convert pandas DataFrames better than `spark.createDataFrame(df)`
6268
# and converting from pandas-on-Spark to Spark DataFrame has no overhead
6369
if pyspark_available and pandas_available and isinstance(df, pandas.core.frame.DataFrame):
6470
df = pyspark.pandas.frame.DataFrame(df)
71+
elif koalas_available and pandas_available and isinstance(df, pandas.core.frame.DataFrame):
72+
df = databricks.koalas.frame.DataFrame(df)
6573

6674
# convert to pyspark.sql.dataframe.DataFrame
6775
if isinstance(df, pyspark.sql.dataframe.DataFrame):
6876
pass # since it is already a Spark DataFrame
6977
elif pyspark_available and isinstance(df, pyspark.pandas.frame.DataFrame):
7078
df = df.to_spark()
79+
elif koalas_available and isinstance(df, databricks.koalas.frame.DataFrame):
80+
df = df.to_spark()
7181
elif pandas_available and isinstance(df, pandas.core.frame.DataFrame):
7282
df = spark.createDataFrame(df)
7383
else:

0 commit comments

Comments
 (0)