Skip to content

Add a User Guide for the Parquet file reading #1452

@zaleslaw

Description

@zaleslaw

About Parquet

Apache Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It provides several advantages:

  • Columnar storage: Data is stored column-by-column, which enables efficient compression and encoding schemes
  • Schema evolution: Supports adding new columns without breaking existing data readers
  • Efficient querying: Optimized for analytics workloads where you typically read a subset of columns
  • Cross-platform: Works across different programming languages and data processing frameworks
  • Compression: Built-in support for various compression algorithms (GZIP, Snappy, etc.)

Parquet files are commonly used in data lakes, data warehouses, and big data processing pipelines. They're frequently created by tools like Apache Spark, Pandas, Dask, and various cloud data services.

Typical use cases

  • Exchanging columnar datasets between Spark and Kotlin/JVM applications.
  • Analytical workloads where columnar compression and predicate pushdown matter.
  • Reading data exported from data lakes and lakehouse tables (e.g., from Spark, Hive, or Delta/Iceberg exports).

Android Compatibility

If you need to process Parquet files in an Android application, consider:

  • Processing files on a server and exposing the data via an API
  • Converting Parquet files to a supported format (JSON, CSV) for Android consumption
  • Using cloud-based data processing services

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentation (not KDocs)

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions