diff --git a/.ai/skills/check-upstream/SKILL.md b/.ai/skills/check-upstream/SKILL.md new file mode 100644 index 000000000..ac4835a4e --- /dev/null +++ b/.ai/skills/check-upstream/SKILL.md @@ -0,0 +1,383 @@ + + +--- +name: check-upstream +description: Check if upstream Apache DataFusion features (functions, DataFrame ops, SessionContext methods, FFI types) are exposed in this Python project. Use when adding missing functions, auditing API coverage, or ensuring parity with upstream. +argument-hint: [area] (e.g., "scalar functions", "aggregate functions", "window functions", "dataframe", "session context", "ffi types", "all") +--- + +# Check Upstream DataFusion Feature Coverage + +You are auditing the datafusion-python project to find features from the upstream Apache DataFusion Rust library that are **not yet exposed** in this Python binding project. Your goal is to identify gaps and, if asked, implement the missing bindings. + +**IMPORTANT: The Python API is the source of truth for coverage.** A function or method is considered "exposed" if it exists in the Python API (e.g., `python/datafusion/functions.py`), even if there is no corresponding entry in the Rust bindings. Many upstream functions are aliases of other functions — the Python layer can expose these aliases by calling a different underlying Rust binding. Do NOT report a function as missing if it appears in the Python `__all__` list and has a working implementation, regardless of whether a matching `#[pyfunction]` exists in Rust. + +## Areas to Check + +The user may specify an area via `$ARGUMENTS`. If no area is specified or "all" is given, check all areas. + +### 1. Scalar Functions + +**Upstream source of truth:** +- Rust docs: https://docs.rs/datafusion/latest/datafusion/functions/index.html +- User docs: https://datafusion.apache.org/user-guide/sql/scalar_functions.html + +**Where they are exposed in this project:** +- Python API: `python/datafusion/functions.py` — each function wraps a call to `datafusion._internal.functions` +- Rust bindings: `crates/core/src/functions.rs` — `#[pyfunction]` definitions registered via `init_module()` + +**How to check:** +1. Fetch the upstream scalar function documentation page +2. Compare against functions listed in `python/datafusion/functions.py` (check the `__all__` list and function definitions) +3. A function is covered if it exists in the Python API — it does NOT need a dedicated Rust `#[pyfunction]`. Many functions are aliases that reuse another function's Rust binding. +4. Only report functions that are missing from the Python `__all__` list / function definitions + +### 2. Aggregate Functions + +**Upstream source of truth:** +- Rust docs: https://docs.rs/datafusion/latest/datafusion/functions_aggregate/index.html +- User docs: https://datafusion.apache.org/user-guide/sql/aggregate_functions.html + +**Where they are exposed in this project:** +- Python API: `python/datafusion/functions.py` (aggregate functions are mixed in with scalar functions) +- Rust bindings: `crates/core/src/functions.rs` + +**How to check:** +1. Fetch the upstream aggregate function documentation page +2. Compare against aggregate functions in `python/datafusion/functions.py` (check `__all__` list and function definitions) +3. A function is covered if it exists in the Python API, even if it aliases another function's Rust binding +4. Report only functions missing from the Python API + +### 3. Window Functions + +**Upstream source of truth:** +- Rust docs: https://docs.rs/datafusion/latest/datafusion/functions_window/index.html +- User docs: https://datafusion.apache.org/user-guide/sql/window_functions.html + +**Where they are exposed in this project:** +- Python API: `python/datafusion/functions.py` (window functions like `rank`, `dense_rank`, `lag`, `lead`, etc.) +- Rust bindings: `crates/core/src/functions.rs` + +**How to check:** +1. Fetch the upstream window function documentation page +2. Compare against window functions in `python/datafusion/functions.py` (check `__all__` list and function definitions) +3. A function is covered if it exists in the Python API, even if it aliases another function's Rust binding +4. Report only functions missing from the Python API + +### 4. Table Functions + +**Upstream source of truth:** +- Rust docs: https://docs.rs/datafusion/latest/datafusion/functions_table/index.html +- User docs: https://datafusion.apache.org/user-guide/sql/table_functions.html (if available) + +**Where they are exposed in this project:** +- Python API: `python/datafusion/functions.py` and `python/datafusion/user_defined.py` (TableFunction/udtf) +- Rust bindings: `crates/core/src/functions.rs` and `crates/core/src/udtf.rs` + +**How to check:** +1. Fetch the upstream table function documentation +2. Compare against what's available in the Python API +3. A function is covered if it exists in the Python API, even if it aliases another function's Rust binding +4. Report only functions missing from the Python API + +### 5. DataFrame Operations + +**Upstream source of truth:** +- Rust docs: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html + +**Where they are exposed in this project:** +- Python API: `python/datafusion/dataframe.py` — the `DataFrame` class +- Rust bindings: `crates/core/src/dataframe.rs` — `PyDataFrame` with `#[pymethods]` + +**Evaluated and not requiring separate Python exposure:** +- `show_limit` — already covered by `DataFrame.show()`, which provides the same functionality with a simpler API +- `with_param_values` — already covered by the `param_values` argument on `SessionContext.sql()`, which accomplishes the same thing more robustly +- `union_by_name_distinct` — already covered by `DataFrame.union_by_name(distinct=True)`, which provides a more Pythonic API + +**How to check:** +1. Fetch the upstream DataFrame documentation page listing all methods +2. Compare against methods in `python/datafusion/dataframe.py` — this is the source of truth for coverage +3. The Rust bindings (`crates/core/src/dataframe.rs`) may be consulted for context, but a method is covered if it exists in the Python API +4. Check against the "evaluated and not requiring exposure" list before flagging as a gap +5. Report only methods missing from the Python API + +### 6. SessionContext Methods + +**Upstream source of truth:** +- Rust docs: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html + +**Where they are exposed in this project:** +- Python API: `python/datafusion/context.py` — the `SessionContext` class +- Rust bindings: `crates/core/src/context.rs` — `PySessionContext` with `#[pymethods]` + +**How to check:** +1. Fetch the upstream SessionContext documentation page listing all methods +2. Compare against methods in `python/datafusion/context.py` — this is the source of truth for coverage +3. The Rust bindings (`crates/core/src/context.rs`) may be consulted for context, but a method is covered if it exists in the Python API +4. Report only methods missing from the Python API + +### 7. FFI Types (datafusion-ffi) + +**Upstream source of truth:** +- Crate source: https://github.com/apache/datafusion/tree/main/datafusion/ffi/src +- Rust docs: https://docs.rs/datafusion-ffi/latest/datafusion_ffi/ + +**Where they are exposed in this project:** +- Rust bindings: various files under `crates/core/src/` and `crates/util/src/` +- FFI example: `examples/datafusion-ffi-example/src/` +- Dependency declared in root `Cargo.toml` and `crates/core/Cargo.toml` + +**Discovering currently supported FFI types:** +Grep for `use datafusion_ffi::` in `crates/core/src/` and `crates/util/src/` to find all FFI types currently imported and used. + +**Evaluated and not requiring direct Python exposure:** +These upstream FFI types have been reviewed and do not need to be independently exposed to end users: +- `FFI_ExecutionPlan` — already used indirectly through table providers; no need for direct exposure +- `FFI_PhysicalExpr` / `FFI_PhysicalSortExpr` — internal physical planning types not expected to be needed by end users +- `FFI_RecordBatchStream` — one level deeper than FFI_ExecutionPlan, used internally when execution plans stream results +- `FFI_SessionRef` / `ForeignSession` — session sharing across FFI; Python manages sessions natively via SessionContext +- `FFI_SessionConfig` — Python can configure sessions natively without FFI +- `FFI_ConfigOptions` / `FFI_TableOptions` — internal configuration plumbing +- `FFI_PlanProperties` / `FFI_Boundedness` / `FFI_EmissionType` — read from existing plans, not user-facing +- `FFI_Partitioning` — supporting type for physical planning +- Supporting/utility types (`FFI_Option`, `FFI_Result`, `WrappedSchema`, `WrappedArray`, `FFI_ColumnarValue`, `FFI_Volatility`, `FFI_InsertOp`, `FFI_AccumulatorArgs`, `FFI_Accumulator`, `FFI_GroupsAccumulator`, `FFI_EmitTo`, `FFI_AggregateOrderSensitivity`, `FFI_PartitionEvaluator`, `FFI_PartitionEvaluatorArgs`, `FFI_Range`, `FFI_SortOptions`, `FFI_Distribution`, `FFI_ExprProperties`, `FFI_SortProperties`, `FFI_Interval`, `FFI_TableProviderFilterPushDown`, `FFI_TableType`) — used as building blocks within the types above, not independently exposed + +**How to check:** +1. Discover currently supported types by grepping for `use datafusion_ffi::` in `crates/core/src/` and `crates/util/src/`, then compare against the upstream `datafusion-ffi` crate's `lib.rs` exports +2. If new FFI types appear upstream, evaluate whether they represent a user-facing capability +3. Check against the "evaluated and not requiring exposure" list before flagging as a gap +4. Report any genuinely new types that enable user-facing functionality +5. For each currently supported FFI type, verify the full pipeline is present using the checklist from "Adding a New FFI Type": + - Rust PyO3 wrapper with `from_pycapsule()` method + - Python Protocol type (e.g., `ScalarUDFExportable`) for FFI objects + - Python wrapper class with full type hints on all public methods + - ABC base class (if the type can be user-implemented) + - Registered in Rust `init_module()` and Python `__init__.py` + - FFI example in `examples/datafusion-ffi-example/` + - Type appears in union type hints where accepted + +## Checking for Existing GitHub Issues + +After identifying missing APIs, search the open issues at https://github.com/apache/datafusion-python/issues for each gap to see if an issue already exists requesting that API be exposed. Search using the function or method name as the query. + +- If an existing issue is found, include a link to it in the report. Do NOT create a new issue. +- If no existing issue is found, note that no issue exists yet. If the user asks to create issues for missing APIs, each issue should specify that Python test coverage is required as part of the implementation. + +## Output Format + +For each area checked, produce a report like: + +``` +## [Area Name] Coverage Report + +### Currently Exposed (X functions/methods) +- list of what's already available + +### Missing from Upstream (Y functions/methods) +- function_name — brief description of what it does (existing issue: #123) +- function_name — brief description of what it does (no existing issue) + +### Notes +- Any relevant observations about partial implementations, naming differences, etc. +``` + +## Implementation Pattern + +If the user asks you to implement missing features, follow these patterns: + +### Adding a New Function (Scalar/Aggregate/Window) + +**Step 1: Rust binding** in `crates/core/src/functions.rs`: +```rust +#[pyfunction] +#[pyo3(signature = (arg1, arg2))] +fn new_function_name(arg1: PyExpr, arg2: PyExpr) -> PyResult { + Ok(datafusion::functions::module::expr_fn::new_function_name(arg1.expr, arg2.expr).into()) +} +``` +Then register in `init_module()`: +```rust +m.add_wrapped(wrap_pyfunction!(new_function_name))?; +``` + +**Step 2: Python wrapper** in `python/datafusion/functions.py`: +```python +def new_function_name(arg1: Expr, arg2: Expr) -> Expr: + """Description of what the function does. + + Args: + arg1: Description of first argument. + arg2: Description of second argument. + + Returns: + Description of return value. + """ + return Expr(f.new_function_name(arg1.expr, arg2.expr)) +``` +Add to `__all__` list. + +### Adding a New DataFrame Method + +**Step 1: Rust binding** in `crates/core/src/dataframe.rs`: +```rust +#[pymethods] +impl PyDataFrame { + fn new_method(&self, py: Python, param: PyExpr) -> PyDataFusionResult { + let df = self.df.as_ref().clone().new_method(param.into())?; + Ok(Self::new(df)) + } +} +``` + +**Step 2: Python wrapper** in `python/datafusion/dataframe.py`: +```python +def new_method(self, param: Expr) -> DataFrame: + """Description of the method.""" + return DataFrame(self.df.new_method(param.expr)) +``` + +### Adding a New SessionContext Method + +**Step 1: Rust binding** in `crates/core/src/context.rs`: +```rust +#[pymethods] +impl PySessionContext { + pub fn new_method(&self, py: Python, param: String) -> PyDataFusionResult { + let df = wait_for_future(py, self.ctx.new_method(¶m))?; + Ok(PyDataFrame::new(df)) + } +} +``` + +**Step 2: Python wrapper** in `python/datafusion/context.py`: +```python +def new_method(self, param: str) -> DataFrame: + """Description of the method.""" + return DataFrame(self.ctx.new_method(param)) +``` + +### Adding a New FFI Type + +FFI types require a full pipeline from C struct through to a typed Python wrapper. Each layer must be present. + +**Step 1: Rust PyO3 wrapper class** in a new or existing file under `crates/core/src/`: +```rust +use datafusion_ffi::new_type::FFI_NewType; + +#[pyclass(from_py_object, frozen, name = "RawNewType", module = "datafusion.module_name", subclass)] +pub struct PyNewType { + pub inner: Arc, +} + +#[pymethods] +impl PyNewType { + #[staticmethod] + fn from_pycapsule(obj: &Bound<'_, PyAny>) -> PyDataFusionResult { + let capsule = obj + .getattr("__datafusion_new_type__")? + .call0()? + .downcast::()?; + let ffi_ptr = unsafe { capsule.reference::() }; + let provider: Arc = ffi_ptr.into(); + Ok(Self { inner: provider }) + } + + fn some_method(&self) -> PyResult<...> { + // wrap inner trait method + } +} +``` +Register in the appropriate `init_module()`: +```rust +m.add_class::()?; +``` + +**Step 2: Python Protocol type** in the appropriate Python module (e.g., `python/datafusion/catalog.py`): +```python +class NewTypeExportable(Protocol): + """Type hint for objects providing a __datafusion_new_type__ PyCapsule.""" + + def __datafusion_new_type__(self) -> object: ... +``` + +**Step 3: Python wrapper class** in the same module: +```python +class NewType: + """Description of the type. + + This class wraps a DataFusion NewType, which can be created from a native + Python implementation or imported from an FFI-compatible library. + """ + + def __init__( + self, + new_type: df_internal.module_name.RawNewType | NewTypeExportable, + ) -> None: + if isinstance(new_type, df_internal.module_name.RawNewType): + self._raw = new_type + else: + self._raw = df_internal.module_name.RawNewType.from_pycapsule(new_type) + + def some_method(self) -> ReturnType: + """Description of the method.""" + return self._raw.some_method() +``` + +**Step 4: ABC base class** (if users should be able to subclass and provide custom implementations in Python): +```python +from abc import ABC, abstractmethod + +class NewTypeProvider(ABC): + """Abstract base class for implementing a custom NewType in Python.""" + + @abstractmethod + def some_method(self) -> ReturnType: + """Description of the method.""" + ... +``` + +**Step 5: Module exports** — add to the appropriate `__init__.py`: +- Add the wrapper class (`NewType`) to `python/datafusion/__init__.py` +- Add the ABC (`NewTypeProvider`) if applicable +- Add the Protocol type (`NewTypeExportable`) if it should be public + +**Step 6: FFI example** — add an example implementation under `examples/datafusion-ffi-example/src/`: +```rust +// examples/datafusion-ffi-example/src/new_type.rs +use datafusion_ffi::new_type::FFI_NewType; +// ... example showing how an external Rust library exposes this type via PyCapsule +``` + +**Checklist for each FFI type:** +- [ ] Rust PyO3 wrapper with `from_pycapsule()` method +- [ ] Python Protocol type (e.g., `NewTypeExportable`) for FFI objects +- [ ] Python wrapper class with full type hints on all public methods +- [ ] ABC base class (if the type can be user-implemented) +- [ ] Registered in Rust `init_module()` and Python `__init__.py` +- [ ] FFI example in `examples/datafusion-ffi-example/` +- [ ] Type appears in union type hints where accepted (e.g., `Table | TableProviderExportable`) + +## Important Notes + +- The upstream DataFusion version used by this project is specified in `crates/core/Cargo.toml` — check the `datafusion` dependency version to ensure you're comparing against the right upstream version. +- Some upstream features may intentionally not be exposed (e.g., internal-only APIs). Use judgment about what's user-facing. +- When fetching upstream docs, prefer the published docs.rs documentation as it matches the crate version. +- Function aliases (e.g., `array_append` / `list_append`) should both be exposed if upstream supports them. +- Check the `__all__` list in `functions.py` to see what's publicly exported vs just defined. diff --git a/.asf.yaml b/.asf.yaml index f27975c84..cb0520c17 100644 --- a/.asf.yaml +++ b/.asf.yaml @@ -16,24 +16,31 @@ # under the License. notifications: - commits: commits@arrow.apache.org - issues: github@arrow.apache.org - pullrequests: github@arrow.apache.org + commits: commits@datafusion.apache.org + issues: github@datafusion.apache.org + pullrequests: github@datafusion.apache.org jira_options: link label worklog github: - description: "Apache Arrow DataFusion Python Bindings" - homepage: https://arrow.apache.org/datafusion + description: "Apache DataFusion Python Bindings" + homepage: https://datafusion.apache.org/python enabled_merge_buttons: squash: true merge: false rebase: false features: issues: true + protected_branches: + main: + required_status_checks: + # require branches to be up-to-date before merging + strict: true + # don't require any jobs to pass + contexts: [] staging: whoami: asf-staging - subdir: datafusion-python + subdir: python publish: whoami: asf-site - subdir: datafusion-python + subdir: python diff --git a/.claude/skills b/.claude/skills new file mode 120000 index 000000000..6838a1160 --- /dev/null +++ b/.claude/skills @@ -0,0 +1 @@ +../.ai/skills \ No newline at end of file diff --git a/.dockerignore b/.dockerignore new file mode 100644 index 000000000..411e60291 --- /dev/null +++ b/.dockerignore @@ -0,0 +1,12 @@ +.cargo +.github +.pytest_cache +ci +conda +dev +docs +examples +parquet +target +testing +venv \ No newline at end of file diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml index fc8910766..7682d6cb0 100644 --- a/.github/workflows/build.yml +++ b/.github/workflows/build.yml @@ -15,34 +15,275 @@ # specific language governing permissions and limitations # under the License. -name: Python Release Build +# Reusable workflow for running building +# This ensures the same tests run for both debug (PRs) and release (main/tags) builds + +name: Build + on: - pull_request: - branches: ["main"] - push: - tags: ["*-rc*"] - branches: ["branch-*"] + workflow_call: + inputs: + build_mode: + description: 'Build mode: debug or release' + required: true + type: string + run_wheels: + description: 'Whether to build distribution wheels' + required: false + type: boolean + default: false + +env: + CARGO_TERM_COLOR: always + RUST_BACKTRACE: 1 + UV_LOCKED: true jobs: + # ============================================ + # Linting Jobs + # ============================================ + lint-rust: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + + - name: Setup Rust + uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 + with: + toolchain: "nightly" + components: rustfmt + + - name: Cache Cargo + uses: Swatinem/rust-cache@v2 + + - name: Check formatting + run: cargo +nightly fmt --all -- --check + + lint-python: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + + - name: Install Python + uses: actions/setup-python@v5 + with: + python-version: "3.12" + + - uses: astral-sh/setup-uv@5a095e7a2014a4212f075830d4f7277575a9d098 + with: + enable-cache: true + + - name: Install dependencies + run: uv sync --dev --no-install-package datafusion + + - name: Run Ruff + run: | + uv run --no-project ruff check --output-format=github python/ + uv run --no-project ruff format --check python/ + + - name: Run codespell + run: | + uv run --no-project codespell --toml pyproject.toml + + lint-toml: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + + - name: Install taplo + uses: taiki-e/install-action@v2 + with: + tool: taplo-cli + + # if you encounter an error, try running 'taplo format' to fix the formatting automatically. + - name: Check Cargo.toml formatting + run: taplo format --check + + check-crates-patch: + if: inputs.build_mode == 'release' && startsWith(github.ref, 'refs/tags/') + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + + - name: Ensure [patch.crates-io] is empty + run: python3 dev/check_crates_patch.py + generate-license: runs-on: ubuntu-latest steps: - - uses: actions/checkout@v3 - - uses: actions-rs/toolchain@v1 + - uses: actions/checkout@v6 + + - uses: astral-sh/setup-uv@5a095e7a2014a4212f075830d4f7277575a9d098 with: - profile: minimal - toolchain: stable - override: true + enable-cache: true + + - name: Install cargo-license + uses: taiki-e/install-action@v2 + with: + tool: cargo-license + - name: Generate license file - run: python ./dev/create_license.py - - uses: actions/upload-artifact@v3 + run: uv run --no-project python ./dev/create_license.py + + - uses: actions/upload-artifact@v7 with: name: python-wheel-license path: LICENSE.txt + # ============================================ + # Build - Linux x86_64 + # ============================================ + build-manylinux-x86_64: + needs: [generate-license, lint-rust, lint-python] + name: ManyLinux x86_64 + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v6 + + - run: rm LICENSE.txt + - name: Download LICENSE.txt + uses: actions/download-artifact@v8 + with: + name: python-wheel-license + path: . + + - name: Setup Rust + uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 + + - name: Cache Cargo + uses: Swatinem/rust-cache@v2 + with: + key: ${{ inputs.build_mode }} + + - uses: astral-sh/setup-uv@5a095e7a2014a4212f075830d4f7277575a9d098 + with: + enable-cache: true + + - name: Add extra swap for release build + if: inputs.build_mode == 'release' + run: | + set -euxo pipefail + sudo swapoff -a || true + sudo rm -f /swapfile + sudo fallocate -l 8G /swapfile || sudo dd if=/dev/zero of=/swapfile bs=1M count=8192 + sudo chmod 600 /swapfile + sudo mkswap /swapfile + sudo swapon /swapfile + free -h + swapon --show + + - name: Build (release mode) + uses: PyO3/maturin-action@v1 + if: inputs.build_mode == 'release' + with: + target: x86_64-unknown-linux-gnu + manylinux: "2_28" + args: --release --strip --features protoc,substrait --out dist + rustup-components: rust-std + + - name: Build (debug mode) + uses: PyO3/maturin-action@v1 + if: inputs.build_mode == 'debug' + with: + target: x86_64-unknown-linux-gnu + manylinux: "2_28" + args: --features protoc,substrait --out dist + rustup-components: rust-std + + - name: Build FFI test library + uses: PyO3/maturin-action@v1 + with: + target: x86_64-unknown-linux-gnu + manylinux: "2_28" + working-directory: examples/datafusion-ffi-example + args: --out dist + rustup-components: rust-std + + - name: Archive wheels + uses: actions/upload-artifact@v7 + with: + name: dist-manylinux-x86_64 + path: dist/* + + - name: Archive FFI test wheel + uses: actions/upload-artifact@v7 + with: + name: test-ffi-manylinux-x86_64 + path: examples/datafusion-ffi-example/dist/* + + # ============================================ + # Build - Linux ARM64 + # ============================================ + build-manylinux-aarch64: + needs: [generate-license, lint-rust, lint-python] + name: ManyLinux arm64 + runs-on: ubuntu-24.04-arm + steps: + - uses: actions/checkout@v6 + + - run: rm LICENSE.txt + - name: Download LICENSE.txt + uses: actions/download-artifact@v8 + with: + name: python-wheel-license + path: . + + - name: Setup Rust + uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 + + - name: Cache Cargo + uses: Swatinem/rust-cache@v2 + with: + key: ${{ inputs.build_mode }} + + - uses: astral-sh/setup-uv@5a095e7a2014a4212f075830d4f7277575a9d098 + with: + enable-cache: true + + - name: Add extra swap for release build + if: inputs.build_mode == 'release' + run: | + set -euxo pipefail + sudo swapoff -a || true + sudo rm -f /swapfile + sudo fallocate -l 8G /swapfile || sudo dd if=/dev/zero of=/swapfile bs=1M count=8192 + sudo chmod 600 /swapfile + sudo mkswap /swapfile + sudo swapon /swapfile + free -h + swapon --show + + - name: Build (release mode) + uses: PyO3/maturin-action@v1 + if: inputs.build_mode == 'release' + with: + target: aarch64-unknown-linux-gnu + manylinux: "2_28" + args: --release --strip --features protoc,substrait --out dist + rustup-components: rust-std + + - name: Build (debug mode) + uses: PyO3/maturin-action@v1 + if: inputs.build_mode == 'debug' + with: + target: aarch64-unknown-linux-gnu + manylinux: "2_28" + args: --features protoc,substrait --out dist + rustup-components: rust-std + + - name: Archive wheels + uses: actions/upload-artifact@v7 + if: inputs.build_mode == 'release' + with: + name: dist-manylinux-aarch64 + path: dist/* + + # ============================================ + # Build - macOS arm64 / Windows + # ============================================ build-python-mac-win: - needs: [generate-license] - name: Mac/Win + needs: [generate-license, lint-rust, lint-python] + name: macOS arm64 & Windows runs-on: ${{ matrix.os }} strategy: fail-fast: false @@ -50,82 +291,136 @@ jobs: python-version: ["3.10"] os: [macos-latest, windows-latest] steps: - - uses: actions/checkout@v3 + - uses: actions/checkout@v6 - - uses: actions/setup-python@v4 - with: - python-version: ${{ matrix.python-version }} - - - uses: actions-rs/toolchain@v1 - with: - toolchain: stable + - uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 - run: rm LICENSE.txt - name: Download LICENSE.txt - uses: actions/download-artifact@v3 + uses: actions/download-artifact@v8 with: name: python-wheel-license path: . - - name: Build Python package - run: maturin build --release --strip + - name: Cache Cargo + uses: Swatinem/rust-cache@v2 + with: + key: ${{ inputs.build_mode }} + + - uses: astral-sh/setup-uv@5a095e7a2014a4212f075830d4f7277575a9d098 + with: + enable-cache: true + + - name: Install Protoc + uses: arduino/setup-protoc@v3 + with: + version: "27.4" + repo-token: ${{ secrets.GITHUB_TOKEN }} + + - name: Install dependencies + run: uv sync --dev --no-install-package datafusion + + # Run clippy BEFORE maturin so we can avoid rebuilding. The features must match + # exactly the features used by maturin. Linux maturin builds need to happen in a + # container so only run this for our mac runner. + - name: Run Clippy + if: matrix.os != 'windows-latest' + run: cargo clippy --no-deps --all-targets --features substrait -- -D warnings + + - name: Build Python package (release mode) + if: inputs.build_mode == 'release' + run: uv run --no-project maturin build --release --strip --features substrait + + - name: Build Python package (debug mode) + if: inputs.build_mode != 'release' + run: uv run --no-project maturin build --features substrait - name: List Windows wheels if: matrix.os == 'windows-latest' run: dir target\wheels\ + # since the runner is dynamic shellcheck (from actionlint) can't infer this is powershell + # so we specify it explicitly + shell: powershell - name: List Mac wheels if: matrix.os != 'windows-latest' run: find target/wheels/ - name: Archive wheels - uses: actions/upload-artifact@v3 + uses: actions/upload-artifact@v7 + if: inputs.build_mode == 'release' with: - name: dist + name: dist-${{ matrix.os }} path: target/wheels/* - build-manylinux: - needs: [generate-license] - name: Manylinux - runs-on: ubuntu-latest + # ============================================ + # Build - macOS x86_64 (release only) + # ============================================ + build-macos-x86_64: + if: inputs.build_mode == 'release' + needs: [generate-license, lint-rust, lint-python] + runs-on: macos-15-intel + strategy: + fail-fast: false + matrix: + python-version: ["3.10"] steps: - - uses: actions/checkout@v3 + - uses: actions/checkout@v6 + + - uses: dtolnay/rust-toolchain@29eef336d9b2848a0b548edc03f92a220660cdb8 + - run: rm LICENSE.txt - name: Download LICENSE.txt - uses: actions/download-artifact@v3 + uses: actions/download-artifact@v8 with: name: python-wheel-license path: . - - run: cat LICENSE.txt + + - name: Cache Cargo + uses: Swatinem/rust-cache@v2 + with: + key: ${{ inputs.build_mode }} + + - uses: astral-sh/setup-uv@5a095e7a2014a4212f075830d4f7277575a9d098 + with: + enable-cache: true + - name: Install Protoc - uses: arduino/setup-protoc@v1 + uses: arduino/setup-protoc@v3 with: - version: '3.x' + version: "27.4" repo-token: ${{ secrets.GITHUB_TOKEN }} - - name: Build wheels - uses: PyO3/maturin-action@v1 - with: - env: - RUST_BACKTRACE: 1 - rust-toolchain: nightly - target: x86_64 - manylinux: auto - args: --release --manylinux 2014 + + - name: Install dependencies + run: uv sync --dev --no-install-package datafusion + + - name: Build (release mode) + run: | + uv run --no-project maturin build --release --strip --features substrait + + - name: List Mac wheels + run: find target/wheels/ + - name: Archive wheels - uses: actions/upload-artifact@v3 + uses: actions/upload-artifact@v7 with: - name: dist + name: dist-macos-aarch64 path: target/wheels/* + # ============================================ + # Build - Source Distribution + # ============================================ + build-sdist: needs: [generate-license] name: Source distribution + if: inputs.build_mode == 'release' runs-on: ubuntu-latest steps: - - uses: actions/checkout@v3 + - uses: actions/checkout@v6 - run: rm LICENSE.txt - name: Download LICENSE.txt - uses: actions/download-artifact@v3 + uses: actions/download-artifact@v8 with: name: python-wheel-license path: . @@ -135,22 +430,132 @@ jobs: with: rust-toolchain: stable manylinux: auto - args: --release --sdist --out dist - - name: Archive wheels - uses: actions/upload-artifact@v3 + rustup-components: rust-std rustfmt + args: --release --sdist --out dist --features protoc,substrait + - name: Assert sdist build does not generate wheels + run: | + if [ "$(ls -A target/wheels)" ]; then + echo "Error: Sdist build generated wheels" + exit 1 + else + echo "Directory is clean" + fi + shell: bash + + # ============================================ + # Build - Source Distribution + # ============================================ + + merge-build-artifacts: + runs-on: ubuntu-latest + name: Merge build artifacts + if: inputs.build_mode == 'release' + needs: + - build-python-mac-win + - build-macos-x86_64 + - build-manylinux-x86_64 + - build-manylinux-aarch64 + - build-sdist + steps: + - name: Merge Build Artifacts + uses: actions/upload-artifact/merge@v7 with: name: dist - path: target/wheels/* + pattern: dist-* + + # ============================================ + # Build - Documentation + # ============================================ + # Documentation build job that runs after wheels are built + build-docs: + name: Build docs + runs-on: ubuntu-latest + needs: [build-manylinux-x86_64] # Only need the Linux wheel for docs + # Only run docs on main branch pushes, tags, or PRs + if: github.event_name == 'push' || github.event_name == 'pull_request' + steps: + - name: Set target branch + if: github.event_name == 'push' && (github.ref == 'refs/heads/main' || github.ref_type == 'tag') + id: target-branch + run: | + set -x + if test '${{ github.ref }}' = 'refs/heads/main'; then + echo "value=asf-staging" >> "$GITHUB_OUTPUT" + elif test '${{ github.ref_type }}' = 'tag'; then + echo "value=asf-site" >> "$GITHUB_OUTPUT" + else + echo "Unsupported input: ${{ github.ref }} / ${{ github.ref_type }}" + exit 1 + fi + + - name: Checkout docs sources + uses: actions/checkout@v6 + + - name: Checkout docs target branch + if: github.event_name == 'push' && (github.ref == 'refs/heads/main' || github.ref_type == 'tag') + uses: actions/checkout@v6 + with: + fetch-depth: 0 + ref: ${{ steps.target-branch.outputs.value }} + path: docs-target + + - name: Setup Python + uses: actions/setup-python@v6 + with: + python-version: "3.10" + + - name: Install dependencies + uses: astral-sh/setup-uv@5a095e7a2014a4212f075830d4f7277575a9d098 + with: + enable-cache: true + + # Download the Linux wheel built in the previous job + - name: Download pre-built Linux wheel + uses: actions/download-artifact@v8 + with: + name: dist-manylinux-x86_64 + path: wheels/ + + # Install from the pre-built wheels + - name: Install from pre-built wheels + run: | + set -x + uv venv + # Install documentation dependencies + uv sync --dev --no-install-package datafusion --group docs + # Install all pre-built wheels + WHEELS=$(find wheels/ -name "*.whl") + if [ -n "$WHEELS" ]; then + echo "Installing wheels:" + echo "$WHEELS" + uv pip install wheels/*.whl + else + echo "ERROR: No wheels found!" + exit 1 + fi + + - name: Build docs + run: | + set -x + cd docs + curl -O https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv + curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet + uv run --no-project make html - # NOTE: PyPI publish needs to be done manually for now after release passed the vote - # release: - # name: Publish in PyPI - # needs: [build-manylinux, build-python-mac-win] - # runs-on: ubuntu-latest - # steps: - # - uses: actions/download-artifact@v3 - # - name: Publish to PyPI - # uses: pypa/gh-action-pypi-publish@master - # with: - # user: __token__ - # password: ${{ secrets.pypi_password }} + - name: Copy & push the generated HTML + if: github.event_name == 'push' && (github.ref == 'refs/heads/main' || github.ref_type == 'tag') + run: | + set -x + cd docs-target + # delete anything but: 1) '.'; 2) '..'; 3) .git/ + find ./ | grep -vE "^./$|^../$|^./.git" | xargs rm -rf + cp ../.asf.yaml . + cp -r ../docs/build/html/* . + git status --porcelain + if [ "$(git status --porcelain)" != "" ]; then + git config user.name "github-actions[bot]" + git config user.email "github-actions[bot]@users.noreply.github.com" + git add --all + git commit -m 'Publish built docs triggered by ${{ github.sha }}' + git push || git push --force + fi diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 000000000..ab284b522 --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,41 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# CI workflow for pull requests - runs tests in DEBUG mode for faster feedback + +name: CI + +on: + pull_request: + branches: ["main"] + +concurrency: + group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }} + cancel-in-progress: true + +jobs: + build: + uses: ./.github/workflows/build.yml + with: + build_mode: debug + run_wheels: false + secrets: inherit + + test: + needs: build + uses: ./.github/workflows/test.yml + secrets: inherit diff --git a/.github/workflows/codeql.yml b/.github/workflows/codeql.yml new file mode 100644 index 000000000..a9855cf48 --- /dev/null +++ b/.github/workflows/codeql.yml @@ -0,0 +1,54 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +name: "CodeQL" + +on: + push: + branches: [ "main" ] + pull_request: + branches: [ "main" ] + schedule: + - cron: '16 4 * * 1' + +permissions: + contents: read + +jobs: + analyze: + name: Analyze Actions + runs-on: ubuntu-latest + permissions: + contents: read + security-events: write + + steps: + - name: Checkout repository + uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6 + with: + persist-credentials: false + + - name: Initialize CodeQL + uses: github/codeql-action/init@c793b717bc78562f491db7b0e93a3a178b099162 # v4 + with: + languages: actions + + - name: Perform CodeQL Analysis + uses: github/codeql-action/analyze@c793b717bc78562f491db7b0e93a3a178b099162 # v4 + with: + category: "/language:actions" diff --git a/.github/workflows/dev.yml b/.github/workflows/dev.yml index 05cf8ce68..2c8ecbc5e 100644 --- a/.github/workflows/dev.yml +++ b/.github/workflows/dev.yml @@ -25,10 +25,10 @@ jobs: runs-on: ubuntu-latest steps: - name: Checkout - uses: actions/checkout@v3 + uses: actions/checkout@v6 - name: Setup Python - uses: actions/setup-python@v4 + uses: actions/setup-python@v6 with: - python-version: "3.10" + python-version: "3.14" - name: Audit licenses run: ./dev/release/run-rat.sh . diff --git a/.github/workflows/docs.yaml b/.github/workflows/docs.yaml deleted file mode 100644 index b1c9ffc12..000000000 --- a/.github/workflows/docs.yaml +++ /dev/null @@ -1,75 +0,0 @@ -on: - push: - branches: - - master - tags-ignore: - - "**-rc**" - -name: Deploy DataFusion Python site - -jobs: - build-docs: - name: Build docs - runs-on: ubuntu-latest - steps: - - name: Set target branch - id: target-branch - run: | - set -x - if test '${{ github.ref }}' = 'refs/heads/main'; then - echo "value=asf-staging" >> $GITHUB_OUTPUT - elif test '${{ github.ref_type }}' = 'tag'; then - echo "value=asf-site" >> $GITHUB_OUTPUT - else - echo "Unsupported input: ${{ github.ref }} / ${{ github.ref_type }}" - exit 1 - fi - - name: Checkout docs sources - uses: actions/checkout@v3 - - name: Checkout docs target branch - uses: actions/checkout@v3 - with: - fetch-depth: 0 - ref: ${{ steps.target-branch.outputs.value }} - path: docs-target - - name: Setup Python - uses: actions/setup-python@v4 - with: - python-version: "3.10" - - - name: Install dependencies - run: | - set -x - python3 -m venv venv - source venv/bin/activate - pip install -r requirements-310.txt - pip install -r docs/requirements.txt - - name: Build Datafusion - run: | - set -x - source venv/bin/activate - maturin develop - - - name: Build docs - run: | - set -x - source venv/bin/activate - cd docs - make html - - - name: Copy & push the generated HTML - run: | - set -x - cd docs-target - # delete anything but: 1) '.'; 2) '..'; 3) .git/ - find ./ | grep -vE "^./$|^../$|^./.git" | xargs rm -rf - cp ../.asf.yaml . - cp -r ../docs/build/html/* . - git status --porcelain - if [ "$(git status --porcelain)" != "" ]; then - git config user.name "github-actions[bot]" - git config user.email "github-actions[bot]@users.noreply.github.com" - git add --all - git commit -m 'Publish built docs triggered by ${{ github.sha }}' - git push || git push --force - fi diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml new file mode 100644 index 000000000..bddc89eac --- /dev/null +++ b/.github/workflows/release.yml @@ -0,0 +1,49 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Release workflow - runs tests in RELEASE mode and builds distribution wheels +# Triggered on: +# - Merges to main +# - Release candidate tags (*-rc*) +# - Release tags (e.g., 45.0.0) + +name: Release Build + +on: + push: + branches: + - "main" + tags: + - "*-rc*" # Release candidates (e.g., 45.0.0-rc1) + - "[0-9]+.*" # Release tags (e.g., 45.0.0) + +concurrency: + group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }} + cancel-in-progress: true + +jobs: + build: + uses: ./.github/workflows/build.yml + with: + build_mode: release + run_wheels: true + secrets: inherit + + test: + needs: build + uses: ./.github/workflows/test.yml + secrets: inherit diff --git a/.github/workflows/take.yml b/.github/workflows/take.yml new file mode 100644 index 000000000..86dc190ad --- /dev/null +++ b/.github/workflows/take.yml @@ -0,0 +1,41 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Assign the issue via a `take` comment +on: + issue_comment: + types: created + +permissions: + issues: write + +jobs: + issue_assign: + runs-on: ubuntu-latest + if: (!github.event.issue.pull_request) && github.event.comment.body == 'take' + concurrency: + group: ${{ github.actor }}-issue-assign + steps: + - run: | + CODE=$(curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" -LI https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/assignees/${{ github.event.comment.user.login }} -o /dev/null -w '%{http_code}\n' -s) + if [ "$CODE" -eq "204" ] + then + echo "Assigning issue ${{ github.event.issue.number }} to ${{ github.event.comment.user.login }}" + curl -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" -d '{"assignees": ["${{ github.event.comment.user.login }}"]}' https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.issue.number }}/assignees + else + echo "Cannot assign issue ${{ github.event.issue.number }} to ${{ github.event.comment.user.login }}" + fi \ No newline at end of file diff --git a/.github/workflows/test.yaml b/.github/workflows/test.yaml deleted file mode 100644 index 164b09e15..000000000 --- a/.github/workflows/test.yaml +++ /dev/null @@ -1,115 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -name: Python test -on: - push: - branches: [main] - pull_request: - branches: [main] - -concurrency: - group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }} - cancel-in-progress: true - -jobs: - test-matrix: - runs-on: ubuntu-latest - strategy: - fail-fast: false - matrix: - python-version: - - "3.10" - toolchain: - - "stable" - - "beta" - # we are not that much eager in walking on the edge yet - # - nightly - # build stable for only 3.7 - include: - - python-version: "3.7" - toolchain: "stable" - steps: - - uses: actions/checkout@v3 - - - name: Setup Rust Toolchain - uses: actions-rs/toolchain@v1 - id: rust-toolchain - with: - toolchain: ${{ matrix.toolchain }} - override: true - - - name: Install Protoc - uses: arduino/setup-protoc@v1 - with: - version: '3.x' - repo-token: ${{ secrets.GITHUB_TOKEN }} - - - name: Setup Python - uses: actions/setup-python@v4 - with: - python-version: ${{ matrix.python-version }} - - - name: Cache Cargo - uses: actions/cache@v3 - with: - path: ~/.cargo - key: cargo-cache-${{ steps.rust-toolchain.outputs.rustc_hash }}-${{ hashFiles('Cargo.lock') }} - - - name: Check Formatting - uses: actions-rs/cargo@v1 - if: ${{ matrix.python-version == '3.10' && matrix.toolchain == 'stable' }} - with: - command: fmt - args: -- --check - - - name: Run Clippy - uses: actions-rs/cargo@v1 - if: ${{ matrix.python-version == '3.10' && matrix.toolchain == 'stable' }} - with: - command: clippy - args: --all-targets --all-features -- -D clippy::all -A clippy::redundant_closure - - - name: Create Virtualenv (3.10) - if: ${{ matrix.python-version == '3.10' }} - run: | - python -m venv venv - source venv/bin/activate - pip install -r requirements-310.txt - - - name: Create Virtualenv (3.7) - if: ${{ matrix.python-version == '3.7' }} - run: | - python -m venv venv - source venv/bin/activate - pip install -r requirements-37.txt - - - name: Run Python Linters - if: ${{ matrix.python-version == '3.10' && matrix.toolchain == 'stable' }} - run: | - source venv/bin/activate - flake8 --exclude venv --ignore=E501,W503 - black --line-length 79 --diff --check . - - - name: Run tests - env: - RUST_BACKTRACE: 1 - run: | - git submodule update --init - source venv/bin/activate - pip install -e . -vv - pytest -v . diff --git a/.github/workflows/test.yml b/.github/workflows/test.yml new file mode 100644 index 000000000..706ccbc55 --- /dev/null +++ b/.github/workflows/test.yml @@ -0,0 +1,117 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Reusable workflow for running tests +# This ensures the same tests run for both debug (PRs) and release (main/tags) builds + +name: Test + +on: + workflow_call: + +env: + UV_LOCKED: true + +jobs: + test-matrix: + runs-on: ubuntu-latest + strategy: + fail-fast: false + matrix: + python-version: + - "3.10" + - "3.11" + - "3.12" + - "3.13" + - "3.14" + toolchain: + - "stable" + + steps: + - uses: actions/checkout@v6 + + - name: Setup Python + uses: actions/setup-python@v6 + with: + python-version: ${{ matrix.python-version }} + + - name: Cache Cargo + uses: actions/cache@v5 + with: + path: ~/.cargo + key: cargo-cache-${{ matrix.toolchain }}-${{ hashFiles('Cargo.lock') }} + + - name: Install dependencies + uses: astral-sh/setup-uv@5a095e7a2014a4212f075830d4f7277575a9d098 + with: + enable-cache: true + + # Download the Linux wheel built in the build workflow + - name: Download pre-built Linux wheel + uses: actions/download-artifact@v8 + with: + name: dist-manylinux-x86_64 + path: wheels/ + + # Download the FFI test wheel + - name: Download pre-built FFI test wheel + uses: actions/download-artifact@v8 + with: + name: test-ffi-manylinux-x86_64 + path: wheels/ + + # Install from the pre-built wheels + - name: Install from pre-built wheels + run: | + set -x + uv venv + # Install development dependencies + uv sync --dev --no-install-package datafusion + # Install all pre-built wheels + WHEELS=$(find wheels/ -name "*.whl") + if [ -n "$WHEELS" ]; then + echo "Installing wheels:" + echo "$WHEELS" + uv pip install wheels/*.whl + else + echo "ERROR: No wheels found!" + exit 1 + fi + + - name: Run tests + env: + RUST_BACKTRACE: 1 + run: | + git submodule update --init + uv run --no-project pytest -v --import-mode=importlib + + - name: FFI unit tests + run: | + cd examples/datafusion-ffi-example + uv run --no-project pytest python/tests/_test*.py + + - name: Run tpchgen-cli to create 1 Gb dataset + run: | + mkdir examples/tpch/data + cd examples/tpch/data + uv pip install tpchgen-cli + uv run --no-project tpchgen-cli -s 1 --format=parquet + + - name: Run TPC-H examples + run: | + cd examples/tpch + uv run --no-project pytest _tests.py diff --git a/.github/workflows/verify-release-candidate.yml b/.github/workflows/verify-release-candidate.yml new file mode 100644 index 000000000..a10a4faa9 --- /dev/null +++ b/.github/workflows/verify-release-candidate.yml @@ -0,0 +1,78 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +name: Verify Release Candidate + +# NOTE: This workflow is intended to be run manually via workflow_dispatch. + +on: + workflow_dispatch: + inputs: + version: + description: Version number (e.g., 52.0.0) + required: true + type: string + rc_number: + description: Release candidate number (e.g., 0) + required: true + type: string + +concurrency: + group: ${{ github.repository }}-${{ github.ref }}-${{ github.workflow }} + cancel-in-progress: true + +jobs: + verify: + name: Verify RC (${{ matrix.os }}-${{ matrix.arch }}) + strategy: + fail-fast: false + matrix: + include: + # Linux + - os: linux + arch: x64 + runner: ubuntu-latest + - os: linux + arch: arm64 + runner: ubuntu-24.04-arm + + # macOS + - os: macos + arch: arm64 + runner: macos-latest + - os: macos + arch: x64 + runner: macos-15-intel + + # Windows + - os: windows + arch: x64 + runner: windows-latest + runs-on: ${{ matrix.runner }} + steps: + - name: Checkout repository + uses: actions/checkout@v6 + + - name: Set up protoc + uses: arduino/setup-protoc@v3 + with: + version: "27.4" + repo-token: ${{ secrets.GITHUB_TOKEN }} + + - name: Run release candidate verification + shell: bash + run: ./dev/release/verify-release-candidate.sh "${{ inputs.version }}" "${{ inputs.rc_number }}" diff --git a/.gitignore b/.gitignore index 4e4450082..614d82327 100644 --- a/.gitignore +++ b/.gitignore @@ -4,22 +4,35 @@ target /docs/temp /docs/build .DS_Store +.vscode # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class +# Python dist ignore +dist + # C extensions *.so +# Python dist +dist + # pyenv # For a library or package, you might want to ignore these files since the code is # intended to run in multiple environments; otherwise, check them in: .python-version venv +.venv apache-rat-*.jar *rat.txt .env -CHANGELOG.md.bak \ No newline at end of file +CHANGELOG.md.bak + +docs/mdbook/book + +.pyo3_build_config + diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml index 3c6805322..2d3c2bc59 100644 --- a/.pre-commit-config.yaml +++ b/.pre-commit-config.yaml @@ -16,37 +16,49 @@ # under the License. repos: - - repo: https://github.com/psf/black - rev: 22.3.0 + - repo: https://github.com/rhysd/actionlint + rev: v1.7.6 hooks: - - id: black - files: datafusion/.* - # Explicitly specify the pyproject.toml at the repo root, not per-project. - args: ["--config", "pyproject.toml", "--line-length", "79", "--diff", "--check", "."] - - repo: https://github.com/PyCQA/flake8 - rev: 5.0.4 + - id: actionlint-docker + - repo: https://github.com/astral-sh/ruff-pre-commit + # Ruff version. + rev: v0.15.1 hooks: - - id: flake8 - files: datafusion/.*$ - types: [file] - types_or: [python] - additional_dependencies: ["flake8-force"] + # Run the linter. + - id: ruff + # Run the formatter. + - id: ruff-format - repo: local hooks: - id: rust-fmt name: Rust fmt description: Run cargo fmt on files included in the commit. rustfmt should be installed before-hand. - entry: cargo fmt --all -- + entry: cargo +nightly fmt --all -- pass_filenames: true types: [file, rust] language: system - id: rust-clippy name: Rust clippy description: Run cargo clippy on files included in the commit. clippy should be installed before-hand. - entry: cargo clippy --all-targets --all-features -- -Dclippy::all -Aclippy::redundant_closure + entry: cargo clippy --all-targets --all-features -- -Dclippy::all -D warnings -Aclippy::redundant_closure pass_filenames: false types: [file, rust] language: system + - repo: https://github.com/codespell-project/codespell + rev: v2.4.1 + hooks: + - id: codespell + args: [ --toml, "pyproject.toml"] + additional_dependencies: + - tomli + + - repo: https://github.com/astral-sh/uv-pre-commit + # uv version. + rev: 0.10.7 + hooks: + # Update the uv lockfile + - id: uv-lock + default_language_version: python: python3 diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 000000000..86c2e9c3b --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,56 @@ + + +# Agent Instructions + +This project uses AI agent skills stored in `.ai/skills/`. Each skill is a directory containing a `SKILL.md` file with instructions for performing a specific task. + +Skills follow the [Agent Skills](https://agentskills.io) open standard. Each skill directory contains: + +- `SKILL.md` — The skill definition with YAML frontmatter (name, description, argument-hint) and detailed instructions. +- Additional supporting files as needed. + +## Python Function Docstrings + +Every Python function must include a docstring with usage examples. + +- **Examples are required**: Each function needs at least one doctest-style example + demonstrating basic usage. +- **Optional parameters**: If a function has optional parameters, include separate + examples that show usage both without and with the optional arguments. Pass + optional arguments using their keyword name (e.g., `step=dfn.lit(3)`) so readers + can immediately see which parameter is being demonstrated. +- **Reuse input data**: Use the same input data across examples wherever possible. + The examples should demonstrate how different optional arguments change the output + for the same input, making the effect of each option easy to understand. +- **Alias functions**: Functions that are simple aliases (e.g., `list_sort` aliasing + `array_sort`) only need a one-line description and a `See Also` reference to the + primary function. They do not need their own examples. + +## Aggregate and Window Function Documentation + +When adding or updating an aggregate or window function, ensure the corresponding +site documentation is kept in sync: + +- **Aggregations**: `docs/source/user-guide/common-operations/aggregations.rst` — + add new aggregate functions to the "Aggregate Functions" list and include usage + examples if appropriate. +- **Window functions**: `docs/source/user-guide/common-operations/windows.rst` — + add new window functions to the "Available Functions" list and include usage + examples if appropriate. diff --git a/CHANGELOG.md b/CHANGELOG.md index 0cf05413a..ae40911d8 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,279 +17,6 @@ under the License. --> -# Changelog +# DataFusion Python Changelog -## [20.0.0](https://github.com/apache/arrow-datafusion-python/tree/20.0.0) (2023-03-17) - -[Full Changelog](https://github.com/apache/arrow-datafusion-python/compare/0.8.0...20.0.0) - -**Implemented enhancements:** - -- Empty relation bindings [#208](https://github.com/apache/arrow-datafusion-python/pull/208) (jdye64) -- wrap display_name and canonical_name functions [#214](https://github.com/apache/arrow-datafusion-python/pull/214) (jdye64) -- Add PyAlias bindings [#216](https://github.com/apache/arrow-datafusion-python/pull/216) (jdye64) -- Add bindings for scalar_variable [#218](https://github.com/apache/arrow-datafusion-python/pull/218) (jdye64) -- Bindings for LIKE type expressions [#220](https://github.com/apache/arrow-datafusion-python/pull/220) (jdye64) -- Bool expr bindings [#223](https://github.com/apache/arrow-datafusion-python/pull/223) (jdye64) -- Between bindings [#229](https://github.com/apache/arrow-datafusion-python/pull/229) (jdye64) -- Add bindings for GetIndexedField [#227](https://github.com/apache/arrow-datafusion-python/pull/227) (jdye64) -- Add bindings for case, cast, and trycast [#232](https://github.com/apache/arrow-datafusion-python/pull/232) (jdye64) -- add remaining expr bindings [#233](https://github.com/apache/arrow-datafusion-python/pull/233) (jdye64) -- feature: Additional export methods [#236](https://github.com/apache/arrow-datafusion-python/pull/236) (simicd) -- Add Python wrapper for LogicalPlan::Union [#240](https://github.com/apache/arrow-datafusion-python/pull/240) (iajoiner) -- feature: Create dataframe from pandas, polars, dictionary, list or pyarrow Table [#242](https://github.com/apache/arrow-datafusion-python/pull/242) (simicd) -- Add Python wrappers for `LogicalPlan::Join` and `LogicalPlan::CrossJoin` [#246](https://github.com/apache/arrow-datafusion-python/pull/246) (iajoiner) -- feature: Set table name from ctx functions [#260](https://github.com/apache/arrow-datafusion-python/pull/260) (simicd) -- Explain bindings [#264](https://github.com/apache/arrow-datafusion-python/pull/264) (jdye64) -- Extension bindings [#266](https://github.com/apache/arrow-datafusion-python/pull/266) (jdye64) -- Subquery alias bindings [#269](https://github.com/apache/arrow-datafusion-python/pull/269) (jdye64) -- Create memory table [#271](https://github.com/apache/arrow-datafusion-python/pull/271) (jdye64) -- Create view bindings [#273](https://github.com/apache/arrow-datafusion-python/pull/273) (jdye64) -- Re-export Datafusion dependencies [#277](https://github.com/apache/arrow-datafusion-python/pull/277) (jdye64) -- Distinct bindings [#275](https://github.com/apache/arrow-datafusion-python/pull/275) (jdye64) -- Drop table bindings [#283](https://github.com/apache/arrow-datafusion-python/pull/283) (jdye64) -- Bindings for LogicalPlan::Repartition [#285](https://github.com/apache/arrow-datafusion-python/pull/285) (jdye64) -- Expand Rust return type support for Arrow DataTypes in ScalarValue [#287](https://github.com/apache/arrow-datafusion-python/pull/287) (jdye64) - -**Documentation updates:** - -- docs: Example of calling Python UDF & UDAF in SQL [#258](https://github.com/apache/arrow-datafusion-python/pull/258) (simicd) - -**Merged pull requests:** - -- Minor docs updates [#210](https://github.com/apache/arrow-datafusion-python/pull/210) (andygrove) -- Empty relation bindings [#208](https://github.com/apache/arrow-datafusion-python/pull/208) (jdye64) -- wrap display_name and canonical_name functions [#214](https://github.com/apache/arrow-datafusion-python/pull/214) (jdye64) -- Add PyAlias bindings [#216](https://github.com/apache/arrow-datafusion-python/pull/216) (jdye64) -- Add bindings for scalar_variable [#218](https://github.com/apache/arrow-datafusion-python/pull/218) (jdye64) -- Bindings for LIKE type expressions [#220](https://github.com/apache/arrow-datafusion-python/pull/220) (jdye64) -- Bool expr bindings [#223](https://github.com/apache/arrow-datafusion-python/pull/223) (jdye64) -- Between bindings [#229](https://github.com/apache/arrow-datafusion-python/pull/229) (jdye64) -- Add bindings for GetIndexedField [#227](https://github.com/apache/arrow-datafusion-python/pull/227) (jdye64) -- Add bindings for case, cast, and trycast [#232](https://github.com/apache/arrow-datafusion-python/pull/232) (jdye64) -- add remaining expr bindings [#233](https://github.com/apache/arrow-datafusion-python/pull/233) (jdye64) -- Pre-commit hooks [#228](https://github.com/apache/arrow-datafusion-python/pull/228) (jdye64) -- Implement new release process [#149](https://github.com/apache/arrow-datafusion-python/pull/149) (andygrove) -- feature: Additional export methods [#236](https://github.com/apache/arrow-datafusion-python/pull/236) (simicd) -- Add Python wrapper for LogicalPlan::Union [#240](https://github.com/apache/arrow-datafusion-python/pull/240) (iajoiner) -- feature: Create dataframe from pandas, polars, dictionary, list or pyarrow Table [#242](https://github.com/apache/arrow-datafusion-python/pull/242) (simicd) -- Fix release instructions [#238](https://github.com/apache/arrow-datafusion-python/pull/238) (andygrove) -- Add Python wrappers for `LogicalPlan::Join` and `LogicalPlan::CrossJoin` [#246](https://github.com/apache/arrow-datafusion-python/pull/246) (iajoiner) -- docs: Example of calling Python UDF & UDAF in SQL [#258](https://github.com/apache/arrow-datafusion-python/pull/258) (simicd) -- feature: Set table name from ctx functions [#260](https://github.com/apache/arrow-datafusion-python/pull/260) (simicd) -- Upgrade to DataFusion 19 [#262](https://github.com/apache/arrow-datafusion-python/pull/262) (andygrove) -- Explain bindings [#264](https://github.com/apache/arrow-datafusion-python/pull/264) (jdye64) -- Extension bindings [#266](https://github.com/apache/arrow-datafusion-python/pull/266) (jdye64) -- Subquery alias bindings [#269](https://github.com/apache/arrow-datafusion-python/pull/269) (jdye64) -- Create memory table [#271](https://github.com/apache/arrow-datafusion-python/pull/271) (jdye64) -- Create view bindings [#273](https://github.com/apache/arrow-datafusion-python/pull/273) (jdye64) -- Re-export Datafusion dependencies [#277](https://github.com/apache/arrow-datafusion-python/pull/277) (jdye64) -- Distinct bindings [#275](https://github.com/apache/arrow-datafusion-python/pull/275) (jdye64) -- build(deps): bump actions/checkout from 2 to 3 [#244](https://github.com/apache/arrow-datafusion-python/pull/244) (dependabot[bot]) -- build(deps): bump actions/upload-artifact from 2 to 3 [#245](https://github.com/apache/arrow-datafusion-python/pull/245) (dependabot[bot]) -- build(deps): bump actions/download-artifact from 2 to 3 [#243](https://github.com/apache/arrow-datafusion-python/pull/243) (dependabot[bot]) -- Use DataFusion 20 [#278](https://github.com/apache/arrow-datafusion-python/pull/278) (andygrove) -- Drop table bindings [#283](https://github.com/apache/arrow-datafusion-python/pull/283) (jdye64) -- Bindings for LogicalPlan::Repartition [#285](https://github.com/apache/arrow-datafusion-python/pull/285) (jdye64) -- Expand Rust return type support for Arrow DataTypes in ScalarValue [#287](https://github.com/apache/arrow-datafusion-python/pull/287) (jdye64) - -## [0.8.0](https://github.com/apache/arrow-datafusion-python/tree/0.8.0) (2023-02-22) - -[Full Changelog](https://github.com/apache/arrow-datafusion-python/compare/0.8.0-rc1...0.8.0) - -**Implemented enhancements:** - -- Add support for cuDF physical execution engine [\#202](https://github.com/apache/arrow-datafusion-python/issues/202) -- Make it easier to create a Pandas dataframe from DataFusion query results [\#139](https://github.com/apache/arrow-datafusion-python/issues/139) - -**Fixed bugs:** - -- Build error: could not compile `thiserror` due to 2 previous errors [\#69](https://github.com/apache/arrow-datafusion-python/issues/69) - -**Closed issues:** - -- Integrate with the new `object_store` crate [\#22](https://github.com/apache/arrow-datafusion-python/issues/22) - -**Merged pull requests:** - -- Update README in preparation for 0.8 release [\#206](https://github.com/apache/arrow-datafusion-python/pull/206) ([andygrove](https://github.com/andygrove)) -- Add support for cudf as a physical execution engine [\#205](https://github.com/apache/arrow-datafusion-python/pull/205) ([jdye64](https://github.com/jdye64)) -- Run `maturin develop` instead of `cargo build` in verification script [\#200](https://github.com/apache/arrow-datafusion-python/pull/200) ([andygrove](https://github.com/andygrove)) -- Add tests for recently added functionality [\#199](https://github.com/apache/arrow-datafusion-python/pull/199) ([andygrove](https://github.com/andygrove)) -- Implement `to_pandas()` [\#197](https://github.com/apache/arrow-datafusion-python/pull/197) ([simicd](https://github.com/simicd)) -- Add Python wrapper for LogicalPlan::Sort [\#196](https://github.com/apache/arrow-datafusion-python/pull/196) ([andygrove](https://github.com/andygrove)) -- Add Python wrapper for LogicalPlan::Aggregate [\#195](https://github.com/apache/arrow-datafusion-python/pull/195) ([andygrove](https://github.com/andygrove)) -- Add Python wrapper for LogicalPlan::Limit [\#193](https://github.com/apache/arrow-datafusion-python/pull/193) ([andygrove](https://github.com/andygrove)) -- Add Python wrapper for LogicalPlan::Filter [\#192](https://github.com/apache/arrow-datafusion-python/pull/192) ([andygrove](https://github.com/andygrove)) -- Add experimental support for executing SQL with Polars and Pandas [\#190](https://github.com/apache/arrow-datafusion-python/pull/190) ([andygrove](https://github.com/andygrove)) -- Update changelog for 0.8 release [\#188](https://github.com/apache/arrow-datafusion-python/pull/188) ([andygrove](https://github.com/andygrove)) -- Add ability to execute ExecutionPlan and get a stream of RecordBatch [\#186](https://github.com/apache/arrow-datafusion-python/pull/186) ([andygrove](https://github.com/andygrove)) -- Dffield bindings [\#185](https://github.com/apache/arrow-datafusion-python/pull/185) ([jdye64](https://github.com/jdye64)) -- Add bindings for DFSchema [\#183](https://github.com/apache/arrow-datafusion-python/pull/183) ([jdye64](https://github.com/jdye64)) -- test: Window functions [\#182](https://github.com/apache/arrow-datafusion-python/pull/182) ([simicd](https://github.com/simicd)) -- Add bindings for Projection [\#180](https://github.com/apache/arrow-datafusion-python/pull/180) ([jdye64](https://github.com/jdye64)) -- Table scan bindings [\#178](https://github.com/apache/arrow-datafusion-python/pull/178) ([jdye64](https://github.com/jdye64)) -- Make session configurable [\#176](https://github.com/apache/arrow-datafusion-python/pull/176) ([andygrove](https://github.com/andygrove)) -- Upgrade to DataFusion 18.0.0 [\#175](https://github.com/apache/arrow-datafusion-python/pull/175) ([andygrove](https://github.com/andygrove)) -- Use latest DataFusion rev in preparation for DF 18 release [\#174](https://github.com/apache/arrow-datafusion-python/pull/174) ([andygrove](https://github.com/andygrove)) -- Arrow type bindings [\#173](https://github.com/apache/arrow-datafusion-python/pull/173) ([jdye64](https://github.com/jdye64)) -- Pyo3 bump [\#171](https://github.com/apache/arrow-datafusion-python/pull/171) ([jdye64](https://github.com/jdye64)) -- feature: Add additional aggregation functions [\#170](https://github.com/apache/arrow-datafusion-python/pull/170) ([simicd](https://github.com/simicd)) -- Make from\_substrait\_plan return DataFrame instead of LogicalPlan [\#164](https://github.com/apache/arrow-datafusion-python/pull/164) ([andygrove](https://github.com/andygrove)) -- feature: Implement count method [\#163](https://github.com/apache/arrow-datafusion-python/pull/163) ([simicd](https://github.com/simicd)) -- CI Fixes [\#162](https://github.com/apache/arrow-datafusion-python/pull/162) ([jdye64](https://github.com/jdye64)) -- Upgrade to DataFusion 17 [\#160](https://github.com/apache/arrow-datafusion-python/pull/160) ([andygrove](https://github.com/andygrove)) -- feature: Improve string representation of datafusion classes [\#159](https://github.com/apache/arrow-datafusion-python/pull/159) ([simicd](https://github.com/simicd)) -- Make PyExecutionPlan.plan public [\#156](https://github.com/apache/arrow-datafusion-python/pull/156) ([andygrove](https://github.com/andygrove)) -- Expose methods on logical and execution plans [\#155](https://github.com/apache/arrow-datafusion-python/pull/155) ([andygrove](https://github.com/andygrove)) -- Fix clippy for new Rust version [\#154](https://github.com/apache/arrow-datafusion-python/pull/154) ([andygrove](https://github.com/andygrove)) -- Add DataFrame methods for accessing plans [\#153](https://github.com/apache/arrow-datafusion-python/pull/153) ([andygrove](https://github.com/andygrove)) -- Use DataFusion rev 5238e8c97f998b4d2cb9fab85fb182f325a1a7fb [\#150](https://github.com/apache/arrow-datafusion-python/pull/150) ([andygrove](https://github.com/andygrove)) -- build\(deps\): bump async-trait from 0.1.61 to 0.1.62 [\#148](https://github.com/apache/arrow-datafusion-python/pull/148) ([dependabot[bot]](https://github.com/apps/dependabot)) -- Rename default branch from master to main [\#147](https://github.com/apache/arrow-datafusion-python/pull/147) ([andygrove](https://github.com/andygrove)) -- Substrait bindings [\#145](https://github.com/apache/arrow-datafusion-python/pull/145) ([jdye64](https://github.com/jdye64)) -- build\(deps\): bump uuid from 0.8.2 to 1.2.2 [\#143](https://github.com/apache/arrow-datafusion-python/pull/143) ([dependabot[bot]](https://github.com/apps/dependabot)) -- Prepare for 0.8.0 release [\#141](https://github.com/apache/arrow-datafusion-python/pull/141) ([andygrove](https://github.com/andygrove)) -- Improve README and add more examples [\#137](https://github.com/apache/arrow-datafusion-python/pull/137) ([andygrove](https://github.com/andygrove)) -- test: Expand tests for built-in functions [\#129](https://github.com/apache/arrow-datafusion-python/pull/129) ([simicd](https://github.com/simicd)) -- build\(deps\): bump object\_store from 0.5.2 to 0.5.3 [\#126](https://github.com/apache/arrow-datafusion-python/pull/126) ([dependabot[bot]](https://github.com/apps/dependabot)) -- build\(deps\): bump mimalloc from 0.1.32 to 0.1.34 [\#125](https://github.com/apache/arrow-datafusion-python/pull/125) ([dependabot[bot]](https://github.com/apps/dependabot)) -- Introduce conda directory containing datafusion-dev.yaml conda enviro… [\#124](https://github.com/apache/arrow-datafusion-python/pull/124) ([jdye64](https://github.com/jdye64)) -- build\(deps\): bump bzip2 from 0.4.3 to 0.4.4 [\#121](https://github.com/apache/arrow-datafusion-python/pull/121) ([dependabot[bot]](https://github.com/apps/dependabot)) -- build\(deps\): bump tokio from 1.23.0 to 1.24.1 [\#119](https://github.com/apache/arrow-datafusion-python/pull/119) ([dependabot[bot]](https://github.com/apps/dependabot)) -- build\(deps\): bump async-trait from 0.1.60 to 0.1.61 [\#118](https://github.com/apache/arrow-datafusion-python/pull/118) ([dependabot[bot]](https://github.com/apps/dependabot)) -- Upgrade to DataFusion 16.0.0 [\#115](https://github.com/apache/arrow-datafusion-python/pull/115) ([andygrove](https://github.com/andygrove)) -- Bump async-trait from 0.1.57 to 0.1.60 [\#114](https://github.com/apache/arrow-datafusion-python/pull/114) ([dependabot[bot]](https://github.com/apps/dependabot)) -- Bump object\_store from 0.5.1 to 0.5.2 [\#112](https://github.com/apache/arrow-datafusion-python/pull/112) ([dependabot[bot]](https://github.com/apps/dependabot)) -- Bump tokio from 1.21.2 to 1.23.0 [\#109](https://github.com/apache/arrow-datafusion-python/pull/109) ([dependabot[bot]](https://github.com/apps/dependabot)) -- Add entries for publishing production \(asf-site\) and staging docs [\#107](https://github.com/apache/arrow-datafusion-python/pull/107) ([martin-g](https://github.com/martin-g)) -- Add a workflow that builds the docs and deploys them at staged or production [\#104](https://github.com/apache/arrow-datafusion-python/pull/104) ([martin-g](https://github.com/martin-g)) -- Upgrade to DataFusion 15.0.0 [\#103](https://github.com/apache/arrow-datafusion-python/pull/103) ([andygrove](https://github.com/andygrove)) -- build\(deps\): bump futures from 0.3.24 to 0.3.25 [\#102](https://github.com/apache/arrow-datafusion-python/pull/102) ([dependabot[bot]](https://github.com/apps/dependabot)) -- build\(deps\): bump pyo3 from 0.17.2 to 0.17.3 [\#101](https://github.com/apache/arrow-datafusion-python/pull/101) ([dependabot[bot]](https://github.com/apps/dependabot)) -- build\(deps\): bump mimalloc from 0.1.30 to 0.1.32 [\#98](https://github.com/apache/arrow-datafusion-python/pull/98) ([dependabot[bot]](https://github.com/apps/dependabot)) -- build\(deps\): bump rand from 0.7.3 to 0.8.5 [\#97](https://github.com/apache/arrow-datafusion-python/pull/97) ([dependabot[bot]](https://github.com/apps/dependabot)) -- Fix GitHub actions warnings [\#95](https://github.com/apache/arrow-datafusion-python/pull/95) ([martin-g](https://github.com/martin-g)) -- Fixes \#81 - Add CI workflow for source distribution [\#93](https://github.com/apache/arrow-datafusion-python/pull/93) ([martin-g](https://github.com/martin-g)) -- post-release updates [\#91](https://github.com/apache/arrow-datafusion-python/pull/91) ([andygrove](https://github.com/andygrove)) -- Build for manylinux 2014 [\#88](https://github.com/apache/arrow-datafusion-python/pull/88) ([martin-g](https://github.com/martin-g)) -- update release readme tag [\#86](https://github.com/apache/arrow-datafusion-python/pull/86) ([Jimexist](https://github.com/Jimexist)) -- Upgrade Maturin to 0.14.2 [\#85](https://github.com/apache/arrow-datafusion-python/pull/85) ([martin-g](https://github.com/martin-g)) -- Update release instructions [\#83](https://github.com/apache/arrow-datafusion-python/pull/83) ([andygrove](https://github.com/andygrove)) -- \[Functions\] - Add python function binding to `functions` [\#73](https://github.com/apache/arrow-datafusion-python/pull/73) ([francis-du](https://github.com/francis-du)) - -## [0.8.0-rc1](https://github.com/apache/arrow-datafusion-python/tree/0.8.0-rc1) (2023-02-17) - -[Full Changelog](https://github.com/apache/arrow-datafusion-python/compare/0.7.0-rc2...0.8.0-rc1) - -**Implemented enhancements:** - -- Add bindings for datafusion\_common::DFField [\#184](https://github.com/apache/arrow-datafusion-python/issues/184) -- Add bindings for DFSchema/DFSchemaRef [\#181](https://github.com/apache/arrow-datafusion-python/issues/181) -- Add bindings for datafusion\_expr Projection [\#179](https://github.com/apache/arrow-datafusion-python/issues/179) -- Add bindings for `TableScan` struct from `datafusion_expr::TableScan` [\#177](https://github.com/apache/arrow-datafusion-python/issues/177) -- Add a "mapping" struct for types [\#172](https://github.com/apache/arrow-datafusion-python/issues/172) -- Improve string representation of datafusion classes \(dataframe, context, expression, ...\) [\#158](https://github.com/apache/arrow-datafusion-python/issues/158) -- Add DataFrame count method [\#151](https://github.com/apache/arrow-datafusion-python/issues/151) -- \[REQUEST\] Github Actions Improvements [\#146](https://github.com/apache/arrow-datafusion-python/issues/146) -- Change default branch name from master to main [\#144](https://github.com/apache/arrow-datafusion-python/issues/144) -- Bump pyo3 to 0.18.0 [\#140](https://github.com/apache/arrow-datafusion-python/issues/140) -- Add script for Python linting [\#134](https://github.com/apache/arrow-datafusion-python/issues/134) -- Add Python bindings for substrait module [\#132](https://github.com/apache/arrow-datafusion-python/issues/132) -- Expand unit tests for built-in functions [\#128](https://github.com/apache/arrow-datafusion-python/issues/128) -- support creating arrow-datafusion-python conda environment [\#122](https://github.com/apache/arrow-datafusion-python/issues/122) -- Build Python source distribution in GitHub workflow [\#81](https://github.com/apache/arrow-datafusion-python/issues/81) -- EPIC: Add all functions to python binding `functions` [\#72](https://github.com/apache/arrow-datafusion-python/issues/72) - -**Fixed bugs:** - -- Build is broken [\#161](https://github.com/apache/arrow-datafusion-python/issues/161) -- Out of memory when sorting [\#157](https://github.com/apache/arrow-datafusion-python/issues/157) -- window\_lead test appears to be non-deterministic [\#135](https://github.com/apache/arrow-datafusion-python/issues/135) -- Reading csv does not work [\#130](https://github.com/apache/arrow-datafusion-python/issues/130) -- Github actions produce a lot of warnings [\#94](https://github.com/apache/arrow-datafusion-python/issues/94) -- ASF source release tarball has wrong directory name [\#90](https://github.com/apache/arrow-datafusion-python/issues/90) -- Python Release Build failing after upgrading to maturin 14.2 [\#87](https://github.com/apache/arrow-datafusion-python/issues/87) -- Maturin build hangs on Linux ARM64 [\#84](https://github.com/apache/arrow-datafusion-python/issues/84) -- Cannot install on Mac M1 from source tarball from testpypi [\#82](https://github.com/apache/arrow-datafusion-python/issues/82) -- ImportPathMismatchError when running pytest locally [\#77](https://github.com/apache/arrow-datafusion-python/issues/77) - -**Closed issues:** - -- Publish documentation for Python bindings [\#39](https://github.com/apache/arrow-datafusion-python/issues/39) -- Add Python binding for `approx_median` [\#32](https://github.com/apache/arrow-datafusion-python/issues/32) -- Release version 0.7.0 [\#7](https://github.com/apache/arrow-datafusion-python/issues/7) - -## [0.7.0-rc2](https://github.com/apache/arrow-datafusion-python/tree/0.7.0-rc2) (2022-11-26) - -[Full Changelog](https://github.com/apache/arrow-datafusion-python/compare/0.7.0...0.7.0-rc2) - - -## [Unreleased](https://github.com/datafusion-contrib/datafusion-python/tree/HEAD) - -[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/0.5.1...HEAD) - -**Merged pull requests:** - -- use \_\_getitem\_\_ for df column selection [\#41](https://github.com/datafusion-contrib/datafusion-python/pull/41) ([Jimexist](https://github.com/Jimexist)) -- fix demo in readme [\#40](https://github.com/datafusion-contrib/datafusion-python/pull/40) ([Jimexist](https://github.com/Jimexist)) -- Implement select_columns [\#39](https://github.com/datafusion-contrib/datafusion-python/pull/39) ([andygrove](https://github.com/andygrove)) -- update readme and changelog [\#38](https://github.com/datafusion-contrib/datafusion-python/pull/38) ([Jimexist](https://github.com/Jimexist)) -- Add PyDataFrame.explain [\#36](https://github.com/datafusion-contrib/datafusion-python/pull/36) ([andygrove](https://github.com/andygrove)) -- Release 0.5.0 [\#34](https://github.com/datafusion-contrib/datafusion-python/pull/34) ([Jimexist](https://github.com/Jimexist)) -- disable nightly in workflow [\#33](https://github.com/datafusion-contrib/datafusion-python/pull/33) ([Jimexist](https://github.com/Jimexist)) -- update requirements to 37 and 310, update readme [\#32](https://github.com/datafusion-contrib/datafusion-python/pull/32) ([Jimexist](https://github.com/Jimexist)) -- Add custom global allocator [\#30](https://github.com/datafusion-contrib/datafusion-python/pull/30) ([matthewmturner](https://github.com/matthewmturner)) -- Remove pandas dependency [\#25](https://github.com/datafusion-contrib/datafusion-python/pull/25) ([matthewmturner](https://github.com/matthewmturner)) -- upgrade datafusion and pyo3 [\#20](https://github.com/datafusion-contrib/datafusion-python/pull/20) ([Jimexist](https://github.com/Jimexist)) -- update maturin 0.12+ [\#17](https://github.com/datafusion-contrib/datafusion-python/pull/17) ([Jimexist](https://github.com/Jimexist)) -- Update README.md [\#16](https://github.com/datafusion-contrib/datafusion-python/pull/16) ([Jimexist](https://github.com/Jimexist)) -- apply cargo clippy --fix [\#15](https://github.com/datafusion-contrib/datafusion-python/pull/15) ([Jimexist](https://github.com/Jimexist)) -- update test workflow to include rust clippy and check [\#14](https://github.com/datafusion-contrib/datafusion-python/pull/14) ([Jimexist](https://github.com/Jimexist)) -- use maturin 0.12.6 [\#13](https://github.com/datafusion-contrib/datafusion-python/pull/13) ([Jimexist](https://github.com/Jimexist)) -- apply cargo fmt [\#12](https://github.com/datafusion-contrib/datafusion-python/pull/12) ([Jimexist](https://github.com/Jimexist)) -- use stable not nightly [\#11](https://github.com/datafusion-contrib/datafusion-python/pull/11) ([Jimexist](https://github.com/Jimexist)) -- ci: test against more compilers, setup clippy and fix clippy lints [\#9](https://github.com/datafusion-contrib/datafusion-python/pull/9) ([cpcloud](https://github.com/cpcloud)) -- Fix use of importlib.metadata and unify requirements.txt [\#8](https://github.com/datafusion-contrib/datafusion-python/pull/8) ([cpcloud](https://github.com/cpcloud)) -- Ship the Cargo.lock file in the source distribution [\#7](https://github.com/datafusion-contrib/datafusion-python/pull/7) ([cpcloud](https://github.com/cpcloud)) -- add \_\_version\_\_ attribute to datafusion object [\#3](https://github.com/datafusion-contrib/datafusion-python/pull/3) ([tfeda](https://github.com/tfeda)) -- fix ci by fixing directories [\#2](https://github.com/datafusion-contrib/datafusion-python/pull/2) ([Jimexist](https://github.com/Jimexist)) -- setup workflow [\#1](https://github.com/datafusion-contrib/datafusion-python/pull/1) ([Jimexist](https://github.com/Jimexist)) - -## [0.5.1](https://github.com/datafusion-contrib/datafusion-python/tree/0.5.1) (2022-03-15) - -[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/0.5.1-rc1...0.5.1) - -## [0.5.1-rc1](https://github.com/datafusion-contrib/datafusion-python/tree/0.5.1-rc1) (2022-03-15) - -[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/0.5.0...0.5.1-rc1) - -## [0.5.0](https://github.com/datafusion-contrib/datafusion-python/tree/0.5.0) (2022-03-10) - -[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/0.5.0-rc2...0.5.0) - -## [0.5.0-rc2](https://github.com/datafusion-contrib/datafusion-python/tree/0.5.0-rc2) (2022-03-10) - -[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/0.5.0-rc1...0.5.0-rc2) - -**Closed issues:** - -- Add support for Ballista [\#37](https://github.com/datafusion-contrib/datafusion-python/issues/37) -- Implement DataFrame.explain [\#35](https://github.com/datafusion-contrib/datafusion-python/issues/35) - -## [0.5.0-rc1](https://github.com/datafusion-contrib/datafusion-python/tree/0.5.0-rc1) (2022-03-09) - -[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/4c98b8e9c3c3f8e2e6a8f2d1ffcfefda344c4680...0.5.0-rc1) - -**Closed issues:** - -- Investigate exposing additional optimizations [\#28](https://github.com/datafusion-contrib/datafusion-python/issues/28) -- Use custom allocator in Python build [\#27](https://github.com/datafusion-contrib/datafusion-python/issues/27) -- Why is pandas a requirement? [\#24](https://github.com/datafusion-contrib/datafusion-python/issues/24) -- Unable to build [\#18](https://github.com/datafusion-contrib/datafusion-python/issues/18) -- Setup CI against multiple Python version [\#6](https://github.com/datafusion-contrib/datafusion-python/issues/6) - -\* _This Changelog was automatically generated by [github_changelog_generator](https://github.com/github-changelog-generator/github-changelog-generator)_ - - -\* *This Changelog was automatically generated by [github_changelog_generator](https://github.com/github-changelog-generator/github-changelog-generator)* +The changelogs have now moved [here](./dev/changelog). diff --git a/CLAUDE.md b/CLAUDE.md new file mode 120000 index 000000000..47dc3e3d8 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +AGENTS.md \ No newline at end of file diff --git a/Cargo.lock b/Cargo.lock index 23f1486b4..1cbb0acb8 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -1,37 +1,80 @@ # This file is automatically @generated by Cargo. # It is not intended for manual editing. -version = 3 +version = 4 [[package]] -name = "adler" -version = "1.0.2" +name = "abi_stable" +version = "0.11.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f26201604c87b1e01bd3d98f8d5d9a8fcbb815e8cedb41ffccbeb4bf593a35fe" +checksum = "69d6512d3eb05ffe5004c59c206de7f99c34951504056ce23fc953842f12c445" +dependencies = [ + "abi_stable_derive", + "abi_stable_shared", + "const_panic", + "core_extensions", + "crossbeam-channel", + "generational-arena", + "libloading", + "lock_api", + "parking_lot", + "paste", + "repr_offset", + "rustc_version", + "serde", + "serde_derive", + "serde_json", +] [[package]] -name = "adler32" -version = "1.2.0" +name = "abi_stable_derive" +version = "0.11.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d7178468b407a4ee10e881bc7a328a65e739f0863615cca4429d43916b05e898" +dependencies = [ + "abi_stable_shared", + "as_derive_utils", + "core_extensions", + "proc-macro2", + "quote", + "rustc_version", + "syn 1.0.109", + "typed-arena", +] + +[[package]] +name = "abi_stable_shared" +version = "0.11.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b2b5df7688c123e63f4d4d649cba63f2967ba7f7861b1664fca3f77d3dad2b63" +dependencies = [ + "core_extensions", +] + +[[package]] +name = "adler2" +version = "2.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "aae1277d39aeec15cb388266ecc24b11c80469deae6067e17a1a7aa9e5c1f234" +checksum = "320119579fcad9c21884f5c4861d16174d0e06250625266f50fe6898340abefa" [[package]] name = "ahash" -version = "0.8.3" +version = "0.8.12" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2c99f64d1e06488f620f932677e24bc6e2897582980441ae90a671415bd7ec2f" +checksum = "5a15f179cd60c4584b8a8c596927aadc462e27f2ca70c04e0071964a73ba7a75" dependencies = [ "cfg-if", "const-random", - "getrandom", + "getrandom 0.3.4", "once_cell", "version_check", + "zerocopy", ] [[package]] name = "aho-corasick" -version = "0.7.20" +version = "1.1.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cc936419f96fa211c1b9166887b38e5e40b19958e5b895be7c1f93adec7071ac" +checksum = "ddd31a130427c27518df266943a5308ed92d4b226cc639f5a8f1002816174301" dependencies = [ "memchr", ] @@ -51,6 +94,12 @@ dependencies = [ "alloc-no-stdlib", ] +[[package]] +name = "allocator-api2" +version = "0.2.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "683d7910e743518b0e34f1186f92494becacb047c7b6bf616c96772180fef923" + [[package]] name = "android_system_properties" version = "0.1.5" @@ -62,56 +111,75 @@ dependencies = [ [[package]] name = "anyhow" -version = "1.0.69" +version = "1.0.102" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "224afbd727c3d6e4b90103ece64b8d1b67fbb1973b1046c2281eed3f3803f800" +checksum = "7f202df86484c868dbad7eaa557ef785d5c66295e41b460ef922eca0723b842c" [[package]] name = "apache-avro" -version = "0.14.0" +version = "0.21.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8cf4144857f9e4d7dd6cc4ba4c78efd2a46bad682b029bd0d91e76a021af1b2a" +checksum = "36fa98bc79671c7981272d91a8753a928ff6a1cd8e4f20a44c45bd5d313840bf" dependencies = [ - "byteorder", + "bigdecimal", + "bon", + "bzip2", "crc32fast", "digest", - "lazy_static", - "libflate", + "liblzma", "log", + "miniz_oxide", "num-bigint", "quad-rand", - "rand", - "regex", + "rand 0.9.2", + "regex-lite", "serde", + "serde_bytes", "serde_json", "snap", "strum", "strum_macros", "thiserror", - "typed-builder", "uuid", - "zerocopy", + "zstd", +] + +[[package]] +name = "ar_archive_writer" +version = "0.5.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7eb93bbb63b9c227414f6eb3a0adfddca591a8ce1e9b60661bb08969b87e340b" +dependencies = [ + "object", +] + +[[package]] +name = "arc-swap" +version = "1.9.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a07d1f37ff60921c83bdfc7407723bdefe89b44b98a9b772f225c8f9d67141a6" +dependencies = [ + "rustversion", ] [[package]] name = "arrayref" -version = "0.3.6" +version = "0.3.9" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a4c527152e37cf757a3f78aae5a06fbeefdb07ccc535c980a3208ee3060dd544" +checksum = "76a2e8124351fda1ef8aaaa3bbd7ebbcb486bbcd4225aca0aa0d84bb2db8fecb" [[package]] name = "arrayvec" -version = "0.7.2" +version = "0.7.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8da52d66c7071e2e3fa2a1e5c6d088fec47b593032b254f5e980de8ea54454d6" +checksum = "7c02d123df017efcdfbd739ef81735b36c5ba83ec3c59c80a9d7ecc718f92e50" [[package]] name = "arrow" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f410d3907b6b3647b9e7bca4551274b2e3d716aa940afb67b7287257401da921" +checksum = "d441fdda254b65f3e9025910eb2c2066b6295d9c8ed409522b8d2ace1ff8574c" dependencies = [ - "ahash", "arrow-arith", "arrow-array", "arrow-buffer", @@ -121,121 +189,129 @@ dependencies = [ "arrow-ipc", "arrow-json", "arrow-ord", + "arrow-pyarrow", "arrow-row", "arrow-schema", "arrow-select", "arrow-string", - "comfy-table", - "pyo3", ] [[package]] name = "arrow-arith" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f87391cf46473c9bc53dab68cb8872c3a81d4dfd1703f1c8aa397dba9880a043" +checksum = "ced5406f8b720cc0bc3aa9cf5758f93e8593cda5490677aa194e4b4b383f9a59" dependencies = [ "arrow-array", "arrow-buffer", "arrow-data", "arrow-schema", "chrono", - "half", - "num", + "num-traits", ] [[package]] name = "arrow-array" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d35d5475e65c57cffba06d0022e3006b677515f99b54af33a7cd54f6cdd4a5b5" +checksum = "772bd34cacdda8baec9418d80d23d0fb4d50ef0735685bd45158b83dfeb6e62d" dependencies = [ "ahash", "arrow-buffer", "arrow-data", "arrow-schema", "chrono", + "chrono-tz", "half", - "hashbrown 0.13.2", - "num", + "hashbrown 0.16.1", + "num-complex", + "num-integer", + "num-traits", ] [[package]] name = "arrow-buffer" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "68b4ec72eda7c0207727df96cf200f539749d736b21f3e782ece113e18c1a0a7" +checksum = "898f4cf1e9598fdb77f356fdf2134feedfd0ee8d5a4e0a5f573e7d0aec16baa4" dependencies = [ + "bytes", "half", - "num", + "num-bigint", + "num-traits", ] [[package]] name = "arrow-cast" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0a7285272c9897321dfdba59de29f5b05aeafd3cdedf104a941256d155f6d304" +checksum = "b0127816c96533d20fc938729f48c52d3e48f99717e7a0b5ade77d742510736d" dependencies = [ "arrow-array", "arrow-buffer", "arrow-data", + "arrow-ord", "arrow-schema", "arrow-select", + "atoi", + "base64", "chrono", + "comfy-table", + "half", "lexical-core", - "num", + "num-traits", + "ryu", ] [[package]] name = "arrow-csv" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "981ee4e7f6a120da04e00d0b39182e1eeacccb59c8da74511de753c56b7fddf7" +checksum = "ca025bd0f38eeecb57c2153c0123b960494138e6a957bbda10da2b25415209fe" dependencies = [ "arrow-array", - "arrow-buffer", "arrow-cast", - "arrow-data", "arrow-schema", "chrono", "csv", "csv-core", - "lazy_static", - "lexical-core", "regex", ] [[package]] name = "arrow-data" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "27cc673ee6989ea6e4b4e8c7d461f7e06026a096c8f0b1a7288885ff71ae1e56" +checksum = "42d10beeab2b1c3bb0b53a00f7c944a178b622173a5c7bcabc3cb45d90238df4" dependencies = [ "arrow-buffer", "arrow-schema", "half", - "num", + "num-integer", + "num-traits", ] [[package]] name = "arrow-ipc" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e37b8b69d9e59116b6b538e8514e0ec63a30f08b617ce800d31cb44e3ef64c1a" +checksum = "609a441080e338147a84e8e6904b6da482cefb957c5cdc0f3398872f69a315d0" dependencies = [ "arrow-array", "arrow-buffer", - "arrow-cast", "arrow-data", "arrow-schema", + "arrow-select", "flatbuffers", + "lz4_flex", + "zstd", ] [[package]] name = "arrow-json" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "80c3fa0bed7cfebf6d18e46b733f9cb8a1cb43ce8e6539055ca3e1e48a426266" +checksum = "6ead0914e4861a531be48fe05858265cf854a4880b9ed12618b1d08cba9bebc8" dependencies = [ "arrow-array", "arrow-buffer", @@ -245,134 +321,197 @@ dependencies = [ "chrono", "half", "indexmap", + "itoa", "lexical-core", - "num", + "memchr", + "num-traits", + "ryu", + "serde_core", "serde_json", + "simdutf8", ] [[package]] name = "arrow-ord" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d247dce7bed6a8d6a3c6debfa707a3a2f694383f0c692a39d736a593eae5ef94" +checksum = "763a7ba279b20b52dad300e68cfc37c17efa65e68623169076855b3a9e941ca5" dependencies = [ "arrow-array", "arrow-buffer", "arrow-data", "arrow-schema", "arrow-select", - "num", +] + +[[package]] +name = "arrow-pyarrow" +version = "58.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e63351dc11981a316c828a6032a5021345bba882f68bc4a36c36825a50725089" +dependencies = [ + "arrow-array", + "arrow-data", + "arrow-schema", + "pyo3", ] [[package]] name = "arrow-row" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8d609c0181f963cea5c70fddf9a388595b5be441f3aa1d1cdbf728ca834bbd3a" +checksum = "e14fe367802f16d7668163ff647830258e6e0aeea9a4d79aaedf273af3bdcd3e" dependencies = [ - "ahash", "arrow-array", "arrow-buffer", "arrow-data", "arrow-schema", "half", - "hashbrown 0.13.2", ] [[package]] name = "arrow-schema" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "64951898473bfb8e22293e83a44f02874d2257514d49cd95f9aa4afcff183fbc" +checksum = "c30a1365d7a7dc50cc847e54154e6af49e4c4b0fddc9f607b687f29212082743" dependencies = [ "bitflags", + "serde_core", + "serde_json", ] [[package]] name = "arrow-select" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2a513d89c2e1ac22b28380900036cf1f3992c6443efc5e079de631dcf83c6888" +checksum = "78694888660a9e8ac949853db393af2a8b8fc82c19ce333132dfa2e72cc1a7fe" dependencies = [ + "ahash", "arrow-array", "arrow-buffer", "arrow-data", "arrow-schema", - "num", + "num-traits", ] [[package]] name = "arrow-string" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5288979b2705dae1114c864d73150629add9153b9b8f1d7ee3963db94c372ba5" +checksum = "61e04a01f8bb73ce54437514c5fd3ee2aa3e8abe4c777ee5cc55853b1652f79e" dependencies = [ "arrow-array", "arrow-buffer", "arrow-data", "arrow-schema", "arrow-select", + "memchr", + "num-traits", "regex", "regex-syntax", ] +[[package]] +name = "as_derive_utils" +version = "0.11.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ff3c96645900a44cf11941c111bd08a6573b0e2f9f69bc9264b179d8fae753c4" +dependencies = [ + "core_extensions", + "proc-macro2", + "quote", + "syn 1.0.109", +] + [[package]] name = "async-compression" -version = "0.3.15" +version = "0.4.41" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "942c7cd7ae39e91bde4820d74132e9862e62c2f386c3aa90ccf55949f5bad63a" +checksum = "d0f9ee0f6e02ffd7ad5816e9464499fba7b3effd01123b515c41d1697c43dad1" dependencies = [ - "bzip2", - "flate2", - "futures-core", - "futures-io", - "memchr", + "compression-codecs", + "compression-core", "pin-project-lite", "tokio", - "xz2", - "zstd 0.11.2+zstd.1.5.2", - "zstd-safe 5.0.2+zstd.1.5.2", +] + +[[package]] +name = "async-ffi" +version = "0.5.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f4de21c0feef7e5a556e51af767c953f0501f7f300ba785cc99c47bdc8081a50" +dependencies = [ + "abi_stable", ] [[package]] name = "async-recursion" -version = "1.0.2" +version = "1.1.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3b015a331cc64ebd1774ba119538573603427eaace0a1950c423ab971f903796" +checksum = "3b43422f69d8ff38f95f1b2bb76517c91589a924d1559a0e935d7c8ce0274c11" dependencies = [ "proc-macro2", "quote", - "syn", + "syn 2.0.117", ] [[package]] name = "async-trait" -version = "0.1.66" +version = "0.1.89" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b84f9ebcc6c1f5b8cb160f6990096a5c127f423fcb6e1ccc46c370cbdfb75dfc" +checksum = "9035ad2d096bed7955a320ee7e2230574d28fd3c3a0f186cbea1ff3c7eed5dbb" dependencies = [ "proc-macro2", "quote", - "syn", + "syn 2.0.117", +] + +[[package]] +name = "atoi" +version = "2.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f28d99ec8bfea296261ca1af174f24225171fea9664ba9003cbebee704810528" +dependencies = [ + "num-traits", ] +[[package]] +name = "atomic-waker" +version = "1.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1505bd5d3d116872e7271a6d4e16d81d0c8570876c8de68093a09ac269d8aac0" + [[package]] name = "autocfg" -version = "1.1.0" +version = "1.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d468802bab17cbc0cc575e9b053f41e72aa36bfa6b7f55e3529ffa43161b97fa" +checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8" [[package]] name = "base64" -version = "0.21.0" +version = "0.22.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "72b3254f16251a8381aa12e40e3c4d2f0199f8c6508fbecb9d91f575e0fbb8c6" + +[[package]] +name = "bigdecimal" +version = "0.4.10" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a4a4ddaa51a5bc52a6948f74c06d20aaaddb71924eab79b8c97a8c556e942d6a" +checksum = "4d6867f1565b3aad85681f1015055b087fcfd840d6aeee6eee7f2da317603695" +dependencies = [ + "autocfg", + "libm", + "num-bigint", + "num-integer", + "num-traits", + "serde", +] [[package]] name = "bitflags" -version = "1.3.2" +version = "2.11.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bef38d45163c2f1dde094a7dfd33ccf595c92905c8f8f4fdc18d06fb1037718a" +checksum = "843867be96c8daad0d758b57df9392b6d8d271134fce549de6ce169ff98a92af" [[package]] name = "blake2" @@ -385,16 +524,16 @@ dependencies = [ [[package]] name = "blake3" -version = "1.3.3" +version = "1.8.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "42ae2468a89544a466886840aa467a25b766499f4f04bf7d9fcd10ecee9fccef" +checksum = "2468ef7d57b3fb7e16b576e8377cdbde2320c60e1491e961d11da40fc4f02a2d" dependencies = [ "arrayref", "arrayvec", "cc", "cfg-if", "constant_time_eq", - "digest", + "cpufeatures 0.2.17", ] [[package]] @@ -406,11 +545,36 @@ dependencies = [ "generic-array", ] +[[package]] +name = "bon" +version = "3.9.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f47dbe92550676ee653353c310dfb9cf6ba17ee70396e1f7cf0a2020ad49b2fe" +dependencies = [ + "bon-macros", + "rustversion", +] + +[[package]] +name = "bon-macros" +version = "3.9.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "519bd3116aeeb42d5372c29d982d16d0170d3d4a5ed85fc7dd91642ffff3c67c" +dependencies = [ + "darling", + "ident_case", + "prettyplease", + "proc-macro2", + "quote", + "rustversion", + "syn 2.0.117", +] + [[package]] name = "brotli" -version = "3.3.4" +version = "8.0.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a1a0b1dbcc8ae29329621f8d4f0d835787c1c38bb1401979b49d13b0b305ff68" +checksum = "4bd8b9603c7aa97359dbd97ecf258968c95f3adddd6db2f7e7a5bef101c84560" dependencies = [ "alloc-no-stdlib", "alloc-stdlib", @@ -419,9 +583,9 @@ dependencies = [ [[package]] name = "brotli-decompressor" -version = "2.3.4" +version = "5.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4b6561fd3f895a11e8f72af2cb7d22e08366bebc2b6b57f7744c4bda27034744" +checksum = "874bb8112abecc98cbd6d81ea4fa7e94fb9449648c93cc89aa40c81c24d7de03" dependencies = [ "alloc-no-stdlib", "alloc-stdlib", @@ -429,236 +593,326 @@ dependencies = [ [[package]] name = "bumpalo" -version = "3.12.0" +version = "3.20.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0d261e256854913907f67ed06efbc3338dfe6179796deefc1ff763fc1aee5535" +checksum = "5d20789868f4b01b2f2caec9f5c4e0213b41e3e5702a50157d699ae31ced2fcb" [[package]] name = "byteorder" -version = "1.4.3" +version = "1.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "14c189c53d098945499cdfa7ecc63567cf3886b3332b312a5b4585d8d3a6a610" +checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b" [[package]] name = "bytes" -version = "1.4.0" +version = "1.11.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "89b2fd2a0dcf38d7971e2194b6b6eebab45ae01067456a7fd93d5547a61b70be" +checksum = "1e748733b7cbc798e1434b6ac524f0c1ff2ab456fe201501e6497c8417a4fc33" [[package]] name = "bzip2" -version = "0.4.4" +version = "0.6.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bdb116a6ef3f6c3698828873ad02c3014b3c85cadb88496095628e3ef1e347f8" +checksum = "f3a53fac24f34a81bc9954b5d6cfce0c21e18ec6959f44f56e8e90e4bb7c346c" dependencies = [ - "bzip2-sys", - "libc", + "libbz2-rs-sys", ] [[package]] -name = "bzip2-sys" -version = "0.1.11+1.0.8" +name = "cc" +version = "1.2.58" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "736a955f3fa7875102d57c82b8cac37ec45224a07fd32d58f9f7a186b6cd4cdc" +checksum = "e1e928d4b69e3077709075a938a05ffbedfa53a84c8f766efbf8220bb1ff60e1" dependencies = [ - "cc", + "find-msvc-tools", + "jobserver", "libc", - "pkg-config", + "shlex", ] [[package]] -name = "cc" -version = "1.0.79" +name = "cfg-if" +version = "1.0.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "50d30906286121d95be3d479533b458f87493b30a4b5f79a607db8f5d11aa91f" -dependencies = [ - "jobserver", -] +checksum = "9330f8b2ff13f34540b44e946ef35111825727b38d33286ef986142615121801" [[package]] -name = "cfg-if" -version = "1.0.0" +name = "cfg_aliases" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "613afe47fcd5fac7ccf1db93babcb082c5994d996f20b8b159f2ad1658eb5724" + +[[package]] +name = "chacha20" +version = "0.10.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd" +checksum = "6f8d983286843e49675a4b7a2d174efe136dc93a18d69130dd18198a6c167601" +dependencies = [ + "cfg-if", + "cpufeatures 0.3.0", + "rand_core 0.10.0", +] [[package]] name = "chrono" -version = "0.4.24" +version = "0.4.44" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4e3c5919066adf22df73762e50cffcde3a758f2a848b113b586d1f86728b673b" +checksum = "c673075a2e0e5f4a1dde27ce9dee1ea4558c7ffe648f576438a20ca1d2acc4b0" dependencies = [ "iana-time-zone", - "js-sys", - "num-integer", "num-traits", "serde", - "time", - "wasm-bindgen", - "winapi", + "windows-link", ] [[package]] -name = "codespan-reporting" -version = "0.11.1" +name = "chrono-tz" +version = "0.10.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3538270d33cc669650c4b093848450d380def10c331d38c768e34cac80576e6e" +checksum = "a6139a8597ed92cf816dfb33f5dd6cf0bb93a6adc938f11039f371bc5bcd26c3" dependencies = [ - "termcolor", - "unicode-width", + "chrono", + "phf", +] + +[[package]] +name = "cmake" +version = "0.1.58" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "c0f78a02292a74a88ac736019ab962ece0bc380e3f977bf72e376c5d78ff0678" +dependencies = [ + "cc", ] [[package]] name = "comfy-table" -version = "6.1.4" +version = "7.2.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6e7b787b0dc42e8111badfdbe4c3059158ccb2db8780352fa1b01e8ccf45cc4d" +checksum = "958c5d6ecf1f214b4c2bbbbf6ab9523a864bd136dcf71a7e8904799acfe1ad47" dependencies = [ - "strum", - "strum_macros", + "unicode-segmentation", "unicode-width", ] +[[package]] +name = "compression-codecs" +version = "0.4.37" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "eb7b51a7d9c967fc26773061ba86150f19c50c0d65c887cb1fbe295fd16619b7" +dependencies = [ + "bzip2", + "compression-core", + "flate2", + "liblzma", + "memchr", + "zstd", + "zstd-safe", +] + +[[package]] +name = "compression-core" +version = "0.4.31" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "75984efb6ed102a0d42db99afb6c1948f0380d1d91808d5529916e6c08b49d8d" + [[package]] name = "const-random" -version = "0.1.15" +version = "0.1.18" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "368a7a772ead6ce7e1de82bfb04c485f3db8ec744f72925af5735e29a22cc18e" +checksum = "87e00182fe74b066627d63b85fd550ac2998d4b0bd86bfed477a0ae4c7c71359" dependencies = [ "const-random-macro", - "proc-macro-hack", ] [[package]] name = "const-random-macro" -version = "0.1.15" +version = "0.1.16" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9d7d6ab3c3a2282db210df5f02c4dab6e0a7057af0fb7ebd4070f30fe05c0ddb" +checksum = "f9d839f2a20b0aee515dc581a6172f2321f96cab76c1a38a4c584a194955390e" dependencies = [ - "getrandom", + "getrandom 0.2.17", "once_cell", - "proc-macro-hack", "tiny-keccak", ] [[package]] -name = "constant_time_eq" -version = "0.2.5" +name = "const_panic" +version = "0.2.15" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "13418e745008f7349ec7e449155f419a61b92b58a99cc3616942b926825ec76b" +checksum = "e262cdaac42494e3ae34c43969f9cdeb7da178bdb4b66fa6a1ea2edb4c8ae652" +dependencies = [ + "typewit", +] [[package]] -name = "core-foundation-sys" -version = "0.8.3" +name = "constant_time_eq" +version = "0.4.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5827cebf4670468b8772dd191856768aedcb1b0278a04f989f7766351917b9dc" +checksum = "3d52eff69cd5e647efe296129160853a42795992097e8af39800e1060caeea9b" [[package]] -name = "cpufeatures" -version = "0.2.5" +name = "core-foundation" +version = "0.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "28d997bd5e24a5928dd43e46dc529867e207907fe0b239c3477d924f7f2ca320" +checksum = "b2a6cd9ae233e7f62ba4e9353e81a88df7fc8a5987b8d445b4d90c879bd156f6" dependencies = [ + "core-foundation-sys", "libc", ] [[package]] -name = "crc32fast" -version = "1.3.2" +name = "core-foundation-sys" +version = "0.8.7" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "773648b94d0e5d620f64f280777445740e61fe701025087ec8b57f45c791888b" + +[[package]] +name = "core_extensions" +version = "1.5.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b540bd8bc810d3885c6ea91e2018302f68baba2129ab3e88f32389ee9370880d" +checksum = "42bb5e5d0269fd4f739ea6cedaf29c16d81c27a7ce7582008e90eb50dcd57003" dependencies = [ - "cfg-if", + "core_extensions_proc_macros", ] [[package]] -name = "crunchy" -version = "0.2.2" +name = "core_extensions_proc_macros" +version = "1.5.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7a81dae078cea95a014a339291cec439d2f232ebe854a9d672b796c6afafa9b7" +checksum = "533d38ecd2709b7608fb8e18e4504deb99e9a72879e6aa66373a76d8dc4259ea" [[package]] -name = "crypto-common" -version = "0.1.6" +name = "cpufeatures" +version = "0.2.17" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1bfb12502f3fc46cca1bb51ac28df9d618d813cdc3d2f25b9fe775a34af26bb3" +checksum = "59ed5838eebb26a2bb2e58f6d5b5316989ae9d08bab10e0e6d103e656d1b0280" dependencies = [ - "generic-array", - "typenum", + "libc", ] [[package]] -name = "csv" -version = "1.2.1" +name = "cpufeatures" +version = "0.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0b015497079b9a9d69c02ad25de6c0a6edef051ea6360a327d0bd05802ef64ad" +checksum = "8b2a41393f66f16b0823bb79094d54ac5fbd34ab292ddafb9a0456ac9f87d201" dependencies = [ - "csv-core", - "itoa", - "ryu", - "serde", + "libc", ] [[package]] -name = "csv-core" -version = "0.1.10" +name = "crc32fast" +version = "1.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2b2466559f260f48ad25fe6317b3c8dac77b5bdb5763ac7d9d6103530663bc90" +checksum = "9481c1c90cbf2ac953f07c8d4a58aa3945c425b7185c9154d67a65e4230da511" dependencies = [ - "memchr", + "cfg-if", ] [[package]] -name = "cxx" -version = "1.0.92" +name = "crossbeam-channel" +version = "0.5.15" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9a140f260e6f3f79013b8bfc65e7ce630c9ab4388c6a89c71e07226f49487b72" +checksum = "82b8f8f868b36967f9606790d1903570de9ceaf870a7bf9fbbd3016d636a2cb2" dependencies = [ - "cc", - "cxxbridge-flags", - "cxxbridge-macro", - "link-cplusplus", + "crossbeam-utils", ] [[package]] -name = "cxx-build" -version = "1.0.92" +name = "crossbeam-utils" +version = "0.8.21" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "da6383f459341ea689374bf0a42979739dc421874f112ff26f829b8040b8e613" -dependencies = [ - "cc", - "codespan-reporting", - "once_cell", - "proc-macro2", - "quote", - "scratch", - "syn", -] +checksum = "d0a5c400df2834b80a4c3327b3aad3a4c4cd4de0629063962b03235697506a28" [[package]] -name = "cxxbridge-flags" -version = "1.0.92" +name = "crunchy" +version = "0.2.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "90201c1a650e95ccff1c8c0bb5a343213bdd317c6e600a93075bca2eff54ec97" +checksum = "460fbee9c2c2f33933d720630a6a0bac33ba7053db5344fac858d4b8952d77d5" [[package]] -name = "cxxbridge-macro" -version = "1.0.92" +name = "crypto-common" +version = "0.1.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0b75aed41bb2e6367cae39e6326ef817a851db13c13e4f3263714ca3cfb8de56" +checksum = "78c8292055d1c1df0cce5d180393dc8cce0abec0a7102adb6c7b1eef6016d60a" dependencies = [ - "proc-macro2", - "quote", - "syn", + "generic-array", + "typenum", +] + +[[package]] +name = "cstr" +version = "0.2.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "68523903c8ae5aacfa32a0d9ae60cadeb764e1da14ee0d26b1f3089f13a54636" +dependencies = [ + "proc-macro2", + "quote", +] + +[[package]] +name = "csv" +version = "1.4.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "52cd9d68cf7efc6ddfaaee42e7288d3a99d613d4b50f76ce9827ae0c6e14f938" +dependencies = [ + "csv-core", + "itoa", + "ryu", + "serde_core", +] + +[[package]] +name = "csv-core" +version = "0.1.13" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "704a3c26996a80471189265814dbc2c257598b96b8a7feae2d31ace646bb9782" +dependencies = [ + "memchr", +] + +[[package]] +name = "darling" +version = "0.23.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "25ae13da2f202d56bd7f91c25fba009e7717a1e4a1cc98a76d844b65ae912e9d" +dependencies = [ + "darling_core", + "darling_macro", +] + +[[package]] +name = "darling_core" +version = "0.23.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9865a50f7c335f53564bb694ef660825eb8610e0a53d3e11bf1b0d3df31e03b0" +dependencies = [ + "ident_case", + "proc-macro2", + "quote", + "strsim", + "syn 2.0.117", +] + +[[package]] +name = "darling_macro" +version = "0.23.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ac3984ec7bd6cfa798e62b4a642426a5be0e68f9401cfc2a01e3fa9ea2fcdb8d" +dependencies = [ + "darling_core", + "quote", + "syn 2.0.117", ] [[package]] name = "dashmap" -version = "5.4.0" +version = "6.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "907076dfda823b0b36d2a1bb5f90c96660a5bbcd7729e10727f07858f22c4edc" +checksum = "5041cc499144891f3790297212f32a74fb938e5136a14943f338ef9e0ae276cf" dependencies = [ "cfg-if", - "hashbrown 0.12.3", + "crossbeam-utils", + "hashbrown 0.14.5", "lock_api", "once_cell", "parking_lot_core", @@ -666,146 +920,442 @@ dependencies = [ [[package]] name = "datafusion" -version = "20.0.0" +version = "53.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2c467c5802cb75ecb0acffa2121d8361a8903fef05b21fd1ca12a55797df8a75" +checksum = "de9f8117889ba9503440f1dd79ebab32ba52ccf1720bb83cd718a29d4edc0d16" dependencies = [ - "ahash", - "apache-avro", "arrow", - "async-compression", + "arrow-schema", "async-trait", "bytes", "bzip2", "chrono", - "dashmap", + "datafusion-catalog", + "datafusion-catalog-listing", "datafusion-common", + "datafusion-common-runtime", + "datafusion-datasource", + "datafusion-datasource-arrow", + "datafusion-datasource-avro", + "datafusion-datasource-csv", + "datafusion-datasource-json", + "datafusion-datasource-parquet", "datafusion-execution", "datafusion-expr", + "datafusion-expr-common", + "datafusion-functions", + "datafusion-functions-aggregate", + "datafusion-functions-nested", + "datafusion-functions-table", + "datafusion-functions-window", "datafusion-optimizer", "datafusion-physical-expr", - "datafusion-row", + "datafusion-physical-expr-adapter", + "datafusion-physical-expr-common", + "datafusion-physical-optimizer", + "datafusion-physical-plan", + "datafusion-session", "datafusion-sql", "flate2", "futures", - "glob", - "hashbrown 0.13.2", - "indexmap", "itertools", - "lazy_static", + "liblzma", "log", - "num-traits", - "num_cpus", "object_store", "parking_lot", "parquet", - "paste", - "percent-encoding", - "pin-project-lite", - "rand", - "smallvec", + "rand 0.9.2", + "regex", "sqlparser", "tempfile", "tokio", - "tokio-stream", - "tokio-util", "url", "uuid", - "xz2", - "zstd 0.12.3+zstd.1.5.2", + "zstd", +] + +[[package]] +name = "datafusion-catalog" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "be893b73a13671f310ffcc8da2c546b81efcc54c22e0382c0a28aa3537017137" +dependencies = [ + "arrow", + "async-trait", + "dashmap", + "datafusion-common", + "datafusion-common-runtime", + "datafusion-datasource", + "datafusion-execution", + "datafusion-expr", + "datafusion-physical-expr", + "datafusion-physical-plan", + "datafusion-session", + "futures", + "itertools", + "log", + "object_store", + "parking_lot", + "tokio", +] + +[[package]] +name = "datafusion-catalog-listing" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "830487b51ed83807d6b32d6325f349c3144ae0c9bf772cf2a712db180c31d5e6" +dependencies = [ + "arrow", + "async-trait", + "datafusion-catalog", + "datafusion-common", + "datafusion-datasource", + "datafusion-execution", + "datafusion-expr", + "datafusion-physical-expr", + "datafusion-physical-expr-adapter", + "datafusion-physical-expr-common", + "datafusion-physical-plan", + "futures", + "itertools", + "log", + "object_store", ] [[package]] name = "datafusion-common" -version = "20.0.0" +version = "53.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c5c60e0a92bae06c6a55c4618947ae0837f833929b8d52ade1d50cf8b5f3b381" +checksum = "0d7663f3af955292f8004e74bcaf8f7ea3d66cc38438749615bb84815b61a293" dependencies = [ + "ahash", "apache-avro", "arrow", + "arrow-ipc", "chrono", - "num_cpus", + "half", + "hashbrown 0.16.1", + "indexmap", + "itertools", + "libc", + "log", "object_store", "parquet", - "pyo3", + "paste", + "recursive", "sqlparser", + "tokio", + "web-time", +] + +[[package]] +name = "datafusion-common-runtime" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5f590205c7e32fe1fea48dd53ffb406e56ae0e7a062213a3ac848db8771641bd" +dependencies = [ + "futures", + "log", + "tokio", ] +[[package]] +name = "datafusion-datasource" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fde1e030a9dc87b743c806fbd631f5ecfa2ccaa4ffb61fa19144a07fea406b79" +dependencies = [ + "arrow", + "async-compression", + "async-trait", + "bytes", + "bzip2", + "chrono", + "datafusion-common", + "datafusion-common-runtime", + "datafusion-execution", + "datafusion-expr", + "datafusion-physical-expr", + "datafusion-physical-expr-adapter", + "datafusion-physical-expr-common", + "datafusion-physical-plan", + "datafusion-session", + "flate2", + "futures", + "glob", + "itertools", + "liblzma", + "log", + "object_store", + "rand 0.9.2", + "tokio", + "tokio-util", + "url", + "zstd", +] + +[[package]] +name = "datafusion-datasource-arrow" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "331ebae7055dc108f9b54994b93dff91f3a17445539efe5b74e89264f7b36e15" +dependencies = [ + "arrow", + "arrow-ipc", + "async-trait", + "bytes", + "datafusion-common", + "datafusion-common-runtime", + "datafusion-datasource", + "datafusion-execution", + "datafusion-expr", + "datafusion-physical-expr-common", + "datafusion-physical-plan", + "datafusion-session", + "futures", + "itertools", + "object_store", + "tokio", +] + +[[package]] +name = "datafusion-datasource-avro" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "49dda81c79b6ba57b1853a9158abc66eb85a3aa1cede0c517dabec6d8a4ed3aa" +dependencies = [ + "apache-avro", + "arrow", + "async-trait", + "bytes", + "datafusion-common", + "datafusion-datasource", + "datafusion-physical-expr-common", + "datafusion-physical-plan", + "datafusion-session", + "futures", + "num-traits", + "object_store", +] + +[[package]] +name = "datafusion-datasource-csv" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9e0d475088325e2986876aa27bb30d0574f72a22955a527d202f454681d55c5c" +dependencies = [ + "arrow", + "async-trait", + "bytes", + "datafusion-common", + "datafusion-common-runtime", + "datafusion-datasource", + "datafusion-execution", + "datafusion-expr", + "datafusion-physical-expr-common", + "datafusion-physical-plan", + "datafusion-session", + "futures", + "object_store", + "regex", + "tokio", +] + +[[package]] +name = "datafusion-datasource-json" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ea1520d81f31770f3ad6ee98b391e75e87a68a5bb90de70064ace5e0a7182fe8" +dependencies = [ + "arrow", + "async-trait", + "bytes", + "datafusion-common", + "datafusion-common-runtime", + "datafusion-datasource", + "datafusion-execution", + "datafusion-expr", + "datafusion-physical-expr-common", + "datafusion-physical-plan", + "datafusion-session", + "futures", + "object_store", + "serde_json", + "tokio", + "tokio-stream", +] + +[[package]] +name = "datafusion-datasource-parquet" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "95be805d0742ab129720f4c51ad9242cd872599cdb076098b03f061fcdc7f946" +dependencies = [ + "arrow", + "async-trait", + "bytes", + "datafusion-common", + "datafusion-common-runtime", + "datafusion-datasource", + "datafusion-execution", + "datafusion-expr", + "datafusion-functions-aggregate-common", + "datafusion-physical-expr", + "datafusion-physical-expr-adapter", + "datafusion-physical-expr-common", + "datafusion-physical-plan", + "datafusion-pruning", + "datafusion-session", + "futures", + "itertools", + "log", + "object_store", + "parking_lot", + "parquet", + "tokio", +] + +[[package]] +name = "datafusion-doc" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5c93ad9e37730d2c7196e68616f3f2dd3b04c892e03acd3a8eeca6e177f3c06a" + [[package]] name = "datafusion-execution" -version = "20.0.0" +version = "53.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "96eab4ef369c3972a4f99418ba8673aa0cf818aa4b892d44c35b473edce8e021" +checksum = "9437d3cd5d363f9319f8122182d4d233427de79c7eb748f23054c9aaa0fdd8df" dependencies = [ + "arrow", + "arrow-buffer", + "async-trait", + "chrono", "dashmap", "datafusion-common", "datafusion-expr", - "hashbrown 0.13.2", + "datafusion-physical-expr-common", + "futures", "log", "object_store", "parking_lot", - "rand", + "rand 0.9.2", "tempfile", "url", ] [[package]] name = "datafusion-expr" -version = "20.0.0" +version = "53.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d0533db37a619a045a4fd37dabc2d99a85f10127d7f100d75914c194e9e36ed7" +checksum = "67164333342b86521d6d93fa54081ee39839894fb10f7a700c099af96d7552cf" dependencies = [ - "ahash", "arrow", + "async-trait", + "chrono", "datafusion-common", - "log", + "datafusion-doc", + "datafusion-expr-common", + "datafusion-functions-aggregate-common", + "datafusion-functions-window-common", + "datafusion-physical-expr-common", + "indexmap", + "itertools", + "paste", + "recursive", + "serde_json", "sqlparser", ] [[package]] -name = "datafusion-optimizer" -version = "20.0.0" +name = "datafusion-expr-common" +version = "53.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3dbb8f467d31c3efebd372854e35cf1ca1572f163d0f031e2c6f8d78b317a43b" +checksum = "ab05fdd00e05d5a6ee362882546d29d6d3df43a6c55355164a7fbee12d163bc9" dependencies = [ "arrow", + "datafusion-common", + "indexmap", + "itertools", + "paste", +] + +[[package]] +name = "datafusion-ffi" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4b8250f7cdf463a0ad145f41d7508bcfa54c9b9f027317e599f0331097e3cc38" +dependencies = [ + "abi_stable", + "arrow", + "arrow-schema", + "async-ffi", "async-trait", - "chrono", + "datafusion-catalog", "datafusion-common", + "datafusion-datasource", + "datafusion-execution", "datafusion-expr", + "datafusion-functions-aggregate-common", "datafusion-physical-expr", - "hashbrown 0.13.2", - "itertools", + "datafusion-physical-expr-common", + "datafusion-physical-plan", + "datafusion-proto", + "datafusion-proto-common", + "datafusion-session", + "futures", "log", - "regex-syntax", + "prost", + "semver", + "tokio", ] [[package]] -name = "datafusion-physical-expr" -version = "20.0.0" +name = "datafusion-ffi-example" +version = "53.0.0" +dependencies = [ + "arrow", + "arrow-array", + "arrow-schema", + "async-trait", + "datafusion-catalog", + "datafusion-common", + "datafusion-expr", + "datafusion-ffi", + "datafusion-functions-aggregate", + "datafusion-functions-window", + "datafusion-python-util", + "pyo3", + "pyo3-build-config", + "pyo3-log", +] + +[[package]] +name = "datafusion-functions" +version = "53.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ceb003f6c0cdc25d2e9e92f8ee0322faca2a8f94f03d070b5ca15224c5503a3c" +checksum = "04fb863482d987cf938db2079e07ab0d3bb64595f28907a6c2f8671ad71cca7e" dependencies = [ - "ahash", "arrow", "arrow-buffer", - "arrow-schema", + "base64", "blake2", "blake3", "chrono", + "chrono-tz", "datafusion-common", + "datafusion-doc", + "datafusion-execution", "datafusion-expr", - "datafusion-row", - "half", - "hashbrown 0.13.2", - "indexmap", + "datafusion-expr-common", + "datafusion-macros", + "hex", "itertools", - "lazy_static", + "log", "md-5", + "memchr", "num-traits", - "paste", - "petgraph", - "rand", + "rand 0.9.2", "regex", "sha2", "unicode-segmentation", @@ -813,75 +1363,403 @@ dependencies = [ ] [[package]] -name = "datafusion-python" -version = "20.0.0" +name = "datafusion-functions-aggregate" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "829856f4e14275fb376c104f27cbf3c3b57a9cfe24885d98677525f5e43ce8d6" +dependencies = [ + "ahash", + "arrow", + "datafusion-common", + "datafusion-doc", + "datafusion-execution", + "datafusion-expr", + "datafusion-functions-aggregate-common", + "datafusion-macros", + "datafusion-physical-expr", + "datafusion-physical-expr-common", + "half", + "log", + "num-traits", + "paste", +] + +[[package]] +name = "datafusion-functions-aggregate-common" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "08af79cc3d2aa874a362fb97decfcbd73d687190cb096f16a6c85a7780cce311" +dependencies = [ + "ahash", + "arrow", + "datafusion-common", + "datafusion-expr-common", + "datafusion-physical-expr-common", +] + +[[package]] +name = "datafusion-functions-nested" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "465ae3368146d49c2eda3e2c0ef114424c87e8a6b509ab34c1026ace6497e790" dependencies = [ + "arrow", + "arrow-ord", + "datafusion-common", + "datafusion-doc", + "datafusion-execution", + "datafusion-expr", + "datafusion-expr-common", + "datafusion-functions", + "datafusion-functions-aggregate", + "datafusion-functions-aggregate-common", + "datafusion-macros", + "datafusion-physical-expr-common", + "hashbrown 0.16.1", + "itertools", + "itoa", + "log", + "paste", +] + +[[package]] +name = "datafusion-functions-table" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6156e6b22fcf1784112fc0173f3ae6e78c8fdb4d3ed0eace9543873b437e2af6" +dependencies = [ + "arrow", "async-trait", - "datafusion", + "datafusion-catalog", "datafusion-common", "datafusion-expr", - "datafusion-optimizer", - "datafusion-sql", + "datafusion-physical-plan", + "parking_lot", + "paste", +] + +[[package]] +name = "datafusion-functions-window" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ca7baec14f866729012efb89011a6973f3a346dc8090c567bfcd328deff551c1" +dependencies = [ + "arrow", + "datafusion-common", + "datafusion-doc", + "datafusion-expr", + "datafusion-functions-window-common", + "datafusion-macros", + "datafusion-physical-expr", + "datafusion-physical-expr-common", + "log", + "paste", +] + +[[package]] +name = "datafusion-functions-window-common" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "159228c3280d342658466bb556dc24de30047fe1d7e559dc5d16ccc5324166f9" +dependencies = [ + "datafusion-common", + "datafusion-physical-expr-common", +] + +[[package]] +name = "datafusion-macros" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e5427e5da5edca4d21ea1c7f50e1c9421775fe33d7d5726e5641a833566e7578" +dependencies = [ + "datafusion-doc", + "quote", + "syn 2.0.117", +] + +[[package]] +name = "datafusion-optimizer" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "89099eefcd5b223ec685c36a41d35c69239236310d71d339f2af0fa4383f3f46" +dependencies = [ + "arrow", + "chrono", + "datafusion-common", + "datafusion-expr", + "datafusion-expr-common", + "datafusion-physical-expr", + "indexmap", + "itertools", + "log", + "recursive", + "regex", + "regex-syntax", +] + +[[package]] +name = "datafusion-physical-expr" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0f222df5195d605d79098ef37bdd5323bff0131c9d877a24da6ec98dfca9fe36" +dependencies = [ + "ahash", + "arrow", + "datafusion-common", + "datafusion-expr", + "datafusion-expr-common", + "datafusion-functions-aggregate-common", + "datafusion-physical-expr-common", + "half", + "hashbrown 0.16.1", + "indexmap", + "itertools", + "parking_lot", + "paste", + "petgraph", + "recursive", + "tokio", +] + +[[package]] +name = "datafusion-physical-expr-adapter" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "40838625d63d9c12549d81979db3dd675d159055eb9135009ba272ab0e8d0f64" +dependencies = [ + "arrow", + "datafusion-common", + "datafusion-expr", + "datafusion-functions", + "datafusion-physical-expr", + "datafusion-physical-expr-common", + "itertools", +] + +[[package]] +name = "datafusion-physical-expr-common" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "eacbcc4cfd502558184ed58fa3c72e775ec65bf077eef5fd2b3453db676f893c" +dependencies = [ + "ahash", + "arrow", + "chrono", + "datafusion-common", + "datafusion-expr-common", + "hashbrown 0.16.1", + "indexmap", + "itertools", + "parking_lot", +] + +[[package]] +name = "datafusion-physical-optimizer" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d501d0e1d0910f015677121601ac177ec59272ef5c9324d1147b394988f40941" +dependencies = [ + "arrow", + "datafusion-common", + "datafusion-execution", + "datafusion-expr", + "datafusion-expr-common", + "datafusion-physical-expr", + "datafusion-physical-expr-common", + "datafusion-physical-plan", + "datafusion-pruning", + "itertools", + "recursive", +] + +[[package]] +name = "datafusion-physical-plan" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "463c88ad6f1ecab1810f4c9f046898bee035b370137eb79b2b2db925e270631d" +dependencies = [ + "ahash", + "arrow", + "arrow-ord", + "arrow-schema", + "async-trait", + "datafusion-common", + "datafusion-common-runtime", + "datafusion-execution", + "datafusion-expr", + "datafusion-functions", + "datafusion-functions-aggregate-common", + "datafusion-functions-window-common", + "datafusion-physical-expr", + "datafusion-physical-expr-common", + "futures", + "half", + "hashbrown 0.16.1", + "indexmap", + "itertools", + "log", + "num-traits", + "parking_lot", + "pin-project-lite", + "tokio", +] + +[[package]] +name = "datafusion-proto" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "677ee4448a010ed5faeff8d73ff78972c2ace59eff3cd7bd15833a1dafa00492" +dependencies = [ + "arrow", + "chrono", + "datafusion-catalog", + "datafusion-catalog-listing", + "datafusion-common", + "datafusion-datasource", + "datafusion-datasource-arrow", + "datafusion-datasource-csv", + "datafusion-datasource-json", + "datafusion-datasource-parquet", + "datafusion-execution", + "datafusion-expr", + "datafusion-functions-table", + "datafusion-physical-expr", + "datafusion-physical-expr-common", + "datafusion-physical-plan", + "datafusion-proto-common", + "object_store", + "prost", + "rand 0.9.2", +] + +[[package]] +name = "datafusion-proto-common" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "965eca01edc8259edbbd95883a00b6d81e329fd44a019cfac3a03b026a83eade" +dependencies = [ + "arrow", + "datafusion-common", + "prost", +] + +[[package]] +name = "datafusion-pruning" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2857618a0ecbd8cd0cf29826889edd3a25774ec26b2995fc3862095c95d88fc6" +dependencies = [ + "arrow", + "datafusion-common", + "datafusion-datasource", + "datafusion-expr-common", + "datafusion-physical-expr", + "datafusion-physical-expr-common", + "datafusion-physical-plan", + "itertools", + "log", +] + +[[package]] +name = "datafusion-python" +version = "53.0.0" +dependencies = [ + "arrow", + "arrow-select", + "async-trait", + "cstr", + "datafusion", + "datafusion-ffi", + "datafusion-proto", + "datafusion-python-util", "datafusion-substrait", "futures", + "log", "mimalloc", "object_store", "parking_lot", + "prost", + "prost-types", "pyo3", - "rand", - "regex-syntax", + "pyo3-async-runtimes", + "pyo3-build-config", + "pyo3-log", + "serde_json", "tokio", + "url", "uuid", ] [[package]] -name = "datafusion-row" -version = "20.0.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ba0b4478a205b767f6d98e26730a604562155a8b3cf64fe3600367e741b0f129" +name = "datafusion-python-util" +version = "53.0.0" dependencies = [ "arrow", + "datafusion", + "datafusion-ffi", + "prost", + "pyo3", + "tokio", +] + +[[package]] +name = "datafusion-session" +version = "53.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ef8637e35022c5c775003b3ab1debc6b4a8f0eb41b069bdd5475dd3aa93f6eba" +dependencies = [ + "async-trait", "datafusion-common", - "paste", - "rand", + "datafusion-execution", + "datafusion-expr", + "datafusion-physical-plan", + "parking_lot", ] [[package]] name = "datafusion-sql" -version = "20.0.0" +version = "53.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a07af4200a9353ebe7bf0d6024894275cdebb303bb534b3a38eb0f532583083d" +checksum = "12d9e9f16a1692a11c94bcc418191fa15fd2b4d72a0c1a0c607db93c0b84dd81" dependencies = [ - "arrow-schema", + "arrow", + "bigdecimal", + "chrono", "datafusion-common", "datafusion-expr", + "datafusion-functions-nested", + "indexmap", "log", + "recursive", + "regex", "sqlparser", ] [[package]] name = "datafusion-substrait" -version = "20.0.0" +version = "53.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "71a9cc3e51c77ada10990e996c15c5bda293b3e4a508f55c2f3993ea6c6a39a7" +checksum = "d5e5656a7e63d51dd3e5af3dbd347ea83bbe993a77c66b854b74961570d16490" dependencies = [ "async-recursion", + "async-trait", "chrono", "datafusion", + "half", "itertools", "object_store", - "prost 0.11.8", - "prost-build 0.9.0", - "prost-types 0.11.8", + "pbjson-types", + "prost", "substrait", "tokio", + "url", ] [[package]] name = "digest" -version = "0.10.6" +version = "0.10.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8168378f4e5023e7218c89c891c0fd8ecdb5e5e4f18cb78f38cf245dd021e76f" +checksum = "9ed9a281f7bc9b7576e61468ba615a66a5c8cfdff42420a70aa82701a3b1e292" dependencies = [ "block-buffer", "crypto-common", @@ -889,73 +1767,67 @@ dependencies = [ ] [[package]] -name = "doc-comment" -version = "0.3.3" +name = "displaydoc" +version = "0.2.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fea41bba32d969b513997752735605054bc0dfa92b4c56bf1189f2e174be7a10" +checksum = "97369cbbc041bc366949bc74d34658d6cda5621039731c6310521892a3a20ae0" +dependencies = [ + "proc-macro2", + "quote", + "syn 2.0.117", +] [[package]] name = "dyn-clone" -version = "1.0.11" +version = "1.0.20" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "68b0cf012f1230e43cd00ebb729c6bb58707ecfa8ad08b52ef3a4ccd2697fc30" +checksum = "d0881ea181b1df73ff77ffaaf9c7544ecc11e82fba9b5f27b262a3c73a332555" [[package]] name = "either" -version = "1.8.1" +version = "1.15.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7fcaabb2fef8c910e7f4c7ce9f67a1283a1715879a7c230ca9d6d1ae31f16d91" +checksum = "48c757948c5ede0e46177b7add2e67155f70e33c07fea8284df6576da70b3719" [[package]] -name = "encoding_rs" -version = "0.8.32" +name = "equivalent" +version = "1.0.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "071a31f4ee85403370b58aca746f01041ede6f0da2730960ad001edc2b71b394" -dependencies = [ - "cfg-if", -] +checksum = "877a4ace8713b0bcf2a4e7eec82529c029f1d0619886d18145fea96c3ffe5c0f" [[package]] name = "errno" -version = "0.2.8" +version = "0.3.14" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f639046355ee4f37944e44f60642c6f3a7efa3cf6b78c78a0d989a8ce6c396a1" +checksum = "39cab71617ae0d63f51a36d69f866391735b51691dbda63cf6f96d042b63efeb" dependencies = [ - "errno-dragonfly", "libc", - "winapi", + "windows-sys 0.61.2", ] [[package]] -name = "errno-dragonfly" -version = "0.1.2" +name = "fastrand" +version = "2.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "aa68f1b12764fab894d2755d2518754e71b4fd80ecfb822714a1206c2aab39bf" -dependencies = [ - "cc", - "libc", -] +checksum = "37909eebbb50d72f9059c3b6d82c0463f2ff062c9e95845c43a6c9c0355411be" [[package]] -name = "fastrand" -version = "1.9.0" +name = "find-msvc-tools" +version = "0.1.9" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e51093e27b0797c359783294ca4f0a911c270184cb10f85783b118614a1501be" -dependencies = [ - "instant", -] +checksum = "5baebc0774151f905a1a2cc41989300b1e6fbb29aff0ceffa1064fdd3088d582" [[package]] name = "fixedbitset" -version = "0.4.2" +version = "0.5.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0ce7134b9999ecaf8bcd65542e436736ef32ddca1b3e06094cb6ec5755203b80" +checksum = "1d674e81391d1e1ab681a28d99df07927c6d4aa5b027d7da16ba32d1d21ecd99" [[package]] name = "flatbuffers" -version = "23.1.21" +version = "25.12.19" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "77f5399c2c9c50ae9418e522842ad362f61ee48b346ac106807bd355a8a7c619" +checksum = "35f6839d7b3b98adde531effaf34f0c2badc6f4735d26fe74709d8e513a96ef3" dependencies = [ "bitflags", "rustc_version", @@ -963,12 +1835,13 @@ dependencies = [ [[package]] name = "flate2" -version = "1.0.25" +version = "1.1.9" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a8a2db397cb1c8772f31494cb8917e48cd1e64f0fa7efac59fbd741a0a8ce841" +checksum = "843fba2746e448b37e26a819579957415c8cef339bf08564fe8b7ddbd959573c" dependencies = [ "crc32fast", "miniz_oxide", + "zlib-rs", ] [[package]] @@ -977,20 +1850,32 @@ version = "1.0.7" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3f9eec918d3f24069decb9af1554cad7c880e2da24a9afd88aca000531ab82c1" +[[package]] +name = "foldhash" +version = "0.1.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d9c4f5dac5e15c24eb999c26181a6ca40b39fe946cbe4c263c7209467bc83af2" + +[[package]] +name = "foldhash" +version = "0.2.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "77ce24cb58228fbb8aa041425bb1050850ac19177686ea6e0f41a70416f56fdb" + [[package]] name = "form_urlencoded" -version = "1.1.0" +version = "1.2.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a9c384f161156f5260c24a097c56119f9be8c798586aecc13afbcbe7b7e26bf8" +checksum = "cb4cb245038516f5f85277875cdaa4f7d2c9a0fa0468de06ed190163b1581fcf" dependencies = [ "percent-encoding", ] [[package]] name = "futures" -version = "0.3.27" +version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "531ac96c6ff5fd7c62263c5e3c67a603af4fcaee2e1a0ae5565ba3a11e69e549" +checksum = "8b147ee9d1f6d097cef9ce628cd2ee62288d963e16fb287bd9286455b241382d" dependencies = [ "futures-channel", "futures-core", @@ -1003,9 +1888,9 @@ dependencies = [ [[package]] name = "futures-channel" -version = "0.3.27" +version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "164713a5a0dcc3e7b4b1ed7d3b433cabc18025386f9339346e8daf15963cf7ac" +checksum = "07bbe89c50d7a535e539b8c17bc0b49bdb77747034daa8087407d655f3f7cc1d" dependencies = [ "futures-core", "futures-sink", @@ -1013,15 +1898,15 @@ dependencies = [ [[package]] name = "futures-core" -version = "0.3.27" +version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "86d7a0c1aa76363dac491de0ee99faf6941128376f1cf96f07db7603b7de69dd" +checksum = "7e3450815272ef58cec6d564423f6e755e25379b217b0bc688e295ba24df6b1d" [[package]] name = "futures-executor" -version = "0.3.27" +version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1997dd9df74cdac935c76252744c1ed5794fac083242ea4fe77ef3ed60ba0f83" +checksum = "baf29c38818342a3b26b5b923639e7b1f4a61fc5e76102d4b1981c6dc7a7579d" dependencies = [ "futures-core", "futures-task", @@ -1030,38 +1915,38 @@ dependencies = [ [[package]] name = "futures-io" -version = "0.3.27" +version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "89d422fa3cbe3b40dca574ab087abb5bc98258ea57eea3fd6f1fa7162c778b91" +checksum = "cecba35d7ad927e23624b22ad55235f2239cfa44fd10428eecbeba6d6a717718" [[package]] name = "futures-macro" -version = "0.3.27" +version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3eb14ed937631bd8b8b8977f2c198443447a8355b6e3ca599f38c975e5a963b6" +checksum = "e835b70203e41293343137df5c0664546da5745f82ec9b84d40be8336958447b" dependencies = [ "proc-macro2", "quote", - "syn", + "syn 2.0.117", ] [[package]] name = "futures-sink" -version = "0.3.27" +version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ec93083a4aecafb2a80a885c9de1f0ccae9dbd32c2bb54b0c3a65690e0b8d2f2" +checksum = "c39754e157331b013978ec91992bde1ac089843443c49cbc7f46150b0fad0893" [[package]] name = "futures-task" -version = "0.3.27" +version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fd65540d33b37b16542a0438c12e6aeead10d4ac5d05bd3f805b8f35ab592879" +checksum = "037711b3d59c33004d3856fbdc83b99d4ff37a24768fa1be9ce3538a1cde4393" [[package]] name = "futures-util" -version = "0.3.27" +version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3ef6b17e481503ec85211fed8f39d1970f128935ca1f814cd32ac4a6842e84ab" +checksum = "389ca41296e6190b48053de0321d02a77f32f8a5d2461dd38762c0593805c6d6" dependencies = [ "futures-channel", "futures-core", @@ -1071,15 +1956,23 @@ dependencies = [ "futures-task", "memchr", "pin-project-lite", - "pin-utils", "slab", ] +[[package]] +name = "generational-arena" +version = "0.2.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "877e94aff08e743b651baaea359664321055749b398adff8740a7399af7796e7" +dependencies = [ + "cfg-if", +] + [[package]] name = "generic-array" -version = "0.14.6" +version = "0.14.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bff49e947297f3312447abdca79f45f4738097cc82b06e72054d2223f601f1b9" +checksum = "85649ca51fd72272d7821adaf274ad91c288277713d9c18820d8499a7ff69e9a" dependencies = [ "typenum", "version_check", @@ -1087,32 +1980,62 @@ dependencies = [ [[package]] name = "getrandom" -version = "0.2.8" +version = "0.2.17" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ff2abc00be7fca6ebc474524697ae276ad847ad0a6b3faa4bcb027e9a4614ad0" +dependencies = [ + "cfg-if", + "js-sys", + "libc", + "wasi", + "wasm-bindgen", +] + +[[package]] +name = "getrandom" +version = "0.3.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "899def5c37c4fd7b2664648c28120ecec138e4d395b459e5ca34f9cce2dd77fd" +dependencies = [ + "cfg-if", + "js-sys", + "libc", + "r-efi 5.3.0", + "wasip2", + "wasm-bindgen", +] + +[[package]] +name = "getrandom" +version = "0.4.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c05aeb6a22b8f62540c194aac980f2115af067bfe15a0734d7277a768d396b31" +checksum = "0de51e6874e94e7bf76d726fc5d13ba782deca734ff60d5bb2fb2607c7406555" dependencies = [ "cfg-if", "libc", - "wasi 0.11.0+wasi-snapshot-preview1", + "r-efi 6.0.0", + "rand_core 0.10.0", + "wasip2", + "wasip3", ] [[package]] name = "glob" -version = "0.3.1" +version = "0.3.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d2fabcfbdc87f4758337ca535fb41a6d701b65693ce38287d856d1674551ec9b" +checksum = "0cc23270f6e1808e30a928bdc84dea0b9b4136a8bc82338574f23baf47bbd280" [[package]] name = "h2" -version = "0.3.16" +version = "0.4.13" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5be7b54589b581f624f566bf5d8eb2bab1db736c51528720b6bd36b96b55924d" +checksum = "2f44da3a8150a6703ed5d34e164b875fd14c2cdab9af1252a9a1020bde2bdc54" dependencies = [ + "atomic-waker", "bytes", "fnv", "futures-core", "futures-sink", - "futures-util", "http", "indexmap", "slab", @@ -1123,258 +2046,379 @@ dependencies = [ [[package]] name = "half" -version = "2.2.1" +version = "2.7.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "02b4af3693f1b705df946e9fe5631932443781d0aabb423b62fcd4d73f6d2fd0" +checksum = "6ea2d84b969582b4b1864a92dc5d27cd2b77b622a8d79306834f1be5ba20d84b" dependencies = [ + "cfg-if", "crunchy", "num-traits", + "zerocopy", ] [[package]] name = "hashbrown" -version = "0.12.3" +version = "0.14.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8a9ee70c43aaf417c914396645a0fa852624801b24ebb7ae78fe8272889ac888" +checksum = "e5274423e17b7c9fc20b6e7e208532f9b19825d82dfd615708b70edd83df41f1" [[package]] name = "hashbrown" -version = "0.13.2" +version = "0.15.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "43a3c133739dddd0d2990f9a4bdf8eb4b21ef50e4851ca85ab661199821d510e" +checksum = "9229cfe53dfd69f0609a49f65461bd93001ea1ef889cd5529dd176593f5338a1" dependencies = [ - "ahash", + "foldhash 0.1.5", ] [[package]] -name = "heck" -version = "0.3.3" +name = "hashbrown" +version = "0.16.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6d621efb26863f0e9924c6ac577e8275e5e6b77455db64ffa6c65c904e9e132c" +checksum = "841d1cc9bed7f9236f321df977030373f4a4163ae1a7dbfe1a51a2c1a51d9100" dependencies = [ - "unicode-segmentation", + "allocator-api2", + "equivalent", + "foldhash 0.2.0", ] [[package]] name = "heck" -version = "0.4.1" +version = "0.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "95505c38b4572b2d910cecb0281560f54b440a19336cbbcb27bf6ce6adc6f5a8" +checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea" [[package]] -name = "hermit-abi" -version = "0.2.6" +name = "hex" +version = "0.4.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ee512640fe35acbfb4bb779db6f0d80704c2cacfa2e39b601ef3e3f47d1ae4c7" -dependencies = [ - "libc", -] +checksum = "7f24254aa9a54b5c858eaee2f5bccdb46aaf0e486a595ed5fd8f86ba55232a70" [[package]] -name = "home" -version = "0.5.4" +name = "http" +version = "1.4.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "747309b4b440c06d57b0b25f2aee03ee9b5e5397d288c60e21fc709bb98a7408" +checksum = "e3ba2a386d7f85a81f119ad7498ebe444d2e22c2af0b86b069416ace48b3311a" dependencies = [ - "winapi", + "bytes", + "itoa", ] [[package]] -name = "http" -version = "0.2.9" +name = "http-body" +version = "1.0.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bd6effc99afb63425aff9b05836f029929e345a6148a14b7ecd5ab67af944482" +checksum = "1efedce1fb8e6913f23e0c92de8e62cd5b772a67e7b3946df930a62566c93184" dependencies = [ "bytes", - "fnv", - "itoa", + "http", ] [[package]] -name = "http-body" -version = "0.4.5" +name = "http-body-util" +version = "0.1.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d5f38f16d184e36f2408a55281cd658ecbd3ca05cce6d6510a176eca393e26d1" +checksum = "b021d93e26becf5dc7e1b75b1bed1fd93124b374ceb73f43d4d4eafec896a64a" dependencies = [ "bytes", + "futures-core", "http", + "http-body", "pin-project-lite", ] [[package]] name = "httparse" -version = "1.8.0" +version = "1.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d897f394bad6a705d5f4104762e116a75639e470d80901eed05a860a95cb1904" +checksum = "6dbf3de79e51f3d586ab4cb9d5c3e2c14aa28ed23d180cf89b4df0454a69cc87" [[package]] -name = "httpdate" -version = "1.0.2" +name = "humantime" +version = "2.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c4a1e36c821dbe04574f602848a19f742f4fb3c98d40449f11bcad18d6b17421" +checksum = "135b12329e5e3ce057a9f972339ea52bc954fe1e9358ef27f95e89716fbc5424" [[package]] name = "hyper" -version = "0.14.25" +version = "1.8.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cc5e554ff619822309ffd57d8734d77cd5ce6238bc956f037ea06c58238c9899" +checksum = "2ab2d4f250c3d7b1c9fcdff1cece94ea4e2dfbec68614f7b87cb205f24ca9d11" dependencies = [ + "atomic-waker", "bytes", "futures-channel", "futures-core", - "futures-util", "h2", "http", "http-body", "httparse", - "httpdate", "itoa", "pin-project-lite", - "socket2", + "pin-utils", + "smallvec", "tokio", - "tower-service", - "tracing", "want", ] [[package]] name = "hyper-rustls" -version = "0.23.2" +version = "0.27.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1788965e61b367cd03a62950836d5cd41560c3577d90e40e0819373194d1661c" +checksum = "e3c93eb611681b207e1fe55d5a71ecf91572ec8a6705cdb6857f7d8d5242cf58" dependencies = [ "http", "hyper", + "hyper-util", "rustls", + "rustls-native-certs", + "rustls-pki-types", "tokio", "tokio-rustls", + "tower-service", +] + +[[package]] +name = "hyper-util" +version = "0.1.20" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "96547c2556ec9d12fb1578c4eaf448b04993e7fb79cbaad930a656880a6bdfa0" +dependencies = [ + "base64", + "bytes", + "futures-channel", + "futures-util", + "http", + "http-body", + "hyper", + "ipnet", + "libc", + "percent-encoding", + "pin-project-lite", + "socket2", + "tokio", + "tower-service", + "tracing", ] [[package]] name = "iana-time-zone" -version = "0.1.53" +version = "0.1.65" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "64c122667b287044802d6ce17ee2ddf13207ed924c712de9a66a5814d5b64765" +checksum = "e31bc9ad994ba00e440a8aa5c9ef0ec67d5cb5e5cb0cc7f8b744a35b389cc470" dependencies = [ "android_system_properties", "core-foundation-sys", "iana-time-zone-haiku", "js-sys", + "log", "wasm-bindgen", - "winapi", + "windows-core", ] [[package]] name = "iana-time-zone-haiku" -version = "0.1.1" +version = "0.1.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0703ae284fc167426161c2e3f1da3ea71d94b21bedbcc9494e92b28e334e3dca" +checksum = "f31827a206f56af32e590ba56d5d2d085f558508192593743f16b2306495269f" dependencies = [ - "cxx", - "cxx-build", + "cc", ] [[package]] -name = "idna" -version = "0.3.0" +name = "icu_collections" +version = "2.1.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e14ddfc70884202db2244c223200c204c2bda1bc6e0998d11b5e024d657209e6" +checksum = "4c6b649701667bbe825c3b7e6388cb521c23d88644678e83c0c4d0a621a34b43" dependencies = [ - "unicode-bidi", - "unicode-normalization", + "displaydoc", + "potential_utf", + "yoke", + "zerofrom", + "zerovec", ] [[package]] -name = "indexmap" -version = "1.9.2" +name = "icu_locale_core" +version = "2.1.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1885e79c1fc4b10f0e172c475f458b7f7b93061064d98c3293e98c5ba0c8b399" +checksum = "edba7861004dd3714265b4db54a3c390e880ab658fec5f7db895fae2046b5bb6" dependencies = [ - "autocfg", - "hashbrown 0.12.3", + "displaydoc", + "litemap", + "tinystr", + "writeable", + "zerovec", +] + +[[package]] +name = "icu_normalizer" +version = "2.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "5f6c8828b67bf8908d82127b2054ea1b4427ff0230ee9141c54251934ab1b599" +dependencies = [ + "icu_collections", + "icu_normalizer_data", + "icu_properties", + "icu_provider", + "smallvec", + "zerovec", ] [[package]] -name = "indoc" -version = "1.0.9" +name = "icu_normalizer_data" +version = "2.1.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bfa799dd5ed20a7e349f3b4639aa80d74549c81716d9ec4f994c9b5815598306" +checksum = "7aedcccd01fc5fe81e6b489c15b247b8b0690feb23304303a9e560f37efc560a" [[package]] -name = "instant" -version = "0.1.12" +name = "icu_properties" +version = "2.1.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7a5bbe824c507c5da5956355e86a746d82e0e1464f65d862cc5e71da70e94b2c" +checksum = "020bfc02fe870ec3a66d93e677ccca0562506e5872c650f893269e08615d74ec" dependencies = [ - "cfg-if", + "icu_collections", + "icu_locale_core", + "icu_properties_data", + "icu_provider", + "zerotrie", + "zerovec", ] [[package]] -name = "integer-encoding" -version = "3.0.4" +name = "icu_properties_data" +version = "2.1.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8bb03732005da905c88227371639bf1ad885cc712789c011c31c5fb3ab3ccf02" +checksum = "616c294cf8d725c6afcd8f55abc17c56464ef6211f9ed59cccffe534129c77af" [[package]] -name = "io-lifetimes" -version = "1.0.6" +name = "icu_provider" +version = "2.1.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cfa919a82ea574332e2de6e74b4c36e74d41982b335080fa59d4ef31be20fdf3" +checksum = "85962cf0ce02e1e0a629cc34e7ca3e373ce20dda4c4d7294bbd0bf1fdb59e614" dependencies = [ - "libc", - "windows-sys 0.45.0", + "displaydoc", + "icu_locale_core", + "writeable", + "yoke", + "zerofrom", + "zerotrie", + "zerovec", +] + +[[package]] +name = "id-arena" +version = "2.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3d3067d79b975e8844ca9eb072e16b31c3c1c36928edf9c6789548c524d0d954" + +[[package]] +name = "ident_case" +version = "1.0.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b9e0384b61958566e926dc50660321d12159025e767c18e043daf26b70104c39" + +[[package]] +name = "idna" +version = "1.1.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3b0875f23caa03898994f6ddc501886a45c7d3d62d04d2d90788d47be1b1e4de" +dependencies = [ + "idna_adapter", + "smallvec", + "utf8_iter", +] + +[[package]] +name = "idna_adapter" +version = "1.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3acae9609540aa318d1bc588455225fb2085b9ed0c4f6bd0d9d5bcd86f1a0344" +dependencies = [ + "icu_normalizer", + "icu_properties", +] + +[[package]] +name = "indexmap" +version = "2.13.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7714e70437a7dc3ac8eb7e6f8df75fd8eb422675fc7678aff7364301092b1017" +dependencies = [ + "equivalent", + "hashbrown 0.16.1", + "serde", + "serde_core", ] +[[package]] +name = "integer-encoding" +version = "3.0.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8bb03732005da905c88227371639bf1ad885cc712789c011c31c5fb3ab3ccf02" + [[package]] name = "ipnet" -version = "2.7.1" +version = "2.12.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "30e22bd8629359895450b59ea7a776c850561b96a3b1d31321c1949d9e6c9146" +checksum = "d98f6fed1fde3f8c21bc40a1abb88dd75e67924f9cffc3ef95607bad8017f8e2" + +[[package]] +name = "iri-string" +version = "0.7.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d8e7418f59cc01c88316161279a7f665217ae316b388e58a0d10e29f54f1e5eb" +dependencies = [ + "memchr", + "serde", +] [[package]] name = "itertools" -version = "0.10.5" +version = "0.14.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b0fd2260e829bddf4cb6ea802289de2f86d6a7a690192fbe91b3f46e0f2c8473" +checksum = "2b192c782037fadd9cfa75548310488aabdbf3d2da73885b31bd0abd03351285" dependencies = [ "either", ] [[package]] name = "itoa" -version = "1.0.6" +version = "1.0.18" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "453ad9f582a441959e5f0d088b02ce04cfe8d51a8eaf077f12ac6d3e94164ca6" +checksum = "8f42a60cbdf9a97f5d2305f08a87dc4e09308d1276d28c869c684d7777685682" [[package]] name = "jobserver" -version = "0.1.26" +version = "0.1.34" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "936cfd212a0155903bcbc060e316fb6cc7cbf2e1907329391ebadc1fe0ce77c2" +checksum = "9afb3de4395d6b3e67a780b6de64b51c978ecf11cb9a462c66be7d4ca9039d33" dependencies = [ + "getrandom 0.3.4", "libc", ] [[package]] name = "js-sys" -version = "0.3.61" +version = "0.3.91" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "445dde2150c55e483f3d8416706b97ec8e8237c307e5b7b4b8dd15e6af2a0730" +checksum = "b49715b7073f385ba4bc528e5747d02e66cb39c6146efb66b781f131f0fb399c" dependencies = [ + "once_cell", "wasm-bindgen", ] [[package]] -name = "lazy_static" -version = "1.4.0" +name = "leb128fmt" +version = "0.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e2abad23fbc42b3700f2f279844dc832adb2b2eb069b2df918f455c4e18cc646" +checksum = "09edd9e8b54e49e587e4f6295a7d29c3ea94d469cb40ab8ca70b288248a81db2" [[package]] name = "lexical-core" -version = "0.8.5" +version = "1.0.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2cde5de06e8d4c2faabc400238f9ae1c74d5412d03a7bd067645ccbc47070e46" +checksum = "7d8d125a277f807e55a77304455eb7b1cb52f2b18c143b60e766c120bd64a594" dependencies = [ "lexical-parse-float", "lexical-parse-integer", @@ -1385,363 +2429,314 @@ dependencies = [ [[package]] name = "lexical-parse-float" -version = "0.8.5" +version = "1.0.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "683b3a5ebd0130b8fb52ba0bdc718cc56815b6a097e28ae5a6997d0ad17dc05f" +checksum = "52a9f232fbd6f550bc0137dcb5f99ab674071ac2d690ac69704593cb4abbea56" dependencies = [ "lexical-parse-integer", "lexical-util", - "static_assertions", ] [[package]] name = "lexical-parse-integer" -version = "0.8.6" +version = "1.0.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6d0994485ed0c312f6d965766754ea177d07f9c00c9b82a5ee62ed5b47945ee9" +checksum = "9a7a039f8fb9c19c996cd7b2fcce303c1b2874fe1aca544edc85c4a5f8489b34" dependencies = [ "lexical-util", - "static_assertions", ] [[package]] name = "lexical-util" -version = "0.8.5" +version = "1.0.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5255b9ff16ff898710eb9eb63cb39248ea8a5bb036bea8085b1a767ff6c4e3fc" -dependencies = [ - "static_assertions", -] +checksum = "2604dd126bb14f13fb5d1bd6a66155079cb9fa655b37f875b3a742c705dbed17" [[package]] name = "lexical-write-float" -version = "0.8.5" +version = "1.0.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "accabaa1c4581f05a3923d1b4cfd124c329352288b7b9da09e766b0668116862" +checksum = "50c438c87c013188d415fbabbb1dceb44249ab81664efbd31b14ae55dabb6361" dependencies = [ "lexical-util", "lexical-write-integer", - "static_assertions", ] [[package]] name = "lexical-write-integer" -version = "0.8.5" +version = "1.0.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e1b6f3d1f4422866b68192d62f77bc5c700bee84f3069f2469d7bc8c77852446" +checksum = "409851a618475d2d5796377cad353802345cba92c867d9fbcde9cf4eac4e14df" dependencies = [ "lexical-util", - "static_assertions", ] +[[package]] +name = "libbz2-rs-sys" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2c4a545a15244c7d945065b5d392b2d2d7f21526fba56ce51467b06ed445e8f7" + [[package]] name = "libc" -version = "0.2.140" +version = "0.2.183" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "99227334921fae1a979cf0bfdfcc6b3e5ce376ef57e16fb6fb3ea2ed6095f80c" +checksum = "b5b646652bf6661599e1da8901b3b9522896f01e736bad5f723fe7a3a27f899d" [[package]] -name = "libflate" -version = "1.3.0" +name = "libloading" +version = "0.7.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "97822bf791bd4d5b403713886a5fbe8bf49520fe78e323b0dc480ca1a03e50b0" +checksum = "b67380fd3b2fbe7527a606e18729d21c6f3951633d0500574c4dc22d2d638b9f" dependencies = [ - "adler32", - "crc32fast", - "libflate_lz77", + "cfg-if", + "winapi", ] [[package]] -name = "libflate_lz77" -version = "1.2.0" +name = "liblzma" +version = "0.4.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a52d3a8bfc85f250440e4424db7d857e241a3aebbbe301f3eb606ab15c39acbf" +checksum = "b6033b77c21d1f56deeae8014eb9fbe7bdf1765185a6c508b5ca82eeaed7f899" dependencies = [ - "rle-decode-fast", + "liblzma-sys", +] + +[[package]] +name = "liblzma-sys" +version = "0.4.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9f2db66f3268487b5033077f266da6777d057949b8f93c8ad82e441df25e6186" +dependencies = [ + "cc", + "libc", + "pkg-config", ] [[package]] name = "libm" -version = "0.2.6" +version = "0.2.16" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "348108ab3fba42ec82ff6e9564fc4ca0247bdccdc68dd8af9764bbc79c3c8ffb" +checksum = "b6d2cec3eae94f9f509c767b45932f1ada8350c4bdb85af2fcab4a3c14807981" [[package]] name = "libmimalloc-sys" -version = "0.1.30" +version = "0.1.44" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dd8c7cbf8b89019683667e347572e6d55a7df7ea36b0c4ce69961b0cde67b174" +checksum = "667f4fec20f29dfc6bc7357c582d91796c169ad7e2fce709468aefeb2c099870" dependencies = [ "cc", "libc", ] [[package]] -name = "link-cplusplus" -version = "1.0.8" +name = "linux-raw-sys" +version = "0.12.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ecd207c9c713c34f95a097a5b029ac2ce6010530c7b49d7fea24d977dede04f5" -dependencies = [ - "cc", -] +checksum = "32a66949e030da00e8c7d4434b251670a91556f4144941d37452769c25d58a53" [[package]] -name = "linux-raw-sys" -version = "0.1.4" +name = "litemap" +version = "0.8.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f051f77a7c8e6957c0696eac88f26b0117e54f52d3fc682ab19397a8812846a4" +checksum = "6373607a59f0be73a39b6fe456b8192fcc3585f602af20751600e974dd455e77" [[package]] name = "lock_api" -version = "0.4.9" +version = "0.4.14" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "435011366fe56583b16cf956f9df0095b405b82d76425bc8981c0e22e60ec4df" +checksum = "224399e74b87b5f3557511d98dff8b14089b3dadafcab6bb93eab67d3aace965" dependencies = [ - "autocfg", "scopeguard", ] [[package]] name = "log" -version = "0.4.17" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "abb12e687cfb44aa40f41fc3978ef76448f9b6038cad6aef4259d3c095a2382e" -dependencies = [ - "cfg-if", -] - -[[package]] -name = "lz4" -version = "1.24.0" +version = "0.4.29" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7e9e2dd86df36ce760a60f6ff6ad526f7ba1f14ba0356f8254fb6905e6494df1" -dependencies = [ - "libc", - "lz4-sys", -] +checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897" [[package]] -name = "lz4-sys" -version = "1.9.4" +name = "lru-slab" +version = "0.1.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "57d27b317e207b10f69f5e75494119e391a96f48861ae870d1da6edac98ca900" -dependencies = [ - "cc", - "libc", -] +checksum = "112b39cec0b298b6c1999fee3e31427f74f676e4cb9879ed1a121b43661a4154" [[package]] -name = "lzma-sys" -version = "0.1.20" +name = "lz4_flex" +version = "0.13.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5fda04ab3764e6cde78b9974eec4f779acaba7c4e84b36eca3cf77c581b85d27" +checksum = "db9a0d582c2874f68138a16ce1867e0ffde6c0bb0a0df85e1f36d04146db488a" dependencies = [ - "cc", - "libc", - "pkg-config", + "twox-hash", ] [[package]] name = "md-5" -version = "0.10.5" +version = "0.10.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6365506850d44bff6e2fbcb5176cf63650e48bd45ef2fe2665ae1570e0f4b9ca" +checksum = "d89e7ee0cfbedfc4da3340218492196241d89eefb6dab27de5df917a6d2e78cf" dependencies = [ + "cfg-if", "digest", -] - -[[package]] -name = "memchr" -version = "2.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2dffe52ecf27772e601905b7522cb4ef790d2cc203488bbd0e2fe85fcb74566d" +] [[package]] -name = "memoffset" -version = "0.8.0" +name = "memchr" +version = "2.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d61c719bcfbcf5d62b3a09efa6088de8c54bc0bfcd3ea7ae39fcc186108b8de1" -dependencies = [ - "autocfg", -] +checksum = "f8ca58f447f06ed17d5fc4043ce1b10dd205e060fb3ce5b979b8ed8e59ff3f79" [[package]] name = "mimalloc" -version = "0.1.34" +version = "0.1.48" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9dcb174b18635f7561a0c6c9fc2ce57218ac7523cf72c50af80e2d79ab8f3ba1" +checksum = "e1ee66a4b64c74f4ef288bcbb9192ad9c3feaad75193129ac8509af543894fd8" dependencies = [ "libmimalloc-sys", ] -[[package]] -name = "mime" -version = "0.3.16" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2a60c7ce501c71e03a9c9c0d35b861413ae925bd979cc7a4e30d060069aaac8d" - [[package]] name = "miniz_oxide" -version = "0.6.2" +version = "0.8.9" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b275950c28b37e794e8c55d88aeb5e139d0ce23fdbbeda68f8d7174abdf9e8fa" +checksum = "1fa76a2c86f704bdb222d66965fb3d63269ce38518b83cb0575fca855ebb6316" dependencies = [ - "adler", + "adler2", + "simd-adler32", ] [[package]] name = "mio" -version = "0.8.6" +version = "1.2.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5b9d9a46eff5b4ff64b45a9e316a6d1e0bc719ef429cbec4dc630684212bfdf9" +checksum = "50b7e5b27aa02a74bac8c3f23f448f8d87ff11f92d3aac1a6ed369ee08cc56c1" dependencies = [ "libc", - "log", - "wasi 0.11.0+wasi-snapshot-preview1", - "windows-sys 0.45.0", + "wasi", + "windows-sys 0.61.2", ] [[package]] name = "multimap" -version = "0.8.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e5ce46fe64a9d73be07dcbe690a38ce1b293be448fd8ce1e6c1b8062c9f72c6a" - -[[package]] -name = "num" -version = "0.4.0" +version = "0.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "43db66d1170d347f9a065114077f7dccb00c1b9478c89384490a3425279a4606" -dependencies = [ - "num-bigint", - "num-complex", - "num-integer", - "num-iter", - "num-rational", - "num-traits", -] +checksum = "1d87ecb2933e8aeadb3e3a02b828fed80a7528047e68b4f424523a0981a3a084" [[package]] name = "num-bigint" -version = "0.4.3" +version = "0.4.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f93ab6289c7b344a8a9f60f88d80aa20032336fe78da341afc91c8a2341fc75f" +checksum = "a5e44f723f1133c9deac646763579fdb3ac745e418f2a7af9cd0c431da1f20b9" dependencies = [ - "autocfg", "num-integer", "num-traits", + "serde", ] [[package]] name = "num-complex" -version = "0.4.3" +version = "0.4.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "02e0d21255c828d6f128a1e41534206671e8c3ea0c62f32291e808dc82cff17d" +checksum = "73f88a1307638156682bada9d7604135552957b7818057dcef22705b4d509495" dependencies = [ "num-traits", ] [[package]] name = "num-integer" -version = "0.1.45" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "225d3389fb3509a24c93f5c29eb6bde2586b98d9f016636dff58d7c6f7569cd9" -dependencies = [ - "autocfg", - "num-traits", -] - -[[package]] -name = "num-iter" -version = "0.1.43" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7d03e6c028c5dc5cac6e2dec0efda81fc887605bb3d884578bb6d6bf7514e252" -dependencies = [ - "autocfg", - "num-integer", - "num-traits", -] - -[[package]] -name = "num-rational" -version = "0.4.1" +version = "0.1.46" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0638a1c9d0a3c0914158145bc76cff373a75a627e6ecbfb71cbe6f453a5a19b0" +checksum = "7969661fd2958a5cb096e56c8e1ad0444ac2bbcd0061bd28660485a44879858f" dependencies = [ - "autocfg", - "num-bigint", - "num-integer", "num-traits", ] [[package]] name = "num-traits" -version = "0.2.15" +version = "0.2.19" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "578ede34cf02f8924ab9447f50c28075b4d3e5b269972345e7e0372b38c6cdcd" +checksum = "071dfc062690e90b734c0b2273ce72ad0ffa95f0c74596bc250dcfd960262841" dependencies = [ "autocfg", "libm", ] [[package]] -name = "num_cpus" -version = "1.15.0" +name = "object" +version = "0.37.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0fac9e2da13b5eb447a6ce3d392f23a29d8694bff781bf03a16cd9ac8697593b" +checksum = "ff76201f031d8863c38aa7f905eca4f53abbfa15f609db4277d44cd8938f33fe" dependencies = [ - "hermit-abi", - "libc", + "memchr", ] [[package]] name = "object_store" -version = "0.5.5" +version = "0.13.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e1ea8f683b4f89a64181393742c041520a1a87e9775e6b4c0dd5a3281af05fc6" +checksum = "622acbc9100d3c10e2ee15804b0caa40e55c933d5aa53814cd520805b7958a49" dependencies = [ "async-trait", "base64", "bytes", "chrono", - "futures", + "form_urlencoded", + "futures-channel", + "futures-core", + "futures-util", + "http", + "http-body-util", + "httparse", + "humantime", + "hyper", "itertools", + "md-5", "parking_lot", "percent-encoding", "quick-xml", - "rand", + "rand 0.10.0", "reqwest", "ring", - "rustls-pemfile", + "rustls-pki-types", "serde", "serde_json", - "snafu", + "serde_urlencoded", + "thiserror", "tokio", "tracing", "url", "walkdir", + "wasm-bindgen-futures", + "web-time", ] [[package]] name = "once_cell" -version = "1.17.1" +version = "1.21.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50" + +[[package]] +name = "openssl-probe" +version = "0.2.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b7e5500299e16ebb147ae15a00a942af264cf3688f47923b8fc2cd5858f23ad3" +checksum = "7c87def4c32ab89d880effc9e097653c8da5d6ef28e6b539d313baaacfbafcbe" [[package]] name = "ordered-float" -version = "2.10.0" +version = "2.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7940cf2ca942593318d07fcf2596cdca60a85c9e7fab408a5e21a4f9dcd40d87" +checksum = "68f19d67e5a2795c94e73e0bb1cc1a7edeb2e28efd39e2e1c9b7a40c1108b11c" dependencies = [ "num-traits", ] [[package]] name = "parking_lot" -version = "0.12.1" +version = "0.12.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3742b2c103b9f06bc9fff0a37ff4912935851bee6d36f3c02bcc755bcfec228f" +checksum = "93857453250e3077bd71ff98b6a65ea6621a19bb0f559a85248955ac12c45a1a" dependencies = [ "lock_api", "parking_lot_core", @@ -1749,27 +2744,26 @@ dependencies = [ [[package]] name = "parking_lot_core" -version = "0.9.7" +version = "0.9.12" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9069cbb9f99e3a5083476ccb29ceb1de18b9118cafa53e90c9551235de2b9521" +checksum = "2621685985a2ebf1c516881c026032ac7deafcda1a2c9b7850dc81e3dfcb64c1" dependencies = [ "cfg-if", "libc", "redox_syscall", "smallvec", - "windows-sys 0.45.0", + "windows-link", ] [[package]] name = "parquet" -version = "34.0.0" +version = "58.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7ac135ecf63ebb5f53dda0921b0b76d6048b3ef631a5f4760b9e8f863ff00cfa" +checksum = "7d3f9f2205199603564127932b89695f52b62322f541d0fc7179d57c2e1c9877" dependencies = [ "ahash", "arrow-array", "arrow-buffer", - "arrow-cast", "arrow-data", "arrow-ipc", "arrow-schema", @@ -1780,56 +2774,107 @@ dependencies = [ "chrono", "flate2", "futures", - "hashbrown 0.13.2", - "lz4", - "num", + "half", + "hashbrown 0.16.1", + "lz4_flex", "num-bigint", + "num-integer", + "num-traits", + "object_store", "paste", "seq-macro", + "simdutf8", "snap", "thrift", "tokio", "twox-hash", - "zstd 0.12.3+zstd.1.5.2", + "zstd", ] [[package]] name = "paste" -version = "1.0.12" +version = "1.0.15" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9f746c4065a8fa3fe23974dd82f15431cc8d40779821001404d10d2e79ca7d79" +checksum = "57c0d7b74b563b49d38dae00a0c37d4d6de9b432382b2892f0574ddcae73fd0a" [[package]] -name = "percent-encoding" -version = "2.2.0" +name = "pbjson" +version = "0.8.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "898bac3fa00d0ba57a4e8289837e965baa2dee8c3749f3b11d45a64b4223d9c3" +dependencies = [ + "base64", + "serde", +] + +[[package]] +name = "pbjson-build" +version = "0.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "478c572c3d73181ff3c2539045f6eb99e5491218eae919370993b890cdbdd98e" +checksum = "af22d08a625a2213a78dbb0ffa253318c5c79ce3133d32d296655a7bdfb02095" +dependencies = [ + "heck", + "itertools", + "prost", + "prost-types", +] [[package]] -name = "pest" -version = "2.5.6" +name = "pbjson-types" +version = "0.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8cbd939b234e95d72bc393d51788aec68aeeb5d51e748ca08ff3aad58cb722f7" +checksum = "8e748e28374f10a330ee3bb9f29b828c0ac79831a32bab65015ad9b661ead526" dependencies = [ - "thiserror", - "ucd-trie", + "bytes", + "chrono", + "pbjson", + "pbjson-build", + "prost", + "prost-build", + "serde", ] +[[package]] +name = "percent-encoding" +version = "2.3.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9b4f627cb1b25917193a259e49bdad08f671f8d9708acfd5fe0a8c1455d87220" + [[package]] name = "petgraph" -version = "0.6.3" +version = "0.8.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4dd7d28ee937e54fe3080c91faa1c3a46c06de6252988a7f4592ba2310ef22a4" +checksum = "8701b58ea97060d5e5b155d383a69952a60943f0e6dfe30b04c287beb0b27455" dependencies = [ "fixedbitset", + "hashbrown 0.15.5", "indexmap", + "serde", +] + +[[package]] +name = "phf" +version = "0.12.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "913273894cec178f401a31ec4b656318d95473527be05c0752cc41cdc32be8b7" +dependencies = [ + "phf_shared", +] + +[[package]] +name = "phf_shared" +version = "0.12.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "06005508882fb681fd97892ecff4b7fd0fee13ef1aa569f8695dae7ab9099981" +dependencies = [ + "siphasher", ] [[package]] name = "pin-project-lite" -version = "0.2.9" +version = "0.2.17" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e0a7ae3ac2f1173085d398531c705756c94a4c56843785df85a60c1a0afac116" +checksum = "a89322df9ebe1c1578d689c92318e070967d1042b512afbe49518723f4e6d5cd" [[package]] name = "pin-utils" @@ -1839,416 +2884,569 @@ checksum = "8b870d8c151b6f2fb93e84a13146138f05d02ed11c7e7c54f8826aaaf7c9f184" [[package]] name = "pkg-config" -version = "0.3.26" +version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6ac9a59f73473f1b8d852421e59e64809f025994837ef743615c6d0c5b305160" +checksum = "7edddbd0b52d732b21ad9a5fab5c704c14cd949e5e9a1ec5929a24fded1b904c" [[package]] -name = "ppv-lite86" -version = "0.2.17" +name = "portable-atomic" +version = "1.13.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5b40af805b3121feab8a3c29f04d8ad262fa8e0561883e7653e024ae4479e6de" +checksum = "c33a9471896f1c69cecef8d20cbe2f7accd12527ce60845ff44c153bb2a21b49" [[package]] -name = "proc-macro-hack" -version = "0.5.20+deprecated" +name = "potential_utf" +version = "0.1.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dc375e1527247fe1a97d8b7156678dfe7c1af2fc075c9a4db3690ecd2a148068" +checksum = "b73949432f5e2a09657003c25bca5e19a0e9c84f8058ca374f49e0ebe605af77" +dependencies = [ + "zerovec", +] [[package]] -name = "proc-macro2" -version = "1.0.52" +name = "ppv-lite86" +version = "0.2.21" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1d0e1ae9e836cc3beddd63db0df682593d7e2d3d891ae8c9083d2113e1744224" +checksum = "85eae3c4ed2f50dcfe72643da4befc30deadb458a9b590d720cde2f2b1e97da9" dependencies = [ - "unicode-ident", + "zerocopy", ] [[package]] -name = "prost" -version = "0.9.0" +name = "prettyplease" +version = "0.2.37" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "444879275cb4fd84958b1a1d5420d15e6fcf7c235fe47f053c9c2a80aceb6001" +checksum = "479ca8adacdd7ce8f1fb39ce9ecccbfe93a3f1344b3d0d97f20bc0196208f62b" dependencies = [ - "bytes", - "prost-derive 0.9.0", + "proc-macro2", + "syn 2.0.117", ] [[package]] -name = "prost" -version = "0.11.8" +name = "proc-macro2" +version = "1.0.106" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e48e50df39172a3e7eb17e14642445da64996989bc212b583015435d39a58537" +checksum = "8fd00f0bb2e90d81d1044c2b32617f68fcb9fa3bb7640c23e9c748e53fb30934" dependencies = [ - "bytes", - "prost-derive 0.11.8", + "unicode-ident", ] [[package]] -name = "prost-build" -version = "0.9.0" +name = "prost" +version = "0.14.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "62941722fb675d463659e49c4f3fe1fe792ff24fe5bbaa9c08cd3b98a1c354f5" +checksum = "d2ea70524a2f82d518bce41317d0fae74151505651af45faf1ffbd6fd33f0568" dependencies = [ "bytes", - "heck 0.3.3", - "itertools", - "lazy_static", - "log", - "multimap", - "petgraph", - "prost 0.9.0", - "prost-types 0.9.0", - "regex", - "tempfile", - "which", + "prost-derive", ] [[package]] name = "prost-build" -version = "0.11.8" +version = "0.14.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2c828f93f5ca4826f97fedcbd3f9a536c16b12cff3dbbb4a007f932bbad95b12" +checksum = "343d3bd7056eda839b03204e68deff7d1b13aba7af2b2fd16890697274262ee7" dependencies = [ - "bytes", - "heck 0.4.1", + "heck", "itertools", - "lazy_static", "log", "multimap", "petgraph", - "prost 0.11.8", - "prost-types 0.11.8", + "prettyplease", + "prost", + "prost-types", "regex", + "syn 2.0.117", "tempfile", - "which", ] [[package]] name = "prost-derive" -version = "0.9.0" +version = "0.14.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f9cc1a3263e07e0bf68e96268f37665207b49560d98739662cdfaae215c720fe" +checksum = "27c6023962132f4b30eb4c172c91ce92d933da334c59c23cddee82358ddafb0b" dependencies = [ "anyhow", "itertools", "proc-macro2", "quote", - "syn", + "syn 2.0.117", ] [[package]] -name = "prost-derive" -version = "0.11.8" +name = "prost-types" +version = "0.14.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4ea9b0f8cbe5e15a8a042d030bd96668db28ecb567ec37d691971ff5731d2b1b" +checksum = "8991c4cbdb8bc5b11f0b074ffe286c30e523de90fee5ba8132f1399f23cb3dd7" dependencies = [ - "anyhow", - "itertools", - "proc-macro2", - "quote", - "syn", + "prost", ] [[package]] -name = "prost-types" -version = "0.9.0" +name = "protobuf-src" +version = "2.1.1+27.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "534b7a0e836e3c482d2693070f982e39e7611da9695d4d1f5a4b186b51faef0a" +checksum = "6217c3504da19b85a3a4b2e9a5183d635822d83507ba0986624b5c05b83bfc40" dependencies = [ - "bytes", - "prost 0.9.0", + "cmake", ] [[package]] -name = "prost-types" -version = "0.11.8" +name = "psm" +version = "0.1.30" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "379119666929a1afd7a043aa6cf96fa67a6dce9af60c88095a4686dbce4c9c88" +checksum = "3852766467df634d74f0b2d7819bf8dc483a0eb2e3b0f50f756f9cfe8b0d18d8" dependencies = [ - "prost 0.11.8", + "ar_archive_writer", + "cc", ] [[package]] name = "pyo3" -version = "0.18.1" +version = "0.28.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "06a3d8e8a46ab2738109347433cb7b96dffda2e4a218b03ef27090238886b147" +checksum = "cf85e27e86080aafd5a22eae58a162e133a589551542b3e5cee4beb27e54f8e1" dependencies = [ - "cfg-if", - "indoc", "libc", - "memoffset", - "parking_lot", + "once_cell", + "portable-atomic", "pyo3-build-config", "pyo3-ffi", "pyo3-macros", - "unindent", ] [[package]] -name = "pyo3-build-config" -version = "0.18.1" +name = "pyo3-async-runtimes" +version = "0.28.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "75439f995d07ddfad42b192dfcf3bc66a7ecfd8b4a1f5f6f046aa5c2c5d7677d" +checksum = "9e7364a95bf00e8377bbf9b0f09d7ff9715a29d8fcf93b47d1a967363b973178" dependencies = [ + "futures-channel", + "futures-util", "once_cell", + "pin-project-lite", + "pyo3", + "tokio", +] + +[[package]] +name = "pyo3-build-config" +version = "0.28.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8bf94ee265674bf76c09fa430b0e99c26e319c945d96ca0d5a8215f31bf81cf7" +dependencies = [ "target-lexicon", ] [[package]] name = "pyo3-ffi" -version = "0.18.1" +version = "0.28.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "839526a5c07a17ff44823679b68add4a58004de00512a95b6c1c98a6dcac0ee5" +checksum = "491aa5fc66d8059dd44a75f4580a2962c1862a1c2945359db36f6c2818b748dc" dependencies = [ "libc", "pyo3-build-config", ] +[[package]] +name = "pyo3-log" +version = "0.13.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "26c2ec80932c5c3b2d4fbc578c9b56b2d4502098587edb8bef5b6bfcad43682e" +dependencies = [ + "arc-swap", + "log", + "pyo3", +] + [[package]] name = "pyo3-macros" -version = "0.18.1" +version = "0.28.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bd44cf207476c6a9760c4653559be4f206efafb924d3e4cbf2721475fc0d6cc5" +checksum = "f5d671734e9d7a43449f8480f8b38115df67bef8d21f76837fa75ee7aaa5e52e" dependencies = [ "proc-macro2", "pyo3-macros-backend", "quote", - "syn", + "syn 2.0.117", ] [[package]] name = "pyo3-macros-backend" -version = "0.18.1" +version = "0.28.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dc1f43d8e30460f36350d18631ccf85ded64c059829208fe680904c65bcd0a4c" +checksum = "22faaa1ce6c430a1f71658760497291065e6450d7b5dc2bcf254d49f66ee700a" dependencies = [ + "heck", "proc-macro2", + "pyo3-build-config", "quote", - "syn", + "syn 2.0.117", ] [[package]] name = "quad-rand" -version = "0.2.1" +version = "0.2.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "658fa1faf7a4cc5f057c9ee5ef560f717ad9d8dc66d975267f709624d6e1ab88" +checksum = "5a651516ddc9168ebd67b24afd085a718be02f8858fe406591b013d101ce2f40" [[package]] name = "quick-xml" -version = "0.27.1" +version = "0.39.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ffc053f057dd768a56f62cd7e434c42c831d296968997e9ac1f76ea7c2d14c41" +checksum = "958f21e8e7ceb5a1aa7fa87fab28e7c75976e0bfe7e23ff069e0a260f894067d" dependencies = [ "memchr", "serde", ] +[[package]] +name = "quinn" +version = "0.11.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b9e20a958963c291dc322d98411f541009df2ced7b5a4f2bd52337638cfccf20" +dependencies = [ + "bytes", + "cfg_aliases", + "pin-project-lite", + "quinn-proto", + "quinn-udp", + "rustc-hash", + "rustls", + "socket2", + "thiserror", + "tokio", + "tracing", + "web-time", +] + +[[package]] +name = "quinn-proto" +version = "0.11.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "434b42fec591c96ef50e21e886936e66d3cc3f737104fdb9b737c40ffb94c098" +dependencies = [ + "bytes", + "getrandom 0.3.4", + "lru-slab", + "rand 0.9.2", + "ring", + "rustc-hash", + "rustls", + "rustls-pki-types", + "slab", + "thiserror", + "tinyvec", + "tracing", + "web-time", +] + +[[package]] +name = "quinn-udp" +version = "0.5.14" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "addec6a0dcad8a8d96a771f815f0eaf55f9d1805756410b39f5fa81332574cbd" +dependencies = [ + "cfg_aliases", + "libc", + "once_cell", + "socket2", + "tracing", + "windows-sys 0.60.2", +] + [[package]] name = "quote" -version = "1.0.26" +version = "1.0.45" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4424af4bf778aae2051a77b60283332f386554255d722233d09fbfc7e30da2fc" +checksum = "41f2619966050689382d2b44f664f4bc593e129785a36d6ee376ddf37259b924" dependencies = [ "proc-macro2", ] +[[package]] +name = "r-efi" +version = "5.3.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "69cdb34c158ceb288df11e18b4bd39de994f6657d83847bdffdbd7f346754b0f" + +[[package]] +name = "r-efi" +version = "6.0.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f8dcc9c7d52a811697d2151c701e0d08956f92b0e24136cf4cf27b57a6a0d9bf" + [[package]] name = "rand" -version = "0.8.5" +version = "0.9.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "34af8d1a0e25924bc5b7c43c079c942339d8f0a8b57c39049bef581b46327404" +checksum = "6db2770f06117d490610c7488547d543617b21bfa07796d7a12f6f1bd53850d1" dependencies = [ - "libc", "rand_chacha", - "rand_core", + "rand_core 0.9.5", +] + +[[package]] +name = "rand" +version = "0.10.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "bc266eb313df6c5c09c1c7b1fbe2510961e5bcd3add930c1e31f7ed9da0feff8" +dependencies = [ + "chacha20", + "getrandom 0.4.2", + "rand_core 0.10.0", ] [[package]] name = "rand_chacha" -version = "0.3.1" +version = "0.9.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88" +checksum = "d3022b5f1df60f26e1ffddd6c66e8aa15de382ae63b3a0c1bfc0e4d3e3f325cb" dependencies = [ "ppv-lite86", - "rand_core", + "rand_core 0.9.5", +] + +[[package]] +name = "rand_core" +version = "0.9.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "76afc826de14238e6e8c374ddcc1fa19e374fd8dd986b0d2af0d02377261d83c" +dependencies = [ + "getrandom 0.3.4", ] [[package]] name = "rand_core" -version = "0.6.4" +version = "0.10.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0c8d0fd677905edcbeedbf2edb6494d676f0e98d54d5cf9bda0b061cb8fb8aba" + +[[package]] +name = "recursive" +version = "0.1.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0786a43debb760f491b1bc0269fe5e84155353c67482b9e60d0cfb596054b43e" +dependencies = [ + "recursive-proc-macro-impl", + "stacker", +] + +[[package]] +name = "recursive-proc-macro-impl" +version = "0.1.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ec0be4795e2f6a28069bec0b5ff3e2ac9bafc99e6a9a7dc3547996c5c816922c" +checksum = "76009fbe0614077fc1a2ce255e3a1881a2e3a3527097d5dc6d8212c585e7e38b" dependencies = [ - "getrandom", + "quote", + "syn 2.0.117", ] [[package]] name = "redox_syscall" -version = "0.2.16" +version = "0.5.18" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ed2bf2547551a7053d6fdfafda3f938979645c44812fbfcda098faae3f1a362d" +dependencies = [ + "bitflags", +] + +[[package]] +name = "regex" +version = "1.12.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fb5a58c1855b4b6819d59012155603f0b22ad30cad752600aadfcb695265519a" +checksum = "e10754a14b9137dd7b1e3e5b0493cc9171fdd105e0ab477f51b72e7f3ac0e276" dependencies = [ - "bitflags", + "aho-corasick", + "memchr", + "regex-automata", + "regex-syntax", ] [[package]] -name = "regex" -version = "1.7.1" +name = "regex-automata" +version = "0.4.14" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "48aaa5748ba571fb95cd2c85c09f629215d3a6ece942baa100950af03a34f733" +checksum = "6e1dd4122fc1595e8162618945476892eefca7b88c52820e74af6262213cae8f" dependencies = [ "aho-corasick", "memchr", "regex-syntax", ] +[[package]] +name = "regex-lite" +version = "0.1.9" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "cab834c73d247e67f4fae452806d17d3c7501756d98c8808d7c9c7aa7d18f973" + [[package]] name = "regex-syntax" -version = "0.6.28" +version = "0.8.10" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "456c603be3e8d448b072f410900c09faf164fbce2d480456f50eea6e25f9c848" +checksum = "dc897dd8d9e8bd1ed8cdad82b5966c3e0ecae09fb1907d58efaa013543185d0a" [[package]] name = "regress" -version = "0.4.1" +version = "0.10.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0a92ff21fe8026ce3f2627faaf43606f0b67b014dbc9ccf027181a804f75d92e" +checksum = "2057b2325e68a893284d1538021ab90279adac1139957ca2a74426c6f118fb48" dependencies = [ + "hashbrown 0.16.1", "memchr", ] +[[package]] +name = "repr_offset" +version = "0.2.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fb1070755bd29dffc19d0971cab794e607839ba2ef4b69a9e6fbc8733c1b72ea" +dependencies = [ + "tstr", +] + [[package]] name = "reqwest" -version = "0.11.14" +version = "0.12.28" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "21eed90ec8570952d53b772ecf8f206aa1ec9a3d76b2521c56c42973f2d91ee9" +checksum = "eddd3ca559203180a307f12d114c268abf583f59b03cb906fd0b3ff8646c1147" dependencies = [ "base64", "bytes", - "encoding_rs", "futures-core", "futures-util", "h2", "http", "http-body", + "http-body-util", "hyper", "hyper-rustls", - "ipnet", + "hyper-util", "js-sys", "log", - "mime", - "once_cell", "percent-encoding", "pin-project-lite", + "quinn", "rustls", - "rustls-pemfile", + "rustls-native-certs", + "rustls-pki-types", "serde", "serde_json", "serde_urlencoded", + "sync_wrapper", "tokio", "tokio-rustls", "tokio-util", + "tower", + "tower-http", "tower-service", "url", "wasm-bindgen", "wasm-bindgen-futures", "wasm-streams", "web-sys", - "webpki-roots", - "winreg", ] [[package]] name = "ring" -version = "0.16.20" +version = "0.17.14" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3053cf52e236a3ed746dfc745aa9cacf1b791d846bdaf412f60a8d7d6e17c8fc" +checksum = "a4689e6c2294d81e88dc6261c768b63bc4fcdb852be6d1352498b114f61383b7" dependencies = [ "cc", + "cfg-if", + "getrandom 0.2.17", "libc", - "once_cell", - "spin", "untrusted", - "web-sys", - "winapi", + "windows-sys 0.52.0", ] [[package]] -name = "rle-decode-fast" -version = "1.0.3" +name = "rustc-hash" +version = "2.1.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3582f63211428f83597b51b2ddb88e2a91a9d52d12831f9d08f5e624e8977422" +checksum = "357703d41365b4b27c590e3ed91eabb1b663f07c4c084095e60cbed4362dff0d" [[package]] name = "rustc_version" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bfa0f585226d2e68097d4f95d113b15b83a82e819ab25717ec0590d9584ef366" -dependencies = [ - "semver 1.0.17", -] - -[[package]] -name = "rustfmt-wrapper" -version = "0.2.0" +version = "0.4.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ed729e3bee08ec2befd593c27e90ca9fdd25efdc83c94c3b82eaef16e4f7406e" +checksum = "cfcb3a22ef46e85b45de6ee7e79d063319ebb6594faafcf1c225ea92ab6e9b92" dependencies = [ - "serde", - "tempfile", - "thiserror", - "toml", - "toolchain_find", + "semver", ] [[package]] name = "rustix" -version = "0.36.9" +version = "1.1.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fd5c6ff11fecd55b40746d1995a02f2eb375bf8c00d192d521ee09f42bef37bc" +checksum = "b6fe4565b9518b83ef4f91bb47ce29620ca828bd32cb7e408f0062e9930ba190" dependencies = [ "bitflags", "errno", - "io-lifetimes", "libc", "linux-raw-sys", - "windows-sys 0.45.0", + "windows-sys 0.61.2", ] [[package]] name = "rustls" -version = "0.20.8" +version = "0.23.37" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fff78fc74d175294f4e83b28343315ffcfb114b156f0185e9741cb5570f50e2f" +checksum = "758025cb5fccfd3bc2fd74708fd4682be41d99e5dff73c377c0646c6012c73a4" dependencies = [ - "log", + "once_cell", "ring", - "sct", - "webpki", + "rustls-pki-types", + "rustls-webpki", + "subtle", + "zeroize", ] [[package]] -name = "rustls-pemfile" -version = "1.0.2" +name = "rustls-native-certs" +version = "0.8.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d194b56d58803a43635bdc398cd17e383d6f71f9182b9a192c127ca42494a59b" +checksum = "612460d5f7bea540c490b2b6395d8e34a953e52b491accd6c86c8164c5932a63" dependencies = [ - "base64", + "openssl-probe", + "rustls-pki-types", + "schannel", + "security-framework", +] + +[[package]] +name = "rustls-pki-types" +version = "1.14.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "be040f8b0a225e40375822a563fa9524378b9d63112f53e19ffff34df5d33fdd" +dependencies = [ + "web-time", + "zeroize", +] + +[[package]] +name = "rustls-webpki" +version = "0.103.10" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "df33b2b81ac578cabaf06b89b0631153a3f416b0a886e8a7a1707fb51abbd1ef" +dependencies = [ + "ring", + "rustls-pki-types", + "untrusted", ] [[package]] name = "rustversion" -version = "1.0.12" +version = "1.0.22" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4f3208ce4d8448b3f3e7d168a73f5e0c43a61e32930de3bceeccedb388b6bf06" +checksum = "b39cdef0fa800fc44525c84ccb54a029961a8215f9619753635a9c0d2538d46d" [[package]] name = "ryu" -version = "1.0.13" +version = "1.0.23" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f91339c0467de62360649f8d3e185ca8de4224ff281f66000de5eb2a77a79041" +checksum = "9774ba4a74de5f7b1c1451ed6cd5285a32eddb5cccb8cc655a4e50009e06477f" [[package]] name = "same-file" @@ -2259,11 +3457,20 @@ dependencies = [ "winapi-util", ] +[[package]] +name = "schannel" +version = "0.1.29" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "91c1b7e4904c873ef0710c1f407dde2e6287de2bebc1bbbf7d430bb7cbffd939" +dependencies = [ + "windows-sys 0.61.2", +] + [[package]] name = "schemars" -version = "0.8.12" +version = "0.8.22" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "02c613288622e5f0c3fdc5dbd4db1c5fbe752746b1d1a56a0630b78fd00de44f" +checksum = "3fbf2ae1b8bc8e02df939598064d22402220cd5bbcca1c76f7d6a310974d5615" dependencies = [ "dyn-clone", "schemars_derive", @@ -2273,119 +3480,135 @@ dependencies = [ [[package]] name = "schemars_derive" -version = "0.8.12" +version = "0.8.22" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "109da1e6b197438deb6db99952990c7f959572794b80ff93707d55a232545e7c" +checksum = "32e265784ad618884abaea0600a9adf15393368d840e0222d101a072f3f7534d" dependencies = [ "proc-macro2", "quote", "serde_derive_internals", - "syn", + "syn 2.0.117", ] [[package]] name = "scopeguard" -version = "1.1.0" +version = "1.2.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d29ab0c6d3fc0ee92fe66e2d99f700eab17a8d57d1c1d3b748380fb20baa78cd" +checksum = "94143f37725109f92c262ed2cf5e59bce7498c01bcc1502d7b9afe439a4e9f49" [[package]] -name = "scratch" -version = "1.0.5" +name = "security-framework" +version = "3.7.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1792db035ce95be60c3f8853017b3999209281c24e2ba5bc8e59bf97a0c590c1" +checksum = "b7f4bc775c73d9a02cde8bf7b2ec4c9d12743edf609006c7facc23998404cd1d" +dependencies = [ + "bitflags", + "core-foundation", + "core-foundation-sys", + "libc", + "security-framework-sys", +] [[package]] -name = "sct" -version = "0.7.0" +name = "security-framework-sys" +version = "2.17.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d53dcdb7c9f8158937a7981b48accfd39a43af418591a5d008c7b22b5e1b7ca4" +checksum = "6ce2691df843ecc5d231c0b14ece2acc3efb62c0a398c7e1d875f3983ce020e3" dependencies = [ - "ring", - "untrusted", + "core-foundation-sys", + "libc", ] [[package]] name = "semver" -version = "0.11.0" +version = "1.0.27" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f301af10236f6df4160f7c3f04eec6dbc70ace82d23326abad5edee88801c6b6" +checksum = "d767eb0aabc880b29956c35734170f26ed551a859dbd361d140cdbeca61ab1e2" dependencies = [ - "semver-parser", + "serde", + "serde_core", ] [[package]] -name = "semver" -version = "1.0.17" +name = "seq-macro" +version = "0.3.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bebd363326d05ec3e2f532ab7660680f3b02130d780c299bca73469d521bc0ed" +checksum = "1bc711410fbe7399f390ca1c3b60ad0f53f80e95c5eb935e52268a0e2cd49acc" [[package]] -name = "semver-parser" -version = "0.10.2" +name = "serde" +version = "1.0.228" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "00b0bef5b7f9e0df16536d3961cfb6e84331c065b4066afb39768d0e319411f7" +checksum = "9a8e94ea7f378bd32cbbd37198a4a91436180c5bb472411e48b5ec2e2124ae9e" dependencies = [ - "pest", + "serde_core", + "serde_derive", ] [[package]] -name = "seq-macro" -version = "0.3.3" +name = "serde_bytes" +version = "0.11.19" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e6b44e8fc93a14e66336d230954dda83d18b4605ccace8fe09bc7514a71ad0bc" +checksum = "a5d440709e79d88e51ac01c4b72fc6cb7314017bb7da9eeff678aa94c10e3ea8" +dependencies = [ + "serde", + "serde_core", +] [[package]] -name = "serde" -version = "1.0.156" +name = "serde_core" +version = "1.0.228" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "314b5b092c0ade17c00142951e50ced110ec27cea304b1037c6969246c2469a4" +checksum = "41d385c7d4ca58e59fc732af25c3983b67ac852c1a25000afe1175de458b67ad" dependencies = [ "serde_derive", ] [[package]] name = "serde_derive" -version = "1.0.156" +version = "1.0.228" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d7e29c4601e36bcec74a223228dce795f4cd3616341a4af93520ca1a837c087d" +checksum = "d540f220d3187173da220f885ab66608367b6574e925011a9353e4badda91d79" dependencies = [ "proc-macro2", "quote", - "syn", + "syn 2.0.117", ] [[package]] name = "serde_derive_internals" -version = "0.26.0" +version = "0.29.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "85bf8229e7920a9f636479437026331ce11aa132b4dde37d121944a44d6e5f3c" +checksum = "18d26a20a969b9e3fdf2fc2d9f21eda6c40e2de84c9408bb5d3b05d499aae711" dependencies = [ "proc-macro2", "quote", - "syn", + "syn 2.0.117", ] [[package]] name = "serde_json" -version = "1.0.94" +version = "1.0.149" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1c533a59c9d8a93a09c6ab31f0fd5e5f4dd1b8fc9434804029839884765d04ea" +checksum = "83fc039473c5595ace860d8c4fafa220ff474b3fc6bfdb4293327f1a37e94d86" dependencies = [ "itoa", - "ryu", + "memchr", "serde", + "serde_core", + "zmij", ] [[package]] name = "serde_tokenstream" -version = "0.1.7" +version = "0.2.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "797ba1d80299b264f3aac68ab5d12e5825a561749db4df7cd7c8083900c5d4e9" +checksum = "d7c49585c52c01f13c5c2ebb333f14f6885d76daa768d8a037d28017ec538c69" dependencies = [ "proc-macro2", + "quote", "serde", - "syn", + "syn 2.0.117", ] [[package]] @@ -2402,9 +3625,9 @@ dependencies = [ [[package]] name = "serde_yaml" -version = "0.9.19" +version = "0.9.34+deprecated" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f82e6c8c047aa50a7328632d067bcae6ef38772a79e28daf32f735e0e4f3dd10" +checksum = "6a8b1a1a2ebf674015cc02edccce75287f1a0130d394307b36743c2f5d504b47" dependencies = [ "indexmap", "itoa", @@ -2415,143 +3638,163 @@ dependencies = [ [[package]] name = "sha2" -version = "0.10.6" +version = "0.10.9" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "82e6b795fe2e3b1e845bafcb27aa35405c4d47cdfc92af5fc8d3002f76cebdc0" +checksum = "a7507d819769d01a365ab707794a4084392c824f54a7a6a7862f8c3d0892b283" dependencies = [ "cfg-if", - "cpufeatures", + "cpufeatures 0.2.17", "digest", ] [[package]] -name = "slab" -version = "0.4.8" +name = "shlex" +version = "1.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6528351c9bc8ab22353f9d776db39a20288e8d6c37ef8cfe3317cf875eecfc2d" -dependencies = [ - "autocfg", -] +checksum = "0fda2ff0d084019ba4d7c6f371c95d8fd75ce3524c3cb8fb653a3023f6323e64" [[package]] -name = "smallvec" -version = "1.10.0" +name = "simd-adler32" +version = "0.3.9" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a507befe795404456341dfab10cef66ead4c041f62b8b11bbb92bffe5d0953e0" +checksum = "703d5c7ef118737c72f1af64ad2f6f8c5e1921f818cdcb97b8fe6fc69bf66214" [[package]] -name = "snafu" -version = "0.7.4" +name = "simdutf8" +version = "0.1.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cb0656e7e3ffb70f6c39b3c2a86332bb74aa3c679da781642590f3c1118c5045" -dependencies = [ - "doc-comment", - "snafu-derive", -] +checksum = "e3a9fe34e3e7a50316060351f37187a3f546bce95496156754b601a5fa71b76e" [[package]] -name = "snafu-derive" -version = "0.7.4" +name = "siphasher" +version = "1.0.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "475b3bbe5245c26f2d8a6f62d67c1f30eb9fffeccee721c45d162c3ebbdf81b2" -dependencies = [ - "heck 0.4.1", - "proc-macro2", - "quote", - "syn", -] +checksum = "b2aa850e253778c88a04c3d7323b043aeda9d3e30d5971937c1855769763678e" + +[[package]] +name = "slab" +version = "0.4.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0c790de23124f9ab44544d7ac05d60440adc586479ce501c1d6d7da3cd8c9cf5" + +[[package]] +name = "smallvec" +version = "1.15.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "67b1b7a3b5fe4f1376887184045fcf45c69e92af734b7aaddc05fb777b6fbd03" [[package]] name = "snap" -version = "1.1.0" +version = "1.1.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5e9f0ab6ef7eb7353d9119c170a436d1bf248eea575ac42d19d12f4e34130831" +checksum = "1b6b67fb9a61334225b5b790716f609cd58395f895b3fe8b328786812a40bc3b" [[package]] name = "socket2" -version = "0.4.9" +version = "0.6.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "64a4a911eed85daf18834cfaa86a79b7d266ff93ff5ba14005426219480ed662" +checksum = "3a766e1110788c36f4fa1c2b71b387a7815aa65f88ce0229841826633d93723e" dependencies = [ "libc", - "winapi", + "windows-sys 0.61.2", ] -[[package]] -name = "spin" -version = "0.5.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6e63cff320ae2c57904679ba7cb63280a3dc4613885beafb148ee7bf9aa9042d" - [[package]] name = "sqlparser" -version = "0.32.0" +version = "0.61.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0366f270dbabb5cc2e4c88427dc4c08bba144f81e32fbd459a013f26a4d16aa0" +checksum = "dbf5ea8d4d7c808e1af1cbabebca9a2abe603bcefc22294c5b95018d53200cb7" dependencies = [ "log", + "recursive", "sqlparser_derive", ] [[package]] name = "sqlparser_derive" -version = "0.1.1" +version = "0.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "55fe75cb4a364c7f7ae06c7dbbc8d84bddd85d6cdf9975963c3935bc1991761e" +checksum = "a6dd45d8fc1c79299bfbb7190e42ccbbdf6a5f52e4a6ad98d92357ea965bd289" dependencies = [ "proc-macro2", "quote", - "syn", + "syn 2.0.117", ] [[package]] -name = "static_assertions" -version = "1.1.0" +name = "stable_deref_trait" +version = "1.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6ce2be8dc25455e1f91df71bfa12ad37d7af1092ae736f3a6cd0e37bc7810596" + +[[package]] +name = "stacker" +version = "0.1.23" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "08d74a23609d509411d10e2176dc2a4346e3b4aea2e7b1869f19fdedbc71c013" +dependencies = [ + "cc", + "cfg-if", + "libc", + "psm", + "windows-sys 0.59.0", +] + +[[package]] +name = "strsim" +version = "0.11.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a2eb9349b6444b326872e140eb1cf5e7c522154d69e7a0ffb0fb81c06b37543f" +checksum = "7da8b5736845d9f2fcb837ea5d9e2628564b3b043a70948a3f0b778838c5fb4f" [[package]] name = "strum" -version = "0.24.1" +version = "0.27.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "063e6045c0e62079840579a7e47a355ae92f60eb74daaf156fb1e84ba164e63f" +checksum = "af23d6f6c1a224baef9d3f61e287d2761385a5b88fdab4eb4c6f11aeb54c4bcf" [[package]] name = "strum_macros" -version = "0.24.3" +version = "0.27.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1e385be0d24f186b4ce2f9982191e7101bb737312ad61c1f2f984f34bcf85d59" +checksum = "7695ce3845ea4b33927c055a39dc438a45b059f7c1b3d91d38d10355fb8cbca7" dependencies = [ - "heck 0.4.1", + "heck", "proc-macro2", "quote", - "rustversion", - "syn", + "syn 2.0.117", ] [[package]] name = "substrait" -version = "0.4.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3e977fc98d1e03cf99220bb6bb96f8838ffa5c1306a8c83c1b25aa20817eb6d0" -dependencies = [ - "heck 0.4.1", - "prost 0.11.8", - "prost-build 0.11.8", - "prost-types 0.11.8", +version = "0.62.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "62fc4b483a129b9772ccb9c3f7945a472112fdd9140da87f8a4e7f1d44e045d0" +dependencies = [ + "heck", + "pbjson", + "pbjson-build", + "pbjson-types", + "prettyplease", + "prost", + "prost-build", + "prost-types", + "protobuf-src", + "regress", "schemars", + "semver", "serde", "serde_json", "serde_yaml", + "syn 2.0.117", "typify", "walkdir", ] [[package]] name = "subtle" -version = "2.4.1" +version = "2.6.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6bdef32e8150c2a081110b42772ffe7d7c9032b606bc226c8260fd97e0976601" +checksum = "13c2bddecc57b384dee18652358fb23172facb8a2c51ccc10d74c157bdea3292" [[package]] name = "syn" @@ -2565,51 +3808,73 @@ dependencies = [ ] [[package]] -name = "target-lexicon" -version = "0.12.6" +name = "syn" +version = "2.0.117" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8ae9980cab1db3fceee2f6c6f643d5d8de2997c58ee8d25fb0cc8a9e9e7348e5" +checksum = "e665b8803e7b1d2a727f4023456bbbbe74da67099c585258af0ad9c5013b9b99" +dependencies = [ + "proc-macro2", + "quote", + "unicode-ident", +] [[package]] -name = "tempfile" -version = "3.4.0" +name = "sync_wrapper" +version = "1.0.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "af18f7ae1acd354b992402e9ec5864359d693cd8a79dcbef59f76891701c1e95" +checksum = "0bf256ce5efdfa370213c1dabab5935a12e49f2c58d15e9eac2870d3b4f27263" dependencies = [ - "cfg-if", - "fastrand", - "redox_syscall", - "rustix", - "windows-sys 0.42.0", + "futures-core", ] [[package]] -name = "termcolor" -version = "1.2.0" +name = "synstructure" +version = "0.13.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "be55cf8942feac5c765c2c993422806843c9a9a45d4d5c407ad6dd2ea95eb9b6" +checksum = "728a70f3dbaf5bab7f0c4b1ac8d7ae5ea60a4b5549c8a5914361c99147a709d2" dependencies = [ - "winapi-util", + "proc-macro2", + "quote", + "syn 2.0.117", +] + +[[package]] +name = "target-lexicon" +version = "0.13.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "adb6935a6f5c20170eeceb1a3835a49e12e19d792f6dd344ccc76a985ca5a6ca" + +[[package]] +name = "tempfile" +version = "3.27.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "32497e9a4c7b38532efcdebeef879707aa9f794296a4f0244f6f69e9bc8574bd" +dependencies = [ + "fastrand", + "getrandom 0.4.2", + "once_cell", + "rustix", + "windows-sys 0.61.2", ] [[package]] name = "thiserror" -version = "1.0.39" +version = "2.0.18" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a5ab016db510546d856297882807df8da66a16fb8c4101cb8b30054b0d5b2d9c" +checksum = "4288b5bcbc7920c07a1149a35cf9590a2aa808e0bc1eafaade0b80947865fbc4" dependencies = [ "thiserror-impl", ] [[package]] name = "thiserror-impl" -version = "1.0.39" +version = "2.0.18" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5420d42e90af0c38c3290abcca25b9b3bdf379fc9f55c528f53a269d9c9a267e" +checksum = "ebc4ee7f67670e9b64d05fa4253e753e016c6c95ff35b89b7941d6b856dec1d5" dependencies = [ "proc-macro2", "quote", - "syn", + "syn 2.0.117", ] [[package]] @@ -2624,30 +3889,29 @@ dependencies = [ ] [[package]] -name = "time" -version = "0.1.45" +name = "tiny-keccak" +version = "2.0.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1b797afad3f312d1c66a56d11d0316f916356d11bd158fbc6ca6389ff6bf805a" +checksum = "2c9d3793400a45f954c52e73d068316d76b6f4e36977e3fcebb13a2721e80237" dependencies = [ - "libc", - "wasi 0.10.0+wasi-snapshot-preview1", - "winapi", + "crunchy", ] [[package]] -name = "tiny-keccak" -version = "2.0.2" +name = "tinystr" +version = "0.8.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2c9d3793400a45f954c52e73d068316d76b6f4e36977e3fcebb13a2721e80237" +checksum = "42d3e9c45c09de15d06dd8acf5f4e0e399e85927b7f00711024eb7ae10fa4869" dependencies = [ - "crunchy", + "displaydoc", + "zerovec", ] [[package]] name = "tinyvec" -version = "1.6.0" +version = "1.11.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "87cc5ceb3875bb20c2890005a4e226a4651264a5c75edb2421b52861a0a0cb50" +checksum = "3e61e67053d25a4e82c844e8424039d9745781b3fc4f32b8d55ed50f5f667ef3" dependencies = [ "tinyvec_macros", ] @@ -2660,105 +3924,116 @@ checksum = "1f3ccbac311fea05f86f61904b462b55fb3df8837a366dfc601a0161d0532f20" [[package]] name = "tokio" -version = "1.26.0" +version = "1.50.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "03201d01c3c27a29c8a5cee5b55a93ddae1ccf6f08f65365c2c918f8c1b76f64" +checksum = "27ad5e34374e03cfffefc301becb44e9dc3c17584f414349ebe29ed26661822d" dependencies = [ - "autocfg", "bytes", "libc", - "memchr", "mio", - "num_cpus", - "parking_lot", "pin-project-lite", "socket2", "tokio-macros", - "windows-sys 0.45.0", + "windows-sys 0.61.2", ] [[package]] name = "tokio-macros" -version = "1.8.2" +version = "2.6.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d266c00fde287f55d3f1c3e96c500c362a2b8c695076ec180f27918820bc6df8" +checksum = "5c55a2eff8b69ce66c84f85e1da1c233edc36ceb85a2058d11b0d6a3c7e7569c" dependencies = [ "proc-macro2", "quote", - "syn", + "syn 2.0.117", ] [[package]] name = "tokio-rustls" -version = "0.23.4" +version = "0.26.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c43ee83903113e03984cb9e5cebe6c04a5116269e900e3ddba8f068a62adda59" +checksum = "1729aa945f29d91ba541258c8df89027d5792d85a8841fb65e8bf0f4ede4ef61" dependencies = [ "rustls", "tokio", - "webpki", ] [[package]] name = "tokio-stream" -version = "0.1.12" +version = "0.1.18" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8fb52b74f05dbf495a8fba459fdc331812b96aa086d9eb78101fa0d4569c3313" +checksum = "32da49809aab5c3bc678af03902d4ccddea2a87d028d86392a4b1560c6906c70" dependencies = [ "futures-core", "pin-project-lite", "tokio", + "tokio-util", ] [[package]] name = "tokio-util" -version = "0.7.7" +version = "0.7.18" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5427d89453009325de0d8f342c9490009f76e999cb7672d77e46267448f7e6b2" +checksum = "9ae9cec805b01e8fc3fd2fe289f89149a9b66dd16786abd8b19cfa7b48cb0098" dependencies = [ "bytes", "futures-core", "futures-sink", "pin-project-lite", "tokio", - "tracing", ] [[package]] -name = "toml" -version = "0.5.11" +name = "tower" +version = "0.5.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f4f7f0dd8d50a853a531c426359045b1998f04219d88799810762cd4ad314234" +checksum = "ebe5ef63511595f1344e2d5cfa636d973292adc0eec1f0ad45fae9f0851ab1d4" dependencies = [ - "serde", + "futures-core", + "futures-util", + "pin-project-lite", + "sync_wrapper", + "tokio", + "tower-layer", + "tower-service", ] [[package]] -name = "toolchain_find" -version = "0.2.0" +name = "tower-http" +version = "0.6.8" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5e85654a10e7a07a47c6f19d93818f3f343e22927f2fa280c84f7c8042743413" +checksum = "d4e6559d53cc268e5031cd8429d05415bc4cb4aefc4aa5d6cc35fbf5b924a1f8" dependencies = [ - "home", - "lazy_static", - "regex", - "semver 0.11.0", - "walkdir", + "bitflags", + "bytes", + "futures-util", + "http", + "http-body", + "iri-string", + "pin-project-lite", + "tower", + "tower-layer", + "tower-service", ] +[[package]] +name = "tower-layer" +version = "0.3.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "121c2a6cda46980bb0fcd1647ffaf6cd3fc79a013de288782836f6df9c48780e" + [[package]] name = "tower-service" -version = "0.3.2" +version = "0.3.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b6bc1c9ce2b5135ac7f93c72918fc37feb872bdc6a5533a8b85eb4b86bfdae52" +checksum = "8df9b6e13f2d32c91b9bd719c00d1958837bc7dec474d94952798cc8e69eeec3" [[package]] name = "tracing" -version = "0.1.37" +version = "0.1.44" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8ce8c33a8d48bd45d624a6e523445fd21ec13d3653cd51f681abf67418f54eb8" +checksum = "63e71662fa4b2a2c3a26f570f037eb95bb1f85397f3cd8076caed2f026a6d100" dependencies = [ - "cfg-if", "pin-project-lite", "tracing-attributes", "tracing-core", @@ -2766,62 +4041,74 @@ dependencies = [ [[package]] name = "tracing-attributes" -version = "0.1.23" +version = "0.1.31" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4017f8f45139870ca7e672686113917c71c7a6e02d4924eda67186083c03081a" +checksum = "7490cfa5ec963746568740651ac6781f701c9c5ea257c58e057f3ba8cf69e8da" dependencies = [ "proc-macro2", "quote", - "syn", + "syn 2.0.117", ] [[package]] name = "tracing-core" -version = "0.1.30" +version = "0.1.36" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "24eb03ba0eab1fd845050058ce5e616558e8f8d8fca633e6b163fe25c797213a" +checksum = "db97caf9d906fbde555dd62fa95ddba9eecfd14cb388e4f491a66d74cd5fb79a" dependencies = [ "once_cell", ] [[package]] name = "try-lock" -version = "0.2.4" +version = "0.2.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3528ecfd12c466c6f163363caf2d02a71161dd5e1cc6ae7b34207ea2d42d81ed" +checksum = "e421abadd41a4225275504ea4d6566923418b7f05506fbc9c0fe86ba7396114b" [[package]] -name = "twox-hash" -version = "1.6.3" +name = "tstr" +version = "0.2.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "97fee6b57c6a41524a810daee9286c02d7752c4253064d0b05472833a438f675" +checksum = "7f8e0294f14baae476d0dd0a2d780b2e24d66e349a9de876f5126777a37bdba7" dependencies = [ - "cfg-if", - "static_assertions", + "tstr_proc_macros", ] [[package]] -name = "typed-builder" -version = "0.10.0" +name = "tstr_proc_macros" +version = "0.2.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "89851716b67b937e393b3daa8423e67ddfc4bbbf1654bcf05488e95e0828db0c" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] +checksum = "e78122066b0cb818b8afd08f7ed22f7fdbc3e90815035726f0840d0d26c0747a" + +[[package]] +name = "twox-hash" +version = "2.1.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9ea3136b675547379c4bd395ca6b938e5ad3c3d20fad76e7fe85f9e0d011419c" + +[[package]] +name = "typed-arena" +version = "2.0.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6af6ae20167a9ece4bcb41af5b80f8a1f1df981f6391189ce00fd257af04126a" [[package]] name = "typenum" -version = "1.16.0" +version = "1.19.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "497961ef93d974e23eb6f433eb5fe1b7930b659f06d12dec6fc44a8f554c0bba" +checksum = "562d481066bde0658276a35467c4af00bdc6ee726305698a55b86e61d7ad82bb" + +[[package]] +name = "typewit" +version = "1.14.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f8c1ae7cc0fdb8b842d65d127cb981574b0d2b249b74d1c7a2986863dc134f71" [[package]] name = "typify" -version = "0.0.10" +version = "0.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2e8486352f3c946e69f983558cfc09b295250b01e01b381ec67a05a812d01d63" +checksum = "e6d5bcc6f62eb1fa8aa4098f39b29f93dcb914e17158b76c50360911257aa629" dependencies = [ "typify-impl", "typify-macro", @@ -2829,198 +4116,188 @@ dependencies = [ [[package]] name = "typify-impl" -version = "0.0.10" +version = "0.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a7624d0b911df6e2bbf34a236f76281f93b294cdde1d4df1dbdb748e5a7fefa5" +checksum = "a1eb359f7ffa4f9ebe947fa11a1b2da054564502968db5f317b7e37693cb2240" dependencies = [ - "heck 0.4.1", + "heck", "log", "proc-macro2", "quote", "regress", - "rustfmt-wrapper", "schemars", + "semver", + "serde", "serde_json", - "syn", + "syn 2.0.117", "thiserror", "unicode-ident", ] [[package]] name = "typify-macro" -version = "0.0.10" +version = "0.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0c42802aa033cee7650a4e1509ba7d5848a56f84be7c4b31e4385ee12445e942" +checksum = "911c32f3c8514b048c1b228361bebb5e6d73aeec01696e8cc0e82e2ffef8ab7a" dependencies = [ "proc-macro2", "quote", "schemars", + "semver", "serde", "serde_json", "serde_tokenstream", - "syn", + "syn 2.0.117", "typify-impl", ] -[[package]] -name = "ucd-trie" -version = "0.1.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9e79c4d996edb816c91e4308506774452e55e95c3c9de07b6729e17e15a5ef81" - -[[package]] -name = "unicode-bidi" -version = "0.3.11" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "524b68aca1d05e03fdf03fcdce2c6c94b6daf6d16861ddaa7e4f2b6638a9052c" - [[package]] name = "unicode-ident" -version = "1.0.8" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e5464a87b239f13a63a501f2701565754bae92d243d4bb7eb12f6d57d2269bf4" - -[[package]] -name = "unicode-normalization" -version = "0.1.22" +version = "1.0.24" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5c5713f0fc4b5db668a2ac63cdb7bb4469d8c9fed047b1d0292cc7b0ce2ba921" -dependencies = [ - "tinyvec", -] +checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75" [[package]] name = "unicode-segmentation" -version = "1.10.1" +version = "1.13.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1dd624098567895118886609431a7c3b8f516e41d30e0643f03d94592a147e36" +checksum = "9629274872b2bfaf8d66f5f15725007f635594914870f65218920345aa11aa8c" [[package]] name = "unicode-width" -version = "0.1.10" +version = "0.2.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c0edd1e5b14653f783770bce4a4dabb4a5108a5370a5f5d8cfe8710c361f6c8b" +checksum = "b4ac048d71ede7ee76d585517add45da530660ef4390e49b098733c6e897f254" [[package]] -name = "unindent" -version = "0.1.11" +name = "unicode-xid" +version = "0.2.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e1766d682d402817b5ac4490b3c3002d91dfa0d22812f341609f97b08757359c" +checksum = "ebc1c04c71510c7f702b52b7c350734c9ff1295c464a03335b00bb84fc54f853" [[package]] name = "unsafe-libyaml" -version = "0.2.7" +version = "0.2.11" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ad2024452afd3874bf539695e04af6732ba06517424dbf958fdb16a01f3bef6c" +checksum = "673aac59facbab8a9007c7f6108d11f63b603f7cabff99fabf650fea5c32b861" [[package]] name = "untrusted" -version = "0.7.1" +version = "0.9.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a156c684c91ea7d62626509bce3cb4e1d9ed5c4d978f7b4352658f96a4c26b4a" +checksum = "8ecb6da28b8a351d773b68d5825ac39017e680750f980f3a1a85cd8dd28a47c1" [[package]] name = "url" -version = "2.3.1" +version = "2.5.8" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0d68c799ae75762b8c3fe375feb6600ef5602c883c5d21eb51c09f22b83c4643" +checksum = "ff67a8a4397373c3ef660812acab3268222035010ab8680ec4215f38ba3d0eed" dependencies = [ "form_urlencoded", "idna", "percent-encoding", + "serde", ] +[[package]] +name = "utf8_iter" +version = "1.0.4" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b6c140620e7ffbb22c2dee59cafe6084a59b5ffc27a8859a5f0d494b5d52b6be" + [[package]] name = "uuid" -version = "1.3.0" +version = "1.23.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1674845326ee10d37ca60470760d4288a6f80f304007d92e5c53bab78c9cfd79" +checksum = "5ac8b6f42ead25368cf5b098aeb3dc8a1a2c05a3eee8a9a1a68c640edbfc79d9" dependencies = [ - "getrandom", - "serde", + "getrandom 0.4.2", + "js-sys", + "serde_core", + "wasm-bindgen", ] [[package]] name = "version_check" -version = "0.9.4" +version = "0.9.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "49874b5167b65d7193b8aba1567f5c7d93d001cafc34600cee003eda787e483f" +checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a" [[package]] name = "walkdir" -version = "2.3.2" +version = "2.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "808cf2735cd4b6866113f648b791c6adc5714537bc222d9347bb203386ffda56" +checksum = "29790946404f91d9c5d06f9874efddea1dc06c5efe94541a7d6863108e3a5e4b" dependencies = [ "same-file", - "winapi", "winapi-util", ] [[package]] name = "want" -version = "0.3.0" +version = "0.3.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1ce8a968cb1cd110d136ff8b819a556d6fb6d919363c61534f6860c7eb172ba0" +checksum = "bfa7760aed19e106de2c7c0b581b509f2f25d3dacaf737cb82ac61bc6d760b0e" dependencies = [ - "log", "try-lock", ] [[package]] name = "wasi" -version = "0.10.0+wasi-snapshot-preview1" +version = "0.11.1+wasi-snapshot-preview1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1a143597ca7c7793eff794def352d41792a93c481eb1042423ff7ff72ba2c31f" +checksum = "ccf3ec651a847eb01de73ccad15eb7d99f80485de043efb2f370cd654f4ea44b" [[package]] -name = "wasi" -version = "0.11.0+wasi-snapshot-preview1" +name = "wasip2" +version = "1.0.2+wasi-0.2.9" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9c8d87e72b64a3b4db28d11ce29237c246188f4f51057d65a7eab63b7987e423" +checksum = "9517f9239f02c069db75e65f174b3da828fe5f5b945c4dd26bd25d89c03ebcf5" +dependencies = [ + "wit-bindgen", +] [[package]] -name = "wasm-bindgen" -version = "0.2.84" +name = "wasip3" +version = "0.4.0+wasi-0.3.0-rc-2026-01-06" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "31f8dcbc21f30d9b8f2ea926ecb58f6b91192c17e9d33594b3df58b2007ca53b" +checksum = "5428f8bf88ea5ddc08faddef2ac4a67e390b88186c703ce6dbd955e1c145aca5" dependencies = [ - "cfg-if", - "wasm-bindgen-macro", + "wit-bindgen", ] [[package]] -name = "wasm-bindgen-backend" -version = "0.2.84" +name = "wasm-bindgen" +version = "0.2.114" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "95ce90fd5bcc06af55a641a86428ee4229e44e07033963a2290a8e241607ccb9" +checksum = "6532f9a5c1ece3798cb1c2cfdba640b9b3ba884f5db45973a6f442510a87d38e" dependencies = [ - "bumpalo", - "log", + "cfg-if", "once_cell", - "proc-macro2", - "quote", - "syn", + "rustversion", + "wasm-bindgen-macro", "wasm-bindgen-shared", ] [[package]] name = "wasm-bindgen-futures" -version = "0.4.34" +version = "0.4.64" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f219e0d211ba40266969f6dbdd90636da12f75bee4fc9d6c23d1260dadb51454" +checksum = "e9c5522b3a28661442748e09d40924dfb9ca614b21c00d3fd135720e48b67db8" dependencies = [ "cfg-if", + "futures-util", "js-sys", + "once_cell", "wasm-bindgen", "web-sys", ] [[package]] name = "wasm-bindgen-macro" -version = "0.2.84" +version = "0.2.114" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4c21f77c0bedc37fd5dc21f897894a5ca01e7bb159884559461862ae90c0b4c5" +checksum = "18a2d50fcf105fb33bb15f00e7a77b772945a2ee45dcf454961fd843e74c18e6" dependencies = [ "quote", "wasm-bindgen-macro-support", @@ -3028,74 +4305,91 @@ dependencies = [ [[package]] name = "wasm-bindgen-macro-support" -version = "0.2.84" +version = "0.2.114" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2aff81306fcac3c7515ad4e177f521b5c9a15f2b08f4e32d823066102f35a5f6" +checksum = "03ce4caeaac547cdf713d280eda22a730824dd11e6b8c3ca9e42247b25c631e3" dependencies = [ + "bumpalo", "proc-macro2", "quote", - "syn", - "wasm-bindgen-backend", + "syn 2.0.117", "wasm-bindgen-shared", ] [[package]] name = "wasm-bindgen-shared" -version = "0.2.84" +version = "0.2.114" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "75a326b8c223ee17883a4251907455a2431acc2791c98c26279376490c378c16" +dependencies = [ + "unicode-ident", +] + +[[package]] +name = "wasm-encoder" +version = "0.244.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0046fef7e28c3804e5e38bfa31ea2a0f73905319b677e57ebe37e49358989b5d" +checksum = "990065f2fe63003fe337b932cfb5e3b80e0b4d0f5ff650e6985b1048f62c8319" +dependencies = [ + "leb128fmt", + "wasmparser", +] [[package]] -name = "wasm-streams" -version = "0.2.3" +name = "wasm-metadata" +version = "0.244.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6bbae3363c08332cadccd13b67db371814cd214c2524020932f0804b8cf7c078" +checksum = "bb0e353e6a2fbdc176932bbaab493762eb1255a7900fe0fea1a2f96c296cc909" dependencies = [ - "futures-util", - "js-sys", - "wasm-bindgen", - "wasm-bindgen-futures", - "web-sys", + "anyhow", + "indexmap", + "wasm-encoder", + "wasmparser", ] [[package]] -name = "web-sys" -version = "0.3.61" +name = "wasm-streams" +version = "0.4.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e33b99f4b23ba3eec1a53ac264e35a755f00e966e0065077d6027c0f575b0b97" +checksum = "15053d8d85c7eccdbefef60f06769760a563c7f0a9d6902a13d35c7800b0ad65" dependencies = [ + "futures-util", "js-sys", "wasm-bindgen", + "wasm-bindgen-futures", + "web-sys", ] [[package]] -name = "webpki" -version = "0.22.0" +name = "wasmparser" +version = "0.244.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f095d78192e208183081cc07bc5515ef55216397af48b873e5edcd72637fa1bd" +checksum = "47b807c72e1bac69382b3a6fb3dbe8ea4c0ed87ff5629b8685ae6b9a611028fe" dependencies = [ - "ring", - "untrusted", + "bitflags", + "hashbrown 0.15.5", + "indexmap", + "semver", ] [[package]] -name = "webpki-roots" -version = "0.22.6" +name = "web-sys" +version = "0.3.91" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b6c71e40d7d2c34a5106301fb632274ca37242cd0c9d3e64dbece371a40a2d87" +checksum = "854ba17bb104abfb26ba36da9729addc7ce7f06f5c0f90f3c391f8461cca21f9" dependencies = [ - "webpki", + "js-sys", + "wasm-bindgen", ] [[package]] -name = "which" -version = "4.4.0" +name = "web-time" +version = "1.1.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2441c784c52b289a054b7201fc93253e288f094e2f4be9058343127c4226a269" +checksum = "5a6580f308b1fad9207618087a65c04e7a10bc77e02c8e84e9b00dd4b12fa0bb" dependencies = [ - "either", - "libc", - "once_cell", + "js-sys", + "wasm-bindgen", ] [[package]] @@ -3116,11 +4410,11 @@ checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6" [[package]] name = "winapi-util" -version = "0.1.5" +version = "0.1.11" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "70ec6ce85bb158151cae5e5c87f95a8e97d2c0c4b001223f33a334e3ce5de178" +checksum = "c2a7b1c03c876122aa43f3020e6c3c3ee5c05081c9a00739faf7503aeba10d22" dependencies = [ - "winapi", + "windows-sys 0.61.2", ] [[package]] @@ -3129,171 +4423,463 @@ version = "0.4.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f" +[[package]] +name = "windows-core" +version = "0.62.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b8e83a14d34d0623b51dce9581199302a221863196a1dde71a7663a4c2be9deb" +dependencies = [ + "windows-implement", + "windows-interface", + "windows-link", + "windows-result", + "windows-strings", +] + +[[package]] +name = "windows-implement" +version = "0.60.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "053e2e040ab57b9dc951b72c264860db7eb3b0200ba345b4e4c3b14f67855ddf" +dependencies = [ + "proc-macro2", + "quote", + "syn 2.0.117", +] + +[[package]] +name = "windows-interface" +version = "0.59.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3f316c4a2570ba26bbec722032c4099d8c8bc095efccdc15688708623367e358" +dependencies = [ + "proc-macro2", + "quote", + "syn 2.0.117", +] + +[[package]] +name = "windows-link" +version = "0.2.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "f0805222e57f7521d6a62e36fa9163bc891acd422f971defe97d64e70d0a4fe5" + +[[package]] +name = "windows-result" +version = "0.4.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7781fa89eaf60850ac3d2da7af8e5242a5ea78d1a11c49bf2910bb5a73853eb5" +dependencies = [ + "windows-link", +] + +[[package]] +name = "windows-strings" +version = "0.5.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7837d08f69c77cf6b07689544538e017c1bfcf57e34b4c0ff58e6c2cd3b37091" +dependencies = [ + "windows-link", +] + +[[package]] +name = "windows-sys" +version = "0.52.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "282be5f36a8ce781fad8c8ae18fa3f9beff57ec1b52cb3de0789201425d9a33d" +dependencies = [ + "windows-targets 0.52.6", +] + +[[package]] +name = "windows-sys" +version = "0.59.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1e38bc4d79ed67fd075bcc251a1c39b32a1776bbe92e5bef1f0bf1f8c531853b" +dependencies = [ + "windows-targets 0.52.6", +] + [[package]] name = "windows-sys" -version = "0.42.0" +version = "0.60.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5a3e1820f08b8513f676f7ab6c1f99ff312fb97b553d30ff4dd86f9f15728aa7" +checksum = "f2f500e4d28234f72040990ec9d39e3a6b950f9f22d3dba18416c35882612bcb" dependencies = [ - "windows_aarch64_gnullvm", - "windows_aarch64_msvc", - "windows_i686_gnu", - "windows_i686_msvc", - "windows_x86_64_gnu", - "windows_x86_64_gnullvm", - "windows_x86_64_msvc", + "windows-targets 0.53.5", ] [[package]] name = "windows-sys" -version = "0.45.0" +version = "0.61.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ae137229bcbd6cdf0f7b80a31df61766145077ddf49416a728b02cb3921ff3fc" +dependencies = [ + "windows-link", +] + +[[package]] +name = "windows-targets" +version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "75283be5efb2831d37ea142365f009c02ec203cd29a3ebecbc093d52315b66d0" +checksum = "9b724f72796e036ab90c1021d4780d4d3d648aca59e491e6b98e725b84e99973" dependencies = [ - "windows-targets", + "windows_aarch64_gnullvm 0.52.6", + "windows_aarch64_msvc 0.52.6", + "windows_i686_gnu 0.52.6", + "windows_i686_gnullvm 0.52.6", + "windows_i686_msvc 0.52.6", + "windows_x86_64_gnu 0.52.6", + "windows_x86_64_gnullvm 0.52.6", + "windows_x86_64_msvc 0.52.6", ] [[package]] name = "windows-targets" -version = "0.42.2" +version = "0.53.5" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8e5180c00cd44c9b1c88adb3693291f1cd93605ded80c250a75d472756b4d071" +checksum = "4945f9f551b88e0d65f3db0bc25c33b8acea4d9e41163edf90dcd0b19f9069f3" dependencies = [ - "windows_aarch64_gnullvm", - "windows_aarch64_msvc", - "windows_i686_gnu", - "windows_i686_msvc", - "windows_x86_64_gnu", - "windows_x86_64_gnullvm", - "windows_x86_64_msvc", + "windows-link", + "windows_aarch64_gnullvm 0.53.1", + "windows_aarch64_msvc 0.53.1", + "windows_i686_gnu 0.53.1", + "windows_i686_gnullvm 0.53.1", + "windows_i686_msvc 0.53.1", + "windows_x86_64_gnu 0.53.1", + "windows_x86_64_gnullvm 0.53.1", + "windows_x86_64_msvc 0.53.1", ] [[package]] name = "windows_aarch64_gnullvm" -version = "0.42.2" +version = "0.52.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "32a4622180e7a0ec044bb555404c800bc9fd9ec262ec147edd5989ccd0c02cd3" + +[[package]] +name = "windows_aarch64_gnullvm" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "a9d8416fa8b42f5c947f8482c43e7d89e73a173cead56d044f6a56104a6d1b53" + +[[package]] +name = "windows_aarch64_msvc" +version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "597a5118570b68bc08d8d59125332c54f1ba9d9adeedeef5b99b02ba2b0698f8" +checksum = "09ec2a7bb152e2252b53fa7803150007879548bc709c039df7627cabbd05d469" [[package]] name = "windows_aarch64_msvc" -version = "0.42.2" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b9d782e804c2f632e395708e99a94275910eb9100b2114651e04744e9b125006" + +[[package]] +name = "windows_i686_gnu" +version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e08e8864a60f06ef0d0ff4ba04124db8b0fb3be5776a5cd47641e942e58c4d43" +checksum = "8e9b5ad5ab802e97eb8e295ac6720e509ee4c243f69d781394014ebfe8bbfa0b" [[package]] name = "windows_i686_gnu" -version = "0.42.2" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "960e6da069d81e09becb0ca57a65220ddff016ff2d6af6a223cf372a506593a3" + +[[package]] +name = "windows_i686_gnullvm" +version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c61d927d8da41da96a81f029489353e68739737d3beca43145c8afec9a31a84f" +checksum = "0eee52d38c090b3caa76c563b86c3a4bd71ef1a819287c19d586d7334ae8ed66" + +[[package]] +name = "windows_i686_gnullvm" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "fa7359d10048f68ab8b09fa71c3daccfb0e9b559aed648a8f95469c27057180c" + +[[package]] +name = "windows_i686_msvc" +version = "0.52.6" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "240948bc05c5e7c6dabba28bf89d89ffce3e303022809e73deaefe4f6ec56c66" [[package]] name = "windows_i686_msvc" -version = "0.42.2" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "1e7ac75179f18232fe9c285163565a57ef8d3c89254a30685b57d83a38d326c2" + +[[package]] +name = "windows_x86_64_gnu" +version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "44d840b6ec649f480a41c8d80f9c65108b92d89345dd94027bfe06ac444d1060" +checksum = "147a5c80aabfbf0c7d901cb5895d1de30ef2907eb21fbbab29ca94c5b08b1a78" [[package]] name = "windows_x86_64_gnu" -version = "0.42.2" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9c3842cdd74a865a8066ab39c8a7a473c0778a3f29370b5fd6b4b9aa7df4a499" + +[[package]] +name = "windows_x86_64_gnullvm" +version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8de912b8b8feb55c064867cf047dda097f92d51efad5b491dfb98f6bbb70cb36" +checksum = "24d5b23dc417412679681396f2b49f3de8c1473deb516bd34410872eff51ed0d" [[package]] name = "windows_x86_64_gnullvm" -version = "0.42.2" +version = "0.53.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "26d41b46a36d453748aedef1486d5c7a85db22e56aff34643984ea85514e94a3" +checksum = "0ffa179e2d07eee8ad8f57493436566c7cc30ac536a3379fdf008f47f6bb7ae1" [[package]] name = "windows_x86_64_msvc" -version = "0.42.2" +version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9aec5da331524158c6d1a4ac0ab1541149c0b9505fde06423b02f5ef0106b9f0" +checksum = "589f6da84c646204747d1270a2a5661ea66ed1cced2631d546fdfb155959f9ec" [[package]] -name = "winreg" -version = "0.10.1" +name = "windows_x86_64_msvc" +version = "0.53.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d6bbff5f0aada427a1e5a6da5f1f98158182f26556f345ac9e04d36d0ebed650" + +[[package]] +name = "wit-bindgen" +version = "0.51.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "80d0f4e272c85def139476380b12f9ac60926689dd2e01d4923222f40580869d" +checksum = "d7249219f66ced02969388cf2bb044a09756a083d0fab1e566056b04d9fbcaa5" dependencies = [ - "winapi", + "wit-bindgen-rust-macro", ] [[package]] -name = "xz2" -version = "0.1.7" +name = "wit-bindgen-core" +version = "0.51.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ea61de684c3ea68cb082b7a88508a8b27fcc8b797d738bfc99a82facf1d752dc" +dependencies = [ + "anyhow", + "heck", + "wit-parser", +] + +[[package]] +name = "wit-bindgen-rust" +version = "0.51.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b7c566e0f4b284dd6561c786d9cb0142da491f46a9fbed79ea69cdad5db17f21" +dependencies = [ + "anyhow", + "heck", + "indexmap", + "prettyplease", + "syn 2.0.117", + "wasm-metadata", + "wit-bindgen-core", + "wit-component", +] + +[[package]] +name = "wit-bindgen-rust-macro" +version = "0.51.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0c0f9bfd77e6a48eccf51359e3ae77140a7f50b1e2ebfe62422d8afdaffab17a" +dependencies = [ + "anyhow", + "prettyplease", + "proc-macro2", + "quote", + "syn 2.0.117", + "wit-bindgen-core", + "wit-bindgen-rust", +] + +[[package]] +name = "wit-component" +version = "0.244.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9d66ea20e9553b30172b5e831994e35fbde2d165325bec84fc43dbf6f4eb9cb2" +dependencies = [ + "anyhow", + "bitflags", + "indexmap", + "log", + "serde", + "serde_derive", + "serde_json", + "wasm-encoder", + "wasm-metadata", + "wasmparser", + "wit-parser", +] + +[[package]] +name = "wit-parser" +version = "0.244.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "ecc8ac4bc1dc3381b7f59c34f00b67e18f910c2c0f50015669dde7def656a736" +dependencies = [ + "anyhow", + "id-arena", + "indexmap", + "log", + "semver", + "serde", + "serde_derive", + "serde_json", + "unicode-xid", + "wasmparser", +] + +[[package]] +name = "writeable" +version = "0.6.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9edde0db4769d2dc68579893f2306b26c6ecfbe0ef499b013d731b7b9247e0b9" + +[[package]] +name = "yoke" +version = "0.8.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "72d6e5c6afb84d73944e5cedb052c4680d5657337201555f9f2a16b7406d4954" +dependencies = [ + "stable_deref_trait", + "yoke-derive", + "zerofrom", +] + +[[package]] +name = "yoke-derive" +version = "0.8.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "388c44dc09d76f1536602ead6d325eb532f5c122f17782bd57fb47baeeb767e2" +checksum = "b659052874eb698efe5b9e8cf382204678a0086ebf46982b79d6ca3182927e5d" dependencies = [ - "lzma-sys", + "proc-macro2", + "quote", + "syn 2.0.117", + "synstructure", ] [[package]] name = "zerocopy" -version = "0.6.1" +version = "0.8.47" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "332f188cc1bcf1fe1064b8c58d150f497e697f49774aa846f2dc949d9a25f236" +checksum = "efbb2a062be311f2ba113ce66f697a4dc589f85e78a4aea276200804cea0ed87" dependencies = [ - "byteorder", "zerocopy-derive", ] [[package]] name = "zerocopy-derive" -version = "0.3.2" +version = "0.8.47" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6505e6815af7de1746a08f69c69606bb45695a17149517680f3b2149713b19a3" +checksum = "0e8bc7269b54418e7aeeef514aa68f8690b8c0489a06b0136e5f57c4c5ccab89" dependencies = [ "proc-macro2", "quote", - "syn", + "syn 2.0.117", ] [[package]] -name = "zstd" -version = "0.11.2+zstd.1.5.2" +name = "zerofrom" +version = "0.1.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "20cc960326ece64f010d2d2107537f26dc589a6573a316bd5b1dba685fa5fde4" +checksum = "50cc42e0333e05660c3587f3bf9d0478688e15d870fab3346451ce7f8c9fbea5" dependencies = [ - "zstd-safe 5.0.2+zstd.1.5.2", + "zerofrom-derive", ] [[package]] -name = "zstd" -version = "0.12.3+zstd.1.5.2" +name = "zerofrom-derive" +version = "0.1.6" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "76eea132fb024e0e13fd9c2f5d5d595d8a967aa72382ac2f9d39fcc95afd0806" +checksum = "d71e5d6e06ab090c67b5e44993ec16b72dcbaabc526db883a360057678b48502" dependencies = [ - "zstd-safe 6.0.4+zstd.1.5.4", + "proc-macro2", + "quote", + "syn 2.0.117", + "synstructure", ] [[package]] -name = "zstd-safe" -version = "5.0.2+zstd.1.5.2" +name = "zeroize" +version = "1.8.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b97154e67e32c85465826e8bcc1c59429aaaf107c1e4a9e53c8d8ccd5eff88d0" + +[[package]] +name = "zerotrie" +version = "0.2.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1d2a5585e04f9eea4b2a3d1eca508c4dee9592a89ef6f450c11719da0726f4db" +checksum = "2a59c17a5562d507e4b54960e8569ebee33bee890c70aa3fe7b97e85a9fd7851" dependencies = [ - "libc", - "zstd-sys", + "displaydoc", + "yoke", + "zerofrom", +] + +[[package]] +name = "zerovec" +version = "0.11.5" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6c28719294829477f525be0186d13efa9a3c602f7ec202ca9e353d310fb9a002" +dependencies = [ + "yoke", + "zerofrom", + "zerovec-derive", +] + +[[package]] +name = "zerovec-derive" +version = "0.11.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "eadce39539ca5cb3985590102671f2567e659fca9666581ad3411d59207951f3" +dependencies = [ + "proc-macro2", + "quote", + "syn 2.0.117", +] + +[[package]] +name = "zlib-rs" +version = "0.6.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "3be3d40e40a133f9c916ee3f9f4fa2d9d63435b5fbe1bfc6d9dae0aa0ada1513" + +[[package]] +name = "zmij" +version = "1.0.21" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "b8848ee67ecc8aedbaf3e4122217aff892639231befc6a1b58d29fff4c2cabaa" + +[[package]] +name = "zstd" +version = "0.13.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "e91ee311a569c327171651566e07972200e76fcfe2242a4fa446149a3881c08a" +dependencies = [ + "zstd-safe", ] [[package]] name = "zstd-safe" -version = "6.0.4+zstd.1.5.4" +version = "7.2.4" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7afb4b54b8910cf5447638cb54bf4e8a65cbedd783af98b98c62ffe91f185543" +checksum = "8f49c4d5f0abb602a93fb8736af2a4f4dd9512e36f7f570d66e65ff867ed3b9d" dependencies = [ - "libc", "zstd-sys", ] [[package]] name = "zstd-sys" -version = "2.0.7+zstd.1.5.4" +version = "2.0.16+zstd.1.5.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "94509c3ba2fe55294d752b79842c530ccfab760192521df74a081a78d2b3c7f5" +checksum = "91e19ebc2adc8f83e43039e79776e3fda8ca919132d68a1fed6a5faca2683748" dependencies = [ "cc", - "libc", "pkg-config", ] diff --git a/Cargo.toml b/Cargo.toml index 03dafefe8..14408d2bc 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -15,46 +15,58 @@ # specific language governing permissions and limitations # under the License. -[package] -name = "datafusion-python" -version = "20.0.0" -homepage = "https://github.com/apache/arrow-datafusion-python" -repository = "https://github.com/apache/arrow-datafusion-python" -authors = ["Apache Arrow "] -description = "Apache Arrow DataFusion DataFrame and SQL Query Engine" +[workspace.package] +version = "53.0.0" +homepage = "https://datafusion.apache.org/python" +repository = "https://github.com/apache/datafusion-python" +authors = ["Apache DataFusion "] +description = "Apache DataFusion DataFrame and SQL Query Engine" readme = "README.md" license = "Apache-2.0" -edition = "2021" -rust-version = "1.64" +edition = "2024" +rust-version = "1.88" -[features] -default = ["mimalloc"] +[workspace] +members = ["crates/core", "crates/util", "examples/datafusion-ffi-example"] +resolver = "3" -[dependencies] -tokio = { version = "1.24", features = ["macros", "rt", "rt-multi-thread", "sync"] } -rand = "0.8" -pyo3 = { version = "0.18.1", features = ["extension-module", "abi3", "abi3-py37"] } -datafusion = { version = "20.0.0", features = ["pyarrow", "avro"]} -datafusion-common = { version = "20.0.0", features = ["pyarrow"]} -datafusion-expr = { version = "20.0.0" } -datafusion-optimizer = { version = "20.0.0" } -datafusion-sql = { version = "20.0.0" } -datafusion-substrait = { version = "20.0.0" } -uuid = { version = "1.2", features = ["v4"] } -mimalloc = { version = "*", optional = true, default-features = false } -async-trait = "0.1" +[workspace.dependencies] +tokio = { version = "1.50" } +pyo3 = { version = "0.28" } +pyo3-async-runtimes = { version = "0.28" } +pyo3-log = "0.13.3" +arrow = { version = "58" } +arrow-array = { version = "58" } +arrow-schema = { version = "58" } +arrow-select = { version = "58" } +datafusion = { version = "53" } +datafusion-substrait = { version = "53" } +datafusion-proto = { version = "53" } +datafusion-ffi = { version = "53" } +datafusion-catalog = { version = "53", default-features = false } +datafusion-common = { version = "53", default-features = false } +datafusion-functions-aggregate = { version = "53" } +datafusion-functions-window = { version = "53" } +datafusion-expr = { version = "53" } +prost = "0.14.3" +serde_json = "1" +uuid = { version = "1.23" } +mimalloc = { version = "0.1", default-features = false } +async-trait = "0.1.89" futures = "0.3" -object_store = { version = "0.5.3", features = ["aws", "gcp", "azure"] } +cstr = "0.2" +object_store = { version = "0.13.1" } +url = "2" +log = "0.4.29" parking_lot = "0.12" -regex-syntax = "0.6.28" - -[lib] -name = "datafusion_python" -crate-type = ["cdylib", "rlib"] - -[package.metadata.maturin] -name = "datafusion._internal" +prost-types = "0.14.3" # keep in line with `datafusion-substrait` +pyo3-build-config = "0.28" +datafusion-python-util = { path = "crates/util", version = "53.0.0" } [profile.release] -lto = true -codegen-units = 1 +lto = "thin" +codegen-units = 2 + +# We cannot publish to crates.io with any patches in the below section. Developers +# must remove any entries in this section before creating a release candidate. +[patch.crates-io] diff --git a/README.md b/README.md index 7c29defdc..7849e7a02 100644 --- a/README.md +++ b/README.md @@ -19,12 +19,19 @@ # DataFusion in Python -[![Python test](https://github.com/apache/arrow-datafusion-python/actions/workflows/test.yaml/badge.svg)](https://github.com/apache/arrow-datafusion-python/actions/workflows/test.yaml) -[![Python Release Build](https://github.com/apache/arrow-datafusion-python/actions/workflows/build.yml/badge.svg)](https://github.com/apache/arrow-datafusion-python/actions/workflows/build.yml) +[![Python test](https://github.com/apache/datafusion-python/actions/workflows/test.yaml/badge.svg)](https://github.com/apache/datafusion-python/actions/workflows/test.yaml) +[![Python Release Build](https://github.com/apache/datafusion-python/actions/workflows/build.yml/badge.svg)](https://github.com/apache/datafusion-python/actions/workflows/build.yml) -This is a Python library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/arrow-datafusion). +This is a Python library that binds to [Apache Arrow](https://arrow.apache.org/) in-memory query engine [DataFusion](https://github.com/apache/datafusion). -DataFusion's Python bindings can be used as an end-user tool as well as providing a foundation for building new systems. +DataFusion's Python bindings can be used as a foundation for building new data systems in Python. Here are some examples: + +- [Dask SQL](https://github.com/dask-contrib/dask-sql) uses DataFusion's Python bindings for SQL parsing, query + planning, and logical plan optimizations, and then transpiles the logical plan to Dask operations for execution. +- [DataFusion Ballista](https://github.com/apache/datafusion-ballista) is a distributed SQL query engine that extends + DataFusion's Python bindings for distributed use cases. +- [DataFusion Ray](https://github.com/apache/datafusion-ray) is another distributed query engine that uses + DataFusion's Python bindings. ## Features @@ -35,19 +42,9 @@ DataFusion's Python bindings can be used as an end-user tool as well as providin - Serialize and deserialize query plans in Substrait format. - Experimental support for transpiling SQL queries to DataFrame calls with Polars, Pandas, and cuDF. -## Comparison with other projects - -Here is a comparison with similar projects that may help understand when DataFusion might be suitable and unsuitable -for your needs: - -- [DuckDB](http://www.duckdb.org/) is an open source, in-process analytic database. Like DataFusion, it supports - very fast execution, both from its custom file format and directly from Parquet files. Unlike DataFusion, it is - written in C/C++ and it is primarily used directly by users as a serverless database and query system rather than - as a library for building such database systems. - -- [Polars](http://pola.rs/) is one of the fastest DataFrame libraries at the time of writing. Like DataFusion, it - is also written in Rust and uses the Apache Arrow memory model, but unlike DataFusion it does not provide full SQL - support, nor as many extension points. +For tips on tuning parallelism, see +[Maximizing CPU Usage](docs/source/user-guide/configuration.rst#maximizing-cpu-usage) +in the configuration guide. ## Example Usage @@ -86,37 +83,115 @@ This produces the following chart: ![Chart](examples/chart.png) +## Registering a DataFrame as a View + +You can use SessionContext's `register_view` method to convert a DataFrame into a view and register it with the context. + +```python +from datafusion import SessionContext, col, literal + +# Create a DataFusion context +ctx = SessionContext() + +# Create sample data +data = {"a": [1, 2, 3, 4, 5], "b": [10, 20, 30, 40, 50]} + +# Create a DataFrame from the dictionary +df = ctx.from_pydict(data, "my_table") + +# Filter the DataFrame (for example, keep rows where a > 2) +df_filtered = df.filter(col("a") > literal(2)) + +# Register the dataframe as a view with the context +ctx.register_view("view1", df_filtered) + +# Now run a SQL query against the registered view +df_view = ctx.sql("SELECT * FROM view1") + +# Collect the results +results = df_view.collect() + +# Convert results to a list of dictionaries for display +result_dicts = [batch.to_pydict() for batch in results] + +print(result_dicts) +``` + +This will output: + +```python +[{'a': [3, 4, 5], 'b': [30, 40, 50]}] +``` + +## Configuration + +It is possible to configure runtime (memory and disk settings) and configuration settings when creating a context. + +```python +runtime = ( + RuntimeEnvBuilder() + .with_disk_manager_os() + .with_fair_spill_pool(10000000) +) +config = ( + SessionConfig() + .with_create_default_catalog_and_schema(True) + .with_default_catalog_and_schema("foo", "bar") + .with_target_partitions(8) + .with_information_schema(True) + .with_repartition_joins(False) + .with_repartition_aggregations(False) + .with_repartition_windows(False) + .with_parquet_pruning(False) + .set("datafusion.execution.parquet.pushdown_filters", "true") +) +ctx = SessionContext(config, runtime) +``` + +Refer to the [API documentation](https://arrow.apache.org/datafusion-python/#api-reference) for more information. + +Printing the context will show the current configuration settings. + +```python +print(ctx) +``` + +## Extensions + +For information about how to extend DataFusion Python, please see the extensions page of the +[online documentation](https://datafusion.apache.org/python/). + ## More Examples See [examples](examples/README.md) for more information. ### Executing Queries with DataFusion -- [Query a Parquet file using SQL](./examples/sql-parquet.py) -- [Query a Parquet file using the DataFrame API](./examples/dataframe-parquet.py) -- [Run a SQL query and store the results in a Pandas DataFrame](./examples/sql-to-pandas.py) -- [Run a SQL query with a Python user-defined function (UDF)](./examples/sql-using-python-udf.py) -- [Run a SQL query with a Python user-defined aggregation function (UDAF)](./examples/sql-using-python-udaf.py) -- [Query PyArrow Data](./examples/query-pyarrow-data.py) -- [Create dataframe](./examples/import.py) -- [Export dataframe](./examples/export.py) +- [Query a Parquet file using SQL](https://github.com/apache/datafusion-python/blob/main/examples/sql-parquet.py) +- [Query a Parquet file using the DataFrame API](https://github.com/apache/datafusion-python/blob/main/examples/dataframe-parquet.py) +- [Run a SQL query and store the results in a Pandas DataFrame](https://github.com/apache/datafusion-python/blob/main/examples/sql-to-pandas.py) +- [Run a SQL query with a Python user-defined function (UDF)](https://github.com/apache/datafusion-python/blob/main/examples/sql-using-python-udf.py) +- [Run a SQL query with a Python user-defined aggregation function (UDAF)](https://github.com/apache/datafusion-python/blob/main/examples/sql-using-python-udaf.py) +- [Query PyArrow Data](https://github.com/apache/datafusion-python/blob/main/examples/query-pyarrow-data.py) +- [Create dataframe](https://github.com/apache/datafusion-python/blob/main/examples/import.py) +- [Export dataframe](https://github.com/apache/datafusion-python/blob/main/examples/export.py) ### Running User-Defined Python Code -- [Register a Python UDF with DataFusion](./examples/python-udf.py) -- [Register a Python UDAF with DataFusion](./examples/python-udaf.py) +- [Register a Python UDF with DataFusion](https://github.com/apache/datafusion-python/blob/main/examples/python-udf.py) +- [Register a Python UDAF with DataFusion](https://github.com/apache/datafusion-python/blob/main/examples/python-udaf.py) ### Substrait Support -- [Serialize query plans using Substrait](./examples/substrait.py) +- [Serialize query plans using Substrait](https://github.com/apache/datafusion-python/blob/main/examples/substrait.py) -### Executing SQL against DataFrame Libraries (Experimental) +## How to install -- [Executing SQL on Polars](./examples/sql-on-polars.py) -- [Executing SQL on Pandas](./examples/sql-on-pandas.py) -- [Executing SQL on cuDF](./examples/sql-on-cudf.py) +### uv -## How to install (from pip) +```bash +uv add datafusion +``` ### Pip @@ -142,73 +217,132 @@ You can verify the installation by running: ## How to develop -This assumes that you have rust and cargo installed. We use the workflow recommended by [pyo3](https://github.com/PyO3/pyo3) and [maturin](https://github.com/PyO3/maturin). +This assumes that you have rust and cargo installed. We use the workflow recommended by [pyo3](https://github.com/PyO3/pyo3) and [maturin](https://github.com/PyO3/maturin). The Maturin tools used in this workflow can be installed either via `uv` or `pip`. Both approaches should offer the same experience. It is recommended to use `uv` since it has significant performance improvements +over `pip`. + +Currently for protobuf support either [protobuf](https://protobuf.dev/installation/) or cmake must be installed. -The Maturin tools used in this workflow can be installed either via Conda or Pip. Both approaches should offer the same experience. Multiple approaches are only offered to appease developer preference. Bootstrapping for both Conda and Pip are as follows. +Bootstrap (`uv`): -Bootstrap (Conda): +By default `uv` will attempt to build the datafusion python package. For our development we prefer to build manually. This means +that when creating your virtual environment using `uv sync` you need to pass in the additional `--no-install-package datafusion` +and for `uv run` commands the additional parameter `--no-project` ```bash # fetch this repo -git clone git@github.com:apache/arrow-datafusion-python.git -# create the conda environment for dev -conda env create -f ./conda/environments/datafusion-dev.yaml -n datafusion-dev -# activate the conda environment -conda activate datafusion-dev +git clone git@github.com:apache/datafusion-python.git +# cd to the repo root +cd datafusion-python/ +# create the virtual environment +uv sync --dev --no-install-package datafusion +# activate the environment +source .venv/bin/activate ``` -Bootstrap (Pip): +Bootstrap (`pip`): ```bash # fetch this repo -git clone git@github.com:apache/arrow-datafusion-python.git +git clone git@github.com:apache/datafusion-python.git +# cd to the repo root +cd datafusion-python/ # prepare development environment (used to build wheel / install in development) -python3 -m venv venv +python3 -m venv .venv # activate the venv -source venv/bin/activate +source .venv/bin/activate # update pip itself if necessary python -m pip install -U pip -# install dependencies (for Python 3.8+) -python -m pip install -r requirements-310.txt +# install dependencies +python -m pip install -r pyproject.toml ``` The tests rely on test data in git submodules. ```bash -git submodule init -git submodule update +git submodule update --init ``` Whenever rust code changes (your changes or via `git pull`): ```bash # make sure you activate the venv using "source venv/bin/activate" first -maturin develop +maturin develop --uv python -m pytest ``` +Alternatively if you are using `uv` you can do the following without +needing to activate the virtual environment: + +```bash +uv run --no-project maturin develop --uv +uv run --no-project pytest +``` + +To run the FFI tests within the examples folder, after you have built +`datafusion-python` with the previous commands: + +```bash +cd examples/datafusion-ffi-example +uv run --no-project maturin develop --uv +uv run --no-project pytest python/tests/_test_*py +``` + ### Running & Installing pre-commit hooks -arrow-datafusion-python takes advantage of (pre-commit)[https://pre-commit.com/] to assist developers in with code linting to help reduce the number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for the developer but certainly helpful for keep PRs clean and concise. +`datafusion-python` takes advantage of [pre-commit](https://pre-commit.com/) to assist developers with code linting to help reduce +the number of commits that ultimately fail in CI due to linter errors. Using the pre-commit hooks is optional for the +developer but certainly helpful for keeping PRs clean and concise. -Our pre-commit hooks can be installed by running `pre-commit install` which will install the configurations in your ARROW_DATAFUSION_PYTHON_ROOT/.github directory and run each time you perform a commit failing to perform the commit if an offending lint is found giving you the opportunity to make changes locally before pushing. +Our pre-commit hooks can be installed by running `pre-commit install`, which will install the configurations in +your DATAFUSION_PYTHON_ROOT/.github directory and run each time you perform a commit, failing to complete +the commit if an offending lint is found allowing you to make changes locally before pushing. -The pre-commit hooks can also be ran ad-hoc without installing them by simply running `pre-commit run --all-files` +The pre-commit hooks can also be run adhoc without installing them by simply running `pre-commit run --all-files`. -## How to update dependencies +NOTE: the current `pre-commit` hooks require docker, and cmake. See note on protobuf above. -To change test dependencies, change the `requirements.in` and run +## Running linters without using pre-commit -```bash -# install pip-tools (this can be done only once), also consider running in venv -python -m pip install pip-tools -python -m piptools compile --generate-hashes -o requirements-310.txt +There are scripts in `ci/scripts` for running Rust and Python linters. + +```shell +./ci/scripts/python_lint.sh +./ci/scripts/rust_clippy.sh +./ci/scripts/rust_fmt.sh +./ci/scripts/rust_toml_fmt.sh ``` -To update dependencies, run with `-U` +## Checking Upstream DataFusion Coverage -```bash -python -m piptools compile -U --generate-hashes -o requirements-310.txt +This project includes an [AI agent skill](.ai/skills/check-upstream/SKILL.md) for auditing which +features from the upstream Apache DataFusion Rust library are not yet exposed in these Python +bindings. This is useful when adding missing functions, auditing API coverage, or ensuring parity +with upstream. + +The skill accepts an optional area argument: + +``` +scalar functions +aggregate functions +window functions +dataframe +session context +ffi types +all ``` -More details [here](https://github.com/jazzband/pip-tools) +If no argument is provided, it defaults to checking all areas. The skill will fetch the upstream +DataFusion documentation, compare it against the functions and methods exposed in this project, and +produce a coverage report listing what is currently exposed and what is missing. + +The skill definition lives in `.ai/skills/check-upstream/SKILL.md` and follows the +[Agent Skills](https://agentskills.io) open standard. It can be used by any AI coding agent that +supports skill discovery, or followed manually. + +## How to update dependencies + +To change test dependencies, change the `pyproject.toml` and run + +```bash +uv sync --dev --no-install-package datafusion +``` diff --git a/benchmarks/db-benchmark/README.md b/benchmarks/db-benchmark/README.md new file mode 100644 index 000000000..8ce45344d --- /dev/null +++ b/benchmarks/db-benchmark/README.md @@ -0,0 +1,32 @@ + + +# DataFusion Implementation of db-benchmark + +This directory contains scripts for running [db-benchmark](https://github.com/duckdblabs/db-benchmark) with +DataFusion's Python bindings. + +## Directions + +Run the following from root of this project. + +```bash +docker build -t db-benchmark -f benchmarks/db-benchmark/db-benchmark.dockerfile . +docker run --privileged -it db-benchmark +``` diff --git a/benchmarks/db-benchmark/db-benchmark.dockerfile b/benchmarks/db-benchmark/db-benchmark.dockerfile new file mode 100644 index 000000000..af2edd0f4 --- /dev/null +++ b/benchmarks/db-benchmark/db-benchmark.dockerfile @@ -0,0 +1,120 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +FROM ubuntu:22.04 +ARG DEBIAN_FRONTEND=noninteractive +ARG TARGETPLATFORM + +# This section is based on https://github.com/duckdblabs/db-benchmark/blob/master/_utils/repro.sh + +RUN apt-get -qq update +RUN apt-get -qq -y upgrade +RUN apt-get -qq install -y apt-utils + +RUN apt-get -qq install -y lsb-release software-properties-common wget curl vim htop git byobu libcurl4-openssl-dev libssl-dev +RUN apt-get -qq install -y libfreetype6-dev +RUN apt-get -qq install -y libfribidi-dev +RUN apt-get -qq install -y libharfbuzz-dev +RUN apt-get -qq install -y git +RUN apt-get -qq install -y libxml2-dev +RUN apt-get -qq install -y make +RUN apt-get -qq install -y libfontconfig1-dev +RUN apt-get -qq install -y libicu-dev pandoc zlib1g-dev libgit2-dev libcurl4-openssl-dev libssl-dev libjpeg-dev libpng-dev libtiff-dev +# apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 +RUN add-apt-repository "deb [arch=amd64,i386] https://cloud.r-project.org/bin/linux/ubuntu $(lsb_release -cs)-cran40/" + +RUN apt-get -qq install -y r-base-dev virtualenv + +RUN cd /usr/local/lib/R && \ + chmod o+w site-library + +RUN cd / && \ + git clone https://github.com/duckdblabs/db-benchmark.git + +WORKDIR /db-benchmark + +RUN mkdir -p .R && \ + echo 'CFLAGS=-O3 -mtune=native' >> .R/Makevars && \ + echo 'CXXFLAGS=-O3 -mtune=native' >> .R/Makevars + +RUN cd pydatatable && \ + virtualenv py-pydatatable --python=/usr/bin/python3.10 +RUN cd pandas && \ + virtualenv py-pandas --python=/usr/bin/python3.10 +RUN cd modin && \ + virtualenv py-modin --python=/usr/bin/python3.10 + +RUN Rscript -e 'install.packages(c("jsonlite","bit64","devtools","rmarkdown"), dependencies=TRUE, repos="https://cloud.r-project.org")' + +SHELL ["/bin/bash", "-c"] + +RUN source ./pandas/py-pandas/bin/activate && \ + python3 -m pip install --upgrade psutil && \ + python3 -m pip install --upgrade pandas && \ + deactivate + +RUN source ./modin/py-modin/bin/activate && \ + python3 -m pip install --upgrade modin && \ + deactivate + +RUN source ./pydatatable/py-pydatatable/bin/activate && \ + python3 -m pip install --upgrade git+https://github.com/h2oai/datatable && \ + deactivate + +## install dplyr +#RUN Rscript -e 'devtools::install_github(c("tidyverse/readr","tidyverse/dplyr"))' + +# install data.table +RUN Rscript -e 'install.packages("data.table", repos="https://rdatatable.gitlab.io/data.table/")' + +## generate data for groupby 0.5GB +RUN Rscript _data/groupby-datagen.R 1e7 1e2 0 0 +RUN #Rscript _data/groupby-datagen.R 1e8 1e2 0 0 +RUN #Rscript _data/groupby-datagen.R 1e9 1e2 0 0 + +RUN mkdir data && \ + mv G1_1e7_1e2_0_0.csv data/ + +# set only groupby task +RUN echo "Changing run.conf and _control/data.csv to run only groupby at 0.5GB" && \ + cp run.conf run.conf.original && \ + sed -i 's/groupby join groupby2014/groupby/g' run.conf && \ + sed -i 's/data.table dplyr pandas pydatatable spark dask clickhouse polars arrow duckdb/data.table dplyr duckdb/g' run.conf && \ + sed -i 's/DO_PUBLISH=true/DO_PUBLISH=false/g' run.conf + +## set sizes +RUN mv _control/data.csv _control/data.csv.original && \ + echo "task,data,nrow,k,na,sort,active" > _control/data.csv && \ + echo "groupby,G1_1e7_1e2_0_0,1e7,1e2,0,0,1" >> _control/data.csv + +RUN #./dplyr/setup-dplyr.sh +RUN #./datatable/setup-datatable.sh +RUN #./duckdb/setup-duckdb.sh + +# END OF SETUP + +RUN python3 -m pip install --upgrade pandas +RUN python3 -m pip install --upgrade polars psutil +RUN python3 -m pip install --upgrade datafusion + +# Now add our solution +RUN rm -rf datafusion-python 2>/dev/null && \ + mkdir datafusion-python +ADD benchmarks/db-benchmark/*.py datafusion-python/ +ADD benchmarks/db-benchmark/run-bench.sh . + +ENTRYPOINT [ "/db-benchmark/run-bench.sh" ] \ No newline at end of file diff --git a/benchmarks/db-benchmark/groupby-datafusion.py b/benchmarks/db-benchmark/groupby-datafusion.py new file mode 100644 index 000000000..533166695 --- /dev/null +++ b/benchmarks/db-benchmark/groupby-datafusion.py @@ -0,0 +1,527 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import gc +import os +import timeit +from pathlib import Path + +import datafusion as df +import pyarrow as pa +from datafusion import ( + RuntimeEnvBuilder, + SessionConfig, + SessionContext, + col, +) +from datafusion import ( + functions as f, +) +from pyarrow import csv as pacsv + +print("# groupby-datafusion.py", flush=True) + +exec(Path.open("./_helpers/helpers.py").read()) + + +def ans_shape(batches) -> tuple[int, int]: + rows, cols = 0, 0 + for batch in batches: + rows += batch.num_rows + if cols == 0: + cols = batch.num_columns + else: + assert cols == batch.num_columns + return rows, cols + + +def execute(df) -> list: + print(df.execution_plan().display_indent()) + return df.collect() + + +ver = df.__version__ +git = "" +task = "groupby" +solution = "datafusion" +fun = ".groupby" +cache = "TRUE" +on_disk = "FALSE" + +# experimental - support running with both DataFrame and SQL APIs +sql = True + +data_name = os.environ["SRC_DATANAME"] +src_grp = "data" / data_name / ".csv" +print("loading dataset %s" % src_grp, flush=True) + +schema = pa.schema( + [ + ("id4", pa.int32()), + ("id5", pa.int32()), + ("id6", pa.int32()), + ("v1", pa.int32()), + ("v2", pa.int32()), + ("v3", pa.float64()), + ] +) + +data = pacsv.read_csv( + src_grp, + convert_options=pacsv.ConvertOptions(auto_dict_encode=True, column_types=schema), +) +print("dataset loaded") + +# create a session context with explicit runtime and config settings +runtime = ( + RuntimeEnvBuilder() + .with_disk_manager_os() + .with_fair_spill_pool(64 * 1024 * 1024 * 1024) +) +config = ( + SessionConfig() + .with_repartition_joins(enabled=False) + .with_repartition_aggregations(enabled=False) + .set("datafusion.execution.coalesce_batches", "false") +) +ctx = SessionContext(config, runtime) +print(ctx) + +ctx.register_record_batches("x", [data.to_batches()]) +print("registered record batches") +# cols = ctx.sql("SHOW columns from x") +# ans.show() + +in_rows = data.num_rows +# print(in_rows, flush=True) + +task_init = timeit.default_timer() + +question = "sum v1 by id1" # q1 +gc.collect() +t_start = timeit.default_timer() +if sql: + df = ctx.sql("SELECT id1, SUM(v1) AS v1 FROM x GROUP BY id1") +else: + df = ctx.table("x").aggregate([f.col("id1")], [f.sum(f.col("v1")).alias("v1")]) +ans = execute(df) + +shape = ans_shape(ans) +print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q1: {t}") +m = memory_usage() +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = df.aggregate([], [f.sum(col("v1"))]).collect()[0].column(0)[0] +chkt = timeit.default_timer() - t_start +write_log( + task=task, + data=data_name, + in_rows=in_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "sum v1 by id1:id2" # q2 +gc.collect() +t_start = timeit.default_timer() +if sql: + df = ctx.sql("SELECT id1, id2, SUM(v1) AS v1 FROM x GROUP BY id1, id2") +else: + df = ctx.table("x").aggregate( + [f.col("id1"), f.col("id2")], [f.sum(f.col("v1")).alias("v1")] + ) +ans = execute(df) +shape = ans_shape(ans) +print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q2: {t}") +m = memory_usage() +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = df.aggregate([], [f.sum(col("v1"))]).collect()[0].column(0)[0] +chkt = timeit.default_timer() - t_start +write_log( + task=task, + data=data_name, + in_rows=in_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "sum v1 mean v3 by id3" # q3 +gc.collect() +t_start = timeit.default_timer() +if sql: + df = ctx.sql("SELECT id3, SUM(v1) AS v1, AVG(v3) AS v3 FROM x GROUP BY id3") +else: + df = ctx.table("x").aggregate( + [f.col("id3")], + [ + f.sum(f.col("v1")).alias("v1"), + f.avg(f.col("v3")).alias("v3"), + ], + ) +ans = execute(df) +shape = ans_shape(ans) +print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q3: {t}") +m = memory_usage() +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = ( + df.aggregate([], [f.sum(col("v1")), f.sum(col("v3"))]) + .collect()[0] + .to_pandas() + .to_numpy()[0] +) +chkt = timeit.default_timer() - t_start +write_log( + task=task, + data=data_name, + in_rows=in_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "mean v1:v3 by id4" # q4 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql( + "SELECT id4, AVG(v1) AS v1, AVG(v2) AS v2, AVG(v3) AS v3 FROM x GROUP BY id4" +).collect() +shape = ans_shape(ans) +print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q4: {t}") +m = memory_usage() +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = ( + df.aggregate([], [f.sum(col("v1")), f.sum(col("v2")), f.sum(col("v3"))]) + .collect()[0] + .to_pandas() + .to_numpy()[0] +) +chkt = timeit.default_timer() - t_start +write_log( + task=task, + data=data_name, + in_rows=in_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "sum v1:v3 by id6" # q5 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql( + "SELECT id6, SUM(v1) AS v1, SUM(v2) AS v2, SUM(v3) AS v3 FROM x GROUP BY id6" +).collect() +shape = ans_shape(ans) +print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q5: {t}") +m = memory_usage() +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = ( + df.aggregate([], [f.sum(col("v1")), f.sum(col("v2")), f.sum(col("v3"))]) + .collect()[0] + .to_pandas() + .to_numpy()[0] +) +chkt = timeit.default_timer() - t_start +write_log( + task=task, + data=data_name, + in_rows=in_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "median v3 sd v3 by id4 id5" # q6 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql( + "SELECT id4, id5, approx_percentile_cont(v3, .5) AS median_v3, stddev(v3) AS stddev_v3 FROM x GROUP BY id4, id5" +).collect() +shape = ans_shape(ans) +print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q6: {t}") +m = memory_usage() +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = ( + df.aggregate([], [f.sum(col("median_v3")), f.sum(col("stddev_v3"))]) + .collect()[0] + .to_pandas() + .to_numpy()[0] +) +chkt = timeit.default_timer() - t_start +write_log( + task=task, + data=data_name, + in_rows=in_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "max v1 - min v2 by id3" # q7 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql( + "SELECT id3, MAX(v1) - MIN(v2) AS range_v1_v2 FROM x GROUP BY id3" +).collect() +shape = ans_shape(ans) +print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q7: {t}") +m = memory_usage() +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = df.aggregate([], [f.sum(col("range_v1_v2"))]).collect()[0].column(0)[0] +chkt = timeit.default_timer() - t_start +write_log( + task=task, + data=data_name, + in_rows=in_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "largest two v3 by id6" # q8 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql( + "SELECT id6, v3 from (SELECT id6, v3, row_number() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS row FROM x) t WHERE row <= 2" +).collect() +shape = ans_shape(ans) +print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q8: {t}") +m = memory_usage() +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = df.aggregate([], [f.sum(col("v3"))]).collect()[0].column(0)[0] +chkt = timeit.default_timer() - t_start +write_log( + task=task, + data=data_name, + in_rows=in_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "regression v1 v2 by id2 id4" # q9 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql("SELECT corr(v1, v2) as corr FROM x GROUP BY id2, id4").collect() +shape = ans_shape(ans) +print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q9: {t}") +m = memory_usage() +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = df.aggregate([], [f.sum(col("corr"))]).collect()[0].column(0)[0] +chkt = timeit.default_timer() - t_start +write_log( + task=task, + data=data_name, + in_rows=in_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "sum v3 count by id1:id6" # q10 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql( + "SELECT id1, id2, id3, id4, id5, id6, SUM(v3) as v3, COUNT(*) AS cnt FROM x GROUP BY id1, id2, id3, id4, id5, id6" +).collect() +shape = ans_shape(ans) +print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q10: {t}") +m = memory_usage() +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = ( + df.aggregate([], [f.sum(col("v3")), f.sum(col("cnt"))]) + .collect()[0] + .to_pandas() + .to_numpy()[0] +) +chkt = timeit.default_timer() - t_start +write_log( + task=task, + data=data_name, + in_rows=in_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +print( + "grouping finished, took %0.fs" % (timeit.default_timer() - task_init), + flush=True, +) + +exit(0) diff --git a/benchmarks/db-benchmark/join-datafusion.py b/benchmarks/db-benchmark/join-datafusion.py new file mode 100755 index 000000000..3be296c81 --- /dev/null +++ b/benchmarks/db-benchmark/join-datafusion.py @@ -0,0 +1,299 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import gc +import os +import timeit +from pathlib import Path + +import datafusion as df +from datafusion import col +from datafusion import functions as f +from pyarrow import csv as pacsv + +print("# join-datafusion.py", flush=True) + +exec(Path.open("./_helpers/helpers.py").read()) + + +def ans_shape(batches) -> tuple[int, int]: + rows, cols = 0, 0 + for batch in batches: + rows += batch.num_rows + if cols == 0: + cols = batch.num_columns + else: + assert cols == batch.num_columns + return rows, cols + + +ver = df.__version__ +task = "join" +git = "" +solution = "datafusion" +fun = ".join" +cache = "TRUE" +on_disk = "FALSE" + +data_name = os.environ["SRC_DATANAME"] +src_jn_x = "data" / data_name / ".csv" +y_data_name = join_to_tbls(data_name) +src_jn_y = [ + "data" / y_data_name[0] / ".csv", + "data" / y_data_name[1] / ".csv", + "data" / y_data_name[2] / ".csv", +] +if len(src_jn_y) != 3: + error_msg = "Something went wrong in preparing files used for join" + raise Exception(error_msg) + +print( + "loading datasets " + + data_name + + ", " + + y_data_name[0] + + ", " + + y_data_name[1] + + ", " + + y_data_name[2], + flush=True, +) + +ctx = df.SessionContext() +print(ctx) + +# TODO we should be applying projections to these table reads to create relations +# of different sizes + +x_data = pacsv.read_csv( + src_jn_x, convert_options=pacsv.ConvertOptions(auto_dict_encode=True) +) +ctx.register_record_batches("x", [x_data.to_batches()]) +small_data = pacsv.read_csv( + src_jn_y[0], convert_options=pacsv.ConvertOptions(auto_dict_encode=True) +) +ctx.register_record_batches("small", [small_data.to_batches()]) +medium_data = pacsv.read_csv( + src_jn_y[1], convert_options=pacsv.ConvertOptions(auto_dict_encode=True) +) +ctx.register_record_batches("medium", [medium_data.to_batches()]) +large_data = pacsv.read_csv( + src_jn_y[2], convert_options=pacsv.ConvertOptions(auto_dict_encode=True) +) +ctx.register_record_batches("large", [large_data.to_batches()]) + +print(x_data.num_rows, flush=True) +print(small_data.num_rows, flush=True) +print(medium_data.num_rows, flush=True) +print(large_data.num_rows, flush=True) + +task_init = timeit.default_timer() +print("joining...", flush=True) + +question = "small inner on int" # q1 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql( + "SELECT x.id1, x.id2, x.id3, x.id4 as xid4, small.id4 as smallid4, x.id5, x.id6, x.v1, small.v2 FROM x INNER JOIN small ON x.id1 = small.id1" +).collect() +# ans = ctx.sql("SELECT * FROM x INNER JOIN small ON x.id1 = small.id1").collect() +# print(set([b.schema for b in ans])) +shape = ans_shape(ans) +# print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q1: {t}") +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = df.aggregate([], [f.sum(col("v1"))]).collect()[0].column(0)[0] +chkt = timeit.default_timer() - t_start +m = memory_usage() +write_log( + task=task, + data=data_name, + in_rows=x_data.num_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "medium inner on int" # q2 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql( + "SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x INNER JOIN medium ON x.id2 = medium.id2" +).collect() +shape = ans_shape(ans) +# print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q2: {t}") +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))]).collect()[0].column(0)[0] +chkt = timeit.default_timer() - t_start +m = memory_usage() +write_log( + task=task, + data=data_name, + in_rows=x_data.num_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "medium outer on int" # q3 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql( + "SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x LEFT JOIN medium ON x.id2 = medium.id2" +).collect() +shape = ans_shape(ans) +# print(shape, flush=True) +t = timeit.default_timer() - t_start +print(f"q3: {t}") +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))]).collect()[0].column(0)[0] +chkt = timeit.default_timer() - t_start +m = memory_usage() +write_log( + task=task, + data=data_name, + in_rows=x_data.num_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "medium inner on factor" # q4 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql( + "SELECT x.id1 as xid1, medium.id1 as mediumid1, x.id2, x.id3, x.id4 as xid4, medium.id4 as mediumid4, x.id5 as xid5, medium.id5 as mediumid5, x.id6, x.v1, medium.v2 FROM x LEFT JOIN medium ON x.id5 = medium.id5" +).collect() +shape = ans_shape(ans) +# print(shape) +t = timeit.default_timer() - t_start +print(f"q4: {t}") +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))]).collect()[0].column(0)[0] +chkt = timeit.default_timer() - t_start +m = memory_usage() +write_log( + task=task, + data=data_name, + in_rows=x_data.num_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +question = "big inner on int" # q5 +gc.collect() +t_start = timeit.default_timer() +ans = ctx.sql( + "SELECT x.id1 as xid1, large.id1 as largeid1, x.id2 as xid2, large.id2 as largeid2, x.id3, x.id4 as xid4, large.id4 as largeid4, x.id5 as xid5, large.id5 as largeid5, x.id6 as xid6, large.id6 as largeid6, x.v1, large.v2 FROM x LEFT JOIN large ON x.id3 = large.id3" +).collect() +shape = ans_shape(ans) +# print(shape) +t = timeit.default_timer() - t_start +print(f"q5: {t}") +t_start = timeit.default_timer() +df = ctx.create_dataframe([ans]) +chk = df.aggregate([], [f.sum(col("v1")), f.sum(col("v2"))]).collect()[0].column(0)[0] +chkt = timeit.default_timer() - t_start +m = memory_usage() +write_log( + task=task, + data=data_name, + in_rows=x_data.num_rows, + question=question, + out_rows=shape[0], + out_cols=shape[1], + solution=solution, + version=ver, + git=git, + fun=fun, + run=1, + time_sec=t, + mem_gb=m, + cache=cache, + chk=make_chk([chk]), + chk_time_sec=chkt, + on_disk=on_disk, +) +del ans +gc.collect() + +print( + "joining finished, took %0.fs" % (timeit.default_timer() - task_init), + flush=True, +) + +exit(0) diff --git a/benchmarks/db-benchmark/run-bench.sh b/benchmarks/db-benchmark/run-bench.sh new file mode 100755 index 000000000..36a6087d9 --- /dev/null +++ b/benchmarks/db-benchmark/run-bench.sh @@ -0,0 +1,27 @@ +#!/bin/bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +set -e + +#SRC_DATANAME=G1_1e7_1e2_0_0 python3 /db-benchmark/polars/groupby-polars.py +SRC_DATANAME=G1_1e7_1e2_0_0 python3 /db-benchmark/datafusion-python/groupby-datafusion.py + +# joins need more work still +#SRC_DATANAME=G1_1e7_1e2_0_0 python3 /db-benchmark/datafusion-python/join-datafusion.py +#SRC_DATANAME=G1_1e7_1e2_0_0 python3 /db-benchmark/polars/join-polars.py + +cat time.csv diff --git a/benchmarks/max_cpu_usage.py b/benchmarks/max_cpu_usage.py new file mode 100644 index 000000000..ae73baad6 --- /dev/null +++ b/benchmarks/max_cpu_usage.py @@ -0,0 +1,107 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +"""Benchmark script showing how to maximize CPU usage. + +This script demonstrates one example of tuning DataFusion for improved parallelism +and CPU utilization. It uses synthetic in-memory data and performs simple aggregation +operations to showcase the impact of partitioning configuration. + +IMPORTANT: This is a simplified example designed to illustrate partitioning concepts. +Actual performance in your applications may vary significantly based on many factors: + +- Type of table providers (Parquet files, CSV, databases, etc.) +- I/O operations and storage characteristics (local disk, network, cloud storage) +- Query complexity and operation types (joins, window functions, complex expressions) +- Data distribution and size characteristics +- Memory available and hardware specifications +- Network latency for distributed data sources + +It is strongly recommended that you create similar benchmarks tailored to your specific: +- Hardware configuration +- Data sources and formats +- Typical query patterns and workloads +- Performance requirements + +This will give you more accurate insights into how DataFusion configuration options +will affect your particular use case. +""" + +from __future__ import annotations + +import argparse +import multiprocessing +import time + +import pyarrow as pa +from datafusion import SessionConfig, SessionContext, col +from datafusion import functions as f + + +def main(num_rows: int, partitions: int) -> None: + """Run a simple aggregation after repartitioning. + + This function demonstrates basic partitioning concepts using synthetic data. + Real-world performance will depend on your specific data sources, query types, + and system configuration. + """ + # Create some example data (synthetic in-memory data for demonstration) + # Note: Real applications typically work with files, databases, or other + # data sources that have different I/O and distribution characteristics + array = pa.array(range(num_rows)) + batch = pa.record_batch([array], names=["a"]) + + # Configure the session to use a higher target partition count and + # enable automatic repartitioning. + config = ( + SessionConfig() + .with_target_partitions(partitions) + .with_repartition_joins(enabled=True) + .with_repartition_aggregations(enabled=True) + .with_repartition_windows(enabled=True) + ) + ctx = SessionContext(config) + + # Register the input data and repartition manually to ensure that all + # partitions are used. + df = ctx.create_dataframe([[batch]]).repartition(partitions) + + start = time.time() + df = df.aggregate([], [f.sum(col("a"))]) + df.collect() + end = time.time() + + print( + f"Processed {num_rows} rows using {partitions} partitions in {end - start:.3f}s" + ) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "--rows", + type=int, + default=1_000_000, + help="Number of rows in the generated dataset", + ) + parser.add_argument( + "--partitions", + type=int, + default=multiprocessing.cpu_count(), + help="Target number of partitions to use", + ) + args = parser.parse_args() + main(args.rows, args.partitions) diff --git a/benchmarks/tpch/.gitignore b/benchmarks/tpch/.gitignore new file mode 100644 index 000000000..4471c6d15 --- /dev/null +++ b/benchmarks/tpch/.gitignore @@ -0,0 +1,2 @@ +data +results.csv \ No newline at end of file diff --git a/benchmarks/tpch/README.md b/benchmarks/tpch/README.md new file mode 100644 index 000000000..a118a7449 --- /dev/null +++ b/benchmarks/tpch/README.md @@ -0,0 +1,78 @@ + + +# DataFusion Python Benchmarks Derived from TPC-H + +## Create Release Build + +From repo root: + +```bash +maturin develop --release +``` + +Note that release builds take a really long time, so you may want to temporarily comment out this section of the +root Cargo.toml when frequently building. + +```toml +[profile.release] +lto = true +codegen-units = 1 +``` + +## Generate Data + +```bash +./tpch-gen.sh 1 +``` + +## Run Benchmarks + +```bash +python tpch.py ./data ./queries +``` + +A summary of the benchmark timings will be written to `results.csv`. For example: + +```csv +setup,1.4 +q1,2978.6 +q2,679.7 +q3,2943.7 +q4,2894.9 +q5,3592.3 +q6,1691.4 +q7,3003.9 +q8,3818.7 +q9,4237.9 +q10,2344.7 +q11,526.1 +q12,2284.6 +q13,1009.2 +q14,1738.4 +q15,1942.1 +q16,499.8 +q17,5178.9 +q18,4127.7 +q19,2056.6 +q20,2162.5 +q21,8046.5 +q22,754.9 +total,58513.2 +``` \ No newline at end of file diff --git a/benchmarks/tpch/create_tables.sql b/benchmarks/tpch/create_tables.sql new file mode 100644 index 000000000..9f3aeea20 --- /dev/null +++ b/benchmarks/tpch/create_tables.sql @@ -0,0 +1,143 @@ +-- Schema derived from TPC-H schema under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. + +CREATE EXTERNAL TABLE customer ( + c_custkey INT NOT NULL, + c_name VARCHAR NOT NULL, + c_address VARCHAR NOT NULL, + c_nationkey INT NOT NULL, + c_phone VARCHAR NOT NULL, + c_acctbal DECIMAL(15, 2) NOT NULL, + c_mktsegment VARCHAR NOT NULL, + c_comment VARCHAR NOT NULL, + c_extra VARCHAR NOT NULL, +) +STORED AS CSV +OPTIONS ( + format.delimiter '|', + format.has_header true +) +LOCATION '$PATH/customer.csv'; + +CREATE EXTERNAL TABLE lineitem ( + l_orderkey INT NOT NULL, + l_partkey INT NOT NULL, + l_suppkey INT NOT NULL, + l_linenumber INT NOT NULL, + l_quantity DECIMAL(15, 2) NOT NULL, + l_extendedprice DECIMAL(15, 2) NOT NULL, + l_discount DECIMAL(15, 2) NOT NULL, + l_tax DECIMAL(15, 2) NOT NULL, + l_returnflag VARCHAR NOT NULL, + l_linestatus VARCHAR NOT NULL, + l_shipdate DATE NOT NULL, + l_commitdate DATE NOT NULL, + l_receiptdate DATE NOT NULL, + l_shipinstruct VARCHAR NOT NULL, + l_shipmode VARCHAR NOT NULL, + l_comment VARCHAR NOT NULL, + l_extra VARCHAR NOT NULL, +) +STORED AS CSV +OPTIONS ( + format.delimiter '|', + format.has_header true +) +LOCATION '$PATH/lineitem.csv'; + +CREATE EXTERNAL TABLE nation ( + n_nationkey INT NOT NULL, + n_name VARCHAR NOT NULL, + n_regionkey INT NOT NULL, + n_comment VARCHAR NOT NULL, + n_extra VARCHAR NOT NULL, +) +STORED AS CSV +OPTIONS ( + format.delimiter '|', + format.has_header true +) +LOCATION '$PATH/nation.csv'; + +CREATE EXTERNAL TABLE orders ( + o_orderkey INT NOT NULL, + o_custkey INT NOT NULL, + o_orderstatus VARCHAR NOT NULL, + o_totalprice DECIMAL(15, 2) NOT NULL, + o_orderdate DATE NOT NULL, + o_orderpriority VARCHAR NOT NULL, + o_clerk VARCHAR NOT NULL, + o_shippriority INT NULL, + o_comment VARCHAR NOT NULL, + o_extra VARCHAR NOT NULL, +) +STORED AS CSV +OPTIONS ( + format.delimiter '|', + format.has_header true +) +LOCATION '$PATH/orders.csv'; + +CREATE EXTERNAL TABLE part ( + p_partkey INT NOT NULL, + p_name VARCHAR NOT NULL, + p_mfgr VARCHAR NOT NULL, + p_brand VARCHAR NOT NULL, + p_type VARCHAR NOT NULL, + p_size INT NULL, + p_container VARCHAR NOT NULL, + p_retailprice DECIMAL(15, 2) NOT NULL, + p_comment VARCHAR NOT NULL, + p_extra VARCHAR NOT NULL, +) +STORED AS CSV +OPTIONS ( + format.delimiter '|', + format.has_header true +) +LOCATION '$PATH/part.csv'; + +CREATE EXTERNAL TABLE partsupp ( + ps_partkey INT NOT NULL, + ps_suppkey INT NOT NULL, + ps_availqty INT NOT NULL, + ps_supplycost DECIMAL(15, 2) NOT NULL, + ps_comment VARCHAR NOT NULL, + ps_extra VARCHAR NOT NULL, +) +STORED AS CSV +OPTIONS ( + format.delimiter '|', + format.has_header true +) +LOCATION '$PATH/partsupp.csv'; + +CREATE EXTERNAL TABLE region ( + r_regionkey INT NOT NULL, + r_name VARCHAR NOT NULL, + r_comment VARCHAR NOT NULL, + r_extra VARCHAR NOT NULL, +) +STORED AS CSV +OPTIONS ( + format.delimiter '|', + format.has_header true +) +LOCATION '$PATH/region.csv'; + +CREATE EXTERNAL TABLE supplier ( + s_suppkey INT NOT NULL, + s_name VARCHAR NOT NULL, + s_address VARCHAR NOT NULL, + s_nationkey INT NOT NULL, + s_phone VARCHAR NOT NULL, + s_acctbal DECIMAL(15, 2) NOT NULL, + s_comment VARCHAR NOT NULL, + s_extra VARCHAR NOT NULL, +) +STORED AS CSV +OPTIONS ( + format.delimiter '|', + format.has_header true +) +LOCATION '$PATH/supplier.csv'; \ No newline at end of file diff --git a/benchmarks/tpch/queries/q1.sql b/benchmarks/tpch/queries/q1.sql new file mode 100644 index 000000000..e7e8e32b8 --- /dev/null +++ b/benchmarks/tpch/queries/q1.sql @@ -0,0 +1,23 @@ +-- Benchmark Query 1 derived from TPC-H query 1 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + l_returnflag, + l_linestatus, + sum(l_quantity) as sum_qty, + sum(l_extendedprice) as sum_base_price, + sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, + sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, + avg(l_quantity) as avg_qty, + avg(l_extendedprice) as avg_price, + avg(l_discount) as avg_disc, + count(*) as count_order +from + lineitem +where + l_shipdate <= date '1998-12-01' - interval '68 days' +group by + l_returnflag, + l_linestatus +order by + l_returnflag, + l_linestatus; diff --git a/benchmarks/tpch/queries/q10.sql b/benchmarks/tpch/queries/q10.sql new file mode 100644 index 000000000..8391f6277 --- /dev/null +++ b/benchmarks/tpch/queries/q10.sql @@ -0,0 +1,33 @@ +-- Benchmark Query 10 derived from TPC-H query 10 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + c_custkey, + c_name, + sum(l_extendedprice * (1 - l_discount)) as revenue, + c_acctbal, + n_name, + c_address, + c_phone, + c_comment +from + customer, + orders, + lineitem, + nation +where + c_custkey = o_custkey + and l_orderkey = o_orderkey + and o_orderdate >= date '1993-07-01' + and o_orderdate < date '1993-07-01' + interval '3' month + and l_returnflag = 'R' + and c_nationkey = n_nationkey +group by + c_custkey, + c_name, + c_acctbal, + c_phone, + n_name, + c_address, + c_comment +order by + revenue desc limit 20; diff --git a/benchmarks/tpch/queries/q11.sql b/benchmarks/tpch/queries/q11.sql new file mode 100644 index 000000000..58776d369 --- /dev/null +++ b/benchmarks/tpch/queries/q11.sql @@ -0,0 +1,29 @@ +-- Benchmark Query 11 derived from TPC-H query 11 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + ps_partkey, + sum(ps_supplycost * ps_availqty) as value +from + partsupp, + supplier, + nation +where + ps_suppkey = s_suppkey + and s_nationkey = n_nationkey + and n_name = 'ALGERIA' +group by + ps_partkey having + sum(ps_supplycost * ps_availqty) > ( + select + sum(ps_supplycost * ps_availqty) * 0.0001000000 + from + partsupp, + supplier, + nation + where + ps_suppkey = s_suppkey + and s_nationkey = n_nationkey + and n_name = 'ALGERIA' + ) +order by + value desc; diff --git a/benchmarks/tpch/queries/q12.sql b/benchmarks/tpch/queries/q12.sql new file mode 100644 index 000000000..0b973de98 --- /dev/null +++ b/benchmarks/tpch/queries/q12.sql @@ -0,0 +1,30 @@ +-- Benchmark Query 12 derived from TPC-H query 12 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + l_shipmode, + sum(case + when o_orderpriority = '1-URGENT' + or o_orderpriority = '2-HIGH' + then 1 + else 0 + end) as high_line_count, + sum(case + when o_orderpriority <> '1-URGENT' + and o_orderpriority <> '2-HIGH' + then 1 + else 0 + end) as low_line_count +from + orders, + lineitem +where + o_orderkey = l_orderkey + and l_shipmode in ('FOB', 'SHIP') + and l_commitdate < l_receiptdate + and l_shipdate < l_commitdate + and l_receiptdate >= date '1995-01-01' + and l_receiptdate < date '1995-01-01' + interval '1' year +group by + l_shipmode +order by + l_shipmode; diff --git a/benchmarks/tpch/queries/q13.sql b/benchmarks/tpch/queries/q13.sql new file mode 100644 index 000000000..145dd6f10 --- /dev/null +++ b/benchmarks/tpch/queries/q13.sql @@ -0,0 +1,22 @@ +-- Benchmark Query 13 derived from TPC-H query 13 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + c_count, + count(*) as custdist +from + ( + select + c_custkey, + count(o_orderkey) + from + customer left outer join orders on + c_custkey = o_custkey + and o_comment not like '%express%requests%' + group by + c_custkey + ) as c_orders (c_custkey, c_count) +group by + c_count +order by + custdist desc, + c_count desc; diff --git a/benchmarks/tpch/queries/q14.sql b/benchmarks/tpch/queries/q14.sql new file mode 100644 index 000000000..1a91a04df --- /dev/null +++ b/benchmarks/tpch/queries/q14.sql @@ -0,0 +1,15 @@ +-- Benchmark Query 14 derived from TPC-H query 14 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + 100.00 * sum(case + when p_type like 'PROMO%' + then l_extendedprice * (1 - l_discount) + else 0 + end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue +from + lineitem, + part +where + l_partkey = p_partkey + and l_shipdate >= date '1995-02-01' + and l_shipdate < date '1995-02-01' + interval '1' month; diff --git a/benchmarks/tpch/queries/q15.sql b/benchmarks/tpch/queries/q15.sql new file mode 100644 index 000000000..68cc32cb7 --- /dev/null +++ b/benchmarks/tpch/queries/q15.sql @@ -0,0 +1,33 @@ +-- Benchmark Query 15 derived from TPC-H query 15 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +create view revenue0 (supplier_no, total_revenue) as + select + l_suppkey, + sum(l_extendedprice * (1 - l_discount)) + from + lineitem + where + l_shipdate >= date '1996-08-01' + and l_shipdate < date '1996-08-01' + interval '3' month + group by + l_suppkey; +select + s_suppkey, + s_name, + s_address, + s_phone, + total_revenue +from + supplier, + revenue0 +where + s_suppkey = supplier_no + and total_revenue = ( + select + max(total_revenue) + from + revenue0 + ) +order by + s_suppkey; +drop view revenue0; diff --git a/benchmarks/tpch/queries/q16.sql b/benchmarks/tpch/queries/q16.sql new file mode 100644 index 000000000..098b4f3b3 --- /dev/null +++ b/benchmarks/tpch/queries/q16.sql @@ -0,0 +1,32 @@ +-- Benchmark Query 16 derived from TPC-H query 16 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + p_brand, + p_type, + p_size, + count(distinct ps_suppkey) as supplier_cnt +from + partsupp, + part +where + p_partkey = ps_partkey + and p_brand <> 'Brand#14' + and p_type not like 'SMALL PLATED%' + and p_size in (14, 6, 5, 31, 49, 15, 41, 47) + and ps_suppkey not in ( + select + s_suppkey + from + supplier + where + s_comment like '%Customer%Complaints%' + ) +group by + p_brand, + p_type, + p_size +order by + supplier_cnt desc, + p_brand, + p_type, + p_size; diff --git a/benchmarks/tpch/queries/q17.sql b/benchmarks/tpch/queries/q17.sql new file mode 100644 index 000000000..ed02d7b77 --- /dev/null +++ b/benchmarks/tpch/queries/q17.sql @@ -0,0 +1,19 @@ +-- Benchmark Query 17 derived from TPC-H query 17 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + sum(l_extendedprice) / 7.0 as avg_yearly +from + lineitem, + part +where + p_partkey = l_partkey + and p_brand = 'Brand#42' + and p_container = 'LG BAG' + and l_quantity < ( + select + 0.2 * avg(l_quantity) + from + lineitem + where + l_partkey = p_partkey + ); diff --git a/benchmarks/tpch/queries/q18.sql b/benchmarks/tpch/queries/q18.sql new file mode 100644 index 000000000..cf1f8c89a --- /dev/null +++ b/benchmarks/tpch/queries/q18.sql @@ -0,0 +1,34 @@ +-- Benchmark Query 18 derived from TPC-H query 18 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + c_name, + c_custkey, + o_orderkey, + o_orderdate, + o_totalprice, + sum(l_quantity) +from + customer, + orders, + lineitem +where + o_orderkey in ( + select + l_orderkey + from + lineitem + group by + l_orderkey having + sum(l_quantity) > 313 + ) + and c_custkey = o_custkey + and o_orderkey = l_orderkey +group by + c_name, + c_custkey, + o_orderkey, + o_orderdate, + o_totalprice +order by + o_totalprice desc, + o_orderdate limit 100; diff --git a/benchmarks/tpch/queries/q19.sql b/benchmarks/tpch/queries/q19.sql new file mode 100644 index 000000000..3968f0d24 --- /dev/null +++ b/benchmarks/tpch/queries/q19.sql @@ -0,0 +1,37 @@ +-- Benchmark Query 19 derived from TPC-H query 19 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + sum(l_extendedprice* (1 - l_discount)) as revenue +from + lineitem, + part +where + ( + p_partkey = l_partkey + and p_brand = 'Brand#21' + and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') + and l_quantity >= 8 and l_quantity <= 8 + 10 + and p_size between 1 and 5 + and l_shipmode in ('AIR', 'AIR REG') + and l_shipinstruct = 'DELIVER IN PERSON' + ) + or + ( + p_partkey = l_partkey + and p_brand = 'Brand#13' + and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') + and l_quantity >= 20 and l_quantity <= 20 + 10 + and p_size between 1 and 10 + and l_shipmode in ('AIR', 'AIR REG') + and l_shipinstruct = 'DELIVER IN PERSON' + ) + or + ( + p_partkey = l_partkey + and p_brand = 'Brand#52' + and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') + and l_quantity >= 30 and l_quantity <= 30 + 10 + and p_size between 1 and 15 + and l_shipmode in ('AIR', 'AIR REG') + and l_shipinstruct = 'DELIVER IN PERSON' + ); diff --git a/benchmarks/tpch/queries/q2.sql b/benchmarks/tpch/queries/q2.sql new file mode 100644 index 000000000..46ec5d239 --- /dev/null +++ b/benchmarks/tpch/queries/q2.sql @@ -0,0 +1,45 @@ +-- Benchmark Query 2 derived from TPC-H query 2 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + s_acctbal, + s_name, + n_name, + p_partkey, + p_mfgr, + s_address, + s_phone, + s_comment +from + part, + supplier, + partsupp, + nation, + region +where + p_partkey = ps_partkey + and s_suppkey = ps_suppkey + and p_size = 48 + and p_type like '%TIN' + and s_nationkey = n_nationkey + and n_regionkey = r_regionkey + and r_name = 'ASIA' + and ps_supplycost = ( + select + min(ps_supplycost) + from + partsupp, + supplier, + nation, + region + where + p_partkey = ps_partkey + and s_suppkey = ps_suppkey + and s_nationkey = n_nationkey + and n_regionkey = r_regionkey + and r_name = 'ASIA' + ) +order by + s_acctbal desc, + n_name, + s_name, + p_partkey limit 100; diff --git a/benchmarks/tpch/queries/q20.sql b/benchmarks/tpch/queries/q20.sql new file mode 100644 index 000000000..5bb16563b --- /dev/null +++ b/benchmarks/tpch/queries/q20.sql @@ -0,0 +1,39 @@ +-- Benchmark Query 20 derived from TPC-H query 20 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + s_name, + s_address +from + supplier, + nation +where + s_suppkey in ( + select + ps_suppkey + from + partsupp + where + ps_partkey in ( + select + p_partkey + from + part + where + p_name like 'blanched%' + ) + and ps_availqty > ( + select + 0.5 * sum(l_quantity) + from + lineitem + where + l_partkey = ps_partkey + and l_suppkey = ps_suppkey + and l_shipdate >= date '1993-01-01' + and l_shipdate < date '1993-01-01' + interval '1' year + ) + ) + and s_nationkey = n_nationkey + and n_name = 'KENYA' +order by + s_name; diff --git a/benchmarks/tpch/queries/q21.sql b/benchmarks/tpch/queries/q21.sql new file mode 100644 index 000000000..6f84b876e --- /dev/null +++ b/benchmarks/tpch/queries/q21.sql @@ -0,0 +1,41 @@ +-- Benchmark Query 21 derived from TPC-H query 21 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + s_name, + count(*) as numwait +from + supplier, + lineitem l1, + orders, + nation +where + s_suppkey = l1.l_suppkey + and o_orderkey = l1.l_orderkey + and o_orderstatus = 'F' + and l1.l_receiptdate > l1.l_commitdate + and exists ( + select + * + from + lineitem l2 + where + l2.l_orderkey = l1.l_orderkey + and l2.l_suppkey <> l1.l_suppkey + ) + and not exists ( + select + * + from + lineitem l3 + where + l3.l_orderkey = l1.l_orderkey + and l3.l_suppkey <> l1.l_suppkey + and l3.l_receiptdate > l3.l_commitdate + ) + and s_nationkey = n_nationkey + and n_name = 'ARGENTINA' +group by + s_name +order by + numwait desc, + s_name limit 100; diff --git a/benchmarks/tpch/queries/q22.sql b/benchmarks/tpch/queries/q22.sql new file mode 100644 index 000000000..65ea49b04 --- /dev/null +++ b/benchmarks/tpch/queries/q22.sql @@ -0,0 +1,39 @@ +-- Benchmark Query 22 derived from TPC-H query 22 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + cntrycode, + count(*) as numcust, + sum(c_acctbal) as totacctbal +from + ( + select + substring(c_phone from 1 for 2) as cntrycode, + c_acctbal + from + customer + where + substring(c_phone from 1 for 2) in + ('24', '34', '16', '30', '33', '14', '13') + and c_acctbal > ( + select + avg(c_acctbal) + from + customer + where + c_acctbal > 0.00 + and substring(c_phone from 1 for 2) in + ('24', '34', '16', '30', '33', '14', '13') + ) + and not exists ( + select + * + from + orders + where + o_custkey = c_custkey + ) + ) as custsale +group by + cntrycode +order by + cntrycode; diff --git a/benchmarks/tpch/queries/q3.sql b/benchmarks/tpch/queries/q3.sql new file mode 100644 index 000000000..161f2e1e4 --- /dev/null +++ b/benchmarks/tpch/queries/q3.sql @@ -0,0 +1,24 @@ +-- Benchmark Query 3 derived from TPC-H query 3 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + l_orderkey, + sum(l_extendedprice * (1 - l_discount)) as revenue, + o_orderdate, + o_shippriority +from + customer, + orders, + lineitem +where + c_mktsegment = 'BUILDING' + and c_custkey = o_custkey + and l_orderkey = o_orderkey + and o_orderdate < date '1995-03-15' + and l_shipdate > date '1995-03-15' +group by + l_orderkey, + o_orderdate, + o_shippriority +order by + revenue desc, + o_orderdate limit 10; diff --git a/benchmarks/tpch/queries/q4.sql b/benchmarks/tpch/queries/q4.sql new file mode 100644 index 000000000..e444dbfce --- /dev/null +++ b/benchmarks/tpch/queries/q4.sql @@ -0,0 +1,23 @@ +-- Benchmark Query 4 derived from TPC-H query 4 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + o_orderpriority, + count(*) as order_count +from + orders +where + o_orderdate >= date '1995-04-01' + and o_orderdate < date '1995-04-01' + interval '3' month + and exists ( + select + * + from + lineitem + where + l_orderkey = o_orderkey + and l_commitdate < l_receiptdate + ) +group by + o_orderpriority +order by + o_orderpriority; diff --git a/benchmarks/tpch/queries/q5.sql b/benchmarks/tpch/queries/q5.sql new file mode 100644 index 000000000..4426bd245 --- /dev/null +++ b/benchmarks/tpch/queries/q5.sql @@ -0,0 +1,26 @@ +-- Benchmark Query 5 derived from TPC-H query 5 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + n_name, + sum(l_extendedprice * (1 - l_discount)) as revenue +from + customer, + orders, + lineitem, + supplier, + nation, + region +where + c_custkey = o_custkey + and l_orderkey = o_orderkey + and l_suppkey = s_suppkey + and c_nationkey = s_nationkey + and s_nationkey = n_nationkey + and n_regionkey = r_regionkey + and r_name = 'AFRICA' + and o_orderdate >= date '1994-01-01' + and o_orderdate < date '1994-01-01' + interval '1' year +group by + n_name +order by + revenue desc; diff --git a/benchmarks/tpch/queries/q6.sql b/benchmarks/tpch/queries/q6.sql new file mode 100644 index 000000000..3d6e51cfe --- /dev/null +++ b/benchmarks/tpch/queries/q6.sql @@ -0,0 +1,11 @@ +-- Benchmark Query 6 derived from TPC-H query 6 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + sum(l_extendedprice * l_discount) as revenue +from + lineitem +where + l_shipdate >= date '1994-01-01' + and l_shipdate < date '1994-01-01' + interval '1' year + and l_discount between 0.04 - 0.01 and 0.04 + 0.01 + and l_quantity < 24; diff --git a/benchmarks/tpch/queries/q7.sql b/benchmarks/tpch/queries/q7.sql new file mode 100644 index 000000000..6e36ad616 --- /dev/null +++ b/benchmarks/tpch/queries/q7.sql @@ -0,0 +1,41 @@ +-- Benchmark Query 7 derived from TPC-H query 7 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + supp_nation, + cust_nation, + l_year, + sum(volume) as revenue +from + ( + select + n1.n_name as supp_nation, + n2.n_name as cust_nation, + extract(year from l_shipdate) as l_year, + l_extendedprice * (1 - l_discount) as volume + from + supplier, + lineitem, + orders, + customer, + nation n1, + nation n2 + where + s_suppkey = l_suppkey + and o_orderkey = l_orderkey + and c_custkey = o_custkey + and s_nationkey = n1.n_nationkey + and c_nationkey = n2.n_nationkey + and ( + (n1.n_name = 'GERMANY' and n2.n_name = 'IRAQ') + or (n1.n_name = 'IRAQ' and n2.n_name = 'GERMANY') + ) + and l_shipdate between date '1995-01-01' and date '1996-12-31' + ) as shipping +group by + supp_nation, + cust_nation, + l_year +order by + supp_nation, + cust_nation, + l_year; diff --git a/benchmarks/tpch/queries/q8.sql b/benchmarks/tpch/queries/q8.sql new file mode 100644 index 000000000..e28235ed4 --- /dev/null +++ b/benchmarks/tpch/queries/q8.sql @@ -0,0 +1,39 @@ +-- Benchmark Query 8 derived from TPC-H query 8 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + o_year, + sum(case + when nation = 'IRAQ' then volume + else 0 + end) / sum(volume) as mkt_share +from + ( + select + extract(year from o_orderdate) as o_year, + l_extendedprice * (1 - l_discount) as volume, + n2.n_name as nation + from + part, + supplier, + lineitem, + orders, + customer, + nation n1, + nation n2, + region + where + p_partkey = l_partkey + and s_suppkey = l_suppkey + and l_orderkey = o_orderkey + and o_custkey = c_custkey + and c_nationkey = n1.n_nationkey + and n1.n_regionkey = r_regionkey + and r_name = 'MIDDLE EAST' + and s_nationkey = n2.n_nationkey + and o_orderdate between date '1995-01-01' and date '1996-12-31' + and p_type = 'LARGE PLATED STEEL' + ) as all_nations +group by + o_year +order by + o_year; diff --git a/benchmarks/tpch/queries/q9.sql b/benchmarks/tpch/queries/q9.sql new file mode 100644 index 000000000..86ae02482 --- /dev/null +++ b/benchmarks/tpch/queries/q9.sql @@ -0,0 +1,34 @@ +-- Benchmark Query 9 derived from TPC-H query 9 under the terms of the TPC Fair Use Policy. +-- TPC-H queries are Copyright 1993-2022 Transaction Processing Performance Council. +select + nation, + o_year, + sum(amount) as sum_profit +from + ( + select + n_name as nation, + extract(year from o_orderdate) as o_year, + l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity as amount + from + part, + supplier, + lineitem, + partsupp, + orders, + nation + where + s_suppkey = l_suppkey + and ps_suppkey = l_suppkey + and ps_partkey = l_partkey + and p_partkey = l_partkey + and o_orderkey = l_orderkey + and s_nationkey = n_nationkey + and p_name like '%moccasin%' + ) as profit +group by + nation, + o_year +order by + nation, + o_year desc; diff --git a/benchmarks/tpch/tpch-gen.sh b/benchmarks/tpch/tpch-gen.sh new file mode 100755 index 000000000..139c300a2 --- /dev/null +++ b/benchmarks/tpch/tpch-gen.sh @@ -0,0 +1,62 @@ +#!/bin/bash +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +mkdir -p data/answers 2>/dev/null + +set -e + +# If RUN_IN_CI is set, then do not produce verbose output or use an interactive terminal +if [[ -z "${RUN_IN_CI}" ]]; then + TERMINAL_FLAG="-it" + VERBOSE_OUTPUT="-vf" +else + TERMINAL_FLAG="" + VERBOSE_OUTPUT="-f" +fi + +#pushd .. +#. ./dev/build-set-env.sh +#popd + +# Generate data into the ./data directory if it does not already exist +FILE=./data/supplier.tbl +if test -f "$FILE"; then + echo "$FILE exists." +else + docker run -v `pwd`/data:/data $TERMINAL_FLAG --rm ghcr.io/scalytics/tpch-docker:main $VERBOSE_OUTPUT -s $1 + + # workaround for https://github.com/apache/arrow-datafusion/issues/6147 + mv data/customer.tbl data/customer.csv + mv data/lineitem.tbl data/lineitem.csv + mv data/nation.tbl data/nation.csv + mv data/orders.tbl data/orders.csv + mv data/part.tbl data/part.csv + mv data/partsupp.tbl data/partsupp.csv + mv data/region.tbl data/region.csv + mv data/supplier.tbl data/supplier.csv + + ls -l data +fi + +# Copy expected answers (at SF=1) into the ./data/answers directory if it does not already exist +FILE=./data/answers/q1.out +if test -f "$FILE"; then + echo "$FILE exists." +else + docker run -v `pwd`/data:/data $TERMINAL_FLAG --entrypoint /bin/bash --rm ghcr.io/scalytics/tpch-docker:main -c "cp /opt/tpch/2.18.0_rc2/dbgen/answers/* /data/answers/" +fi diff --git a/benchmarks/tpch/tpch.py b/benchmarks/tpch/tpch.py new file mode 100644 index 000000000..ffee5554c --- /dev/null +++ b/benchmarks/tpch/tpch.py @@ -0,0 +1,99 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import argparse +import time +from pathlib import Path + +from datafusion import SessionContext + + +def bench(data_path, query_path) -> None: + with Path("results.csv").open("w") as results: + # register tables + start = time.time() + total_time_millis = 0 + + # create context + # runtime = ( + # RuntimeEnvBuilder() + # .with_disk_manager_os() + # .with_fair_spill_pool(10000000) + # ) + # config = ( + # SessionConfig() + # .with_create_default_catalog_and_schema(True) + # .with_default_catalog_and_schema("datafusion", "tpch") + # .with_information_schema(True) + # ) + # ctx = SessionContext(config, runtime) + + ctx = SessionContext() + print("Configuration:\n", ctx) + + # register tables + with Path("create_tables.sql").open() as f: + sql = "" + for line in f.readlines(): + if line.startswith("--"): + continue + sql = sql + line + if sql.strip().endswith(";"): + sql = sql.strip().replace("$PATH", data_path) + ctx.sql(sql) + sql = "" + + end = time.time() + time_millis = (end - start) * 1000 + total_time_millis += time_millis + print(f"setup,{round(time_millis, 1)}") + results.write(f"setup,{round(time_millis, 1)}\n") + results.flush() + + # run queries + for query in range(1, 23): + with Path(f"{query_path}/q{query}.sql").open() as f: + text = f.read() + tmp = text.split(";") + queries = [s.strip() for s in tmp if len(s.strip()) > 0] + + try: + start = time.time() + for sql in queries: + print(sql) + df = ctx.sql(sql) + # result_set = df.collect() + df.show() + end = time.time() + time_millis = (end - start) * 1000 + total_time_millis += time_millis + print(f"q{query},{round(time_millis, 1)}") + results.write(f"q{query},{round(time_millis, 1)}\n") + results.flush() + except Exception as e: + print("query", query, "failed", e) + + print(f"total,{round(total_time_millis, 1)}") + results.write(f"total,{round(total_time_millis, 1)}\n") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + parser.add_argument("data_path") + parser.add_argument("query_path") + args = parser.parse_args() + bench(args.data_path, args.query_path) diff --git a/datafusion/expr.py b/ci/scripts/python_lint.sh old mode 100644 new mode 100755 similarity index 90% rename from datafusion/expr.py rename to ci/scripts/python_lint.sh index e914b85d7..3f7310ba7 --- a/datafusion/expr.py +++ b/ci/scripts/python_lint.sh @@ -1,3 +1,5 @@ +#!/usr/bin/env bash +# # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information @@ -15,9 +17,6 @@ # specific language governing permissions and limitations # under the License. - -from ._internal import expr - - -def __getattr__(name): - return getattr(expr, name) +set -ex +ruff format datafusion +ruff check datafusion \ No newline at end of file diff --git a/ci/scripts/rust_fmt.sh b/ci/scripts/rust_fmt.sh index 9d8325877..05cb6b208 100755 --- a/ci/scripts/rust_fmt.sh +++ b/ci/scripts/rust_fmt.sh @@ -18,4 +18,4 @@ # under the License. set -ex -cargo fmt --all -- --check +cargo +nightly fmt --all -- --check diff --git a/conda/environments/datafusion-dev.yaml b/conda/environments/datafusion-dev.yaml deleted file mode 100644 index d9405e4fe..000000000 --- a/conda/environments/datafusion-dev.yaml +++ /dev/null @@ -1,44 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -channels: -- conda-forge -dependencies: -- black -- flake8 -- isort -- maturin -- mypy -- numpy -- pyarrow -- pytest -- toml -- importlib_metadata -- python>=3.10 -# Packages useful for building distributions and releasing -- mamba -- conda-build -- anaconda-client -# Packages for documentation building -- sphinx -- pydata-sphinx-theme==0.8.0 -- myst-parser -- jinja2 -# GPU packages -- cudf -- cudatoolkit=11.8 -name: datafusion-dev diff --git a/conftest.py b/conftest.py new file mode 100644 index 000000000..0c9410636 --- /dev/null +++ b/conftest.py @@ -0,0 +1,36 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +"""Pytest configuration for doctest namespace injection.""" + +import datafusion as dfn +import numpy as np +import pyarrow as pa +import pytest +from datafusion import col, lit +from datafusion import functions as F + + +@pytest.fixture(autouse=True) +def _doctest_namespace(doctest_namespace: dict) -> None: + """Add common imports to the doctest namespace.""" + doctest_namespace["dfn"] = dfn + doctest_namespace["np"] = np + doctest_namespace["pa"] = pa + doctest_namespace["col"] = col + doctest_namespace["lit"] = lit + doctest_namespace["F"] = F diff --git a/crates/core/Cargo.toml b/crates/core/Cargo.toml new file mode 100644 index 000000000..3e2b01c8e --- /dev/null +++ b/crates/core/Cargo.toml @@ -0,0 +1,82 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +[package] +name = "datafusion-python" +version.workspace = true +edition.workspace = true +rust-version.workspace = true +license.workspace = true +description.workspace = true +homepage.workspace = true +repository.workspace = true +include = [ + "src", + "../LICENSE.txt", + "build.rs", + "../pyproject.toml", + "Cargo.toml", + "../Cargo.lock", +] + +[dependencies] +tokio = { workspace = true, features = [ + "macros", + "rt", + "rt-multi-thread", + "sync", +] } +pyo3 = { workspace = true, features = [ + "extension-module", + "abi3", + "abi3-py310", +] } +pyo3-async-runtimes = { workspace = true, features = ["tokio-runtime"] } +pyo3-log = { workspace = true } +arrow = { workspace = true, features = ["pyarrow"] } +arrow-select = { workspace = true } +datafusion = { workspace = true, features = ["avro", "unicode_expressions"] } +datafusion-substrait = { workspace = true, optional = true } +datafusion-proto = { workspace = true } +datafusion-ffi = { workspace = true } +prost = { workspace = true } # keep in line with `datafusion-substrait` +serde_json = { workspace = true } +uuid = { workspace = true, features = ["v4"] } +mimalloc = { workspace = true, optional = true, features = [ + "local_dynamic_tls", +] } +async-trait = { workspace = true } +futures = { workspace = true } +cstr = { workspace = true } +object_store = { workspace = true, features = ["aws", "gcp", "azure", "http"] } +url = { workspace = true } +log = { workspace = true } +parking_lot = { workspace = true } +datafusion-python-util = { workspace = true } + +[build-dependencies] +prost-types = { workspace = true } +pyo3-build-config = { workspace = true } + +[features] +default = ["mimalloc"] +protoc = ["datafusion-substrait/protoc"] +substrait = ["dep:datafusion-substrait"] + +[lib] +name = "datafusion_python" +crate-type = ["cdylib", "rlib"] diff --git a/src/expr/grouping_set.rs b/crates/core/build.rs similarity index 62% rename from src/expr/grouping_set.rs rename to crates/core/build.rs index b73932863..4878d8b0e 100644 --- a/src/expr/grouping_set.rs +++ b/crates/core/build.rs @@ -15,23 +15,6 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::GroupingSet; -use pyo3::prelude::*; - -#[pyclass(name = "GroupingSet", module = "datafusion.expr", subclass)] -#[derive(Clone)] -pub struct PyGroupingSet { - grouping_set: GroupingSet, -} - -impl From for GroupingSet { - fn from(grouping_set: PyGroupingSet) -> Self { - grouping_set.grouping_set - } -} - -impl From for PyGroupingSet { - fn from(grouping_set: GroupingSet) -> PyGroupingSet { - PyGroupingSet { grouping_set } - } +fn main() { + pyo3_build_config::add_extension_module_link_args(); } diff --git a/crates/core/src/array.rs b/crates/core/src/array.rs new file mode 100644 index 000000000..f284fa9de --- /dev/null +++ b/crates/core/src/array.rs @@ -0,0 +1,88 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::ptr::NonNull; +use std::sync::Arc; + +use arrow::array::{Array, ArrayRef}; +use arrow::datatypes::{Field, FieldRef}; +use arrow::ffi::{FFI_ArrowArray, FFI_ArrowSchema}; +use arrow::pyarrow::ToPyArrow; +use pyo3::prelude::{PyAnyMethods, PyCapsuleMethods}; +use pyo3::types::PyCapsule; +use pyo3::{Bound, PyAny, PyResult, Python, pyclass, pymethods}; + +use crate::errors::PyDataFusionResult; + +/// A Python object which implements the Arrow PyCapsule for importing +/// into other libraries. +#[pyclass( + from_py_object, + name = "ArrowArrayExportable", + module = "datafusion", + frozen +)] +#[derive(Clone)] +pub struct PyArrowArrayExportable { + array: ArrayRef, + field: FieldRef, +} + +#[pymethods] +impl PyArrowArrayExportable { + #[pyo3(signature = (requested_schema=None))] + fn __arrow_c_array__<'py>( + &'py self, + py: Python<'py>, + requested_schema: Option>, + ) -> PyDataFusionResult<(Bound<'py, PyCapsule>, Bound<'py, PyCapsule>)> { + let field = if let Some(schema_capsule) = requested_schema { + let data: NonNull = schema_capsule + .pointer_checked(Some(c"arrow_schema"))? + .cast(); + let schema_ptr = unsafe { data.as_ref() }; + let desired_field = Field::try_from(schema_ptr)?; + + Arc::new(desired_field) + } else { + Arc::clone(&self.field) + }; + + let ffi_schema = FFI_ArrowSchema::try_from(&field)?; + let schema_capsule = PyCapsule::new(py, ffi_schema, Some(cr"arrow_schema".into()))?; + + let ffi_array = FFI_ArrowArray::new(&self.array.to_data()); + let array_capsule = PyCapsule::new(py, ffi_array, Some(cr"arrow_array".into()))?; + + Ok((schema_capsule, array_capsule)) + } +} + +impl ToPyArrow for PyArrowArrayExportable { + fn to_pyarrow<'py>(&self, py: Python<'py>) -> PyResult> { + let module = py.import("pyarrow")?; + let method = module.getattr("array")?; + let array = method.call((self.clone(),), None)?; + Ok(array) + } +} + +impl PyArrowArrayExportable { + pub fn new(array: ArrayRef, field: FieldRef) -> Self { + Self { array, field } + } +} diff --git a/crates/core/src/catalog.rs b/crates/core/src/catalog.rs new file mode 100644 index 000000000..30ec4744c --- /dev/null +++ b/crates/core/src/catalog.rs @@ -0,0 +1,726 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::any::Any; +use std::collections::HashSet; +use std::ptr::NonNull; +use std::sync::Arc; + +use async_trait::async_trait; +use datafusion::catalog::{ + CatalogProvider, CatalogProviderList, MemoryCatalogProvider, MemoryCatalogProviderList, + MemorySchemaProvider, SchemaProvider, +}; +use datafusion::common::DataFusionError; +use datafusion::datasource::TableProvider; +use datafusion_ffi::catalog_provider::FFI_CatalogProvider; +use datafusion_ffi::proto::logical_extension_codec::FFI_LogicalExtensionCodec; +use datafusion_ffi::schema_provider::FFI_SchemaProvider; +use datafusion_python_util::{ + create_logical_extension_capsule, ffi_logical_codec_from_pycapsule, wait_for_future, +}; +use pyo3::IntoPyObjectExt; +use pyo3::exceptions::PyKeyError; +use pyo3::prelude::*; +use pyo3::types::PyCapsule; + +use crate::context::PySessionContext; +use crate::dataset::Dataset; +use crate::errors::{PyDataFusionError, PyDataFusionResult, py_datafusion_err, to_datafusion_err}; +use crate::table::PyTable; + +#[pyclass( + from_py_object, + frozen, + name = "RawCatalogList", + module = "datafusion.catalog", + subclass +)] +#[derive(Clone)] +pub struct PyCatalogList { + pub catalog_list: Arc, + codec: Arc, +} + +#[pyclass( + from_py_object, + frozen, + name = "RawCatalog", + module = "datafusion.catalog", + subclass +)] +#[derive(Clone)] +pub struct PyCatalog { + pub catalog: Arc, + codec: Arc, +} + +#[pyclass( + from_py_object, + frozen, + name = "RawSchema", + module = "datafusion.catalog", + subclass +)] +#[derive(Clone)] +pub struct PySchema { + pub schema: Arc, + codec: Arc, +} + +impl PyCatalog { + pub(crate) fn new_from_parts( + catalog: Arc, + codec: Arc, + ) -> Self { + Self { catalog, codec } + } +} + +impl PySchema { + pub(crate) fn new_from_parts( + schema: Arc, + codec: Arc, + ) -> Self { + Self { schema, codec } + } +} + +#[pymethods] +impl PyCatalogList { + #[new] + pub fn new( + py: Python, + catalog_list: Py, + session: Option>, + ) -> PyResult { + let codec = extract_logical_extension_codec(py, session)?; + let catalog_list = Arc::new(RustWrappedPyCatalogProviderList::new( + catalog_list, + codec.clone(), + )) as Arc; + Ok(Self { + catalog_list, + codec, + }) + } + + #[staticmethod] + pub fn memory_catalog_list(py: Python, session: Option>) -> PyResult { + let codec = extract_logical_extension_codec(py, session)?; + let catalog_list = + Arc::new(MemoryCatalogProviderList::default()) as Arc; + Ok(Self { + catalog_list, + codec, + }) + } + + pub fn catalog_names(&self) -> HashSet { + self.catalog_list.catalog_names().into_iter().collect() + } + + #[pyo3(signature = (name="public"))] + pub fn catalog(&self, name: &str) -> PyResult> { + let catalog = self + .catalog_list + .catalog(name) + .ok_or(PyKeyError::new_err(format!( + "Schema with name {name} doesn't exist." + )))?; + + Python::attach(|py| { + match catalog + .as_any() + .downcast_ref::() + { + Some(wrapped_catalog) => Ok(wrapped_catalog.catalog_provider.clone_ref(py)), + None => PyCatalog::new_from_parts(catalog, self.codec.clone()).into_py_any(py), + } + }) + } + + pub fn register_catalog(&self, name: &str, catalog_provider: Bound<'_, PyAny>) -> PyResult<()> { + let provider = extract_catalog_provider_from_pyobj(catalog_provider, self.codec.as_ref())?; + + let _ = self + .catalog_list + .register_catalog(name.to_owned(), provider); + + Ok(()) + } + + pub fn __repr__(&self) -> PyResult { + let mut names: Vec = self.catalog_names().into_iter().collect(); + names.sort(); + Ok(format!("CatalogList(catalog_names=[{}])", names.join(", "))) + } +} + +#[pymethods] +impl PyCatalog { + #[new] + pub fn new(py: Python, catalog: Py, session: Option>) -> PyResult { + let codec = extract_logical_extension_codec(py, session)?; + let catalog = Arc::new(RustWrappedPyCatalogProvider::new(catalog, codec.clone())) + as Arc; + Ok(Self { catalog, codec }) + } + + #[staticmethod] + pub fn memory_catalog(py: Python, session: Option>) -> PyResult { + let codec = extract_logical_extension_codec(py, session)?; + let catalog = Arc::new(MemoryCatalogProvider::default()) as Arc; + Ok(Self { catalog, codec }) + } + + pub fn schema_names(&self) -> HashSet { + self.catalog.schema_names().into_iter().collect() + } + + #[pyo3(signature = (name="public"))] + pub fn schema(&self, name: &str) -> PyResult> { + let schema = self + .catalog + .schema(name) + .ok_or(PyKeyError::new_err(format!( + "Schema with name {name} doesn't exist." + )))?; + + Python::attach(|py| { + match schema + .as_any() + .downcast_ref::() + { + Some(wrapped_schema) => Ok(wrapped_schema.schema_provider.clone_ref(py)), + None => PySchema::new_from_parts(schema, self.codec.clone()).into_py_any(py), + } + }) + } + + pub fn register_schema(&self, name: &str, schema_provider: Bound<'_, PyAny>) -> PyResult<()> { + let provider = extract_schema_provider_from_pyobj(schema_provider, self.codec.as_ref())?; + + let _ = self + .catalog + .register_schema(name, provider) + .map_err(py_datafusion_err)?; + + Ok(()) + } + + pub fn deregister_schema(&self, name: &str, cascade: bool) -> PyResult<()> { + let _ = self + .catalog + .deregister_schema(name, cascade) + .map_err(py_datafusion_err)?; + + Ok(()) + } + + pub fn __repr__(&self) -> PyResult { + let mut names: Vec = self.schema_names().into_iter().collect(); + names.sort(); + Ok(format!("Catalog(schema_names=[{}])", names.join(", "))) + } +} + +#[pymethods] +impl PySchema { + #[new] + pub fn new( + py: Python, + schema_provider: Py, + session: Option>, + ) -> PyResult { + let codec = extract_logical_extension_codec(py, session)?; + let schema = + Arc::new(RustWrappedPySchemaProvider::new(schema_provider)) as Arc; + Ok(Self { schema, codec }) + } + + #[staticmethod] + fn memory_schema(py: Python, session: Option>) -> PyResult { + let codec = extract_logical_extension_codec(py, session)?; + let schema = Arc::new(MemorySchemaProvider::default()) as Arc; + Ok(Self { schema, codec }) + } + + #[getter] + fn table_names(&self) -> HashSet { + self.schema.table_names().into_iter().collect() + } + + fn table(&self, name: &str, py: Python) -> PyDataFusionResult { + if let Some(table) = wait_for_future(py, self.schema.table(name))?? { + Ok(PyTable::from(table)) + } else { + Err(PyDataFusionError::Common(format!( + "Table not found: {name}" + ))) + } + } + + fn __repr__(&self) -> PyResult { + let mut names: Vec = self.table_names().into_iter().collect(); + names.sort(); + Ok(format!("Schema(table_names=[{}])", names.join(";"))) + } + + fn register_table(&self, name: &str, table_provider: Bound<'_, PyAny>) -> PyResult<()> { + let py = table_provider.py(); + let codec_capsule = create_logical_extension_capsule(py, self.codec.as_ref())? + .as_any() + .clone(); + + let table = PyTable::new(table_provider, Some(codec_capsule))?; + + let _ = self + .schema + .register_table(name.to_string(), table.table) + .map_err(py_datafusion_err)?; + + Ok(()) + } + + fn deregister_table(&self, name: &str) -> PyResult<()> { + let _ = self + .schema + .deregister_table(name) + .map_err(py_datafusion_err)?; + + Ok(()) + } + + fn table_exist(&self, name: &str) -> bool { + self.schema.table_exist(name) + } +} + +#[derive(Debug)] +pub(crate) struct RustWrappedPySchemaProvider { + schema_provider: Py, + owner_name: Option, +} + +impl RustWrappedPySchemaProvider { + pub fn new(schema_provider: Py) -> Self { + let owner_name = Python::attach(|py| { + schema_provider + .bind(py) + .getattr("owner_name") + .ok() + .map(|name| name.to_string()) + }); + + Self { + schema_provider, + owner_name, + } + } + + fn table_inner(&self, name: &str) -> PyResult>> { + Python::attach(|py| { + let provider = self.schema_provider.bind(py); + let py_table_method = provider.getattr("table")?; + + let py_table = py_table_method.call((name,), None)?; + if py_table.is_none() { + return Ok(None); + } + + let table = PyTable::new(py_table, None)?; + + Ok(Some(table.table)) + }) + } +} + +#[async_trait] +impl SchemaProvider for RustWrappedPySchemaProvider { + fn owner_name(&self) -> Option<&str> { + self.owner_name.as_deref() + } + + fn as_any(&self) -> &dyn Any { + self + } + + fn table_names(&self) -> Vec { + Python::attach(|py| { + let provider = self.schema_provider.bind(py); + + provider + .getattr("table_names") + .and_then(|names| names.extract::>()) + .unwrap_or_else(|err| { + log::error!("Unable to get table_names: {err}"); + Vec::default() + }) + }) + } + + async fn table( + &self, + name: &str, + ) -> datafusion::common::Result>, DataFusionError> { + self.table_inner(name) + .map_err(|e| DataFusionError::External(Box::new(e))) + } + + fn register_table( + &self, + name: String, + table: Arc, + ) -> datafusion::common::Result>> { + let py_table = PyTable::from(table); + Python::attach(|py| { + let provider = self.schema_provider.bind(py); + let _ = provider + .call_method1("register_table", (name, py_table)) + .map_err(to_datafusion_err)?; + // Since the definition of `register_table` says that an error + // will be returned if the table already exists, there is no + // case where we want to return a table provider as output. + Ok(None) + }) + } + + fn deregister_table( + &self, + name: &str, + ) -> datafusion::common::Result>> { + Python::attach(|py| { + let provider = self.schema_provider.bind(py); + let table = provider + .call_method1("deregister_table", (name,)) + .map_err(to_datafusion_err)?; + if table.is_none() { + return Ok(None); + } + + // If we can turn this table provider into a `Dataset`, return it. + // Otherwise, return None. + let dataset = match Dataset::new(&table, py) { + Ok(dataset) => Some(Arc::new(dataset) as Arc), + Err(_) => None, + }; + + Ok(dataset) + }) + } + + fn table_exist(&self, name: &str) -> bool { + Python::attach(|py| { + let provider = self.schema_provider.bind(py); + provider + .call_method1("table_exist", (name,)) + .and_then(|pyobj| pyobj.extract()) + .unwrap_or(false) + }) + } +} + +#[derive(Debug)] +pub(crate) struct RustWrappedPyCatalogProvider { + pub(crate) catalog_provider: Py, + codec: Arc, +} + +impl RustWrappedPyCatalogProvider { + pub fn new(catalog_provider: Py, codec: Arc) -> Self { + Self { + catalog_provider, + codec, + } + } + + fn schema_inner(&self, name: &str) -> PyResult>> { + Python::attach(|py| { + let provider = self.catalog_provider.bind(py); + + let py_schema = provider.call_method1("schema", (name,))?; + if py_schema.is_none() { + return Ok(None); + } + + extract_schema_provider_from_pyobj(py_schema, self.codec.as_ref()).map(Some) + }) + } +} + +#[async_trait] +impl CatalogProvider for RustWrappedPyCatalogProvider { + fn as_any(&self) -> &dyn Any { + self + } + + fn schema_names(&self) -> Vec { + Python::attach(|py| { + let provider = self.catalog_provider.bind(py); + provider + .call_method0("schema_names") + .and_then(|names| names.extract::>()) + .map(|names| names.into_iter().collect()) + .unwrap_or_else(|err| { + log::error!("Unable to get schema_names: {err}"); + Vec::default() + }) + }) + } + + fn schema(&self, name: &str) -> Option> { + self.schema_inner(name).unwrap_or_else(|err| { + log::error!("CatalogProvider schema returned error: {err}"); + None + }) + } + + fn register_schema( + &self, + name: &str, + schema: Arc, + ) -> datafusion::common::Result>> { + Python::attach(|py| { + let py_schema = match schema + .as_any() + .downcast_ref::() + { + Some(wrapped_schema) => wrapped_schema.schema_provider.as_any(), + None => &PySchema::new_from_parts(schema, self.codec.clone()) + .into_py_any(py) + .map_err(to_datafusion_err)?, + }; + + let provider = self.catalog_provider.bind(py); + let schema = provider + .call_method1("register_schema", (name, py_schema)) + .map_err(to_datafusion_err)?; + if schema.is_none() { + return Ok(None); + } + + let schema = Arc::new(RustWrappedPySchemaProvider::new(schema.into())) + as Arc; + + Ok(Some(schema)) + }) + } + + fn deregister_schema( + &self, + name: &str, + cascade: bool, + ) -> datafusion::common::Result>> { + Python::attach(|py| { + let provider = self.catalog_provider.bind(py); + let schema = provider + .call_method1("deregister_schema", (name, cascade)) + .map_err(to_datafusion_err)?; + if schema.is_none() { + return Ok(None); + } + + let schema = Arc::new(RustWrappedPySchemaProvider::new(schema.into())) + as Arc; + + Ok(Some(schema)) + }) + } +} + +#[derive(Debug)] +pub(crate) struct RustWrappedPyCatalogProviderList { + pub(crate) catalog_provider_list: Py, + codec: Arc, +} + +impl RustWrappedPyCatalogProviderList { + pub fn new(catalog_provider_list: Py, codec: Arc) -> Self { + Self { + catalog_provider_list, + codec, + } + } + + fn catalog_inner(&self, name: &str) -> PyResult>> { + Python::attach(|py| { + let provider = self.catalog_provider_list.bind(py); + + let py_schema = provider.call_method1("catalog", (name,))?; + if py_schema.is_none() { + return Ok(None); + } + + extract_catalog_provider_from_pyobj(py_schema, self.codec.as_ref()).map(Some) + }) + } +} + +#[async_trait] +impl CatalogProviderList for RustWrappedPyCatalogProviderList { + fn as_any(&self) -> &dyn Any { + self + } + + fn catalog_names(&self) -> Vec { + Python::attach(|py| { + let provider = self.catalog_provider_list.bind(py); + provider + .call_method0("catalog_names") + .and_then(|names| names.extract::>()) + .map(|names| names.into_iter().collect()) + .unwrap_or_else(|err| { + log::error!("Unable to get catalog_names: {err}"); + Vec::default() + }) + }) + } + + fn catalog(&self, name: &str) -> Option> { + self.catalog_inner(name).unwrap_or_else(|err| { + log::error!("CatalogProvider catalog returned error: {err}"); + None + }) + } + + fn register_catalog( + &self, + name: String, + catalog: Arc, + ) -> Option> { + Python::attach(|py| { + let py_catalog = match catalog + .as_any() + .downcast_ref::() + { + Some(wrapped_schema) => wrapped_schema.catalog_provider.as_any().clone_ref(py), + None => { + match PyCatalog::new_from_parts(catalog, self.codec.clone()).into_py_any(py) { + Ok(c) => c, + Err(err) => { + log::error!( + "register_catalog returned error during conversion to PyAny: {err}" + ); + return None; + } + } + } + }; + + let provider = self.catalog_provider_list.bind(py); + let catalog = match provider.call_method1("register_catalog", (name, py_catalog)) { + Ok(c) => c, + Err(err) => { + log::error!("register_catalog returned error: {err}"); + return None; + } + }; + if catalog.is_none() { + return None; + } + + let catalog = Arc::new(RustWrappedPyCatalogProvider::new( + catalog.into(), + self.codec.clone(), + )) as Arc; + + Some(catalog) + }) + } +} + +fn extract_catalog_provider_from_pyobj( + mut catalog_provider: Bound, + codec: &FFI_LogicalExtensionCodec, +) -> PyResult> { + if catalog_provider.hasattr("__datafusion_catalog_provider__")? { + let py = catalog_provider.py(); + let codec_capsule = create_logical_extension_capsule(py, codec)?; + catalog_provider = catalog_provider + .getattr("__datafusion_catalog_provider__")? + .call1((codec_capsule,))?; + } + + let provider = if let Ok(capsule) = catalog_provider.cast::() { + let data: NonNull = capsule + .pointer_checked(Some(c"datafusion_catalog_provider"))? + .cast(); + let provider = unsafe { data.as_ref() }; + let provider: Arc = provider.into(); + provider as Arc + } else { + match catalog_provider.extract::() { + Ok(py_catalog) => py_catalog.catalog, + Err(_) => Arc::new(RustWrappedPyCatalogProvider::new( + catalog_provider.into(), + Arc::new(codec.clone()), + )) as Arc, + } + }; + + Ok(provider) +} + +fn extract_schema_provider_from_pyobj( + mut schema_provider: Bound, + codec: &FFI_LogicalExtensionCodec, +) -> PyResult> { + if schema_provider.hasattr("__datafusion_schema_provider__")? { + let py = schema_provider.py(); + let codec_capsule = create_logical_extension_capsule(py, codec)?; + schema_provider = schema_provider + .getattr("__datafusion_schema_provider__")? + .call1((codec_capsule,))?; + } + + let provider = if let Ok(capsule) = schema_provider.cast::() { + let data: NonNull = capsule + .pointer_checked(Some(c"datafusion_schema_provider"))? + .cast(); + let provider = unsafe { data.as_ref() }; + let provider: Arc = provider.into(); + provider as Arc + } else { + match schema_provider.extract::() { + Ok(py_schema) => py_schema.schema, + Err(_) => Arc::new(RustWrappedPySchemaProvider::new(schema_provider.into())) + as Arc, + } + }; + + Ok(provider) +} + +fn extract_logical_extension_codec( + py: Python, + obj: Option>, +) -> PyResult> { + let obj = match obj { + Some(obj) => obj, + None => PySessionContext::global_ctx()?.into_bound_py_any(py)?, + }; + ffi_logical_codec_from_pycapsule(obj).map(Arc::new) +} + +pub(crate) fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> { + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + + Ok(()) +} diff --git a/src/common.rs b/crates/core/src/common.rs similarity index 69% rename from src/common.rs rename to crates/core/src/common.rs index 8a8e2adf5..88d2fdd5f 100644 --- a/src/common.rs +++ b/crates/core/src/common.rs @@ -18,16 +18,26 @@ use pyo3::prelude::*; pub mod data_type; -pub mod df_field; pub mod df_schema; +pub mod function; +pub mod schema; /// Initializes the `common` module to match the pattern of `datafusion-common` https://docs.rs/datafusion-common/18.0.0/datafusion_common/index.html -pub(crate) fn init_module(m: &PyModule) -> PyResult<()> { +pub(crate) fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> { m.add_class::()?; - m.add_class::()?; m.add_class::()?; m.add_class::()?; + m.add_class::()?; m.add_class::()?; m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; Ok(()) } diff --git a/crates/core/src/common/data_type.rs b/crates/core/src/common/data_type.rs new file mode 100644 index 000000000..af4179806 --- /dev/null +++ b/crates/core/src/common/data_type.rs @@ -0,0 +1,792 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::Arc; + +use datafusion::arrow::array::Array; +use datafusion::arrow::datatypes::{DataType, IntervalUnit, TimeUnit}; +use datafusion::common::ScalarValue; +use datafusion::logical_expr::expr::NullTreatment as DFNullTreatment; +use pyo3::exceptions::{PyNotImplementedError, PyValueError}; +use pyo3::prelude::*; + +/// A [`ScalarValue`] wrapped in a Python object. This struct allows for conversion +/// from a variety of Python objects into a [`ScalarValue`]. See +/// ``FromPyArrow::from_pyarrow_bound`` conversion details. +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd)] +pub struct PyScalarValue(pub ScalarValue); + +impl From for PyScalarValue { + fn from(value: ScalarValue) -> Self { + Self(value) + } +} +impl From for ScalarValue { + fn from(value: PyScalarValue) -> Self { + value.0 + } +} + +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + frozen, + eq, + eq_int, + name = "RexType", + module = "datafusion.common" +)] +pub enum RexType { + Alias, + Literal, + Call, + Reference, + ScalarSubquery, + Other, +} + +/// These bindings are tying together several disparate systems. +/// You have SQL types for the SQL strings and RDBMS systems itself. +/// Rust types for the DataFusion code +/// Arrow types which represents the underlying arrow format +/// Python types which represent the type in Python +/// It is important to keep all of those types in a single +/// and manageable location. Therefore this structure exists +/// to map those types and provide a simple place for developers +/// to map types from one system to another. +// TODO: This looks like this needs pyo3 tracking so leaving unfrozen for now +#[derive(Debug, Clone)] +#[pyclass( + from_py_object, + name = "DataTypeMap", + module = "datafusion.common", + subclass +)] +pub struct DataTypeMap { + #[pyo3(get, set)] + pub arrow_type: PyDataType, + #[pyo3(get, set)] + pub python_type: PythonType, + #[pyo3(get, set)] + pub sql_type: SqlType, +} + +impl DataTypeMap { + fn new(arrow_type: DataType, python_type: PythonType, sql_type: SqlType) -> Self { + DataTypeMap { + arrow_type: PyDataType { + data_type: arrow_type, + }, + python_type, + sql_type, + } + } + + pub fn map_from_arrow_type(arrow_type: &DataType) -> Result { + match arrow_type { + DataType::Null => Ok(DataTypeMap::new( + DataType::Null, + PythonType::None, + SqlType::NULL, + )), + DataType::Boolean => Ok(DataTypeMap::new( + DataType::Boolean, + PythonType::Bool, + SqlType::BOOLEAN, + )), + DataType::Int8 => Ok(DataTypeMap::new( + DataType::Int8, + PythonType::Int, + SqlType::TINYINT, + )), + DataType::Int16 => Ok(DataTypeMap::new( + DataType::Int16, + PythonType::Int, + SqlType::SMALLINT, + )), + DataType::Int32 => Ok(DataTypeMap::new( + DataType::Int32, + PythonType::Int, + SqlType::INTEGER, + )), + DataType::Int64 => Ok(DataTypeMap::new( + DataType::Int64, + PythonType::Int, + SqlType::BIGINT, + )), + DataType::UInt8 => Ok(DataTypeMap::new( + DataType::UInt8, + PythonType::Int, + SqlType::TINYINT, + )), + DataType::UInt16 => Ok(DataTypeMap::new( + DataType::UInt16, + PythonType::Int, + SqlType::SMALLINT, + )), + DataType::UInt32 => Ok(DataTypeMap::new( + DataType::UInt32, + PythonType::Int, + SqlType::INTEGER, + )), + DataType::UInt64 => Ok(DataTypeMap::new( + DataType::UInt64, + PythonType::Int, + SqlType::BIGINT, + )), + DataType::Float16 => Ok(DataTypeMap::new( + DataType::Float16, + PythonType::Float, + SqlType::FLOAT, + )), + DataType::Float32 => Ok(DataTypeMap::new( + DataType::Float32, + PythonType::Float, + SqlType::FLOAT, + )), + DataType::Float64 => Ok(DataTypeMap::new( + DataType::Float64, + PythonType::Float, + SqlType::FLOAT, + )), + DataType::Timestamp(unit, tz) => Ok(DataTypeMap::new( + DataType::Timestamp(*unit, tz.clone()), + PythonType::Datetime, + SqlType::DATE, + )), + DataType::Date32 => Ok(DataTypeMap::new( + DataType::Date32, + PythonType::Datetime, + SqlType::DATE, + )), + DataType::Date64 => Ok(DataTypeMap::new( + DataType::Date64, + PythonType::Datetime, + SqlType::DATE, + )), + DataType::Time32(unit) => Ok(DataTypeMap::new( + DataType::Time32(*unit), + PythonType::Datetime, + SqlType::DATE, + )), + DataType::Time64(unit) => Ok(DataTypeMap::new( + DataType::Time64(*unit), + PythonType::Datetime, + SqlType::DATE, + )), + DataType::Duration(_) => Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))), + DataType::Interval(interval_unit) => Ok(DataTypeMap::new( + DataType::Interval(*interval_unit), + PythonType::Datetime, + match interval_unit { + IntervalUnit::DayTime => SqlType::INTERVAL_DAY, + IntervalUnit::MonthDayNano => SqlType::INTERVAL_MONTH, + IntervalUnit::YearMonth => SqlType::INTERVAL_YEAR_MONTH, + }, + )), + DataType::Binary => Ok(DataTypeMap::new( + DataType::Binary, + PythonType::Bytes, + SqlType::BINARY, + )), + DataType::FixedSizeBinary(_) => { + Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))) + } + DataType::LargeBinary => Ok(DataTypeMap::new( + DataType::LargeBinary, + PythonType::Bytes, + SqlType::BINARY, + )), + DataType::Utf8 => Ok(DataTypeMap::new( + DataType::Utf8, + PythonType::Str, + SqlType::VARCHAR, + )), + DataType::LargeUtf8 => Ok(DataTypeMap::new( + DataType::LargeUtf8, + PythonType::Str, + SqlType::VARCHAR, + )), + DataType::List(_) => Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))), + DataType::FixedSizeList(_, _) => { + Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))) + } + DataType::LargeList(_) => { + Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))) + } + DataType::Struct(_) => Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))), + DataType::Union(_, _) => Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))), + DataType::Dictionary(_, _) => { + Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))) + } + DataType::Decimal32(precision, scale) => Ok(DataTypeMap::new( + DataType::Decimal32(*precision, *scale), + PythonType::Float, + SqlType::DECIMAL, + )), + DataType::Decimal64(precision, scale) => Ok(DataTypeMap::new( + DataType::Decimal64(*precision, *scale), + PythonType::Float, + SqlType::DECIMAL, + )), + DataType::Decimal128(precision, scale) => Ok(DataTypeMap::new( + DataType::Decimal128(*precision, *scale), + PythonType::Float, + SqlType::DECIMAL, + )), + DataType::Decimal256(precision, scale) => Ok(DataTypeMap::new( + DataType::Decimal256(*precision, *scale), + PythonType::Float, + SqlType::DECIMAL, + )), + DataType::Map(_, _) => Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))), + DataType::RunEndEncoded(_, _) => { + Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))) + } + DataType::BinaryView => Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))), + DataType::Utf8View => Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))), + DataType::ListView(_) => Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))), + DataType::LargeListView(_) => { + Err(PyNotImplementedError::new_err(format!("{arrow_type:?}"))) + } + } + } + + /// Generate the `DataTypeMap` from a `ScalarValue` instance + pub fn map_from_scalar_value(scalar_val: &ScalarValue) -> Result { + DataTypeMap::map_from_arrow_type(&DataTypeMap::map_from_scalar_to_arrow(scalar_val)?) + } + + /// Maps a `ScalarValue` to an Arrow `DataType` + pub fn map_from_scalar_to_arrow(scalar_val: &ScalarValue) -> Result { + match scalar_val { + ScalarValue::Boolean(_) => Ok(DataType::Boolean), + ScalarValue::Float16(_) => Ok(DataType::Float16), + ScalarValue::Float32(_) => Ok(DataType::Float32), + ScalarValue::Float64(_) => Ok(DataType::Float64), + ScalarValue::Decimal32(_, precision, scale) => { + Ok(DataType::Decimal32(*precision, *scale)) + } + ScalarValue::Decimal64(_, precision, scale) => { + Ok(DataType::Decimal64(*precision, *scale)) + } + ScalarValue::Decimal128(_, precision, scale) => { + Ok(DataType::Decimal128(*precision, *scale)) + } + ScalarValue::Decimal256(_, precision, scale) => { + Ok(DataType::Decimal256(*precision, *scale)) + } + ScalarValue::Dictionary(data_type, scalar_type) => { + // Call this function again to map the dictionary scalar_value to an Arrow type + Ok(DataType::Dictionary( + Box::new(*data_type.clone()), + Box::new(DataTypeMap::map_from_scalar_to_arrow(scalar_type)?), + )) + } + ScalarValue::Int8(_) => Ok(DataType::Int8), + ScalarValue::Int16(_) => Ok(DataType::Int16), + ScalarValue::Int32(_) => Ok(DataType::Int32), + ScalarValue::Int64(_) => Ok(DataType::Int64), + ScalarValue::UInt8(_) => Ok(DataType::UInt8), + ScalarValue::UInt16(_) => Ok(DataType::UInt16), + ScalarValue::UInt32(_) => Ok(DataType::UInt32), + ScalarValue::UInt64(_) => Ok(DataType::UInt64), + ScalarValue::Utf8(_) => Ok(DataType::Utf8), + ScalarValue::LargeUtf8(_) => Ok(DataType::LargeUtf8), + ScalarValue::Binary(_) => Ok(DataType::Binary), + ScalarValue::LargeBinary(_) => Ok(DataType::LargeBinary), + ScalarValue::Date32(_) => Ok(DataType::Date32), + ScalarValue::Date64(_) => Ok(DataType::Date64), + ScalarValue::Time32Second(_) => Ok(DataType::Time32(TimeUnit::Second)), + ScalarValue::Time32Millisecond(_) => Ok(DataType::Time32(TimeUnit::Millisecond)), + ScalarValue::Time64Microsecond(_) => Ok(DataType::Time64(TimeUnit::Microsecond)), + ScalarValue::Time64Nanosecond(_) => Ok(DataType::Time64(TimeUnit::Nanosecond)), + ScalarValue::Null => Ok(DataType::Null), + ScalarValue::TimestampSecond(_, tz) => { + Ok(DataType::Timestamp(TimeUnit::Second, tz.to_owned())) + } + ScalarValue::TimestampMillisecond(_, tz) => { + Ok(DataType::Timestamp(TimeUnit::Millisecond, tz.to_owned())) + } + ScalarValue::TimestampMicrosecond(_, tz) => { + Ok(DataType::Timestamp(TimeUnit::Microsecond, tz.to_owned())) + } + ScalarValue::TimestampNanosecond(_, tz) => { + Ok(DataType::Timestamp(TimeUnit::Nanosecond, tz.to_owned())) + } + ScalarValue::IntervalYearMonth(..) => Ok(DataType::Interval(IntervalUnit::YearMonth)), + ScalarValue::IntervalDayTime(..) => Ok(DataType::Interval(IntervalUnit::DayTime)), + ScalarValue::IntervalMonthDayNano(..) => { + Ok(DataType::Interval(IntervalUnit::MonthDayNano)) + } + ScalarValue::List(arr) => Ok(arr.data_type().to_owned()), + ScalarValue::Struct(_fields) => Err(PyNotImplementedError::new_err( + "ScalarValue::Struct".to_string(), + )), + ScalarValue::FixedSizeBinary(size, _) => Ok(DataType::FixedSizeBinary(*size)), + ScalarValue::FixedSizeList(_array_ref) => { + // The FieldRef was removed from ScalarValue::FixedSizeList in + // https://github.com/apache/arrow-datafusion/pull/8221, so we can no + // longer convert back to a DataType here + Err(PyNotImplementedError::new_err( + "ScalarValue::FixedSizeList".to_string(), + )) + } + ScalarValue::LargeList(_) => Err(PyNotImplementedError::new_err( + "ScalarValue::LargeList".to_string(), + )), + ScalarValue::DurationSecond(_) => Ok(DataType::Duration(TimeUnit::Second)), + ScalarValue::DurationMillisecond(_) => Ok(DataType::Duration(TimeUnit::Millisecond)), + ScalarValue::DurationMicrosecond(_) => Ok(DataType::Duration(TimeUnit::Microsecond)), + ScalarValue::DurationNanosecond(_) => Ok(DataType::Duration(TimeUnit::Nanosecond)), + ScalarValue::Union(_, _, _) => Err(PyNotImplementedError::new_err( + "ScalarValue::LargeList".to_string(), + )), + ScalarValue::Utf8View(_) => Ok(DataType::Utf8View), + ScalarValue::BinaryView(_) => Ok(DataType::BinaryView), + ScalarValue::Map(_) => Err(PyNotImplementedError::new_err( + "ScalarValue::Map".to_string(), + )), + ScalarValue::RunEndEncoded(field1, field2, _) => Ok(DataType::RunEndEncoded( + Arc::clone(field1), + Arc::clone(field2), + )), + } + } +} + +#[pymethods] +impl DataTypeMap { + #[new] + pub fn py_new(arrow_type: PyDataType, python_type: PythonType, sql_type: SqlType) -> Self { + DataTypeMap { + arrow_type, + python_type, + sql_type, + } + } + + #[staticmethod] + #[pyo3(name = "from_parquet_type_str")] + /// When using pyarrow.parquet.read_metadata().schema.column(x).physical_type you are presented + /// with a String type for schema rather than an object type. Here we make a best effort + /// to convert that to a physical type. + pub fn py_map_from_parquet_type_str(parquet_str_type: String) -> PyResult { + let arrow_dtype = match parquet_str_type.to_lowercase().as_str() { + "boolean" => Ok(DataType::Boolean), + "int32" => Ok(DataType::Int32), + "int64" => Ok(DataType::Int64), + "int96" => { + // Int96 is an old parquet datatype that is now deprecated. We convert to nanosecond timestamp + Ok(DataType::Timestamp(TimeUnit::Nanosecond, None)) + } + "float" => Ok(DataType::Float32), + "double" => Ok(DataType::Float64), + "byte_array" => Ok(DataType::Utf8), + _ => Err(PyValueError::new_err(format!( + "Unable to determine Arrow Data Type from Parquet String type: {parquet_str_type:?}" + ))), + }; + DataTypeMap::map_from_arrow_type(&arrow_dtype?) + } + + #[staticmethod] + #[pyo3(name = "arrow")] + pub fn py_map_from_arrow_type(arrow_type: &PyDataType) -> PyResult { + DataTypeMap::map_from_arrow_type(&arrow_type.data_type) + } + + #[staticmethod] + #[pyo3(name = "arrow_str")] + pub fn py_map_from_arrow_type_str(arrow_type_str: String) -> PyResult { + let data_type = PyDataType::py_map_from_arrow_type_str(arrow_type_str); + DataTypeMap::map_from_arrow_type(&data_type?.data_type) + } + + #[staticmethod] + #[pyo3(name = "sql")] + pub fn py_map_from_sql_type(sql_type: &SqlType) -> PyResult { + match sql_type { + SqlType::ANY => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::ARRAY => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::BIGINT => Ok(DataTypeMap::new( + DataType::Int64, + PythonType::Int, + SqlType::BIGINT, + )), + SqlType::BINARY => Ok(DataTypeMap::new( + DataType::Binary, + PythonType::Bytes, + SqlType::BINARY, + )), + SqlType::BOOLEAN => Ok(DataTypeMap::new( + DataType::Boolean, + PythonType::Bool, + SqlType::BOOLEAN, + )), + SqlType::CHAR => Ok(DataTypeMap::new( + DataType::UInt8, + PythonType::Int, + SqlType::CHAR, + )), + SqlType::COLUMN_LIST => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::CURSOR => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::DATE => Ok(DataTypeMap::new( + DataType::Date64, + PythonType::Datetime, + SqlType::DATE, + )), + SqlType::DECIMAL => Ok(DataTypeMap::new( + DataType::Decimal128(1, 1), + PythonType::Float, + SqlType::DECIMAL, + )), + SqlType::DISTINCT => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::DOUBLE => Ok(DataTypeMap::new( + DataType::Decimal256(1, 1), + PythonType::Float, + SqlType::DOUBLE, + )), + SqlType::DYNAMIC_STAR => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::FLOAT => Ok(DataTypeMap::new( + DataType::Decimal128(1, 1), + PythonType::Float, + SqlType::FLOAT, + )), + SqlType::GEOMETRY => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::INTEGER => Ok(DataTypeMap::new( + DataType::Int8, + PythonType::Int, + SqlType::INTEGER, + )), + SqlType::INTERVAL => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::INTERVAL_DAY => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::INTERVAL_DAY_HOUR => { + Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))) + } + SqlType::INTERVAL_DAY_MINUTE => { + Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))) + } + SqlType::INTERVAL_DAY_SECOND => { + Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))) + } + SqlType::INTERVAL_HOUR => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::INTERVAL_HOUR_MINUTE => { + Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))) + } + SqlType::INTERVAL_HOUR_SECOND => { + Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))) + } + SqlType::INTERVAL_MINUTE => { + Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))) + } + SqlType::INTERVAL_MINUTE_SECOND => { + Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))) + } + SqlType::INTERVAL_MONTH => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::INTERVAL_SECOND => { + Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))) + } + SqlType::INTERVAL_YEAR => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::INTERVAL_YEAR_MONTH => { + Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))) + } + SqlType::MAP => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::MULTISET => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::NULL => Ok(DataTypeMap::new( + DataType::Null, + PythonType::None, + SqlType::NULL, + )), + SqlType::OTHER => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::REAL => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::ROW => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::SARG => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::SMALLINT => Ok(DataTypeMap::new( + DataType::Int16, + PythonType::Int, + SqlType::SMALLINT, + )), + SqlType::STRUCTURED => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::SYMBOL => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::TIME => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::TIME_WITH_LOCAL_TIME_ZONE => { + Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))) + } + SqlType::TIMESTAMP => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::TIMESTAMP_WITH_LOCAL_TIME_ZONE => { + Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))) + } + SqlType::TINYINT => Ok(DataTypeMap::new( + DataType::Int8, + PythonType::Int, + SqlType::TINYINT, + )), + SqlType::UNKNOWN => Err(PyNotImplementedError::new_err(format!("{sql_type:?}"))), + SqlType::VARBINARY => Ok(DataTypeMap::new( + DataType::LargeBinary, + PythonType::Bytes, + SqlType::VARBINARY, + )), + SqlType::VARCHAR => Ok(DataTypeMap::new( + DataType::Utf8, + PythonType::Str, + SqlType::VARCHAR, + )), + } + } + + /// Unfortunately PyO3 does not allow for us to expose the DataType as an enum since + /// we cannot directly annotate the Enum instance of dependency code. Therefore, here + /// we provide an enum to mimic it. + #[pyo3(name = "friendly_arrow_type_name")] + pub fn friendly_arrow_type_name(&self) -> PyResult<&str> { + Ok(match &self.arrow_type.data_type { + DataType::Null => "Null", + DataType::Boolean => "Boolean", + DataType::Int8 => "Int8", + DataType::Int16 => "Int16", + DataType::Int32 => "Int32", + DataType::Int64 => "Int64", + DataType::UInt8 => "UInt8", + DataType::UInt16 => "UInt16", + DataType::UInt32 => "UInt32", + DataType::UInt64 => "UInt64", + DataType::Float16 => "Float16", + DataType::Float32 => "Float32", + DataType::Float64 => "Float64", + DataType::Timestamp(_, _) => "Timestamp", + DataType::Date32 => "Date32", + DataType::Date64 => "Date64", + DataType::Time32(_) => "Time32", + DataType::Time64(_) => "Time64", + DataType::Duration(_) => "Duration", + DataType::Interval(_) => "Interval", + DataType::Binary => "Binary", + DataType::FixedSizeBinary(_) => "FixedSizeBinary", + DataType::LargeBinary => "LargeBinary", + DataType::Utf8 => "Utf8", + DataType::LargeUtf8 => "LargeUtf8", + DataType::List(_) => "List", + DataType::FixedSizeList(_, _) => "FixedSizeList", + DataType::LargeList(_) => "LargeList", + DataType::Struct(_) => "Struct", + DataType::Union(_, _) => "Union", + DataType::Dictionary(_, _) => "Dictionary", + DataType::Decimal32(_, _) => "Decimal32", + DataType::Decimal64(_, _) => "Decimal64", + DataType::Decimal128(_, _) => "Decimal128", + DataType::Decimal256(_, _) => "Decimal256", + DataType::Map(_, _) => "Map", + DataType::RunEndEncoded(_, _) => "RunEndEncoded", + DataType::BinaryView => "BinaryView", + DataType::Utf8View => "Utf8View", + DataType::ListView(_) => "ListView", + DataType::LargeListView(_) => "LargeListView", + }) + } +} + +/// PyO3 requires that objects passed between Rust and Python implement the trait `PyClass` +/// Since `DataType` exists in another package we cannot make that happen here so we wrap +/// `DataType` as `PyDataType` This exists solely to satisfy those constraints. +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + frozen, + name = "DataType", + module = "datafusion.common" +)] +pub struct PyDataType { + pub data_type: DataType, +} + +impl PyDataType { + /// There are situations when obtaining dtypes on the Python side where the Arrow type + /// is presented as a String rather than an actual DataType. This function is used to + /// convert that String to a DataType for the Python side to use. + pub fn py_map_from_arrow_type_str(arrow_str_type: String) -> PyResult { + // Certain string types contain "metadata" that should be trimmed here. Ex: "datetime64[ns, Europe/Berlin]" + let arrow_str_type = match arrow_str_type.find('[') { + Some(index) => arrow_str_type[0..index].to_string(), + None => arrow_str_type, // Return early if ',' is not found. + }; + + let arrow_dtype = match arrow_str_type.to_lowercase().as_str() { + "bool" => Ok(DataType::Boolean), + "boolean" => Ok(DataType::Boolean), + "uint8" => Ok(DataType::UInt8), + "uint16" => Ok(DataType::UInt16), + "uint32" => Ok(DataType::UInt32), + "uint64" => Ok(DataType::UInt64), + "int8" => Ok(DataType::Int8), + "int16" => Ok(DataType::Int16), + "int32" => Ok(DataType::Int32), + "int64" => Ok(DataType::Int64), + "float" => Ok(DataType::Float32), + "double" => Ok(DataType::Float64), + "float16" => Ok(DataType::Float16), + "float32" => Ok(DataType::Float32), + "float64" => Ok(DataType::Float64), + "datetime64" => Ok(DataType::Date64), + "object" => Ok(DataType::Utf8), + _ => Err(PyValueError::new_err(format!( + "Unable to determine Arrow Data Type from Arrow String type: {arrow_str_type:?}" + ))), + }; + Ok(PyDataType { + data_type: arrow_dtype?, + }) + } +} + +impl From for DataType { + fn from(data_type: PyDataType) -> DataType { + data_type.data_type + } +} + +impl From for PyDataType { + fn from(data_type: DataType) -> PyDataType { + PyDataType { data_type } + } +} + +/// Represents the possible Python types that can be mapped to the SQL types +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + frozen, + eq, + eq_int, + name = "PythonType", + module = "datafusion.common" +)] +pub enum PythonType { + Array, + Bool, + Bytes, + Datetime, + Float, + Int, + List, + None, + Object, + Str, +} + +/// Represents the types that are possible for DataFusion to parse +/// from a SQL query. Aka "SqlType" and are valid values for +/// ANSI SQL +#[allow(non_camel_case_types)] +#[allow(clippy::upper_case_acronyms)] +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + frozen, + eq, + eq_int, + name = "SqlType", + module = "datafusion.common" +)] +pub enum SqlType { + ANY, + ARRAY, + BIGINT, + BINARY, + BOOLEAN, + CHAR, + COLUMN_LIST, + CURSOR, + DATE, + DECIMAL, + DISTINCT, + DOUBLE, + DYNAMIC_STAR, + FLOAT, + GEOMETRY, + INTEGER, + INTERVAL, + INTERVAL_DAY, + INTERVAL_DAY_HOUR, + INTERVAL_DAY_MINUTE, + INTERVAL_DAY_SECOND, + INTERVAL_HOUR, + INTERVAL_HOUR_MINUTE, + INTERVAL_HOUR_SECOND, + INTERVAL_MINUTE, + INTERVAL_MINUTE_SECOND, + INTERVAL_MONTH, + INTERVAL_SECOND, + INTERVAL_YEAR, + INTERVAL_YEAR_MONTH, + MAP, + MULTISET, + NULL, + OTHER, + REAL, + ROW, + SARG, + SMALLINT, + STRUCTURED, + SYMBOL, + TIME, + TIME_WITH_LOCAL_TIME_ZONE, + TIMESTAMP, + TIMESTAMP_WITH_LOCAL_TIME_ZONE, + TINYINT, + UNKNOWN, + VARBINARY, + VARCHAR, +} + +/// Specifies Ignore / Respect NULL within window functions. +/// For example +/// `FIRST_VALUE(column2) IGNORE NULLS OVER (PARTITION BY column1)` +#[allow(non_camel_case_types)] +#[allow(clippy::upper_case_acronyms)] +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + frozen, + eq, + eq_int, + name = "NullTreatment", + module = "datafusion.common" +)] +pub enum NullTreatment { + IGNORE_NULLS, + RESPECT_NULLS, +} + +impl From for DFNullTreatment { + fn from(null_treatment: NullTreatment) -> DFNullTreatment { + match null_treatment { + NullTreatment::IGNORE_NULLS => DFNullTreatment::IgnoreNulls, + NullTreatment::RESPECT_NULLS => DFNullTreatment::RespectNulls, + } + } +} + +impl From for NullTreatment { + fn from(null_treatment: DFNullTreatment) -> NullTreatment { + match null_treatment { + DFNullTreatment::IgnoreNulls => NullTreatment::IGNORE_NULLS, + DFNullTreatment::RespectNulls => NullTreatment::RESPECT_NULLS, + } + } +} diff --git a/src/common/df_schema.rs b/crates/core/src/common/df_schema.rs similarity index 91% rename from src/common/df_schema.rs rename to crates/core/src/common/df_schema.rs index c16b8eba0..9167e772e 100644 --- a/src/common/df_schema.rs +++ b/crates/core/src/common/df_schema.rs @@ -17,11 +17,17 @@ use std::sync::Arc; -use datafusion_common::DFSchema; +use datafusion::common::DFSchema; use pyo3::prelude::*; #[derive(Debug, Clone)] -#[pyclass(name = "DFSchema", module = "datafusion.common", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "DFSchema", + module = "datafusion.common", + subclass +)] pub struct PyDFSchema { schema: Arc, } diff --git a/crates/core/src/common/function.rs b/crates/core/src/common/function.rs new file mode 100644 index 000000000..41cab515f --- /dev/null +++ b/crates/core/src/common/function.rs @@ -0,0 +1,61 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::HashMap; + +use datafusion::arrow::datatypes::DataType; +use pyo3::prelude::*; + +use super::data_type::PyDataType; + +#[pyclass( + from_py_object, + frozen, + name = "SqlFunction", + module = "datafusion.common", + subclass +)] +#[derive(Debug, Clone)] +pub struct SqlFunction { + pub name: String, + pub return_types: HashMap, DataType>, + pub aggregation: bool, +} + +impl SqlFunction { + pub fn new( + function_name: String, + input_types: Vec, + return_type: PyDataType, + aggregation_bool: bool, + ) -> Self { + let mut func = Self { + name: function_name, + return_types: HashMap::new(), + aggregation: aggregation_bool, + }; + func.add_type_mapping(input_types, return_type); + func + } + + pub fn add_type_mapping(&mut self, input_types: Vec, return_type: PyDataType) { + self.return_types.insert( + input_types.iter().map(|t| t.clone().into()).collect(), + return_type.into(), + ); + } +} diff --git a/crates/core/src/common/schema.rs b/crates/core/src/common/schema.rs new file mode 100644 index 000000000..29a27b204 --- /dev/null +++ b/crates/core/src/common/schema.rs @@ -0,0 +1,389 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::any::Any; +use std::borrow::Cow; +use std::fmt::{self, Display, Formatter}; +use std::sync::Arc; + +use arrow::datatypes::Schema; +use arrow::pyarrow::PyArrowType; +use datafusion::arrow::datatypes::SchemaRef; +use datafusion::common::Constraints; +use datafusion::datasource::TableType; +use datafusion::logical_expr::utils::split_conjunction; +use datafusion::logical_expr::{Expr, TableProviderFilterPushDown, TableSource}; +use parking_lot::RwLock; +use pyo3::prelude::*; + +use super::data_type::DataTypeMap; +use super::function::SqlFunction; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + name = "SqlSchema", + module = "datafusion.common", + subclass, + frozen +)] +#[derive(Debug, Clone)] +pub struct SqlSchema { + name: Arc>, + tables: Arc>>, + views: Arc>>, + functions: Arc>>, +} + +#[pyclass( + from_py_object, + name = "SqlTable", + module = "datafusion.common", + subclass +)] +#[derive(Debug, Clone)] +pub struct SqlTable { + #[pyo3(get, set)] + pub name: String, + #[pyo3(get, set)] + pub columns: Vec<(String, DataTypeMap)>, + #[pyo3(get, set)] + pub primary_key: Option, + #[pyo3(get, set)] + pub foreign_keys: Vec, + #[pyo3(get, set)] + pub indexes: Vec, + #[pyo3(get, set)] + pub constraints: Vec, + #[pyo3(get, set)] + pub statistics: SqlStatistics, + #[pyo3(get, set)] + pub filepaths: Option>, +} + +#[pymethods] +impl SqlTable { + #[new] + #[pyo3(signature = (table_name, columns, row_count, filepaths=None))] + pub fn new( + table_name: String, + columns: Vec<(String, DataTypeMap)>, + row_count: f64, + filepaths: Option>, + ) -> Self { + Self { + name: table_name, + columns, + primary_key: None, + foreign_keys: Vec::new(), + indexes: Vec::new(), + constraints: Vec::new(), + statistics: SqlStatistics::new(row_count), + filepaths, + } + } +} + +#[pyclass( + from_py_object, + name = "SqlView", + module = "datafusion.common", + subclass +)] +#[derive(Debug, Clone)] +pub struct SqlView { + #[pyo3(get, set)] + pub name: String, + #[pyo3(get, set)] + pub definition: String, // SQL code that defines the view +} + +#[pymethods] +impl SqlSchema { + #[new] + pub fn new(schema_name: &str) -> Self { + Self { + name: Arc::new(RwLock::new(schema_name.to_owned())), + tables: Arc::new(RwLock::new(Vec::new())), + views: Arc::new(RwLock::new(Vec::new())), + functions: Arc::new(RwLock::new(Vec::new())), + } + } + + #[getter] + fn name(&self) -> PyResult { + Ok(self.name.read().clone()) + } + + #[setter] + fn set_name(&self, value: String) -> PyResult<()> { + *self.name.write() = value; + Ok(()) + } + + #[getter] + fn tables(&self) -> PyResult> { + Ok(self.tables.read().clone()) + } + + #[setter] + fn set_tables(&self, tables: Vec) -> PyResult<()> { + *self.tables.write() = tables; + Ok(()) + } + + #[getter] + fn views(&self) -> PyResult> { + Ok(self.views.read().clone()) + } + + #[setter] + fn set_views(&self, views: Vec) -> PyResult<()> { + *self.views.write() = views; + Ok(()) + } + + #[getter] + fn functions(&self) -> PyResult> { + Ok(self.functions.read().clone()) + } + + #[setter] + fn set_functions(&self, functions: Vec) -> PyResult<()> { + *self.functions.write() = functions; + Ok(()) + } + + pub fn table_by_name(&self, table_name: &str) -> Option { + let tables = self.tables.read(); + tables.iter().find(|tbl| tbl.name.eq(table_name)).cloned() + } + + pub fn add_table(&self, table: SqlTable) { + let mut tables = self.tables.write(); + tables.push(table); + } + + pub fn drop_table(&self, table_name: String) { + let mut tables = self.tables.write(); + tables.retain(|x| !x.name.eq(&table_name)); + } +} + +/// SqlTable wrapper that is compatible with DataFusion logical query plans +pub struct SqlTableSource { + schema: SchemaRef, + statistics: Option, + filepaths: Option>, +} + +impl SqlTableSource { + /// Initialize a new `EmptyTable` from a schema + pub fn new( + schema: SchemaRef, + statistics: Option, + filepaths: Option>, + ) -> Self { + Self { + schema, + statistics, + filepaths, + } + } + + /// Access optional statistics associated with this table source + pub fn statistics(&self) -> Option<&SqlStatistics> { + self.statistics.as_ref() + } + + /// Access optional filepath associated with this table source + #[allow(dead_code)] + pub fn filepaths(&self) -> Option<&Vec> { + self.filepaths.as_ref() + } +} + +/// Implement TableSource, used in the logical query plan and in logical query optimizations +impl TableSource for SqlTableSource { + fn as_any(&self) -> &dyn Any { + self + } + + fn schema(&self) -> SchemaRef { + self.schema.clone() + } + + fn table_type(&self) -> datafusion::logical_expr::TableType { + datafusion::logical_expr::TableType::Base + } + + fn supports_filters_pushdown( + &self, + filters: &[&Expr], + ) -> datafusion::common::Result> { + filters + .iter() + .map(|f| { + let filters = split_conjunction(f); + if filters.iter().all(|f| is_supported_push_down_expr(f)) { + // Push down filters to the tablescan operation if all are supported + Ok(TableProviderFilterPushDown::Exact) + } else if filters.iter().any(|f| is_supported_push_down_expr(f)) { + // Partially apply the filter in the TableScan but retain + // the Filter operator in the plan as well + Ok(TableProviderFilterPushDown::Inexact) + } else { + Ok(TableProviderFilterPushDown::Unsupported) + } + }) + .collect() + } + + fn get_logical_plan(&self) -> Option> { + None + } +} + +fn is_supported_push_down_expr(_expr: &Expr) -> bool { + // For now we support all kinds of expr's at this level + true +} + +#[pyclass( + from_py_object, + frozen, + name = "SqlStatistics", + module = "datafusion.common", + subclass +)] +#[derive(Debug, Clone)] +pub struct SqlStatistics { + row_count: f64, +} + +#[pymethods] +impl SqlStatistics { + #[new] + pub fn new(row_count: f64) -> Self { + Self { row_count } + } + + #[pyo3(name = "getRowCount")] + pub fn get_row_count(&self) -> f64 { + self.row_count + } +} + +#[pyclass( + from_py_object, + frozen, + name = "Constraints", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyConstraints { + pub constraints: Constraints, +} + +impl From for Constraints { + fn from(constraints: PyConstraints) -> Self { + constraints.constraints + } +} + +impl From for PyConstraints { + fn from(constraints: Constraints) -> Self { + PyConstraints { constraints } + } +} + +impl Display for PyConstraints { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!(f, "Constraints: {:?}", self.constraints) + } +} + +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + frozen, + eq, + eq_int, + name = "TableType", + module = "datafusion.common" +)] +pub enum PyTableType { + Base, + View, + Temporary, +} + +impl From for datafusion::logical_expr::TableType { + fn from(table_type: PyTableType) -> Self { + match table_type { + PyTableType::Base => datafusion::logical_expr::TableType::Base, + PyTableType::View => datafusion::logical_expr::TableType::View, + PyTableType::Temporary => datafusion::logical_expr::TableType::Temporary, + } + } +} + +impl From for PyTableType { + fn from(table_type: TableType) -> Self { + match table_type { + datafusion::logical_expr::TableType::Base => PyTableType::Base, + datafusion::logical_expr::TableType::View => PyTableType::View, + datafusion::logical_expr::TableType::Temporary => PyTableType::Temporary, + } + } +} + +#[pyclass( + from_py_object, + frozen, + name = "TableSource", + module = "datafusion.common", + subclass +)] +#[derive(Clone)] +pub struct PyTableSource { + pub table_source: Arc, +} + +#[pymethods] +impl PyTableSource { + pub fn schema(&self) -> PyArrowType { + (*self.table_source.schema()).clone().into() + } + + pub fn constraints(&self) -> Option { + self.table_source.constraints().map(|c| PyConstraints { + constraints: c.clone(), + }) + } + + pub fn table_type(&self) -> PyTableType { + self.table_source.table_type().into() + } + + pub fn get_logical_plan(&self) -> Option { + self.table_source + .get_logical_plan() + .map(|plan| PyLogicalPlan::new(plan.into_owned())) + } +} diff --git a/crates/core/src/config.rs b/crates/core/src/config.rs new file mode 100644 index 000000000..fdb693a12 --- /dev/null +++ b/crates/core/src/config.rs @@ -0,0 +1,104 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::Arc; + +use datafusion::config::ConfigOptions; +use parking_lot::RwLock; +use pyo3::prelude::*; +use pyo3::types::*; + +use crate::common::data_type::PyScalarValue; +use crate::errors::PyDataFusionResult; +#[pyclass( + from_py_object, + name = "Config", + module = "datafusion", + subclass, + frozen +)] +#[derive(Clone)] +pub(crate) struct PyConfig { + config: Arc>, +} + +#[pymethods] +impl PyConfig { + #[new] + fn py_new() -> Self { + Self { + config: Arc::new(RwLock::new(ConfigOptions::new())), + } + } + + /// Get configurations from environment variables + #[staticmethod] + pub fn from_env() -> PyDataFusionResult { + Ok(Self { + config: Arc::new(RwLock::new(ConfigOptions::from_env()?)), + }) + } + + /// Get a configuration option + pub fn get<'py>(&self, key: &str, py: Python<'py>) -> PyResult> { + let value: Option> = { + let options = self.config.read(); + options + .entries() + .into_iter() + .find_map(|entry| (entry.key == key).then_some(entry.value.clone())) + }; + + match value { + Some(value) => Ok(value.into_pyobject(py)?), + None => Ok(None::.into_pyobject(py)?), + } + } + + /// Set a configuration option + pub fn set(&self, key: &str, value: Py, py: Python) -> PyDataFusionResult<()> { + let scalar_value: PyScalarValue = value.extract(py)?; + let mut options = self.config.write(); + options.set(key, scalar_value.0.to_string().as_str())?; + Ok(()) + } + + /// Get all configuration options + pub fn get_all(&self, py: Python) -> PyResult> { + let entries: Vec<(String, Option)> = { + let options = self.config.read(); + options + .entries() + .into_iter() + .map(|entry| (entry.key.clone(), entry.value.clone())) + .collect() + }; + + let dict = PyDict::new(py); + for (key, value) in entries { + dict.set_item(key, value.into_pyobject(py)?)?; + } + Ok(dict.into()) + } + + fn __repr__(&self, py: Python) -> PyResult { + match self.get_all(py) { + Ok(result) => Ok(format!("Config({result})")), + Err(err) => Ok(format!("Error: {:?}", err.to_string())), + } + } +} diff --git a/crates/core/src/context.rs b/crates/core/src/context.rs new file mode 100644 index 000000000..e46d359d6 --- /dev/null +++ b/crates/core/src/context.rs @@ -0,0 +1,1453 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::{HashMap, HashSet}; +use std::path::PathBuf; +use std::ptr::NonNull; +use std::str::FromStr; +use std::sync::Arc; + +use arrow::array::RecordBatchReader; +use arrow::ffi_stream::ArrowArrayStreamReader; +use arrow::pyarrow::FromPyArrow; +use datafusion::arrow::datatypes::{DataType, Schema, SchemaRef}; +use datafusion::arrow::pyarrow::PyArrowType; +use datafusion::arrow::record_batch::RecordBatch; +use datafusion::catalog::{CatalogProvider, CatalogProviderList, TableProviderFactory}; +use datafusion::common::{DFSchema, ScalarValue, TableReference, exec_err}; +use datafusion::datasource::file_format::file_compression_type::FileCompressionType; +use datafusion::datasource::file_format::parquet::ParquetFormat; +use datafusion::datasource::listing::{ + ListingOptions, ListingTable, ListingTableConfig, ListingTableUrl, +}; +use datafusion::datasource::{MemTable, TableProvider}; +use datafusion::execution::TaskContextProvider; +use datafusion::execution::context::{ + DataFilePaths, SQLOptions, SessionConfig, SessionContext, TaskContext, +}; +use datafusion::execution::disk_manager::DiskManagerMode; +use datafusion::execution::memory_pool::{FairSpillPool, GreedyMemoryPool, UnboundedMemoryPool}; +use datafusion::execution::options::{ArrowReadOptions, ReadOptions}; +use datafusion::execution::runtime_env::RuntimeEnvBuilder; +use datafusion::execution::session_state::SessionStateBuilder; +use datafusion::prelude::{ + AvroReadOptions, CsvReadOptions, DataFrame, JsonReadOptions, ParquetReadOptions, +}; +use datafusion_ffi::catalog_provider::FFI_CatalogProvider; +use datafusion_ffi::catalog_provider_list::FFI_CatalogProviderList; +use datafusion_ffi::config::extension_options::FFI_ExtensionOptions; +use datafusion_ffi::execution::FFI_TaskContextProvider; +use datafusion_ffi::proto::logical_extension_codec::FFI_LogicalExtensionCodec; +use datafusion_ffi::table_provider_factory::FFI_TableProviderFactory; +use datafusion_proto::logical_plan::DefaultLogicalExtensionCodec; +use datafusion_python_util::{ + create_logical_extension_capsule, ffi_logical_codec_from_pycapsule, get_global_ctx, + get_tokio_runtime, spawn_future, wait_for_future, +}; +use object_store::ObjectStore; +use pyo3::IntoPyObjectExt; +use pyo3::exceptions::{PyKeyError, PyRuntimeError, PyValueError}; +use pyo3::prelude::*; +use pyo3::types::{PyCapsule, PyDict, PyList, PyTuple}; +use url::Url; +use uuid::Uuid; + +use crate::catalog::{ + PyCatalog, PyCatalogList, RustWrappedPyCatalogProvider, RustWrappedPyCatalogProviderList, +}; +use crate::common::data_type::PyScalarValue; +use crate::common::df_schema::PyDFSchema; +use crate::dataframe::PyDataFrame; +use crate::dataset::Dataset; +use crate::errors::{ + PyDataFusionError, PyDataFusionResult, from_datafusion_error, py_datafusion_err, +}; +use crate::expr::PyExpr; +use crate::expr::sort_expr::PySortExpr; +use crate::options::PyCsvReadOptions; +use crate::physical_plan::PyExecutionPlan; +use crate::record_batch::PyRecordBatchStream; +use crate::sql::logical::PyLogicalPlan; +use crate::sql::util::replace_placeholders_with_strings; +use crate::store::StorageContexts; +use crate::table::{PyTable, RustWrappedPyTableProviderFactory}; +use crate::udaf::PyAggregateUDF; +use crate::udf::PyScalarUDF; +use crate::udtf::PyTableFunction; +use crate::udwf::PyWindowUDF; + +/// Configuration options for a SessionContext +#[pyclass( + from_py_object, + frozen, + name = "SessionConfig", + module = "datafusion", + subclass +)] +#[derive(Clone, Default)] +pub struct PySessionConfig { + pub config: SessionConfig, +} + +impl From for PySessionConfig { + fn from(config: SessionConfig) -> Self { + Self { config } + } +} + +#[pymethods] +impl PySessionConfig { + #[pyo3(signature = (config_options=None))] + #[new] + fn new(config_options: Option>) -> Self { + let mut config = SessionConfig::new(); + if let Some(hash_map) = config_options { + for (k, v) in &hash_map { + config = config.set(k, &ScalarValue::Utf8(Some(v.clone()))); + } + } + + Self { config } + } + + fn with_create_default_catalog_and_schema(&self, enabled: bool) -> Self { + Self::from( + self.config + .clone() + .with_create_default_catalog_and_schema(enabled), + ) + } + + fn with_default_catalog_and_schema(&self, catalog: &str, schema: &str) -> Self { + Self::from( + self.config + .clone() + .with_default_catalog_and_schema(catalog, schema), + ) + } + + fn with_information_schema(&self, enabled: bool) -> Self { + Self::from(self.config.clone().with_information_schema(enabled)) + } + + fn with_batch_size(&self, batch_size: usize) -> Self { + Self::from(self.config.clone().with_batch_size(batch_size)) + } + + fn with_target_partitions(&self, target_partitions: usize) -> Self { + Self::from( + self.config + .clone() + .with_target_partitions(target_partitions), + ) + } + + fn with_repartition_aggregations(&self, enabled: bool) -> Self { + Self::from(self.config.clone().with_repartition_aggregations(enabled)) + } + + fn with_repartition_joins(&self, enabled: bool) -> Self { + Self::from(self.config.clone().with_repartition_joins(enabled)) + } + + fn with_repartition_windows(&self, enabled: bool) -> Self { + Self::from(self.config.clone().with_repartition_windows(enabled)) + } + + fn with_repartition_sorts(&self, enabled: bool) -> Self { + Self::from(self.config.clone().with_repartition_sorts(enabled)) + } + + fn with_repartition_file_scans(&self, enabled: bool) -> Self { + Self::from(self.config.clone().with_repartition_file_scans(enabled)) + } + + fn with_repartition_file_min_size(&self, size: usize) -> Self { + Self::from(self.config.clone().with_repartition_file_min_size(size)) + } + + fn with_parquet_pruning(&self, enabled: bool) -> Self { + Self::from(self.config.clone().with_parquet_pruning(enabled)) + } + + fn set(&self, key: &str, value: &str) -> Self { + Self::from(self.config.clone().set_str(key, value)) + } + + pub fn with_extension(&self, extension: Bound) -> PyResult { + if !extension.hasattr("__datafusion_extension_options__")? { + return Err(pyo3::exceptions::PyAttributeError::new_err( + "Expected extension object to define __datafusion_extension_options__()", + )); + } + let capsule = extension.call_method0("__datafusion_extension_options__")?; + let capsule = capsule.cast::()?; + + let extension: NonNull = capsule + .pointer_checked(Some(c"datafusion_extension_options"))? + .cast(); + let mut extension = unsafe { extension.as_ref() }.clone(); + + let mut config = self.config.clone(); + let options = config.options_mut(); + if let Some(prior_extension) = options.extensions.get::() { + extension + .merge(prior_extension) + .map_err(py_datafusion_err)?; + } + + options.extensions.insert(extension); + + Ok(Self::from(config)) + } +} + +/// Runtime options for a SessionContext +#[pyclass( + from_py_object, + frozen, + name = "RuntimeEnvBuilder", + module = "datafusion", + subclass +)] +#[derive(Clone)] +pub struct PyRuntimeEnvBuilder { + pub builder: RuntimeEnvBuilder, +} + +#[pymethods] +impl PyRuntimeEnvBuilder { + #[new] + fn new() -> Self { + Self { + builder: RuntimeEnvBuilder::default(), + } + } + + fn with_disk_manager_disabled(&self) -> Self { + let mut runtime_builder = self.builder.clone(); + + let mut disk_mgr_builder = runtime_builder + .disk_manager_builder + .clone() + .unwrap_or_default(); + disk_mgr_builder.set_mode(DiskManagerMode::Disabled); + + runtime_builder = runtime_builder.with_disk_manager_builder(disk_mgr_builder); + Self { + builder: runtime_builder, + } + } + + fn with_disk_manager_os(&self) -> Self { + let mut runtime_builder = self.builder.clone(); + + let mut disk_mgr_builder = runtime_builder + .disk_manager_builder + .clone() + .unwrap_or_default(); + disk_mgr_builder.set_mode(DiskManagerMode::OsTmpDirectory); + + runtime_builder = runtime_builder.with_disk_manager_builder(disk_mgr_builder); + Self { + builder: runtime_builder, + } + } + + fn with_disk_manager_specified(&self, paths: Vec) -> Self { + let paths = paths.iter().map(|s| s.into()).collect(); + let mut runtime_builder = self.builder.clone(); + + let mut disk_mgr_builder = runtime_builder + .disk_manager_builder + .clone() + .unwrap_or_default(); + disk_mgr_builder.set_mode(DiskManagerMode::Directories(paths)); + + runtime_builder = runtime_builder.with_disk_manager_builder(disk_mgr_builder); + Self { + builder: runtime_builder, + } + } + + fn with_unbounded_memory_pool(&self) -> Self { + let builder = self.builder.clone(); + let builder = builder.with_memory_pool(Arc::new(UnboundedMemoryPool::default())); + Self { builder } + } + + fn with_fair_spill_pool(&self, size: usize) -> Self { + let builder = self.builder.clone(); + let builder = builder.with_memory_pool(Arc::new(FairSpillPool::new(size))); + Self { builder } + } + + fn with_greedy_memory_pool(&self, size: usize) -> Self { + let builder = self.builder.clone(); + let builder = builder.with_memory_pool(Arc::new(GreedyMemoryPool::new(size))); + Self { builder } + } + + fn with_temp_file_path(&self, path: &str) -> Self { + let builder = self.builder.clone(); + let builder = builder.with_temp_file_path(path); + Self { builder } + } +} + +/// `PySQLOptions` allows you to specify options to the sql execution. +#[pyclass( + from_py_object, + frozen, + name = "SQLOptions", + module = "datafusion", + subclass +)] +#[derive(Clone)] +pub struct PySQLOptions { + pub options: SQLOptions, +} + +impl From for PySQLOptions { + fn from(options: SQLOptions) -> Self { + Self { options } + } +} + +#[pymethods] +impl PySQLOptions { + #[new] + fn new() -> Self { + let options = SQLOptions::new(); + Self { options } + } + + /// Should DDL data modification commands (e.g. `CREATE TABLE`) be run? Defaults to `true`. + fn with_allow_ddl(&self, allow: bool) -> Self { + Self::from(self.options.with_allow_ddl(allow)) + } + + /// Should DML data modification commands (e.g. `INSERT and COPY`) be run? Defaults to `true` + pub fn with_allow_dml(&self, allow: bool) -> Self { + Self::from(self.options.with_allow_dml(allow)) + } + + /// Should Statements such as (e.g. `SET VARIABLE and `BEGIN TRANSACTION` ...`) be run?. Defaults to `true` + pub fn with_allow_statements(&self, allow: bool) -> Self { + Self::from(self.options.with_allow_statements(allow)) + } +} + +/// `PySessionContext` is able to plan and execute DataFusion plans. +/// It has a powerful optimizer, a physical planner for local execution, and a +/// multi-threaded execution engine to perform the execution. +#[pyclass( + from_py_object, + frozen, + name = "SessionContext", + module = "datafusion", + subclass +)] +#[derive(Clone)] +pub struct PySessionContext { + pub ctx: Arc, + logical_codec: Arc, +} + +#[pymethods] +impl PySessionContext { + #[pyo3(signature = (config=None, runtime=None))] + #[new] + pub fn new( + config: Option, + runtime: Option, + ) -> PyDataFusionResult { + let config = if let Some(c) = config { + c.config + } else { + SessionConfig::default().with_information_schema(true) + }; + let runtime_env_builder = if let Some(c) = runtime { + c.builder + } else { + RuntimeEnvBuilder::default() + }; + let runtime = Arc::new(runtime_env_builder.build()?); + let session_state = SessionStateBuilder::new() + .with_config(config) + .with_runtime_env(runtime) + .with_default_features() + .build(); + let ctx = Arc::new(SessionContext::new_with_state(session_state)); + let logical_codec = Self::default_logical_codec(&ctx); + Ok(PySessionContext { ctx, logical_codec }) + } + + pub fn enable_url_table(&self) -> PyResult { + Ok(PySessionContext { + ctx: Arc::new(self.ctx.as_ref().clone().enable_url_table()), + logical_codec: Arc::clone(&self.logical_codec), + }) + } + + #[staticmethod] + #[pyo3(signature = ())] + pub fn global_ctx() -> PyResult { + let ctx = get_global_ctx().clone(); + let logical_codec = Self::default_logical_codec(&ctx); + Ok(Self { ctx, logical_codec }) + } + + /// Register an object store with the given name + #[pyo3(signature = (scheme, store, host=None))] + pub fn register_object_store( + &self, + scheme: &str, + store: StorageContexts, + host: Option<&str>, + ) -> PyResult<()> { + // for most stores the "host" is the bucket name and can be inferred from the store + let (store, upstream_host): (Arc, String) = match store { + StorageContexts::AmazonS3(s3) => (s3.inner, s3.bucket_name), + StorageContexts::GoogleCloudStorage(gcs) => (gcs.inner, gcs.bucket_name), + StorageContexts::MicrosoftAzure(azure) => (azure.inner, azure.container_name), + StorageContexts::LocalFileSystem(local) => (local.inner, "".to_string()), + StorageContexts::HTTP(http) => (http.store, http.url), + }; + + // let users override the host to match the api signature from upstream + let derived_host = if let Some(host) = host { + host + } else { + &upstream_host + }; + let url_string = format!("{scheme}{derived_host}"); + let url = Url::parse(&url_string).map_err(|e| PyValueError::new_err(e.to_string()))?; + self.ctx.runtime_env().register_object_store(&url, store); + Ok(()) + } + + /// Deregister an object store with the given url + #[pyo3(signature = (scheme, host=None))] + pub fn deregister_object_store( + &self, + scheme: &str, + host: Option<&str>, + ) -> PyDataFusionResult<()> { + let host = host.unwrap_or(""); + let url_string = format!("{scheme}{host}"); + let url = Url::parse(&url_string).map_err(|e| PyDataFusionError::Common(e.to_string()))?; + self.ctx.runtime_env().deregister_object_store(&url)?; + Ok(()) + } + + #[allow(clippy::too_many_arguments)] + #[pyo3(signature = (name, path, table_partition_cols=vec![], + file_extension=".parquet", + schema=None, + file_sort_order=None))] + pub fn register_listing_table( + &self, + name: &str, + path: &str, + table_partition_cols: Vec<(String, PyArrowType)>, + file_extension: &str, + schema: Option>, + file_sort_order: Option>>, + py: Python, + ) -> PyDataFusionResult<()> { + let options = ListingOptions::new(Arc::new(ParquetFormat::new())) + .with_file_extension(file_extension) + .with_table_partition_cols( + table_partition_cols + .into_iter() + .map(|(name, ty)| (name, ty.0)) + .collect::>(), + ) + .with_file_sort_order( + file_sort_order + .unwrap_or_default() + .into_iter() + .map(|e| e.into_iter().map(|f| f.into()).collect()) + .collect(), + ); + let table_path = ListingTableUrl::parse(path)?; + let resolved_schema: SchemaRef = match schema { + Some(s) => Arc::new(s.0), + None => { + let state = self.ctx.state(); + let schema = options.infer_schema(&state, &table_path); + wait_for_future(py, schema)?? + } + }; + let config = ListingTableConfig::new(table_path) + .with_listing_options(options) + .with_schema(resolved_schema); + let table = ListingTable::try_new(config)?; + self.ctx.register_table(name, Arc::new(table))?; + Ok(()) + } + + pub fn register_udtf(&self, func: PyTableFunction) { + let name = func.name.clone(); + let func = Arc::new(func); + self.ctx.register_udtf(&name, func); + } + + pub fn deregister_udtf(&self, name: &str) { + self.ctx.deregister_udtf(name); + } + + #[pyo3(signature = (query, options=None, param_values=HashMap::default(), param_strings=HashMap::default()))] + pub fn sql_with_options( + &self, + py: Python, + mut query: String, + options: Option, + param_values: HashMap, + param_strings: HashMap, + ) -> PyDataFusionResult { + let options = if let Some(options) = options { + options.options + } else { + SQLOptions::new() + }; + + let param_values = param_values + .into_iter() + .map(|(name, value)| (name, ScalarValue::from(value))) + .collect::>(); + + let state = self.ctx.state(); + let dialect = state.config().options().sql_parser.dialect.as_ref(); + + if !param_strings.is_empty() { + query = replace_placeholders_with_strings(&query, dialect, param_strings)?; + } + + let mut df = wait_for_future(py, async { + self.ctx.sql_with_options(&query, options).await + })? + .map_err(from_datafusion_error)?; + + if !param_values.is_empty() { + df = df.with_param_values(param_values)?; + } + + Ok(PyDataFrame::new(df)) + } + + #[pyo3(signature = (partitions, name=None, schema=None))] + pub fn create_dataframe( + &self, + partitions: PyArrowType>>, + name: Option<&str>, + schema: Option>, + py: Python, + ) -> PyDataFusionResult { + let schema = if let Some(schema) = schema { + SchemaRef::from(schema.0) + } else { + partitions.0[0][0].schema() + }; + + let table = MemTable::try_new(schema, partitions.0)?; + + // generate a random (unique) name for this table if none is provided + // table name cannot start with numeric digit + let table_name = match name { + Some(val) => val.to_owned(), + None => { + "c".to_owned() + + Uuid::new_v4() + .simple() + .encode_lower(&mut Uuid::encode_buffer()) + } + }; + + self.ctx.register_table(&*table_name, Arc::new(table))?; + + let table = wait_for_future(py, self._table(&table_name))??; + + let df = PyDataFrame::new(table); + Ok(df) + } + + /// Create a DataFrame from an existing logical plan + pub fn create_dataframe_from_logical_plan(&self, plan: PyLogicalPlan) -> PyDataFrame { + PyDataFrame::new(DataFrame::new(self.ctx.state(), plan.plan.as_ref().clone())) + } + + /// Construct datafusion dataframe from Python list + #[pyo3(signature = (data, name=None))] + pub fn from_pylist( + &self, + data: Bound<'_, PyList>, + name: Option<&str>, + ) -> PyResult { + // Acquire GIL Token + let py = data.py(); + + // Instantiate pyarrow Table object & convert to Arrow Table + let table_class = py.import("pyarrow")?.getattr("Table")?; + let args = PyTuple::new(py, &[data])?; + let table = table_class.call_method1("from_pylist", args)?; + + // Convert Arrow Table to datafusion DataFrame + let df = self.from_arrow(table, name, py)?; + Ok(df) + } + + /// Construct datafusion dataframe from Python dictionary + #[pyo3(signature = (data, name=None))] + pub fn from_pydict( + &self, + data: Bound<'_, PyDict>, + name: Option<&str>, + ) -> PyResult { + // Acquire GIL Token + let py = data.py(); + + // Instantiate pyarrow Table object & convert to Arrow Table + let table_class = py.import("pyarrow")?.getattr("Table")?; + let args = PyTuple::new(py, &[data])?; + let table = table_class.call_method1("from_pydict", args)?; + + // Convert Arrow Table to datafusion DataFrame + let df = self.from_arrow(table, name, py)?; + Ok(df) + } + + /// Construct datafusion dataframe from Arrow Table + #[pyo3(signature = (data, name=None))] + pub fn from_arrow( + &self, + data: Bound<'_, PyAny>, + name: Option<&str>, + py: Python, + ) -> PyDataFusionResult { + let (schema, batches) = + if let Ok(stream_reader) = ArrowArrayStreamReader::from_pyarrow_bound(&data) { + // Works for any object that implements __arrow_c_stream__ in pycapsule. + + let schema = stream_reader.schema().as_ref().to_owned(); + let batches = stream_reader + .collect::, arrow::error::ArrowError>>()?; + + (schema, batches) + } else if let Ok(array) = RecordBatch::from_pyarrow_bound(&data) { + // While this says RecordBatch, it will work for any object that implements + // __arrow_c_array__ and returns a StructArray. + + (array.schema().as_ref().to_owned(), vec![array]) + } else { + return Err(PyDataFusionError::Common( + "Expected either a Arrow Array or Arrow Stream in from_arrow().".to_string(), + )); + }; + + // Because create_dataframe() expects a vector of vectors of record batches + // here we need to wrap the vector of record batches in an additional vector + let list_of_batches = PyArrowType::from(vec![batches]); + self.create_dataframe(list_of_batches, name, Some(schema.into()), py) + } + + /// Construct datafusion dataframe from pandas + #[allow(clippy::wrong_self_convention)] + #[pyo3(signature = (data, name=None))] + pub fn from_pandas(&self, data: Bound<'_, PyAny>, name: Option<&str>) -> PyResult { + // Obtain GIL token + let py = data.py(); + + // Instantiate pyarrow Table object & convert to Arrow Table + let table_class = py.import("pyarrow")?.getattr("Table")?; + let args = PyTuple::new(py, &[data])?; + let table = table_class.call_method1("from_pandas", args)?; + + // Convert Arrow Table to datafusion DataFrame + let df = self.from_arrow(table, name, py)?; + Ok(df) + } + + /// Construct datafusion dataframe from polars + #[pyo3(signature = (data, name=None))] + pub fn from_polars(&self, data: Bound<'_, PyAny>, name: Option<&str>) -> PyResult { + // Convert Polars dataframe to Arrow Table + let table = data.call_method0("to_arrow")?; + + // Convert Arrow Table to datafusion DataFrame + let df = self.from_arrow(table, name, data.py())?; + Ok(df) + } + + pub fn register_table(&self, name: &str, table: Bound<'_, PyAny>) -> PyDataFusionResult<()> { + let session = self.clone().into_bound_py_any(table.py())?; + let table = PyTable::new(table, Some(session))?; + + self.ctx.register_table(name, table.table)?; + Ok(()) + } + + pub fn deregister_table(&self, name: &str) -> PyDataFusionResult<()> { + self.ctx.deregister_table(name)?; + Ok(()) + } + + pub fn register_table_factory( + &self, + format: &str, + mut factory: Bound<'_, PyAny>, + ) -> PyDataFusionResult<()> { + if factory.hasattr("__datafusion_table_provider_factory__")? { + let py = factory.py(); + let codec_capsule = create_logical_extension_capsule(py, self.logical_codec.as_ref())?; + factory = factory + .getattr("__datafusion_table_provider_factory__")? + .call1((codec_capsule,))?; + } + + let factory: Arc = + if let Ok(capsule) = factory.cast::().map_err(py_datafusion_err) { + let data: NonNull = capsule + .pointer_checked(Some(c"datafusion_table_provider_factory"))? + .cast(); + let factory = unsafe { data.as_ref() }; + factory.into() + } else { + Arc::new(RustWrappedPyTableProviderFactory::new( + factory.into(), + self.logical_codec.clone(), + )) + }; + + let st = self.ctx.state_ref(); + let mut lock = st.write(); + lock.table_factories_mut() + .insert(format.to_owned(), factory); + + Ok(()) + } + + pub fn register_catalog_provider_list( + &self, + mut provider: Bound, + ) -> PyDataFusionResult<()> { + if provider.hasattr("__datafusion_catalog_provider_list__")? { + let py = provider.py(); + let codec_capsule = create_logical_extension_capsule(py, self.logical_codec.as_ref())?; + provider = provider + .getattr("__datafusion_catalog_provider_list__")? + .call1((codec_capsule,))?; + } + + let provider = if let Ok(capsule) = provider.cast::() { + let data: NonNull = capsule + .pointer_checked(Some(c"datafusion_catalog_provider_list"))? + .cast(); + let provider = unsafe { data.as_ref() }; + let provider: Arc = provider.into(); + provider as Arc + } else { + match provider.extract::() { + Ok(py_catalog_list) => py_catalog_list.catalog_list, + Err(_) => Arc::new(RustWrappedPyCatalogProviderList::new( + provider.into(), + Arc::clone(&self.logical_codec), + )) as Arc, + } + }; + + self.ctx.register_catalog_list(provider); + + Ok(()) + } + + pub fn register_catalog_provider( + &self, + name: &str, + mut provider: Bound<'_, PyAny>, + ) -> PyDataFusionResult<()> { + if provider.hasattr("__datafusion_catalog_provider__")? { + let py = provider.py(); + let codec_capsule = create_logical_extension_capsule(py, self.logical_codec.as_ref())?; + provider = provider + .getattr("__datafusion_catalog_provider__")? + .call1((codec_capsule,))?; + } + + let provider = if let Ok(capsule) = provider.cast::() { + let data: NonNull = capsule + .pointer_checked(Some(c"datafusion_catalog_provider"))? + .cast(); + let provider = unsafe { data.as_ref() }; + let provider: Arc = provider.into(); + provider as Arc + } else { + match provider.extract::() { + Ok(py_catalog) => py_catalog.catalog, + Err(_) => Arc::new(RustWrappedPyCatalogProvider::new( + provider.into(), + Arc::clone(&self.logical_codec), + )) as Arc, + } + }; + + let _ = self.ctx.register_catalog(name, provider); + + Ok(()) + } + + /// Construct datafusion dataframe from Arrow Table + pub fn register_table_provider( + &self, + name: &str, + provider: Bound<'_, PyAny>, + ) -> PyDataFusionResult<()> { + // Deprecated: use `register_table` instead + self.register_table(name, provider) + } + + pub fn register_record_batches( + &self, + name: &str, + partitions: PyArrowType>>, + ) -> PyDataFusionResult<()> { + let schema = partitions.0[0][0].schema(); + let table = MemTable::try_new(schema, partitions.0)?; + self.ctx.register_table(name, Arc::new(table))?; + Ok(()) + } + + #[allow(clippy::too_many_arguments)] + #[pyo3(signature = (name, path, table_partition_cols=vec![], + parquet_pruning=true, + file_extension=".parquet", + skip_metadata=true, + schema=None, + file_sort_order=None))] + pub fn register_parquet( + &self, + name: &str, + path: &str, + table_partition_cols: Vec<(String, PyArrowType)>, + parquet_pruning: bool, + file_extension: &str, + skip_metadata: bool, + schema: Option>, + file_sort_order: Option>>, + py: Python, + ) -> PyDataFusionResult<()> { + let mut options = ParquetReadOptions::default() + .table_partition_cols( + table_partition_cols + .into_iter() + .map(|(name, ty)| (name, ty.0)) + .collect::>(), + ) + .parquet_pruning(parquet_pruning) + .skip_metadata(skip_metadata); + options.file_extension = file_extension; + options.schema = schema.as_ref().map(|x| &x.0); + options.file_sort_order = file_sort_order + .unwrap_or_default() + .into_iter() + .map(|e| e.into_iter().map(|f| f.into()).collect()) + .collect(); + + let result = self.ctx.register_parquet(name, path, options); + wait_for_future(py, result)??; + Ok(()) + } + + #[pyo3(signature = (name, + path, + options=None))] + pub fn register_csv( + &self, + name: &str, + path: &Bound<'_, PyAny>, + options: Option<&PyCsvReadOptions>, + py: Python, + ) -> PyDataFusionResult<()> { + let options = options + .map(|opts| opts.try_into()) + .transpose()? + .unwrap_or_default(); + + if path.is_instance_of::() { + let paths = path.extract::>()?; + let result = self.register_csv_from_multiple_paths(name, paths, options); + wait_for_future(py, result)??; + } else { + let path = path.extract::()?; + let result = self.ctx.register_csv(name, &path, options); + wait_for_future(py, result)??; + } + + Ok(()) + } + + #[allow(clippy::too_many_arguments)] + #[pyo3(signature = (name, + path, + schema=None, + schema_infer_max_records=1000, + file_extension=".json", + table_partition_cols=vec![], + file_compression_type=None))] + pub fn register_json( + &self, + name: &str, + path: PathBuf, + schema: Option>, + schema_infer_max_records: usize, + file_extension: &str, + table_partition_cols: Vec<(String, PyArrowType)>, + file_compression_type: Option, + py: Python, + ) -> PyDataFusionResult<()> { + let path = path + .to_str() + .ok_or_else(|| PyValueError::new_err("Unable to convert path to a string"))?; + + let mut options = JsonReadOptions::default() + .file_compression_type(parse_file_compression_type(file_compression_type)?) + .table_partition_cols( + table_partition_cols + .into_iter() + .map(|(name, ty)| (name, ty.0)) + .collect::>(), + ); + options.schema_infer_max_records = schema_infer_max_records; + options.file_extension = file_extension; + options.schema = schema.as_ref().map(|x| &x.0); + + let result = self.ctx.register_json(name, path, options); + wait_for_future(py, result)??; + + Ok(()) + } + + #[allow(clippy::too_many_arguments)] + #[pyo3(signature = (name, + path, + schema=None, + file_extension=".avro", + table_partition_cols=vec![]))] + pub fn register_avro( + &self, + name: &str, + path: PathBuf, + schema: Option>, + file_extension: &str, + table_partition_cols: Vec<(String, PyArrowType)>, + py: Python, + ) -> PyDataFusionResult<()> { + let path = path + .to_str() + .ok_or_else(|| PyValueError::new_err("Unable to convert path to a string"))?; + + let mut options = AvroReadOptions::default().table_partition_cols( + table_partition_cols + .into_iter() + .map(|(name, ty)| (name, ty.0)) + .collect::>(), + ); + options.file_extension = file_extension; + options.schema = schema.as_ref().map(|x| &x.0); + + let result = self.ctx.register_avro(name, path, options); + wait_for_future(py, result)??; + + Ok(()) + } + + #[pyo3(signature = (name, path, schema=None, file_extension=".arrow", table_partition_cols=vec![]))] + pub fn register_arrow( + &self, + name: &str, + path: &str, + schema: Option>, + file_extension: &str, + table_partition_cols: Vec<(String, PyArrowType)>, + py: Python, + ) -> PyDataFusionResult<()> { + let mut options = ArrowReadOptions::default().table_partition_cols( + table_partition_cols + .into_iter() + .map(|(name, ty)| (name, ty.0)) + .collect::>(), + ); + options.file_extension = file_extension; + options.schema = schema.as_ref().map(|x| &x.0); + + let result = self.ctx.register_arrow(name, path, options); + wait_for_future(py, result)??; + Ok(()) + } + + pub fn register_batch( + &self, + name: &str, + batch: PyArrowType, + ) -> PyDataFusionResult<()> { + self.ctx.register_batch(name, batch.0)?; + Ok(()) + } + + // Registers a PyArrow.Dataset + pub fn register_dataset( + &self, + name: &str, + dataset: &Bound<'_, PyAny>, + py: Python, + ) -> PyDataFusionResult<()> { + let table: Arc = Arc::new(Dataset::new(dataset, py)?); + + self.ctx.register_table(name, table)?; + + Ok(()) + } + + pub fn register_udf(&self, udf: PyScalarUDF) -> PyResult<()> { + self.ctx.register_udf(udf.function); + Ok(()) + } + + pub fn deregister_udf(&self, name: &str) { + self.ctx.deregister_udf(name); + } + + pub fn register_udaf(&self, udaf: PyAggregateUDF) -> PyResult<()> { + self.ctx.register_udaf(udaf.function); + Ok(()) + } + + pub fn deregister_udaf(&self, name: &str) { + self.ctx.deregister_udaf(name); + } + + pub fn register_udwf(&self, udwf: PyWindowUDF) -> PyResult<()> { + self.ctx.register_udwf(udwf.function); + Ok(()) + } + + pub fn deregister_udwf(&self, name: &str) { + self.ctx.deregister_udwf(name); + } + + #[pyo3(signature = (name="datafusion"))] + pub fn catalog(&self, py: Python, name: &str) -> PyResult> { + let catalog = self.ctx.catalog(name).ok_or(PyKeyError::new_err(format!( + "Catalog with name {name} doesn't exist." + )))?; + + match catalog + .as_any() + .downcast_ref::() + { + Some(wrapped_schema) => Ok(wrapped_schema.catalog_provider.clone_ref(py)), + None => Ok( + PyCatalog::new_from_parts(catalog, Arc::clone(&self.logical_codec)) + .into_py_any(py)?, + ), + } + } + + pub fn catalog_names(&self) -> HashSet { + self.ctx.catalog_names().into_iter().collect() + } + + pub fn table(&self, name: &str, py: Python) -> PyResult { + let res = wait_for_future(py, self.ctx.table(name)) + .map_err(|e| PyKeyError::new_err(e.to_string()))?; + match res { + Ok(df) => Ok(PyDataFrame::new(df)), + Err(e) => { + if let datafusion::error::DataFusionError::Plan(msg) = &e + && msg.contains("No table named") + { + return Err(PyKeyError::new_err(msg.to_string())); + } + Err(py_datafusion_err(e)) + } + } + } + + pub fn table_exist(&self, name: &str) -> PyDataFusionResult { + Ok(self.ctx.table_exist(name)?) + } + + pub fn empty_table(&self) -> PyDataFusionResult { + Ok(PyDataFrame::new(self.ctx.read_empty()?)) + } + + pub fn session_id(&self) -> String { + self.ctx.session_id() + } + + pub fn session_start_time(&self) -> String { + self.ctx.session_start_time().to_rfc3339() + } + + pub fn enable_ident_normalization(&self) -> bool { + self.ctx.enable_ident_normalization() + } + + pub fn parse_sql_expr(&self, sql: &str, schema: PyDFSchema) -> PyDataFusionResult { + let df_schema: DFSchema = schema.into(); + Ok(self.ctx.parse_sql_expr(sql, &df_schema)?.into()) + } + + pub fn execute_logical_plan( + &self, + plan: PyLogicalPlan, + py: Python, + ) -> PyDataFusionResult { + let df = wait_for_future( + py, + self.ctx.execute_logical_plan(plan.plan.as_ref().clone()), + )??; + Ok(PyDataFrame::new(df)) + } + + pub fn refresh_catalogs(&self, py: Python) -> PyDataFusionResult<()> { + wait_for_future(py, self.ctx.refresh_catalogs())??; + Ok(()) + } + + pub fn remove_optimizer_rule(&self, name: &str) -> bool { + self.ctx.remove_optimizer_rule(name) + } + + pub fn table_provider(&self, name: &str, py: Python) -> PyResult { + let provider = wait_for_future(py, self.ctx.table_provider(name)) + // Outer error: runtime/async failure + .map_err(|e| PyRuntimeError::new_err(e.to_string()))? + // Inner error: table not found + .map_err(|e| PyKeyError::new_err(e.to_string()))?; + Ok(PyTable { table: provider }) + } + + #[allow(clippy::too_many_arguments)] + #[pyo3(signature = (path, schema=None, schema_infer_max_records=1000, file_extension=".json", table_partition_cols=vec![], file_compression_type=None))] + pub fn read_json( + &self, + path: PathBuf, + schema: Option>, + schema_infer_max_records: usize, + file_extension: &str, + table_partition_cols: Vec<(String, PyArrowType)>, + file_compression_type: Option, + py: Python, + ) -> PyDataFusionResult { + let path = path + .to_str() + .ok_or_else(|| PyValueError::new_err("Unable to convert path to a string"))?; + let mut options = JsonReadOptions::default() + .table_partition_cols( + table_partition_cols + .into_iter() + .map(|(name, ty)| (name, ty.0)) + .collect::>(), + ) + .file_compression_type(parse_file_compression_type(file_compression_type)?); + options.schema_infer_max_records = schema_infer_max_records; + options.file_extension = file_extension; + let df = if let Some(schema) = schema { + options.schema = Some(&schema.0); + let result = self.ctx.read_json(path, options); + wait_for_future(py, result)?? + } else { + let result = self.ctx.read_json(path, options); + wait_for_future(py, result)?? + }; + Ok(PyDataFrame::new(df)) + } + + #[pyo3(signature = ( + path, + options=None))] + pub fn read_csv( + &self, + path: &Bound<'_, PyAny>, + options: Option<&PyCsvReadOptions>, + py: Python, + ) -> PyDataFusionResult { + let options = options + .map(|opts| opts.try_into()) + .transpose()? + .unwrap_or_default(); + + if path.is_instance_of::() { + let paths = path.extract::>()?; + let paths = paths.iter().map(|p| p as &str).collect::>(); + let result = self.ctx.read_csv(paths, options); + let df = PyDataFrame::new(wait_for_future(py, result)??); + Ok(df) + } else { + let path = path.extract::()?; + let result = self.ctx.read_csv(path, options); + let df = PyDataFrame::new(wait_for_future(py, result)??); + Ok(df) + } + } + + #[allow(clippy::too_many_arguments)] + #[pyo3(signature = ( + path, + table_partition_cols=vec![], + parquet_pruning=true, + file_extension=".parquet", + skip_metadata=true, + schema=None, + file_sort_order=None))] + pub fn read_parquet( + &self, + path: &str, + table_partition_cols: Vec<(String, PyArrowType)>, + parquet_pruning: bool, + file_extension: &str, + skip_metadata: bool, + schema: Option>, + file_sort_order: Option>>, + py: Python, + ) -> PyDataFusionResult { + let mut options = ParquetReadOptions::default() + .table_partition_cols( + table_partition_cols + .into_iter() + .map(|(name, ty)| (name, ty.0)) + .collect::>(), + ) + .parquet_pruning(parquet_pruning) + .skip_metadata(skip_metadata); + options.file_extension = file_extension; + options.schema = schema.as_ref().map(|x| &x.0); + options.file_sort_order = file_sort_order + .unwrap_or_default() + .into_iter() + .map(|e| e.into_iter().map(|f| f.into()).collect()) + .collect(); + + let result = self.ctx.read_parquet(path, options); + let df = PyDataFrame::new(wait_for_future(py, result)??); + Ok(df) + } + + #[allow(clippy::too_many_arguments)] + #[pyo3(signature = (path, schema=None, table_partition_cols=vec![], file_extension=".avro"))] + pub fn read_avro( + &self, + path: &str, + schema: Option>, + table_partition_cols: Vec<(String, PyArrowType)>, + file_extension: &str, + py: Python, + ) -> PyDataFusionResult { + let mut options = AvroReadOptions::default().table_partition_cols( + table_partition_cols + .into_iter() + .map(|(name, ty)| (name, ty.0)) + .collect::>(), + ); + options.file_extension = file_extension; + let df = if let Some(schema) = schema { + options.schema = Some(&schema.0); + let read_future = self.ctx.read_avro(path, options); + wait_for_future(py, read_future)?? + } else { + let read_future = self.ctx.read_avro(path, options); + wait_for_future(py, read_future)?? + }; + Ok(PyDataFrame::new(df)) + } + + #[pyo3(signature = (path, schema=None, file_extension=".arrow", table_partition_cols=vec![]))] + pub fn read_arrow( + &self, + path: &str, + schema: Option>, + file_extension: &str, + table_partition_cols: Vec<(String, PyArrowType)>, + py: Python, + ) -> PyDataFusionResult { + let mut options = ArrowReadOptions::default().table_partition_cols( + table_partition_cols + .into_iter() + .map(|(name, ty)| (name, ty.0)) + .collect::>(), + ); + options.file_extension = file_extension; + options.schema = schema.as_ref().map(|x| &x.0); + + let result = self.ctx.read_arrow(path, options); + let df = wait_for_future(py, result)??; + Ok(PyDataFrame::new(df)) + } + + pub fn read_table(&self, table: Bound<'_, PyAny>) -> PyDataFusionResult { + let session = self.clone().into_bound_py_any(table.py())?; + let table = PyTable::new(table, Some(session))?; + let df = self.ctx.read_table(table.table())?; + Ok(PyDataFrame::new(df)) + } + + fn __repr__(&self) -> PyResult { + let config = self.ctx.copied_config(); + let mut config_entries = config + .options() + .entries() + .iter() + .filter(|e| e.value.is_some()) + .map(|e| format!("{} = {}", e.key, e.value.as_ref().unwrap())) + .collect::>(); + config_entries.sort(); + Ok(format!( + "SessionContext: id={}; configs=[\n\t{}]", + self.session_id(), + config_entries.join("\n\t") + )) + } + + /// Execute a partition of an execution plan and return a stream of record batches + pub fn execute( + &self, + plan: PyExecutionPlan, + part: usize, + py: Python, + ) -> PyDataFusionResult { + let ctx: TaskContext = TaskContext::from(&self.ctx.state()); + let plan = plan.plan.clone(); + let stream = spawn_future(py, async move { plan.execute(part, Arc::new(ctx)) })?; + Ok(PyRecordBatchStream::new(stream)) + } + + pub fn __datafusion_task_context_provider__<'py>( + &self, + py: Python<'py>, + ) -> PyResult> { + let name = cr"datafusion_task_context_provider".into(); + + let ctx_provider = Arc::clone(&self.ctx) as Arc; + let ffi_ctx_provider = FFI_TaskContextProvider::from(&ctx_provider); + + PyCapsule::new(py, ffi_ctx_provider, Some(name)) + } + + pub fn __datafusion_logical_extension_codec__<'py>( + &self, + py: Python<'py>, + ) -> PyResult> { + create_logical_extension_capsule(py, self.logical_codec.as_ref()) + } + + pub fn with_logical_extension_codec<'py>( + &self, + codec: Bound<'py, PyAny>, + ) -> PyDataFusionResult { + let logical_codec = Arc::new(ffi_logical_codec_from_pycapsule(codec)?); + + Ok({ + Self { + ctx: Arc::clone(&self.ctx), + logical_codec, + } + }) + } +} + +impl PySessionContext { + async fn _table(&self, name: &str) -> datafusion::common::Result { + self.ctx.table(name).await + } + + async fn register_csv_from_multiple_paths( + &self, + name: &str, + table_paths: Vec, + options: CsvReadOptions<'_>, + ) -> datafusion::common::Result<()> { + let table_paths = table_paths.to_urls()?; + let session_config = self.ctx.copied_config(); + let listing_options = + options.to_listing_options(&session_config, self.ctx.copied_table_options()); + + let option_extension = listing_options.file_extension.clone(); + + if table_paths.is_empty() { + return exec_err!("No table paths were provided"); + } + + // check if the file extension matches the expected extension + for path in &table_paths { + let file_path = path.as_str(); + if !file_path.ends_with(option_extension.clone().as_str()) && !path.is_collection() { + return exec_err!( + "File path '{file_path}' does not match the expected extension '{option_extension}'" + ); + } + } + + let resolved_schema = options + .get_resolved_schema(&session_config, self.ctx.state(), table_paths[0].clone()) + .await?; + + let config = ListingTableConfig::new_with_multi_paths(table_paths) + .with_listing_options(listing_options) + .with_schema(resolved_schema); + let table = ListingTable::try_new(config)?; + self.ctx + .register_table(TableReference::Bare { table: name.into() }, Arc::new(table))?; + Ok(()) + } + + fn default_logical_codec(ctx: &Arc) -> Arc { + let codec = Arc::new(DefaultLogicalExtensionCodec {}); + let runtime = get_tokio_runtime().handle().clone(); + let ctx_provider = Arc::clone(ctx) as Arc; + Arc::new(FFI_LogicalExtensionCodec::new( + codec, + Some(runtime), + &ctx_provider, + )) + } +} + +pub fn parse_file_compression_type( + file_compression_type: Option, +) -> Result { + FileCompressionType::from_str(&*file_compression_type.unwrap_or("".to_string()).as_str()) + .map_err(|_| { + PyValueError::new_err("file_compression_type must one of: gzip, bz2, xz, zstd") + }) +} + +impl From for SessionContext { + fn from(ctx: PySessionContext) -> SessionContext { + ctx.ctx.as_ref().clone() + } +} + +impl From for PySessionContext { + fn from(ctx: SessionContext) -> PySessionContext { + let ctx = Arc::new(ctx); + let logical_codec = Self::default_logical_codec(&ctx); + + PySessionContext { ctx, logical_codec } + } +} diff --git a/crates/core/src/dataframe.rs b/crates/core/src/dataframe.rs new file mode 100644 index 000000000..c067eac30 --- /dev/null +++ b/crates/core/src/dataframe.rs @@ -0,0 +1,1541 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::HashMap; +use std::ffi::{CStr, CString}; +use std::ptr::NonNull; +use std::str::FromStr; +use std::sync::Arc; + +use arrow::array::{Array, ArrayRef, RecordBatch, RecordBatchReader, new_null_array}; +use arrow::compute::can_cast_types; +use arrow::error::ArrowError; +use arrow::ffi::FFI_ArrowSchema; +use arrow::ffi_stream::FFI_ArrowArrayStream; +use arrow::pyarrow::FromPyArrow; +use cstr::cstr; +use datafusion::arrow::datatypes::{Schema, SchemaRef}; +use datafusion::arrow::pyarrow::{PyArrowType, ToPyArrow}; +use datafusion::arrow::util::pretty; +use datafusion::catalog::TableProvider; +use datafusion::common::UnnestOptions; +use datafusion::config::{CsvOptions, ParquetColumnOptions, ParquetOptions, TableParquetOptions}; +use datafusion::dataframe::{DataFrame, DataFrameWriteOptions}; +use datafusion::error::DataFusionError; +use datafusion::execution::SendableRecordBatchStream; +use datafusion::logical_expr::SortExpr; +use datafusion::logical_expr::dml::InsertOp; +use datafusion::parquet::basic::{BrotliLevel, Compression, GzipLevel, ZstdLevel}; +use datafusion::prelude::*; +use datafusion_python_util::{is_ipython_env, spawn_future, wait_for_future}; +use futures::{StreamExt, TryStreamExt}; +use parking_lot::Mutex; +use pyo3::PyErr; +use pyo3::exceptions::PyValueError; +use pyo3::prelude::*; +use pyo3::pybacked::PyBackedStr; +use pyo3::types::{PyCapsule, PyList, PyTuple, PyTupleMethods}; + +use crate::common::data_type::PyScalarValue; +use crate::errors::{PyDataFusionError, PyDataFusionResult, py_datafusion_err}; +use crate::expr::PyExpr; +use crate::expr::sort_expr::{PySortExpr, to_sort_expressions}; +use crate::physical_plan::PyExecutionPlan; +use crate::record_batch::{PyRecordBatchStream, poll_next_batch}; +use crate::sql::logical::PyLogicalPlan; +use crate::table::{PyTable, TempViewTable}; + +/// File-level static CStr for the Arrow array stream capsule name. +static ARROW_ARRAY_STREAM_NAME: &CStr = cstr!("arrow_array_stream"); + +// Type aliases to simplify very complex types used in this file and +// avoid compiler complaints about deeply nested types in struct fields. +type CachedBatches = Option<(Vec, bool)>; +type SharedCachedBatches = Arc>; + +/// Configuration for DataFrame display formatting +#[derive(Debug, Clone)] +pub struct FormatterConfig { + /// Maximum memory in bytes to use for display (default: 2MB) + pub max_bytes: usize, + /// Minimum number of rows to display (default: 10) + pub min_rows: usize, + /// Maximum number of rows to include in __repr__ output (default: 10) + pub max_rows: usize, +} + +impl Default for FormatterConfig { + fn default() -> Self { + Self { + max_bytes: 2 * 1024 * 1024, // 2MB + min_rows: 10, + max_rows: 10, + } + } +} + +impl FormatterConfig { + /// Validates that all configuration values are positive integers. + /// + /// # Returns + /// + /// `Ok(())` if all values are valid, or an `Err` with a descriptive error message. + pub fn validate(&self) -> Result<(), String> { + if self.max_bytes == 0 { + return Err("max_bytes must be a positive integer".to_string()); + } + + if self.min_rows == 0 { + return Err("min_rows must be a positive integer".to_string()); + } + + if self.max_rows == 0 { + return Err("max_rows must be a positive integer".to_string()); + } + + if self.min_rows > self.max_rows { + return Err("min_rows must be less than or equal to max_rows".to_string()); + } + + Ok(()) + } +} + +/// Holds the Python formatter and its configuration +struct PythonFormatter<'py> { + /// The Python formatter object + formatter: Bound<'py, PyAny>, + /// The formatter configuration + config: FormatterConfig, +} + +/// Get the Python formatter and its configuration +fn get_python_formatter_with_config(py: Python) -> PyResult { + let formatter = import_python_formatter(py)?; + let config = build_formatter_config_from_python(&formatter)?; + Ok(PythonFormatter { formatter, config }) +} + +/// Get the Python formatter from the datafusion.dataframe_formatter module +fn import_python_formatter(py: Python<'_>) -> PyResult> { + let formatter_module = py.import("datafusion.dataframe_formatter")?; + let get_formatter = formatter_module.getattr("get_formatter")?; + get_formatter.call0() +} + +// Helper function to extract attributes with fallback to default +fn get_attr<'a, T>(py_object: &'a Bound<'a, PyAny>, attr_name: &str, default_value: T) -> T +where + T: for<'py> pyo3::FromPyObject<'py, 'py> + Clone, +{ + py_object + .getattr(attr_name) + .and_then(|v| v.extract::().map_err(Into::::into)) + .unwrap_or_else(|_| default_value.clone()) +} + +/// Helper function to create a FormatterConfig from a Python formatter object +fn build_formatter_config_from_python(formatter: &Bound<'_, PyAny>) -> PyResult { + let default_config = FormatterConfig::default(); + let max_bytes = get_attr(formatter, "max_memory_bytes", default_config.max_bytes); + let min_rows = get_attr(formatter, "min_rows", default_config.min_rows); + + // Backward compatibility: Try max_rows first (new name), fall back to repr_rows (deprecated), + // then use default. This ensures backward compatibility with custom formatter implementations + // during the deprecation period. + let max_rows = get_attr(formatter, "max_rows", 0usize); + let max_rows = if max_rows > 0 { + // max_rows attribute exists and has a value + max_rows + } else { + // Try the deprecated repr_rows attribute + let repr_rows = get_attr(formatter, "repr_rows", 0usize); + if repr_rows > 0 { + repr_rows + } else { + // Use default + default_config.max_rows + } + }; + + let config = FormatterConfig { + max_bytes, + min_rows, + max_rows, + }; + + // Return the validated config, converting String error to PyErr + config.validate().map_err(PyValueError::new_err)?; + Ok(config) +} + +/// Python mapping of `ParquetOptions` (includes just the writer-related options). +#[pyclass( + from_py_object, + frozen, + name = "ParquetWriterOptions", + module = "datafusion", + subclass +)] +#[derive(Clone, Default)] +pub struct PyParquetWriterOptions { + options: ParquetOptions, +} + +#[pymethods] +impl PyParquetWriterOptions { + #[new] + #[allow(clippy::too_many_arguments)] + pub fn new( + data_pagesize_limit: usize, + write_batch_size: usize, + writer_version: &str, + skip_arrow_metadata: bool, + compression: Option, + dictionary_enabled: Option, + dictionary_page_size_limit: usize, + statistics_enabled: Option, + max_row_group_size: usize, + created_by: String, + column_index_truncate_length: Option, + statistics_truncate_length: Option, + data_page_row_count_limit: usize, + encoding: Option, + bloom_filter_on_write: bool, + bloom_filter_fpp: Option, + bloom_filter_ndv: Option, + allow_single_file_parallelism: bool, + maximum_parallel_row_group_writers: usize, + maximum_buffered_record_batches_per_stream: usize, + ) -> PyResult { + let writer_version = + datafusion::common::parquet_config::DFParquetWriterVersion::from_str(writer_version) + .map_err(py_datafusion_err)?; + Ok(Self { + options: ParquetOptions { + data_pagesize_limit, + write_batch_size, + writer_version, + skip_arrow_metadata, + compression, + dictionary_enabled, + dictionary_page_size_limit, + statistics_enabled, + max_row_group_size, + created_by, + column_index_truncate_length, + statistics_truncate_length, + data_page_row_count_limit, + encoding, + bloom_filter_on_write, + bloom_filter_fpp, + bloom_filter_ndv, + allow_single_file_parallelism, + maximum_parallel_row_group_writers, + maximum_buffered_record_batches_per_stream, + ..Default::default() + }, + }) + } +} + +/// Python mapping of `ParquetColumnOptions`. +#[pyclass( + from_py_object, + frozen, + name = "ParquetColumnOptions", + module = "datafusion", + subclass +)] +#[derive(Clone, Default)] +pub struct PyParquetColumnOptions { + options: ParquetColumnOptions, +} + +#[pymethods] +impl PyParquetColumnOptions { + #[new] + pub fn new( + bloom_filter_enabled: Option, + encoding: Option, + dictionary_enabled: Option, + compression: Option, + statistics_enabled: Option, + bloom_filter_fpp: Option, + bloom_filter_ndv: Option, + ) -> Self { + Self { + options: ParquetColumnOptions { + bloom_filter_enabled, + encoding, + dictionary_enabled, + compression, + statistics_enabled, + bloom_filter_fpp, + bloom_filter_ndv, + }, + } + } +} + +/// A PyDataFrame is a representation of a logical plan and an API to compose statements. +/// Use it to build a plan and `.collect()` to execute the plan and collect the result. +/// The actual execution of a plan runs natively on Rust and Arrow on a multi-threaded environment. +#[pyclass( + from_py_object, + name = "DataFrame", + module = "datafusion", + subclass, + frozen +)] +#[derive(Clone)] +pub struct PyDataFrame { + df: Arc, + + // In IPython environment cache batches between __repr__ and _repr_html_ calls. + batches: SharedCachedBatches, +} + +impl PyDataFrame { + /// creates a new PyDataFrame + pub fn new(df: DataFrame) -> Self { + Self { + df: Arc::new(df), + batches: Arc::new(Mutex::new(None)), + } + } + + /// Return a clone of the inner Arc for crate-local callers. + pub(crate) fn inner_df(&self) -> Arc { + Arc::clone(&self.df) + } + + fn prepare_repr_string<'py>( + &self, + py: Python<'py>, + as_html: bool, + ) -> PyDataFusionResult { + // Get the Python formatter and config + let PythonFormatter { formatter, config } = get_python_formatter_with_config(py)?; + + let is_ipython = *is_ipython_env(py); + + let (cached_batches, should_cache) = { + let mut cache = self.batches.lock(); + let should_cache = is_ipython && cache.is_none(); + let batches = cache.take(); + (batches, should_cache) + }; + + let (batches, has_more) = match cached_batches { + Some(b) => b, + None => wait_for_future( + py, + collect_record_batches_to_display(self.df.as_ref().clone(), config), + )??, + }; + + if batches.is_empty() { + // This should not be reached, but do it for safety since we index into the vector below + return Ok("No data to display".to_string()); + } + + let table_uuid = uuid::Uuid::new_v4().to_string(); + + // Convert record batches to Py list + let py_batches = batches + .iter() + .map(|rb| rb.to_pyarrow(py)) + .collect::>>>()?; + + let py_schema = self.schema().into_pyobject(py)?; + + let kwargs = pyo3::types::PyDict::new(py); + let py_batches_list = PyList::new(py, py_batches.as_slice())?; + kwargs.set_item("batches", py_batches_list)?; + kwargs.set_item("schema", py_schema)?; + kwargs.set_item("has_more", has_more)?; + kwargs.set_item("table_uuid", table_uuid)?; + + let method_name = match as_html { + true => "format_html", + false => "format_str", + }; + + let html_result = formatter.call_method(method_name, (), Some(&kwargs))?; + let html_str: String = html_result.extract()?; + + if should_cache { + let mut cache = self.batches.lock(); + *cache = Some((batches.clone(), has_more)); + } + + Ok(html_str) + } + + async fn collect_column_inner(&self, column: &str) -> Result { + let batches = self + .df + .as_ref() + .clone() + .select_columns(&[column])? + .collect() + .await?; + + let arrays = batches + .iter() + .map(|b| b.column(0).as_ref()) + .collect::>(); + + arrow_select::concat::concat(&arrays).map_err(Into::into) + } +} + +/// Synchronous wrapper around partitioned [`SendableRecordBatchStream`]s used +/// for the `__arrow_c_stream__` implementation. +/// +/// It drains each partition's stream sequentially, yielding record batches in +/// their original partition order. When a `projection` is set, each batch is +/// converted via `record_batch_into_schema` to apply schema changes per batch. +struct PartitionedDataFrameStreamReader { + streams: Vec, + schema: SchemaRef, + projection: Option, + current: usize, +} + +impl Iterator for PartitionedDataFrameStreamReader { + type Item = Result; + + fn next(&mut self) -> Option { + while self.current < self.streams.len() { + let stream = &mut self.streams[self.current]; + let fut = poll_next_batch(stream); + let result = Python::attach(|py| wait_for_future(py, fut)); + + match result { + Ok(Ok(Some(batch))) => { + let batch = if let Some(ref schema) = self.projection { + match record_batch_into_schema(batch, schema.as_ref()) { + Ok(b) => b, + Err(e) => return Some(Err(e)), + } + } else { + batch + }; + return Some(Ok(batch)); + } + Ok(Ok(None)) => { + self.current += 1; + continue; + } + Ok(Err(e)) => { + return Some(Err(ArrowError::ExternalError(Box::new(e)))); + } + Err(e) => { + return Some(Err(ArrowError::ExternalError(Box::new(e)))); + } + } + } + + None + } +} + +impl RecordBatchReader for PartitionedDataFrameStreamReader { + fn schema(&self) -> SchemaRef { + self.schema.clone() + } +} + +#[pymethods] +impl PyDataFrame { + /// Enable selection for `df[col]`, `df[col1, col2, col3]`, and `df[[col1, col2, col3]]` + fn __getitem__(&self, key: Bound<'_, PyAny>) -> PyDataFusionResult { + if let Ok(key) = key.extract::() { + // df[col] + self.select_exprs(vec![key]) + } else if let Ok(tuple) = key.cast::() { + // df[col1, col2, col3] + let keys = tuple + .iter() + .map(|item| item.extract::()) + .collect::>>()?; + self.select_exprs(keys) + } else if let Ok(keys) = key.extract::>() { + // df[[col1, col2, col3]] + self.select_exprs(keys) + } else { + let message = "DataFrame can only be indexed by string index or indices".to_string(); + Err(PyDataFusionError::Common(message)) + } + } + + fn __repr__(&self, py: Python) -> PyDataFusionResult { + self.prepare_repr_string(py, false) + } + + #[staticmethod] + #[expect(unused_variables)] + fn default_str_repr<'py>( + batches: Vec>, + schema: &Bound<'py, PyAny>, + has_more: bool, + table_uuid: &str, + ) -> PyResult { + let batches = batches + .into_iter() + .map(|batch| RecordBatch::from_pyarrow_bound(&batch)) + .collect::>>()? + .into_iter() + .filter(|batch| batch.num_rows() > 0) + .collect::>(); + + if batches.is_empty() { + return Ok("No data to display".to_owned()); + } + + let batches_as_displ = + pretty::pretty_format_batches(&batches).map_err(py_datafusion_err)?; + + let additional_str = match has_more { + true => "\nData truncated.", + false => "", + }; + + Ok(format!("DataFrame()\n{batches_as_displ}{additional_str}")) + } + + fn _repr_html_(&self, py: Python) -> PyDataFusionResult { + self.prepare_repr_string(py, true) + } + + /// Calculate summary statistics for a DataFrame + fn describe(&self, py: Python) -> PyDataFusionResult { + let df = self.df.as_ref().clone(); + let stat_df = wait_for_future(py, df.describe())??; + Ok(Self::new(stat_df)) + } + + /// Returns the schema from the logical plan + fn schema(&self) -> PyArrowType { + PyArrowType(self.df.schema().as_arrow().clone()) + } + + /// Convert this DataFrame into a Table Provider that can be used in register_table + /// By convention, into_... methods consume self and return the new object. + /// Disabling the clippy lint, so we can use &self + /// because we're working with Python bindings + /// where objects are shared + #[allow(clippy::wrong_self_convention)] + pub fn into_view(&self, temporary: bool) -> PyDataFusionResult { + let table_provider = if temporary { + Arc::new(TempViewTable::new(Arc::clone(&self.df))) as Arc + } else { + // Call the underlying Rust DataFrame::into_view method. + // Note that the Rust method consumes self; here we clone the inner Arc + // so that we don't invalidate this PyDataFrame. + self.df.as_ref().clone().into_view() + }; + Ok(PyTable::from(table_provider)) + } + + #[pyo3(signature = (*args))] + fn select_exprs(&self, args: Vec) -> PyDataFusionResult { + let args = args.iter().map(|s| s.as_ref()).collect::>(); + let df = self.df.as_ref().clone().select_exprs(&args)?; + Ok(Self::new(df)) + } + + #[pyo3(signature = (*args))] + fn select(&self, args: Vec) -> PyDataFusionResult { + let expr: Vec = args.into_iter().map(|e| e.into()).collect(); + let df = self.df.as_ref().clone().select(expr)?; + Ok(Self::new(df)) + } + + #[pyo3(signature = (*args))] + fn drop(&self, args: Vec) -> PyDataFusionResult { + let cols = args.iter().map(|s| s.as_ref()).collect::>(); + let df = self.df.as_ref().clone().drop_columns(&cols)?; + Ok(Self::new(df)) + } + + /// Apply window function expressions to the DataFrame + #[pyo3(signature = (*exprs))] + fn window(&self, exprs: Vec) -> PyDataFusionResult { + let window_exprs = exprs.into_iter().map(|e| e.into()).collect(); + let df = self.df.as_ref().clone().window(window_exprs)?; + Ok(Self::new(df)) + } + + fn filter(&self, predicate: PyExpr) -> PyDataFusionResult { + let df = self.df.as_ref().clone().filter(predicate.into())?; + Ok(Self::new(df)) + } + + fn parse_sql_expr(&self, expr: PyBackedStr) -> PyDataFusionResult { + self.df + .as_ref() + .parse_sql_expr(&expr) + .map(PyExpr::from) + .map_err(PyDataFusionError::from) + } + + fn with_column(&self, name: &str, expr: PyExpr) -> PyDataFusionResult { + let df = self.df.as_ref().clone().with_column(name, expr.into())?; + Ok(Self::new(df)) + } + + fn with_columns(&self, exprs: Vec) -> PyDataFusionResult { + let mut df = self.df.as_ref().clone(); + for expr in exprs { + let expr: Expr = expr.into(); + let name = format!("{}", expr.schema_name()); + df = df.with_column(name.as_str(), expr)? + } + Ok(Self::new(df)) + } + + /// Rename one column by applying a new projection. This is a no-op if the column to be + /// renamed does not exist. + fn with_column_renamed(&self, old_name: &str, new_name: &str) -> PyDataFusionResult { + let df = self + .df + .as_ref() + .clone() + .with_column_renamed(old_name, new_name)?; + Ok(Self::new(df)) + } + + fn aggregate(&self, group_by: Vec, aggs: Vec) -> PyDataFusionResult { + let group_by = group_by.into_iter().map(|e| e.into()).collect(); + let aggs = aggs.into_iter().map(|e| e.into()).collect(); + let df = self.df.as_ref().clone().aggregate(group_by, aggs)?; + Ok(Self::new(df)) + } + + #[pyo3(signature = (*exprs))] + fn sort(&self, exprs: Vec) -> PyDataFusionResult { + let exprs = to_sort_expressions(exprs); + let df = self.df.as_ref().clone().sort(exprs)?; + Ok(Self::new(df)) + } + + #[pyo3(signature = (count, offset=0))] + fn limit(&self, count: usize, offset: usize) -> PyDataFusionResult { + let df = self.df.as_ref().clone().limit(offset, Some(count))?; + Ok(Self::new(df)) + } + + /// Executes the plan, returning a list of `RecordBatch`es. + /// Unless some order is specified in the plan, there is no + /// guarantee of the order of the result. + fn collect<'py>(&self, py: Python<'py>) -> PyResult>> { + let batches = wait_for_future(py, self.df.as_ref().clone().collect())? + .map_err(PyDataFusionError::from)?; + // cannot use PyResult> return type due to + // https://github.com/PyO3/pyo3/issues/1813 + batches.into_iter().map(|rb| rb.to_pyarrow(py)).collect() + } + + /// Cache DataFrame. + fn cache(&self, py: Python) -> PyDataFusionResult { + let df = wait_for_future(py, self.df.as_ref().clone().cache())??; + Ok(Self::new(df)) + } + + /// Executes this DataFrame and collects all results into a vector of vector of RecordBatch + /// maintaining the input partitioning. + fn collect_partitioned<'py>(&self, py: Python<'py>) -> PyResult>>> { + let batches = wait_for_future(py, self.df.as_ref().clone().collect_partitioned())? + .map_err(PyDataFusionError::from)?; + + batches + .into_iter() + .map(|rbs| rbs.into_iter().map(|rb| rb.to_pyarrow(py)).collect()) + .collect() + } + + fn collect_column<'py>(&self, py: Python<'py>, column: &str) -> PyResult> { + wait_for_future(py, self.collect_column_inner(column))? + .map_err(PyDataFusionError::from)? + .to_data() + .to_pyarrow(py) + } + + /// Print the result, 20 lines by default + #[pyo3(signature = (num=20))] + fn show(&self, py: Python, num: usize) -> PyDataFusionResult<()> { + let df = self.df.as_ref().clone().limit(0, Some(num))?; + print_dataframe(py, df) + } + + /// Filter out duplicate rows + fn distinct(&self) -> PyDataFusionResult { + let df = self.df.as_ref().clone().distinct()?; + Ok(Self::new(df)) + } + + fn join( + &self, + right: PyDataFrame, + how: &str, + left_on: Vec, + right_on: Vec, + coalesce_keys: bool, + ) -> PyDataFusionResult { + let join_type = match how { + "inner" => JoinType::Inner, + "left" => JoinType::Left, + "right" => JoinType::Right, + "full" => JoinType::Full, + "semi" => JoinType::LeftSemi, + "anti" => JoinType::LeftAnti, + how => { + return Err(PyDataFusionError::Common(format!( + "The join type {how} does not exist or is not implemented" + ))); + } + }; + + let left_keys = left_on.iter().map(|s| s.as_ref()).collect::>(); + let right_keys = right_on.iter().map(|s| s.as_ref()).collect::>(); + + let mut df = self.df.as_ref().clone().join( + right.df.as_ref().clone(), + join_type, + &left_keys, + &right_keys, + None, + )?; + + if coalesce_keys { + let mutual_keys = left_keys + .iter() + .zip(right_keys.iter()) + .filter(|(l, r)| l == r) + .map(|(key, _)| *key) + .collect::>(); + + let fields_to_coalesce = mutual_keys + .iter() + .map(|name| { + let qualified_fields = df + .logical_plan() + .schema() + .qualified_fields_with_unqualified_name(name); + (*name, qualified_fields) + }) + .filter(|(_, fields)| fields.len() == 2) + .collect::>(); + + let expr: Vec = df + .logical_plan() + .schema() + .fields() + .into_iter() + .enumerate() + .map(|(idx, _)| df.logical_plan().schema().qualified_field(idx)) + .filter_map(|(qualifier, field)| { + if let Some((key_name, qualified_fields)) = fields_to_coalesce + .iter() + .find(|(_, qf)| qf.contains(&(qualifier, field))) + { + // Only add the coalesce expression once (when we encounter the first field) + // Skip the second field (it's already included in to coalesce) + if (qualifier, field) == qualified_fields[0] { + let left_col = Expr::Column(Column::from(qualified_fields[0])); + let right_col = Expr::Column(Column::from(qualified_fields[1])); + return Some(coalesce(vec![left_col, right_col]).alias(*key_name)); + } + None + } else { + Some(Expr::Column(Column::from((qualifier, field)))) + } + }) + .collect(); + df = df.select(expr)?; + } + + Ok(Self::new(df)) + } + + fn join_on( + &self, + right: PyDataFrame, + on_exprs: Vec, + how: &str, + ) -> PyDataFusionResult { + let join_type = match how { + "inner" => JoinType::Inner, + "left" => JoinType::Left, + "right" => JoinType::Right, + "full" => JoinType::Full, + "semi" => JoinType::LeftSemi, + "anti" => JoinType::LeftAnti, + how => { + return Err(PyDataFusionError::Common(format!( + "The join type {how} does not exist or is not implemented" + ))); + } + }; + let exprs: Vec = on_exprs.into_iter().map(|e| e.into()).collect(); + + let df = self + .df + .as_ref() + .clone() + .join_on(right.df.as_ref().clone(), join_type, exprs)?; + Ok(Self::new(df)) + } + + /// Print the query plan + #[pyo3(signature = (verbose=false, analyze=false, format=None))] + fn explain( + &self, + py: Python, + verbose: bool, + analyze: bool, + format: Option<&str>, + ) -> PyDataFusionResult<()> { + let explain_format = match format { + Some(f) => f + .parse::() + .map_err(|e| { + PyDataFusionError::Common(format!("Invalid explain format '{}': {}", f, e)) + })?, + None => datafusion::common::format::ExplainFormat::Indent, + }; + let opts = datafusion::logical_expr::ExplainOption::default() + .with_verbose(verbose) + .with_analyze(analyze) + .with_format(explain_format); + let df = self.df.as_ref().clone().explain_with_options(opts)?; + print_dataframe(py, df) + } + + /// Get the logical plan for this `DataFrame` + fn logical_plan(&self) -> PyResult { + Ok(self.df.as_ref().clone().logical_plan().clone().into()) + } + + /// Get the optimized logical plan for this `DataFrame` + fn optimized_logical_plan(&self) -> PyDataFusionResult { + Ok(self.df.as_ref().clone().into_optimized_plan()?.into()) + } + + /// Get the execution plan for this `DataFrame` + fn execution_plan(&self, py: Python) -> PyDataFusionResult { + let plan = wait_for_future(py, self.df.as_ref().clone().create_physical_plan())??; + Ok(plan.into()) + } + + /// Repartition a `DataFrame` based on a logical partitioning scheme. + fn repartition(&self, num: usize) -> PyDataFusionResult { + let new_df = self + .df + .as_ref() + .clone() + .repartition(Partitioning::RoundRobinBatch(num))?; + Ok(Self::new(new_df)) + } + + /// Repartition a `DataFrame` based on a logical partitioning scheme. + #[pyo3(signature = (*args, num))] + fn repartition_by_hash(&self, args: Vec, num: usize) -> PyDataFusionResult { + let expr = args.into_iter().map(|py_expr| py_expr.into()).collect(); + let new_df = self + .df + .as_ref() + .clone() + .repartition(Partitioning::Hash(expr, num))?; + Ok(Self::new(new_df)) + } + + /// Calculate the union of two `DataFrame`s, preserving duplicate rows.The + /// two `DataFrame`s must have exactly the same schema + #[pyo3(signature = (py_df, distinct=false))] + fn union(&self, py_df: PyDataFrame, distinct: bool) -> PyDataFusionResult { + let new_df = if distinct { + self.df + .as_ref() + .clone() + .union_distinct(py_df.df.as_ref().clone())? + } else { + self.df.as_ref().clone().union(py_df.df.as_ref().clone())? + }; + + Ok(Self::new(new_df)) + } + + #[pyo3(signature = (columns, preserve_nulls=true, recursions=None))] + fn unnest_columns( + &self, + columns: Vec, + preserve_nulls: bool, + recursions: Option>, + ) -> PyDataFusionResult { + let unnest_options = build_unnest_options(preserve_nulls, recursions); + let cols = columns.iter().map(|s| s.as_ref()).collect::>(); + let df = self + .df + .as_ref() + .clone() + .unnest_columns_with_options(&cols, unnest_options)?; + Ok(Self::new(df)) + } + + /// Calculate the intersection of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema + #[pyo3(signature = (py_df, distinct=false))] + fn intersect(&self, py_df: PyDataFrame, distinct: bool) -> PyDataFusionResult { + let base = self.df.as_ref().clone(); + let other = py_df.df.as_ref().clone(); + let new_df = if distinct { + base.intersect_distinct(other)? + } else { + base.intersect(other)? + }; + Ok(Self::new(new_df)) + } + + /// Calculate the exception of two `DataFrame`s. The two `DataFrame`s must have exactly the same schema + #[pyo3(signature = (py_df, distinct=false))] + fn except_all(&self, py_df: PyDataFrame, distinct: bool) -> PyDataFusionResult { + let base = self.df.as_ref().clone(); + let other = py_df.df.as_ref().clone(); + let new_df = if distinct { + base.except_distinct(other)? + } else { + base.except(other)? + }; + Ok(Self::new(new_df)) + } + + /// Union two DataFrames matching columns by name + #[pyo3(signature = (py_df, distinct=false))] + fn union_by_name(&self, py_df: PyDataFrame, distinct: bool) -> PyDataFusionResult { + let base = self.df.as_ref().clone(); + let other = py_df.df.as_ref().clone(); + let new_df = if distinct { + base.union_by_name_distinct(other)? + } else { + base.union_by_name(other)? + }; + Ok(Self::new(new_df)) + } + + /// Deduplicate rows based on specific columns, keeping the first row per group + fn distinct_on( + &self, + on_expr: Vec, + select_expr: Vec, + sort_expr: Option>, + ) -> PyDataFusionResult { + let on_expr = on_expr.into_iter().map(|e| e.into()).collect(); + let select_expr = select_expr.into_iter().map(|e| e.into()).collect(); + let sort_expr = sort_expr.map(to_sort_expressions); + let df = self + .df + .as_ref() + .clone() + .distinct_on(on_expr, select_expr, sort_expr)?; + Ok(Self::new(df)) + } + + /// Sort by column expressions with ascending order and nulls last + fn sort_by(&self, exprs: Vec) -> PyDataFusionResult { + let exprs = exprs.into_iter().map(|e| e.into()).collect(); + let df = self.df.as_ref().clone().sort_by(exprs)?; + Ok(Self::new(df)) + } + + /// Return fully qualified column expressions for the given column names + fn find_qualified_columns(&self, names: Vec) -> PyDataFusionResult> { + let name_refs: Vec<&str> = names.iter().map(|s| s.as_str()).collect(); + let qualified = self.df.find_qualified_columns(&name_refs)?; + Ok(qualified + .into_iter() + .map(|q| Expr::Column(Column::from(q)).into()) + .collect()) + } + + /// Write a `DataFrame` to a CSV file. + fn write_csv( + &self, + py: Python, + path: &str, + with_header: bool, + write_options: Option, + ) -> PyDataFusionResult<()> { + let csv_options = CsvOptions { + has_header: Some(with_header), + ..Default::default() + }; + let write_options = write_options + .map(DataFrameWriteOptions::from) + .unwrap_or_default(); + + wait_for_future( + py, + self.df + .as_ref() + .clone() + .write_csv(path, write_options, Some(csv_options)), + )??; + Ok(()) + } + + /// Write a `DataFrame` to a Parquet file. + #[pyo3(signature = ( + path, + compression="zstd", + compression_level=None, + write_options=None, + ))] + fn write_parquet( + &self, + path: &str, + compression: &str, + compression_level: Option, + write_options: Option, + py: Python, + ) -> PyDataFusionResult<()> { + fn verify_compression_level(cl: Option) -> Result { + cl.ok_or(PyValueError::new_err("compression_level is not defined")) + } + + let _validated = match compression.to_lowercase().as_str() { + "snappy" => Compression::SNAPPY, + "gzip" => Compression::GZIP( + GzipLevel::try_new(compression_level.unwrap_or(6)) + .map_err(|e| PyValueError::new_err(format!("{e}")))?, + ), + "brotli" => Compression::BROTLI( + BrotliLevel::try_new(verify_compression_level(compression_level)?) + .map_err(|e| PyValueError::new_err(format!("{e}")))?, + ), + "zstd" => Compression::ZSTD( + ZstdLevel::try_new(verify_compression_level(compression_level)? as i32) + .map_err(|e| PyValueError::new_err(format!("{e}")))?, + ), + "lzo" => Compression::LZO, + "lz4" => Compression::LZ4, + "lz4_raw" => Compression::LZ4_RAW, + "uncompressed" => Compression::UNCOMPRESSED, + _ => { + return Err(PyDataFusionError::Common(format!( + "Unrecognized compression type {compression}" + ))); + } + }; + + let mut compression_string = compression.to_string(); + if let Some(level) = compression_level { + compression_string.push_str(&format!("({level})")); + } + + let mut options = TableParquetOptions::default(); + options.global.compression = Some(compression_string); + let write_options = write_options + .map(DataFrameWriteOptions::from) + .unwrap_or_default(); + + wait_for_future( + py, + self.df + .as_ref() + .clone() + .write_parquet(path, write_options, Option::from(options)), + )??; + Ok(()) + } + + /// Write a `DataFrame` to a Parquet file, using advanced options. + fn write_parquet_with_options( + &self, + path: &str, + options: PyParquetWriterOptions, + column_specific_options: HashMap, + write_options: Option, + py: Python, + ) -> PyDataFusionResult<()> { + let table_options = TableParquetOptions { + global: options.options, + column_specific_options: column_specific_options + .into_iter() + .map(|(k, v)| (k, v.options)) + .collect(), + ..Default::default() + }; + let write_options = write_options + .map(DataFrameWriteOptions::from) + .unwrap_or_default(); + wait_for_future( + py, + self.df.as_ref().clone().write_parquet( + path, + write_options, + Option::from(table_options), + ), + )??; + Ok(()) + } + + /// Executes a query and writes the results to a partitioned JSON file. + fn write_json( + &self, + path: &str, + py: Python, + write_options: Option, + ) -> PyDataFusionResult<()> { + let write_options = write_options + .map(DataFrameWriteOptions::from) + .unwrap_or_default(); + wait_for_future( + py, + self.df + .as_ref() + .clone() + .write_json(path, write_options, None), + )??; + Ok(()) + } + + fn write_table( + &self, + py: Python, + table_name: &str, + write_options: Option, + ) -> PyDataFusionResult<()> { + let write_options = write_options + .map(DataFrameWriteOptions::from) + .unwrap_or_default(); + wait_for_future( + py, + self.df + .as_ref() + .clone() + .write_table(table_name, write_options), + )??; + Ok(()) + } + + /// Convert to Arrow Table + /// Collect the batches and pass to Arrow Table + fn to_arrow_table(&self, py: Python<'_>) -> PyResult> { + let batches = self.collect(py)?.into_pyobject(py)?; + + // only use the DataFrame's schema if there are no batches, otherwise let the schema be + // determined from the batches (avoids some inconsistencies with nullable columns) + let args = if batches.len()? == 0 { + let schema = self.schema().into_pyobject(py)?; + PyTuple::new(py, &[batches, schema])? + } else { + PyTuple::new(py, &[batches])? + }; + + // Instantiate pyarrow Table object and use its from_batches method + let table_class = py.import("pyarrow")?.getattr("Table")?; + let table: Py = table_class.call_method1("from_batches", args)?.into(); + Ok(table) + } + + #[pyo3(signature = (requested_schema=None))] + fn __arrow_c_stream__<'py>( + &'py self, + py: Python<'py>, + requested_schema: Option>, + ) -> PyDataFusionResult> { + let df = self.df.as_ref().clone(); + let streams = spawn_future(py, async move { df.execute_stream_partitioned().await })?; + + let mut schema: Schema = self.df.schema().to_owned().as_arrow().clone(); + let mut projection: Option = None; + + if let Some(schema_capsule) = requested_schema { + let data: NonNull = schema_capsule + .pointer_checked(Some(c"arrow_schema"))? + .cast(); + let schema_ptr = unsafe { data.as_ref() }; + let desired_schema = Schema::try_from(schema_ptr)?; + + schema = project_schema(schema, desired_schema)?; + projection = Some(Arc::new(schema.clone())); + } + + let schema_ref = Arc::new(schema.clone()); + + let reader = PartitionedDataFrameStreamReader { + streams, + schema: schema_ref, + projection, + current: 0, + }; + let reader: Box = Box::new(reader); + + // Create the Arrow stream and wrap it in a PyCapsule. The default + // destructor provided by PyO3 will drop the stream unless ownership is + // transferred to PyArrow during import. + let stream = FFI_ArrowArrayStream::new(reader); + let name = CString::new(ARROW_ARRAY_STREAM_NAME.to_bytes()).unwrap(); + let capsule = PyCapsule::new(py, stream, Some(name))?; + Ok(capsule) + } + + fn execute_stream(&self, py: Python) -> PyDataFusionResult { + let df = self.df.as_ref().clone(); + let stream = spawn_future(py, async move { df.execute_stream().await })?; + Ok(PyRecordBatchStream::new(stream)) + } + + fn execute_stream_partitioned(&self, py: Python) -> PyResult> { + let df = self.df.as_ref().clone(); + let streams = spawn_future(py, async move { df.execute_stream_partitioned().await })?; + Ok(streams.into_iter().map(PyRecordBatchStream::new).collect()) + } + + /// Convert to pandas dataframe with pyarrow + /// Collect the batches, pass to Arrow Table & then convert to Pandas DataFrame + fn to_pandas(&self, py: Python<'_>) -> PyResult> { + let table = self.to_arrow_table(py)?; + + // See also: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pandas + let result = table.call_method0(py, "to_pandas")?; + Ok(result) + } + + /// Convert to Python list using pyarrow + /// Each list item represents one row encoded as dictionary + fn to_pylist(&self, py: Python<'_>) -> PyResult> { + let table = self.to_arrow_table(py)?; + + // See also: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pylist + let result = table.call_method0(py, "to_pylist")?; + Ok(result) + } + + /// Convert to Python dictionary using pyarrow + /// Each dictionary key is a column and the dictionary value represents the column values + fn to_pydict(&self, py: Python) -> PyResult> { + let table = self.to_arrow_table(py)?; + + // See also: https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.to_pydict + let result = table.call_method0(py, "to_pydict")?; + Ok(result) + } + + /// Convert to polars dataframe with pyarrow + /// Collect the batches, pass to Arrow Table & then convert to polars DataFrame + fn to_polars(&self, py: Python<'_>) -> PyResult> { + let table = self.to_arrow_table(py)?; + let dataframe = py.import("polars")?.getattr("DataFrame")?; + let args = PyTuple::new(py, &[table])?; + let result: Py = dataframe.call1(args)?.into(); + Ok(result) + } + + // Executes this DataFrame to get the total number of rows. + fn count(&self, py: Python) -> PyDataFusionResult { + Ok(wait_for_future(py, self.df.as_ref().clone().count())??) + } + + /// Fill null values with a specified value for specific columns + #[pyo3(signature = (value, columns=None))] + fn fill_null( + &self, + value: Py, + columns: Option>, + py: Python, + ) -> PyDataFusionResult { + let scalar_value: PyScalarValue = value.extract(py)?; + + let cols = match columns { + Some(col_names) => col_names.iter().map(|c| c.to_string()).collect(), + None => Vec::new(), // Empty vector means fill null for all columns + }; + + let df = self.df.as_ref().clone().fill_null(scalar_value.0, cols)?; + Ok(Self::new(df)) + } +} + +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + frozen, + eq, + eq_int, + name = "InsertOp", + module = "datafusion" +)] +pub enum PyInsertOp { + APPEND, + REPLACE, + OVERWRITE, +} + +impl From for InsertOp { + fn from(value: PyInsertOp) -> Self { + match value { + PyInsertOp::APPEND => InsertOp::Append, + PyInsertOp::REPLACE => InsertOp::Replace, + PyInsertOp::OVERWRITE => InsertOp::Overwrite, + } + } +} + +#[derive(Debug, Clone)] +#[pyclass( + from_py_object, + frozen, + name = "DataFrameWriteOptions", + module = "datafusion" +)] +pub struct PyDataFrameWriteOptions { + insert_operation: InsertOp, + single_file_output: bool, + partition_by: Vec, + sort_by: Vec, +} + +impl From for DataFrameWriteOptions { + fn from(value: PyDataFrameWriteOptions) -> Self { + DataFrameWriteOptions::new() + .with_insert_operation(value.insert_operation) + .with_single_file_output(value.single_file_output) + .with_partition_by(value.partition_by) + .with_sort_by(value.sort_by) + } +} + +#[pymethods] +impl PyDataFrameWriteOptions { + #[new] + fn new( + insert_operation: Option, + single_file_output: bool, + partition_by: Option>, + sort_by: Option>, + ) -> Self { + let insert_operation = insert_operation.map(Into::into).unwrap_or(InsertOp::Append); + let sort_by = sort_by + .unwrap_or_default() + .into_iter() + .map(Into::into) + .collect(); + Self { + insert_operation, + single_file_output, + partition_by: partition_by.unwrap_or_default(), + sort_by, + } + } +} + +fn build_unnest_options( + preserve_nulls: bool, + recursions: Option>, +) -> UnnestOptions { + let mut opts = UnnestOptions::default().with_preserve_nulls(preserve_nulls); + if let Some(recs) = recursions { + opts.recursions = recs + .into_iter() + .map( + |(input, output, depth)| datafusion::common::RecursionUnnestOption { + input_column: datafusion::common::Column::from(input.as_str()), + output_column: datafusion::common::Column::from(output.as_str()), + depth, + }, + ) + .collect(); + } + opts +} + +/// Print DataFrame +fn print_dataframe(py: Python, df: DataFrame) -> PyDataFusionResult<()> { + // Get string representation of record batches + let batches = wait_for_future(py, df.collect())??; + let result = if batches.is_empty() { + "DataFrame has no rows".to_string() + } else { + match pretty::pretty_format_batches(&batches) { + Ok(batch) => format!("DataFrame()\n{batch}"), + Err(err) => format!("Error: {:?}", err.to_string()), + } + }; + + // Import the Python 'builtins' module to access the print function + // Note that println! does not print to the Python debug console and is not visible in notebooks for instance + let print = py.import("builtins")?.getattr("print")?; + print.call1((result,))?; + Ok(()) +} + +fn project_schema(from_schema: Schema, to_schema: Schema) -> Result { + let merged_schema = Schema::try_merge(vec![from_schema, to_schema.clone()])?; + + let project_indices: Vec = to_schema + .fields + .iter() + .map(|field| field.name()) + .filter_map(|field_name| merged_schema.index_of(field_name).ok()) + .collect(); + + merged_schema.project(&project_indices) +} +// NOTE: `arrow::compute::cast` in combination with `RecordBatch::try_select` or +// DataFusion's `schema::cast_record_batch` do not fully cover the required +// transformations here. They will not create missing columns and may insert +// nulls for non-nullable fields without erroring. To maintain current behavior +// we perform the casting and null checks manually. +fn record_batch_into_schema( + record_batch: RecordBatch, + schema: &Schema, +) -> Result { + let schema = Arc::new(schema.clone()); + let base_schema = record_batch.schema(); + if base_schema.fields().is_empty() { + // Nothing to project + return Ok(RecordBatch::new_empty(schema)); + } + + let array_size = record_batch.column(0).len(); + let mut data_arrays = Vec::with_capacity(schema.fields().len()); + + for field in schema.fields() { + let desired_data_type = field.data_type(); + if let Some(original_data) = record_batch.column_by_name(field.name()) { + let original_data_type = original_data.data_type(); + + if can_cast_types(original_data_type, desired_data_type) { + data_arrays.push(arrow::compute::kernels::cast( + original_data, + desired_data_type, + )?); + } else if field.is_nullable() { + data_arrays.push(new_null_array(desired_data_type, array_size)); + } else { + return Err(ArrowError::CastError(format!( + "Attempting to cast to non-nullable and non-castable field {} during schema projection.", + field.name() + ))); + } + } else { + if !field.is_nullable() { + return Err(ArrowError::CastError(format!( + "Attempting to set null to non-nullable field {} during schema projection.", + field.name() + ))); + } + data_arrays.push(new_null_array(desired_data_type, array_size)); + } + } + + RecordBatch::try_new(schema, data_arrays) +} + +/// This is a helper function to return the first non-empty record batch from executing a DataFrame. +/// It additionally returns a bool, which indicates if there are more record batches available. +/// We do this so we can determine if we should indicate to the user that the data has been +/// truncated. This collects until we have archived both of these two conditions +/// +/// - We have collected our minimum number of rows +/// - We have reached our limit, either data size or maximum number of rows +/// +/// Otherwise it will return when the stream has exhausted. If you want a specific number of +/// rows, set min_rows == max_rows. +async fn collect_record_batches_to_display( + df: DataFrame, + config: FormatterConfig, +) -> Result<(Vec, bool), DataFusionError> { + let FormatterConfig { + max_bytes, + min_rows, + max_rows, + } = config; + + let partitioned_stream = df.execute_stream_partitioned().await?; + let mut stream = futures::stream::iter(partitioned_stream).flatten(); + let mut size_estimate_so_far = 0; + let mut rows_so_far = 0; + let mut record_batches = Vec::default(); + let mut has_more = false; + + // Collect rows until we hit a limit (memory or max_rows) OR reach the guaranteed minimum. + // The minimum rows constraint overrides both memory and row limits to ensure a baseline + // of data is always displayed, even if it temporarily exceeds those limits. + // This provides better UX by guaranteeing users see at least min_rows rows. + while (size_estimate_so_far < max_bytes && rows_so_far < max_rows) || rows_so_far < min_rows { + let mut rb = match stream.next().await { + None => { + break; + } + Some(Ok(r)) => r, + Some(Err(e)) => return Err(e), + }; + + let mut rows_in_rb = rb.num_rows(); + if rows_in_rb > 0 { + size_estimate_so_far += rb.get_array_memory_size(); + + // When memory limit is exceeded, scale back row count proportionally to stay within budget + if size_estimate_so_far > max_bytes { + let ratio = max_bytes as f32 / size_estimate_so_far as f32; + let total_rows = rows_in_rb + rows_so_far; + + // Calculate reduced rows maintaining the memory/data proportion + let mut reduced_row_num = (total_rows as f32 * ratio).round() as usize; + // Ensure we always respect the minimum rows guarantee + if reduced_row_num < min_rows { + reduced_row_num = min_rows.min(total_rows); + } + + let limited_rows_this_rb = reduced_row_num - rows_so_far; + if limited_rows_this_rb < rows_in_rb { + rows_in_rb = limited_rows_this_rb; + rb = rb.slice(0, limited_rows_this_rb); + has_more = true; + } + } + + if rows_in_rb + rows_so_far > max_rows { + rb = rb.slice(0, max_rows - rows_so_far); + has_more = true; + } + + rows_so_far += rb.num_rows(); + record_batches.push(rb); + } + } + + if record_batches.is_empty() { + return Ok((Vec::default(), false)); + } + + if !has_more { + // Data was not already truncated, so check to see if more record batches remain + has_more = match stream.try_next().await { + Ok(None) => false, // reached end + Ok(Some(_)) => true, + Err(_) => false, // Stream disconnected + }; + } + + Ok((record_batches, has_more)) +} diff --git a/src/dataset.rs b/crates/core/src/dataset.rs similarity index 79% rename from src/dataset.rs rename to crates/core/src/dataset.rs index 0a2c7f50f..dbeafcd9f 100644 --- a/src/dataset.rs +++ b/crates/core/src/dataset.rs @@ -15,44 +15,42 @@ // specific language governing permissions and limitations // under the License. -use pyo3::exceptions::PyValueError; -/// Implements a Datafusion TableProvider that delegates to a PyArrow Dataset -/// This allows us to use PyArrow Datasets as Datafusion tables while pushing down projections and filters -use pyo3::prelude::*; -use pyo3::types::PyType; - use std::any::Any; use std::sync::Arc; use async_trait::async_trait; - use datafusion::arrow::datatypes::SchemaRef; use datafusion::arrow::pyarrow::PyArrowType; -use datafusion::datasource::datasource::TableProviderFilterPushDown; +use datafusion::catalog::Session; use datafusion::datasource::{TableProvider, TableType}; use datafusion::error::{DataFusionError, Result as DFResult}; -use datafusion::execution::context::SessionState; +use datafusion::logical_expr::{Expr, TableProviderFilterPushDown}; use datafusion::physical_plan::ExecutionPlan; -use datafusion_expr::Expr; +use pyo3::exceptions::PyValueError; +/// Implements a Datafusion TableProvider that delegates to a PyArrow Dataset +/// This allows us to use PyArrow Datasets as Datafusion tables while pushing down projections and filters +use pyo3::prelude::*; +use pyo3::types::PyType; use crate::dataset_exec::DatasetExec; use crate::pyarrow_filter_expression::PyArrowFilterExpression; // Wraps a pyarrow.dataset.Dataset class and implements a Datafusion TableProvider around it -#[derive(Debug, Clone)] +#[derive(Debug)] pub(crate) struct Dataset { - dataset: PyObject, + dataset: Py, } impl Dataset { // Creates a Python PyArrow.Dataset - pub fn new(dataset: &PyAny, py: Python) -> PyResult { + pub fn new(dataset: &Bound<'_, PyAny>, py: Python) -> PyResult { // Ensure that we were passed an instance of pyarrow.dataset.Dataset let ds = PyModule::import(py, "pyarrow.dataset")?; - let ds_type: &PyType = ds.getattr("Dataset")?.downcast()?; + let ds_attr = ds.getattr("Dataset")?; + let ds_type = ds_attr.cast::()?; if dataset.is_instance(ds_type)? { Ok(Dataset { - dataset: dataset.into(), + dataset: dataset.clone().unbind(), }) } else { Err(PyValueError::new_err( @@ -72,8 +70,8 @@ impl TableProvider for Dataset { /// Get a reference to the schema for this table fn schema(&self) -> SchemaRef { - Python::with_gil(|py| { - let dataset = self.dataset.as_ref(py); + Python::attach(|py| { + let dataset = self.dataset.bind(py); // This can panic but since we checked that self.dataset is a pyarrow.dataset.Dataset it should never Arc::new( dataset @@ -97,7 +95,7 @@ impl TableProvider for Dataset { /// parallelized or distributed. async fn scan( &self, - _ctx: &SessionState, + _ctx: &dyn Session, projection: Option<&Vec>, filters: &[Expr], // limit can be used to reduce the amount scanned @@ -106,9 +104,9 @@ impl TableProvider for Dataset { // The datasource should return *at least* this number of rows if available. _limit: Option, ) -> DFResult> { - Python::with_gil(|py| { + Python::attach(|py| { let plan: Arc = Arc::new( - DatasetExec::new(py, self.dataset.as_ref(py), projection.cloned(), filters) + DatasetExec::new(py, self.dataset.bind(py), projection.cloned(), filters) .map_err(|err| DataFusionError::External(Box::new(err)))?, ); Ok(plan) @@ -117,10 +115,16 @@ impl TableProvider for Dataset { /// Tests whether the table provider can make use of a filter expression /// to optimise data retrieval. - fn supports_filter_pushdown(&self, filter: &Expr) -> DFResult { - match PyArrowFilterExpression::try_from(filter) { - Ok(_) => Ok(TableProviderFilterPushDown::Exact), - _ => Ok(TableProviderFilterPushDown::Unsupported), - } + fn supports_filters_pushdown( + &self, + filter: &[&Expr], + ) -> DFResult> { + filter + .iter() + .map(|&f| match PyArrowFilterExpression::try_from(f) { + Ok(_) => Ok(TableProviderFilterPushDown::Exact), + _ => Ok(TableProviderFilterPushDown::Unsupported), + }) + .collect() } } diff --git a/src/dataset_exec.rs b/crates/core/src/dataset_exec.rs similarity index 69% rename from src/dataset_exec.rs rename to crates/core/src/dataset_exec.rs index 859678856..a7dd1500d 100644 --- a/src/dataset_exec.rs +++ b/crates/core/src/dataset_exec.rs @@ -15,32 +15,31 @@ // specific language governing permissions and limitations // under the License. -/// Implements a Datafusion physical ExecutionPlan that delegates to a PyArrow Dataset -/// This actually performs the projection, filtering and scanning of a Dataset -use pyo3::prelude::*; -use pyo3::types::{PyDict, PyIterator, PyList}; - use std::any::Any; use std::sync::Arc; -use futures::{stream, TryStreamExt}; - use datafusion::arrow::datatypes::SchemaRef; -use datafusion::arrow::error::ArrowError; -use datafusion::arrow::error::Result as ArrowResult; +use datafusion::arrow::error::{ArrowError, Result as ArrowResult}; use datafusion::arrow::pyarrow::PyArrowType; use datafusion::arrow::record_batch::RecordBatch; use datafusion::error::{DataFusionError as InnerDataFusionError, Result as DFResult}; use datafusion::execution::context::TaskContext; -use datafusion::physical_expr::PhysicalSortExpr; +use datafusion::logical_expr::Expr; +use datafusion::logical_expr::utils::conjunction; +use datafusion::physical_expr::{EquivalenceProperties, LexOrdering}; +use datafusion::physical_plan::execution_plan::{Boundedness, EmissionType}; use datafusion::physical_plan::stream::RecordBatchStreamAdapter; use datafusion::physical_plan::{ - DisplayFormatType, ExecutionPlan, Partitioning, SendableRecordBatchStream, Statistics, + DisplayAs, DisplayFormatType, ExecutionPlan, ExecutionPlanProperties, Partitioning, + PlanProperties, SendableRecordBatchStream, Statistics, }; -use datafusion_expr::Expr; -use datafusion_optimizer::utils::conjunction; +use futures::{TryStreamExt, stream}; +/// Implements a Datafusion physical ExecutionPlan that delegates to a PyArrow Dataset +/// This actually performs the projection, filtering and scanning of a Dataset +use pyo3::prelude::*; +use pyo3::types::{PyDict, PyIterator, PyList}; -use crate::errors::DataFusionError; +use crate::errors::PyDataFusionResult; use crate::pyarrow_filter_expression::PyArrowFilterExpression; struct PyArrowBatchesAdapter { @@ -51,8 +50,8 @@ impl Iterator for PyArrowBatchesAdapter { type Item = ArrowResult; fn next(&mut self) -> Option { - Python::with_gil(|py| { - let mut batches: &PyIterator = self.batches.as_ref(py); + Python::attach(|py| { + let mut batches = self.batches.clone_ref(py).into_bound(py); Some( batches .next()? @@ -64,24 +63,25 @@ impl Iterator for PyArrowBatchesAdapter { } // Wraps a pyarrow.dataset.Dataset class and implements a Datafusion ExecutionPlan around it -#[derive(Debug, Clone)] +#[derive(Debug)] pub(crate) struct DatasetExec { - dataset: PyObject, + dataset: Py, schema: SchemaRef, fragments: Py, columns: Option>, - filter_expr: Option, + filter_expr: Option>, projected_statistics: Statistics, + plan_properties: Arc, } impl DatasetExec { pub fn new( py: Python, - dataset: &PyAny, + dataset: &Bound<'_, PyAny>, projection: Option>, filters: &[Expr], - ) -> Result { - let columns: Option, DataFusionError>> = projection.map(|p| { + ) -> PyDataFusionResult { + let columns: Option>> = projection.map(|p| { p.iter() .map(|index| { let name: String = dataset @@ -94,7 +94,7 @@ impl DatasetExec { .collect() }); let columns: Option> = columns.transpose()?; - let filter_expr: Option = conjunction(filters.to_owned()) + let filter_expr: Option> = conjunction(filters.to_owned()) .map(|filters| { PyArrowFilterExpression::try_from(&filters) .map(|filter_expr| filter_expr.inner().clone_ref(py)) @@ -109,9 +109,9 @@ impl DatasetExec { filter_expr.as_ref().map(|expr| expr.clone_ref(py)), )?; - let scanner = dataset.call_method("scanner", (), Some(kwargs))?; + let scanner = dataset.call_method("scanner", (), Some(&kwargs))?; - let schema = Arc::new( + let schema: SchemaRef = Arc::new( scanner .getattr("projected_schema")? .extract::>()? @@ -122,28 +122,40 @@ impl DatasetExec { let pylist = builtins.getattr("list")?; // Get the fragments or partitions of the dataset - let fragments_iterator: &PyAny = dataset.call_method1( + let fragments_iterator: Bound<'_, PyAny> = dataset.call_method1( "get_fragments", (filter_expr.as_ref().map(|expr| expr.clone_ref(py)),), )?; - let fragments: &PyList = pylist - .call1((fragments_iterator,))? - .downcast() - .map_err(PyErr::from)?; + let fragments_iter = pylist.call1((fragments_iterator,))?; + let fragments = fragments_iter.cast::().map_err(PyErr::from)?; + + let projected_statistics = Statistics::new_unknown(&schema); + let plan_properties = Arc::new(PlanProperties::new( + EquivalenceProperties::new(schema.clone()), + Partitioning::UnknownPartitioning(fragments.len()), + EmissionType::Final, + Boundedness::Bounded, + )); Ok(DatasetExec { - dataset: dataset.into(), + dataset: dataset.clone().unbind(), schema, - fragments: fragments.into(), + fragments: fragments.clone().unbind(), columns, filter_expr, - projected_statistics: Default::default(), + projected_statistics, + plan_properties, }) } } impl ExecutionPlan for DatasetExec { + fn name(&self) -> &str { + // [ExecutionPlan::name] docs recommends forwarding to `static_name` + Self::static_name() + } + /// Return a reference to Any that can be used for downcasting fn as_any(&self) -> &dyn Any { self @@ -154,19 +166,7 @@ impl ExecutionPlan for DatasetExec { self.schema.clone() } - /// Get the output partitioning of this plan - fn output_partitioning(&self) -> Partitioning { - Python::with_gil(|py| { - let fragments = self.fragments.as_ref(py); - Partitioning::UnknownPartitioning(fragments.len()) - }) - } - - fn output_ordering(&self) -> Option<&[PhysicalSortExpr]> { - None - } - - fn children(&self) -> Vec> { + fn children(&self) -> Vec<&Arc> { // this is a leaf node and has no children vec![] } @@ -184,9 +184,9 @@ impl ExecutionPlan for DatasetExec { context: Arc, ) -> DFResult { let batch_size = context.session_config().batch_size(); - Python::with_gil(|py| { - let dataset = self.dataset.as_ref(py); - let fragments = self.fragments.as_ref(py); + Python::attach(|py| { + let dataset = self.dataset.bind(py); + let fragments = self.fragments.bind(py); let fragment = fragments .get_item(partition) .map_err(|err| InnerDataFusionError::External(Box::new(err)))?; @@ -209,7 +209,7 @@ impl ExecutionPlan for DatasetExec { .set_item("batch_size", batch_size) .map_err(|err| InnerDataFusionError::External(Box::new(err)))?; let scanner = fragment - .call_method("scanner", (dataset_schema,), Some(kwargs)) + .call_method("scanner", (dataset_schema,), Some(&kwargs)) .map_err(|err| InnerDataFusionError::External(Box::new(err)))?; let schema: SchemaRef = Arc::new( scanner @@ -217,10 +217,10 @@ impl ExecutionPlan for DatasetExec { .and_then(|schema| Ok(schema.extract::>()?.0)) .map_err(|err| InnerDataFusionError::External(Box::new(err)))?, ); - let record_batches: &PyIterator = scanner + let record_batches: Bound<'_, PyIterator> = scanner .call_method0("to_batches") .map_err(|err| InnerDataFusionError::External(Box::new(err)))? - .iter() + .try_iter() .map_err(|err| InnerDataFusionError::External(Box::new(err)))?; let record_batches = PyArrowBatchesAdapter { @@ -235,11 +235,46 @@ impl ExecutionPlan for DatasetExec { }) } + fn partition_statistics(&self, _partition: Option) -> DFResult { + Ok(self.projected_statistics.clone()) + } + + fn properties(&self) -> &Arc { + &self.plan_properties + } +} + +impl ExecutionPlanProperties for DatasetExec { + /// Get the output partitioning of this plan + fn output_partitioning(&self) -> &Partitioning { + self.plan_properties.output_partitioning() + } + + fn output_ordering(&self) -> Option<&LexOrdering> { + None + } + + fn boundedness(&self) -> Boundedness { + self.plan_properties.boundedness + } + + fn pipeline_behavior(&self) -> EmissionType { + self.plan_properties.emission_type + } + + fn equivalence_properties(&self) -> &datafusion::physical_expr::EquivalenceProperties { + &self.plan_properties.eq_properties + } +} + +impl DisplayAs for DatasetExec { fn fmt_as(&self, t: DisplayFormatType, f: &mut std::fmt::Formatter) -> std::fmt::Result { - Python::with_gil(|py| { - let number_of_fragments = self.fragments.as_ref(py).len(); + Python::attach(|py| { + let number_of_fragments = self.fragments.bind(py).len(); match t { - DisplayFormatType::Default => { + DisplayFormatType::Default + | DisplayFormatType::Verbose + | DisplayFormatType::TreeRender => { let projected_columns: Vec = self .schema .fields() @@ -247,7 +282,7 @@ impl ExecutionPlan for DatasetExec { .map(|x| x.name().to_owned()) .collect(); if let Some(filter_expr) = &self.filter_expr { - let filter_expr = filter_expr.as_ref(py).str().or(Err(std::fmt::Error))?; + let filter_expr = filter_expr.bind(py).str().or(Err(std::fmt::Error))?; write!( f, "DatasetExec: number_of_fragments={}, filter_expr={}, projection=[{}]", @@ -267,8 +302,4 @@ impl ExecutionPlan for DatasetExec { } }) } - - fn statistics(&self) -> Statistics { - self.projected_statistics.clone() - } } diff --git a/crates/core/src/errors.rs b/crates/core/src/errors.rs new file mode 100644 index 000000000..8babc5a56 --- /dev/null +++ b/crates/core/src/errors.rs @@ -0,0 +1,18 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +pub use datafusion_python_util::errors::*; diff --git a/crates/core/src/expr.rs b/crates/core/src/expr.rs new file mode 100644 index 000000000..c4f2a12da --- /dev/null +++ b/crates/core/src/expr.rs @@ -0,0 +1,884 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::HashMap; +use std::convert::{From, Into}; +use std::sync::Arc; + +use datafusion::arrow::datatypes::{DataType, Field}; +use datafusion::arrow::pyarrow::PyArrowType; +use datafusion::functions::core::expr_ext::FieldAccessor; +use datafusion::logical_expr::expr::{ + AggregateFunction, AggregateFunctionParams, FieldMetadata, InList, InSubquery, ScalarFunction, + SetComparison, WindowFunction, +}; +use datafusion::logical_expr::utils::exprlist_to_fields; +use datafusion::logical_expr::{ + Between, BinaryExpr, Case, Cast, Expr, ExprFuncBuilder, ExprFunctionExt, Like, LogicalPlan, + Operator, TryCast, WindowFunctionDefinition, col, lit, lit_with_metadata, +}; +use pyo3::IntoPyObjectExt; +use pyo3::basic::CompareOp; +use pyo3::prelude::*; +use window::PyWindowFrame; + +use self::alias::PyAlias; +use self::bool_expr::{ + PyIsFalse, PyIsNotFalse, PyIsNotNull, PyIsNotTrue, PyIsNotUnknown, PyIsNull, PyIsTrue, + PyIsUnknown, PyNegative, PyNot, +}; +use self::like::{PyILike, PyLike, PySimilarTo}; +use self::scalar_variable::PyScalarVariable; +use crate::common::data_type::{DataTypeMap, NullTreatment, PyScalarValue, RexType}; +use crate::errors::{PyDataFusionResult, py_runtime_err, py_type_err, py_unsupported_variant_err}; +use crate::expr::aggregate_expr::PyAggregateFunction; +use crate::expr::binary_expr::PyBinaryExpr; +use crate::expr::column::PyColumn; +use crate::expr::literal::PyLiteral; +use crate::functions::add_builder_fns_to_window; +use crate::pyarrow_util::scalar_to_pyarrow; +use crate::sql::logical::PyLogicalPlan; + +pub mod aggregate; +pub mod aggregate_expr; +pub mod alias; +pub mod analyze; +pub mod between; +pub mod binary_expr; +pub mod bool_expr; +pub mod case; +pub mod cast; +pub mod column; +pub mod conditional_expr; +pub mod copy_to; +pub mod create_catalog; +pub mod create_catalog_schema; +pub mod create_external_table; +pub mod create_function; +pub mod create_index; +pub mod create_memory_table; +pub mod create_view; +pub mod describe_table; +pub mod distinct; +pub mod dml; +pub mod drop_catalog_schema; +pub mod drop_function; +pub mod drop_table; +pub mod drop_view; +pub mod empty_relation; +pub mod exists; +pub mod explain; +pub mod extension; +pub mod filter; +pub mod grouping_set; +pub mod in_list; +pub mod in_subquery; +pub mod join; +pub mod like; +pub mod limit; +pub mod literal; +pub mod logical_node; +pub mod placeholder; +pub mod projection; +pub mod recursive_query; +pub mod repartition; +pub mod scalar_subquery; +pub mod scalar_variable; +pub mod set_comparison; +pub mod signature; +pub mod sort; +pub mod sort_expr; +pub mod statement; +pub mod subquery; +pub mod subquery_alias; +pub mod table_scan; +pub mod union; +pub mod unnest; +pub mod unnest_expr; +pub mod values; +pub mod window; + +use sort_expr::{PySortExpr, to_sort_expressions}; + +/// A PyExpr that can be used on a DataFrame +#[pyclass( + from_py_object, + frozen, + name = "RawExpr", + module = "datafusion.expr", + subclass +)] +#[derive(Debug, Clone)] +pub struct PyExpr { + pub expr: Expr, +} + +impl From for Expr { + fn from(expr: PyExpr) -> Expr { + expr.expr + } +} + +impl From for PyExpr { + fn from(expr: Expr) -> PyExpr { + PyExpr { expr } + } +} + +/// Convert a list of DataFusion Expr to PyExpr +pub fn py_expr_list(expr: &[Expr]) -> PyResult> { + Ok(expr.iter().map(|e| PyExpr::from(e.clone())).collect()) +} + +#[pymethods] +impl PyExpr { + /// Return the specific expression + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + Python::attach(|_| match &self.expr { + Expr::Alias(alias) => Ok(PyAlias::from(alias.clone()).into_bound_py_any(py)?), + Expr::Column(col) => Ok(PyColumn::from(col.clone()).into_bound_py_any(py)?), + Expr::ScalarVariable(field, variables) => { + Ok(PyScalarVariable::new(field, variables).into_bound_py_any(py)?) + } + Expr::Like(value) => Ok(PyLike::from(value.clone()).into_bound_py_any(py)?), + Expr::Literal(value, metadata) => Ok(PyLiteral::new_with_metadata( + value.clone(), + metadata.clone(), + ) + .into_bound_py_any(py)?), + Expr::BinaryExpr(expr) => Ok(PyBinaryExpr::from(expr.clone()).into_bound_py_any(py)?), + Expr::Not(expr) => Ok(PyNot::new(*expr.clone()).into_bound_py_any(py)?), + Expr::IsNotNull(expr) => Ok(PyIsNotNull::new(*expr.clone()).into_bound_py_any(py)?), + Expr::IsNull(expr) => Ok(PyIsNull::new(*expr.clone()).into_bound_py_any(py)?), + Expr::IsTrue(expr) => Ok(PyIsTrue::new(*expr.clone()).into_bound_py_any(py)?), + Expr::IsFalse(expr) => Ok(PyIsFalse::new(*expr.clone()).into_bound_py_any(py)?), + Expr::IsUnknown(expr) => Ok(PyIsUnknown::new(*expr.clone()).into_bound_py_any(py)?), + Expr::IsNotTrue(expr) => Ok(PyIsNotTrue::new(*expr.clone()).into_bound_py_any(py)?), + Expr::IsNotFalse(expr) => Ok(PyIsNotFalse::new(*expr.clone()).into_bound_py_any(py)?), + Expr::IsNotUnknown(expr) => { + Ok(PyIsNotUnknown::new(*expr.clone()).into_bound_py_any(py)?) + } + Expr::Negative(expr) => Ok(PyNegative::new(*expr.clone()).into_bound_py_any(py)?), + Expr::AggregateFunction(expr) => { + Ok(PyAggregateFunction::from(expr.clone()).into_bound_py_any(py)?) + } + Expr::SimilarTo(value) => Ok(PySimilarTo::from(value.clone()).into_bound_py_any(py)?), + Expr::Between(value) => { + Ok(between::PyBetween::from(value.clone()).into_bound_py_any(py)?) + } + Expr::Case(value) => Ok(case::PyCase::from(value.clone()).into_bound_py_any(py)?), + Expr::Cast(value) => Ok(cast::PyCast::from(value.clone()).into_bound_py_any(py)?), + Expr::TryCast(value) => Ok(cast::PyTryCast::from(value.clone()).into_bound_py_any(py)?), + Expr::ScalarFunction(value) => Err(py_unsupported_variant_err(format!( + "Converting Expr::ScalarFunction to a Python object is not implemented: {value:?}" + ))), + Expr::WindowFunction(value) => Err(py_unsupported_variant_err(format!( + "Converting Expr::WindowFunction to a Python object is not implemented: {value:?}" + ))), + Expr::InList(value) => { + Ok(in_list::PyInList::from(value.clone()).into_bound_py_any(py)?) + } + Expr::Exists(value) => Ok(exists::PyExists::from(value.clone()).into_bound_py_any(py)?), + Expr::InSubquery(value) => { + Ok(in_subquery::PyInSubquery::from(value.clone()).into_bound_py_any(py)?) + } + Expr::ScalarSubquery(value) => { + Ok(scalar_subquery::PyScalarSubquery::from(value.clone()).into_bound_py_any(py)?) + } + #[allow(deprecated)] + Expr::Wildcard { qualifier, options } => Err(py_unsupported_variant_err(format!( + "Converting Expr::Wildcard to a Python object is not implemented : {qualifier:?} {options:?}" + ))), + Expr::GroupingSet(value) => { + Ok(grouping_set::PyGroupingSet::from(value.clone()).into_bound_py_any(py)?) + } + Expr::Placeholder(value) => { + Ok(placeholder::PyPlaceholder::from(value.clone()).into_bound_py_any(py)?) + } + Expr::OuterReferenceColumn(data_type, column) => { + Err(py_unsupported_variant_err(format!( + "Converting Expr::OuterReferenceColumn to a Python object is not implemented: {data_type:?} - {column:?}" + ))) + } + Expr::Unnest(value) => { + Ok(unnest_expr::PyUnnestExpr::from(value.clone()).into_bound_py_any(py)?) + } + Expr::SetComparison(value) => { + Ok(set_comparison::PySetComparison::from(value.clone()).into_bound_py_any(py)?) + } + }) + } + + /// Returns the name of this expression as it should appear in a schema. This name + /// will not include any CAST expressions. + fn schema_name(&self) -> PyResult { + Ok(format!("{}", self.expr.schema_name())) + } + + /// Returns a full and complete string representation of this expression. + fn canonical_name(&self) -> PyResult { + Ok(format!("{}", self.expr)) + } + + /// Returns the name of the Expr variant. + /// Ex: 'IsNotNull', 'Literal', 'BinaryExpr', etc + fn variant_name(&self) -> PyResult<&str> { + Ok(self.expr.variant_name()) + } + + fn __richcmp__(&self, other: PyExpr, op: CompareOp) -> PyExpr { + let expr = match op { + CompareOp::Lt => self.expr.clone().lt(other.expr), + CompareOp::Le => self.expr.clone().lt_eq(other.expr), + CompareOp::Eq => self.expr.clone().eq(other.expr), + CompareOp::Ne => self.expr.clone().not_eq(other.expr), + CompareOp::Gt => self.expr.clone().gt(other.expr), + CompareOp::Ge => self.expr.clone().gt_eq(other.expr), + }; + expr.into() + } + + fn __repr__(&self) -> PyResult { + Ok(format!("Expr({})", self.expr)) + } + + fn __add__(&self, rhs: PyExpr) -> PyResult { + Ok((self.expr.clone() + rhs.expr).into()) + } + + fn __sub__(&self, rhs: PyExpr) -> PyResult { + Ok((self.expr.clone() - rhs.expr).into()) + } + + fn __truediv__(&self, rhs: PyExpr) -> PyResult { + Ok((self.expr.clone() / rhs.expr).into()) + } + + fn __mul__(&self, rhs: PyExpr) -> PyResult { + Ok((self.expr.clone() * rhs.expr).into()) + } + + fn __mod__(&self, rhs: PyExpr) -> PyResult { + let expr = self.expr.clone() % rhs.expr; + Ok(expr.into()) + } + + fn __and__(&self, rhs: PyExpr) -> PyResult { + Ok(self.expr.clone().and(rhs.expr).into()) + } + + fn __or__(&self, rhs: PyExpr) -> PyResult { + Ok(self.expr.clone().or(rhs.expr).into()) + } + + fn __invert__(&self) -> PyResult { + let expr = !self.expr.clone(); + Ok(expr.into()) + } + + fn __getitem__(&self, key: &str) -> PyResult { + Ok(self.expr.clone().field(key).into()) + } + + #[staticmethod] + pub fn literal(value: PyScalarValue) -> PyExpr { + lit(value.0).into() + } + + #[staticmethod] + pub fn literal_with_metadata( + value: PyScalarValue, + metadata: HashMap, + ) -> PyExpr { + let metadata = FieldMetadata::new(metadata.into_iter().collect()); + lit_with_metadata(value.0, Some(metadata)).into() + } + + #[staticmethod] + pub fn column(value: &str) -> PyExpr { + col(value).into() + } + + /// assign a name to the PyExpr + #[pyo3(signature = (name, metadata=None))] + pub fn alias(&self, name: &str, metadata: Option>) -> PyExpr { + let metadata = metadata.map(|m| FieldMetadata::new(m.into_iter().collect())); + self.expr.clone().alias_with_metadata(name, metadata).into() + } + + /// Create a sort PyExpr from an existing PyExpr. + #[pyo3(signature = (ascending=true, nulls_first=true))] + pub fn sort(&self, ascending: bool, nulls_first: bool) -> PySortExpr { + self.expr.clone().sort(ascending, nulls_first).into() + } + + pub fn is_null(&self) -> PyExpr { + self.expr.clone().is_null().into() + } + + pub fn is_not_null(&self) -> PyExpr { + self.expr.clone().is_not_null().into() + } + + pub fn cast(&self, to: PyArrowType) -> PyExpr { + // self.expr.cast_to() requires DFSchema to validate that the cast + // is supported, omit that for now + let expr = Expr::Cast(Cast::new(Box::new(self.expr.clone()), to.0)); + expr.into() + } + + #[pyo3(signature = (low, high, negated=false))] + pub fn between(&self, low: PyExpr, high: PyExpr, negated: bool) -> PyExpr { + let expr = Expr::Between(Between::new( + Box::new(self.expr.clone()), + negated, + Box::new(low.into()), + Box::new(high.into()), + )); + expr.into() + } + + /// A Rex (Row Expression) specifies a single row of data. That specification + /// could include user defined functions or types. RexType identifies the row + /// as one of the possible valid `RexTypes`. + pub fn rex_type(&self) -> PyResult { + Ok(match self.expr { + Expr::Alias(..) => RexType::Alias, + Expr::Column(..) => RexType::Reference, + Expr::ScalarVariable(..) | Expr::Literal(..) => RexType::Literal, + Expr::BinaryExpr { .. } + | Expr::Not(..) + | Expr::IsNotNull(..) + | Expr::Negative(..) + | Expr::IsNull(..) + | Expr::Like { .. } + | Expr::SimilarTo { .. } + | Expr::Between { .. } + | Expr::Case { .. } + | Expr::Cast { .. } + | Expr::TryCast { .. } + | Expr::ScalarFunction { .. } + | Expr::AggregateFunction { .. } + | Expr::WindowFunction { .. } + | Expr::InList { .. } + | Expr::Exists { .. } + | Expr::InSubquery { .. } + | Expr::GroupingSet(..) + | Expr::IsTrue(..) + | Expr::IsFalse(..) + | Expr::IsUnknown(_) + | Expr::IsNotTrue(..) + | Expr::IsNotFalse(..) + | Expr::Placeholder { .. } + | Expr::OuterReferenceColumn(_, _) + | Expr::Unnest(_) + | Expr::IsNotUnknown(_) + | Expr::SetComparison(_) => RexType::Call, + Expr::ScalarSubquery(..) => RexType::ScalarSubquery, + #[allow(deprecated)] + Expr::Wildcard { .. } => { + return Err(py_unsupported_variant_err("Expr::Wildcard is unsupported")); + } + }) + } + + /// Given the current `Expr` return the DataTypeMap which represents the + /// PythonType, Arrow DataType, and SqlType Enum which represents + pub fn types(&self) -> PyResult { + Self::_types(&self.expr) + } + + /// Extracts the Expr value into a Py that can be shared with Python + pub fn python_value<'py>(&self, py: Python<'py>) -> PyResult> { + match &self.expr { + Expr::Literal(scalar_value, _) => scalar_to_pyarrow(scalar_value, py), + _ => Err(py_type_err(format!( + "Non Expr::Literal encountered in types: {:?}", + &self.expr + ))), + } + } + + /// Row expressions, Rex(s), operate on the concept of operands. Different variants of Expressions, Expr(s), + /// store those operands in different datastructures. This function examines the Expr variant and returns + /// the operands to the calling logic as a Vec of PyExpr instances. + pub fn rex_call_operands(&self) -> PyResult> { + match &self.expr { + // Expr variants that are themselves the operand to return + Expr::Column(..) | Expr::ScalarVariable(..) | Expr::Literal(..) => { + Ok(vec![PyExpr::from(self.expr.clone())]) + } + + Expr::Alias(alias) => Ok(vec![PyExpr::from(*alias.expr.clone())]), + + // Expr(s) that house the Expr instance to return in their bounded params + Expr::Not(expr) + | Expr::IsNull(expr) + | Expr::IsNotNull(expr) + | Expr::IsTrue(expr) + | Expr::IsFalse(expr) + | Expr::IsUnknown(expr) + | Expr::IsNotTrue(expr) + | Expr::IsNotFalse(expr) + | Expr::IsNotUnknown(expr) + | Expr::Negative(expr) + | Expr::Cast(Cast { expr, .. }) + | Expr::TryCast(TryCast { expr, .. }) + | Expr::InSubquery(InSubquery { expr, .. }) + | Expr::SetComparison(SetComparison { expr, .. }) => { + Ok(vec![PyExpr::from(*expr.clone())]) + } + + // Expr variants containing a collection of Expr(s) for operands + Expr::AggregateFunction(AggregateFunction { + params: AggregateFunctionParams { args, .. }, + .. + }) + | Expr::ScalarFunction(ScalarFunction { args, .. }) => { + Ok(args.iter().map(|arg| PyExpr::from(arg.clone())).collect()) + } + Expr::WindowFunction(boxed_window_fn) => { + let args = &boxed_window_fn.params.args; + Ok(args.iter().map(|arg| PyExpr::from(arg.clone())).collect()) + } + + // Expr(s) that require more specific processing + Expr::Case(Case { + expr, + when_then_expr, + else_expr, + }) => { + let mut operands: Vec = Vec::new(); + + if let Some(e) = expr { + for (when, then) in when_then_expr { + operands.push(PyExpr::from(Expr::BinaryExpr(BinaryExpr::new( + Box::new(*e.clone()), + Operator::Eq, + Box::new(*when.clone()), + )))); + operands.push(PyExpr::from(*then.clone())); + } + } else { + for (when, then) in when_then_expr { + operands.push(PyExpr::from(*when.clone())); + operands.push(PyExpr::from(*then.clone())); + } + }; + + if let Some(e) = else_expr { + operands.push(PyExpr::from(*e.clone())); + }; + + Ok(operands) + } + Expr::InList(InList { expr, list, .. }) => { + let mut operands: Vec = vec![PyExpr::from(*expr.clone())]; + for list_elem in list { + operands.push(PyExpr::from(list_elem.clone())); + } + + Ok(operands) + } + Expr::BinaryExpr(BinaryExpr { left, right, .. }) => Ok(vec![ + PyExpr::from(*left.clone()), + PyExpr::from(*right.clone()), + ]), + Expr::Like(Like { expr, pattern, .. }) => Ok(vec![ + PyExpr::from(*expr.clone()), + PyExpr::from(*pattern.clone()), + ]), + Expr::SimilarTo(Like { expr, pattern, .. }) => Ok(vec![ + PyExpr::from(*expr.clone()), + PyExpr::from(*pattern.clone()), + ]), + Expr::Between(Between { + expr, + negated: _, + low, + high, + }) => Ok(vec![ + PyExpr::from(*expr.clone()), + PyExpr::from(*low.clone()), + PyExpr::from(*high.clone()), + ]), + + // Currently un-support/implemented Expr types for Rex Call operations + Expr::GroupingSet(..) + | Expr::Unnest(_) + | Expr::OuterReferenceColumn(_, _) + | Expr::ScalarSubquery(..) + | Expr::Placeholder { .. } + | Expr::Exists { .. } => Err(py_runtime_err(format!( + "Unimplemented Expr type: {}", + self.expr + ))), + + #[allow(deprecated)] + Expr::Wildcard { .. } => { + Err(py_unsupported_variant_err("Expr::Wildcard is unsupported")) + } + } + } + + /// Extracts the operator associated with a RexType::Call + pub fn rex_call_operator(&self) -> PyResult { + Ok(match &self.expr { + Expr::BinaryExpr(BinaryExpr { + left: _, + op, + right: _, + }) => format!("{op}"), + Expr::ScalarFunction(ScalarFunction { func, args: _ }) => func.name().to_string(), + Expr::Cast { .. } => "cast".to_string(), + Expr::Between { .. } => "between".to_string(), + Expr::Case { .. } => "case".to_string(), + Expr::IsNull(..) => "is null".to_string(), + Expr::IsNotNull(..) => "is not null".to_string(), + Expr::IsTrue(_) => "is true".to_string(), + Expr::IsFalse(_) => "is false".to_string(), + Expr::IsUnknown(_) => "is unknown".to_string(), + Expr::IsNotTrue(_) => "is not true".to_string(), + Expr::IsNotFalse(_) => "is not false".to_string(), + Expr::IsNotUnknown(_) => "is not unknown".to_string(), + Expr::InList { .. } => "in list".to_string(), + Expr::Negative(..) => "negative".to_string(), + Expr::Not(..) => "not".to_string(), + Expr::Like(Like { + negated, + case_insensitive, + .. + }) => { + let name = if *case_insensitive { "ilike" } else { "like" }; + if *negated { + format!("not {name}") + } else { + name.to_string() + } + } + Expr::SimilarTo(Like { negated, .. }) => { + if *negated { + "not similar to".to_string() + } else { + "similar to".to_string() + } + } + _ => { + return Err(py_type_err(format!( + "Catch all triggered in get_operator_name: {:?}", + &self.expr + ))); + } + }) + } + + pub fn column_name(&self, plan: PyLogicalPlan) -> PyResult { + self._column_name(&plan.plan()).map_err(py_runtime_err) + } + + // Expression Function Builder functions + + pub fn order_by(&self, order_by: Vec) -> PyExprFuncBuilder { + self.expr + .clone() + .order_by(to_sort_expressions(order_by)) + .into() + } + + pub fn filter(&self, filter: PyExpr) -> PyExprFuncBuilder { + self.expr.clone().filter(filter.expr.clone()).into() + } + + pub fn distinct(&self) -> PyExprFuncBuilder { + self.expr.clone().distinct().into() + } + + pub fn null_treatment(&self, null_treatment: NullTreatment) -> PyExprFuncBuilder { + self.expr + .clone() + .null_treatment(Some(null_treatment.into())) + .into() + } + + pub fn partition_by(&self, partition_by: Vec) -> PyExprFuncBuilder { + let partition_by = partition_by.iter().map(|e| e.expr.clone()).collect(); + self.expr.clone().partition_by(partition_by).into() + } + + pub fn window_frame(&self, window_frame: PyWindowFrame) -> PyExprFuncBuilder { + self.expr.clone().window_frame(window_frame.into()).into() + } + + #[pyo3(signature = (partition_by=None, window_frame=None, order_by=None, null_treatment=None))] + pub fn over( + &self, + partition_by: Option>, + window_frame: Option, + order_by: Option>, + null_treatment: Option, + ) -> PyDataFusionResult { + match &self.expr { + Expr::AggregateFunction(agg_fn) => { + let window_fn = Expr::WindowFunction(Box::new(WindowFunction::new( + WindowFunctionDefinition::AggregateUDF(agg_fn.func.clone()), + agg_fn.params.args.clone(), + ))); + + add_builder_fns_to_window( + window_fn, + partition_by, + window_frame, + order_by, + null_treatment, + ) + } + Expr::WindowFunction(_) => add_builder_fns_to_window( + self.expr.clone(), + partition_by, + window_frame, + order_by, + null_treatment, + ), + _ => Err(datafusion::error::DataFusionError::Plan(format!( + "Using {} with `over` is not allowed. Must use an aggregate or window function.", + self.expr.variant_name() + )) + .into()), + } + } +} + +#[pyclass( + from_py_object, + frozen, + name = "ExprFuncBuilder", + module = "datafusion.expr", + subclass +)] +#[derive(Debug, Clone)] +pub struct PyExprFuncBuilder { + pub builder: ExprFuncBuilder, +} + +impl From for PyExprFuncBuilder { + fn from(builder: ExprFuncBuilder) -> Self { + Self { builder } + } +} + +#[pymethods] +impl PyExprFuncBuilder { + pub fn order_by(&self, order_by: Vec) -> PyExprFuncBuilder { + self.builder + .clone() + .order_by(to_sort_expressions(order_by)) + .into() + } + + pub fn filter(&self, filter: PyExpr) -> PyExprFuncBuilder { + self.builder.clone().filter(filter.expr.clone()).into() + } + + pub fn distinct(&self) -> PyExprFuncBuilder { + self.builder.clone().distinct().into() + } + + pub fn null_treatment(&self, null_treatment: NullTreatment) -> PyExprFuncBuilder { + self.builder + .clone() + .null_treatment(Some(null_treatment.into())) + .into() + } + + pub fn partition_by(&self, partition_by: Vec) -> PyExprFuncBuilder { + let partition_by = partition_by.iter().map(|e| e.expr.clone()).collect(); + self.builder.clone().partition_by(partition_by).into() + } + + pub fn window_frame(&self, window_frame: PyWindowFrame) -> PyExprFuncBuilder { + self.builder + .clone() + .window_frame(window_frame.into()) + .into() + } + + pub fn build(&self) -> PyDataFusionResult { + Ok(self.builder.clone().build().map(|expr| expr.into())?) + } +} + +impl PyExpr { + pub fn _column_name(&self, plan: &LogicalPlan) -> PyDataFusionResult { + let field = Self::expr_to_field(&self.expr, plan)?; + Ok(field.name().to_owned()) + } + + /// Create a [Field] representing an [Expr], given an input [LogicalPlan] to resolve against + pub fn expr_to_field(expr: &Expr, input_plan: &LogicalPlan) -> PyDataFusionResult> { + let fields = exprlist_to_fields(std::slice::from_ref(expr), input_plan)?; + Ok(fields[0].1.clone()) + } + fn _types(expr: &Expr) -> PyResult { + match expr { + Expr::BinaryExpr(BinaryExpr { + left: _, + op, + right: _, + }) => match op { + Operator::Eq + | Operator::NotEq + | Operator::Lt + | Operator::LtEq + | Operator::Gt + | Operator::GtEq + | Operator::And + | Operator::Or + | Operator::IsDistinctFrom + | Operator::IsNotDistinctFrom + | Operator::RegexMatch + | Operator::RegexIMatch + | Operator::RegexNotMatch + | Operator::RegexNotIMatch + | Operator::LikeMatch + | Operator::ILikeMatch + | Operator::NotLikeMatch + | Operator::NotILikeMatch => DataTypeMap::map_from_arrow_type(&DataType::Boolean), + Operator::Plus | Operator::Minus | Operator::Multiply | Operator::Modulo => { + DataTypeMap::map_from_arrow_type(&DataType::Int64) + } + Operator::Divide => DataTypeMap::map_from_arrow_type(&DataType::Float64), + Operator::StringConcat => DataTypeMap::map_from_arrow_type(&DataType::Utf8), + Operator::BitwiseShiftLeft + | Operator::BitwiseShiftRight + | Operator::BitwiseXor + | Operator::BitwiseAnd + | Operator::BitwiseOr => DataTypeMap::map_from_arrow_type(&DataType::Binary), + Operator::AtArrow + | Operator::ArrowAt + | Operator::Arrow + | Operator::LongArrow + | Operator::HashArrow + | Operator::HashLongArrow + | Operator::AtAt + | Operator::IntegerDivide + | Operator::HashMinus + | Operator::AtQuestion + | Operator::Question + | Operator::QuestionAnd + | Operator::QuestionPipe + | Operator::Colon => Err(py_type_err(format!("Unsupported expr: ${op}"))), + }, + Expr::Cast(Cast { expr: _, data_type }) => DataTypeMap::map_from_arrow_type(data_type), + Expr::Literal(scalar_value, _) => DataTypeMap::map_from_scalar_value(scalar_value), + _ => Err(py_type_err(format!( + "Non Expr::Literal encountered in types: {expr:?}" + ))), + } + } +} + +/// Initializes the `expr` module to match the pattern of `datafusion-expr` https://docs.rs/datafusion-expr/latest/datafusion_expr/ +pub(crate) fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> { + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + + Ok(()) +} diff --git a/src/expr/aggregate.rs b/crates/core/src/expr/aggregate.rs similarity index 57% rename from src/expr/aggregate.rs rename to crates/core/src/expr/aggregate.rs index c3de9673a..5a6a771a7 100644 --- a/src/expr/aggregate.rs +++ b/crates/core/src/expr/aggregate.rs @@ -15,17 +15,28 @@ // specific language governing permissions and limitations // under the License. -use datafusion_common::DataFusionError; -use datafusion_expr::logical_plan::Aggregate; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::common::DataFusionError; +use datafusion::logical_expr::Expr; +use datafusion::logical_expr::expr::{AggregateFunction, AggregateFunctionParams, Alias}; +use datafusion::logical_expr::logical_plan::Aggregate; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + use super::logical_node::LogicalNode; use crate::common::df_schema::PyDFSchema; +use crate::errors::py_type_err; use crate::expr::PyExpr; use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "Aggregate", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Aggregate", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyAggregate { aggregate: Aggregate, @@ -84,6 +95,24 @@ impl PyAggregate { .collect()) } + /// Returns the inner Aggregate Expr(s) + pub fn agg_expressions(&self) -> PyResult> { + Ok(self + .aggregate + .aggr_expr + .iter() + .map(|e| PyExpr::from(e.clone())) + .collect()) + } + + pub fn agg_func_name(&self, expr: PyExpr) -> PyResult { + Self::_agg_func_name(&expr.expr) + } + + pub fn aggregation_arguments(&self, expr: PyExpr) -> PyResult> { + self._aggregation_arguments(&expr.expr) + } + // Retrieves the input `LogicalPlan` to this `Aggregate` node fn input(&self) -> PyResult> { Ok(Self::inputs(self)) @@ -95,7 +124,35 @@ impl PyAggregate { } fn __repr__(&self) -> PyResult { - Ok(format!("Aggregate({})", self)) + Ok(format!("Aggregate({self})")) + } +} + +impl PyAggregate { + #[allow(clippy::only_used_in_recursion)] + fn _aggregation_arguments(&self, expr: &Expr) -> PyResult> { + match expr { + // TODO: This Alias logic seems to be returning some strange results that we should investigate + Expr::Alias(Alias { expr, .. }) => self._aggregation_arguments(expr.as_ref()), + Expr::AggregateFunction(AggregateFunction { + func: _, + params: AggregateFunctionParams { args, .. }, + .. + }) => Ok(args.iter().map(|e| PyExpr::from(e.clone())).collect()), + _ => Err(py_type_err( + "Encountered a non Aggregate type in aggregation_arguments", + )), + } + } + + fn _agg_func_name(expr: &Expr) -> PyResult { + match expr { + Expr::Alias(Alias { expr, .. }) => Self::_agg_func_name(expr.as_ref()), + Expr::AggregateFunction(AggregateFunction { func, .. }) => Ok(func.name().to_owned()), + _ => Err(py_type_err( + "Encountered a non Aggregate type in agg_func_name", + )), + } } } @@ -104,7 +161,7 @@ impl LogicalNode for PyAggregate { vec![PyLogicalPlan::from((*self.aggregate.input).clone())] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/src/expr/aggregate_expr.rs b/crates/core/src/expr/aggregate_expr.rs similarity index 77% rename from src/expr/aggregate_expr.rs rename to crates/core/src/expr/aggregate_expr.rs index 180105180..88e47999f 100644 --- a/src/expr/aggregate_expr.rs +++ b/crates/core/src/expr/aggregate_expr.rs @@ -15,12 +15,20 @@ // specific language governing permissions and limitations // under the License. -use crate::expr::PyExpr; -use datafusion_expr::expr::AggregateFunction; -use pyo3::prelude::*; use std::fmt::{Display, Formatter}; -#[pyclass(name = "AggregateFunction", module = "datafusion.expr", subclass)] +use datafusion::logical_expr::expr::AggregateFunction; +use pyo3::prelude::*; + +use crate::expr::PyExpr; + +#[pyclass( + from_py_object, + frozen, + name = "AggregateFunction", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyAggregateFunction { aggr: AggregateFunction, @@ -40,8 +48,14 @@ impl From for PyAggregateFunction { impl Display for PyAggregateFunction { fn fmt(&self, f: &mut Formatter) -> std::fmt::Result { - let args: Vec = self.aggr.args.iter().map(|expr| expr.to_string()).collect(); - write!(f, "{}({})", self.aggr.fun, args.join(", ")) + let args: Vec = self + .aggr + .params + .args + .iter() + .map(|expr| expr.to_string()) + .collect(); + write!(f, "{}({})", self.aggr.func.name(), args.join(", ")) } } @@ -49,17 +63,18 @@ impl Display for PyAggregateFunction { impl PyAggregateFunction { /// Get the aggregate type, such as "MIN", or "MAX" fn aggregate_type(&self) -> String { - format!("{}", self.aggr.fun) + self.aggr.func.name().to_string() } /// is this a distinct aggregate such as `COUNT(DISTINCT expr)` fn is_distinct(&self) -> bool { - self.aggr.distinct + self.aggr.params.distinct } /// Get the arguments to the aggregate function fn args(&self) -> Vec { self.aggr + .params .args .iter() .map(|expr| PyExpr::from(expr.clone())) @@ -68,6 +83,6 @@ impl PyAggregateFunction { /// Get a String representation of this column fn __repr__(&self) -> String { - format!("{}", self) + format!("{self}") } } diff --git a/src/expr/alias.rs b/crates/core/src/expr/alias.rs similarity index 73% rename from src/expr/alias.rs rename to crates/core/src/expr/alias.rs index 2ce656342..b76e82e22 100644 --- a/src/expr/alias.rs +++ b/crates/core/src/expr/alias.rs @@ -15,17 +15,35 @@ // specific language governing permissions and limitations // under the License. -use crate::expr::PyExpr; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; -use datafusion_expr::Expr; +use datafusion::logical_expr::expr::Alias; +use pyo3::prelude::*; + +use crate::expr::PyExpr; -#[pyclass(name = "Alias", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Alias", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyAlias { - expr: PyExpr, - alias_name: String, + alias: Alias, +} + +impl From for PyAlias { + fn from(alias: Alias) -> Self { + Self { alias } + } +} + +impl From for Alias { + fn from(py_alias: PyAlias) -> Self { + py_alias.alias + } } impl Display for PyAlias { @@ -35,33 +53,24 @@ impl Display for PyAlias { "Alias \nExpr: `{:?}` \nAlias Name: `{}`", - &self.expr, &self.alias_name + &self.alias.expr, &self.alias.name ) } } -impl PyAlias { - pub fn new(expr: &Expr, alias_name: &String) -> Self { - Self { - expr: expr.clone().into(), - alias_name: alias_name.to_owned(), - } - } -} - #[pymethods] impl PyAlias { /// Retrieve the "name" of the alias fn alias(&self) -> PyResult { - Ok(self.alias_name.clone()) + Ok(self.alias.name.clone()) } fn expr(&self) -> PyResult { - Ok(self.expr.clone()) + Ok((*self.alias.expr.clone()).into()) } /// Get a String representation of this column fn __repr__(&self) -> String { - format!("{}", self) + format!("{self}") } } diff --git a/src/expr/analyze.rs b/crates/core/src/expr/analyze.rs similarity index 85% rename from src/expr/analyze.rs rename to crates/core/src/expr/analyze.rs index bbec3a808..137765fe1 100644 --- a/src/expr/analyze.rs +++ b/crates/core/src/expr/analyze.rs @@ -15,15 +15,23 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::logical_plan::Analyze; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::logical_expr::logical_plan::Analyze; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + use super::logical_node::LogicalNode; use crate::common::df_schema::PyDFSchema; use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "Analyze", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Analyze", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyAnalyze { analyze: Analyze, @@ -69,7 +77,7 @@ impl PyAnalyze { } fn __repr__(&self) -> PyResult { - Ok(format!("Analyze({})", self)) + Ok(format!("Analyze({self})")) } } @@ -78,7 +86,7 @@ impl LogicalNode for PyAnalyze { vec![PyLogicalPlan::from((*self.analyze.input).clone())] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/src/expr/between.rs b/crates/core/src/expr/between.rs similarity index 91% rename from src/expr/between.rs rename to crates/core/src/expr/between.rs index 9b78b9eeb..6943b6c3b 100644 --- a/src/expr/between.rs +++ b/crates/core/src/expr/between.rs @@ -15,12 +15,20 @@ // specific language governing permissions and limitations // under the License. -use crate::expr::PyExpr; -use datafusion_expr::expr::Between; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; -#[pyclass(name = "Between", module = "datafusion.expr", subclass)] +use datafusion::logical_expr::expr::Between; +use pyo3::prelude::*; + +use crate::expr::PyExpr; + +#[pyclass( + from_py_object, + frozen, + name = "Between", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyBetween { between: Between, @@ -71,6 +79,6 @@ impl PyBetween { } fn __repr__(&self) -> String { - format!("{}", self) + format!("{self}") } } diff --git a/src/expr/binary_expr.rs b/crates/core/src/expr/binary_expr.rs similarity index 90% rename from src/expr/binary_expr.rs rename to crates/core/src/expr/binary_expr.rs index 5f382b770..2326ba705 100644 --- a/src/expr/binary_expr.rs +++ b/crates/core/src/expr/binary_expr.rs @@ -15,11 +15,18 @@ // specific language governing permissions and limitations // under the License. -use crate::expr::PyExpr; -use datafusion_expr::BinaryExpr; +use datafusion::logical_expr::BinaryExpr; use pyo3::prelude::*; -#[pyclass(name = "BinaryExpr", module = "datafusion.expr", subclass)] +use crate::expr::PyExpr; + +#[pyclass( + from_py_object, + frozen, + name = "BinaryExpr", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyBinaryExpr { expr: BinaryExpr, diff --git a/src/expr/bool_expr.rs b/crates/core/src/expr/bool_expr.rs similarity index 82% rename from src/expr/bool_expr.rs rename to crates/core/src/expr/bool_expr.rs index d1502a4eb..9e374c7e2 100644 --- a/src/expr/bool_expr.rs +++ b/crates/core/src/expr/bool_expr.rs @@ -15,13 +15,20 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::Expr; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::logical_expr::Expr; +use pyo3::prelude::*; + use super::PyExpr; -#[pyclass(name = "Not", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Not", + module = "datafusion.expr", + subclass +)] #[derive(Clone, Debug)] pub struct PyNot { expr: Expr, @@ -51,7 +58,13 @@ impl PyNot { } } -#[pyclass(name = "IsNotNull", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "IsNotNull", + module = "datafusion.expr", + subclass +)] #[derive(Clone, Debug)] pub struct PyIsNotNull { expr: Expr, @@ -81,7 +94,13 @@ impl PyIsNotNull { } } -#[pyclass(name = "IsNull", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "IsNull", + module = "datafusion.expr", + subclass +)] #[derive(Clone, Debug)] pub struct PyIsNull { expr: Expr, @@ -111,7 +130,13 @@ impl PyIsNull { } } -#[pyclass(name = "IsTrue", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "IsTrue", + module = "datafusion.expr", + subclass +)] #[derive(Clone, Debug)] pub struct PyIsTrue { expr: Expr, @@ -141,7 +166,13 @@ impl PyIsTrue { } } -#[pyclass(name = "IsFalse", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "IsFalse", + module = "datafusion.expr", + subclass +)] #[derive(Clone, Debug)] pub struct PyIsFalse { expr: Expr, @@ -171,7 +202,13 @@ impl PyIsFalse { } } -#[pyclass(name = "IsUnknown", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "IsUnknown", + module = "datafusion.expr", + subclass +)] #[derive(Clone, Debug)] pub struct PyIsUnknown { expr: Expr, @@ -201,7 +238,13 @@ impl PyIsUnknown { } } -#[pyclass(name = "IsNotTrue", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "IsNotTrue", + module = "datafusion.expr", + subclass +)] #[derive(Clone, Debug)] pub struct PyIsNotTrue { expr: Expr, @@ -231,7 +274,13 @@ impl PyIsNotTrue { } } -#[pyclass(name = "IsNotFalse", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "IsNotFalse", + module = "datafusion.expr", + subclass +)] #[derive(Clone, Debug)] pub struct PyIsNotFalse { expr: Expr, @@ -261,7 +310,13 @@ impl PyIsNotFalse { } } -#[pyclass(name = "IsNotUnknown", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "IsNotUnknown", + module = "datafusion.expr", + subclass +)] #[derive(Clone, Debug)] pub struct PyIsNotUnknown { expr: Expr, @@ -291,7 +346,13 @@ impl PyIsNotUnknown { } } -#[pyclass(name = "Negative", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Negative", + module = "datafusion.expr", + subclass +)] #[derive(Clone, Debug)] pub struct PyNegative { expr: Expr, diff --git a/src/expr/case.rs b/crates/core/src/expr/case.rs similarity index 91% rename from src/expr/case.rs rename to crates/core/src/expr/case.rs index 605275376..4f00449d8 100644 --- a/src/expr/case.rs +++ b/crates/core/src/expr/case.rs @@ -15,11 +15,18 @@ // specific language governing permissions and limitations // under the License. -use crate::expr::PyExpr; -use datafusion_expr::Case; +use datafusion::logical_expr::Case; use pyo3::prelude::*; -#[pyclass(name = "Case", module = "datafusion.expr", subclass)] +use crate::expr::PyExpr; + +#[pyclass( + from_py_object, + frozen, + name = "Case", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyCase { case: Case, diff --git a/src/expr/cast.rs b/crates/core/src/expr/cast.rs similarity index 85% rename from src/expr/cast.rs rename to crates/core/src/expr/cast.rs index a72199876..37d603538 100644 --- a/src/expr/cast.rs +++ b/crates/core/src/expr/cast.rs @@ -15,11 +15,19 @@ // specific language governing permissions and limitations // under the License. -use crate::{common::data_type::PyDataType, expr::PyExpr}; -use datafusion_expr::{Cast, TryCast}; +use datafusion::logical_expr::{Cast, TryCast}; use pyo3::prelude::*; -#[pyclass(name = "Cast", module = "datafusion.expr", subclass)] +use crate::common::data_type::PyDataType; +use crate::expr::PyExpr; + +#[pyclass( + from_py_object, + frozen, + name = "Cast", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyCast { cast: Cast, @@ -48,7 +56,7 @@ impl PyCast { } } -#[pyclass(name = "TryCast", module = "datafusion.expr", subclass)] +#[pyclass(from_py_object, name = "TryCast", module = "datafusion.expr", subclass)] #[derive(Clone)] pub struct PyTryCast { try_cast: TryCast, diff --git a/src/expr/column.rs b/crates/core/src/expr/column.rs similarity index 88% rename from src/expr/column.rs rename to crates/core/src/expr/column.rs index 16b8bce3c..c1238f98a 100644 --- a/src/expr/column.rs +++ b/crates/core/src/expr/column.rs @@ -15,10 +15,16 @@ // specific language governing permissions and limitations // under the License. -use datafusion_common::Column; +use datafusion::common::Column; use pyo3::prelude::*; -#[pyclass(name = "Column", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Column", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyColumn { pub col: Column, @@ -45,7 +51,7 @@ impl PyColumn { /// Get the column relation fn relation(&self) -> Option { - self.col.relation.clone() + self.col.relation.as_ref().map(|r| format!("{r}")) } /// Get the fully-qualified column name diff --git a/crates/core/src/expr/conditional_expr.rs b/crates/core/src/expr/conditional_expr.rs new file mode 100644 index 000000000..ea21fdb20 --- /dev/null +++ b/crates/core/src/expr/conditional_expr.rs @@ -0,0 +1,84 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use datafusion::logical_expr::conditional_expressions::CaseBuilder; +use datafusion::prelude::Expr; +use pyo3::prelude::*; + +use crate::errors::PyDataFusionResult; +use crate::expr::PyExpr; + +// TODO(tsaucer) replace this all with CaseBuilder after it implements Clone +#[derive(Clone, Debug)] +#[pyclass( + from_py_object, + name = "CaseBuilder", + module = "datafusion.expr", + subclass, + frozen +)] +pub struct PyCaseBuilder { + expr: Option, + when: Vec, + then: Vec, +} + +#[pymethods] +impl PyCaseBuilder { + #[new] + pub fn new(expr: Option) -> Self { + Self { + expr: expr.map(Into::into), + when: vec![], + then: vec![], + } + } + + pub fn when(&self, when: PyExpr, then: PyExpr) -> PyCaseBuilder { + let mut case_builder = self.clone(); + case_builder.when.push(when.into()); + case_builder.then.push(then.into()); + + case_builder + } + + fn otherwise(&self, else_expr: PyExpr) -> PyDataFusionResult { + let case_builder = CaseBuilder::new( + self.expr.clone().map(Box::new), + self.when.clone(), + self.then.clone(), + Some(Box::new(else_expr.into())), + ); + + let expr = case_builder.end()?; + + Ok(expr.into()) + } + + fn end(&self) -> PyDataFusionResult { + let case_builder = CaseBuilder::new( + self.expr.clone().map(Box::new), + self.when.clone(), + self.then.clone(), + None, + ); + + let expr = case_builder.end()?; + + Ok(expr.into()) + } +} diff --git a/crates/core/src/expr/copy_to.rs b/crates/core/src/expr/copy_to.rs new file mode 100644 index 000000000..78e53cdff --- /dev/null +++ b/crates/core/src/expr/copy_to.rs @@ -0,0 +1,149 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::HashMap; +use std::fmt::{self, Display, Formatter}; +use std::sync::Arc; + +use datafusion::common::file_options::file_type::FileType; +use datafusion::logical_expr::dml::CopyTo; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "CopyTo", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyCopyTo { + copy: CopyTo, +} + +impl From for CopyTo { + fn from(copy: PyCopyTo) -> Self { + copy.copy + } +} + +impl From for PyCopyTo { + fn from(copy: CopyTo) -> PyCopyTo { + PyCopyTo { copy } + } +} + +impl Display for PyCopyTo { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!(f, "CopyTo: {:?}", self.copy.output_url) + } +} + +impl LogicalNode for PyCopyTo { + fn inputs(&self) -> Vec { + vec![PyLogicalPlan::from((*self.copy.input).clone())] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} + +#[pymethods] +impl PyCopyTo { + #[new] + pub fn new( + input: PyLogicalPlan, + output_url: String, + partition_by: Vec, + file_type: PyFileType, + options: HashMap, + ) -> Self { + PyCopyTo { + copy: CopyTo::new( + input.plan(), + output_url, + partition_by, + file_type.file_type, + options, + ), + } + } + + fn input(&self) -> PyLogicalPlan { + PyLogicalPlan::from((*self.copy.input).clone()) + } + + fn output_url(&self) -> String { + self.copy.output_url.clone() + } + + fn partition_by(&self) -> Vec { + self.copy.partition_by.clone() + } + + fn file_type(&self) -> PyFileType { + PyFileType { + file_type: self.copy.file_type.clone(), + } + } + + fn options(&self) -> HashMap { + self.copy.options.clone() + } + + fn __repr__(&self) -> PyResult { + Ok(format!("CopyTo({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("CopyTo".to_string()) + } +} + +#[pyclass( + from_py_object, + frozen, + name = "FileType", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyFileType { + file_type: Arc, +} + +impl Display for PyFileType { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!(f, "FileType: {}", self.file_type) + } +} + +#[pymethods] +impl PyFileType { + fn __repr__(&self) -> PyResult { + Ok(format!("FileType({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("FileType".to_string()) + } +} diff --git a/crates/core/src/expr/create_catalog.rs b/crates/core/src/expr/create_catalog.rs new file mode 100644 index 000000000..fa95980c0 --- /dev/null +++ b/crates/core/src/expr/create_catalog.rs @@ -0,0 +1,105 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; +use std::sync::Arc; + +use datafusion::logical_expr::CreateCatalog; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use crate::common::df_schema::PyDFSchema; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "CreateCatalog", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyCreateCatalog { + create: CreateCatalog, +} + +impl From for CreateCatalog { + fn from(create: PyCreateCatalog) -> Self { + create.create + } +} + +impl From for PyCreateCatalog { + fn from(create: CreateCatalog) -> PyCreateCatalog { + PyCreateCatalog { create } + } +} + +impl Display for PyCreateCatalog { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!(f, "CreateCatalog: {:?}", self.create.catalog_name) + } +} + +#[pymethods] +impl PyCreateCatalog { + #[new] + pub fn new( + catalog_name: String, + if_not_exists: bool, + schema: PyDFSchema, + ) -> PyResult { + Ok(PyCreateCatalog { + create: CreateCatalog { + catalog_name, + if_not_exists, + schema: Arc::new(schema.into()), + }, + }) + } + + pub fn catalog_name(&self) -> String { + self.create.catalog_name.clone() + } + + pub fn if_not_exists(&self) -> bool { + self.create.if_not_exists + } + + pub fn schema(&self) -> PyDFSchema { + (*self.create.schema).clone().into() + } + + fn __repr__(&self) -> PyResult { + Ok(format!("CreateCatalog({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("CreateCatalog".to_string()) + } +} + +impl LogicalNode for PyCreateCatalog { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/crates/core/src/expr/create_catalog_schema.rs b/crates/core/src/expr/create_catalog_schema.rs new file mode 100644 index 000000000..d836284a0 --- /dev/null +++ b/crates/core/src/expr/create_catalog_schema.rs @@ -0,0 +1,105 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; +use std::sync::Arc; + +use datafusion::logical_expr::CreateCatalogSchema; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use crate::common::df_schema::PyDFSchema; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "CreateCatalogSchema", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyCreateCatalogSchema { + create: CreateCatalogSchema, +} + +impl From for CreateCatalogSchema { + fn from(create: PyCreateCatalogSchema) -> Self { + create.create + } +} + +impl From for PyCreateCatalogSchema { + fn from(create: CreateCatalogSchema) -> PyCreateCatalogSchema { + PyCreateCatalogSchema { create } + } +} + +impl Display for PyCreateCatalogSchema { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!(f, "CreateCatalogSchema: {:?}", self.create.schema_name) + } +} + +#[pymethods] +impl PyCreateCatalogSchema { + #[new] + pub fn new( + schema_name: String, + if_not_exists: bool, + schema: PyDFSchema, + ) -> PyResult { + Ok(PyCreateCatalogSchema { + create: CreateCatalogSchema { + schema_name, + if_not_exists, + schema: Arc::new(schema.into()), + }, + }) + } + + pub fn schema_name(&self) -> String { + self.create.schema_name.clone() + } + + pub fn if_not_exists(&self) -> bool { + self.create.if_not_exists + } + + pub fn schema(&self) -> PyDFSchema { + (*self.create.schema).clone().into() + } + + fn __repr__(&self) -> PyResult { + Ok(format!("CreateCatalogSchema({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("CreateCatalogSchema".to_string()) + } +} + +impl LogicalNode for PyCreateCatalogSchema { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/crates/core/src/expr/create_external_table.rs b/crates/core/src/expr/create_external_table.rs new file mode 100644 index 000000000..980eea131 --- /dev/null +++ b/crates/core/src/expr/create_external_table.rs @@ -0,0 +1,192 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::HashMap; +use std::fmt::{self, Display, Formatter}; +use std::sync::Arc; + +use datafusion::logical_expr::CreateExternalTable; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use super::sort_expr::PySortExpr; +use crate::common::df_schema::PyDFSchema; +use crate::common::schema::PyConstraints; +use crate::expr::PyExpr; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "CreateExternalTable", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyCreateExternalTable { + create: CreateExternalTable, +} + +impl From for CreateExternalTable { + fn from(create: PyCreateExternalTable) -> Self { + create.create + } +} + +impl From for PyCreateExternalTable { + fn from(create: CreateExternalTable) -> PyCreateExternalTable { + PyCreateExternalTable { create } + } +} + +impl Display for PyCreateExternalTable { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!( + f, + "CreateExternalTable: {:?}{}", + self.create.name, self.create.constraints + ) + } +} + +#[pymethods] +impl PyCreateExternalTable { + #[allow(clippy::too_many_arguments)] + #[new] + #[pyo3(signature = (schema, name, location, file_type, table_partition_cols, if_not_exists, or_replace, temporary, order_exprs, unbounded, options, constraints, column_defaults, definition=None))] + pub fn new( + schema: PyDFSchema, + name: String, + location: String, + file_type: String, + table_partition_cols: Vec, + if_not_exists: bool, + or_replace: bool, + temporary: bool, + order_exprs: Vec>, + unbounded: bool, + options: HashMap, + constraints: PyConstraints, + column_defaults: HashMap, + definition: Option, + ) -> Self { + let create = CreateExternalTable { + schema: Arc::new(schema.into()), + name: name.into(), + location, + file_type, + table_partition_cols, + if_not_exists, + or_replace, + temporary, + definition, + order_exprs: order_exprs + .into_iter() + .map(|vec| vec.into_iter().map(|s| s.into()).collect::>()) + .collect::>(), + unbounded, + options, + constraints: constraints.constraints, + column_defaults: column_defaults + .into_iter() + .map(|(k, v)| (k, v.into())) + .collect(), + }; + PyCreateExternalTable { create } + } + + pub fn schema(&self) -> PyDFSchema { + (*self.create.schema).clone().into() + } + + pub fn name(&self) -> PyResult { + Ok(self.create.name.to_string()) + } + + pub fn location(&self) -> String { + self.create.location.clone() + } + + pub fn file_type(&self) -> String { + self.create.file_type.clone() + } + + pub fn table_partition_cols(&self) -> Vec { + self.create.table_partition_cols.clone() + } + + pub fn if_not_exists(&self) -> bool { + self.create.if_not_exists + } + + pub fn temporary(&self) -> bool { + self.create.temporary + } + + pub fn definition(&self) -> Option { + self.create.definition.clone() + } + + pub fn order_exprs(&self) -> Vec> { + self.create + .order_exprs + .iter() + .map(|vec| vec.iter().map(|s| s.clone().into()).collect()) + .collect() + } + + pub fn unbounded(&self) -> bool { + self.create.unbounded + } + + pub fn options(&self) -> HashMap { + self.create.options.clone() + } + + pub fn constraints(&self) -> PyConstraints { + PyConstraints { + constraints: self.create.constraints.clone(), + } + } + + pub fn column_defaults(&self) -> HashMap { + self.create + .column_defaults + .iter() + .map(|(k, v)| (k.clone(), v.clone().into())) + .collect() + } + + fn __repr__(&self) -> PyResult { + Ok(format!("CreateExternalTable({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("CreateExternalTable".to_string()) + } +} + +impl LogicalNode for PyCreateExternalTable { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/crates/core/src/expr/create_function.rs b/crates/core/src/expr/create_function.rs new file mode 100644 index 000000000..622858913 --- /dev/null +++ b/crates/core/src/expr/create_function.rs @@ -0,0 +1,207 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; +use std::sync::Arc; + +use datafusion::logical_expr::{ + CreateFunction, CreateFunctionBody, OperateFunctionArg, Volatility, +}; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::PyExpr; +use super::logical_node::LogicalNode; +use crate::common::data_type::PyDataType; +use crate::common::df_schema::PyDFSchema; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "CreateFunction", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyCreateFunction { + create: CreateFunction, +} + +impl From for CreateFunction { + fn from(create: PyCreateFunction) -> Self { + create.create + } +} + +impl From for PyCreateFunction { + fn from(create: CreateFunction) -> PyCreateFunction { + PyCreateFunction { create } + } +} + +impl Display for PyCreateFunction { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!(f, "CreateFunction: name {:?}", self.create.name) + } +} + +#[pyclass( + from_py_object, + frozen, + name = "OperateFunctionArg", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyOperateFunctionArg { + arg: OperateFunctionArg, +} + +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + frozen, + eq, + eq_int, + name = "Volatility", + module = "datafusion.expr" +)] +pub enum PyVolatility { + Immutable, + Stable, + Volatile, +} + +#[pyclass( + from_py_object, + frozen, + name = "CreateFunctionBody", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyCreateFunctionBody { + body: CreateFunctionBody, +} + +#[pymethods] +impl PyCreateFunctionBody { + pub fn language(&self) -> Option { + self.body + .language + .as_ref() + .map(|language| language.to_string()) + } + + pub fn behavior(&self) -> Option { + self.body.behavior.as_ref().map(|behavior| match behavior { + Volatility::Immutable => PyVolatility::Immutable, + Volatility::Stable => PyVolatility::Stable, + Volatility::Volatile => PyVolatility::Volatile, + }) + } + + pub fn function_body(&self) -> Option { + self.body + .function_body + .as_ref() + .map(|function_body| function_body.clone().into()) + } +} + +#[pymethods] +impl PyCreateFunction { + #[new] + #[pyo3(signature = (or_replace, temporary, name, params, schema, return_type=None, args=None))] + pub fn new( + or_replace: bool, + temporary: bool, + name: String, + params: PyCreateFunctionBody, + schema: PyDFSchema, + return_type: Option, + args: Option>, + ) -> Self { + PyCreateFunction { + create: CreateFunction { + or_replace, + temporary, + name, + args: args.map(|args| args.into_iter().map(|arg| arg.arg).collect()), + return_type: return_type.map(|return_type| return_type.data_type), + params: params.body, + schema: Arc::new(schema.into()), + }, + } + } + + pub fn or_replace(&self) -> bool { + self.create.or_replace + } + + pub fn temporary(&self) -> bool { + self.create.temporary + } + + pub fn name(&self) -> String { + self.create.name.clone() + } + + pub fn params(&self) -> PyCreateFunctionBody { + PyCreateFunctionBody { + body: self.create.params.clone(), + } + } + + pub fn schema(&self) -> PyDFSchema { + (*self.create.schema).clone().into() + } + + pub fn return_type(&self) -> Option { + self.create + .return_type + .as_ref() + .map(|return_type| return_type.clone().into()) + } + + pub fn args(&self) -> Option> { + self.create.args.as_ref().map(|args| { + args.iter() + .map(|arg| PyOperateFunctionArg { arg: arg.clone() }) + .collect() + }) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("CreateFunction({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("CreateFunction".to_string()) + } +} + +impl LogicalNode for PyCreateFunction { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/crates/core/src/expr/create_index.rs b/crates/core/src/expr/create_index.rs new file mode 100644 index 000000000..5f9bd11e8 --- /dev/null +++ b/crates/core/src/expr/create_index.rs @@ -0,0 +1,135 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; +use std::sync::Arc; + +use datafusion::logical_expr::CreateIndex; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use super::sort_expr::PySortExpr; +use crate::common::df_schema::PyDFSchema; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "CreateIndex", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyCreateIndex { + create: CreateIndex, +} + +impl From for CreateIndex { + fn from(create: PyCreateIndex) -> Self { + create.create + } +} + +impl From for PyCreateIndex { + fn from(create: CreateIndex) -> PyCreateIndex { + PyCreateIndex { create } + } +} + +impl Display for PyCreateIndex { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!(f, "CreateIndex: {:?}", self.create.name) + } +} + +#[pymethods] +impl PyCreateIndex { + #[new] + #[pyo3(signature = (table, columns, unique, if_not_exists, schema, name=None, using=None))] + pub fn new( + table: String, + columns: Vec, + unique: bool, + if_not_exists: bool, + schema: PyDFSchema, + name: Option, + using: Option, + ) -> PyResult { + Ok(PyCreateIndex { + create: CreateIndex { + name, + table: table.into(), + using, + columns: columns.iter().map(|c| c.clone().into()).collect(), + unique, + if_not_exists, + schema: Arc::new(schema.into()), + }, + }) + } + + pub fn name(&self) -> Option { + self.create.name.clone() + } + + pub fn table(&self) -> PyResult { + Ok(self.create.table.to_string()) + } + + pub fn using(&self) -> Option { + self.create.using.clone() + } + + pub fn columns(&self) -> Vec { + self.create + .columns + .iter() + .map(|c| c.clone().into()) + .collect() + } + + pub fn unique(&self) -> bool { + self.create.unique + } + + pub fn if_not_exists(&self) -> bool { + self.create.if_not_exists + } + + pub fn schema(&self) -> PyDFSchema { + (*self.create.schema).clone().into() + } + + fn __repr__(&self) -> PyResult { + Ok(format!("CreateIndex({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("CreateIndex".to_string()) + } +} + +impl LogicalNode for PyCreateIndex { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/src/expr/create_memory_table.rs b/crates/core/src/expr/create_memory_table.rs similarity index 86% rename from src/expr/create_memory_table.rs rename to crates/core/src/expr/create_memory_table.rs index 509bf2168..3214dab0e 100644 --- a/src/expr/create_memory_table.rs +++ b/crates/core/src/expr/create_memory_table.rs @@ -17,14 +17,20 @@ use std::fmt::{self, Display, Formatter}; -use datafusion_expr::CreateMemoryTable; +use datafusion::logical_expr::CreateMemoryTable; +use pyo3::IntoPyObjectExt; use pyo3::prelude::*; -use crate::sql::logical::PyLogicalPlan; - use super::logical_node::LogicalNode; +use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "CreateMemoryTable", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "CreateMemoryTable", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyCreateMemoryTable { create: CreateMemoryTable, @@ -78,7 +84,7 @@ impl PyCreateMemoryTable { } fn __repr__(&self) -> PyResult { - Ok(format!("CreateMemoryTable({})", self)) + Ok(format!("CreateMemoryTable({self})")) } fn __name__(&self) -> PyResult { @@ -91,7 +97,7 @@ impl LogicalNode for PyCreateMemoryTable { vec![PyLogicalPlan::from((*self.create.input).clone())] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/src/expr/create_view.rs b/crates/core/src/expr/create_view.rs similarity index 75% rename from src/expr/create_view.rs rename to crates/core/src/expr/create_view.rs index 9d06239ea..6941ef769 100644 --- a/src/expr/create_view.rs +++ b/crates/core/src/expr/create_view.rs @@ -17,14 +17,21 @@ use std::fmt::{self, Display, Formatter}; -use datafusion_expr::CreateView; +use datafusion::logical_expr::{CreateView, DdlStatement, LogicalPlan}; +use pyo3::IntoPyObjectExt; use pyo3::prelude::*; -use crate::sql::logical::PyLogicalPlan; - use super::logical_node::LogicalNode; +use crate::errors::py_type_err; +use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "CreateView", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "CreateView", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyCreateView { create: CreateView, @@ -75,7 +82,7 @@ impl PyCreateView { } fn __repr__(&self) -> PyResult { - Ok(format!("CreateView({})", self)) + Ok(format!("CreateView({self})")) } fn __name__(&self) -> PyResult { @@ -88,7 +95,18 @@ impl LogicalNode for PyCreateView { vec![PyLogicalPlan::from((*self.create.input).clone())] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} + +impl TryFrom for PyCreateView { + type Error = PyErr; + + fn try_from(logical_plan: LogicalPlan) -> Result { + match logical_plan { + LogicalPlan::Ddl(DdlStatement::CreateView(create)) => Ok(PyCreateView { create }), + _ => Err(py_type_err("unexpected plan")), + } } } diff --git a/crates/core/src/expr/describe_table.rs b/crates/core/src/expr/describe_table.rs new file mode 100644 index 000000000..73955bb34 --- /dev/null +++ b/crates/core/src/expr/describe_table.rs @@ -0,0 +1,98 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; +use std::sync::Arc; + +use arrow::datatypes::Schema; +use arrow::pyarrow::PyArrowType; +use datafusion::logical_expr::DescribeTable; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use crate::common::df_schema::PyDFSchema; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "DescribeTable", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyDescribeTable { + describe: DescribeTable, +} + +impl Display for PyDescribeTable { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!(f, "DescribeTable") + } +} + +#[pymethods] +impl PyDescribeTable { + #[new] + fn new(schema: PyArrowType, output_schema: PyDFSchema) -> Self { + Self { + describe: DescribeTable { + schema: Arc::new(schema.0), + output_schema: Arc::new(output_schema.into()), + }, + } + } + + pub fn schema(&self) -> PyArrowType { + (*self.describe.schema).clone().into() + } + + pub fn output_schema(&self) -> PyDFSchema { + (*self.describe.output_schema).clone().into() + } + + fn __repr__(&self) -> PyResult { + Ok(format!("DescribeTable({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("DescribeTable".to_string()) + } +} + +impl From for DescribeTable { + fn from(describe: PyDescribeTable) -> Self { + describe.describe + } +} + +impl From for PyDescribeTable { + fn from(describe: DescribeTable) -> PyDescribeTable { + PyDescribeTable { describe } + } +} + +impl LogicalNode for PyDescribeTable { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/src/expr/distinct.rs b/crates/core/src/expr/distinct.rs similarity index 64% rename from src/expr/distinct.rs rename to crates/core/src/expr/distinct.rs index 681ae953b..68c2a17fe 100644 --- a/src/expr/distinct.rs +++ b/crates/core/src/expr/distinct.rs @@ -17,14 +17,20 @@ use std::fmt::{self, Display, Formatter}; -use datafusion_expr::Distinct; +use datafusion::logical_expr::Distinct; +use pyo3::IntoPyObjectExt; use pyo3::prelude::*; -use crate::sql::logical::PyLogicalPlan; - use super::logical_node::LogicalNode; +use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "Distinct", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Distinct", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyDistinct { distinct: Distinct, @@ -44,12 +50,21 @@ impl From for PyDistinct { impl Display for PyDistinct { fn fmt(&self, f: &mut Formatter) -> fmt::Result { - write!( - f, - "Distinct + match &self.distinct { + Distinct::All(input) => write!( + f, + "Distinct ALL + \nInput: {input:?}", + ), + Distinct::On(distinct_on) => { + write!( + f, + "Distinct ON \nInput: {:?}", - self.distinct.input, - ) + distinct_on.input, + ) + } + } } } @@ -61,7 +76,7 @@ impl PyDistinct { } fn __repr__(&self) -> PyResult { - Ok(format!("Distinct({})", self)) + Ok(format!("Distinct({self})")) } fn __name__(&self) -> PyResult { @@ -71,10 +86,15 @@ impl PyDistinct { impl LogicalNode for PyDistinct { fn inputs(&self) -> Vec { - vec![PyLogicalPlan::from((*self.distinct.input).clone())] + match &self.distinct { + Distinct::All(input) => vec![PyLogicalPlan::from(input.as_ref().clone())], + Distinct::On(distinct_on) => { + vec![PyLogicalPlan::from(distinct_on.input.as_ref().clone())] + } + } } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/crates/core/src/expr/dml.rs b/crates/core/src/expr/dml.rs new file mode 100644 index 000000000..26f975820 --- /dev/null +++ b/crates/core/src/expr/dml.rs @@ -0,0 +1,149 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use datafusion::logical_expr::dml::InsertOp; +use datafusion::logical_expr::{DmlStatement, WriteOp}; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use crate::common::df_schema::PyDFSchema; +use crate::common::schema::PyTableSource; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "DmlStatement", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyDmlStatement { + dml: DmlStatement, +} + +impl From for DmlStatement { + fn from(dml: PyDmlStatement) -> Self { + dml.dml + } +} + +impl From for PyDmlStatement { + fn from(dml: DmlStatement) -> PyDmlStatement { + PyDmlStatement { dml } + } +} + +impl LogicalNode for PyDmlStatement { + fn inputs(&self) -> Vec { + vec![PyLogicalPlan::from((*self.dml.input).clone())] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} + +#[pymethods] +impl PyDmlStatement { + pub fn table_name(&self) -> PyResult { + Ok(self.dml.table_name.to_string()) + } + + pub fn target(&self) -> PyResult { + Ok(PyTableSource { + table_source: self.dml.target.clone(), + }) + } + + pub fn op(&self) -> PyWriteOp { + self.dml.op.clone().into() + } + + pub fn input(&self) -> PyLogicalPlan { + PyLogicalPlan { + plan: self.dml.input.clone(), + } + } + + pub fn output_schema(&self) -> PyDFSchema { + (*self.dml.output_schema).clone().into() + } + + fn __repr__(&self) -> PyResult { + Ok("DmlStatement".to_string()) + } + + fn __name__(&self) -> PyResult { + Ok("DmlStatement".to_string()) + } +} + +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + eq, + eq_int, + name = "WriteOp", + module = "datafusion.expr" +)] +pub enum PyWriteOp { + Append, + Overwrite, + Replace, + Update, + Delete, + Ctas, + Truncate, +} + +impl From for PyWriteOp { + fn from(write_op: WriteOp) -> Self { + match write_op { + WriteOp::Insert(InsertOp::Append) => PyWriteOp::Append, + WriteOp::Insert(InsertOp::Overwrite) => PyWriteOp::Overwrite, + WriteOp::Insert(InsertOp::Replace) => PyWriteOp::Replace, + WriteOp::Update => PyWriteOp::Update, + WriteOp::Delete => PyWriteOp::Delete, + WriteOp::Ctas => PyWriteOp::Ctas, + WriteOp::Truncate => PyWriteOp::Truncate, + } + } +} + +impl From for WriteOp { + fn from(py: PyWriteOp) -> Self { + match py { + PyWriteOp::Append => WriteOp::Insert(InsertOp::Append), + PyWriteOp::Overwrite => WriteOp::Insert(InsertOp::Overwrite), + PyWriteOp::Replace => WriteOp::Insert(InsertOp::Replace), + PyWriteOp::Update => WriteOp::Update, + PyWriteOp::Delete => WriteOp::Delete, + PyWriteOp::Ctas => WriteOp::Ctas, + PyWriteOp::Truncate => WriteOp::Truncate, + } + } +} + +#[pymethods] +impl PyWriteOp { + fn name(&self) -> String { + let write_op: WriteOp = self.clone().into(); + write_op.name().to_string() + } +} diff --git a/crates/core/src/expr/drop_catalog_schema.rs b/crates/core/src/expr/drop_catalog_schema.rs new file mode 100644 index 000000000..fd5105332 --- /dev/null +++ b/crates/core/src/expr/drop_catalog_schema.rs @@ -0,0 +1,123 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; +use std::sync::Arc; + +use datafusion::common::SchemaReference; +use datafusion::logical_expr::DropCatalogSchema; +use datafusion::sql::TableReference; +use pyo3::IntoPyObjectExt; +use pyo3::exceptions::PyValueError; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use crate::common::df_schema::PyDFSchema; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "DropCatalogSchema", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyDropCatalogSchema { + drop: DropCatalogSchema, +} + +impl From for DropCatalogSchema { + fn from(drop: PyDropCatalogSchema) -> Self { + drop.drop + } +} + +impl From for PyDropCatalogSchema { + fn from(drop: DropCatalogSchema) -> PyDropCatalogSchema { + PyDropCatalogSchema { drop } + } +} + +impl Display for PyDropCatalogSchema { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!(f, "DropCatalogSchema") + } +} + +fn parse_schema_reference(name: String) -> PyResult { + match name.into() { + TableReference::Bare { table } => Ok(SchemaReference::Bare { schema: table }), + TableReference::Partial { schema, table } => Ok(SchemaReference::Full { + schema: table, + catalog: schema, + }), + TableReference::Full { + catalog: _, + schema: _, + table: _, + } => Err(PyErr::new::( + "Invalid schema specifier (has 3 parts)".to_string(), + )), + } +} + +#[pymethods] +impl PyDropCatalogSchema { + #[new] + fn new(name: String, schema: PyDFSchema, if_exists: bool, cascade: bool) -> PyResult { + let name = parse_schema_reference(name)?; + Ok(PyDropCatalogSchema { + drop: DropCatalogSchema { + name, + schema: Arc::new(schema.into()), + if_exists, + cascade, + }, + }) + } + + fn name(&self) -> PyResult { + Ok(self.drop.name.to_string()) + } + + fn schema(&self) -> PyDFSchema { + (*self.drop.schema).clone().into() + } + + fn if_exists(&self) -> PyResult { + Ok(self.drop.if_exists) + } + + fn cascade(&self) -> PyResult { + Ok(self.drop.cascade) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("DropCatalogSchema({self})")) + } +} + +impl LogicalNode for PyDropCatalogSchema { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/crates/core/src/expr/drop_function.rs b/crates/core/src/expr/drop_function.rs new file mode 100644 index 000000000..0599dd49e --- /dev/null +++ b/crates/core/src/expr/drop_function.rs @@ -0,0 +1,100 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; +use std::sync::Arc; + +use datafusion::logical_expr::DropFunction; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use crate::common::df_schema::PyDFSchema; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "DropFunction", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyDropFunction { + drop: DropFunction, +} + +impl From for DropFunction { + fn from(drop: PyDropFunction) -> Self { + drop.drop + } +} + +impl From for PyDropFunction { + fn from(drop: DropFunction) -> PyDropFunction { + PyDropFunction { drop } + } +} + +impl Display for PyDropFunction { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!(f, "DropFunction") + } +} + +#[pymethods] +impl PyDropFunction { + #[new] + fn new(name: String, schema: PyDFSchema, if_exists: bool) -> PyResult { + Ok(PyDropFunction { + drop: DropFunction { + name, + schema: Arc::new(schema.into()), + if_exists, + }, + }) + } + fn name(&self) -> PyResult { + Ok(self.drop.name.clone()) + } + + fn schema(&self) -> PyDFSchema { + (*self.drop.schema).clone().into() + } + + fn if_exists(&self) -> PyResult { + Ok(self.drop.if_exists) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("DropFunction({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("DropFunction".to_string()) + } +} + +impl LogicalNode for PyDropFunction { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/src/expr/drop_table.rs b/crates/core/src/expr/drop_table.rs similarity index 85% rename from src/expr/drop_table.rs rename to crates/core/src/expr/drop_table.rs index 2a8836db5..46fe67465 100644 --- a/src/expr/drop_table.rs +++ b/crates/core/src/expr/drop_table.rs @@ -17,14 +17,20 @@ use std::fmt::{self, Display, Formatter}; -use datafusion_expr::logical_plan::DropTable; +use datafusion::logical_expr::logical_plan::DropTable; +use pyo3::IntoPyObjectExt; use pyo3::prelude::*; -use crate::sql::logical::PyLogicalPlan; - use super::logical_node::LogicalNode; +use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "DropTable", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "DropTable", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyDropTable { drop: DropTable, @@ -70,7 +76,7 @@ impl PyDropTable { } fn __repr__(&self) -> PyResult { - Ok(format!("DropTable({})", self)) + Ok(format!("DropTable({self})")) } fn __name__(&self) -> PyResult { @@ -83,7 +89,7 @@ impl LogicalNode for PyDropTable { vec![] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/crates/core/src/expr/drop_view.rs b/crates/core/src/expr/drop_view.rs new file mode 100644 index 000000000..0d0c51f13 --- /dev/null +++ b/crates/core/src/expr/drop_view.rs @@ -0,0 +1,106 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; +use std::sync::Arc; + +use datafusion::logical_expr::DropView; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use crate::common::df_schema::PyDFSchema; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "DropView", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyDropView { + drop: DropView, +} + +impl From for DropView { + fn from(drop: PyDropView) -> Self { + drop.drop + } +} + +impl From for PyDropView { + fn from(drop: DropView) -> PyDropView { + PyDropView { drop } + } +} + +impl Display for PyDropView { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!( + f, + "DropView: {name:?} if not exist:={if_exists}", + name = self.drop.name, + if_exists = self.drop.if_exists + ) + } +} + +#[pymethods] +impl PyDropView { + #[new] + fn new(name: String, schema: PyDFSchema, if_exists: bool) -> PyResult { + Ok(PyDropView { + drop: DropView { + name: name.into(), + schema: Arc::new(schema.into()), + if_exists, + }, + }) + } + + fn name(&self) -> PyResult { + Ok(self.drop.name.to_string()) + } + + fn schema(&self) -> PyDFSchema { + (*self.drop.schema).clone().into() + } + + fn if_exists(&self) -> PyResult { + Ok(self.drop.if_exists) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("DropView({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("DropView".to_string()) + } +} + +impl LogicalNode for PyDropView { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/src/expr/empty_relation.rs b/crates/core/src/expr/empty_relation.rs similarity index 83% rename from src/expr/empty_relation.rs rename to crates/core/src/expr/empty_relation.rs index 0bc222e59..f3c237731 100644 --- a/src/expr/empty_relation.rs +++ b/crates/core/src/expr/empty_relation.rs @@ -15,14 +15,23 @@ // specific language governing permissions and limitations // under the License. -use crate::{common::df_schema::PyDFSchema, sql::logical::PyLogicalPlan}; -use datafusion_expr::EmptyRelation; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::logical_expr::EmptyRelation; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + use super::logical_node::LogicalNode; +use crate::common::df_schema::PyDFSchema; +use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "EmptyRelation", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "EmptyRelation", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyEmptyRelation { empty: EmptyRelation, @@ -65,7 +74,7 @@ impl PyEmptyRelation { /// Get a String representation of this column fn __repr__(&self) -> String { - format!("{}", self) + format!("{self}") } fn __name__(&self) -> PyResult { @@ -79,7 +88,7 @@ impl LogicalNode for PyEmptyRelation { vec![] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/src/expr/exists.rs b/crates/core/src/expr/exists.rs similarity index 74% rename from src/expr/exists.rs rename to crates/core/src/expr/exists.rs index 7df9a6e81..d2e816127 100644 --- a/src/expr/exists.rs +++ b/crates/core/src/expr/exists.rs @@ -15,31 +15,36 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::Subquery; +use datafusion::logical_expr::expr::Exists; use pyo3::prelude::*; use super::subquery::PySubquery; -#[pyclass(name = "Exists", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Exists", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyExists { - subquery: Subquery, - negated: bool, + exists: Exists, } -impl PyExists { - pub fn new(subquery: Subquery, negated: bool) -> Self { - Self { subquery, negated } +impl From for PyExists { + fn from(exists: Exists) -> Self { + PyExists { exists } } } #[pymethods] impl PyExists { fn subquery(&self) -> PySubquery { - self.subquery.clone().into() + self.exists.subquery.clone().into() } fn negated(&self) -> bool { - self.negated + self.exists.negated } } diff --git a/src/expr/explain.rs b/crates/core/src/expr/explain.rs similarity index 85% rename from src/expr/explain.rs rename to crates/core/src/expr/explain.rs index d5d6a7bbd..6259951de 100644 --- a/src/expr/explain.rs +++ b/crates/core/src/expr/explain.rs @@ -17,14 +17,23 @@ use std::fmt::{self, Display, Formatter}; -use datafusion_expr::{logical_plan::Explain, LogicalPlan}; +use datafusion::logical_expr::LogicalPlan; +use datafusion::logical_expr::logical_plan::Explain; +use pyo3::IntoPyObjectExt; use pyo3::prelude::*; -use crate::{common::df_schema::PyDFSchema, errors::py_type_err, sql::logical::PyLogicalPlan}; - use super::logical_node::LogicalNode; - -#[pyclass(name = "Explain", module = "datafusion.expr", subclass)] +use crate::common::df_schema::PyDFSchema; +use crate::errors::py_type_err; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "Explain", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyExplain { explain: Explain, @@ -104,7 +113,7 @@ impl LogicalNode for PyExplain { vec![] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/src/expr/extension.rs b/crates/core/src/expr/extension.rs similarity index 81% rename from src/expr/extension.rs rename to crates/core/src/expr/extension.rs index 81a435c23..a0b617565 100644 --- a/src/expr/extension.rs +++ b/crates/core/src/expr/extension.rs @@ -15,14 +15,20 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::Extension; +use datafusion::logical_expr::Extension; +use pyo3::IntoPyObjectExt; use pyo3::prelude::*; -use crate::sql::logical::PyLogicalPlan; - use super::logical_node::LogicalNode; +use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "Extension", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Extension", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyExtension { pub node: Extension, @@ -46,7 +52,7 @@ impl LogicalNode for PyExtension { vec![] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/src/expr/filter.rs b/crates/core/src/expr/filter.rs similarity index 86% rename from src/expr/filter.rs rename to crates/core/src/expr/filter.rs index 2def2f7d6..67426806d 100644 --- a/src/expr/filter.rs +++ b/crates/core/src/expr/filter.rs @@ -15,16 +15,24 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::logical_plan::Filter; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::logical_expr::logical_plan::Filter; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + use crate::common::df_schema::PyDFSchema; -use crate::expr::logical_node::LogicalNode; use crate::expr::PyExpr; +use crate::expr::logical_node::LogicalNode; use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "Filter", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Filter", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyFilter { filter: Filter, @@ -72,7 +80,7 @@ impl PyFilter { } fn __repr__(&self) -> String { - format!("Filter({})", self) + format!("Filter({self})") } } @@ -81,7 +89,7 @@ impl LogicalNode for PyFilter { vec![PyLogicalPlan::from((*self.filter.input).clone())] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/crates/core/src/expr/grouping_set.rs b/crates/core/src/expr/grouping_set.rs new file mode 100644 index 000000000..11d8f4fcd --- /dev/null +++ b/crates/core/src/expr/grouping_set.rs @@ -0,0 +1,78 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use datafusion::logical_expr::{Expr, GroupingSet}; +use pyo3::prelude::*; + +use crate::expr::PyExpr; + +#[pyclass( + from_py_object, + frozen, + name = "GroupingSet", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyGroupingSet { + grouping_set: GroupingSet, +} + +#[pymethods] +impl PyGroupingSet { + #[staticmethod] + #[pyo3(signature = (*exprs))] + fn rollup(exprs: Vec) -> PyExpr { + Expr::GroupingSet(GroupingSet::Rollup( + exprs.into_iter().map(|e| e.expr).collect(), + )) + .into() + } + + #[staticmethod] + #[pyo3(signature = (*exprs))] + fn cube(exprs: Vec) -> PyExpr { + Expr::GroupingSet(GroupingSet::Cube( + exprs.into_iter().map(|e| e.expr).collect(), + )) + .into() + } + + #[staticmethod] + #[pyo3(signature = (*expr_lists))] + fn grouping_sets(expr_lists: Vec>) -> PyExpr { + Expr::GroupingSet(GroupingSet::GroupingSets( + expr_lists + .into_iter() + .map(|list| list.into_iter().map(|e| e.expr).collect()) + .collect(), + )) + .into() + } +} + +impl From for GroupingSet { + fn from(grouping_set: PyGroupingSet) -> Self { + grouping_set.grouping_set + } +} + +impl From for PyGroupingSet { + fn from(grouping_set: GroupingSet) -> PyGroupingSet { + PyGroupingSet { grouping_set } + } +} diff --git a/src/expr/in_list.rs b/crates/core/src/expr/in_list.rs similarity index 70% rename from src/expr/in_list.rs rename to crates/core/src/expr/in_list.rs index 840eee2ce..0612cc21e 100644 --- a/src/expr/in_list.rs +++ b/crates/core/src/expr/in_list.rs @@ -15,39 +15,40 @@ // specific language governing permissions and limitations // under the License. -use crate::expr::PyExpr; -use datafusion_expr::Expr; +use datafusion::logical_expr::expr::InList; use pyo3::prelude::*; -#[pyclass(name = "InList", module = "datafusion.expr", subclass)] +use crate::expr::PyExpr; + +#[pyclass( + from_py_object, + frozen, + name = "InList", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyInList { - expr: Box, - list: Vec, - negated: bool, + in_list: InList, } -impl PyInList { - pub fn new(expr: Box, list: Vec, negated: bool) -> Self { - Self { - expr, - list, - negated, - } +impl From for PyInList { + fn from(in_list: InList) -> Self { + PyInList { in_list } } } #[pymethods] impl PyInList { fn expr(&self) -> PyExpr { - (*self.expr).clone().into() + (*self.in_list.expr).clone().into() } fn list(&self) -> Vec { - self.list.iter().map(|e| e.clone().into()).collect() + self.in_list.list.iter().map(|e| e.clone().into()).collect() } fn negated(&self) -> bool { - self.negated + self.in_list.negated } } diff --git a/src/expr/in_subquery.rs b/crates/core/src/expr/in_subquery.rs similarity index 67% rename from src/expr/in_subquery.rs rename to crates/core/src/expr/in_subquery.rs index 6cee4a1f0..81a2c5794 100644 --- a/src/expr/in_subquery.rs +++ b/crates/core/src/expr/in_subquery.rs @@ -15,40 +15,41 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::{Expr, Subquery}; +use datafusion::logical_expr::expr::InSubquery; use pyo3::prelude::*; -use super::{subquery::PySubquery, PyExpr}; +use super::PyExpr; +use super::subquery::PySubquery; -#[pyclass(name = "InSubquery", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "InSubquery", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyInSubquery { - expr: Box, - subquery: Subquery, - negated: bool, + in_subquery: InSubquery, } -impl PyInSubquery { - pub fn new(expr: Box, subquery: Subquery, negated: bool) -> Self { - Self { - expr, - subquery, - negated, - } +impl From for PyInSubquery { + fn from(in_subquery: InSubquery) -> Self { + PyInSubquery { in_subquery } } } #[pymethods] impl PyInSubquery { fn expr(&self) -> PyExpr { - (*self.expr).clone().into() + (*self.in_subquery.expr).clone().into() } fn subquery(&self) -> PySubquery { - self.subquery.clone().into() + self.in_subquery.subquery.clone().into() } fn negated(&self) -> bool { - self.negated + self.in_subquery.negated } } diff --git a/src/expr/indexed_field.rs b/crates/core/src/expr/indexed_field.rs similarity index 81% rename from src/expr/indexed_field.rs rename to crates/core/src/expr/indexed_field.rs index c98607712..98a90d8d4 100644 --- a/src/expr/indexed_field.rs +++ b/crates/core/src/expr/indexed_field.rs @@ -15,14 +15,21 @@ // specific language governing permissions and limitations // under the License. -use crate::expr::PyExpr; -use datafusion_expr::expr::GetIndexedField; -use pyo3::prelude::*; use std::fmt::{Display, Formatter}; +use datafusion::logical_expr::expr::{GetFieldAccess, GetIndexedField}; +use pyo3::prelude::*; + use super::literal::PyLiteral; +use crate::expr::PyExpr; -#[pyclass(name = "GetIndexedField", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "GetIndexedField", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyGetIndexedField { indexed_field: GetIndexedField, @@ -47,7 +54,7 @@ impl Display for PyGetIndexedField { "GetIndexedField Expr: {:?} Key: {:?}", - &self.indexed_field.expr, &self.indexed_field.key + &self.indexed_field.expr, &self.indexed_field.field ) } } @@ -59,7 +66,10 @@ impl PyGetIndexedField { } fn key(&self) -> PyResult { - Ok(self.indexed_field.key.clone().into()) + match &self.indexed_field.field { + GetFieldAccess::NamedStructField { name, .. } => Ok(name.clone().into()), + _ => todo!(), + } } /// Get a String representation of this column diff --git a/src/expr/join.rs b/crates/core/src/expr/join.rs similarity index 78% rename from src/expr/join.rs rename to crates/core/src/expr/join.rs index 801662962..b90f2f57d 100644 --- a/src/expr/join.rs +++ b/crates/core/src/expr/join.rs @@ -15,16 +15,20 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::logical_plan::{Join, JoinConstraint, JoinType}; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::common::NullEquality; +use datafusion::logical_expr::logical_plan::{Join, JoinConstraint, JoinType}; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + use crate::common::df_schema::PyDFSchema; -use crate::expr::{logical_node::LogicalNode, PyExpr}; +use crate::expr::PyExpr; +use crate::expr::logical_node::LogicalNode; use crate::sql::logical::PyLogicalPlan; #[derive(Debug, Clone, PartialEq, Eq, Hash)] -#[pyclass(name = "JoinType", module = "datafusion.expr")] +#[pyclass(from_py_object, frozen, name = "JoinType", module = "datafusion.expr")] pub struct PyJoinType { join_type: JoinType, } @@ -46,6 +50,10 @@ impl PyJoinType { pub fn is_outer(&self) -> bool { self.join_type.is_outer() } + + fn __repr__(&self) -> PyResult { + Ok(format!("{}", self.join_type)) + } } impl Display for PyJoinType { @@ -55,7 +63,12 @@ impl Display for PyJoinType { } #[derive(Debug, Clone, Copy)] -#[pyclass(name = "JoinConstraint", module = "datafusion.expr")] +#[pyclass( + from_py_object, + frozen, + name = "JoinConstraint", + module = "datafusion.expr" +)] pub struct PyJoinConstraint { join_constraint: JoinConstraint, } @@ -72,7 +85,23 @@ impl From for JoinConstraint { } } -#[pyclass(name = "Join", module = "datafusion.expr", subclass)] +#[pymethods] +impl PyJoinConstraint { + fn __repr__(&self) -> PyResult { + match self.join_constraint { + JoinConstraint::On => Ok("On".to_string()), + JoinConstraint::Using => Ok("Using".to_string()), + } + } +} + +#[pyclass( + from_py_object, + frozen, + name = "Join", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyJoin { join: Join, @@ -102,7 +131,7 @@ impl Display for PyJoin { JoinType: {:?} JoinConstraint: {:?} Schema: {:?} - NullEqualsNull: {:?}", + NullEquality: {:?}", &self.join.left, &self.join.right, &self.join.on, @@ -110,7 +139,7 @@ impl Display for PyJoin { &self.join.join_type, &self.join.join_constraint, &self.join.schema, - &self.join.null_equals_null, + &self.join.null_equality, ) } } @@ -159,11 +188,14 @@ impl PyJoin { /// If null_equals_null is true, null == null else null != null fn null_equals_null(&self) -> PyResult { - Ok(self.join.null_equals_null) + match self.join.null_equality { + NullEquality::NullEqualsNothing => Ok(false), + NullEquality::NullEqualsNull => Ok(true), + } } fn __repr__(&self) -> PyResult { - Ok(format!("Join({})", self)) + Ok(format!("Join({self})")) } fn __name__(&self) -> PyResult { @@ -179,7 +211,7 @@ impl LogicalNode for PyJoin { ] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/src/expr/like.rs b/crates/core/src/expr/like.rs similarity index 89% rename from src/expr/like.rs rename to crates/core/src/expr/like.rs index 6ed3c2467..417dc9182 100644 --- a/src/expr/like.rs +++ b/crates/core/src/expr/like.rs @@ -15,13 +15,20 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::expr::Like; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::logical_expr::expr::Like; +use pyo3::prelude::*; + use crate::expr::PyExpr; -#[pyclass(name = "Like", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Like", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyLike { like: Like, @@ -75,11 +82,17 @@ impl PyLike { } fn __repr__(&self) -> String { - format!("Like({})", self) + format!("Like({self})") } } -#[pyclass(name = "ILike", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "ILike", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyILike { like: Like, @@ -133,11 +146,17 @@ impl PyILike { } fn __repr__(&self) -> String { - format!("Like({})", self) + format!("Like({self})") } } -#[pyclass(name = "SimilarTo", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "SimilarTo", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PySimilarTo { like: Like, @@ -191,6 +210,6 @@ impl PySimilarTo { } fn __repr__(&self) -> String { - format!("Like({})", self) + format!("Like({self})") } } diff --git a/src/expr/limit.rs b/crates/core/src/expr/limit.rs similarity index 72% rename from src/expr/limit.rs rename to crates/core/src/expr/limit.rs index d7b3f4ca5..c04b8bfa8 100644 --- a/src/expr/limit.rs +++ b/crates/core/src/expr/limit.rs @@ -15,15 +15,23 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::logical_plan::Limit; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::logical_expr::logical_plan::Limit; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + use crate::common::df_schema::PyDFSchema; use crate::expr::logical_node::LogicalNode; use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "Limit", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Limit", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyLimit { limit: Limit, @@ -46,7 +54,7 @@ impl Display for PyLimit { write!( f, "Limit - Skip: {} + Skip: {:?} Fetch: {:?} Input: {:?}", &self.limit.skip, &self.limit.fetch, &self.limit.input @@ -56,15 +64,19 @@ impl Display for PyLimit { #[pymethods] impl PyLimit { - /// Retrieves the skip value for this `Limit` - fn skip(&self) -> usize { - self.limit.skip - } + // NOTE: Upstream now has expressions for skip and fetch + // TODO: Do we still want to expose these? + // REF: https://github.com/apache/datafusion/pull/12836 - /// Retrieves the fetch value for this `Limit` - fn fetch(&self) -> Option { - self.limit.fetch - } + // /// Retrieves the skip value for this `Limit` + // fn skip(&self) -> usize { + // self.limit.skip + // } + + // /// Retrieves the fetch value for this `Limit` + // fn fetch(&self) -> Option { + // self.limit.fetch + // } /// Retrieves the input `LogicalPlan` to this `Limit` node fn input(&self) -> PyResult> { @@ -77,7 +89,7 @@ impl PyLimit { } fn __repr__(&self) -> PyResult { - Ok(format!("Limit({})", self)) + Ok(format!("Limit({self})")) } } @@ -86,7 +98,7 @@ impl LogicalNode for PyLimit { vec![PyLogicalPlan::from((*self.limit.input).clone())] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/src/expr/literal.rs b/crates/core/src/expr/literal.rs similarity index 77% rename from src/expr/literal.rs rename to crates/core/src/expr/literal.rs index b29497e64..9db0f594b 100644 --- a/src/expr/literal.rs +++ b/crates/core/src/expr/literal.rs @@ -15,14 +15,30 @@ // specific language governing permissions and limitations // under the License. -use crate::errors::{py_runtime_err, DataFusionError}; -use datafusion_common::ScalarValue; +use datafusion::common::ScalarValue; +use datafusion::logical_expr::expr::FieldMetadata; +use pyo3::IntoPyObjectExt; use pyo3::prelude::*; -#[pyclass(name = "Literal", module = "datafusion.expr", subclass)] +use crate::errors::PyDataFusionError; + +#[pyclass( + from_py_object, + name = "Literal", + module = "datafusion.expr", + subclass, + frozen +)] #[derive(Clone)] pub struct PyLiteral { pub value: ScalarValue, + pub metadata: Option, +} + +impl PyLiteral { + pub fn new_with_metadata(value: ScalarValue, metadata: Option) -> PyLiteral { + Self { value, metadata } + } } impl From for ScalarValue { @@ -33,7 +49,10 @@ impl From for ScalarValue { impl From for PyLiteral { fn from(value: ScalarValue) -> PyLiteral { - PyLiteral { value } + PyLiteral { + value, + metadata: None, + } } } @@ -50,7 +69,7 @@ macro_rules! extract_scalar_value { impl PyLiteral { /// Get the data type of this literal value fn data_type(&self) -> String { - format!("{}", self.value.get_datatype()) + format!("{}", self.value.data_type()) } pub fn value_f32(&self) -> PyResult> { @@ -61,7 +80,7 @@ impl PyLiteral { extract_scalar_value!(self, Float64) } - pub fn value_decimal128(&mut self) -> PyResult<(Option, u8, i8)> { + pub fn value_decimal128(&self) -> PyResult<(Option, u8, i8)> { match &self.value { ScalarValue::Decimal128(value, precision, scale) => Ok((*value, *precision, *scale)), other => Err(unexpected_literal_value(other)), @@ -112,12 +131,14 @@ impl PyLiteral { extract_scalar_value!(self, Time64Nanosecond) } - pub fn value_timestamp(&mut self) -> PyResult<(Option, Option)> { + pub fn value_timestamp(&self) -> PyResult<(Option, Option)> { match &self.value { ScalarValue::TimestampNanosecond(iv, tz) | ScalarValue::TimestampMicrosecond(iv, tz) | ScalarValue::TimestampMillisecond(iv, tz) - | ScalarValue::TimestampSecond(iv, tz) => Ok((*iv, tz.clone())), + | ScalarValue::TimestampSecond(iv, tz) => { + Ok((*iv, tz.as_ref().map(|s| s.as_ref().to_string()))) + } other => Err(unexpected_literal_value(other)), } } @@ -135,20 +156,15 @@ impl PyLiteral { pub fn value_interval_day_time(&self) -> PyResult> { match &self.value { - ScalarValue::IntervalDayTime(Some(iv)) => { - let interval = *iv as u64; - let days = (interval >> 32) as i32; - let ms = interval as i32; - Ok(Some((days, ms))) - } + ScalarValue::IntervalDayTime(Some(iv)) => Ok(Some((iv.days, iv.milliseconds))), ScalarValue::IntervalDayTime(None) => Ok(None), other => Err(unexpected_literal_value(other)), } } #[allow(clippy::wrong_self_convention)] - fn into_type(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn into_type<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } fn __repr__(&self) -> PyResult { @@ -157,5 +173,5 @@ impl PyLiteral { } fn unexpected_literal_value(value: &ScalarValue) -> PyErr { - DataFusionError::Common(format!("getValue() - Unexpected value: {value}")).into() + PyDataFusionError::Common(format!("getValue() - Unexpected value: {value}")).into() } diff --git a/src/expr/logical_node.rs b/crates/core/src/expr/logical_node.rs similarity index 89% rename from src/expr/logical_node.rs rename to crates/core/src/expr/logical_node.rs index 757e4f94b..5aff70059 100644 --- a/src/expr/logical_node.rs +++ b/crates/core/src/expr/logical_node.rs @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -use pyo3::{PyObject, PyResult, Python}; +use pyo3::{Bound, PyAny, PyResult, Python}; use crate::sql::logical::PyLogicalPlan; @@ -25,5 +25,5 @@ pub trait LogicalNode { /// The input plan to the current logical node instance. fn inputs(&self) -> Vec; - fn to_variant(&self, py: Python) -> PyResult; + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult>; } diff --git a/src/expr/placeholder.rs b/crates/core/src/expr/placeholder.rs similarity index 59% rename from src/expr/placeholder.rs rename to crates/core/src/expr/placeholder.rs index e37c8b561..6bd88321c 100644 --- a/src/expr/placeholder.rs +++ b/crates/core/src/expr/placeholder.rs @@ -15,34 +15,48 @@ // specific language governing permissions and limitations // under the License. -use datafusion::arrow::datatypes::DataType; +use arrow::datatypes::Field; +use arrow::pyarrow::PyArrowType; +use datafusion::logical_expr::expr::Placeholder; use pyo3::prelude::*; use crate::common::data_type::PyDataType; -#[pyclass(name = "Placeholder", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Placeholder", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyPlaceholder { - id: String, - data_type: Option, + placeholder: Placeholder, } -impl PyPlaceholder { - pub fn new(id: String, data_type: DataType) -> Self { - Self { - id, - data_type: Some(data_type), - } +impl From for PyPlaceholder { + fn from(placeholder: Placeholder) -> Self { + PyPlaceholder { placeholder } } } #[pymethods] impl PyPlaceholder { fn id(&self) -> String { - self.id.clone() + self.placeholder.id.clone() } fn data_type(&self) -> Option { - self.data_type.as_ref().map(|e| e.clone().into()) + self.placeholder + .field + .as_ref() + .map(|f| f.data_type().clone().into()) + } + + fn field(&self) -> Option> { + self.placeholder + .field + .as_ref() + .map(|f| f.as_ref().clone().into()) } } diff --git a/src/expr/projection.rs b/crates/core/src/expr/projection.rs similarity index 73% rename from src/expr/projection.rs rename to crates/core/src/expr/projection.rs index f5ba12db2..456e06412 100644 --- a/src/expr/projection.rs +++ b/crates/core/src/expr/projection.rs @@ -15,19 +15,28 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::logical_plan::Projection; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::logical_expr::Expr; +use datafusion::logical_expr::logical_plan::Projection; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + use crate::common::df_schema::PyDFSchema; -use crate::expr::logical_node::LogicalNode; use crate::expr::PyExpr; +use crate::expr::logical_node::LogicalNode; use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "Projection", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Projection", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyProjection { - projection: Projection, + pub projection: Projection, } impl PyProjection { @@ -84,7 +93,7 @@ impl PyProjection { } fn __repr__(&self) -> PyResult { - Ok(format!("Projection({})", self)) + Ok(format!("Projection({self})")) } fn __name__(&self) -> PyResult { @@ -92,12 +101,27 @@ impl PyProjection { } } +impl PyProjection { + /// Projection: Gets the names of the fields that should be projected + pub fn projected_expressions(local_expr: &PyExpr) -> Vec { + let mut projs: Vec = Vec::new(); + match &local_expr.expr { + Expr::Alias(alias) => { + let py_expr: PyExpr = PyExpr::from(*alias.expr.clone()); + projs.extend_from_slice(Self::projected_expressions(&py_expr).as_slice()); + } + _ => projs.push(local_expr.clone()), + } + projs + } +} + impl LogicalNode for PyProjection { fn inputs(&self) -> Vec { vec![PyLogicalPlan::from((*self.projection.input).clone())] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/crates/core/src/expr/recursive_query.rs b/crates/core/src/expr/recursive_query.rs new file mode 100644 index 000000000..e03137b80 --- /dev/null +++ b/crates/core/src/expr/recursive_query.rs @@ -0,0 +1,117 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; + +use datafusion::logical_expr::RecursiveQuery; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "RecursiveQuery", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyRecursiveQuery { + query: RecursiveQuery, +} + +impl From for RecursiveQuery { + fn from(query: PyRecursiveQuery) -> Self { + query.query + } +} + +impl From for PyRecursiveQuery { + fn from(query: RecursiveQuery) -> PyRecursiveQuery { + PyRecursiveQuery { query } + } +} + +impl Display for PyRecursiveQuery { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!( + f, + "RecursiveQuery {name:?} is_distinct:={is_distinct}", + name = self.query.name, + is_distinct = self.query.is_distinct + ) + } +} + +#[pymethods] +impl PyRecursiveQuery { + #[new] + fn new( + name: String, + static_term: PyLogicalPlan, + recursive_term: PyLogicalPlan, + is_distinct: bool, + ) -> Self { + Self { + query: RecursiveQuery { + name, + static_term: static_term.plan(), + recursive_term: recursive_term.plan(), + is_distinct, + }, + } + } + + fn name(&self) -> PyResult { + Ok(self.query.name.clone()) + } + + fn static_term(&self) -> PyLogicalPlan { + PyLogicalPlan::from((*self.query.static_term).clone()) + } + + fn recursive_term(&self) -> PyLogicalPlan { + PyLogicalPlan::from((*self.query.recursive_term).clone()) + } + + fn is_distinct(&self) -> PyResult { + Ok(self.query.is_distinct) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("RecursiveQuery({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("RecursiveQuery".to_string()) + } +} + +impl LogicalNode for PyRecursiveQuery { + fn inputs(&self) -> Vec { + vec![ + PyLogicalPlan::from((*self.query.static_term).clone()), + PyLogicalPlan::from((*self.query.recursive_term).clone()), + ] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/src/expr/repartition.rs b/crates/core/src/expr/repartition.rs similarity index 83% rename from src/expr/repartition.rs rename to crates/core/src/expr/repartition.rs index e3e14f878..be39b9978 100644 --- a/src/expr/repartition.rs +++ b/crates/core/src/expr/repartition.rs @@ -17,20 +17,35 @@ use std::fmt::{self, Display, Formatter}; -use datafusion_expr::{logical_plan::Repartition, Expr, Partitioning}; +use datafusion::logical_expr::logical_plan::Repartition; +use datafusion::logical_expr::{Expr, Partitioning}; +use pyo3::IntoPyObjectExt; use pyo3::prelude::*; -use crate::{errors::py_type_err, sql::logical::PyLogicalPlan}; - -use super::{logical_node::LogicalNode, PyExpr}; - -#[pyclass(name = "Repartition", module = "datafusion.expr", subclass)] +use super::PyExpr; +use super::logical_node::LogicalNode; +use crate::errors::py_type_err; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "Repartition", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyRepartition { repartition: Repartition, } -#[pyclass(name = "Partitioning", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Partitioning", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyPartitioning { partitioning: Partitioning, @@ -108,7 +123,7 @@ impl PyRepartition { } fn __repr__(&self) -> PyResult { - Ok(format!("Repartition({})", self)) + Ok(format!("Repartition({self})")) } fn __name__(&self) -> PyResult { @@ -121,7 +136,7 @@ impl LogicalNode for PyRepartition { vec![PyLogicalPlan::from((*self.repartition.input).clone())] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/src/expr/scalar_subquery.rs b/crates/core/src/expr/scalar_subquery.rs similarity index 89% rename from src/expr/scalar_subquery.rs rename to crates/core/src/expr/scalar_subquery.rs index c71bb9905..c7852a4c4 100644 --- a/src/expr/scalar_subquery.rs +++ b/crates/core/src/expr/scalar_subquery.rs @@ -15,12 +15,18 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::Subquery; +use datafusion::logical_expr::Subquery; use pyo3::prelude::*; use super::subquery::PySubquery; -#[pyclass(name = "ScalarSubquery", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "ScalarSubquery", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyScalarSubquery { subquery: Subquery, diff --git a/src/expr/scalar_variable.rs b/crates/core/src/expr/scalar_variable.rs similarity index 76% rename from src/expr/scalar_variable.rs rename to crates/core/src/expr/scalar_variable.rs index 7b50ba241..2d3bc4b76 100644 --- a/src/expr/scalar_variable.rs +++ b/crates/core/src/expr/scalar_variable.rs @@ -15,22 +15,28 @@ // specific language governing permissions and limitations // under the License. -use datafusion::arrow::datatypes::DataType; +use arrow::datatypes::FieldRef; use pyo3::prelude::*; use crate::common::data_type::PyDataType; -#[pyclass(name = "ScalarVariable", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "ScalarVariable", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyScalarVariable { - data_type: DataType, + field: FieldRef, variables: Vec, } impl PyScalarVariable { - pub fn new(data_type: &DataType, variables: &[String]) -> Self { + pub fn new(field: &FieldRef, variables: &[String]) -> Self { Self { - data_type: data_type.to_owned(), + field: field.to_owned(), variables: variables.to_vec(), } } @@ -40,7 +46,7 @@ impl PyScalarVariable { impl PyScalarVariable { /// Get the data type fn data_type(&self) -> PyResult { - Ok(self.data_type.clone().into()) + Ok(self.field.data_type().clone().into()) } fn variables(&self) -> PyResult> { @@ -48,6 +54,6 @@ impl PyScalarVariable { } fn __repr__(&self) -> PyResult { - Ok(format!("{}{:?}", self.data_type, self.variables)) + Ok(format!("{}{:?}", self.field.data_type(), self.variables)) } } diff --git a/crates/core/src/expr/set_comparison.rs b/crates/core/src/expr/set_comparison.rs new file mode 100644 index 000000000..9f0c077e1 --- /dev/null +++ b/crates/core/src/expr/set_comparison.rs @@ -0,0 +1,59 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use datafusion::logical_expr::expr::SetComparison; +use pyo3::prelude::*; + +use super::subquery::PySubquery; +use crate::expr::PyExpr; + +#[pyclass( + from_py_object, + frozen, + name = "SetComparison", + module = "datafusion.set_comparison", + subclass +)] +#[derive(Clone)] +pub struct PySetComparison { + set_comparison: SetComparison, +} + +impl From for PySetComparison { + fn from(set_comparison: SetComparison) -> Self { + PySetComparison { set_comparison } + } +} + +#[pymethods] +impl PySetComparison { + fn expr(&self) -> PyExpr { + (*self.set_comparison.expr).clone().into() + } + + fn subquery(&self) -> PySubquery { + self.set_comparison.subquery.clone().into() + } + + fn op(&self) -> String { + format!("{}", self.set_comparison.op) + } + + fn quantifier(&self) -> String { + format!("{}", self.set_comparison.quantifier) + } +} diff --git a/src/expr/signature.rs b/crates/core/src/expr/signature.rs similarity index 87% rename from src/expr/signature.rs rename to crates/core/src/expr/signature.rs index 2f194982e..35268e3a9 100644 --- a/src/expr/signature.rs +++ b/crates/core/src/expr/signature.rs @@ -15,12 +15,17 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::{TypeSignature, Volatility}; +use datafusion::logical_expr::{TypeSignature, Volatility}; use pyo3::prelude::*; #[allow(dead_code)] -#[pyclass(name = "Signature", module = "datafusion.expr", subclass)] -#[allow(dead_code)] +#[pyclass( + from_py_object, + frozen, + name = "Signature", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PySignature { type_signature: TypeSignature, diff --git a/src/expr/sort.rs b/crates/core/src/expr/sort.rs similarity index 78% rename from src/expr/sort.rs rename to crates/core/src/expr/sort.rs index 8843c638d..7c1e654c5 100644 --- a/src/expr/sort.rs +++ b/crates/core/src/expr/sort.rs @@ -15,17 +15,25 @@ // specific language governing permissions and limitations // under the License. -use datafusion_common::DataFusionError; -use datafusion_expr::logical_plan::Sort; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::common::DataFusionError; +use datafusion::logical_expr::logical_plan::Sort; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + use crate::common::df_schema::PyDFSchema; use crate::expr::logical_node::LogicalNode; -use crate::expr::PyExpr; +use crate::expr::sort_expr::PySortExpr; use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "Sort", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Sort", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PySort { sort: Sort, @@ -63,15 +71,19 @@ impl Display for PySort { #[pymethods] impl PySort { /// Retrieves the sort expressions for this `Sort` - fn sort_exprs(&self) -> PyResult> { + fn sort_exprs(&self) -> PyResult> { Ok(self .sort .expr .iter() - .map(|e| PyExpr::from(e.clone())) + .map(|e| PySortExpr::from(e.clone())) .collect()) } + fn get_fetch_val(&self) -> PyResult> { + Ok(self.sort.fetch) + } + /// Retrieves the input `LogicalPlan` to this `Sort` node fn input(&self) -> PyResult> { Ok(Self::inputs(self)) @@ -83,7 +95,7 @@ impl PySort { } fn __repr__(&self) -> PyResult { - Ok(format!("Sort({})", self)) + Ok(format!("Sort({self})")) } } @@ -92,7 +104,7 @@ impl LogicalNode for PySort { vec![PyLogicalPlan::from((*self.sort.input).clone())] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/crates/core/src/expr/sort_expr.rs b/crates/core/src/expr/sort_expr.rs new file mode 100644 index 000000000..3c3c86bc1 --- /dev/null +++ b/crates/core/src/expr/sort_expr.rs @@ -0,0 +1,98 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; + +use datafusion::logical_expr::SortExpr; +use pyo3::prelude::*; + +use crate::expr::PyExpr; + +#[pyclass( + from_py_object, + frozen, + name = "SortExpr", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PySortExpr { + pub(crate) sort: SortExpr, +} + +impl From for SortExpr { + fn from(sort: PySortExpr) -> Self { + sort.sort + } +} + +impl From for PySortExpr { + fn from(sort: SortExpr) -> PySortExpr { + PySortExpr { sort } + } +} + +impl Display for PySortExpr { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!( + f, + "Sort + Expr: {:?} + Asc: {:?} + NullsFirst: {:?}", + &self.sort.expr, &self.sort.asc, &self.sort.nulls_first + ) + } +} + +pub fn to_sort_expressions(order_by: Vec) -> Vec { + order_by.iter().map(|e| e.sort.clone()).collect() +} + +pub fn py_sort_expr_list(expr: &[SortExpr]) -> PyResult> { + Ok(expr.iter().map(|e| PySortExpr::from(e.clone())).collect()) +} + +#[pymethods] +impl PySortExpr { + #[new] + fn new(expr: PyExpr, asc: bool, nulls_first: bool) -> Self { + Self { + sort: SortExpr { + expr: expr.into(), + asc, + nulls_first, + }, + } + } + + fn expr(&self) -> PyResult { + Ok(self.sort.expr.clone().into()) + } + + fn ascending(&self) -> PyResult { + Ok(self.sort.asc) + } + + fn nulls_first(&self) -> PyResult { + Ok(self.sort.nulls_first) + } + + fn __repr__(&self) -> String { + format!("{self}") + } +} diff --git a/crates/core/src/expr/statement.rs b/crates/core/src/expr/statement.rs new file mode 100644 index 000000000..5aa1e4e9c --- /dev/null +++ b/crates/core/src/expr/statement.rs @@ -0,0 +1,558 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::Arc; + +use arrow::datatypes::Field; +use arrow::pyarrow::PyArrowType; +use datafusion::logical_expr::{ + Deallocate, Execute, Prepare, ResetVariable, SetVariable, TransactionAccessMode, + TransactionConclusion, TransactionEnd, TransactionIsolationLevel, TransactionStart, +}; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::PyExpr; +use super::logical_node::LogicalNode; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "TransactionStart", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyTransactionStart { + transaction_start: TransactionStart, +} + +impl From for PyTransactionStart { + fn from(transaction_start: TransactionStart) -> PyTransactionStart { + PyTransactionStart { transaction_start } + } +} + +impl TryFrom for TransactionStart { + type Error = PyErr; + + fn try_from(py: PyTransactionStart) -> Result { + Ok(py.transaction_start) + } +} + +impl LogicalNode for PyTransactionStart { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} + +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + frozen, + eq, + eq_int, + name = "TransactionAccessMode", + module = "datafusion.expr" +)] +pub enum PyTransactionAccessMode { + ReadOnly, + ReadWrite, +} + +impl From for PyTransactionAccessMode { + fn from(access_mode: TransactionAccessMode) -> PyTransactionAccessMode { + match access_mode { + TransactionAccessMode::ReadOnly => PyTransactionAccessMode::ReadOnly, + TransactionAccessMode::ReadWrite => PyTransactionAccessMode::ReadWrite, + } + } +} + +impl TryFrom for TransactionAccessMode { + type Error = PyErr; + + fn try_from(py: PyTransactionAccessMode) -> Result { + match py { + PyTransactionAccessMode::ReadOnly => Ok(TransactionAccessMode::ReadOnly), + PyTransactionAccessMode::ReadWrite => Ok(TransactionAccessMode::ReadWrite), + } + } +} + +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + frozen, + eq, + eq_int, + name = "TransactionIsolationLevel", + module = "datafusion.expr" +)] +pub enum PyTransactionIsolationLevel { + ReadUncommitted, + ReadCommitted, + RepeatableRead, + Serializable, + Snapshot, +} + +impl From for PyTransactionIsolationLevel { + fn from(isolation_level: TransactionIsolationLevel) -> PyTransactionIsolationLevel { + match isolation_level { + TransactionIsolationLevel::ReadUncommitted => { + PyTransactionIsolationLevel::ReadUncommitted + } + TransactionIsolationLevel::ReadCommitted => PyTransactionIsolationLevel::ReadCommitted, + TransactionIsolationLevel::RepeatableRead => { + PyTransactionIsolationLevel::RepeatableRead + } + TransactionIsolationLevel::Serializable => PyTransactionIsolationLevel::Serializable, + TransactionIsolationLevel::Snapshot => PyTransactionIsolationLevel::Snapshot, + } + } +} + +impl TryFrom for TransactionIsolationLevel { + type Error = PyErr; + + fn try_from(value: PyTransactionIsolationLevel) -> Result { + match value { + PyTransactionIsolationLevel::ReadUncommitted => { + Ok(TransactionIsolationLevel::ReadUncommitted) + } + PyTransactionIsolationLevel::ReadCommitted => { + Ok(TransactionIsolationLevel::ReadCommitted) + } + PyTransactionIsolationLevel::RepeatableRead => { + Ok(TransactionIsolationLevel::RepeatableRead) + } + PyTransactionIsolationLevel::Serializable => { + Ok(TransactionIsolationLevel::Serializable) + } + PyTransactionIsolationLevel::Snapshot => Ok(TransactionIsolationLevel::Snapshot), + } + } +} + +#[pymethods] +impl PyTransactionStart { + #[new] + pub fn new( + access_mode: PyTransactionAccessMode, + isolation_level: PyTransactionIsolationLevel, + ) -> PyResult { + let access_mode = access_mode.try_into()?; + let isolation_level = isolation_level.try_into()?; + Ok(PyTransactionStart { + transaction_start: TransactionStart { + access_mode, + isolation_level, + }, + }) + } + + pub fn access_mode(&self) -> PyResult { + Ok(self.transaction_start.access_mode.clone().into()) + } + + pub fn isolation_level(&self) -> PyResult { + Ok(self.transaction_start.isolation_level.clone().into()) + } +} + +#[pyclass( + from_py_object, + frozen, + name = "TransactionEnd", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyTransactionEnd { + transaction_end: TransactionEnd, +} + +impl From for PyTransactionEnd { + fn from(transaction_end: TransactionEnd) -> PyTransactionEnd { + PyTransactionEnd { transaction_end } + } +} + +impl TryFrom for TransactionEnd { + type Error = PyErr; + + fn try_from(py: PyTransactionEnd) -> Result { + Ok(py.transaction_end) + } +} + +impl LogicalNode for PyTransactionEnd { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} + +#[derive(Debug, Clone, PartialEq, Eq, Hash, PartialOrd, Ord)] +#[pyclass( + from_py_object, + frozen, + eq, + eq_int, + name = "TransactionConclusion", + module = "datafusion.expr" +)] +pub enum PyTransactionConclusion { + Commit, + Rollback, +} + +impl From for PyTransactionConclusion { + fn from(value: TransactionConclusion) -> Self { + match value { + TransactionConclusion::Commit => PyTransactionConclusion::Commit, + TransactionConclusion::Rollback => PyTransactionConclusion::Rollback, + } + } +} + +impl TryFrom for TransactionConclusion { + type Error = PyErr; + + fn try_from(value: PyTransactionConclusion) -> Result { + match value { + PyTransactionConclusion::Commit => Ok(TransactionConclusion::Commit), + PyTransactionConclusion::Rollback => Ok(TransactionConclusion::Rollback), + } + } +} +#[pymethods] +impl PyTransactionEnd { + #[new] + pub fn new(conclusion: PyTransactionConclusion, chain: bool) -> PyResult { + let conclusion = conclusion.try_into()?; + Ok(PyTransactionEnd { + transaction_end: TransactionEnd { conclusion, chain }, + }) + } + + pub fn conclusion(&self) -> PyResult { + Ok(self.transaction_end.conclusion.clone().into()) + } + + pub fn chain(&self) -> bool { + self.transaction_end.chain + } +} + +#[pyclass( + from_py_object, + frozen, + name = "ResetVariable", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyResetVariable { + reset_variable: ResetVariable, +} + +impl From for PyResetVariable { + fn from(reset_variable: ResetVariable) -> PyResetVariable { + PyResetVariable { reset_variable } + } +} + +impl TryFrom for ResetVariable { + type Error = PyErr; + + fn try_from(py: PyResetVariable) -> Result { + Ok(py.reset_variable) + } +} + +impl LogicalNode for PyResetVariable { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} + +#[pymethods] +impl PyResetVariable { + #[new] + pub fn new(variable: String) -> Self { + PyResetVariable { + reset_variable: ResetVariable { variable }, + } + } + + pub fn variable(&self) -> String { + self.reset_variable.variable.clone() + } +} + +#[pyclass( + from_py_object, + frozen, + name = "SetVariable", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PySetVariable { + set_variable: SetVariable, +} + +impl From for PySetVariable { + fn from(set_variable: SetVariable) -> PySetVariable { + PySetVariable { set_variable } + } +} + +impl TryFrom for SetVariable { + type Error = PyErr; + + fn try_from(py: PySetVariable) -> Result { + Ok(py.set_variable) + } +} + +impl LogicalNode for PySetVariable { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} + +#[pymethods] +impl PySetVariable { + #[new] + pub fn new(variable: String, value: String) -> Self { + PySetVariable { + set_variable: SetVariable { variable, value }, + } + } + + pub fn variable(&self) -> String { + self.set_variable.variable.clone() + } + + pub fn value(&self) -> String { + self.set_variable.value.clone() + } +} + +#[pyclass( + from_py_object, + frozen, + name = "Prepare", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyPrepare { + prepare: Prepare, +} + +impl From for PyPrepare { + fn from(prepare: Prepare) -> PyPrepare { + PyPrepare { prepare } + } +} + +impl TryFrom for Prepare { + type Error = PyErr; + + fn try_from(py: PyPrepare) -> Result { + Ok(py.prepare) + } +} + +impl LogicalNode for PyPrepare { + fn inputs(&self) -> Vec { + vec![PyLogicalPlan::from((*self.prepare.input).clone())] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} + +#[pymethods] +impl PyPrepare { + #[new] + pub fn new(name: String, fields: Vec>, input: PyLogicalPlan) -> Self { + let input = input.plan().clone(); + let fields = fields.into_iter().map(|field| Arc::new(field.0)).collect(); + PyPrepare { + prepare: Prepare { + name, + fields, + input, + }, + } + } + + pub fn name(&self) -> String { + self.prepare.name.clone() + } + + pub fn fields(&self) -> Vec> { + self.prepare + .fields + .clone() + .into_iter() + .map(|f| f.as_ref().clone().into()) + .collect() + } + + pub fn input(&self) -> PyLogicalPlan { + PyLogicalPlan { + plan: self.prepare.input.clone(), + } + } +} + +#[pyclass( + from_py_object, + frozen, + name = "Execute", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyExecute { + execute: Execute, +} + +impl From for PyExecute { + fn from(execute: Execute) -> PyExecute { + PyExecute { execute } + } +} + +impl TryFrom for Execute { + type Error = PyErr; + + fn try_from(py: PyExecute) -> Result { + Ok(py.execute) + } +} + +impl LogicalNode for PyExecute { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} + +#[pymethods] +impl PyExecute { + #[new] + pub fn new(name: String, parameters: Vec) -> Self { + let parameters = parameters + .into_iter() + .map(|parameter| parameter.into()) + .collect(); + PyExecute { + execute: Execute { name, parameters }, + } + } + + pub fn name(&self) -> String { + self.execute.name.clone() + } + + pub fn parameters(&self) -> Vec { + self.execute + .parameters + .clone() + .into_iter() + .map(|t| t.into()) + .collect() + } +} + +#[pyclass( + from_py_object, + frozen, + name = "Deallocate", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyDeallocate { + deallocate: Deallocate, +} + +impl From for PyDeallocate { + fn from(deallocate: Deallocate) -> PyDeallocate { + PyDeallocate { deallocate } + } +} + +impl TryFrom for Deallocate { + type Error = PyErr; + + fn try_from(py: PyDeallocate) -> Result { + Ok(py.deallocate) + } +} + +impl LogicalNode for PyDeallocate { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} + +#[pymethods] +impl PyDeallocate { + #[new] + pub fn new(name: String) -> Self { + PyDeallocate { + deallocate: Deallocate { name }, + } + } + + pub fn name(&self) -> String { + self.deallocate.name.clone() + } +} diff --git a/crates/core/src/expr/subquery.rs b/crates/core/src/expr/subquery.rs new file mode 100644 index 000000000..c6fa83db8 --- /dev/null +++ b/crates/core/src/expr/subquery.rs @@ -0,0 +1,87 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; + +use datafusion::logical_expr::Subquery; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use super::logical_node::LogicalNode; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "Subquery", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PySubquery { + subquery: Subquery, +} + +impl From for Subquery { + fn from(subquery: PySubquery) -> Self { + subquery.subquery + } +} + +impl From for PySubquery { + fn from(subquery: Subquery) -> PySubquery { + PySubquery { subquery } + } +} + +impl Display for PySubquery { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!( + f, + "Subquery + Subquery: {:?} + outer_ref_columns: {:?}", + self.subquery.subquery, self.subquery.outer_ref_columns, + ) + } +} + +#[pymethods] +impl PySubquery { + /// Retrieves the input `LogicalPlan` to this `Projection` node + fn input(&self) -> PyResult> { + Ok(Self::inputs(self)) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("Subquery({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("Subquery".to_string()) + } +} + +impl LogicalNode for PySubquery { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/src/expr/subquery_alias.rs b/crates/core/src/expr/subquery_alias.rs similarity index 82% rename from src/expr/subquery_alias.rs rename to crates/core/src/expr/subquery_alias.rs index 5360bbbc4..a6b09e842 100644 --- a/src/expr/subquery_alias.rs +++ b/crates/core/src/expr/subquery_alias.rs @@ -17,14 +17,21 @@ use std::fmt::{self, Display, Formatter}; -use datafusion_expr::SubqueryAlias; +use datafusion::logical_expr::SubqueryAlias; +use pyo3::IntoPyObjectExt; use pyo3::prelude::*; -use crate::{common::df_schema::PyDFSchema, sql::logical::PyLogicalPlan}; - use super::logical_node::LogicalNode; +use crate::common::df_schema::PyDFSchema; +use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "SubqueryAlias", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "SubqueryAlias", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PySubqueryAlias { subquery_alias: SubqueryAlias, @@ -68,11 +75,11 @@ impl PySubqueryAlias { } fn alias(&self) -> PyResult { - Ok(self.subquery_alias.alias.clone()) + Ok(self.subquery_alias.alias.to_string()) } fn __repr__(&self) -> PyResult { - Ok(format!("SubqueryAlias({})", self)) + Ok(format!("SubqueryAlias({self})")) } fn __name__(&self) -> PyResult { @@ -85,7 +92,7 @@ impl LogicalNode for PySubqueryAlias { vec![PyLogicalPlan::from((*self.subquery_alias.input).clone())] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/src/expr/table_scan.rs b/crates/core/src/expr/table_scan.rs similarity index 74% rename from src/expr/table_scan.rs rename to crates/core/src/expr/table_scan.rs index 63684fe7f..8ba7e4a69 100644 --- a/src/expr/table_scan.rs +++ b/crates/core/src/expr/table_scan.rs @@ -15,15 +15,25 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::logical_plan::TableScan; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::common::TableReference; +use datafusion::logical_expr::logical_plan::TableScan; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use crate::common::df_schema::PyDFSchema; +use crate::expr::PyExpr; use crate::expr::logical_node::LogicalNode; use crate::sql::logical::PyLogicalPlan; -use crate::{common::df_schema::PyDFSchema, expr::PyExpr}; -#[pyclass(name = "TableScan", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "TableScan", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyTableScan { table_scan: TableScan, @@ -67,20 +77,33 @@ impl Display for PyTableScan { impl PyTableScan { /// Retrieves the name of the table represented by this `TableScan` instance #[pyo3(name = "table_name")] - fn py_table_name(&self) -> PyResult<&str> { - Ok(&self.table_scan.table_name) + fn py_table_name(&self) -> PyResult { + Ok(format!("{}", self.table_scan.table_name)) } - /// TODO: Bindings for `TableSource` need to exist first. Left as a - /// placeholder to display intention to add when able to. - // #[pyo3(name = "source")] - // fn py_source(&self) -> PyResult> { - // Ok(self.table_scan.source) - // } + #[pyo3(name = "fqn")] + fn fqn(&self) -> PyResult<(Option, Option, String)> { + let table_ref: TableReference = self.table_scan.table_name.clone(); + Ok(match table_ref { + TableReference::Bare { table } => (None, None, table.to_string()), + TableReference::Partial { schema, table } => { + (None, Some(schema.to_string()), table.to_string()) + } + TableReference::Full { + catalog, + schema, + table, + } => ( + Some(catalog.to_string()), + Some(schema.to_string()), + table.to_string(), + ), + }) + } /// The column indexes that should be. Note if this is empty then /// all columns should be read by the `TableProvider`. This function - /// provides a Tuple of the (index, column_name) to make things simplier + /// provides a Tuple of the (index, column_name) to make things simpler /// for the calling code since often times the name is preferred to /// the index which is a lower level abstraction. #[pyo3(name = "projection")] @@ -122,7 +145,7 @@ impl PyTableScan { } fn __repr__(&self) -> PyResult { - Ok(format!("TableScan({})", self)) + Ok(format!("TableScan({self})")) } } @@ -132,7 +155,7 @@ impl LogicalNode for PyTableScan { vec![] } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/src/expr/union.rs b/crates/core/src/expr/union.rs similarity index 86% rename from src/expr/union.rs rename to crates/core/src/expr/union.rs index 98e8eaae6..a3b9efe91 100644 --- a/src/expr/union.rs +++ b/crates/core/src/expr/union.rs @@ -15,15 +15,23 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::logical_plan::Union; -use pyo3::prelude::*; use std::fmt::{self, Display, Formatter}; +use datafusion::logical_expr::logical_plan::Union; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + use crate::common::df_schema::PyDFSchema; use crate::expr::logical_node::LogicalNode; use crate::sql::logical::PyLogicalPlan; -#[pyclass(name = "Union", module = "datafusion.expr", subclass)] +#[pyclass( + from_py_object, + frozen, + name = "Union", + module = "datafusion.expr", + subclass +)] #[derive(Clone)] pub struct PyUnion { union_: Union, @@ -66,7 +74,7 @@ impl PyUnion { } fn __repr__(&self) -> PyResult { - Ok(format!("Union({})", self)) + Ok(format!("Union({self})")) } fn __name__(&self) -> PyResult { @@ -83,7 +91,7 @@ impl LogicalNode for PyUnion { .collect() } - fn to_variant(&self, py: Python) -> PyResult { - Ok(self.clone().into_py(py)) + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) } } diff --git a/crates/core/src/expr/unnest.rs b/crates/core/src/expr/unnest.rs new file mode 100644 index 000000000..880d0a279 --- /dev/null +++ b/crates/core/src/expr/unnest.rs @@ -0,0 +1,93 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; + +use datafusion::logical_expr::logical_plan::Unnest; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use crate::common::df_schema::PyDFSchema; +use crate::expr::logical_node::LogicalNode; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "Unnest", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyUnnest { + unnest_: Unnest, +} + +impl From for PyUnnest { + fn from(unnest_: Unnest) -> PyUnnest { + PyUnnest { unnest_ } + } +} + +impl From for Unnest { + fn from(unnest_: PyUnnest) -> Self { + unnest_.unnest_ + } +} + +impl Display for PyUnnest { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!( + f, + "Unnest + Inputs: {:?} + Schema: {:?}", + &self.unnest_.input, &self.unnest_.schema, + ) + } +} + +#[pymethods] +impl PyUnnest { + /// Retrieves the input `LogicalPlan` to this `Unnest` node + fn input(&self) -> PyResult> { + Ok(Self::inputs(self)) + } + + /// Resulting Schema for this `Unnest` node instance + fn schema(&self) -> PyResult { + Ok(self.unnest_.schema.as_ref().clone().into()) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("Unnest({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("Unnest".to_string()) + } +} + +impl LogicalNode for PyUnnest { + fn inputs(&self) -> Vec { + vec![PyLogicalPlan::from((*self.unnest_.input).clone())] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/crates/core/src/expr/unnest_expr.rs b/crates/core/src/expr/unnest_expr.rs new file mode 100644 index 000000000..97feef1d1 --- /dev/null +++ b/crates/core/src/expr/unnest_expr.rs @@ -0,0 +1,74 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; + +use datafusion::logical_expr::expr::Unnest; +use pyo3::prelude::*; + +use super::PyExpr; + +#[pyclass( + from_py_object, + frozen, + name = "UnnestExpr", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyUnnestExpr { + unnest: Unnest, +} + +impl From for PyUnnestExpr { + fn from(unnest: Unnest) -> PyUnnestExpr { + PyUnnestExpr { unnest } + } +} + +impl From for Unnest { + fn from(unnest: PyUnnestExpr) -> Self { + unnest.unnest + } +} + +impl Display for PyUnnestExpr { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!( + f, + "Unnest + Expr: {:?}", + &self.unnest.expr, + ) + } +} + +#[pymethods] +impl PyUnnestExpr { + /// Retrieves the expression that is being unnested + fn expr(&self) -> PyResult { + Ok((*self.unnest.expr).clone().into()) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("UnnestExpr({self})")) + } + + fn __name__(&self) -> PyResult { + Ok("UnnestExpr".to_string()) + } +} diff --git a/crates/core/src/expr/values.rs b/crates/core/src/expr/values.rs new file mode 100644 index 000000000..d40b0e7cf --- /dev/null +++ b/crates/core/src/expr/values.rs @@ -0,0 +1,93 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::Arc; + +use datafusion::logical_expr::Values; +use pyo3::prelude::*; +use pyo3::{IntoPyObjectExt, PyErr, PyResult, Python, pyclass}; + +use super::PyExpr; +use super::logical_node::LogicalNode; +use crate::common::df_schema::PyDFSchema; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "Values", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyValues { + values: Values, +} + +impl From for PyValues { + fn from(values: Values) -> PyValues { + PyValues { values } + } +} + +impl TryFrom for Values { + type Error = PyErr; + + fn try_from(py: PyValues) -> Result { + Ok(py.values) + } +} + +impl LogicalNode for PyValues { + fn inputs(&self) -> Vec { + vec![] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} + +#[pymethods] +impl PyValues { + #[new] + pub fn new(schema: PyDFSchema, values: Vec>) -> PyResult { + let values = values + .into_iter() + .map(|row| row.into_iter().map(|expr| expr.into()).collect()) + .collect(); + Ok(PyValues { + values: Values { + schema: Arc::new(schema.into()), + values, + }, + }) + } + + pub fn schema(&self) -> PyResult { + Ok((*self.values.schema).clone().into()) + } + + pub fn values(&self) -> Vec> { + self.values + .values + .clone() + .into_iter() + .map(|row| row.into_iter().map(|expr| expr.into()).collect()) + .collect() + } +} diff --git a/crates/core/src/expr/window.rs b/crates/core/src/expr/window.rs new file mode 100644 index 000000000..92d909bfc --- /dev/null +++ b/crates/core/src/expr/window.rs @@ -0,0 +1,307 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::fmt::{self, Display, Formatter}; + +use datafusion::common::{DataFusionError, ScalarValue}; +use datafusion::logical_expr::{Expr, Window, WindowFrame, WindowFrameBound, WindowFrameUnits}; +use pyo3::IntoPyObjectExt; +use pyo3::exceptions::PyNotImplementedError; +use pyo3::prelude::*; + +use super::py_expr_list; +use crate::common::data_type::PyScalarValue; +use crate::common::df_schema::PyDFSchema; +use crate::errors::{PyDataFusionResult, py_type_err}; +use crate::expr::PyExpr; +use crate::expr::logical_node::LogicalNode; +use crate::expr::sort_expr::{PySortExpr, py_sort_expr_list}; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "WindowExpr", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyWindowExpr { + window: Window, +} + +#[pyclass( + from_py_object, + frozen, + name = "WindowFrame", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyWindowFrame { + window_frame: WindowFrame, +} + +impl From for WindowFrame { + fn from(window_frame: PyWindowFrame) -> Self { + window_frame.window_frame + } +} + +impl From for PyWindowFrame { + fn from(window_frame: WindowFrame) -> PyWindowFrame { + PyWindowFrame { window_frame } + } +} + +#[pyclass( + from_py_object, + frozen, + name = "WindowFrameBound", + module = "datafusion.expr", + subclass +)] +#[derive(Clone)] +pub struct PyWindowFrameBound { + frame_bound: WindowFrameBound, +} + +impl From for Window { + fn from(window: PyWindowExpr) -> Window { + window.window + } +} + +impl From for PyWindowExpr { + fn from(window: Window) -> PyWindowExpr { + PyWindowExpr { window } + } +} + +impl From for PyWindowFrameBound { + fn from(frame_bound: WindowFrameBound) -> Self { + PyWindowFrameBound { frame_bound } + } +} + +impl Display for PyWindowExpr { + fn fmt(&self, f: &mut Formatter) -> fmt::Result { + write!( + f, + "Over\n + Window Expr: {:?} + Schema: {:?}", + &self.window.window_expr, &self.window.schema + ) + } +} + +impl Display for PyWindowFrame { + fn fmt(&self, f: &mut Formatter) -> std::fmt::Result { + write!( + f, + "OVER ({} BETWEEN {} AND {})", + self.window_frame.units, self.window_frame.start_bound, self.window_frame.end_bound + ) + } +} + +#[pymethods] +impl PyWindowExpr { + /// Returns the schema of the Window + pub fn schema(&self) -> PyResult { + Ok(self.window.schema.as_ref().clone().into()) + } + + /// Returns window expressions + pub fn get_window_expr(&self) -> PyResult> { + py_expr_list(&self.window.window_expr) + } + + /// Returns order by columns in a window function expression + pub fn get_sort_exprs(&self, expr: PyExpr) -> PyResult> { + match expr.expr.unalias() { + Expr::WindowFunction(boxed_window_fn) => { + py_sort_expr_list(&boxed_window_fn.params.order_by) + } + other => Err(not_window_function_err(other)), + } + } + + /// Return partition by columns in a window function expression + pub fn get_partition_exprs(&self, expr: PyExpr) -> PyResult> { + match expr.expr.unalias() { + Expr::WindowFunction(boxed_window_fn) => { + py_expr_list(&boxed_window_fn.params.partition_by) + } + other => Err(not_window_function_err(other)), + } + } + + /// Return input args for window function + pub fn get_args(&self, expr: PyExpr) -> PyResult> { + match expr.expr.unalias() { + Expr::WindowFunction(boxed_window_fn) => py_expr_list(&boxed_window_fn.params.args), + other => Err(not_window_function_err(other)), + } + } + + /// Return window function name + pub fn window_func_name(&self, expr: PyExpr) -> PyResult { + match expr.expr.unalias() { + Expr::WindowFunction(boxed_window_fn) => Ok(boxed_window_fn.fun.to_string()), + other => Err(not_window_function_err(other)), + } + } + + /// Returns a Pywindow frame for a given window function expression + pub fn get_frame(&self, expr: PyExpr) -> Option { + match expr.expr.unalias() { + Expr::WindowFunction(boxed_window_fn) => { + Some(boxed_window_fn.params.window_frame.into()) + } + _ => None, + } + } +} + +fn not_window_function_err(expr: Expr) -> PyErr { + py_type_err(format!( + "Provided {} Expr {:?} is not a WindowFunction type", + expr.variant_name(), + expr + )) +} + +#[pymethods] +impl PyWindowFrame { + #[new] + #[pyo3(signature=(unit, start_bound, end_bound))] + pub fn new( + unit: &str, + start_bound: Option, + end_bound: Option, + ) -> PyResult { + let units = unit.to_ascii_lowercase(); + let units = match units.as_str() { + "rows" => WindowFrameUnits::Rows, + "range" => WindowFrameUnits::Range, + "groups" => WindowFrameUnits::Groups, + _ => { + return Err(PyNotImplementedError::new_err(format!("{units:?}"))); + } + }; + let start_bound = match start_bound { + Some(start_bound) => WindowFrameBound::Preceding(start_bound.0), + None => match units { + WindowFrameUnits::Range => WindowFrameBound::Preceding(ScalarValue::UInt64(None)), + WindowFrameUnits::Rows => WindowFrameBound::Preceding(ScalarValue::UInt64(None)), + WindowFrameUnits::Groups => { + return Err(PyNotImplementedError::new_err(format!("{units:?}"))); + } + }, + }; + let end_bound = match end_bound { + Some(end_bound) => WindowFrameBound::Following(end_bound.0), + None => match units { + WindowFrameUnits::Rows => WindowFrameBound::Following(ScalarValue::UInt64(None)), + WindowFrameUnits::Range => WindowFrameBound::Following(ScalarValue::UInt64(None)), + WindowFrameUnits::Groups => { + return Err(PyNotImplementedError::new_err(format!("{units:?}"))); + } + }, + }; + Ok(PyWindowFrame { + window_frame: WindowFrame::new_bounds(units, start_bound, end_bound), + }) + } + + /// Returns the window frame units for the bounds + pub fn get_frame_units(&self) -> PyResult { + Ok(self.window_frame.units.to_string()) + } + /// Returns starting bound + pub fn get_lower_bound(&self) -> PyResult { + Ok(self.window_frame.start_bound.clone().into()) + } + /// Returns end bound + pub fn get_upper_bound(&self) -> PyResult { + Ok(self.window_frame.end_bound.clone().into()) + } + + /// Get a String representation of this window frame + fn __repr__(&self) -> String { + format!("{self}") + } +} + +#[pymethods] +impl PyWindowFrameBound { + /// Returns if the frame bound is current row + pub fn is_current_row(&self) -> bool { + matches!(self.frame_bound, WindowFrameBound::CurrentRow) + } + + /// Returns if the frame bound is preceding + pub fn is_preceding(&self) -> bool { + matches!(self.frame_bound, WindowFrameBound::Preceding(_)) + } + + /// Returns if the frame bound is following + pub fn is_following(&self) -> bool { + matches!(self.frame_bound, WindowFrameBound::Following(_)) + } + /// Returns the offset of the window frame + pub fn get_offset(&self) -> PyDataFusionResult> { + match &self.frame_bound { + WindowFrameBound::Preceding(val) | WindowFrameBound::Following(val) => match val { + x if x.is_null() => Ok(None), + ScalarValue::UInt64(v) => Ok(*v), + // The cast below is only safe because window bounds cannot be negative + ScalarValue::Int64(v) => Ok(v.map(|n| n as u64)), + ScalarValue::Utf8(Some(s)) => match s.parse::() { + Ok(s) => Ok(Some(s)), + Err(_e) => Err(DataFusionError::Plan(format!( + "Unable to parse u64 from Utf8 value '{s}'" + )) + .into()), + }, + ref x => { + Err(DataFusionError::Plan(format!("Unexpected window frame bound: {x}")).into()) + } + }, + WindowFrameBound::CurrentRow => Ok(None), + } + } + /// Returns if the frame bound is unbounded + pub fn is_unbounded(&self) -> PyResult { + match &self.frame_bound { + WindowFrameBound::Preceding(v) | WindowFrameBound::Following(v) => Ok(v.is_null()), + WindowFrameBound::CurrentRow => Ok(false), + } + } +} + +impl LogicalNode for PyWindowExpr { + fn inputs(&self) -> Vec { + vec![self.window.input.as_ref().clone().into()] + } + + fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + self.clone().into_bound_py_any(py) + } +} diff --git a/crates/core/src/functions.rs b/crates/core/src/functions.rs new file mode 100644 index 000000000..7feb62d79 --- /dev/null +++ b/crates/core/src/functions.rs @@ -0,0 +1,1146 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::HashMap; + +use datafusion::common::{Column, ScalarValue, TableReference}; +use datafusion::logical_expr::expr::{Alias, FieldMetadata, NullTreatment as DFNullTreatment}; +use datafusion::logical_expr::{Expr, ExprFunctionExt, lit}; +use datafusion::{functions, functions_aggregate, functions_window}; +use pyo3::prelude::*; +use pyo3::wrap_pyfunction; + +use crate::common::data_type::{NullTreatment, PyScalarValue}; +use crate::errors::PyDataFusionResult; +use crate::expr::PyExpr; +use crate::expr::conditional_expr::PyCaseBuilder; +use crate::expr::sort_expr::{PySortExpr, to_sort_expressions}; +use crate::expr::window::PyWindowFrame; + +fn add_builder_fns_to_aggregate( + agg_fn: Expr, + distinct: Option, + filter: Option, + order_by: Option>, + null_treatment: Option, +) -> PyDataFusionResult { + // Since ExprFuncBuilder::new() is private, we can guarantee initializing + // a builder with an `null_treatment` with option None + let mut builder = agg_fn.null_treatment(None); + + if let Some(order_by_cols) = order_by { + let order_by_cols = to_sort_expressions(order_by_cols); + builder = builder.order_by(order_by_cols); + } + + if let Some(true) = distinct { + builder = builder.distinct(); + } + + if let Some(filter) = filter { + builder = builder.filter(filter.expr); + } + + builder = builder.null_treatment(null_treatment.map(DFNullTreatment::from)); + + Ok(builder.build()?.into()) +} + +#[pyfunction] +fn in_list(expr: PyExpr, value: Vec, negated: bool) -> PyExpr { + datafusion::logical_expr::in_list( + expr.expr, + value.into_iter().map(|x| x.expr).collect::>(), + negated, + ) + .into() +} + +#[pyfunction] +fn make_array(exprs: Vec) -> PyExpr { + datafusion::functions_nested::expr_fn::make_array(exprs.into_iter().map(|x| x.into()).collect()) + .into() +} + +#[pyfunction] +fn array_concat(exprs: Vec) -> PyExpr { + let exprs = exprs.into_iter().map(|x| x.into()).collect(); + datafusion::functions_nested::expr_fn::array_concat(exprs).into() +} + +#[pyfunction] +fn array_cat(exprs: Vec) -> PyExpr { + array_concat(exprs) +} + +#[pyfunction] +fn array_distance(array1: PyExpr, array2: PyExpr) -> PyExpr { + let args = vec![array1.into(), array2.into()]; + Expr::ScalarFunction(datafusion::logical_expr::expr::ScalarFunction::new_udf( + datafusion::functions_nested::distance::array_distance_udf(), + args, + )) + .into() +} + +#[pyfunction] +fn arrays_zip(exprs: Vec) -> PyExpr { + let exprs = exprs.into_iter().map(|x| x.into()).collect(); + datafusion::functions_nested::expr_fn::arrays_zip(exprs).into() +} + +#[pyfunction] +#[pyo3(signature = (string, delimiter, null_string=None))] +fn string_to_array(string: PyExpr, delimiter: PyExpr, null_string: Option) -> PyExpr { + let mut args = vec![string.into(), delimiter.into()]; + if let Some(null_string) = null_string { + args.push(null_string.into()); + } + Expr::ScalarFunction(datafusion::logical_expr::expr::ScalarFunction::new_udf( + datafusion::functions_nested::string::string_to_array_udf(), + args, + )) + .into() +} + +#[pyfunction] +#[pyo3(signature = (start, stop, step=None))] +fn gen_series(start: PyExpr, stop: PyExpr, step: Option) -> PyExpr { + let mut args = vec![start.into(), stop.into()]; + if let Some(step) = step { + args.push(step.into()); + } + Expr::ScalarFunction(datafusion::logical_expr::expr::ScalarFunction::new_udf( + datafusion::functions_nested::range::gen_series_udf(), + args, + )) + .into() +} + +#[pyfunction] +fn make_map(keys: Vec, values: Vec) -> PyExpr { + let keys = keys.into_iter().map(|x| x.into()).collect(); + let values = values.into_iter().map(|x| x.into()).collect(); + datafusion::functions_nested::map::map(keys, values).into() +} + +#[pyfunction] +#[pyo3(signature = (array, element, index=None))] +fn array_position(array: PyExpr, element: PyExpr, index: Option) -> PyExpr { + let index = ScalarValue::Int64(index); + let index = Expr::Literal(index, None); + datafusion::functions_nested::expr_fn::array_position(array.into(), element.into(), index) + .into() +} + +#[pyfunction] +#[pyo3(signature = (array, begin, end, stride=None))] +fn array_slice(array: PyExpr, begin: PyExpr, end: PyExpr, stride: Option) -> PyExpr { + datafusion::functions_nested::expr_fn::array_slice( + array.into(), + begin.into(), + end.into(), + stride.map(Into::into), + ) + .into() +} + +/// Computes a binary hash of the given data. type is the algorithm to use. +/// Standard algorithms are md5, sha224, sha256, sha384, sha512, blake2s, blake2b, and blake3. +// #[pyfunction(value, method)] +#[pyfunction] +fn digest(value: PyExpr, method: PyExpr) -> PyExpr { + PyExpr { + expr: functions::expr_fn::digest(value.expr, method.expr), + } +} + +/// Concatenates the text representations of all the arguments. +/// NULL arguments are ignored. +#[pyfunction] +fn concat(args: Vec) -> PyResult { + let args = args.into_iter().map(|e| e.expr).collect::>(); + Ok(functions::string::expr_fn::concat(args).into()) +} + +/// Concatenates all but the first argument, with separators. +/// The first argument is used as the separator string, and should not be NULL. +/// Other NULL arguments are ignored. +#[pyfunction] +fn concat_ws(sep: String, args: Vec) -> PyResult { + let args = args.into_iter().map(|e| e.expr).collect::>(); + Ok(functions::string::expr_fn::concat_ws(lit(sep), args).into()) +} + +#[pyfunction] +#[pyo3(signature = (values, regex, flags=None))] +fn regexp_like(values: PyExpr, regex: PyExpr, flags: Option) -> PyResult { + Ok(functions::expr_fn::regexp_like(values.expr, regex.expr, flags.map(|x| x.expr)).into()) +} + +#[pyfunction] +#[pyo3(signature = (values, regex, flags=None))] +fn regexp_match(values: PyExpr, regex: PyExpr, flags: Option) -> PyResult { + Ok(functions::expr_fn::regexp_match(values.expr, regex.expr, flags.map(|x| x.expr)).into()) +} + +#[pyfunction] +#[pyo3(signature = (string, pattern, replacement, flags=None))] +/// Replaces substring(s) matching a POSIX regular expression. +fn regexp_replace( + string: PyExpr, + pattern: PyExpr, + replacement: PyExpr, + flags: Option, +) -> PyResult { + Ok(functions::expr_fn::regexp_replace( + string.into(), + pattern.into(), + replacement.into(), + flags.map(|x| x.expr), + ) + .into()) +} + +#[pyfunction] +#[pyo3(signature = (string, pattern, start, flags=None))] +/// Returns the number of matches found in the string. +fn regexp_count( + string: PyExpr, + pattern: PyExpr, + start: Option, + flags: Option, +) -> PyResult { + Ok(functions::expr_fn::regexp_count( + string.expr, + pattern.expr, + start.map(|x| x.expr), + flags.map(|x| x.expr), + ) + .into()) +} + +#[pyfunction] +#[pyo3(signature = (values, regex, start=None, n=None, flags=None, subexpr=None))] +/// Returns the position in a string where the specified occurrence of a regular expression is located +fn regexp_instr( + values: PyExpr, + regex: PyExpr, + start: Option, + n: Option, + flags: Option, + subexpr: Option, +) -> PyResult { + Ok(functions::expr_fn::regexp_instr( + values.into(), + regex.into(), + start.map(|x| x.expr).or(Some(lit(1))), + n.map(|x| x.expr).or(Some(lit(1))), + None, + flags.map(|x| x.expr).or(Some(lit(""))), + subexpr.map(|x| x.expr).or(Some(lit(0))), + ) + .into()) +} + +/// Creates a new Sort Expr +#[pyfunction] +fn order_by(expr: PyExpr, asc: bool, nulls_first: bool) -> PyResult { + Ok(PySortExpr::from(datafusion::logical_expr::expr::Sort { + expr: expr.expr, + asc, + nulls_first, + })) +} + +/// Creates a new Alias Expr +#[pyfunction] +#[pyo3(signature = (expr, name, metadata=None))] +fn alias(expr: PyExpr, name: &str, metadata: Option>) -> PyResult { + let relation: Option = None; + let metadata = metadata.map(|m| FieldMetadata::new(m.into_iter().collect())); + Ok(PyExpr { + expr: datafusion::logical_expr::Expr::Alias( + Alias::new(expr.expr, relation, name).with_metadata(metadata), + ), + }) +} + +/// Create a column reference Expr +#[pyfunction] +fn col(name: &str) -> PyResult { + Ok(PyExpr { + expr: datafusion::logical_expr::Expr::Column(Column::new_unqualified(name)), + }) +} + +/// Create a CASE WHEN statement with literal WHEN expressions for comparison to the base expression. +#[pyfunction] +fn case(expr: PyExpr) -> PyResult { + Ok(PyCaseBuilder::new(Some(expr))) +} + +/// Create a CASE WHEN statement with literal WHEN expressions for comparison to the base expression. +#[pyfunction] +fn when(when: PyExpr, then: PyExpr) -> PyResult { + Ok(PyCaseBuilder::new(None).when(when, then)) +} + +// Generates a [pyo3] wrapper for associated aggregate functions. +// All of the builder options are exposed to the python internal +// function and we rely on the wrappers to only use those that +// are appropriate. +macro_rules! aggregate_function { + ($NAME: ident) => { + aggregate_function!($NAME, expr); + }; + ($NAME: ident, $($arg:ident)*) => { + #[pyfunction] + #[pyo3(signature = ($($arg),*, distinct=None, filter=None, order_by=None, null_treatment=None))] + fn $NAME( + $($arg: PyExpr),*, + distinct: Option, + filter: Option, + order_by: Option>, + null_treatment: Option + ) -> PyDataFusionResult { + let agg_fn = functions_aggregate::expr_fn::$NAME($($arg.into()),*); + + add_builder_fns_to_aggregate(agg_fn, distinct, filter, order_by, null_treatment) + } + }; +} + +/// Generates a [pyo3] wrapper for [datafusion::functions::expr_fn] +/// +/// These functions have explicit named arguments. +macro_rules! expr_fn { + ($FUNC: ident) => { + expr_fn!($FUNC, , stringify!($FUNC)); + }; + ($FUNC:ident, $($arg:ident)*) => { + expr_fn!($FUNC, $($arg)*, stringify!($FUNC)); + }; + ($FUNC: ident, $DOC: expr) => { + expr_fn!($FUNC, ,$DOC); + }; + ($FUNC: ident, $($arg:ident)*, $DOC: expr) => { + #[doc = $DOC] + #[pyfunction] + fn $FUNC($($arg: PyExpr),*) -> PyExpr { + functions::expr_fn::$FUNC($($arg.into()),*).into() + } + }; +} +/// Generates a [pyo3] wrapper for [datafusion::functions::expr_fn] +/// +/// These functions take a single `Vec` argument using `pyo3(signature = (*args))`. +macro_rules! expr_fn_vec { + ($FUNC: ident) => { + expr_fn_vec!($FUNC, stringify!($FUNC)); + }; + ($FUNC: ident, $DOC: expr) => { + #[doc = $DOC] + #[pyfunction] + #[pyo3(signature = (*args))] + fn $FUNC(args: Vec) -> PyExpr { + let args = args.into_iter().map(|e| e.into()).collect::>(); + functions::expr_fn::$FUNC(args).into() + } + }; +} + +/// Generates a [pyo3] wrapper for [datafusion_functions_nested::expr_fn] +/// +/// These functions have explicit named arguments. +macro_rules! array_fn { + ($FUNC: ident) => { + array_fn!($FUNC, , stringify!($FUNC)); + }; + ($FUNC:ident, $($arg:ident)*) => { + array_fn!($FUNC, $($arg)*, stringify!($FUNC)); + }; + ($FUNC: ident, $DOC: expr) => { + array_fn!($FUNC, , $DOC); + }; + ($FUNC: ident, $($arg:ident)*, $DOC:expr) => { + #[doc = $DOC] + #[pyfunction] + fn $FUNC($($arg: PyExpr),*) -> PyExpr { + datafusion::functions_nested::expr_fn::$FUNC($($arg.into()),*).into() + } + }; +} + +expr_fn!(abs, num); +expr_fn!(acos, num); +expr_fn!(acosh, num); +expr_fn!( + ascii, + arg1, + "Returns the numeric code of the first character of the argument. In UTF8 encoding, returns the Unicode code point of the character. In other multibyte encodings, the argument must be an ASCII character." +); +expr_fn!(asin, num); +expr_fn!(asinh, num); +expr_fn!(atan, num); +expr_fn!(atanh, num); +expr_fn!(atan2, y x); +expr_fn!( + bit_length, + arg, + "Returns number of bits in the string (8 times the octet_length)." +); +expr_fn_vec!( + btrim, + "Removes the longest string containing only characters in characters (a space by default) from the start and end of string." +); +expr_fn!(cbrt, num); +expr_fn!(ceil, num); +expr_fn!( + character_length, + string, + "Returns number of characters in the string." +); +expr_fn!(length, string); +expr_fn!(char_length, string); +expr_fn!(chr, arg, "Returns the character with the given code."); +expr_fn_vec!(coalesce); +expr_fn_vec!(greatest); +expr_fn_vec!(least); +expr_fn!( + contains, + string search_str, + "Return true if search_str is found within string (case-sensitive)." +); +expr_fn!(cos, num); +expr_fn!(cosh, num); +expr_fn!(cot, num); +expr_fn!(degrees, num); +expr_fn!(decode, input encoding); +expr_fn!(encode, input encoding); +expr_fn!(ends_with, string suffix, "Returns true if string ends with suffix."); +expr_fn!(exp, num); +expr_fn!(factorial, num); +expr_fn!(floor, num); +expr_fn!(gcd, x y); +expr_fn!( + initcap, + string, + "Converts the first letter of each word to upper case and the rest to lower case. Words are sequences of alphanumeric characters separated by non-alphanumeric characters." +); +expr_fn!(isnan, num); +expr_fn!(iszero, num); +expr_fn!(levenshtein, string1 string2); +expr_fn!(lcm, x y); +expr_fn!(left, string n, "Returns first n characters in the string, or when n is negative, returns all but last |n| characters."); +expr_fn!(ln, num); +expr_fn!(log, base num); +expr_fn!(log10, num); +expr_fn!(log2, num); +expr_fn!(lower, arg1, "Converts the string to all lower case"); +expr_fn_vec!( + lpad, + "Extends the string to length length by prepending the characters fill (a space by default). If the string is already longer than length then it is truncated (on the right)." +); +expr_fn_vec!( + ltrim, + "Removes the longest string containing only characters in characters (a space by default) from the start of string." +); +expr_fn!( + md5, + input_arg, + "Computes the MD5 hash of the argument, with the result written in hexadecimal." +); +expr_fn!( + nanvl, + x y, + "Returns x if x is not NaN otherwise returns y." +); +expr_fn!( + nvl, + x y, + "Returns x if x is not NULL otherwise returns y." +); +expr_fn!( + nvl2, + x y z, + "Returns y if x is not NULL; otherwise returns z." +); +expr_fn!(nullif, arg_1 arg_2); +expr_fn!( + octet_length, + args, + "Returns number of bytes in the string. Since this version of the function accepts type character directly, it will not strip trailing spaces." +); +expr_fn_vec!(overlay); +expr_fn!(pi); +expr_fn!(power, base exponent); +expr_fn!(radians, num); +expr_fn!(repeat, string n, "Repeats string the specified number of times."); +expr_fn!( + replace, + string from to, + "Replaces all occurrences in string of substring from with substring to." +); +expr_fn!( + reverse, + string, + "Reverses the order of the characters in the string." +); +expr_fn!(right, string n, "Returns last n characters in the string, or when n is negative, returns all but first |n| characters."); +expr_fn_vec!(round); +expr_fn_vec!( + rpad, + "Extends the string to length length by appending the characters fill (a space by default). If the string is already longer than length then it is truncated." +); +expr_fn_vec!( + rtrim, + "Removes the longest string containing only characters in characters (a space by default) from the end of string." +); +expr_fn!(sha224, input_arg1); +expr_fn!(sha256, input_arg1); +expr_fn!(sha384, input_arg1); +expr_fn!(sha512, input_arg1); +expr_fn!(signum, num); +expr_fn!(sin, num); +expr_fn!(sinh, num); +expr_fn!( + split_part, + string delimiter index, + "Splits string at occurrences of delimiter and returns the n'th field (counting from one)." +); +expr_fn!(sqrt, num); +expr_fn!(starts_with, string prefix, "Returns true if string starts with prefix."); +expr_fn!(strpos, string substring, "Returns starting index of specified substring within string, or zero if it's not present. (Same as position(substring in string), but note the reversed argument order.)"); +expr_fn!(substr, string position); +expr_fn!(substr_index, string delimiter count); +expr_fn!(substring, string position length); +expr_fn!(find_in_set, string string_list); +expr_fn!(tan, num); +expr_fn!(tanh, num); +expr_fn!( + to_hex, + arg1, + "Converts the number to its equivalent hexadecimal representation." +); +expr_fn!(now); +expr_fn_vec!(to_date); +expr_fn_vec!(to_local_time); +expr_fn_vec!(to_time); +expr_fn_vec!(to_timestamp); +expr_fn_vec!(to_timestamp_millis); +expr_fn_vec!(to_timestamp_nanos); +expr_fn_vec!(to_timestamp_micros); +expr_fn_vec!(to_timestamp_seconds); +expr_fn_vec!(to_unixtime); +expr_fn!(current_date); +expr_fn!(current_time); +expr_fn!(date_part, part date); +expr_fn!(date_trunc, part date); +expr_fn!(date_bin, stride source origin); +expr_fn!(make_date, year month day); +expr_fn!(make_time, hour minute second); +expr_fn!(to_char, datetime format); + +expr_fn!(translate, string from to, "Replaces each character in string that matches a character in the from set with the corresponding character in the to set. If from is longer than to, occurrences of the extra characters in from are deleted."); +expr_fn_vec!( + trim, + "Removes the longest string containing only characters in characters (a space by default) from the start, end, or both ends (BOTH is the default) of string." +); +expr_fn_vec!(trunc); +expr_fn!(upper, arg1, "Converts the string to all upper case."); +expr_fn!(uuid); +expr_fn_vec!(r#struct); // Use raw identifier since struct is a keyword +expr_fn_vec!(named_struct); +expr_fn!(from_unixtime, unixtime); +expr_fn!(arrow_typeof, arg_1); +expr_fn!(arrow_cast, arg_1 datatype); +expr_fn_vec!(arrow_metadata); +expr_fn!(union_tag, arg1); +expr_fn!(random); + +#[pyfunction] +fn get_field(expr: PyExpr, name: PyExpr) -> PyExpr { + functions::core::get_field() + .call(vec![expr.into(), name.into()]) + .into() +} + +#[pyfunction] +fn union_extract(union_expr: PyExpr, field_name: PyExpr) -> PyExpr { + functions::core::union_extract() + .call(vec![union_expr.into(), field_name.into()]) + .into() +} + +#[pyfunction] +fn version() -> PyExpr { + functions::core::version().call(vec![]).into() +} + +// Array Functions +array_fn!(array_append, array element); +array_fn!(array_to_string, array delimiter); +array_fn!(array_dims, array); +array_fn!(array_distinct, array); +array_fn!(array_element, array element); +array_fn!(array_empty, array); +array_fn!(array_length, array); +array_fn!(array_has, first_array second_array); +array_fn!(array_has_all, first_array second_array); +array_fn!(array_has_any, first_array second_array); +array_fn!(array_positions, array element); +array_fn!(array_ndims, array); +array_fn!(array_prepend, element array); +array_fn!(array_pop_back, array); +array_fn!(array_pop_front, array); +array_fn!(array_remove, array element); +array_fn!(array_remove_n, array element max); +array_fn!(array_remove_all, array element); +array_fn!(array_repeat, element count); +array_fn!(array_replace, array from to); +array_fn!(array_replace_n, array from to max); +array_fn!(array_replace_all, array from to); +array_fn!(array_sort, array desc null_first); +array_fn!(array_intersect, first_array second_array); +array_fn!(array_union, array1 array2); +array_fn!(array_except, first_array second_array); +array_fn!(array_resize, array size value); +array_fn!(array_any_value, array); +array_fn!(array_max, array); +array_fn!(array_min, array); +array_fn!(array_reverse, array); +array_fn!(cardinality, array); +array_fn!(flatten, array); +array_fn!(range, start stop step); + +// Map Functions +array_fn!(map_keys, map); +array_fn!(map_values, map); +array_fn!(map_extract, map key); +array_fn!(map_entries, map); + +aggregate_function!(array_agg); +aggregate_function!(max); +aggregate_function!(min); +aggregate_function!(avg); +aggregate_function!(sum); +aggregate_function!(bit_and); +aggregate_function!(bit_or); +aggregate_function!(bit_xor); +aggregate_function!(bool_and); +aggregate_function!(bool_or); +aggregate_function!(corr, y x); +aggregate_function!(count); +aggregate_function!(covar_samp, y x); +aggregate_function!(covar_pop, y x); +aggregate_function!(median); +aggregate_function!(regr_slope, y x); +aggregate_function!(regr_intercept, y x); +aggregate_function!(regr_count, y x); +aggregate_function!(regr_r2, y x); +aggregate_function!(regr_avgx, y x); +aggregate_function!(regr_avgy, y x); +aggregate_function!(regr_sxx, y x); +aggregate_function!(regr_syy, y x); +aggregate_function!(regr_sxy, y x); +aggregate_function!(stddev); +aggregate_function!(stddev_pop); +aggregate_function!(var_sample); +aggregate_function!(var_pop); +aggregate_function!(approx_distinct); +aggregate_function!(approx_median); + +// The grouping function's physical plan is not implemented, but the +// ResolveGroupingFunction analyzer rule rewrites it before the physical +// planner sees it, so it works correctly at runtime. +aggregate_function!(grouping); + +#[pyfunction] +#[pyo3(signature = (sort_expression, percentile, num_centroids=None, filter=None))] +pub fn approx_percentile_cont( + sort_expression: PySortExpr, + percentile: f64, + num_centroids: Option, // enforces optional arguments at the end, currently + filter: Option, +) -> PyDataFusionResult { + let agg_fn = functions_aggregate::expr_fn::approx_percentile_cont( + sort_expression.sort, + lit(percentile), + num_centroids.map(lit), + ); + + add_builder_fns_to_aggregate(agg_fn, None, filter, None, None) +} + +#[pyfunction] +#[pyo3(signature = (sort_expression, weight, percentile, num_centroids=None, filter=None))] +pub fn approx_percentile_cont_with_weight( + sort_expression: PySortExpr, + weight: PyExpr, + percentile: f64, + num_centroids: Option, + filter: Option, +) -> PyDataFusionResult { + let agg_fn = functions_aggregate::expr_fn::approx_percentile_cont_with_weight( + sort_expression.sort, + weight.expr, + lit(percentile), + num_centroids.map(lit), + ); + + add_builder_fns_to_aggregate(agg_fn, None, filter, None, None) +} + +#[pyfunction] +#[pyo3(signature = (sort_expression, percentile, filter=None))] +pub fn percentile_cont( + sort_expression: PySortExpr, + percentile: f64, + filter: Option, +) -> PyDataFusionResult { + let agg_fn = + functions_aggregate::expr_fn::percentile_cont(sort_expression.sort, lit(percentile)); + + add_builder_fns_to_aggregate(agg_fn, None, filter, None, None) +} + +// We handle last_value explicitly because the signature expects an order_by +// https://github.com/apache/datafusion/issues/12376 +#[pyfunction] +#[pyo3(signature = (expr, distinct=None, filter=None, order_by=None, null_treatment=None))] +pub fn last_value( + expr: PyExpr, + distinct: Option, + filter: Option, + order_by: Option>, + null_treatment: Option, +) -> PyDataFusionResult { + // If we initialize the UDAF with order_by directly, then it gets over-written by the builder + let agg_fn = functions_aggregate::expr_fn::last_value(expr.expr, vec![]); + + add_builder_fns_to_aggregate(agg_fn, distinct, filter, order_by, null_treatment) +} +// We handle first_value explicitly because the signature expects an order_by +// https://github.com/apache/datafusion/issues/12376 +#[pyfunction] +#[pyo3(signature = (expr, distinct=None, filter=None, order_by=None, null_treatment=None))] +pub fn first_value( + expr: PyExpr, + distinct: Option, + filter: Option, + order_by: Option>, + null_treatment: Option, +) -> PyDataFusionResult { + // If we initialize the UDAF with order_by directly, then it gets over-written by the builder + let agg_fn = functions_aggregate::expr_fn::first_value(expr.expr, vec![]); + + add_builder_fns_to_aggregate(agg_fn, distinct, filter, order_by, null_treatment) +} + +// nth_value requires a non-expr argument +#[pyfunction] +#[pyo3(signature = (expr, n, distinct=None, filter=None, order_by=None, null_treatment=None))] +pub fn nth_value( + expr: PyExpr, + n: i64, + distinct: Option, + filter: Option, + order_by: Option>, + null_treatment: Option, +) -> PyDataFusionResult { + let agg_fn = datafusion::functions_aggregate::nth_value::nth_value(expr.expr, n, vec![]); + add_builder_fns_to_aggregate(agg_fn, distinct, filter, order_by, null_treatment) +} + +// string_agg requires a non-expr argument +#[pyfunction] +#[pyo3(signature = (expr, delimiter, distinct=None, filter=None, order_by=None, null_treatment=None))] +pub fn string_agg( + expr: PyExpr, + delimiter: String, + distinct: Option, + filter: Option, + order_by: Option>, + null_treatment: Option, +) -> PyDataFusionResult { + let agg_fn = datafusion::functions_aggregate::string_agg::string_agg(expr.expr, lit(delimiter)); + add_builder_fns_to_aggregate(agg_fn, distinct, filter, order_by, null_treatment) +} + +pub(crate) fn add_builder_fns_to_window( + window_fn: Expr, + partition_by: Option>, + window_frame: Option, + order_by: Option>, + null_treatment: Option, +) -> PyDataFusionResult { + let null_treatment = null_treatment.map(|n| n.into()); + let mut builder = window_fn.null_treatment(null_treatment); + + if let Some(partition_cols) = partition_by { + builder = builder.partition_by( + partition_cols + .into_iter() + .map(|col| col.clone().into()) + .collect(), + ); + } + + if let Some(order_by_cols) = order_by { + let order_by_cols = to_sort_expressions(order_by_cols); + builder = builder.order_by(order_by_cols); + } + + if let Some(window_frame) = window_frame { + builder = builder.window_frame(window_frame.into()); + } + + Ok(builder.build().map(|e| e.into())?) +} + +#[pyfunction] +#[pyo3(signature = (arg, shift_offset, default_value=None, partition_by=None, order_by=None))] +pub fn lead( + arg: PyExpr, + shift_offset: i64, + default_value: Option, + partition_by: Option>, + order_by: Option>, +) -> PyDataFusionResult { + let default_value = default_value.map(|v| v.into()); + let window_fn = functions_window::expr_fn::lead(arg.expr, Some(shift_offset), default_value); + + add_builder_fns_to_window(window_fn, partition_by, None, order_by, None) +} + +#[pyfunction] +#[pyo3(signature = (arg, shift_offset, default_value=None, partition_by=None, order_by=None))] +pub fn lag( + arg: PyExpr, + shift_offset: i64, + default_value: Option, + partition_by: Option>, + order_by: Option>, +) -> PyDataFusionResult { + let default_value = default_value.map(|v| v.into()); + let window_fn = functions_window::expr_fn::lag(arg.expr, Some(shift_offset), default_value); + + add_builder_fns_to_window(window_fn, partition_by, None, order_by, None) +} + +#[pyfunction] +#[pyo3(signature = (partition_by=None, order_by=None))] +pub fn row_number( + partition_by: Option>, + order_by: Option>, +) -> PyDataFusionResult { + let window_fn = functions_window::expr_fn::row_number(); + + add_builder_fns_to_window(window_fn, partition_by, None, order_by, None) +} + +#[pyfunction] +#[pyo3(signature = (partition_by=None, order_by=None))] +pub fn rank( + partition_by: Option>, + order_by: Option>, +) -> PyDataFusionResult { + let window_fn = functions_window::expr_fn::rank(); + + add_builder_fns_to_window(window_fn, partition_by, None, order_by, None) +} + +#[pyfunction] +#[pyo3(signature = (partition_by=None, order_by=None))] +pub fn dense_rank( + partition_by: Option>, + order_by: Option>, +) -> PyDataFusionResult { + let window_fn = functions_window::expr_fn::dense_rank(); + + add_builder_fns_to_window(window_fn, partition_by, None, order_by, None) +} + +#[pyfunction] +#[pyo3(signature = (partition_by=None, order_by=None))] +pub fn percent_rank( + partition_by: Option>, + order_by: Option>, +) -> PyDataFusionResult { + let window_fn = functions_window::expr_fn::percent_rank(); + + add_builder_fns_to_window(window_fn, partition_by, None, order_by, None) +} + +#[pyfunction] +#[pyo3(signature = (partition_by=None, order_by=None))] +pub fn cume_dist( + partition_by: Option>, + order_by: Option>, +) -> PyDataFusionResult { + let window_fn = functions_window::expr_fn::cume_dist(); + + add_builder_fns_to_window(window_fn, partition_by, None, order_by, None) +} + +#[pyfunction] +#[pyo3(signature = (arg, partition_by=None, order_by=None))] +pub fn ntile( + arg: PyExpr, + partition_by: Option>, + order_by: Option>, +) -> PyDataFusionResult { + let window_fn = functions_window::expr_fn::ntile(arg.into()); + + add_builder_fns_to_window(window_fn, partition_by, None, order_by, None) +} + +pub(crate) fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> { + m.add_wrapped(wrap_pyfunction!(abs))?; + m.add_wrapped(wrap_pyfunction!(acos))?; + m.add_wrapped(wrap_pyfunction!(acosh))?; + m.add_wrapped(wrap_pyfunction!(approx_distinct))?; + m.add_wrapped(wrap_pyfunction!(alias))?; + m.add_wrapped(wrap_pyfunction!(approx_median))?; + m.add_wrapped(wrap_pyfunction!(approx_percentile_cont))?; + m.add_wrapped(wrap_pyfunction!(approx_percentile_cont_with_weight))?; + m.add_wrapped(wrap_pyfunction!(percentile_cont))?; + m.add_wrapped(wrap_pyfunction!(range))?; + m.add_wrapped(wrap_pyfunction!(array_agg))?; + m.add_wrapped(wrap_pyfunction!(arrow_typeof))?; + m.add_wrapped(wrap_pyfunction!(arrow_cast))?; + m.add_wrapped(wrap_pyfunction!(arrow_metadata))?; + m.add_wrapped(wrap_pyfunction!(ascii))?; + m.add_wrapped(wrap_pyfunction!(asin))?; + m.add_wrapped(wrap_pyfunction!(asinh))?; + m.add_wrapped(wrap_pyfunction!(atan))?; + m.add_wrapped(wrap_pyfunction!(atanh))?; + m.add_wrapped(wrap_pyfunction!(atan2))?; + m.add_wrapped(wrap_pyfunction!(avg))?; + m.add_wrapped(wrap_pyfunction!(bit_length))?; + m.add_wrapped(wrap_pyfunction!(btrim))?; + m.add_wrapped(wrap_pyfunction!(cbrt))?; + m.add_wrapped(wrap_pyfunction!(ceil))?; + m.add_wrapped(wrap_pyfunction!(character_length))?; + m.add_wrapped(wrap_pyfunction!(chr))?; + m.add_wrapped(wrap_pyfunction!(char_length))?; + m.add_wrapped(wrap_pyfunction!(coalesce))?; + m.add_wrapped(wrap_pyfunction!(case))?; + m.add_wrapped(wrap_pyfunction!(when))?; + m.add_wrapped(wrap_pyfunction!(col))?; + m.add_wrapped(wrap_pyfunction!(concat_ws))?; + m.add_wrapped(wrap_pyfunction!(concat))?; + m.add_wrapped(wrap_pyfunction!(contains))?; + m.add_wrapped(wrap_pyfunction!(corr))?; + m.add_wrapped(wrap_pyfunction!(cos))?; + m.add_wrapped(wrap_pyfunction!(cosh))?; + m.add_wrapped(wrap_pyfunction!(cot))?; + m.add_wrapped(wrap_pyfunction!(count))?; + m.add_wrapped(wrap_pyfunction!(covar_pop))?; + m.add_wrapped(wrap_pyfunction!(covar_samp))?; + m.add_wrapped(wrap_pyfunction!(current_date))?; + m.add_wrapped(wrap_pyfunction!(current_time))?; + m.add_wrapped(wrap_pyfunction!(degrees))?; + m.add_wrapped(wrap_pyfunction!(date_bin))?; + m.add_wrapped(wrap_pyfunction!(date_part))?; + m.add_wrapped(wrap_pyfunction!(date_trunc))?; + m.add_wrapped(wrap_pyfunction!(make_date))?; + m.add_wrapped(wrap_pyfunction!(make_time))?; + m.add_wrapped(wrap_pyfunction!(digest))?; + m.add_wrapped(wrap_pyfunction!(ends_with))?; + m.add_wrapped(wrap_pyfunction!(exp))?; + m.add_wrapped(wrap_pyfunction!(factorial))?; + m.add_wrapped(wrap_pyfunction!(floor))?; + m.add_wrapped(wrap_pyfunction!(from_unixtime))?; + m.add_wrapped(wrap_pyfunction!(gcd))?; + m.add_wrapped(wrap_pyfunction!(greatest))?; + m.add_wrapped(wrap_pyfunction!(grouping))?; + m.add_wrapped(wrap_pyfunction!(in_list))?; + m.add_wrapped(wrap_pyfunction!(initcap))?; + m.add_wrapped(wrap_pyfunction!(isnan))?; + m.add_wrapped(wrap_pyfunction!(iszero))?; + m.add_wrapped(wrap_pyfunction!(levenshtein))?; + m.add_wrapped(wrap_pyfunction!(lcm))?; + m.add_wrapped(wrap_pyfunction!(least))?; + m.add_wrapped(wrap_pyfunction!(left))?; + m.add_wrapped(wrap_pyfunction!(length))?; + m.add_wrapped(wrap_pyfunction!(ln))?; + m.add_wrapped(wrap_pyfunction!(self::log))?; + m.add_wrapped(wrap_pyfunction!(log10))?; + m.add_wrapped(wrap_pyfunction!(log2))?; + m.add_wrapped(wrap_pyfunction!(lower))?; + m.add_wrapped(wrap_pyfunction!(lpad))?; + m.add_wrapped(wrap_pyfunction!(ltrim))?; + m.add_wrapped(wrap_pyfunction!(max))?; + m.add_wrapped(wrap_pyfunction!(make_array))?; + m.add_wrapped(wrap_pyfunction!(md5))?; + m.add_wrapped(wrap_pyfunction!(median))?; + m.add_wrapped(wrap_pyfunction!(min))?; + m.add_wrapped(wrap_pyfunction!(named_struct))?; + m.add_wrapped(wrap_pyfunction!(nanvl))?; + m.add_wrapped(wrap_pyfunction!(nvl))?; + m.add_wrapped(wrap_pyfunction!(nvl2))?; + m.add_wrapped(wrap_pyfunction!(now))?; + m.add_wrapped(wrap_pyfunction!(nullif))?; + m.add_wrapped(wrap_pyfunction!(octet_length))?; + m.add_wrapped(wrap_pyfunction!(order_by))?; + m.add_wrapped(wrap_pyfunction!(overlay))?; + m.add_wrapped(wrap_pyfunction!(pi))?; + m.add_wrapped(wrap_pyfunction!(power))?; + m.add_wrapped(wrap_pyfunction!(radians))?; + m.add_wrapped(wrap_pyfunction!(random))?; + m.add_wrapped(wrap_pyfunction!(regexp_count))?; + m.add_wrapped(wrap_pyfunction!(regexp_instr))?; + m.add_wrapped(wrap_pyfunction!(regexp_like))?; + m.add_wrapped(wrap_pyfunction!(regexp_match))?; + m.add_wrapped(wrap_pyfunction!(regexp_replace))?; + m.add_wrapped(wrap_pyfunction!(repeat))?; + m.add_wrapped(wrap_pyfunction!(replace))?; + m.add_wrapped(wrap_pyfunction!(reverse))?; + m.add_wrapped(wrap_pyfunction!(right))?; + m.add_wrapped(wrap_pyfunction!(round))?; + m.add_wrapped(wrap_pyfunction!(rpad))?; + m.add_wrapped(wrap_pyfunction!(rtrim))?; + m.add_wrapped(wrap_pyfunction!(sha224))?; + m.add_wrapped(wrap_pyfunction!(sha256))?; + m.add_wrapped(wrap_pyfunction!(sha384))?; + m.add_wrapped(wrap_pyfunction!(sha512))?; + m.add_wrapped(wrap_pyfunction!(signum))?; + m.add_wrapped(wrap_pyfunction!(sin))?; + m.add_wrapped(wrap_pyfunction!(sinh))?; + m.add_wrapped(wrap_pyfunction!(split_part))?; + m.add_wrapped(wrap_pyfunction!(sqrt))?; + m.add_wrapped(wrap_pyfunction!(starts_with))?; + m.add_wrapped(wrap_pyfunction!(stddev))?; + m.add_wrapped(wrap_pyfunction!(stddev_pop))?; + m.add_wrapped(wrap_pyfunction!(string_agg))?; + m.add_wrapped(wrap_pyfunction!(strpos))?; + m.add_wrapped(wrap_pyfunction!(r#struct))?; // Use raw identifier since struct is a keyword + m.add_wrapped(wrap_pyfunction!(substr))?; + m.add_wrapped(wrap_pyfunction!(substr_index))?; + m.add_wrapped(wrap_pyfunction!(substring))?; + m.add_wrapped(wrap_pyfunction!(find_in_set))?; + m.add_wrapped(wrap_pyfunction!(sum))?; + m.add_wrapped(wrap_pyfunction!(tan))?; + m.add_wrapped(wrap_pyfunction!(tanh))?; + m.add_wrapped(wrap_pyfunction!(to_hex))?; + m.add_wrapped(wrap_pyfunction!(to_char))?; + m.add_wrapped(wrap_pyfunction!(to_date))?; + m.add_wrapped(wrap_pyfunction!(to_local_time))?; + m.add_wrapped(wrap_pyfunction!(to_time))?; + m.add_wrapped(wrap_pyfunction!(to_timestamp))?; + m.add_wrapped(wrap_pyfunction!(to_timestamp_millis))?; + m.add_wrapped(wrap_pyfunction!(to_timestamp_nanos))?; + m.add_wrapped(wrap_pyfunction!(to_timestamp_micros))?; + m.add_wrapped(wrap_pyfunction!(to_timestamp_seconds))?; + m.add_wrapped(wrap_pyfunction!(to_unixtime))?; + m.add_wrapped(wrap_pyfunction!(translate))?; + m.add_wrapped(wrap_pyfunction!(trim))?; + m.add_wrapped(wrap_pyfunction!(trunc))?; + m.add_wrapped(wrap_pyfunction!(upper))?; + m.add_wrapped(wrap_pyfunction!(get_field))?; + m.add_wrapped(wrap_pyfunction!(union_extract))?; + m.add_wrapped(wrap_pyfunction!(union_tag))?; + m.add_wrapped(wrap_pyfunction!(version))?; + m.add_wrapped(wrap_pyfunction!(self::uuid))?; // Use self to avoid name collision + m.add_wrapped(wrap_pyfunction!(var_pop))?; + m.add_wrapped(wrap_pyfunction!(var_sample))?; + m.add_wrapped(wrap_pyfunction!(regr_avgx))?; + m.add_wrapped(wrap_pyfunction!(regr_avgy))?; + m.add_wrapped(wrap_pyfunction!(regr_count))?; + m.add_wrapped(wrap_pyfunction!(regr_intercept))?; + m.add_wrapped(wrap_pyfunction!(regr_r2))?; + m.add_wrapped(wrap_pyfunction!(regr_slope))?; + m.add_wrapped(wrap_pyfunction!(regr_sxx))?; + m.add_wrapped(wrap_pyfunction!(regr_sxy))?; + m.add_wrapped(wrap_pyfunction!(regr_syy))?; + m.add_wrapped(wrap_pyfunction!(first_value))?; + m.add_wrapped(wrap_pyfunction!(last_value))?; + m.add_wrapped(wrap_pyfunction!(nth_value))?; + m.add_wrapped(wrap_pyfunction!(bit_and))?; + m.add_wrapped(wrap_pyfunction!(bit_or))?; + m.add_wrapped(wrap_pyfunction!(bit_xor))?; + m.add_wrapped(wrap_pyfunction!(bool_and))?; + m.add_wrapped(wrap_pyfunction!(bool_or))?; + + //Binary String Functions + m.add_wrapped(wrap_pyfunction!(encode))?; + m.add_wrapped(wrap_pyfunction!(decode))?; + + // Array Functions + m.add_wrapped(wrap_pyfunction!(array_append))?; + m.add_wrapped(wrap_pyfunction!(array_concat))?; + m.add_wrapped(wrap_pyfunction!(array_cat))?; + m.add_wrapped(wrap_pyfunction!(array_dims))?; + m.add_wrapped(wrap_pyfunction!(array_distinct))?; + m.add_wrapped(wrap_pyfunction!(array_element))?; + m.add_wrapped(wrap_pyfunction!(array_empty))?; + m.add_wrapped(wrap_pyfunction!(array_length))?; + m.add_wrapped(wrap_pyfunction!(array_has))?; + m.add_wrapped(wrap_pyfunction!(array_has_all))?; + m.add_wrapped(wrap_pyfunction!(array_has_any))?; + m.add_wrapped(wrap_pyfunction!(array_position))?; + m.add_wrapped(wrap_pyfunction!(array_positions))?; + m.add_wrapped(wrap_pyfunction!(array_to_string))?; + m.add_wrapped(wrap_pyfunction!(array_intersect))?; + m.add_wrapped(wrap_pyfunction!(array_union))?; + m.add_wrapped(wrap_pyfunction!(array_except))?; + m.add_wrapped(wrap_pyfunction!(array_resize))?; + m.add_wrapped(wrap_pyfunction!(array_ndims))?; + m.add_wrapped(wrap_pyfunction!(array_prepend))?; + m.add_wrapped(wrap_pyfunction!(array_pop_back))?; + m.add_wrapped(wrap_pyfunction!(array_pop_front))?; + m.add_wrapped(wrap_pyfunction!(array_remove))?; + m.add_wrapped(wrap_pyfunction!(array_remove_n))?; + m.add_wrapped(wrap_pyfunction!(array_remove_all))?; + m.add_wrapped(wrap_pyfunction!(array_repeat))?; + m.add_wrapped(wrap_pyfunction!(array_replace))?; + m.add_wrapped(wrap_pyfunction!(array_replace_n))?; + m.add_wrapped(wrap_pyfunction!(array_replace_all))?; + m.add_wrapped(wrap_pyfunction!(array_sort))?; + m.add_wrapped(wrap_pyfunction!(array_slice))?; + m.add_wrapped(wrap_pyfunction!(array_any_value))?; + m.add_wrapped(wrap_pyfunction!(array_distance))?; + m.add_wrapped(wrap_pyfunction!(array_max))?; + m.add_wrapped(wrap_pyfunction!(array_min))?; + m.add_wrapped(wrap_pyfunction!(array_reverse))?; + m.add_wrapped(wrap_pyfunction!(arrays_zip))?; + m.add_wrapped(wrap_pyfunction!(string_to_array))?; + m.add_wrapped(wrap_pyfunction!(gen_series))?; + m.add_wrapped(wrap_pyfunction!(flatten))?; + m.add_wrapped(wrap_pyfunction!(cardinality))?; + + // Map Functions + m.add_wrapped(wrap_pyfunction!(make_map))?; + m.add_wrapped(wrap_pyfunction!(map_keys))?; + m.add_wrapped(wrap_pyfunction!(map_values))?; + m.add_wrapped(wrap_pyfunction!(map_extract))?; + m.add_wrapped(wrap_pyfunction!(map_entries))?; + + // Window Functions + m.add_wrapped(wrap_pyfunction!(lead))?; + m.add_wrapped(wrap_pyfunction!(lag))?; + m.add_wrapped(wrap_pyfunction!(rank))?; + m.add_wrapped(wrap_pyfunction!(row_number))?; + m.add_wrapped(wrap_pyfunction!(dense_rank))?; + m.add_wrapped(wrap_pyfunction!(percent_rank))?; + m.add_wrapped(wrap_pyfunction!(cume_dist))?; + m.add_wrapped(wrap_pyfunction!(ntile))?; + + Ok(()) +} diff --git a/src/lib.rs b/crates/core/src/lib.rs similarity index 59% rename from src/lib.rs rename to crates/core/src/lib.rs index 4a6574c16..fc2d006d3 100644 --- a/src/lib.rs +++ b/crates/core/src/lib.rs @@ -15,45 +15,53 @@ // specific language governing permissions and limitations // under the License. +// Re-export Apache Arrow DataFusion dependencies +pub use datafusion::{ + self, common as datafusion_common, logical_expr as datafusion_expr, optimizer, + sql as datafusion_sql, +}; +#[cfg(feature = "substrait")] +pub use datafusion_substrait; #[cfg(feature = "mimalloc")] use mimalloc::MiMalloc; use pyo3::prelude::*; -// Re-export Apache Arrow DataFusion dependencies -pub use datafusion; -pub use datafusion_common; -pub use datafusion_expr; -pub use datafusion_optimizer; -pub use datafusion_sql; -pub use datafusion_substrait; - #[allow(clippy::borrow_deref_ref)] pub mod catalog; pub mod common; + #[allow(clippy::borrow_deref_ref)] mod config; #[allow(clippy::borrow_deref_ref)] -mod context; +pub mod context; #[allow(clippy::borrow_deref_ref)] -mod dataframe; +pub mod dataframe; mod dataset; mod dataset_exec; pub mod errors; #[allow(clippy::borrow_deref_ref)] -mod expr; +pub mod expr; #[allow(clippy::borrow_deref_ref)] mod functions; +mod options; pub mod physical_plan; mod pyarrow_filter_expression; +pub mod pyarrow_util; mod record_batch; pub mod sql; pub mod store; +pub mod table; +pub mod unparser; + +mod array; +#[cfg(feature = "substrait")] pub mod substrait; #[allow(clippy::borrow_deref_ref)] mod udaf; #[allow(clippy::borrow_deref_ref)] mod udf; -pub mod utils; +pub mod udtf; +mod udwf; #[cfg(feature = "mimalloc")] #[global_allocator] @@ -64,44 +72,72 @@ static GLOBAL: MiMalloc = MiMalloc; /// The higher-level public API is defined in pure python files under the /// datafusion directory. #[pymodule] -fn _internal(py: Python, m: &PyModule) -> PyResult<()> { +fn _internal(py: Python, m: Bound<'_, PyModule>) -> PyResult<()> { + // Initialize logging + pyo3_log::init(); + // Register the python classes - m.add_class::()?; - m.add_class::()?; - m.add_class::()?; - m.add_class::()?; + m.add_class::()?; m.add_class::()?; m.add_class::()?; + m.add_class::()?; m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; m.add_class::()?; m.add_class::()?; + m.add_class::()?; + m.add_class::()?; m.add_class::()?; m.add_class::()?; m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + + let catalog = PyModule::new(py, "catalog")?; + catalog::init_module(&catalog)?; + m.add_submodule(&catalog)?; // Register `common` as a submodule. Matching `datafusion-common` https://docs.rs/datafusion-common/latest/datafusion_common/ let common = PyModule::new(py, "common")?; - common::init_module(common)?; - m.add_submodule(common)?; + common::init_module(&common)?; + m.add_submodule(&common)?; // Register `expr` as a submodule. Matching `datafusion-expr` https://docs.rs/datafusion-expr/latest/datafusion_expr/ let expr = PyModule::new(py, "expr")?; - expr::init_module(expr)?; - m.add_submodule(expr)?; + expr::init_module(&expr)?; + m.add_submodule(&expr)?; + + let unparser = PyModule::new(py, "unparser")?; + unparser::init_module(&unparser)?; + m.add_submodule(&unparser)?; // Register the functions as a submodule let funcs = PyModule::new(py, "functions")?; - functions::init_module(funcs)?; - m.add_submodule(funcs)?; + functions::init_module(&funcs)?; + m.add_submodule(&funcs)?; let store = PyModule::new(py, "object_store")?; - store::init_module(store)?; - m.add_submodule(store)?; + store::init_module(&store)?; + m.add_submodule(&store)?; + + let options = PyModule::new(py, "options")?; + options::init_module(&options)?; + m.add_submodule(&options)?; // Register substrait as a submodule - let substrait = PyModule::new(py, "substrait")?; - substrait::init_module(substrait)?; - m.add_submodule(substrait)?; + #[cfg(feature = "substrait")] + setup_substrait_module(py, &m)?; Ok(()) } + +#[cfg(feature = "substrait")] +fn setup_substrait_module(py: Python, m: &Bound<'_, PyModule>) -> PyResult<()> { + let substrait = PyModule::new(py, "substrait")?; + substrait::init_module(&substrait)?; + m.add_submodule(&substrait)?; + Ok(()) +} diff --git a/crates/core/src/options.rs b/crates/core/src/options.rs new file mode 100644 index 000000000..6b6037695 --- /dev/null +++ b/crates/core/src/options.rs @@ -0,0 +1,159 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use arrow::datatypes::{DataType, Schema}; +use arrow::pyarrow::PyArrowType; +use datafusion::prelude::CsvReadOptions; +use pyo3::prelude::{PyModule, PyModuleMethods}; +use pyo3::{Bound, PyResult, pyclass, pymethods}; + +use crate::context::parse_file_compression_type; +use crate::errors::PyDataFusionError; +use crate::expr::sort_expr::PySortExpr; + +/// Options for reading CSV files +#[pyclass(name = "CsvReadOptions", module = "datafusion.options", frozen)] +pub struct PyCsvReadOptions { + pub has_header: bool, + pub delimiter: u8, + pub quote: u8, + pub terminator: Option, + pub escape: Option, + pub comment: Option, + pub newlines_in_values: bool, + pub schema: Option>, + pub schema_infer_max_records: usize, + pub file_extension: String, + pub table_partition_cols: Vec<(String, PyArrowType)>, + pub file_compression_type: String, + pub file_sort_order: Vec>, + pub null_regex: Option, + pub truncated_rows: bool, +} + +#[pymethods] +impl PyCsvReadOptions { + #[allow(clippy::too_many_arguments)] + #[pyo3(signature = ( + has_header=true, + delimiter=b',', + quote=b'"', + terminator=None, + escape=None, + comment=None, + newlines_in_values=false, + schema=None, + schema_infer_max_records=1000, + file_extension=".csv".to_string(), + table_partition_cols=vec![], + file_compression_type="".to_string(), + file_sort_order=vec![], + null_regex=None, + truncated_rows=false + ))] + #[new] + fn new( + has_header: bool, + delimiter: u8, + quote: u8, + terminator: Option, + escape: Option, + comment: Option, + newlines_in_values: bool, + schema: Option>, + schema_infer_max_records: usize, + file_extension: String, + table_partition_cols: Vec<(String, PyArrowType)>, + file_compression_type: String, + file_sort_order: Vec>, + null_regex: Option, + truncated_rows: bool, + ) -> Self { + Self { + has_header, + delimiter, + quote, + terminator, + escape, + comment, + newlines_in_values, + schema, + schema_infer_max_records, + file_extension, + table_partition_cols, + file_compression_type, + file_sort_order, + null_regex, + truncated_rows, + } + } +} + +impl<'a> TryFrom<&'a PyCsvReadOptions> for CsvReadOptions<'a> { + type Error = PyDataFusionError; + + fn try_from(value: &'a PyCsvReadOptions) -> Result, Self::Error> { + let partition_cols: Vec<(String, DataType)> = value + .table_partition_cols + .iter() + .map(|(name, dtype)| (name.clone(), dtype.0.clone())) + .collect(); + + let compression = parse_file_compression_type(Some(value.file_compression_type.clone()))?; + + let sort_order: Vec> = value + .file_sort_order + .iter() + .map(|inner| { + inner + .iter() + .map(|sort_expr| sort_expr.sort.clone()) + .collect() + }) + .collect(); + + // Explicit struct initialization to catch upstream changes + let mut options = CsvReadOptions { + has_header: value.has_header, + delimiter: value.delimiter, + quote: value.quote, + terminator: value.terminator, + escape: value.escape, + comment: value.comment, + newlines_in_values: value.newlines_in_values, + schema: None, // Will be set separately due to lifetime constraints + schema_infer_max_records: value.schema_infer_max_records, + file_extension: value.file_extension.as_str(), + table_partition_cols: partition_cols, + file_compression_type: compression, + file_sort_order: sort_order, + null_regex: value.null_regex.clone(), + truncated_rows: value.truncated_rows, + }; + + // Set schema separately to handle the lifetime + options.schema = value.schema.as_ref().map(|s| &s.0); + + Ok(options) + } +} + +pub(crate) fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> { + m.add_class::()?; + + Ok(()) +} diff --git a/src/physical_plan.rs b/crates/core/src/physical_plan.rs similarity index 50% rename from src/physical_plan.rs rename to crates/core/src/physical_plan.rs index 340d527fa..8674a8b55 100644 --- a/src/physical_plan.rs +++ b/crates/core/src/physical_plan.rs @@ -15,12 +15,25 @@ // specific language governing permissions and limitations // under the License. -use datafusion::physical_plan::{displayable, ExecutionPlan}; use std::sync::Arc; +use datafusion::physical_plan::{ExecutionPlan, ExecutionPlanProperties, displayable}; +use datafusion_proto::physical_plan::{AsExecutionPlan, DefaultPhysicalExtensionCodec}; +use prost::Message; +use pyo3::exceptions::PyRuntimeError; use pyo3::prelude::*; +use pyo3::types::PyBytes; -#[pyclass(name = "ExecutionPlan", module = "datafusion", subclass)] +use crate::context::PySessionContext; +use crate::errors::PyDataFusionResult; + +#[pyclass( + from_py_object, + frozen, + name = "ExecutionPlan", + module = "datafusion", + subclass +)] #[derive(Debug, Clone)] pub struct PyExecutionPlan { pub plan: Arc, @@ -40,7 +53,7 @@ impl PyExecutionPlan { self.plan .children() .iter() - .map(|p| p.to_owned().into()) + .map(|&p| p.to_owned().into()) .collect() } @@ -51,7 +64,45 @@ impl PyExecutionPlan { pub fn display_indent(&self) -> String { let d = displayable(self.plan.as_ref()); - format!("{}", d.indent()) + format!("{}", d.indent(false)) + } + + pub fn to_proto<'py>(&'py self, py: Python<'py>) -> PyDataFusionResult> { + let codec = DefaultPhysicalExtensionCodec {}; + let proto = datafusion_proto::protobuf::PhysicalPlanNode::try_from_physical_plan( + self.plan.clone(), + &codec, + )?; + + let bytes = proto.encode_to_vec(); + Ok(PyBytes::new(py, &bytes)) + } + + #[staticmethod] + pub fn from_proto( + ctx: PySessionContext, + proto_msg: Bound<'_, PyBytes>, + ) -> PyDataFusionResult { + let bytes: &[u8] = proto_msg.extract().map_err(Into::::into)?; + let proto_plan = + datafusion_proto::protobuf::PhysicalPlanNode::decode(bytes).map_err(|e| { + PyRuntimeError::new_err(format!( + "Unable to decode logical node from serialized bytes: {e}" + )) + })?; + + let codec = DefaultPhysicalExtensionCodec {}; + let plan = proto_plan.try_into_physical_plan(ctx.ctx.task_ctx().as_ref(), &codec)?; + Ok(Self::new(plan)) + } + + fn __repr__(&self) -> String { + self.display_indent() + } + + #[getter] + pub fn partition_count(&self) -> usize { + self.plan.output_partitioning().partition_count() } } diff --git a/src/pyarrow_filter_expression.rs b/crates/core/src/pyarrow_filter_expression.rs similarity index 59% rename from src/pyarrow_filter_expression.rs rename to crates/core/src/pyarrow_filter_expression.rs index d35101237..e3b4b6009 100644 --- a/src/pyarrow_filter_expression.rs +++ b/crates/core/src/pyarrow_filter_expression.rs @@ -15,26 +15,27 @@ // specific language governing permissions and limitations // under the License. -/// Converts a Datafusion logical plan expression (Expr) into a PyArrow compute expression -use pyo3::prelude::*; - use std::convert::TryFrom; use std::result::Result; -use datafusion_common::{Column, ScalarValue}; -use datafusion_expr::{Between, BinaryExpr, Expr, Operator}; +use datafusion::common::{Column, ScalarValue}; +use datafusion::logical_expr::expr::InList; +use datafusion::logical_expr::{Between, BinaryExpr, Expr, Operator}; +/// Converts a Datafusion logical plan expression (Expr) into a PyArrow compute expression +use pyo3::{IntoPyObjectExt, prelude::*}; -use crate::errors::DataFusionError; +use crate::errors::{PyDataFusionError, PyDataFusionResult}; +use crate::pyarrow_util::scalar_to_pyarrow; -#[derive(Debug, Clone)] +#[derive(Debug)] #[repr(transparent)] -pub(crate) struct PyArrowFilterExpression(PyObject); +pub(crate) struct PyArrowFilterExpression(Py); fn operator_to_py<'py>( operator: &Operator, - op: &'py PyModule, -) -> Result<&'py PyAny, DataFusionError> { - let py_op: &PyAny = match operator { + op: &Bound<'py, PyModule>, +) -> PyDataFusionResult> { + let py_op: Bound<'_, PyAny> = match operator { Operator::Eq => op.getattr("eq")?, Operator::NotEq => op.getattr("ne")?, Operator::Lt => op.getattr("lt")?, @@ -44,81 +45,69 @@ fn operator_to_py<'py>( Operator::And => op.getattr("and_")?, Operator::Or => op.getattr("or_")?, _ => { - return Err(DataFusionError::Common(format!( + return Err(PyDataFusionError::Common(format!( "Unsupported operator {operator:?}" - ))) + ))); } }; Ok(py_op) } -fn extract_scalar_list(exprs: &[Expr], py: Python) -> Result, DataFusionError> { - let ret: Result, DataFusionError> = exprs +fn extract_scalar_list<'py>( + exprs: &[Expr], + py: Python<'py>, +) -> PyDataFusionResult>> { + exprs .iter() .map(|expr| match expr { - Expr::Literal(v) => match v { - ScalarValue::Boolean(Some(b)) => Ok(b.into_py(py)), - ScalarValue::Int8(Some(i)) => Ok(i.into_py(py)), - ScalarValue::Int16(Some(i)) => Ok(i.into_py(py)), - ScalarValue::Int32(Some(i)) => Ok(i.into_py(py)), - ScalarValue::Int64(Some(i)) => Ok(i.into_py(py)), - ScalarValue::UInt8(Some(i)) => Ok(i.into_py(py)), - ScalarValue::UInt16(Some(i)) => Ok(i.into_py(py)), - ScalarValue::UInt32(Some(i)) => Ok(i.into_py(py)), - ScalarValue::UInt64(Some(i)) => Ok(i.into_py(py)), - ScalarValue::Float32(Some(f)) => Ok(f.into_py(py)), - ScalarValue::Float64(Some(f)) => Ok(f.into_py(py)), - ScalarValue::Utf8(Some(s)) => Ok(s.into_py(py)), - _ => Err(DataFusionError::Common(format!( + // TODO: should we also leverage `ScalarValue::to_pyarrow` here? + Expr::Literal(v, _) => match v { + // The unwraps here are for infallible conversions + ScalarValue::Boolean(Some(b)) => Ok(b.into_bound_py_any(py)?), + ScalarValue::Int8(Some(i)) => Ok(i.into_bound_py_any(py)?), + ScalarValue::Int16(Some(i)) => Ok(i.into_bound_py_any(py)?), + ScalarValue::Int32(Some(i)) => Ok(i.into_bound_py_any(py)?), + ScalarValue::Int64(Some(i)) => Ok(i.into_bound_py_any(py)?), + ScalarValue::UInt8(Some(i)) => Ok(i.into_bound_py_any(py)?), + ScalarValue::UInt16(Some(i)) => Ok(i.into_bound_py_any(py)?), + ScalarValue::UInt32(Some(i)) => Ok(i.into_bound_py_any(py)?), + ScalarValue::UInt64(Some(i)) => Ok(i.into_bound_py_any(py)?), + ScalarValue::Float32(Some(f)) => Ok(f.into_bound_py_any(py)?), + ScalarValue::Float64(Some(f)) => Ok(f.into_bound_py_any(py)?), + ScalarValue::Utf8(Some(s)) => Ok(s.into_bound_py_any(py)?), + _ => Err(PyDataFusionError::Common(format!( "PyArrow can't handle ScalarValue: {v:?}" ))), }, - _ => Err(DataFusionError::Common(format!( + _ => Err(PyDataFusionError::Common(format!( "Only a list of Literals are supported got {expr:?}" ))), }) - .collect(); - ret + .collect() } impl PyArrowFilterExpression { - pub fn inner(&self) -> &PyObject { + pub fn inner(&self) -> &Py { &self.0 } } impl TryFrom<&Expr> for PyArrowFilterExpression { - type Error = DataFusionError; + type Error = PyDataFusionError; // Converts a Datafusion filter Expr into an expression string that can be evaluated by Python // Note that pyarrow.compute.{field,scalar} are put into Python globals() when evaluated // isin, is_null, and is_valid (~is_null) are methods of pyarrow.dataset.Expression // https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html#pyarrow-dataset-expression fn try_from(expr: &Expr) -> Result { - Python::with_gil(|py| { + Python::attach(|py| { let pc = Python::import(py, "pyarrow.compute")?; let op_module = Python::import(py, "operator")?; - let pc_expr: Result<&PyAny, DataFusionError> = match expr { + let pc_expr: PyDataFusionResult> = match expr { Expr::Column(Column { name, .. }) => Ok(pc.getattr("field")?.call1((name,))?), - Expr::Literal(v) => match v { - ScalarValue::Boolean(Some(b)) => Ok(pc.getattr("scalar")?.call1((*b,))?), - ScalarValue::Int8(Some(i)) => Ok(pc.getattr("scalar")?.call1((*i,))?), - ScalarValue::Int16(Some(i)) => Ok(pc.getattr("scalar")?.call1((*i,))?), - ScalarValue::Int32(Some(i)) => Ok(pc.getattr("scalar")?.call1((*i,))?), - ScalarValue::Int64(Some(i)) => Ok(pc.getattr("scalar")?.call1((*i,))?), - ScalarValue::UInt8(Some(i)) => Ok(pc.getattr("scalar")?.call1((*i,))?), - ScalarValue::UInt16(Some(i)) => Ok(pc.getattr("scalar")?.call1((*i,))?), - ScalarValue::UInt32(Some(i)) => Ok(pc.getattr("scalar")?.call1((*i,))?), - ScalarValue::UInt64(Some(i)) => Ok(pc.getattr("scalar")?.call1((*i,))?), - ScalarValue::Float32(Some(f)) => Ok(pc.getattr("scalar")?.call1((*f,))?), - ScalarValue::Float64(Some(f)) => Ok(pc.getattr("scalar")?.call1((*f,))?), - ScalarValue::Utf8(Some(s)) => Ok(pc.getattr("scalar")?.call1((s,))?), - _ => Err(DataFusionError::Common(format!( - "PyArrow can't handle ScalarValue: {v:?}" - ))), - }, + Expr::Literal(scalar, _) => Ok(scalar_to_pyarrow(scalar, py)?), Expr::BinaryExpr(BinaryExpr { left, op, right }) => { - let operator = operator_to_py(op, op_module)?; + let operator = operator_to_py(op, &op_module)?; let left = PyArrowFilterExpression::try_from(left.as_ref())?.0; let right = PyArrowFilterExpression::try_from(right.as_ref())?.0; Ok(operator.call1((left, right))?) @@ -131,14 +120,20 @@ impl TryFrom<&Expr> for PyArrowFilterExpression { Expr::IsNotNull(expr) => { let py_expr = PyArrowFilterExpression::try_from(expr.as_ref())? .0 - .into_ref(py); + .into_bound(py); Ok(py_expr.call_method0("is_valid")?) } Expr::IsNull(expr) => { let expr = PyArrowFilterExpression::try_from(expr.as_ref())? .0 - .into_ref(py); - Ok(expr.call_method1("is_null", (expr,))?) + .into_bound(py); + + // https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html#pyarrow.dataset.Expression.is_null + // Whether floating-point NaNs are considered null. + let nan_is_null = false; + + let res = expr.call_method1("is_null", (nan_is_null,))?; + Ok(res) } Expr::Between(Between { expr, @@ -161,21 +156,21 @@ impl TryFrom<&Expr> for PyArrowFilterExpression { Ok(if *negated { invert.call1((ret,))? } else { ret }) } - Expr::InList { + Expr::InList(InList { expr, list, negated, - } => { + }) => { let expr = PyArrowFilterExpression::try_from(expr.as_ref())? .0 - .into_ref(py); + .into_bound(py); let scalars = extract_scalar_list(list, py)?; let ret = expr.call_method1("isin", (scalars,))?; let invert = op_module.getattr("invert")?; Ok(if *negated { invert.call1((ret,))? } else { ret }) } - _ => Err(DataFusionError::Common(format!( + _ => Err(PyDataFusionError::Common(format!( "Unsupported Datafusion expression {expr:?}" ))), }; diff --git a/crates/core/src/pyarrow_util.rs b/crates/core/src/pyarrow_util.rs new file mode 100644 index 000000000..1401a4938 --- /dev/null +++ b/crates/core/src/pyarrow_util.rs @@ -0,0 +1,163 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +//! Conversions between PyArrow and DataFusion types + +use std::sync::Arc; + +use arrow::array::{Array, ArrayData, ArrayRef, ListArray, make_array}; +use arrow::buffer::OffsetBuffer; +use arrow::datatypes::Field; +use arrow::pyarrow::{FromPyArrow, ToPyArrow}; +use datafusion::common::exec_err; +use datafusion::scalar::ScalarValue; +use pyo3::types::{PyAnyMethods, PyList}; +use pyo3::{Borrowed, Bound, FromPyObject, PyAny, PyErr, PyResult, Python}; + +use crate::common::data_type::PyScalarValue; +use crate::errors::PyDataFusionError; + +/// Helper function to turn an Array into a ScalarValue. If ``as_list_array`` is true, +/// the array will be turned into a ``ListArray``. Otherwise, we extract the first value +/// from the array. +fn array_to_scalar_value(array: ArrayRef, as_list_array: bool) -> PyResult { + if as_list_array { + let field = Arc::new(Field::new_list_field( + array.data_type().clone(), + array.nulls().is_some(), + )); + let offsets = OffsetBuffer::from_lengths(vec![array.len()]); + let list_array = ListArray::new(field, offsets, array, None); + Ok(PyScalarValue(ScalarValue::List(Arc::new(list_array)))) + } else { + let scalar = ScalarValue::try_from_array(&array, 0).map_err(PyDataFusionError::from)?; + Ok(PyScalarValue(scalar)) + } +} + +/// Helper function to take any Python object that contains an Arrow PyCapsule +/// interface and attempt to extract a scalar value from it. If `as_list_array` +/// is true, the array will be turned into a ``ListArray``. Otherwise, we extract +/// the first value from the array. +fn pyobj_extract_scalar_via_capsule( + value: &Bound<'_, PyAny>, + as_list_array: bool, +) -> PyResult { + let array_data = ArrayData::from_pyarrow_bound(value)?; + let array = make_array(array_data); + + array_to_scalar_value(array, as_list_array) +} + +impl FromPyArrow for PyScalarValue { + fn from_pyarrow_bound(value: &Bound<'_, PyAny>) -> PyResult { + let py = value.py(); + let pyarrow_mod = py.import("pyarrow"); + + // Is it a PyArrow object? + if let Ok(pa) = pyarrow_mod.as_ref() { + let scalar_type = pa.getattr("Scalar")?; + if value.is_instance(&scalar_type)? { + let typ = value.getattr("type")?; + + // construct pyarrow array from the python value and pyarrow type + let factory = py.import("pyarrow")?.getattr("array")?; + let args = PyList::new(py, [value])?; + let array = factory.call1((args, typ))?; + + return pyobj_extract_scalar_via_capsule(&array, false); + } + + let array_type = pa.getattr("Array")?; + if value.is_instance(&array_type)? { + return pyobj_extract_scalar_via_capsule(value, true); + } + } + + // Is it a NanoArrow scalar? + if let Ok(na) = py.import("nanoarrow") { + let scalar_type = py.import("nanoarrow.array")?.getattr("Scalar")?; + if value.is_instance(&scalar_type)? { + return pyobj_extract_scalar_via_capsule(value, false); + } + let array_type = na.getattr("Array")?; + if value.is_instance(&array_type)? { + return pyobj_extract_scalar_via_capsule(value, true); + } + } + + // Is it a arro3 scalar? + if let Ok(arro3) = py.import("arro3").and_then(|arro3| arro3.getattr("core")) { + let scalar_type = arro3.getattr("Scalar")?; + if value.is_instance(&scalar_type)? { + return pyobj_extract_scalar_via_capsule(value, false); + } + let array_type = arro3.getattr("Array")?; + if value.is_instance(&array_type)? { + return pyobj_extract_scalar_via_capsule(value, true); + } + } + + // Does it have a PyCapsule interface but isn't one of our known libraries? + // If so do our "best guess". Try checking type name, and if that fails + // return a single value if the length is 1 and return a List value otherwise + if value.hasattr("__arrow_c_array__")? { + let type_name = value.get_type().repr()?; + if type_name.contains("Scalar")? { + return pyobj_extract_scalar_via_capsule(value, false); + } + if type_name.contains("Array")? { + return pyobj_extract_scalar_via_capsule(value, true); + } + + let array_data = ArrayData::from_pyarrow_bound(value)?; + let array = make_array(array_data); + + let as_array_list = array.len() != 1; + return array_to_scalar_value(array, as_array_list); + } + + // Last attempt - try to create a PyArrow scalar from a plain Python object + if let Ok(pa) = pyarrow_mod.as_ref() { + let scalar = pa.call_method1("scalar", (value,))?; + + PyScalarValue::from_pyarrow_bound(&scalar) + } else { + exec_err!("Unable to import scalar value").map_err(PyDataFusionError::from)? + } + } +} + +impl<'source> FromPyObject<'_, 'source> for PyScalarValue { + type Error = PyErr; + + fn extract(value: Borrowed<'_, 'source, PyAny>) -> Result { + Self::from_pyarrow_bound(&value) + } +} + +pub fn scalar_to_pyarrow<'py>( + scalar: &ScalarValue, + py: Python<'py>, +) -> PyResult> { + let array = scalar.to_array().map_err(PyDataFusionError::from)?; + // convert to pyarrow array using C data interface + let pyarray = array.to_data().to_pyarrow(py)?; + let pyscalar = pyarray.call_method1("__getitem__", (0,))?; + + Ok(pyscalar) +} diff --git a/crates/core/src/record_batch.rs b/crates/core/src/record_batch.rs new file mode 100644 index 000000000..0492c6c76 --- /dev/null +++ b/crates/core/src/record_batch.rs @@ -0,0 +1,113 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::Arc; + +use datafusion::arrow::pyarrow::ToPyArrow; +use datafusion::arrow::record_batch::RecordBatch; +use datafusion::physical_plan::SendableRecordBatchStream; +use datafusion_python_util::wait_for_future; +use futures::StreamExt; +use pyo3::exceptions::{PyStopAsyncIteration, PyStopIteration}; +use pyo3::prelude::*; +use pyo3::{PyAny, PyResult, Python, pyclass, pymethods}; +use tokio::sync::Mutex; + +use crate::errors::PyDataFusionError; + +#[pyclass(name = "RecordBatch", module = "datafusion", subclass, frozen)] +pub struct PyRecordBatch { + batch: RecordBatch, +} + +#[pymethods] +impl PyRecordBatch { + fn to_pyarrow<'py>(&self, py: Python<'py>) -> PyResult> { + self.batch.to_pyarrow(py) + } +} + +impl From for PyRecordBatch { + fn from(batch: RecordBatch) -> Self { + Self { batch } + } +} + +#[pyclass(name = "RecordBatchStream", module = "datafusion", subclass, frozen)] +pub struct PyRecordBatchStream { + stream: Arc>, +} + +impl PyRecordBatchStream { + pub fn new(stream: SendableRecordBatchStream) -> Self { + Self { + stream: Arc::new(Mutex::new(stream)), + } + } +} + +#[pymethods] +impl PyRecordBatchStream { + fn next(&self, py: Python) -> PyResult { + let stream = self.stream.clone(); + wait_for_future(py, next_stream(stream, true))? + } + + fn __next__(&self, py: Python) -> PyResult { + self.next(py) + } + + fn __anext__<'py>(&'py self, py: Python<'py>) -> PyResult> { + let stream = self.stream.clone(); + pyo3_async_runtimes::tokio::future_into_py(py, next_stream(stream, false)) + } + + fn __iter__(slf: PyRef<'_, Self>) -> PyRef<'_, Self> { + slf + } + + fn __aiter__(slf: PyRef<'_, Self>) -> PyRef<'_, Self> { + slf + } +} + +/// Polls the next batch from a `SendableRecordBatchStream`, converting the `Option>` form. +pub(crate) async fn poll_next_batch( + stream: &mut SendableRecordBatchStream, +) -> datafusion::error::Result> { + stream.next().await.transpose() +} + +async fn next_stream( + stream: Arc>, + sync: bool, +) -> PyResult { + let mut stream = stream.lock().await; + match poll_next_batch(&mut stream).await { + Ok(Some(batch)) => Ok(batch.into()), + Ok(None) => { + // Depending on whether the iteration is sync or not, we raise either a + // StopIteration or a StopAsyncIteration + if sync { + Err(PyStopIteration::new_err("stream exhausted")) + } else { + Err(PyStopAsyncIteration::new_err("stream exhausted")) + } + } + Err(e) => Err(PyDataFusionError::from(e))?, + } +} diff --git a/src/sql.rs b/crates/core/src/sql.rs similarity index 97% rename from src/sql.rs rename to crates/core/src/sql.rs index 9f1fe81be..dea9b566a 100644 --- a/src/sql.rs +++ b/crates/core/src/sql.rs @@ -17,3 +17,4 @@ pub mod exceptions; pub mod logical; +pub(crate) mod util; diff --git a/src/expr/subquery.rs b/crates/core/src/sql/exceptions.rs similarity index 65% rename from src/expr/subquery.rs rename to crates/core/src/sql/exceptions.rs index 93ff244f6..cfb02274b 100644 --- a/src/expr/subquery.rs +++ b/crates/core/src/sql/exceptions.rs @@ -15,23 +15,18 @@ // specific language governing permissions and limitations // under the License. -use datafusion_expr::Subquery; -use pyo3::prelude::*; +use std::fmt::{Debug, Display}; -#[pyclass(name = "Subquery", module = "datafusion.expr", subclass)] -#[derive(Clone)] -pub struct PySubquery { - subquery: Subquery, +use pyo3::PyErr; + +pub fn py_type_err(e: impl Debug + Display) -> PyErr { + PyErr::new::(format!("{e}")) } -impl From for Subquery { - fn from(subquery: PySubquery) -> Self { - subquery.subquery - } +pub fn py_runtime_err(e: impl Debug + Display) -> PyErr { + PyErr::new::(format!("{e}")) } -impl From for PySubquery { - fn from(subquery: Subquery) -> PySubquery { - PySubquery { subquery } - } +pub fn py_value_err(e: impl Debug + Display) -> PyErr { + PyErr::new::(format!("{e}")) } diff --git a/crates/core/src/sql/logical.rs b/crates/core/src/sql/logical.rs new file mode 100644 index 000000000..631aa9b09 --- /dev/null +++ b/crates/core/src/sql/logical.rs @@ -0,0 +1,239 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::Arc; + +use datafusion::logical_expr::{DdlStatement, LogicalPlan, Statement}; +use datafusion_proto::logical_plan::{AsLogicalPlan, DefaultLogicalExtensionCodec}; +use prost::Message; +use pyo3::exceptions::PyRuntimeError; +use pyo3::prelude::*; +use pyo3::types::PyBytes; + +use crate::context::PySessionContext; +use crate::errors::PyDataFusionResult; +use crate::expr::aggregate::PyAggregate; +use crate::expr::analyze::PyAnalyze; +use crate::expr::copy_to::PyCopyTo; +use crate::expr::create_catalog::PyCreateCatalog; +use crate::expr::create_catalog_schema::PyCreateCatalogSchema; +use crate::expr::create_external_table::PyCreateExternalTable; +use crate::expr::create_function::PyCreateFunction; +use crate::expr::create_index::PyCreateIndex; +use crate::expr::create_memory_table::PyCreateMemoryTable; +use crate::expr::create_view::PyCreateView; +use crate::expr::describe_table::PyDescribeTable; +use crate::expr::distinct::PyDistinct; +use crate::expr::dml::PyDmlStatement; +use crate::expr::drop_catalog_schema::PyDropCatalogSchema; +use crate::expr::drop_function::PyDropFunction; +use crate::expr::drop_table::PyDropTable; +use crate::expr::drop_view::PyDropView; +use crate::expr::empty_relation::PyEmptyRelation; +use crate::expr::explain::PyExplain; +use crate::expr::extension::PyExtension; +use crate::expr::filter::PyFilter; +use crate::expr::join::PyJoin; +use crate::expr::limit::PyLimit; +use crate::expr::logical_node::LogicalNode; +use crate::expr::projection::PyProjection; +use crate::expr::recursive_query::PyRecursiveQuery; +use crate::expr::repartition::PyRepartition; +use crate::expr::sort::PySort; +use crate::expr::statement::{ + PyDeallocate, PyExecute, PyPrepare, PyResetVariable, PySetVariable, PyTransactionEnd, + PyTransactionStart, +}; +use crate::expr::subquery::PySubquery; +use crate::expr::subquery_alias::PySubqueryAlias; +use crate::expr::table_scan::PyTableScan; +use crate::expr::union::PyUnion; +use crate::expr::unnest::PyUnnest; +use crate::expr::values::PyValues; +use crate::expr::window::PyWindowExpr; + +#[pyclass( + from_py_object, + frozen, + name = "LogicalPlan", + module = "datafusion", + subclass, + eq +)] +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct PyLogicalPlan { + pub(crate) plan: Arc, +} + +impl PyLogicalPlan { + /// creates a new PyLogicalPlan + pub fn new(plan: LogicalPlan) -> Self { + Self { + plan: Arc::new(plan), + } + } + + pub fn plan(&self) -> Arc { + self.plan.clone() + } +} + +#[pymethods] +impl PyLogicalPlan { + /// Return the specific logical operator + pub fn to_variant<'py>(&self, py: Python<'py>) -> PyResult> { + match self.plan.as_ref() { + LogicalPlan::Aggregate(plan) => PyAggregate::from(plan.clone()).to_variant(py), + LogicalPlan::Analyze(plan) => PyAnalyze::from(plan.clone()).to_variant(py), + LogicalPlan::Distinct(plan) => PyDistinct::from(plan.clone()).to_variant(py), + LogicalPlan::EmptyRelation(plan) => PyEmptyRelation::from(plan.clone()).to_variant(py), + LogicalPlan::Explain(plan) => PyExplain::from(plan.clone()).to_variant(py), + LogicalPlan::Extension(plan) => PyExtension::from(plan.clone()).to_variant(py), + LogicalPlan::Filter(plan) => PyFilter::from(plan.clone()).to_variant(py), + LogicalPlan::Join(plan) => PyJoin::from(plan.clone()).to_variant(py), + LogicalPlan::Limit(plan) => PyLimit::from(plan.clone()).to_variant(py), + LogicalPlan::Projection(plan) => PyProjection::from(plan.clone()).to_variant(py), + LogicalPlan::Sort(plan) => PySort::from(plan.clone()).to_variant(py), + LogicalPlan::TableScan(plan) => PyTableScan::from(plan.clone()).to_variant(py), + LogicalPlan::Subquery(plan) => PySubquery::from(plan.clone()).to_variant(py), + LogicalPlan::SubqueryAlias(plan) => PySubqueryAlias::from(plan.clone()).to_variant(py), + LogicalPlan::Unnest(plan) => PyUnnest::from(plan.clone()).to_variant(py), + LogicalPlan::Window(plan) => PyWindowExpr::from(plan.clone()).to_variant(py), + LogicalPlan::Repartition(plan) => PyRepartition::from(plan.clone()).to_variant(py), + LogicalPlan::Union(plan) => PyUnion::from(plan.clone()).to_variant(py), + LogicalPlan::Statement(plan) => match plan { + Statement::TransactionStart(plan) => { + PyTransactionStart::from(plan.clone()).to_variant(py) + } + Statement::TransactionEnd(plan) => { + PyTransactionEnd::from(plan.clone()).to_variant(py) + } + Statement::SetVariable(plan) => PySetVariable::from(plan.clone()).to_variant(py), + Statement::ResetVariable(plan) => { + PyResetVariable::from(plan.clone()).to_variant(py) + } + Statement::Prepare(plan) => PyPrepare::from(plan.clone()).to_variant(py), + Statement::Execute(plan) => PyExecute::from(plan.clone()).to_variant(py), + Statement::Deallocate(plan) => PyDeallocate::from(plan.clone()).to_variant(py), + }, + LogicalPlan::Values(plan) => PyValues::from(plan.clone()).to_variant(py), + LogicalPlan::Dml(plan) => PyDmlStatement::from(plan.clone()).to_variant(py), + LogicalPlan::Ddl(plan) => match plan { + DdlStatement::CreateExternalTable(plan) => { + PyCreateExternalTable::from(plan.clone()).to_variant(py) + } + DdlStatement::CreateMemoryTable(plan) => { + PyCreateMemoryTable::from(plan.clone()).to_variant(py) + } + DdlStatement::CreateView(plan) => PyCreateView::from(plan.clone()).to_variant(py), + DdlStatement::CreateCatalogSchema(plan) => { + PyCreateCatalogSchema::from(plan.clone()).to_variant(py) + } + DdlStatement::CreateCatalog(plan) => { + PyCreateCatalog::from(plan.clone()).to_variant(py) + } + DdlStatement::CreateIndex(plan) => PyCreateIndex::from(plan.clone()).to_variant(py), + DdlStatement::DropTable(plan) => PyDropTable::from(plan.clone()).to_variant(py), + DdlStatement::DropView(plan) => PyDropView::from(plan.clone()).to_variant(py), + DdlStatement::DropCatalogSchema(plan) => { + PyDropCatalogSchema::from(plan.clone()).to_variant(py) + } + DdlStatement::CreateFunction(plan) => { + PyCreateFunction::from(plan.clone()).to_variant(py) + } + DdlStatement::DropFunction(plan) => { + PyDropFunction::from(plan.clone()).to_variant(py) + } + }, + LogicalPlan::Copy(plan) => PyCopyTo::from(plan.clone()).to_variant(py), + LogicalPlan::DescribeTable(plan) => PyDescribeTable::from(plan.clone()).to_variant(py), + LogicalPlan::RecursiveQuery(plan) => { + PyRecursiveQuery::from(plan.clone()).to_variant(py) + } + } + } + + /// Get the inputs to this plan + fn inputs(&self) -> Vec { + let mut inputs = vec![]; + for input in self.plan.inputs() { + inputs.push(input.to_owned().into()); + } + inputs + } + + fn __repr__(&self) -> PyResult { + Ok(format!("{:?}", self.plan)) + } + + fn display(&self) -> String { + format!("{}", self.plan.display()) + } + + fn display_indent(&self) -> String { + format!("{}", self.plan.display_indent()) + } + + fn display_indent_schema(&self) -> String { + format!("{}", self.plan.display_indent_schema()) + } + + fn display_graphviz(&self) -> String { + format!("{}", self.plan.display_graphviz()) + } + + pub fn to_proto<'py>(&'py self, py: Python<'py>) -> PyDataFusionResult> { + let codec = DefaultLogicalExtensionCodec {}; + let proto = + datafusion_proto::protobuf::LogicalPlanNode::try_from_logical_plan(&self.plan, &codec)?; + + let bytes = proto.encode_to_vec(); + Ok(PyBytes::new(py, &bytes)) + } + + #[staticmethod] + pub fn from_proto( + ctx: PySessionContext, + proto_msg: Bound<'_, PyBytes>, + ) -> PyDataFusionResult { + let bytes: &[u8] = proto_msg.extract().map_err(Into::::into)?; + let proto_plan = + datafusion_proto::protobuf::LogicalPlanNode::decode(bytes).map_err(|e| { + PyRuntimeError::new_err(format!( + "Unable to decode logical node from serialized bytes: {e}" + )) + })?; + + let codec = DefaultLogicalExtensionCodec {}; + let plan = proto_plan.try_into_logical_plan(&ctx.ctx.task_ctx(), &codec)?; + Ok(Self::new(plan)) + } +} + +impl From for LogicalPlan { + fn from(logical_plan: PyLogicalPlan) -> LogicalPlan { + logical_plan.plan.as_ref().clone() + } +} + +impl From for PyLogicalPlan { + fn from(logical_plan: LogicalPlan) -> PyLogicalPlan { + PyLogicalPlan { + plan: Arc::new(logical_plan), + } + } +} diff --git a/crates/core/src/sql/util.rs b/crates/core/src/sql/util.rs new file mode 100644 index 000000000..d1e8964f8 --- /dev/null +++ b/crates/core/src/sql/util.rs @@ -0,0 +1,87 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::collections::HashMap; + +use datafusion::common::{DataFusionError, exec_err, plan_datafusion_err}; +use datafusion::logical_expr::sqlparser::dialect::dialect_from_str; +use datafusion::sql::sqlparser::dialect::Dialect; +use datafusion::sql::sqlparser::parser::Parser; +use datafusion::sql::sqlparser::tokenizer::{Token, Tokenizer}; + +fn tokens_from_replacements( + placeholder: &str, + replacements: &HashMap>, +) -> Option> { + if let Some(pattern) = placeholder.strip_prefix("$") { + replacements.get(pattern).cloned() + } else { + None + } +} + +fn get_tokens_for_string_replacement( + dialect: &dyn Dialect, + replacements: HashMap, +) -> Result>, DataFusionError> { + replacements + .into_iter() + .map(|(name, value)| { + let tokens = Tokenizer::new(dialect, &value) + .tokenize() + .map_err(|err| DataFusionError::External(err.into()))?; + Ok((name, tokens)) + }) + .collect() +} + +pub(crate) fn replace_placeholders_with_strings( + query: &str, + dialect: &str, + replacements: HashMap, +) -> Result { + let dialect = dialect_from_str(dialect) + .ok_or_else(|| plan_datafusion_err!("Unsupported SQL dialect: {dialect}."))?; + + let replacements = get_tokens_for_string_replacement(dialect.as_ref(), replacements)?; + + let tokens = Tokenizer::new(dialect.as_ref(), query) + .tokenize() + .map_err(|err| DataFusionError::External(err.into()))?; + + let replaced_tokens = tokens + .into_iter() + .flat_map(|token| { + if let Token::Placeholder(placeholder) = &token { + tokens_from_replacements(placeholder, &replacements).unwrap_or(vec![token]) + } else { + vec![token] + } + }) + .collect::>(); + + let statement = Parser::new(dialect.as_ref()) + .with_tokens(replaced_tokens) + .parse_statements() + .map_err(|err| DataFusionError::External(Box::new(err)))?; + + if statement.len() != 1 { + return exec_err!("placeholder replacement should return exactly one statement"); + } + + Ok(statement[0].to_string()) +} diff --git a/src/store.rs b/crates/core/src/store.rs similarity index 80% rename from src/store.rs rename to crates/core/src/store.rs index 7d9bb7518..8535e83b7 100644 --- a/src/store.rs +++ b/crates/core/src/store.rs @@ -17,12 +17,14 @@ use std::sync::Arc; -use pyo3::prelude::*; - use object_store::aws::{AmazonS3, AmazonS3Builder}; use object_store::azure::{MicrosoftAzure, MicrosoftAzureBuilder}; use object_store::gcp::{GoogleCloudStorage, GoogleCloudStorageBuilder}; +use object_store::http::{HttpBuilder, HttpStore}; use object_store::local::LocalFileSystem; +use pyo3::exceptions::PyValueError; +use pyo3::prelude::*; +use url::Url; #[derive(FromPyObject)] pub enum StorageContexts { @@ -30,13 +32,15 @@ pub enum StorageContexts { GoogleCloudStorage(PyGoogleCloudContext), MicrosoftAzure(PyMicrosoftAzureContext), LocalFileSystem(PyLocalFileSystemContext), + HTTP(PyHttpContext), } #[pyclass( + from_py_object, + frozen, name = "LocalFileSystem", module = "datafusion.store", - subclass, - unsendable + subclass )] #[derive(Debug, Clone)] pub struct PyLocalFileSystemContext { @@ -64,10 +68,11 @@ impl PyLocalFileSystemContext { } #[pyclass( + from_py_object, + frozen, name = "MicrosoftAzure", module = "datafusion.store", - subclass, - unsendable + subclass )] #[derive(Debug, Clone)] pub struct PyMicrosoftAzureContext { @@ -78,7 +83,7 @@ pub struct PyMicrosoftAzureContext { #[pymethods] impl PyMicrosoftAzureContext { #[allow(clippy::too_many_arguments)] - #[pyo3(signature = (container_name, account=None, access_key=None, bearer_token=None, client_id=None, client_secret=None, tenant_id=None, sas_query_pairs=None, use_emulator=None, allow_http=None))] + #[pyo3(signature = (container_name, account=None, access_key=None, bearer_token=None, client_id=None, client_secret=None, tenant_id=None, sas_query_pairs=None, use_emulator=None, allow_http=None, use_fabric_endpoint=None))] #[new] fn new( container_name: String, @@ -91,6 +96,7 @@ impl PyMicrosoftAzureContext { sas_query_pairs: Option>, use_emulator: Option, allow_http: Option, + use_fabric_endpoint: Option, ) -> Self { let mut builder = MicrosoftAzureBuilder::from_env().with_container_name(&container_name); @@ -129,6 +135,10 @@ impl PyMicrosoftAzureContext { builder = builder.with_allow_http(allow_http); } + if let Some(use_fabric_endpoint) = use_fabric_endpoint { + builder = builder.with_use_fabric_endpoint(use_fabric_endpoint); + } + Self { inner: Arc::new( builder @@ -141,10 +151,11 @@ impl PyMicrosoftAzureContext { } #[pyclass( + from_py_object, + frozen, name = "GoogleCloud", module = "datafusion.store", - subclass, - unsendable + subclass )] #[derive(Debug, Clone)] pub struct PyGoogleCloudContext { @@ -175,7 +186,13 @@ impl PyGoogleCloudContext { } } -#[pyclass(name = "AmazonS3", module = "datafusion.store", subclass, unsendable)] +#[pyclass( + from_py_object, + frozen, + name = "AmazonS3", + module = "datafusion.store", + subclass +)] #[derive(Debug, Clone)] pub struct PyAmazonS3Context { pub inner: Arc, @@ -185,13 +202,14 @@ pub struct PyAmazonS3Context { #[pymethods] impl PyAmazonS3Context { #[allow(clippy::too_many_arguments)] - #[pyo3(signature = (bucket_name, region=None, access_key_id=None, secret_access_key=None, endpoint=None, allow_http=false, imdsv1_fallback=false))] + #[pyo3(signature = (bucket_name, region=None, access_key_id=None, secret_access_key=None, session_token=None, endpoint=None, allow_http=false, imdsv1_fallback=false))] #[new] fn new( bucket_name: String, region: Option, access_key_id: Option, secret_access_key: Option, + session_token: Option, endpoint: Option, //retry_config: RetryConfig, allow_http: bool, @@ -212,6 +230,10 @@ impl PyAmazonS3Context { builder = builder.with_secret_access_key(secret_access_key); }; + if let Some(session_token) = session_token { + builder = builder.with_token(session_token); + } + if let Some(endpoint) = endpoint { builder = builder.with_endpoint(endpoint); }; @@ -234,10 +256,43 @@ impl PyAmazonS3Context { } } -pub(crate) fn init_module(m: &PyModule) -> PyResult<()> { +#[pyclass( + from_py_object, + frozen, + name = "Http", + module = "datafusion.store", + subclass +)] +#[derive(Debug, Clone)] +pub struct PyHttpContext { + pub url: String, + pub store: Arc, +} + +#[pymethods] +impl PyHttpContext { + #[new] + fn new(url: String) -> PyResult { + let store = match Url::parse(url.as_str()) { + Ok(url) => HttpBuilder::new() + .with_url(url.origin().ascii_serialization()) + .build(), + Err(_) => HttpBuilder::new().build(), + } + .map_err(|e| PyValueError::new_err(format!("Error: {:?}", e.to_string())))?; + + Ok(Self { + url, + store: Arc::new(store), + }) + } +} + +pub(crate) fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> { m.add_class::()?; m.add_class::()?; m.add_class::()?; m.add_class::()?; + m.add_class::()?; Ok(()) } diff --git a/crates/core/src/substrait.rs b/crates/core/src/substrait.rs new file mode 100644 index 000000000..27e446f48 --- /dev/null +++ b/crates/core/src/substrait.rs @@ -0,0 +1,195 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use datafusion_python_util::wait_for_future; +use datafusion_substrait::logical_plan::{consumer, producer}; +use datafusion_substrait::serializer; +use datafusion_substrait::substrait::proto::Plan; +use prost::Message; +use pyo3::prelude::*; +use pyo3::types::PyBytes; + +use crate::context::PySessionContext; +use crate::errors::{PyDataFusionError, PyDataFusionResult, py_datafusion_err, to_datafusion_err}; +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "Plan", + module = "datafusion.substrait", + subclass +)] +#[derive(Debug, Clone)] +pub struct PyPlan { + pub plan: Plan, +} + +#[pymethods] +impl PyPlan { + fn encode(&self, py: Python) -> PyResult> { + let mut proto_bytes = Vec::::new(); + self.plan + .encode(&mut proto_bytes) + .map_err(PyDataFusionError::EncodeError)?; + Ok(PyBytes::new(py, &proto_bytes).into()) + } + + /// Get the JSON representation of the substrait plan + fn to_json(&self) -> PyDataFusionResult { + let json = serde_json::to_string_pretty(&self.plan).map_err(to_datafusion_err)?; + Ok(json) + } + + /// Parse a Substrait Plan from its JSON representation + #[staticmethod] + fn from_json(json: &str) -> PyDataFusionResult { + let plan: Plan = serde_json::from_str(json).map_err(to_datafusion_err)?; + Ok(PyPlan { plan }) + } +} + +impl From for Plan { + fn from(plan: PyPlan) -> Plan { + plan.plan + } +} + +impl From for PyPlan { + fn from(plan: Plan) -> PyPlan { + PyPlan { plan } + } +} + +/// A PySubstraitSerializer is a representation of a Serializer that is capable of both serializing +/// a `LogicalPlan` instance to Substrait Protobuf bytes and also deserialize Substrait Protobuf bytes +/// to a valid `LogicalPlan` instance. +#[pyclass( + from_py_object, + frozen, + name = "Serde", + module = "datafusion.substrait", + subclass +)] +#[derive(Debug, Clone)] +pub struct PySubstraitSerializer; + +#[pymethods] +impl PySubstraitSerializer { + #[staticmethod] + pub fn serialize( + sql: &str, + ctx: PySessionContext, + path: &str, + py: Python, + ) -> PyDataFusionResult<()> { + wait_for_future(py, serializer::serialize(sql, &ctx.ctx, path))??; + Ok(()) + } + + #[staticmethod] + pub fn serialize_to_plan( + sql: &str, + ctx: PySessionContext, + py: Python, + ) -> PyDataFusionResult { + PySubstraitSerializer::serialize_bytes(sql, ctx, py).and_then(|proto_bytes| { + let proto_bytes = proto_bytes.bind(py).cast::().unwrap(); + PySubstraitSerializer::deserialize_bytes(proto_bytes.as_bytes().to_vec(), py) + }) + } + + #[staticmethod] + pub fn serialize_bytes( + sql: &str, + ctx: PySessionContext, + py: Python, + ) -> PyDataFusionResult> { + let proto_bytes: Vec = + wait_for_future(py, serializer::serialize_bytes(sql, &ctx.ctx))??; + Ok(PyBytes::new(py, &proto_bytes).into()) + } + + #[staticmethod] + pub fn deserialize(path: &str, py: Python) -> PyDataFusionResult { + let plan = wait_for_future(py, serializer::deserialize(path))??; + Ok(PyPlan { plan: *plan }) + } + + #[staticmethod] + pub fn deserialize_bytes(proto_bytes: Vec, py: Python) -> PyDataFusionResult { + let plan = wait_for_future(py, serializer::deserialize_bytes(proto_bytes))??; + Ok(PyPlan { plan: *plan }) + } +} + +#[pyclass( + from_py_object, + frozen, + name = "Producer", + module = "datafusion.substrait", + subclass +)] +#[derive(Debug, Clone)] +pub struct PySubstraitProducer; + +#[pymethods] +impl PySubstraitProducer { + /// Convert DataFusion LogicalPlan to Substrait Plan + #[staticmethod] + pub fn to_substrait_plan(plan: PyLogicalPlan, ctx: &PySessionContext) -> PyResult { + let session_state = ctx.ctx.state(); + match producer::to_substrait_plan(&plan.plan, &session_state) { + Ok(plan) => Ok(PyPlan { plan: *plan }), + Err(e) => Err(py_datafusion_err(e)), + } + } +} + +#[pyclass( + from_py_object, + frozen, + name = "Consumer", + module = "datafusion.substrait", + subclass +)] +#[derive(Debug, Clone)] +pub struct PySubstraitConsumer; + +#[pymethods] +impl PySubstraitConsumer { + /// Convert Substrait Plan to DataFusion DataFrame + #[staticmethod] + pub fn from_substrait_plan( + ctx: &PySessionContext, + plan: PyPlan, + py: Python, + ) -> PyDataFusionResult { + let session_state = ctx.ctx.state(); + let result = consumer::from_substrait_plan(&session_state, &plan.plan); + let logical_plan = wait_for_future(py, result)??; + Ok(PyLogicalPlan::new(logical_plan)) + } +} + +pub fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> { + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + m.add_class::()?; + Ok(()) +} diff --git a/crates/core/src/table.rs b/crates/core/src/table.rs new file mode 100644 index 000000000..623349771 --- /dev/null +++ b/crates/core/src/table.rs @@ -0,0 +1,261 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::any::Any; +use std::sync::Arc; + +use arrow::datatypes::SchemaRef; +use arrow::pyarrow::ToPyArrow; +use async_trait::async_trait; +use datafusion::catalog::{Session, TableProviderFactory}; +use datafusion::common::Column; +use datafusion::datasource::{TableProvider, TableType}; +use datafusion::logical_expr::{ + CreateExternalTable, Expr, LogicalPlanBuilder, TableProviderFilterPushDown, +}; +use datafusion::physical_plan::ExecutionPlan; +use datafusion::prelude::DataFrame; +use datafusion_ffi::proto::logical_extension_codec::FFI_LogicalExtensionCodec; +use datafusion_python_util::{create_logical_extension_capsule, table_provider_from_pycapsule}; +use pyo3::IntoPyObjectExt; +use pyo3::prelude::*; + +use crate::context::PySessionContext; +use crate::dataframe::PyDataFrame; +use crate::dataset::Dataset; +use crate::errors; +use crate::expr::create_external_table::PyCreateExternalTable; + +/// This struct is used as a common method for all TableProviders, +/// whether they refer to an FFI provider, an internally known +/// implementation, a dataset, or a dataframe view. +#[pyclass( + from_py_object, + frozen, + name = "RawTable", + module = "datafusion.catalog", + subclass +)] +#[derive(Clone)] +pub struct PyTable { + pub table: Arc, +} + +impl PyTable { + pub fn table(&self) -> Arc { + self.table.clone() + } +} + +#[pymethods] +impl PyTable { + /// Instantiate from any Python object that supports any of the table + /// types. We do not know a priori when using this method if the object + /// will be passed a wrapped or raw class. Here we handle all of the + /// following object types: + /// + /// - PyTable (essentially a clone operation), but either raw or wrapped + /// - DataFrame, either raw or wrapped + /// - FFI Table Providers via PyCapsule + /// - PyArrow Dataset objects + #[new] + pub fn new(obj: Bound<'_, PyAny>, session: Option>) -> PyResult { + let py = obj.py(); + if let Ok(py_table) = obj.extract::() { + Ok(py_table) + } else if let Ok(py_table) = obj + .getattr("_inner") + .and_then(|inner| inner.extract::().map_err(Into::::into)) + { + Ok(py_table) + } else if let Ok(py_df) = obj.extract::() { + let provider = py_df.inner_df().as_ref().clone().into_view(); + Ok(PyTable::from(provider)) + } else if let Ok(py_df) = obj + .getattr("df") + .and_then(|inner| inner.extract::().map_err(Into::::into)) + { + let provider = py_df.inner_df().as_ref().clone().into_view(); + Ok(PyTable::from(provider)) + } else if let Some(provider) = { + let session = match session { + Some(session) => session, + None => PySessionContext::global_ctx()?.into_bound_py_any(obj.py())?, + }; + table_provider_from_pycapsule(obj.clone(), session)? + } { + Ok(PyTable::from(provider)) + } else { + let provider = Arc::new(Dataset::new(&obj, py)?) as Arc; + Ok(PyTable::from(provider)) + } + } + + /// Get a reference to the schema for this table + #[getter] + fn schema<'py>(&self, py: Python<'py>) -> PyResult> { + self.table.schema().to_pyarrow(py) + } + + /// Get the type of this table for metadata/catalog purposes. + #[getter] + fn kind(&self) -> &str { + match self.table.table_type() { + TableType::Base => "physical", + TableType::View => "view", + TableType::Temporary => "temporary", + } + } + + fn __repr__(&self) -> PyResult { + let kind = self.kind(); + Ok(format!("Table(kind={kind})")) + } +} + +impl From> for PyTable { + fn from(table: Arc) -> Self { + Self { table } + } +} + +#[derive(Clone, Debug)] +pub(crate) struct TempViewTable { + df: Arc, +} + +/// This is nearly identical to `DataFrameTableProvider` +/// except that it is for temporary tables. +/// Remove when https://github.com/apache/datafusion/issues/18026 +/// closes. +impl TempViewTable { + pub(crate) fn new(df: Arc) -> Self { + Self { df } + } +} + +#[async_trait] +impl TableProvider for TempViewTable { + fn as_any(&self) -> &dyn Any { + self + } + + fn schema(&self) -> SchemaRef { + Arc::new(self.df.schema().as_arrow().clone()) + } + + fn table_type(&self) -> TableType { + TableType::Temporary + } + + async fn scan( + &self, + state: &dyn Session, + projection: Option<&Vec>, + filters: &[Expr], + limit: Option, + ) -> datafusion::common::Result> { + let filter = filters.iter().cloned().reduce(|acc, new| acc.and(new)); + let plan = self.df.logical_plan().clone(); + let mut plan = LogicalPlanBuilder::from(plan); + + if let Some(filter) = filter { + plan = plan.filter(filter)?; + } + + let mut plan = if let Some(projection) = projection { + // avoiding adding a redundant projection (e.g. SELECT * FROM view) + let current_projection = (0..plan.schema().fields().len()).collect::>(); + if projection == ¤t_projection { + plan + } else { + let fields: Vec = projection + .iter() + .map(|i| { + Expr::Column(Column::from( + self.df.logical_plan().schema().qualified_field(*i), + )) + }) + .collect(); + plan.project(fields)? + } + } else { + plan + }; + + if let Some(limit) = limit { + plan = plan.limit(0, Some(limit))?; + } + + state.create_physical_plan(&plan.build()?).await + } + + fn supports_filters_pushdown( + &self, + filters: &[&Expr], + ) -> datafusion::common::Result> { + Ok(vec![TableProviderFilterPushDown::Exact; filters.len()]) + } +} + +#[derive(Debug)] +pub(crate) struct RustWrappedPyTableProviderFactory { + pub(crate) table_provider_factory: Py, + pub(crate) codec: Arc, +} + +impl RustWrappedPyTableProviderFactory { + pub fn new(table_provider_factory: Py, codec: Arc) -> Self { + Self { + table_provider_factory, + codec, + } + } + + fn create_inner( + &self, + cmd: CreateExternalTable, + codec: Bound, + ) -> PyResult> { + Python::attach(|py| { + let provider = self.table_provider_factory.bind(py); + let cmd = PyCreateExternalTable::from(cmd); + + provider + .call_method1("create", (cmd,)) + .and_then(|t| PyTable::new(t, Some(codec))) + .map(|t| t.table()) + }) + } +} + +#[async_trait] +impl TableProviderFactory for RustWrappedPyTableProviderFactory { + async fn create( + &self, + _: &dyn Session, + cmd: &CreateExternalTable, + ) -> datafusion::common::Result> { + Python::attach(|py| { + let codec = create_logical_extension_capsule(py, self.codec.as_ref()) + .map_err(errors::to_datafusion_err)?; + + self.create_inner(cmd.clone(), codec.into_any()) + .map_err(errors::to_datafusion_err) + }) + } +} diff --git a/crates/core/src/udaf.rs b/crates/core/src/udaf.rs new file mode 100644 index 000000000..80ef51716 --- /dev/null +++ b/crates/core/src/udaf.rs @@ -0,0 +1,234 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::ptr::NonNull; +use std::sync::Arc; + +use datafusion::arrow::array::ArrayRef; +use datafusion::arrow::datatypes::DataType; +use datafusion::arrow::pyarrow::{PyArrowType, ToPyArrow}; +use datafusion::common::ScalarValue; +use datafusion::error::{DataFusionError, Result}; +use datafusion::logical_expr::{ + Accumulator, AccumulatorFactoryFunction, AggregateUDF, AggregateUDFImpl, create_udaf, +}; +use datafusion_ffi::udaf::FFI_AggregateUDF; +use datafusion_python_util::parse_volatility; +use pyo3::prelude::*; +use pyo3::types::{PyCapsule, PyTuple}; + +use crate::common::data_type::PyScalarValue; +use crate::errors::{PyDataFusionResult, py_datafusion_err, to_datafusion_err}; +use crate::expr::PyExpr; + +#[derive(Debug)] +struct RustAccumulator { + accum: Py, +} + +impl RustAccumulator { + fn new(accum: Py) -> Self { + Self { accum } + } +} + +impl Accumulator for RustAccumulator { + fn state(&mut self) -> Result> { + Python::attach(|py| -> PyResult> { + let values = self.accum.bind(py).call_method0("state")?; + let mut scalars = Vec::new(); + for item in values.try_iter()? { + let item: Bound<'_, PyAny> = item?; + let scalar = item.extract::()?.0; + scalars.push(scalar); + } + Ok(scalars) + }) + .map_err(|e| DataFusionError::Execution(format!("{e}"))) + } + + fn evaluate(&mut self) -> Result { + Python::attach(|py| -> PyResult { + let value = self.accum.bind(py).call_method0("evaluate")?; + value.extract::().map(|v| v.0) + }) + .map_err(|e| DataFusionError::Execution(format!("{e}"))) + } + + fn update_batch(&mut self, values: &[ArrayRef]) -> Result<()> { + Python::attach(|py| { + // 1. cast args to Pyarrow array + let py_args = values + .iter() + .map(|arg| arg.to_data().to_pyarrow(py).unwrap()) + .collect::>(); + let py_args = PyTuple::new(py, py_args).map_err(to_datafusion_err)?; + + // 2. call function + self.accum + .bind(py) + .call_method1("update", py_args) + .map_err(|e| DataFusionError::Execution(format!("{e}")))?; + + Ok(()) + }) + } + + fn merge_batch(&mut self, states: &[ArrayRef]) -> Result<()> { + Python::attach(|py| { + // // 1. cast states to Pyarrow arrays + let py_states: Result>> = states + .iter() + .map(|state| { + state + .to_data() + .to_pyarrow(py) + .map_err(|e| DataFusionError::Execution(format!("{e}"))) + }) + .collect(); + + // 2. call merge + self.accum + .bind(py) + .call_method1("merge", (py_states?,)) + .map_err(|e| DataFusionError::Execution(format!("{e}")))?; + + Ok(()) + }) + } + + fn size(&self) -> usize { + std::mem::size_of_val(self) + } + + fn retract_batch(&mut self, values: &[ArrayRef]) -> Result<()> { + Python::attach(|py| { + // 1. cast args to Pyarrow array + let py_args = values + .iter() + .map(|arg| arg.to_data().to_pyarrow(py).unwrap()) + .collect::>(); + let py_args = PyTuple::new(py, py_args).map_err(to_datafusion_err)?; + + // 2. call function + self.accum + .bind(py) + .call_method1("retract_batch", py_args) + .map_err(|e| DataFusionError::Execution(format!("{e}")))?; + + Ok(()) + }) + } + + fn supports_retract_batch(&self) -> bool { + Python::attach( + |py| match self.accum.bind(py).call_method0("supports_retract_batch") { + Ok(x) => x.extract().unwrap_or(false), + Err(_) => false, + }, + ) + } +} + +pub fn to_rust_accumulator(accum: Py) -> AccumulatorFactoryFunction { + Arc::new(move |_args| -> Result> { + let accum = Python::attach(|py| { + accum + .call0(py) + .map_err(|e| DataFusionError::Execution(format!("{e}"))) + })?; + Ok(Box::new(RustAccumulator::new(accum))) + }) +} + +fn aggregate_udf_from_capsule(capsule: &Bound<'_, PyCapsule>) -> PyDataFusionResult { + let data: NonNull = capsule + .pointer_checked(Some(c"datafusion_aggregate_udf"))? + .cast(); + let udaf = unsafe { data.as_ref() }; + let udaf: Arc = udaf.into(); + + Ok(AggregateUDF::new_from_shared_impl(udaf)) +} + +/// Represents an AggregateUDF +#[pyclass( + from_py_object, + frozen, + name = "AggregateUDF", + module = "datafusion", + subclass +)] +#[derive(Debug, Clone)] +pub struct PyAggregateUDF { + pub(crate) function: AggregateUDF, +} + +#[pymethods] +impl PyAggregateUDF { + #[new] + #[pyo3(signature=(name, accumulator, input_type, return_type, state_type, volatility))] + fn new( + name: &str, + accumulator: Py, + input_type: PyArrowType>, + return_type: PyArrowType, + state_type: PyArrowType>, + volatility: &str, + ) -> PyResult { + let function = create_udaf( + name, + input_type.0, + Arc::new(return_type.0), + parse_volatility(volatility)?, + to_rust_accumulator(accumulator), + Arc::new(state_type.0), + ); + Ok(Self { function }) + } + + #[staticmethod] + pub fn from_pycapsule(func: Bound<'_, PyAny>) -> PyDataFusionResult { + if func.is_instance_of::() { + let capsule = func.cast::().map_err(py_datafusion_err)?; + let function = aggregate_udf_from_capsule(capsule)?; + return Ok(Self { function }); + } + + if func.hasattr("__datafusion_aggregate_udf__")? { + let capsule = func.getattr("__datafusion_aggregate_udf__")?.call0()?; + let capsule = capsule.cast::().map_err(py_datafusion_err)?; + let function = aggregate_udf_from_capsule(capsule)?; + return Ok(Self { function }); + } + + Err(crate::errors::PyDataFusionError::Common( + "__datafusion_aggregate_udf__ does not exist on AggregateUDF object.".to_string(), + )) + } + + /// creates a new PyExpr with the call of the udf + #[pyo3(signature = (*args))] + fn __call__(&self, args: Vec) -> PyResult { + let args = args.iter().map(|e| e.expr.clone()).collect(); + Ok(self.function.call(args).into()) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("AggregateUDF({})", self.function.name())) + } +} diff --git a/crates/core/src/udf.rs b/crates/core/src/udf.rs new file mode 100644 index 000000000..c0a39cb47 --- /dev/null +++ b/crates/core/src/udf.rs @@ -0,0 +1,223 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::any::Any; +use std::hash::{Hash, Hasher}; +use std::ptr::NonNull; +use std::sync::Arc; + +use arrow::datatypes::{Field, FieldRef}; +use arrow::pyarrow::ToPyArrow; +use datafusion::arrow::array::{ArrayData, make_array}; +use datafusion::arrow::datatypes::DataType; +use datafusion::arrow::pyarrow::{FromPyArrow, PyArrowType}; +use datafusion::common::internal_err; +use datafusion::error::DataFusionError; +use datafusion::logical_expr::{ + ColumnarValue, ReturnFieldArgs, ScalarFunctionArgs, ScalarUDF, ScalarUDFImpl, Signature, + Volatility, +}; +use datafusion_ffi::udf::FFI_ScalarUDF; +use datafusion_python_util::parse_volatility; +use pyo3::prelude::*; +use pyo3::types::{PyCapsule, PyTuple}; + +use crate::array::PyArrowArrayExportable; +use crate::errors::{PyDataFusionResult, to_datafusion_err}; +use crate::expr::PyExpr; + +/// This struct holds the Python written function that is a +/// ScalarUDF. +#[derive(Debug)] +struct PythonFunctionScalarUDF { + name: String, + func: Py, + signature: Signature, + return_field: FieldRef, +} + +impl PythonFunctionScalarUDF { + fn new( + name: String, + func: Py, + input_fields: Vec, + return_field: Field, + volatility: Volatility, + ) -> Self { + let input_types = input_fields.iter().map(|f| f.data_type().clone()).collect(); + let signature = Signature::exact(input_types, volatility); + Self { + name, + func, + signature, + return_field: Arc::new(return_field), + } + } +} + +impl Eq for PythonFunctionScalarUDF {} +impl PartialEq for PythonFunctionScalarUDF { + fn eq(&self, other: &Self) -> bool { + self.name == other.name + && self.signature == other.signature + && self.return_field == other.return_field + && Python::attach(|py| self.func.bind(py).eq(other.func.bind(py)).unwrap_or(false)) + } +} + +impl Hash for PythonFunctionScalarUDF { + fn hash(&self, state: &mut H) { + self.name.hash(state); + self.signature.hash(state); + self.return_field.hash(state); + + Python::attach(|py| { + let py_hash = self.func.bind(py).hash().unwrap_or(0); // Handle unhashable objects + + state.write_isize(py_hash); + }); + } +} + +impl ScalarUDFImpl for PythonFunctionScalarUDF { + fn as_any(&self) -> &dyn Any { + self + } + + fn name(&self) -> &str { + &self.name + } + + fn signature(&self) -> &Signature { + &self.signature + } + + fn return_type(&self, _arg_types: &[DataType]) -> datafusion::common::Result { + internal_err!( + "return_field should not be called when return_field_from_args is implemented." + ) + } + + fn return_field_from_args( + &self, + _args: ReturnFieldArgs, + ) -> datafusion::common::Result { + Ok(Arc::clone(&self.return_field)) + } + + fn invoke_with_args( + &self, + args: ScalarFunctionArgs, + ) -> datafusion::common::Result { + let num_rows = args.number_rows; + Python::attach(|py| { + // 1. cast args to Pyarrow arrays + let py_args = args + .args + .into_iter() + .zip(args.arg_fields) + .map(|(arg, field)| { + let array = arg.to_array(num_rows)?; + PyArrowArrayExportable::new(array, field) + .to_pyarrow(py) + .map_err(to_datafusion_err) + }) + .collect::, _>>()?; + let py_args = PyTuple::new(py, py_args).map_err(to_datafusion_err)?; + + // 2. call function + let value = self + .func + .call(py, py_args, None) + .map_err(|e| DataFusionError::Execution(format!("{e:?}")))?; + + // 3. cast to arrow::array::Array + let array_data = ArrayData::from_pyarrow_bound(value.bind(py)) + .map_err(|e| DataFusionError::Execution(format!("{e:?}")))?; + Ok(ColumnarValue::Array(make_array(array_data))) + }) + } +} + +/// Represents a PyScalarUDF +#[pyclass( + from_py_object, + frozen, + name = "ScalarUDF", + module = "datafusion", + subclass +)] +#[derive(Debug, Clone)] +pub struct PyScalarUDF { + pub(crate) function: ScalarUDF, +} + +#[pymethods] +impl PyScalarUDF { + #[new] + #[pyo3(signature=(name, func, input_types, return_type, volatility))] + fn new( + name: String, + func: Py, + input_types: PyArrowType>, + return_type: PyArrowType, + volatility: &str, + ) -> PyResult { + let py_function = PythonFunctionScalarUDF::new( + name, + func, + input_types.0, + return_type.0, + parse_volatility(volatility)?, + ); + let function = ScalarUDF::new_from_impl(py_function); + + Ok(Self { function }) + } + + #[staticmethod] + pub fn from_pycapsule(func: Bound<'_, PyAny>) -> PyDataFusionResult { + if func.hasattr("__datafusion_scalar_udf__")? { + let capsule = func.getattr("__datafusion_scalar_udf__")?.call0()?; + let capsule = capsule.cast::().map_err(to_datafusion_err)?; + let data: NonNull = capsule + .pointer_checked(Some(c"datafusion_scalar_udf"))? + .cast(); + let udf = unsafe { data.as_ref() }; + let udf: Arc = udf.into(); + + Ok(Self { + function: ScalarUDF::new_from_shared_impl(udf), + }) + } else { + Err(crate::errors::PyDataFusionError::Common( + "__datafusion_scalar_udf__ does not exist on ScalarUDF object.".to_string(), + )) + } + } + + /// creates a new PyExpr with the call of the udf + #[pyo3(signature = (*args))] + fn __call__(&self, args: Vec) -> PyResult { + let args = args.iter().map(|e| e.expr.clone()).collect(); + Ok(self.function.call(args).into()) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("ScalarUDF({})", self.function.name())) + } +} diff --git a/crates/core/src/udtf.rs b/crates/core/src/udtf.rs new file mode 100644 index 000000000..9371732dc --- /dev/null +++ b/crates/core/src/udtf.rs @@ -0,0 +1,134 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::ptr::NonNull; +use std::sync::Arc; + +use datafusion::catalog::{TableFunctionImpl, TableProvider}; +use datafusion::error::Result as DataFusionResult; +use datafusion::logical_expr::Expr; +use datafusion_ffi::udtf::FFI_TableFunction; +use pyo3::IntoPyObjectExt; +use pyo3::exceptions::{PyImportError, PyTypeError}; +use pyo3::prelude::*; +use pyo3::types::{PyCapsule, PyTuple, PyType}; + +use crate::context::PySessionContext; +use crate::errors::{py_datafusion_err, to_datafusion_err}; +use crate::expr::PyExpr; +use crate::table::PyTable; + +/// Represents a user defined table function +#[pyclass(from_py_object, frozen, name = "TableFunction", module = "datafusion")] +#[derive(Debug, Clone)] +pub struct PyTableFunction { + pub(crate) name: String, + pub(crate) inner: PyTableFunctionInner, +} + +// TODO: Implement pure python based user defined table functions +#[derive(Debug, Clone)] +pub(crate) enum PyTableFunctionInner { + PythonFunction(Arc>), + FFIFunction(Arc), +} + +#[pymethods] +impl PyTableFunction { + #[new] + #[pyo3(signature=(name, func, session))] + pub fn new( + name: &str, + func: Bound<'_, PyAny>, + session: Option>, + ) -> PyResult { + let inner = if func.hasattr("__datafusion_table_function__")? { + let py = func.py(); + let session = match session { + Some(session) => session, + None => PySessionContext::global_ctx()?.into_bound_py_any(py)?, + }; + let capsule = func + .getattr("__datafusion_table_function__")? + .call1((session,)).map_err(|err| { + if err.get_type(py).is(PyType::new::(py)) { + PyImportError::new_err("Incompatible libraries. DataFusion 52.0.0 introduced an incompatible signature change for table functions. Either downgrade DataFusion or upgrade your function library.") + } else { + err + } + })?; + let capsule = capsule.cast::()?; + let data: NonNull = capsule + .pointer_checked(Some(c"datafusion_table_function"))? + .cast(); + let ffi_func = unsafe { data.as_ref() }; + let foreign_func: Arc = ffi_func.to_owned().into(); + + PyTableFunctionInner::FFIFunction(foreign_func) + } else { + let py_obj = Arc::new(func.unbind()); + PyTableFunctionInner::PythonFunction(py_obj) + }; + + Ok(Self { + name: name.to_string(), + inner, + }) + } + + #[pyo3(signature = (*args))] + pub fn __call__(&self, args: Vec) -> PyResult { + let args: Vec = args.iter().map(|e| e.expr.clone()).collect(); + let table_provider = self.call(&args).map_err(py_datafusion_err)?; + + Ok(PyTable::from(table_provider)) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("TableUDF({})", self.name)) + } +} + +#[allow(clippy::result_large_err)] +fn call_python_table_function( + func: &Arc>, + args: &[Expr], +) -> DataFusionResult> { + let args = args + .iter() + .map(|arg| PyExpr::from(arg.clone())) + .collect::>(); + + // move |args: &[ArrayRef]| -> Result { + Python::attach(|py| { + let py_args = PyTuple::new(py, args)?; + let provider_obj = func.call1(py, py_args)?; + let provider = provider_obj.bind(py).clone(); + + Ok::, PyErr>(PyTable::new(provider, None)?.table) + }) + .map_err(to_datafusion_err) +} + +impl TableFunctionImpl for PyTableFunction { + fn call(&self, args: &[Expr]) -> DataFusionResult> { + match &self.inner { + PyTableFunctionInner::FFIFunction(func) => func.call(args), + PyTableFunctionInner::PythonFunction(obj) => call_python_table_function(obj, args), + } + } +} diff --git a/crates/core/src/udwf.rs b/crates/core/src/udwf.rs new file mode 100644 index 000000000..1d3608ada --- /dev/null +++ b/crates/core/src/udwf.rs @@ -0,0 +1,345 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::any::Any; +use std::ops::Range; +use std::ptr::NonNull; +use std::sync::Arc; + +use arrow::array::{Array, ArrayData, ArrayRef, make_array}; +use datafusion::arrow::datatypes::DataType; +use datafusion::arrow::pyarrow::{FromPyArrow, PyArrowType, ToPyArrow}; +use datafusion::error::{DataFusionError, Result}; +use datafusion::logical_expr::function::{PartitionEvaluatorArgs, WindowUDFFieldArgs}; +use datafusion::logical_expr::ptr_eq::PtrEq; +use datafusion::logical_expr::window_state::WindowAggState; +use datafusion::logical_expr::{ + PartitionEvaluator, PartitionEvaluatorFactory, Signature, Volatility, WindowUDF, WindowUDFImpl, +}; +use datafusion::scalar::ScalarValue; +use datafusion_ffi::udwf::FFI_WindowUDF; +use datafusion_python_util::parse_volatility; +use pyo3::exceptions::PyValueError; +use pyo3::prelude::*; +use pyo3::types::{PyCapsule, PyList, PyTuple}; + +use crate::common::data_type::PyScalarValue; +use crate::errors::{PyDataFusionResult, to_datafusion_err}; +use crate::expr::PyExpr; + +#[derive(Debug)] +struct RustPartitionEvaluator { + evaluator: Py, +} + +impl RustPartitionEvaluator { + fn new(evaluator: Py) -> Self { + Self { evaluator } + } +} + +impl PartitionEvaluator for RustPartitionEvaluator { + fn memoize(&mut self, _state: &mut WindowAggState) -> Result<()> { + Python::attach(|py| self.evaluator.bind(py).call_method0("memoize").map(|_| ())) + .map_err(|e| DataFusionError::Execution(format!("{e}"))) + } + + fn get_range(&self, idx: usize, n_rows: usize) -> Result> { + Python::attach(|py| { + let py_args = vec![idx.into_pyobject(py)?, n_rows.into_pyobject(py)?]; + let py_args = PyTuple::new(py, py_args)?; + + self.evaluator + .bind(py) + .call_method1("get_range", py_args) + .and_then(|v| { + let tuple: Bound<'_, PyTuple> = v.extract()?; + if tuple.len() != 2 { + return Err(PyValueError::new_err(format!( + "Expected get_range to return tuple of length 2. Received length {}", + tuple.len() + ))); + } + + let start: usize = tuple.get_item(0).unwrap().extract()?; + let end: usize = tuple.get_item(1).unwrap().extract()?; + + Ok(Range { start, end }) + }) + }) + .map_err(|e| DataFusionError::Execution(format!("{e}"))) + } + + fn is_causal(&self) -> bool { + Python::attach(|py| { + self.evaluator + .bind(py) + .call_method0("is_causal") + .and_then(|v| v.extract()) + .unwrap_or(false) + }) + } + + fn evaluate_all(&mut self, values: &[ArrayRef], num_rows: usize) -> Result { + Python::attach(|py| { + let py_values = PyList::new( + py, + values + .iter() + .map(|arg| arg.into_data().to_pyarrow(py).unwrap()), + )?; + let py_num_rows = num_rows.into_pyobject(py)?; + let py_args = PyTuple::new(py, vec![py_values.as_any(), &py_num_rows])?; + + self.evaluator + .bind(py) + .call_method1("evaluate_all", py_args) + .map(|v| { + let array_data = ArrayData::from_pyarrow_bound(&v).unwrap(); + make_array(array_data) + }) + }) + .map_err(to_datafusion_err) + } + + fn evaluate(&mut self, values: &[ArrayRef], range: &Range) -> Result { + Python::attach(|py| { + let py_values = PyList::new( + py, + values + .iter() + .map(|arg| arg.into_data().to_pyarrow(py).unwrap()), + )?; + let range_tuple = PyTuple::new(py, vec![range.start, range.end])?; + let py_args = PyTuple::new(py, vec![py_values.as_any(), range_tuple.as_any()])?; + + self.evaluator + .bind(py) + .call_method1("evaluate", py_args) + .and_then(|v| v.extract::()) + .map(|v| v.0) + }) + .map_err(to_datafusion_err) + } + + fn evaluate_all_with_rank( + &self, + num_rows: usize, + ranks_in_partition: &[Range], + ) -> Result { + Python::attach(|py| { + let ranks = ranks_in_partition + .iter() + .map(|r| PyTuple::new(py, vec![r.start, r.end])) + .collect::>>()?; + + // 1. cast args to Pyarrow array + let py_args = vec![ + num_rows.into_pyobject(py)?.into_any(), + PyList::new(py, ranks)?.into_any(), + ]; + + let py_args = PyTuple::new(py, py_args)?; + + // 2. call function + self.evaluator + .bind(py) + .call_method1("evaluate_all_with_rank", py_args) + .map(|v| { + let array_data = ArrayData::from_pyarrow_bound(&v).unwrap(); + make_array(array_data) + }) + }) + .map_err(to_datafusion_err) + } + + fn supports_bounded_execution(&self) -> bool { + Python::attach(|py| { + self.evaluator + .bind(py) + .call_method0("supports_bounded_execution") + .and_then(|v| v.extract()) + .unwrap_or(false) + }) + } + + fn uses_window_frame(&self) -> bool { + Python::attach(|py| { + self.evaluator + .bind(py) + .call_method0("uses_window_frame") + .and_then(|v| v.extract()) + .unwrap_or(false) + }) + } + + fn include_rank(&self) -> bool { + Python::attach(|py| { + self.evaluator + .bind(py) + .call_method0("include_rank") + .and_then(|v| v.extract()) + .unwrap_or(false) + }) + } +} + +pub fn to_rust_partition_evaluator(evaluator: Py) -> PartitionEvaluatorFactory { + Arc::new(move || -> Result> { + let evaluator = Python::attach(|py| { + evaluator + .call0(py) + .map_err(|e| DataFusionError::Execution(e.to_string())) + })?; + Ok(Box::new(RustPartitionEvaluator::new(evaluator))) + }) +} + +/// Represents an WindowUDF +#[pyclass( + from_py_object, + frozen, + name = "WindowUDF", + module = "datafusion", + subclass +)] +#[derive(Debug, Clone)] +pub struct PyWindowUDF { + pub(crate) function: WindowUDF, +} + +#[pymethods] +impl PyWindowUDF { + #[new] + #[pyo3(signature=(name, evaluator, input_types, return_type, volatility))] + fn new( + name: &str, + evaluator: Py, + input_types: Vec>, + return_type: PyArrowType, + volatility: &str, + ) -> PyResult { + let return_type = return_type.0; + let input_types = input_types.into_iter().map(|t| t.0).collect(); + + let function = WindowUDF::from(MultiColumnWindowUDF::new( + name, + input_types, + return_type, + parse_volatility(volatility)?, + to_rust_partition_evaluator(evaluator), + )); + Ok(Self { function }) + } + + /// creates a new PyExpr with the call of the udf + #[pyo3(signature = (*args))] + fn __call__(&self, args: Vec) -> PyResult { + let args = args.iter().map(|e| e.expr.clone()).collect(); + Ok(self.function.call(args).into()) + } + + #[staticmethod] + pub fn from_pycapsule(func: Bound<'_, PyAny>) -> PyDataFusionResult { + let capsule = if func.hasattr("__datafusion_window_udf__")? { + func.getattr("__datafusion_window_udf__")?.call0()? + } else { + func + }; + + let capsule = capsule.cast::().map_err(to_datafusion_err)?; + let data: NonNull = capsule + .pointer_checked(Some(c"datafusion_window_udf"))? + .cast(); + let udwf = unsafe { data.as_ref() }; + let udwf: Arc = udwf.into(); + + Ok(Self { + function: WindowUDF::new_from_shared_impl(udwf), + }) + } + + fn __repr__(&self) -> PyResult { + Ok(format!("WindowUDF({})", self.function.name())) + } +} + +#[derive(Hash, Eq, PartialEq)] +pub struct MultiColumnWindowUDF { + name: String, + signature: Signature, + return_type: DataType, + partition_evaluator_factory: PtrEq, +} + +impl std::fmt::Debug for MultiColumnWindowUDF { + fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { + f.debug_struct("WindowUDF") + .field("name", &self.name) + .field("signature", &self.signature) + .field("return_type", &"") + .field("partition_evaluator_factory", &"") + .finish() + } +} + +impl MultiColumnWindowUDF { + pub fn new( + name: impl Into, + input_types: Vec, + return_type: DataType, + volatility: Volatility, + partition_evaluator_factory: PartitionEvaluatorFactory, + ) -> Self { + let name = name.into(); + let signature = Signature::exact(input_types, volatility); + Self { + name, + signature, + return_type, + partition_evaluator_factory: partition_evaluator_factory.into(), + } + } +} + +impl WindowUDFImpl for MultiColumnWindowUDF { + fn as_any(&self) -> &dyn Any { + self + } + + fn name(&self) -> &str { + &self.name + } + + fn signature(&self) -> &Signature { + &self.signature + } + + fn field(&self, field_args: WindowUDFFieldArgs) -> Result { + // TODO: Should nullable always be `true`? + Ok(arrow::datatypes::Field::new(field_args.name(), self.return_type.clone(), true).into()) + } + + // TODO: Enable passing partition_evaluator_args to python? + fn partition_evaluator( + &self, + _partition_evaluator_args: PartitionEvaluatorArgs, + ) -> Result> { + let _ = _partition_evaluator_args; + (self.partition_evaluator_factory)() + } +} diff --git a/crates/core/src/unparser/dialect.rs b/crates/core/src/unparser/dialect.rs new file mode 100644 index 000000000..52a2da00b --- /dev/null +++ b/crates/core/src/unparser/dialect.rs @@ -0,0 +1,69 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::sync::Arc; + +use datafusion::sql::unparser::dialect::{ + DefaultDialect, Dialect, DuckDBDialect, MySqlDialect, PostgreSqlDialect, SqliteDialect, +}; +use pyo3::prelude::*; + +#[pyclass( + from_py_object, + frozen, + name = "Dialect", + module = "datafusion.unparser", + subclass +)] +#[derive(Clone)] +pub struct PyDialect { + pub dialect: Arc, +} + +#[pymethods] +impl PyDialect { + #[staticmethod] + pub fn default() -> Self { + Self { + dialect: Arc::new(DefaultDialect {}), + } + } + #[staticmethod] + pub fn postgres() -> Self { + Self { + dialect: Arc::new(PostgreSqlDialect {}), + } + } + #[staticmethod] + pub fn mysql() -> Self { + Self { + dialect: Arc::new(MySqlDialect {}), + } + } + #[staticmethod] + pub fn sqlite() -> Self { + Self { + dialect: Arc::new(SqliteDialect {}), + } + } + #[staticmethod] + pub fn duckdb() -> Self { + Self { + dialect: Arc::new(DuckDBDialect::new()), + } + } +} diff --git a/crates/core/src/unparser/mod.rs b/crates/core/src/unparser/mod.rs new file mode 100644 index 000000000..5142b918e --- /dev/null +++ b/crates/core/src/unparser/mod.rs @@ -0,0 +1,74 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +mod dialect; + +use std::sync::Arc; + +use datafusion::sql::unparser::Unparser; +use datafusion::sql::unparser::dialect::Dialect; +use dialect::PyDialect; +use pyo3::exceptions::PyValueError; +use pyo3::prelude::*; + +use crate::sql::logical::PyLogicalPlan; + +#[pyclass( + from_py_object, + frozen, + name = "Unparser", + module = "datafusion.unparser", + subclass +)] +#[derive(Clone)] +pub struct PyUnparser { + dialect: Arc, + pretty: bool, +} + +#[pymethods] +impl PyUnparser { + #[new] + pub fn new(dialect: PyDialect) -> Self { + Self { + dialect: dialect.dialect.clone(), + pretty: false, + } + } + + pub fn plan_to_sql(&self, plan: &PyLogicalPlan) -> PyResult { + let mut unparser = Unparser::new(self.dialect.as_ref()); + unparser = unparser.with_pretty(self.pretty); + let sql = unparser + .plan_to_sql(&plan.plan()) + .map_err(|e| PyValueError::new_err(e.to_string()))?; + Ok(sql.to_string()) + } + + pub fn with_pretty(&self, pretty: bool) -> Self { + Self { + dialect: self.dialect.clone(), + pretty, + } + } +} + +pub(crate) fn init_module(m: &Bound<'_, PyModule>) -> PyResult<()> { + m.add_class::()?; + m.add_class::()?; + Ok(()) +} diff --git a/crates/util/Cargo.toml b/crates/util/Cargo.toml new file mode 100644 index 000000000..00d5946a5 --- /dev/null +++ b/crates/util/Cargo.toml @@ -0,0 +1,34 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +[package] +name = "datafusion-python-util" +version.workspace = true +edition.workspace = true +rust-version.workspace = true +license.workspace = true +description.workspace = true +homepage.workspace = true +repository.workspace = true + +[dependencies] +tokio = { workspace = true, features = ["macros", "rt", "rt-multi-thread"] } +pyo3 = { workspace = true } +datafusion = { workspace = true } +datafusion-ffi = { workspace = true } +arrow = { workspace = true } +prost = { workspace = true } diff --git a/crates/util/src/errors.rs b/crates/util/src/errors.rs new file mode 100644 index 000000000..0d25c8847 --- /dev/null +++ b/crates/util/src/errors.rs @@ -0,0 +1,108 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use core::fmt; +use std::error::Error; +use std::fmt::Debug; + +use datafusion::arrow::error::ArrowError; +use datafusion::error::DataFusionError as InnerDataFusionError; +use prost::EncodeError; +use pyo3::PyErr; +use pyo3::exceptions::{PyException, PyValueError}; + +pub type PyDataFusionResult = std::result::Result; + +#[derive(Debug)] +pub enum PyDataFusionError { + ExecutionError(Box), + ArrowError(ArrowError), + Common(String), + PythonError(PyErr), + EncodeError(EncodeError), +} + +impl fmt::Display for PyDataFusionError { + fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { + match self { + PyDataFusionError::ExecutionError(e) => write!(f, "DataFusion error: {e}"), + PyDataFusionError::ArrowError(e) => write!(f, "Arrow error: {e:?}"), + PyDataFusionError::PythonError(e) => write!(f, "Python error {e:?}"), + PyDataFusionError::Common(e) => write!(f, "{e}"), + PyDataFusionError::EncodeError(e) => write!(f, "Failed to encode substrait plan: {e}"), + } + } +} + +impl From for PyDataFusionError { + fn from(err: ArrowError) -> PyDataFusionError { + PyDataFusionError::ArrowError(err) + } +} + +impl From for PyDataFusionError { + fn from(err: InnerDataFusionError) -> PyDataFusionError { + PyDataFusionError::ExecutionError(Box::new(err)) + } +} + +impl From for PyDataFusionError { + fn from(err: PyErr) -> PyDataFusionError { + PyDataFusionError::PythonError(err) + } +} + +impl From for PyErr { + fn from(err: PyDataFusionError) -> PyErr { + match err { + PyDataFusionError::PythonError(py_err) => py_err, + _ => PyException::new_err(err.to_string()), + } + } +} + +impl Error for PyDataFusionError {} + +pub fn py_type_err(e: impl Debug) -> PyErr { + PyErr::new::(format!("{e:?}")) +} + +pub fn py_runtime_err(e: impl Debug) -> PyErr { + PyErr::new::(format!("{e:?}")) +} + +pub fn py_datafusion_err(e: impl Debug) -> PyErr { + PyErr::new::(format!("{e:?}")) +} + +pub fn py_unsupported_variant_err(e: impl Debug) -> PyErr { + PyErr::new::(format!("{e:?}")) +} + +pub fn to_datafusion_err(e: impl Debug) -> InnerDataFusionError { + InnerDataFusionError::Execution(format!("{e:?}")) +} + +pub fn from_datafusion_error(err: InnerDataFusionError) -> PyErr { + match err { + InnerDataFusionError::External(boxed) => match boxed.downcast::() { + Ok(py_err) => *py_err, + Err(original_boxed) => PyValueError::new_err(format!("{original_boxed}")), + }, + _ => PyValueError::new_err(format!("{err}")), + } +} diff --git a/crates/util/src/lib.rs b/crates/util/src/lib.rs new file mode 100644 index 000000000..5b1c89936 --- /dev/null +++ b/crates/util/src/lib.rs @@ -0,0 +1,226 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +use std::future::Future; +use std::ptr::NonNull; +use std::sync::{Arc, OnceLock}; +use std::time::Duration; + +use datafusion::datasource::TableProvider; +use datafusion::execution::context::SessionContext; +use datafusion::logical_expr::Volatility; +use datafusion_ffi::proto::logical_extension_codec::FFI_LogicalExtensionCodec; +use datafusion_ffi::table_provider::FFI_TableProvider; +use pyo3::exceptions::{PyImportError, PyTypeError, PyValueError}; +use pyo3::prelude::*; +use pyo3::types::{PyCapsule, PyType}; +use tokio::runtime::Runtime; +use tokio::task::JoinHandle; +use tokio::time::sleep; + +use crate::errors::{PyDataFusionError, PyDataFusionResult, to_datafusion_err}; + +pub mod errors; + +/// Utility to get the Tokio Runtime from Python +#[inline] +pub fn get_tokio_runtime() -> &'static Runtime { + // NOTE: Other pyo3 python libraries have had issues with using tokio + // behind a forking app-server like `gunicorn` + // If we run into that problem, in the future we can look to `delta-rs` + // which adds a check in that disallows calls from a forked process + // https://github.com/delta-io/delta-rs/blob/87010461cfe01563d91a4b9cd6fa468e2ad5f283/python/src/utils.rs#L10-L31 + static RUNTIME: OnceLock = OnceLock::new(); + RUNTIME.get_or_init(|| Runtime::new().unwrap()) +} + +#[inline] +pub fn is_ipython_env(py: Python) -> &'static bool { + static IS_IPYTHON_ENV: OnceLock = OnceLock::new(); + IS_IPYTHON_ENV.get_or_init(|| { + py.import("IPython") + .and_then(|ipython| ipython.call_method0("get_ipython")) + .map(|ipython| !ipython.is_none()) + .unwrap_or(false) + }) +} + +/// Utility to get the Global Datafussion CTX +#[inline] +pub fn get_global_ctx() -> &'static Arc { + static CTX: OnceLock> = OnceLock::new(); + CTX.get_or_init(|| Arc::new(SessionContext::new())) +} + +/// Utility to collect rust futures with GIL released and respond to +/// Python interrupts such as ``KeyboardInterrupt``. If a signal is +/// received while the future is running, the future is aborted and the +/// corresponding Python exception is raised. +pub fn wait_for_future(py: Python, fut: F) -> PyResult +where + F: Future + Send, + F::Output: Send, +{ + let runtime: &Runtime = get_tokio_runtime(); + const INTERVAL_CHECK_SIGNALS: Duration = Duration::from_millis(1_000); + + // Some fast running processes that generate many `wait_for_future` calls like + // PartitionedDataFrameStreamReader::next require checking for interrupts early + py.run(cr"pass", None, None)?; + py.check_signals()?; + + py.detach(|| { + runtime.block_on(async { + tokio::pin!(fut); + loop { + tokio::select! { + res = &mut fut => break Ok(res), + _ = sleep(INTERVAL_CHECK_SIGNALS) => { + Python::attach(|py| { + // Execute a no-op Python statement to trigger signal processing. + // This is necessary because py.check_signals() alone doesn't + // actually check for signals - it only raises an exception if + // a signal was already set during a previous Python API call. + // Running even trivial Python code forces the interpreter to + // process any pending signals (like KeyboardInterrupt). + py.run(cr"pass", None, None)?; + py.check_signals() + })?; + } + } + } + }) + }) +} + +/// Spawn a [`Future`] on the Tokio runtime and wait for completion +/// while respecting Python signal handling. +pub fn spawn_future(py: Python, fut: F) -> PyDataFusionResult +where + F: Future> + Send + 'static, + T: Send + 'static, +{ + let rt = get_tokio_runtime(); + let handle: JoinHandle> = rt.spawn(fut); + // Wait for the join handle while respecting Python signal handling. + // We handle errors in two steps so `?` maps the error types correctly: + // 1) convert any Python-related error from `wait_for_future` into `PyDataFusionError` + // 2) convert any DataFusion error (inner result) into `PyDataFusionError` + let inner_result = wait_for_future(py, async { + // handle.await yields `Result, JoinError>` + // map JoinError into a DataFusion error so the async block returns + // `datafusion::common::Result` (i.e. Result) + match handle.await { + Ok(inner) => inner, + Err(join_err) => Err(to_datafusion_err(join_err)), + } + })?; // converts PyErr -> PyDataFusionError + + // `inner_result` is `datafusion::common::Result`; use `?` to convert + // the inner DataFusion error into `PyDataFusionError` via `From` and + // return the inner `T` on success. + Ok(inner_result?) +} + +pub fn parse_volatility(value: &str) -> PyDataFusionResult { + Ok(match value { + "immutable" => Volatility::Immutable, + "stable" => Volatility::Stable, + "volatile" => Volatility::Volatile, + value => { + return Err(PyDataFusionError::Common(format!( + "Unsupported volatility type: `{value}`, supported \ + values are: immutable, stable and volatile." + ))); + } + }) +} + +pub fn validate_pycapsule(capsule: &Bound, name: &str) -> PyResult<()> { + let capsule_name = capsule.name()?; + if capsule_name.is_none() { + return Err(PyValueError::new_err(format!( + "Expected {name} PyCapsule to have name set." + ))); + } + + let capsule_name = unsafe { capsule_name.unwrap().as_cstr().to_str()? }; + if capsule_name != name { + return Err(PyValueError::new_err(format!( + "Expected name '{name}' in PyCapsule, instead got '{capsule_name}'" + ))); + } + + Ok(()) +} + +pub fn table_provider_from_pycapsule<'py>( + mut obj: Bound<'py, PyAny>, + session: Bound<'py, PyAny>, +) -> PyResult>> { + if obj.hasattr("__datafusion_table_provider__")? { + obj = obj + .getattr("__datafusion_table_provider__")? + .call1((session,)).map_err(|err| { + let py = obj.py(); + if err.get_type(py).is(PyType::new::(py)) { + PyImportError::new_err("Incompatible libraries. DataFusion 52.0.0 introduced an incompatible signature change for table providers. Either downgrade DataFusion or upgrade your function library.") + } else { + err + } + })?; + } + + if let Ok(capsule) = obj.cast::() { + let data: NonNull = capsule + .pointer_checked(Some(c"datafusion_table_provider"))? + .cast(); + let provider = unsafe { data.as_ref() }; + let provider: Arc = provider.into(); + + Ok(Some(provider)) + } else { + Ok(None) + } +} + +pub fn create_logical_extension_capsule<'py>( + py: Python<'py>, + codec: &FFI_LogicalExtensionCodec, +) -> PyResult> { + let name = cr"datafusion_logical_extension_codec".into(); + let codec = codec.clone(); + + PyCapsule::new(py, codec, Some(name)) +} + +pub fn ffi_logical_codec_from_pycapsule(obj: Bound) -> PyResult { + let attr_name = "__datafusion_logical_extension_codec__"; + let capsule = if obj.hasattr(attr_name)? { + obj.getattr(attr_name)?.call0()? + } else { + obj + }; + + let capsule = capsule.cast::()?; + let data: NonNull = capsule + .pointer_checked(Some(c"datafusion_logical_extension_codec"))? + .cast(); + let codec = unsafe { data.as_ref() }; + + Ok(codec.clone()) +} diff --git a/datafusion/__init__.py b/datafusion/__init__.py deleted file mode 100644 index bb1beacd9..000000000 --- a/datafusion/__init__.py +++ /dev/null @@ -1,219 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -from abc import ABCMeta, abstractmethod -from typing import List - -try: - import importlib.metadata as importlib_metadata -except ImportError: - import importlib_metadata - -import pyarrow as pa - -from ._internal import ( - AggregateUDF, - Config, - DataFrame, - SessionContext, - SessionConfig, - RuntimeConfig, - ScalarUDF, -) - -from .common import ( - DFField, - DFSchema, -) - -from .expr import ( - Alias, - Analyze, - Expr, - Filter, - Limit, - Like, - ILike, - Projection, - SimilarTo, - ScalarVariable, - Sort, - TableScan, - GetIndexedField, - Not, - IsNotNull, - IsTrue, - IsFalse, - IsUnknown, - IsNotTrue, - IsNotFalse, - IsNotUnknown, - Negative, - ScalarFunction, - BuiltinScalarFunction, - InList, - Exists, - Subquery, - InSubquery, - ScalarSubquery, - GroupingSet, - Placeholder, - Case, - Cast, - TryCast, - Between, - Explain, - CreateMemoryTable, - SubqueryAlias, - Extension, - CreateView, - Distinct, - DropTable, - Repartition, - Partitioning, -) - -__version__ = importlib_metadata.version(__name__) - -__all__ = [ - "Config", - "DataFrame", - "SessionContext", - "SessionConfig", - "RuntimeConfig", - "Expr", - "AggregateUDF", - "ScalarUDF", - "column", - "literal", - "TableScan", - "Projection", - "DFSchema", - "DFField", - "Analyze", - "Sort", - "Limit", - "Filter", - "Like", - "ILike", - "SimilarTo", - "ScalarVariable", - "Alias", - "GetIndexedField", - "Not", - "IsNotNull", - "IsTrue", - "IsFalse", - "IsUnknown", - "IsNotTrue", - "IsNotFalse", - "IsNotUnknown", - "Negative", - "ScalarFunction", - "BuiltinScalarFunction", - "InList", - "Exists", - "Subquery", - "InSubquery", - "ScalarSubquery", - "GroupingSet", - "Placeholder", - "Case", - "Cast", - "TryCast", - "Between", - "Explain", - "SubqueryAlias", - "Extension", - "CreateMemoryTable", - "CreateView", - "Distinct", - "DropTable", - "Repartition", - "Partitioning", -] - - -class Accumulator(metaclass=ABCMeta): - @abstractmethod - def state(self) -> List[pa.Scalar]: - pass - - @abstractmethod - def update(self, values: pa.Array) -> None: - pass - - @abstractmethod - def merge(self, states: pa.Array) -> None: - pass - - @abstractmethod - def evaluate(self) -> pa.Scalar: - pass - - -def column(value): - return Expr.column(value) - - -col = column - - -def literal(value): - if not isinstance(value, pa.Scalar): - value = pa.scalar(value) - return Expr.literal(value) - - -lit = literal - - -def udf(func, input_types, return_type, volatility, name=None): - """ - Create a new User Defined Function - """ - if not callable(func): - raise TypeError("`func` argument must be callable") - if name is None: - name = func.__qualname__.lower() - return ScalarUDF( - name=name, - func=func, - input_types=input_types, - return_type=return_type, - volatility=volatility, - ) - - -def udaf(accum, input_type, return_type, state_type, volatility, name=None): - """ - Create a new User Defined Aggregate Function - """ - if not issubclass(accum, Accumulator): - raise TypeError( - "`accum` must implement the abstract base class Accumulator" - ) - if name is None: - name = accum.__qualname__.lower() - return AggregateUDF( - name=name, - accumulator=accum, - input_type=input_type, - return_type=return_type, - state_type=state_type, - volatility=volatility, - ) diff --git a/datafusion/cudf.py b/datafusion/cudf.py deleted file mode 100644 index d5f02156f..000000000 --- a/datafusion/cudf.py +++ /dev/null @@ -1,61 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -import cudf -import datafusion -from datafusion.expr import Projection, TableScan, Column - - -class SessionContext: - def __init__(self): - self.datafusion_ctx = datafusion.SessionContext() - self.parquet_tables = {} - - def register_parquet(self, name, path): - self.parquet_tables[name] = path - self.datafusion_ctx.register_parquet(name, path) - - def to_cudf_expr(self, expr): - # get Python wrapper for logical expression - expr = expr.to_variant() - - if isinstance(expr, Column): - return expr.name() - else: - raise Exception("unsupported expression: {}".format(expr)) - - def to_cudf_df(self, plan): - # recurse down first to translate inputs into pandas data frames - inputs = [self.to_cudf_df(x) for x in plan.inputs()] - - # get Python wrapper for logical operator node - node = plan.to_variant() - - if isinstance(node, Projection): - args = [self.to_cudf_expr(expr) for expr in node.projections()] - return inputs[0][args] - elif isinstance(node, TableScan): - return cudf.read_parquet(self.parquet_tables[node.table_name()]) - else: - raise Exception( - "unsupported logical operator: {}".format(type(node)) - ) - - def sql(self, sql): - datafusion_df = self.datafusion_ctx.sql(sql) - plan = datafusion_df.logical_plan() - return self.to_cudf_df(plan) diff --git a/datafusion/pandas.py b/datafusion/pandas.py deleted file mode 100644 index f8e56512b..000000000 --- a/datafusion/pandas.py +++ /dev/null @@ -1,61 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -import pandas as pd -import datafusion -from datafusion.expr import Projection, TableScan, Column - - -class SessionContext: - def __init__(self): - self.datafusion_ctx = datafusion.SessionContext() - self.parquet_tables = {} - - def register_parquet(self, name, path): - self.parquet_tables[name] = path - self.datafusion_ctx.register_parquet(name, path) - - def to_pandas_expr(self, expr): - # get Python wrapper for logical expression - expr = expr.to_variant() - - if isinstance(expr, Column): - return expr.name() - else: - raise Exception("unsupported expression: {}".format(expr)) - - def to_pandas_df(self, plan): - # recurse down first to translate inputs into pandas data frames - inputs = [self.to_pandas_df(x) for x in plan.inputs()] - - # get Python wrapper for logical operator node - node = plan.to_variant() - - if isinstance(node, Projection): - args = [self.to_pandas_expr(expr) for expr in node.projections()] - return inputs[0][args] - elif isinstance(node, TableScan): - return pd.read_parquet(self.parquet_tables[node.table_name()]) - else: - raise Exception( - "unsupported logical operator: {}".format(type(node)) - ) - - def sql(self, sql): - datafusion_df = self.datafusion_ctx.sql(sql) - plan = datafusion_df.logical_plan() - return self.to_pandas_df(plan) diff --git a/datafusion/polars.py b/datafusion/polars.py deleted file mode 100644 index a1bafbef8..000000000 --- a/datafusion/polars.py +++ /dev/null @@ -1,84 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -import polars -import datafusion -from datafusion.expr import Projection, TableScan, Aggregate -from datafusion.expr import Column, AggregateFunction - - -class SessionContext: - def __init__(self): - self.datafusion_ctx = datafusion.SessionContext() - self.parquet_tables = {} - - def register_parquet(self, name, path): - self.parquet_tables[name] = path - self.datafusion_ctx.register_parquet(name, path) - - def to_polars_expr(self, expr): - # get Python wrapper for logical expression - expr = expr.to_variant() - - if isinstance(expr, Column): - return polars.col(expr.name()) - else: - raise Exception("unsupported expression: {}".format(expr)) - - def to_polars_df(self, plan): - # recurse down first to translate inputs into Polars data frames - inputs = [self.to_polars_df(x) for x in plan.inputs()] - - # get Python wrapper for logical operator node - node = plan.to_variant() - - if isinstance(node, Projection): - args = [self.to_polars_expr(expr) for expr in node.projections()] - return inputs[0].select(*args) - elif isinstance(node, Aggregate): - groupby_expr = [ - self.to_polars_expr(expr) for expr in node.group_by_exprs() - ] - aggs = [] - for expr in node.aggregate_exprs(): - expr = expr.to_variant() - if isinstance(expr, AggregateFunction): - if expr.aggregate_type() == "COUNT": - aggs.append(polars.count().alias("{}".format(expr))) - else: - raise Exception( - "Unsupported aggregate function {}".format( - expr.aggregate_type() - ) - ) - else: - raise Exception( - "Unsupported aggregate function {}".format(expr) - ) - df = inputs[0].groupby(groupby_expr).agg(aggs) - return df - elif isinstance(node, TableScan): - return polars.read_parquet(self.parquet_tables[node.table_name()]) - else: - raise Exception( - "unsupported logical operator: {}".format(type(node)) - ) - - def sql(self, sql): - datafusion_df = self.datafusion_ctx.sql(sql) - plan = datafusion_df.logical_plan() - return self.to_polars_df(plan) diff --git a/datafusion/substrait.py b/datafusion/substrait.py deleted file mode 100644 index eff809a0c..000000000 --- a/datafusion/substrait.py +++ /dev/null @@ -1,23 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - - -from ._internal import substrait - - -def __getattr__(name): - return getattr(substrait, name) diff --git a/datafusion/tests/test_aggregation.py b/datafusion/tests/test_aggregation.py deleted file mode 100644 index 2c8c064b1..000000000 --- a/datafusion/tests/test_aggregation.py +++ /dev/null @@ -1,127 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -import numpy as np -import pyarrow as pa -import pytest - -from datafusion import SessionContext, column, lit -from datafusion import functions as f - - -@pytest.fixture -def df(): - ctx = SessionContext() - - # create a RecordBatch and a new DataFrame from it - batch = pa.RecordBatch.from_arrays( - [ - pa.array([1, 2, 3]), - pa.array([4, 4, 6]), - pa.array([9, 8, 5]), - ], - names=["a", "b", "c"], - ) - return ctx.create_dataframe([[batch]]) - - -def test_built_in_aggregation(df): - col_a = column("a") - col_b = column("b") - col_c = column("c") - - agg_df = df.aggregate( - [], - [ - f.approx_distinct(col_b), - f.approx_median(col_b), - f.approx_percentile_cont(col_b, lit(0.5)), - f.approx_percentile_cont_with_weight(col_b, lit(0.6), lit(0.5)), - f.array_agg(col_b), - f.avg(col_a), - f.corr(col_a, col_b), - f.count(col_a), - f.covar(col_a, col_b), - f.covar_pop(col_a, col_c), - f.covar_samp(col_b, col_c), - # f.grouping(col_a), # No physical plan implemented yet - f.max(col_a), - f.mean(col_b), - f.median(col_b), - f.min(col_a), - f.sum(col_b), - f.stddev(col_a), - f.stddev_pop(col_b), - f.stddev_samp(col_c), - f.var(col_a), - f.var_pop(col_b), - f.var_samp(col_c), - ], - ) - result = agg_df.collect()[0] - values_a, values_b, values_c = df.collect()[0] - - assert result.column(0) == pa.array([2], type=pa.uint64()) - assert result.column(1) == pa.array([4]) - assert result.column(2) == pa.array([4]) - assert result.column(3) == pa.array([6]) - assert result.column(4) == pa.array([[4, 4, 6]]) - np.testing.assert_array_almost_equal( - result.column(5), np.average(values_a) - ) - np.testing.assert_array_almost_equal( - result.column(6), np.corrcoef(values_a, values_b)[0][1] - ) - assert result.column(7) == pa.array([len(values_a)]) - # Sample (co)variance -> ddof=1 - # Population (co)variance -> ddof=0 - np.testing.assert_array_almost_equal( - result.column(8), np.cov(values_a, values_b, ddof=1)[0][1] - ) - np.testing.assert_array_almost_equal( - result.column(9), np.cov(values_a, values_c, ddof=0)[0][1] - ) - np.testing.assert_array_almost_equal( - result.column(10), np.cov(values_b, values_c, ddof=1)[0][1] - ) - np.testing.assert_array_almost_equal(result.column(11), np.max(values_a)) - np.testing.assert_array_almost_equal(result.column(12), np.mean(values_b)) - np.testing.assert_array_almost_equal( - result.column(13), np.median(values_b) - ) - np.testing.assert_array_almost_equal(result.column(14), np.min(values_a)) - np.testing.assert_array_almost_equal( - result.column(15), np.sum(values_b.to_pylist()) - ) - np.testing.assert_array_almost_equal( - result.column(16), np.std(values_a, ddof=1) - ) - np.testing.assert_array_almost_equal( - result.column(17), np.std(values_b, ddof=0) - ) - np.testing.assert_array_almost_equal( - result.column(18), np.std(values_c, ddof=1) - ) - np.testing.assert_array_almost_equal( - result.column(19), np.var(values_a, ddof=1) - ) - np.testing.assert_array_almost_equal( - result.column(20), np.var(values_b, ddof=0) - ) - np.testing.assert_array_almost_equal( - result.column(21), np.var(values_c, ddof=1) - ) diff --git a/datafusion/tests/test_context.py b/datafusion/tests/test_context.py deleted file mode 100644 index 1aea21c21..000000000 --- a/datafusion/tests/test_context.py +++ /dev/null @@ -1,351 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -import os - -import pyarrow as pa -import pyarrow.dataset as ds - -from datafusion import ( - column, - literal, - SessionContext, - SessionConfig, - RuntimeConfig, - DataFrame, -) -import pytest - - -def test_create_context_no_args(): - SessionContext() - - -def test_create_context_with_all_valid_args(): - runtime = ( - RuntimeConfig().with_disk_manager_os().with_fair_spill_pool(10000000) - ) - config = ( - SessionConfig() - .with_create_default_catalog_and_schema(True) - .with_default_catalog_and_schema("foo", "bar") - .with_target_partitions(1) - .with_information_schema(True) - .with_repartition_joins(False) - .with_repartition_aggregations(False) - .with_repartition_windows(False) - .with_parquet_pruning(False) - ) - - ctx = SessionContext(config, runtime) - - # verify that at least some of the arguments worked - ctx.catalog("foo").database("bar") - with pytest.raises(KeyError): - ctx.catalog("datafusion") - - -def test_register_record_batches(ctx): - # create a RecordBatch and register it as memtable - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - - ctx.register_record_batches("t", [[batch]]) - - assert ctx.tables() == {"t"} - - result = ctx.sql("SELECT a+b, a-b FROM t").collect() - - assert result[0].column(0) == pa.array([5, 7, 9]) - assert result[0].column(1) == pa.array([-3, -3, -3]) - - -def test_create_dataframe_registers_unique_table_name(ctx): - # create a RecordBatch and register it as memtable - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - - df = ctx.create_dataframe([[batch]]) - tables = list(ctx.tables()) - - assert df - assert len(tables) == 1 - assert len(tables[0]) == 33 - assert tables[0].startswith("c") - # ensure that the rest of the table name contains - # only hexadecimal numbers - for c in tables[0][1:]: - assert c in "0123456789abcdef" - - -def test_create_dataframe_registers_with_defined_table_name(ctx): - # create a RecordBatch and register it as memtable - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - - df = ctx.create_dataframe([[batch]], name="tbl") - tables = list(ctx.tables()) - - assert df - assert len(tables) == 1 - assert tables[0] == "tbl" - - -def test_from_arrow_table(ctx): - # create a PyArrow table - data = {"a": [1, 2, 3], "b": [4, 5, 6]} - table = pa.Table.from_pydict(data) - - # convert to DataFrame - df = ctx.from_arrow_table(table) - tables = list(ctx.tables()) - - assert df - assert len(tables) == 1 - assert type(df) == DataFrame - assert set(df.schema().names) == {"a", "b"} - assert df.collect()[0].num_rows == 3 - - -def test_from_arrow_table_with_name(ctx): - # create a PyArrow table - data = {"a": [1, 2, 3], "b": [4, 5, 6]} - table = pa.Table.from_pydict(data) - - # convert to DataFrame with optional name - df = ctx.from_arrow_table(table, name="tbl") - tables = list(ctx.tables()) - - assert df - assert tables[0] == "tbl" - - -def test_from_pylist(ctx): - # create a dataframe from Python list - data = [ - {"a": 1, "b": 4}, - {"a": 2, "b": 5}, - {"a": 3, "b": 6}, - ] - - df = ctx.from_pylist(data) - tables = list(ctx.tables()) - - assert df - assert len(tables) == 1 - assert type(df) == DataFrame - assert set(df.schema().names) == {"a", "b"} - assert df.collect()[0].num_rows == 3 - - -def test_from_pydict(ctx): - # create a dataframe from Python dictionary - data = {"a": [1, 2, 3], "b": [4, 5, 6]} - - df = ctx.from_pydict(data) - tables = list(ctx.tables()) - - assert df - assert len(tables) == 1 - assert type(df) == DataFrame - assert set(df.schema().names) == {"a", "b"} - assert df.collect()[0].num_rows == 3 - - -def test_from_pandas(ctx): - # create a dataframe from pandas dataframe - pd = pytest.importorskip("pandas") - data = {"a": [1, 2, 3], "b": [4, 5, 6]} - pandas_df = pd.DataFrame(data) - - df = ctx.from_pandas(pandas_df) - tables = list(ctx.tables()) - - assert df - assert len(tables) == 1 - assert type(df) == DataFrame - assert set(df.schema().names) == {"a", "b"} - assert df.collect()[0].num_rows == 3 - - -def test_from_polars(ctx): - # create a dataframe from Polars dataframe - pd = pytest.importorskip("polars") - data = {"a": [1, 2, 3], "b": [4, 5, 6]} - polars_df = pd.DataFrame(data) - - df = ctx.from_polars(polars_df) - tables = list(ctx.tables()) - - assert df - assert len(tables) == 1 - assert type(df) == DataFrame - assert set(df.schema().names) == {"a", "b"} - assert df.collect()[0].num_rows == 3 - - -def test_register_table(ctx, database): - default = ctx.catalog() - public = default.database("public") - assert public.names() == {"csv", "csv1", "csv2"} - table = public.table("csv") - - ctx.register_table("csv3", table) - assert public.names() == {"csv", "csv1", "csv2", "csv3"} - - -def test_deregister_table(ctx, database): - default = ctx.catalog() - public = default.database("public") - assert public.names() == {"csv", "csv1", "csv2"} - - ctx.deregister_table("csv") - assert public.names() == {"csv1", "csv2"} - - -def test_register_dataset(ctx): - # create a RecordBatch and register it as a pyarrow.dataset.Dataset - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - dataset = ds.dataset([batch]) - ctx.register_dataset("t", dataset) - - assert ctx.tables() == {"t"} - - result = ctx.sql("SELECT a+b, a-b FROM t").collect() - - assert result[0].column(0) == pa.array([5, 7, 9]) - assert result[0].column(1) == pa.array([-3, -3, -3]) - - -def test_dataset_filter(ctx, capfd): - # create a RecordBatch and register it as a pyarrow.dataset.Dataset - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - dataset = ds.dataset([batch]) - ctx.register_dataset("t", dataset) - - assert ctx.tables() == {"t"} - df = ctx.sql("SELECT a+b, a-b FROM t WHERE a BETWEEN 2 and 3 AND b > 5") - - # Make sure the filter was pushed down in Physical Plan - df.explain() - captured = capfd.readouterr() - assert "filter_expr=(((a >= 2) and (a <= 3)) and (b > 5))" in captured.out - - result = df.collect() - - assert result[0].column(0) == pa.array([9]) - assert result[0].column(1) == pa.array([-3]) - - -def test_dataset_filter_nested_data(ctx): - # create Arrow StructArrays to test nested data types - data = pa.StructArray.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - batch = pa.RecordBatch.from_arrays( - [data], - names=["nested_data"], - ) - dataset = ds.dataset([batch]) - ctx.register_dataset("t", dataset) - - assert ctx.tables() == {"t"} - - df = ctx.table("t") - - # This filter will not be pushed down to DatasetExec since it - # isn't supported - df = df.select( - column("nested_data")["a"] + column("nested_data")["b"], - column("nested_data")["a"] - column("nested_data")["b"], - ).filter(column("nested_data")["b"] > literal(5)) - - result = df.collect() - - assert result[0].column(0) == pa.array([9]) - assert result[0].column(1) == pa.array([-3]) - - -def test_table_exist(ctx): - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - dataset = ds.dataset([batch]) - ctx.register_dataset("t", dataset) - - assert ctx.table_exist("t") is True - - -def test_read_json(ctx): - path = os.path.dirname(os.path.abspath(__file__)) - - # Default - test_data_path = os.path.join(path, "data_test_context", "data.json") - df = ctx.read_json(test_data_path) - result = df.collect() - - assert result[0].column(0) == pa.array(["a", "b", "c"]) - assert result[0].column(1) == pa.array([1, 2, 3]) - - # Schema - schema = pa.schema( - [ - pa.field("A", pa.string(), nullable=True), - ] - ) - df = ctx.read_json(test_data_path, schema=schema) - result = df.collect() - - assert result[0].column(0) == pa.array(["a", "b", "c"]) - assert result[0].schema == schema - - # File extension - test_data_path = os.path.join(path, "data_test_context", "data.json") - df = ctx.read_json(test_data_path, file_extension=".json") - result = df.collect() - - assert result[0].column(0) == pa.array(["a", "b", "c"]) - assert result[0].column(1) == pa.array([1, 2, 3]) - - -def test_read_csv(ctx): - csv_df = ctx.read_csv(path="testing/data/csv/aggregate_test_100.csv") - csv_df.select(column("c1")).show() - - -def test_read_parquet(ctx): - csv_df = ctx.read_parquet(path="parquet/data/alltypes_plain.parquet") - csv_df.show() - - -def test_read_avro(ctx): - csv_df = ctx.read_avro(path="testing/data/avro/alltypes_plain.avro") - csv_df.show() diff --git a/datafusion/tests/test_dataframe.py b/datafusion/tests/test_dataframe.py deleted file mode 100644 index 611bcabe4..000000000 --- a/datafusion/tests/test_dataframe.py +++ /dev/null @@ -1,609 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -import pyarrow as pa -import pytest - -from datafusion import functions as f -from datafusion import DataFrame, SessionContext, column, literal, udf - - -@pytest.fixture -def ctx(): - return SessionContext() - - -@pytest.fixture -def df(): - ctx = SessionContext() - - # create a RecordBatch and a new DataFrame from it - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6]), pa.array([8, 5, 8])], - names=["a", "b", "c"], - ) - - return ctx.create_dataframe([[batch]]) - - -@pytest.fixture -def struct_df(): - ctx = SessionContext() - - # create a RecordBatch and a new DataFrame from it - batch = pa.RecordBatch.from_arrays( - [pa.array([{"c": 1}, {"c": 2}, {"c": 3}]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - - return ctx.create_dataframe([[batch]]) - - -@pytest.fixture -def aggregate_df(): - ctx = SessionContext() - ctx.register_csv("test", "testing/data/csv/aggregate_test_100.csv") - return ctx.sql("select c1, sum(c2) from test group by c1") - - -def test_select(df): - df = df.select( - column("a") + column("b"), - column("a") - column("b"), - ) - - # execute and collect the first (and only) batch - result = df.collect()[0] - - assert result.column(0) == pa.array([5, 7, 9]) - assert result.column(1) == pa.array([-3, -3, -3]) - - -def test_select_columns(df): - df = df.select_columns("b", "a") - - # execute and collect the first (and only) batch - result = df.collect()[0] - - assert result.column(0) == pa.array([4, 5, 6]) - assert result.column(1) == pa.array([1, 2, 3]) - - -def test_filter(df): - df = df.select( - column("a") + column("b"), - column("a") - column("b"), - ).filter(column("a") > literal(2)) - - # execute and collect the first (and only) batch - result = df.collect()[0] - - assert result.column(0) == pa.array([9]) - assert result.column(1) == pa.array([-3]) - - -def test_sort(df): - df = df.sort(column("b").sort(ascending=False)) - - table = pa.Table.from_batches(df.collect()) - expected = {"a": [3, 2, 1], "b": [6, 5, 4], "c": [8, 5, 8]} - - assert table.to_pydict() == expected - - -def test_limit(df): - df = df.limit(1) - - # execute and collect the first (and only) batch - result = df.collect()[0] - - assert len(result.column(0)) == 1 - assert len(result.column(1)) == 1 - - -def test_with_column(df): - df = df.with_column("c", column("a") + column("b")) - - # execute and collect the first (and only) batch - result = df.collect()[0] - - assert result.schema.field(0).name == "a" - assert result.schema.field(1).name == "b" - assert result.schema.field(2).name == "c" - - assert result.column(0) == pa.array([1, 2, 3]) - assert result.column(1) == pa.array([4, 5, 6]) - assert result.column(2) == pa.array([5, 7, 9]) - - -def test_with_column_renamed(df): - df = df.with_column("c", column("a") + column("b")).with_column_renamed( - "c", "sum" - ) - - result = df.collect()[0] - - assert result.schema.field(0).name == "a" - assert result.schema.field(1).name == "b" - assert result.schema.field(2).name == "sum" - - -def test_udf(df): - # is_null is a pa function over arrays - is_null = udf( - lambda x: x.is_null(), - [pa.int64()], - pa.bool_(), - volatility="immutable", - ) - - df = df.select(is_null(column("a"))) - result = df.collect()[0].column(0) - - assert result == pa.array([False, False, False]) - - -def test_join(): - ctx = SessionContext() - - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - df = ctx.create_dataframe([[batch]]) - - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2]), pa.array([8, 10])], - names=["a", "c"], - ) - df1 = ctx.create_dataframe([[batch]]) - - df = df.join(df1, join_keys=(["a"], ["a"]), how="inner") - df = df.sort(column("a").sort(ascending=True)) - table = pa.Table.from_batches(df.collect()) - - expected = {"a": [1, 2], "c": [8, 10], "b": [4, 5]} - assert table.to_pydict() == expected - - -def test_distinct(): - ctx = SessionContext() - - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3, 1, 2, 3]), pa.array([4, 5, 6, 4, 5, 6])], - names=["a", "b"], - ) - df_a = ( - ctx.create_dataframe([[batch]]) - .distinct() - .sort(column("a").sort(ascending=True)) - ) - - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - df_b = ctx.create_dataframe([[batch]]).sort( - column("a").sort(ascending=True) - ) - - assert df_a.collect() == df_b.collect() - - -def test_window_functions(df): - df = df.select( - column("a"), - column("b"), - column("c"), - f.alias( - f.window("row_number", [], order_by=[f.order_by(column("c"))]), - "row", - ), - f.alias( - f.window("rank", [], order_by=[f.order_by(column("c"))]), - "rank", - ), - f.alias( - f.window("dense_rank", [], order_by=[f.order_by(column("c"))]), - "dense_rank", - ), - f.alias( - f.window("percent_rank", [], order_by=[f.order_by(column("c"))]), - "percent_rank", - ), - f.alias( - f.window("cume_dist", [], order_by=[f.order_by(column("b"))]), - "cume_dist", - ), - f.alias( - f.window( - "ntile", [literal(2)], order_by=[f.order_by(column("c"))] - ), - "ntile", - ), - f.alias( - f.window("lag", [column("b")], order_by=[f.order_by(column("b"))]), - "previous", - ), - f.alias( - f.window( - "lead", [column("b")], order_by=[f.order_by(column("b"))] - ), - "next", - ), - f.alias( - f.window( - "first_value", - [column("a")], - order_by=[f.order_by(column("b"))], - ), - "first_value", - ), - f.alias( - f.window( - "last_value", [column("b")], order_by=[f.order_by(column("b"))] - ), - "last_value", - ), - f.alias( - f.window( - "nth_value", - [column("b"), literal(2)], - order_by=[f.order_by(column("b"))], - ), - "2nd_value", - ), - ) - - table = pa.Table.from_batches(df.collect()) - - expected = { - "a": [1, 2, 3], - "b": [4, 5, 6], - "c": [8, 5, 8], - "row": [2, 1, 3], - "rank": [2, 1, 2], - "dense_rank": [2, 1, 2], - "percent_rank": [0.5, 0, 0.5], - "cume_dist": [0.3333333333333333, 0.6666666666666666, 1.0], - "ntile": [1, 1, 2], - "next": [5, 6, None], - "previous": [None, 4, 5], - "first_value": [1, 1, 1], - "last_value": [4, 5, 6], - "2nd_value": [None, 5, 5], - } - assert table.sort_by("a").to_pydict() == expected - - -def test_get_dataframe(tmp_path): - ctx = SessionContext() - - path = tmp_path / "test.csv" - table = pa.Table.from_arrays( - [ - [1, 2, 3, 4], - ["a", "b", "c", "d"], - [1.1, 2.2, 3.3, 4.4], - ], - names=["int", "str", "float"], - ) - pa.csv.write_csv(table, path) - - ctx.register_csv("csv", path) - - df = ctx.table("csv") - assert isinstance(df, DataFrame) - - -def test_struct_select(struct_df): - df = struct_df.select( - column("a")["c"] + column("b"), - column("a")["c"] - column("b"), - ) - - # execute and collect the first (and only) batch - result = df.collect()[0] - - assert result.column(0) == pa.array([5, 7, 9]) - assert result.column(1) == pa.array([-3, -3, -3]) - - -def test_explain(df): - df = df.select( - column("a") + column("b"), - column("a") - column("b"), - ) - df.explain() - - -def test_logical_plan(aggregate_df): - plan = aggregate_df.logical_plan() - - expected = "Projection: test.c1, SUM(test.c2)" - - assert expected == plan.display() - - expected = ( - "Projection: test.c1, SUM(test.c2)\n" - " Aggregate: groupBy=[[test.c1]], aggr=[[SUM(test.c2)]]\n" - " TableScan: test" - ) - - assert expected == plan.display_indent() - - -def test_optimized_logical_plan(aggregate_df): - plan = aggregate_df.optimized_logical_plan() - - expected = "Aggregate: groupBy=[[test.c1]], aggr=[[SUM(test.c2)]]" - - assert expected == plan.display() - - expected = ( - "Aggregate: groupBy=[[test.c1]], aggr=[[SUM(test.c2)]]\n" - " TableScan: test projection=[c1, c2]" - ) - - assert expected == plan.display_indent() - - -def test_execution_plan(aggregate_df): - plan = aggregate_df.execution_plan() - - expected = "AggregateExec: mode=FinalPartitioned, gby=[c1@0 as c1], aggr=[SUM(test.c2)]\n" # noqa: E501 - - assert expected == plan.display() - - expected = ( - "ProjectionExec: expr=[c1@0 as c1, SUM(test.c2)@1 as SUM(test.c2)]\n" - " Aggregate: groupBy=[[test.c1]], aggr=[[SUM(test.c2)]]\n" - " TableScan: test projection=[c1, c2]" - ) - - indent = plan.display_indent() - - # indent plan will be different for everyone due to absolute path - # to filename, so we just check for some expected content - assert "AggregateExec:" in indent - assert "CoalesceBatchesExec:" in indent - assert "RepartitionExec:" in indent - assert "CsvExec:" in indent - - ctx = SessionContext() - stream = ctx.execute(plan, 0) - # get the one and only batch - batch = stream.next() - assert batch is not None - # there should be no more batches - batch = stream.next() - assert batch is None - - -def test_repartition(df): - df.repartition(2) - - -def test_repartition_by_hash(df): - df.repartition_by_hash(column("a"), num=2) - - -def test_intersect(): - ctx = SessionContext() - - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - df_a = ctx.create_dataframe([[batch]]) - - batch = pa.RecordBatch.from_arrays( - [pa.array([3, 4, 5]), pa.array([6, 7, 8])], - names=["a", "b"], - ) - df_b = ctx.create_dataframe([[batch]]) - - batch = pa.RecordBatch.from_arrays( - [pa.array([3]), pa.array([6])], - names=["a", "b"], - ) - df_c = ctx.create_dataframe([[batch]]).sort( - column("a").sort(ascending=True) - ) - - df_a_i_b = df_a.intersect(df_b).sort(column("a").sort(ascending=True)) - - assert df_c.collect() == df_a_i_b.collect() - - -def test_except_all(): - ctx = SessionContext() - - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - df_a = ctx.create_dataframe([[batch]]) - - batch = pa.RecordBatch.from_arrays( - [pa.array([3, 4, 5]), pa.array([6, 7, 8])], - names=["a", "b"], - ) - df_b = ctx.create_dataframe([[batch]]) - - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2]), pa.array([4, 5])], - names=["a", "b"], - ) - df_c = ctx.create_dataframe([[batch]]).sort( - column("a").sort(ascending=True) - ) - - df_a_e_b = df_a.except_all(df_b).sort(column("a").sort(ascending=True)) - - assert df_c.collect() == df_a_e_b.collect() - - -def test_collect_partitioned(): - ctx = SessionContext() - - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - - assert [[batch]] == ctx.create_dataframe([[batch]]).collect_partitioned() - - -def test_union(ctx): - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - df_a = ctx.create_dataframe([[batch]]) - - batch = pa.RecordBatch.from_arrays( - [pa.array([3, 4, 5]), pa.array([6, 7, 8])], - names=["a", "b"], - ) - df_b = ctx.create_dataframe([[batch]]) - - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3, 3, 4, 5]), pa.array([4, 5, 6, 6, 7, 8])], - names=["a", "b"], - ) - df_c = ctx.create_dataframe([[batch]]).sort( - column("a").sort(ascending=True) - ) - - df_a_u_b = df_a.union(df_b).sort(column("a").sort(ascending=True)) - - assert df_c.collect() == df_a_u_b.collect() - - -def test_union_distinct(ctx): - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - df_a = ctx.create_dataframe([[batch]]) - - batch = pa.RecordBatch.from_arrays( - [pa.array([3, 4, 5]), pa.array([6, 7, 8])], - names=["a", "b"], - ) - df_b = ctx.create_dataframe([[batch]]) - - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3, 4, 5]), pa.array([4, 5, 6, 7, 8])], - names=["a", "b"], - ) - df_c = ctx.create_dataframe([[batch]]).sort( - column("a").sort(ascending=True) - ) - - df_a_u_b = df_a.union(df_b, True).sort(column("a").sort(ascending=True)) - - assert df_c.collect() == df_a_u_b.collect() - assert df_c.collect() == df_a_u_b.collect() - - -def test_cache(df): - assert df.cache().collect() == df.collect() - - -def test_count(df): - # Get number of rows - assert df.count() == 3 - - -def test_to_pandas(df): - # Skip test if pandas is not installed - pd = pytest.importorskip("pandas") - - # Convert datafusion dataframe to pandas dataframe - pandas_df = df.to_pandas() - assert type(pandas_df) == pd.DataFrame - assert pandas_df.shape == (3, 3) - assert set(pandas_df.columns) == {"a", "b", "c"} - - -def test_empty_to_pandas(df): - # Skip test if pandas is not installed - pd = pytest.importorskip("pandas") - - # Convert empty datafusion dataframe to pandas dataframe - pandas_df = df.limit(0).to_pandas() - assert type(pandas_df) == pd.DataFrame - assert pandas_df.shape == (0, 3) - assert set(pandas_df.columns) == {"a", "b", "c"} - - -def test_to_polars(df): - # Skip test if polars is not installed - pl = pytest.importorskip("polars") - - # Convert datafusion dataframe to polars dataframe - polars_df = df.to_polars() - assert type(polars_df) == pl.DataFrame - assert polars_df.shape == (3, 3) - assert set(polars_df.columns) == {"a", "b", "c"} - - -def test_empty_to_polars(df): - # Skip test if polars is not installed - pl = pytest.importorskip("polars") - - # Convert empty datafusion dataframe to polars dataframe - polars_df = df.limit(0).to_polars() - assert type(polars_df) == pl.DataFrame - assert polars_df.shape == (0, 3) - assert set(polars_df.columns) == {"a", "b", "c"} - - -def test_to_arrow_table(df): - # Convert datafusion dataframe to pyarrow Table - pyarrow_table = df.to_arrow_table() - assert type(pyarrow_table) == pa.Table - assert pyarrow_table.shape == (3, 3) - assert set(pyarrow_table.column_names) == {"a", "b", "c"} - - -def test_empty_to_arrow_table(df): - # Convert empty datafusion dataframe to pyarrow Table - pyarrow_table = df.limit(0).to_arrow_table() - assert type(pyarrow_table) == pa.Table - assert pyarrow_table.shape == (0, 3) - assert set(pyarrow_table.column_names) == {"a", "b", "c"} - - -def test_to_pylist(df): - # Convert datafusion dataframe to Python list - pylist = df.to_pylist() - assert type(pylist) == list - assert pylist == [ - {"a": 1, "b": 4, "c": 8}, - {"a": 2, "b": 5, "c": 5}, - {"a": 3, "b": 6, "c": 8}, - ] - - -def test_to_pydict(df): - # Convert datafusion dataframe to Python dictionary - pydict = df.to_pydict() - assert type(pydict) == dict - assert pydict == {"a": [1, 2, 3], "b": [4, 5, 6], "c": [8, 5, 8]} diff --git a/datafusion/tests/test_expr.py b/datafusion/tests/test_expr.py deleted file mode 100644 index 0c4869f27..000000000 --- a/datafusion/tests/test_expr.py +++ /dev/null @@ -1,110 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -from datafusion import SessionContext -from datafusion.expr import Column, Literal, BinaryExpr, AggregateFunction -from datafusion.expr import ( - Projection, - Filter, - Aggregate, - Limit, - Sort, - TableScan, -) -import pytest - - -@pytest.fixture -def test_ctx(): - ctx = SessionContext() - ctx.register_csv("test", "testing/data/csv/aggregate_test_100.csv") - return ctx - - -def test_projection(test_ctx): - df = test_ctx.sql("select c1, 123, c1 < 123 from test") - plan = df.logical_plan() - - plan = plan.to_variant() - assert isinstance(plan, Projection) - - expr = plan.projections() - - col1 = expr[0].to_variant() - assert isinstance(col1, Column) - assert col1.name() == "c1" - assert col1.qualified_name() == "test.c1" - - col2 = expr[1].to_variant() - assert isinstance(col2, Literal) - assert col2.data_type() == "Int64" - assert col2.value_i64() == 123 - - col3 = expr[2].to_variant() - assert isinstance(col3, BinaryExpr) - assert isinstance(col3.left().to_variant(), Column) - assert col3.op() == "<" - assert isinstance(col3.right().to_variant(), Literal) - - plan = plan.input()[0].to_variant() - assert isinstance(plan, TableScan) - - -def test_filter(test_ctx): - df = test_ctx.sql("select c1 from test WHERE c1 > 5") - plan = df.logical_plan() - - plan = plan.to_variant() - assert isinstance(plan, Projection) - - plan = plan.input()[0].to_variant() - assert isinstance(plan, Filter) - - -def test_limit(test_ctx): - df = test_ctx.sql("select c1 from test LIMIT 10") - plan = df.logical_plan() - - plan = plan.to_variant() - assert isinstance(plan, Limit) - - -def test_aggregate_query(test_ctx): - df = test_ctx.sql("select c1, count(*) from test group by c1") - plan = df.logical_plan() - - projection = plan.to_variant() - assert isinstance(projection, Projection) - - aggregate = projection.input()[0].to_variant() - assert isinstance(aggregate, Aggregate) - - col1 = aggregate.group_by_exprs()[0].to_variant() - assert isinstance(col1, Column) - assert col1.name() == "c1" - assert col1.qualified_name() == "test.c1" - - col2 = aggregate.aggregate_exprs()[0].to_variant() - assert isinstance(col2, AggregateFunction) - - -def test_sort(test_ctx): - df = test_ctx.sql("select c1 from test order by c1") - plan = df.logical_plan() - - plan = plan.to_variant() - assert isinstance(plan, Sort) diff --git a/datafusion/tests/test_functions.py b/datafusion/tests/test_functions.py deleted file mode 100644 index bea580859..000000000 --- a/datafusion/tests/test_functions.py +++ /dev/null @@ -1,413 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -import numpy as np -import pyarrow as pa -import pytest -from datetime import datetime - -from datafusion import SessionContext, column -from datafusion import functions as f -from datafusion import literal - - -@pytest.fixture -def df(): - ctx = SessionContext() - # create a RecordBatch and a new DataFrame from it - batch = pa.RecordBatch.from_arrays( - [ - pa.array(["Hello", "World", "!"]), - pa.array([4, 5, 6]), - pa.array(["hello ", " world ", " !"]), - pa.array( - [ - datetime(2022, 12, 31), - datetime(2027, 6, 26), - datetime(2020, 7, 2), - ] - ), - ], - names=["a", "b", "c", "d"], - ) - return ctx.create_dataframe([[batch]]) - - -def test_literal(df): - df = df.select( - literal(1), - literal("1"), - literal("OK"), - literal(3.14), - literal(True), - literal(b"hello world"), - ) - result = df.collect() - assert len(result) == 1 - result = result[0] - assert result.column(0) == pa.array([1] * 3) - assert result.column(1) == pa.array(["1"] * 3) - assert result.column(2) == pa.array(["OK"] * 3) - assert result.column(3) == pa.array([3.14] * 3) - assert result.column(4) == pa.array([True] * 3) - assert result.column(5) == pa.array([b"hello world"] * 3) - - -def test_lit_arith(df): - """ - Test literals with arithmetic operations - """ - df = df.select( - literal(1) + column("b"), f.concat(column("a"), literal("!")) - ) - result = df.collect() - assert len(result) == 1 - result = result[0] - assert result.column(0) == pa.array([5, 6, 7]) - assert result.column(1) == pa.array(["Hello!", "World!", "!!"]) - - -def test_math_functions(): - ctx = SessionContext() - # create a RecordBatch and a new DataFrame from it - batch = pa.RecordBatch.from_arrays( - [pa.array([0.1, -0.7, 0.55])], names=["value"] - ) - df = ctx.create_dataframe([[batch]]) - - values = np.array([0.1, -0.7, 0.55]) - col_v = column("value") - df = df.select( - f.abs(col_v), - f.sin(col_v), - f.cos(col_v), - f.tan(col_v), - f.asin(col_v), - f.acos(col_v), - f.exp(col_v), - f.ln(col_v + literal(pa.scalar(1))), - f.log2(col_v + literal(pa.scalar(1))), - f.log10(col_v + literal(pa.scalar(1))), - f.random(), - f.atan(col_v), - f.atan2(col_v, literal(pa.scalar(1.1))), - f.ceil(col_v), - f.floor(col_v), - f.power(col_v, literal(pa.scalar(3))), - f.pow(col_v, literal(pa.scalar(4))), - f.round(col_v), - f.sqrt(col_v), - f.signum(col_v), - f.trunc(col_v), - ) - batches = df.collect() - assert len(batches) == 1 - result = batches[0] - - np.testing.assert_array_almost_equal(result.column(0), np.abs(values)) - np.testing.assert_array_almost_equal(result.column(1), np.sin(values)) - np.testing.assert_array_almost_equal(result.column(2), np.cos(values)) - np.testing.assert_array_almost_equal(result.column(3), np.tan(values)) - np.testing.assert_array_almost_equal(result.column(4), np.arcsin(values)) - np.testing.assert_array_almost_equal(result.column(5), np.arccos(values)) - np.testing.assert_array_almost_equal(result.column(6), np.exp(values)) - np.testing.assert_array_almost_equal( - result.column(7), np.log(values + 1.0) - ) - np.testing.assert_array_almost_equal( - result.column(8), np.log2(values + 1.0) - ) - np.testing.assert_array_almost_equal( - result.column(9), np.log10(values + 1.0) - ) - np.testing.assert_array_less(result.column(10), np.ones_like(values)) - np.testing.assert_array_almost_equal(result.column(11), np.arctan(values)) - np.testing.assert_array_almost_equal( - result.column(12), np.arctan2(values, 1.1) - ) - np.testing.assert_array_almost_equal(result.column(13), np.ceil(values)) - np.testing.assert_array_almost_equal(result.column(14), np.floor(values)) - np.testing.assert_array_almost_equal( - result.column(15), np.power(values, 3) - ) - np.testing.assert_array_almost_equal( - result.column(16), np.power(values, 4) - ) - np.testing.assert_array_almost_equal(result.column(17), np.round(values)) - np.testing.assert_array_almost_equal(result.column(18), np.sqrt(values)) - np.testing.assert_array_almost_equal(result.column(19), np.sign(values)) - np.testing.assert_array_almost_equal(result.column(20), np.trunc(values)) - - -def test_string_functions(df): - df = df.select( - f.ascii(column("a")), - f.bit_length(column("a")), - f.btrim(literal(" World ")), - f.character_length(column("a")), - f.chr(literal(68)), - f.concat_ws("-", column("a"), literal("test")), - f.concat(column("a"), literal("?")), - f.initcap(column("c")), - f.left(column("a"), literal(3)), - f.length(column("c")), - f.lower(column("a")), - f.lpad(column("a"), literal(7)), - f.ltrim(column("c")), - f.md5(column("a")), - f.octet_length(column("a")), - f.repeat(column("a"), literal(2)), - f.replace(column("a"), literal("l"), literal("?")), - f.reverse(column("a")), - f.right(column("a"), literal(4)), - f.rpad(column("a"), literal(8)), - f.rtrim(column("c")), - f.split_part(column("a"), literal("l"), literal(1)), - f.starts_with(column("a"), literal("Wor")), - f.strpos(column("a"), literal("o")), - f.substr(column("a"), literal(3)), - f.translate(column("a"), literal("or"), literal("ld")), - f.trim(column("c")), - f.upper(column("c")), - ) - result = df.collect() - assert len(result) == 1 - result = result[0] - assert result.column(0) == pa.array( - [72, 87, 33], type=pa.int32() - ) # H = 72; W = 87; ! = 33 - assert result.column(1) == pa.array([40, 40, 8], type=pa.int32()) - assert result.column(2) == pa.array(["World", "World", "World"]) - assert result.column(3) == pa.array([5, 5, 1], type=pa.int32()) - assert result.column(4) == pa.array(["D", "D", "D"]) - assert result.column(5) == pa.array(["Hello-test", "World-test", "!-test"]) - assert result.column(6) == pa.array(["Hello?", "World?", "!?"]) - assert result.column(7) == pa.array(["Hello ", " World ", " !"]) - assert result.column(8) == pa.array(["Hel", "Wor", "!"]) - assert result.column(9) == pa.array([6, 7, 2], type=pa.int32()) - assert result.column(10) == pa.array(["hello", "world", "!"]) - assert result.column(11) == pa.array([" Hello", " World", " !"]) - assert result.column(12) == pa.array(["hello ", "world ", "!"]) - assert result.column(13) == pa.array( - [ - "8b1a9953c4611296a827abf8c47804d7", - "f5a7924e621e84c9280a9a27e1bcb7f6", - "9033e0e305f247c0c3c80d0c7848c8b3", - ] - ) - assert result.column(14) == pa.array([5, 5, 1], type=pa.int32()) - assert result.column(15) == pa.array(["HelloHello", "WorldWorld", "!!"]) - assert result.column(16) == pa.array(["He??o", "Wor?d", "!"]) - assert result.column(17) == pa.array(["olleH", "dlroW", "!"]) - assert result.column(18) == pa.array(["ello", "orld", "!"]) - assert result.column(19) == pa.array(["Hello ", "World ", "! "]) - assert result.column(20) == pa.array(["hello", " world", " !"]) - assert result.column(21) == pa.array(["He", "Wor", "!"]) - assert result.column(22) == pa.array([False, True, False]) - assert result.column(23) == pa.array([5, 2, 0], type=pa.int32()) - assert result.column(24) == pa.array(["llo", "rld", ""]) - assert result.column(25) == pa.array(["Helll", "Wldld", "!"]) - assert result.column(26) == pa.array(["hello", "world", "!"]) - assert result.column(27) == pa.array(["HELLO ", " WORLD ", " !"]) - - -def test_hash_functions(df): - exprs = [ - f.digest(column("a"), literal(m)) - for m in ( - "md5", - "sha224", - "sha256", - "sha384", - "sha512", - "blake2s", - "blake3", - ) - ] - df = df.select( - *exprs, - f.sha224(column("a")), - f.sha256(column("a")), - f.sha384(column("a")), - f.sha512(column("a")), - ) - result = df.collect() - assert len(result) == 1 - result = result[0] - b = bytearray.fromhex - assert result.column(0) == pa.array( - [ - b("8B1A9953C4611296A827ABF8C47804D7"), - b("F5A7924E621E84C9280A9A27E1BCB7F6"), - b("9033E0E305F247C0C3C80D0C7848C8B3"), - ] - ) - assert result.column(1) == pa.array( - [ - b("4149DA18AA8BFC2B1E382C6C26556D01A92C261B6436DAD5E3BE3FCC"), - b("12972632B6D3B6AA52BD6434552F08C1303D56B817119406466E9236"), - b("6641A7E8278BCD49E476E7ACAE158F4105B2952D22AEB2E0B9A231A0"), - ] - ) - assert result.column(2) == pa.array( - [ - b( - "185F8DB32271FE25F561A6FC938B2E26" - "4306EC304EDA518007D1764826381969" - ), - b( - "78AE647DC5544D227130A0682A51E30B" - "C7777FBB6D8A8F17007463A3ECD1D524" - ), - b( - "BB7208BC9B5D7C04F1236A82A0093A5E" - "33F40423D5BA8D4266F7092C3BA43B62" - ), - ] - ) - assert result.column(3) == pa.array( - [ - b( - "3519FE5AD2C596EFE3E276A6F351B8FC" - "0B03DB861782490D45F7598EBD0AB5FD" - "5520ED102F38C4A5EC834E98668035FC" - ), - b( - "ED7CED84875773603AF90402E42C65F3" - "B48A5E77F84ADC7A19E8F3E8D3101010" - "22F552AEC70E9E1087B225930C1D260A" - ), - b( - "1D0EC8C84EE9521E21F06774DE232367" - "B64DE628474CB5B2E372B699A1F55AE3" - "35CC37193EF823E33324DFD9A70738A6" - ), - ] - ) - assert result.column(4) == pa.array( - [ - b( - "3615F80C9D293ED7402687F94B22D58E" - "529B8CC7916F8FAC7FDDF7FBD5AF4CF7" - "77D3D795A7A00A16BF7E7F3FB9561EE9" - "BAAE480DA9FE7A18769E71886B03F315" - ), - b( - "8EA77393A42AB8FA92500FB077A9509C" - "C32BC95E72712EFA116EDAF2EDFAE34F" - "BB682EFDD6C5DD13C117E08BD4AAEF71" - "291D8AACE2F890273081D0677C16DF0F" - ), - b( - "3831A6A6155E509DEE59A7F451EB3532" - "4D8F8F2DF6E3708894740F98FDEE2388" - "9F4DE5ADB0C5010DFB555CDA77C8AB5D" - "C902094C52DE3278F35A75EBC25F093A" - ), - ] - ) - assert result.column(5) == pa.array( - [ - b( - "F73A5FBF881F89B814871F46E26AD3FA" - "37CB2921C5E8561618639015B3CCBB71" - ), - b( - "B792A0383FB9E7A189EC150686579532" - "854E44B71AC394831DAED169BA85CCC5" - ), - b( - "27988A0E51812297C77A433F63523334" - "6AEE29A829DCF4F46E0F58F402C6CFCB" - ), - ] - ) - assert result.column(6) == pa.array( - [ - b( - "FBC2B0516EE8744D293B980779178A35" - "08850FDCFE965985782C39601B65794F" - ), - b( - "BF73D18575A736E4037D45F9E316085B" - "86C19BE6363DE6AA789E13DEAACC1C4E" - ), - b( - "C8D11B9F7237E4034ADBCD2005735F9B" - "C4C597C75AD89F4492BEC8F77D15F7EB" - ), - ] - ) - assert result.column(7) == result.column(1) # SHA-224 - assert result.column(8) == result.column(2) # SHA-256 - assert result.column(9) == result.column(3) # SHA-384 - assert result.column(10) == result.column(4) # SHA-512 - - -def test_temporal_functions(df): - df = df.select( - f.date_part(literal("month"), column("d")), - f.datepart(literal("year"), column("d")), - f.date_trunc(literal("month"), column("d")), - f.datetrunc(literal("day"), column("d")), - f.date_bin( - literal("15 minutes"), - column("d"), - literal("2001-01-01 00:02:30"), - ), - f.from_unixtime(literal(1673383974)), - f.to_timestamp(literal("2023-09-07 05:06:14.523952")), - f.to_timestamp_seconds(literal("2023-09-07 05:06:14.523952")), - f.to_timestamp_millis(literal("2023-09-07 05:06:14.523952")), - f.to_timestamp_micros(literal("2023-09-07 05:06:14.523952")), - ) - result = df.collect() - assert len(result) == 1 - result = result[0] - assert result.column(0) == pa.array([12, 6, 7], type=pa.float64()) - assert result.column(1) == pa.array([2022, 2027, 2020], type=pa.float64()) - assert result.column(2) == pa.array( - [datetime(2022, 12, 1), datetime(2027, 6, 1), datetime(2020, 7, 1)], - type=pa.timestamp("ns"), - ) - assert result.column(3) == pa.array( - [datetime(2022, 12, 31), datetime(2027, 6, 26), datetime(2020, 7, 2)], - type=pa.timestamp("ns"), - ) - assert result.column(4) == pa.array( - [ - datetime(2022, 12, 30, 23, 47, 30), - datetime(2027, 6, 25, 23, 47, 30), - datetime(2020, 7, 1, 23, 47, 30), - ], - type=pa.timestamp("ns"), - ) - assert result.column(5) == pa.array( - [datetime(2023, 1, 10, 20, 52, 54)] * 3, type=pa.timestamp("s") - ) - assert result.column(6) == pa.array( - [datetime(2023, 9, 7, 5, 6, 14, 523952)] * 3, type=pa.timestamp("ns") - ) - assert result.column(7) == pa.array( - [datetime(2023, 9, 7, 5, 6, 14)] * 3, type=pa.timestamp("s") - ) - assert result.column(8) == pa.array( - [datetime(2023, 9, 7, 5, 6, 14, 523000)] * 3, type=pa.timestamp("ms") - ) - assert result.column(9) == pa.array( - [datetime(2023, 9, 7, 5, 6, 14, 523952)] * 3, type=pa.timestamp("us") - ) diff --git a/datafusion/tests/test_sql.py b/datafusion/tests/test_sql.py deleted file mode 100644 index 638a222dc..000000000 --- a/datafusion/tests/test_sql.py +++ /dev/null @@ -1,295 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -import numpy as np -import pyarrow as pa -import pyarrow.dataset as ds -import pytest - -from datafusion import udf - -from . import generic as helpers - - -def test_no_table(ctx): - with pytest.raises(Exception, match="DataFusion error"): - ctx.sql("SELECT a FROM b").collect() - - -def test_register_csv(ctx, tmp_path): - path = tmp_path / "test.csv" - - table = pa.Table.from_arrays( - [ - [1, 2, 3, 4], - ["a", "b", "c", "d"], - [1.1, 2.2, 3.3, 4.4], - ], - names=["int", "str", "float"], - ) - pa.csv.write_csv(table, path) - - ctx.register_csv("csv", path) - ctx.register_csv("csv1", str(path)) - ctx.register_csv( - "csv2", - path, - has_header=True, - delimiter=",", - schema_infer_max_records=10, - ) - alternative_schema = pa.schema( - [ - ("some_int", pa.int16()), - ("some_bytes", pa.string()), - ("some_floats", pa.float32()), - ] - ) - ctx.register_csv("csv3", path, schema=alternative_schema) - - assert ctx.tables() == {"csv", "csv1", "csv2", "csv3"} - - for table in ["csv", "csv1", "csv2"]: - result = ctx.sql(f"SELECT COUNT(int) AS cnt FROM {table}").collect() - result = pa.Table.from_batches(result) - assert result.to_pydict() == {"cnt": [4]} - - result = ctx.sql("SELECT * FROM csv3").collect() - result = pa.Table.from_batches(result) - assert result.schema == alternative_schema - - with pytest.raises( - ValueError, match="Delimiter must be a single character" - ): - ctx.register_csv("csv4", path, delimiter="wrong") - - -def test_register_parquet(ctx, tmp_path): - path = helpers.write_parquet(tmp_path / "a.parquet", helpers.data()) - ctx.register_parquet("t", path) - assert ctx.tables() == {"t"} - - result = ctx.sql("SELECT COUNT(a) AS cnt FROM t").collect() - result = pa.Table.from_batches(result) - assert result.to_pydict() == {"cnt": [100]} - - -def test_register_parquet_partitioned(ctx, tmp_path): - dir_root = tmp_path / "dataset_parquet_partitioned" - dir_root.mkdir(exist_ok=False) - (dir_root / "grp=a").mkdir(exist_ok=False) - (dir_root / "grp=b").mkdir(exist_ok=False) - - table = pa.Table.from_arrays( - [ - [1, 2, 3, 4], - ["a", "b", "c", "d"], - [1.1, 2.2, 3.3, 4.4], - ], - names=["int", "str", "float"], - ) - pa.parquet.write_table(table.slice(0, 3), dir_root / "grp=a/file.parquet") - pa.parquet.write_table(table.slice(3, 4), dir_root / "grp=b/file.parquet") - - ctx.register_parquet( - "datapp", - str(dir_root), - table_partition_cols=[("grp", "string")], - parquet_pruning=True, - file_extension=".parquet", - ) - assert ctx.tables() == {"datapp"} - - result = ctx.sql( - "SELECT grp, COUNT(*) AS cnt FROM datapp GROUP BY grp" - ).collect() - result = pa.Table.from_batches(result) - - rd = result.to_pydict() - assert dict(zip(rd["grp"], rd["cnt"])) == {"a": 3, "b": 1} - - -def test_register_dataset(ctx, tmp_path): - path = helpers.write_parquet(tmp_path / "a.parquet", helpers.data()) - dataset = ds.dataset(path, format="parquet") - - ctx.register_dataset("t", dataset) - assert ctx.tables() == {"t"} - - result = ctx.sql("SELECT COUNT(a) AS cnt FROM t").collect() - result = pa.Table.from_batches(result) - assert result.to_pydict() == {"cnt": [100]} - - -def test_execute(ctx, tmp_path): - data = [1, 1, 2, 2, 3, 11, 12] - - # single column, "a" - path = helpers.write_parquet(tmp_path / "a.parquet", pa.array(data)) - ctx.register_parquet("t", path) - - assert ctx.tables() == {"t"} - - # count - result = ctx.sql( - "SELECT COUNT(a) AS cnt FROM t WHERE a IS NOT NULL" - ).collect() - - expected = pa.array([7], pa.int64()) - expected = [pa.RecordBatch.from_arrays([expected], ["cnt"])] - assert result == expected - - # where - expected = pa.array([2], pa.int64()) - expected = [pa.RecordBatch.from_arrays([expected], ["cnt"])] - result = ctx.sql("SELECT COUNT(a) AS cnt FROM t WHERE a > 10").collect() - assert result == expected - - # group by - results = ctx.sql( - "SELECT CAST(a as int) AS a, COUNT(a) AS cnt FROM t GROUP BY a" - ).collect() - - # group by returns batches - result_keys = [] - result_values = [] - for result in results: - pydict = result.to_pydict() - result_keys.extend(pydict["a"]) - result_values.extend(pydict["cnt"]) - - result_keys, result_values = ( - list(t) for t in zip(*sorted(zip(result_keys, result_values))) - ) - - assert result_keys == [1, 2, 3, 11, 12] - assert result_values == [2, 2, 1, 1, 1] - - # order by - result = ctx.sql( - "SELECT a, CAST(a AS int) AS a_int FROM t ORDER BY a DESC LIMIT 2" - ).collect() - expected_a = pa.array([50.0219, 50.0152], pa.float64()) - expected_cast = pa.array([50, 50], pa.int32()) - expected = [ - pa.RecordBatch.from_arrays([expected_a, expected_cast], ["a", "a_int"]) - ] - np.testing.assert_equal(expected[0].column(1), expected[0].column(1)) - - -def test_cast(ctx, tmp_path): - """ - Verify that we can cast - """ - path = helpers.write_parquet(tmp_path / "a.parquet", helpers.data()) - ctx.register_parquet("t", path) - - valid_types = [ - "smallint", - "int", - "bigint", - "float(32)", - "float(64)", - "float", - ] - - select = ", ".join( - [f"CAST(9 AS {t}) AS A{i}" for i, t in enumerate(valid_types)] - ) - - # can execute, which implies that we can cast - ctx.sql(f"SELECT {select} FROM t").collect() - - -@pytest.mark.parametrize( - ("fn", "input_types", "output_type", "input_values", "expected_values"), - [ - ( - lambda x: x, - [pa.float64()], - pa.float64(), - [-1.2, None, 1.2], - [-1.2, None, 1.2], - ), - ( - lambda x: x.is_null(), - [pa.float64()], - pa.bool_(), - [-1.2, None, 1.2], - [False, True, False], - ), - ], -) -def test_udf( - ctx, tmp_path, fn, input_types, output_type, input_values, expected_values -): - # write to disk - path = helpers.write_parquet( - tmp_path / "a.parquet", pa.array(input_values) - ) - ctx.register_parquet("t", path) - - func = udf( - fn, input_types, output_type, name="func", volatility="immutable" - ) - ctx.register_udf(func) - - batches = ctx.sql("SELECT func(a) AS tt FROM t").collect() - result = batches[0].column(0) - - assert result == pa.array(expected_values) - - -_null_mask = np.array([False, True, False]) - - -@pytest.mark.parametrize( - "arr", - [ - pa.array(["a", "b", "c"], pa.utf8(), _null_mask), - pa.array(["a", "b", "c"], pa.large_utf8(), _null_mask), - pa.array([b"1", b"2", b"3"], pa.binary(), _null_mask), - pa.array([b"1111", b"2222", b"3333"], pa.large_binary(), _null_mask), - pa.array([False, True, True], None, _null_mask), - pa.array([0, 1, 2], None), - helpers.data_binary_other(), - helpers.data_date32(), - helpers.data_with_nans(), - # C data interface missing - pytest.param( - pa.array([b"1111", b"2222", b"3333"], pa.binary(4), _null_mask), - marks=pytest.mark.xfail, - ), - pytest.param(helpers.data_datetime("s"), marks=pytest.mark.xfail), - pytest.param(helpers.data_datetime("ms"), marks=pytest.mark.xfail), - pytest.param(helpers.data_datetime("us"), marks=pytest.mark.xfail), - pytest.param(helpers.data_datetime("ns"), marks=pytest.mark.xfail), - # Not writtable to parquet - pytest.param(helpers.data_timedelta("s"), marks=pytest.mark.xfail), - pytest.param(helpers.data_timedelta("ms"), marks=pytest.mark.xfail), - pytest.param(helpers.data_timedelta("us"), marks=pytest.mark.xfail), - pytest.param(helpers.data_timedelta("ns"), marks=pytest.mark.xfail), - ], -) -def test_simple_select(ctx, tmp_path, arr): - path = helpers.write_parquet(tmp_path / "a.parquet", arr) - ctx.register_parquet("t", path) - - batches = ctx.sql("SELECT a AS tt FROM t").collect() - result = batches[0].column(0) - - np.testing.assert_equal(result, arr) diff --git a/datafusion/tests/test_substrait.py b/datafusion/tests/test_substrait.py deleted file mode 100644 index 9a08b760e..000000000 --- a/datafusion/tests/test_substrait.py +++ /dev/null @@ -1,55 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -import pyarrow as pa - -from datafusion import SessionContext -from datafusion import substrait as ss -import pytest - - -@pytest.fixture -def ctx(): - return SessionContext() - - -def test_substrait_serialization(ctx): - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 5, 6])], - names=["a", "b"], - ) - - ctx.register_record_batches("t", [[batch]]) - - assert ctx.tables() == {"t"} - - # For now just make sure the method calls blow up - substrait_plan = ss.substrait.serde.serialize_to_plan( - "SELECT * FROM t", ctx - ) - substrait_bytes = ss.substrait.serde.serialize_bytes( - "SELECT * FROM t", ctx - ) - substrait_plan = ss.substrait.serde.deserialize_bytes(substrait_bytes) - logical_plan = ss.substrait.consumer.from_substrait_plan( - ctx, substrait_plan - ) - - # demonstrate how to create a DataFrame from a deserialized logical plan - df = ctx.create_dataframe_from_logical_plan(logical_plan) - - substrait_plan = ss.substrait.producer.to_substrait_plan(df.logical_plan()) diff --git a/datafusion/tests/test_udaf.py b/datafusion/tests/test_udaf.py deleted file mode 100644 index c2b29d199..000000000 --- a/datafusion/tests/test_udaf.py +++ /dev/null @@ -1,136 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -from typing import List - -import pyarrow as pa -import pyarrow.compute as pc -import pytest - -from datafusion import Accumulator, SessionContext, column, udaf - - -class Summarize(Accumulator): - """ - Interface of a user-defined accumulation. - """ - - def __init__(self): - self._sum = pa.scalar(0.0) - - def state(self) -> List[pa.Scalar]: - return [self._sum] - - def update(self, values: pa.Array) -> None: - # Not nice since pyarrow scalars can't be summed yet. - # This breaks on `None` - self._sum = pa.scalar(self._sum.as_py() + pc.sum(values).as_py()) - - def merge(self, states: pa.Array) -> None: - # Not nice since pyarrow scalars can't be summed yet. - # This breaks on `None` - self._sum = pa.scalar(self._sum.as_py() + pc.sum(states).as_py()) - - def evaluate(self) -> pa.Scalar: - return self._sum - - -class NotSubclassOfAccumulator: - pass - - -class MissingMethods(Accumulator): - def __init__(self): - self._sum = pa.scalar(0) - - def state(self) -> List[pa.Scalar]: - return [self._sum] - - -@pytest.fixture -def df(): - ctx = SessionContext() - - # create a RecordBatch and a new DataFrame from it - batch = pa.RecordBatch.from_arrays( - [pa.array([1, 2, 3]), pa.array([4, 4, 6])], - names=["a", "b"], - ) - return ctx.create_dataframe([[batch]]) - - -@pytest.mark.skip(reason="df.collect() will hang, need more investigations") -def test_errors(df): - with pytest.raises(TypeError): - udaf( - NotSubclassOfAccumulator, - pa.float64(), - pa.float64(), - [pa.float64()], - volatility="immutable", - ) - - accum = udaf( - MissingMethods, - pa.int64(), - pa.int64(), - [pa.int64()], - volatility="immutable", - ) - df = df.aggregate([], [accum(column("a"))]) - - msg = ( - "Can't instantiate abstract class MissingMethods with abstract " - "methods evaluate, merge, update" - ) - with pytest.raises(Exception, match=msg): - df.collect() - - -def test_aggregate(df): - summarize = udaf( - Summarize, - pa.float64(), - pa.float64(), - [pa.float64()], - volatility="immutable", - ) - - df = df.aggregate([], [summarize(column("a"))]) - - # execute and collect the first (and only) batch - result = df.collect()[0] - - assert result.column(0) == pa.array([1.0 + 2.0 + 3.0]) - - -def test_group_by(df): - summarize = udaf( - Summarize, - pa.float64(), - pa.float64(), - [pa.float64()], - volatility="immutable", - ) - - df = df.aggregate([column("b")], [summarize(column("a"))]) - - batches = df.collect() - - arrays = [batch.column(1) for batch in batches] - joined = pa.concat_arrays(arrays) - assert joined == pa.array([1.0 + 2.0, 3.0]) diff --git a/dev/changelog/43.0.0.md b/dev/changelog/43.0.0.md new file mode 100644 index 000000000..bbb766910 --- /dev/null +++ b/dev/changelog/43.0.0.md @@ -0,0 +1,73 @@ + + +# Apache DataFusion Python 43.0.0 Changelog + +This release consists of 26 commits from 7 contributors. See credits at the end of this changelog for more information. + +**Implemented enhancements:** + +- feat: expose `drop` method [#913](https://github.com/apache/datafusion-python/pull/913) (ion-elgreco) +- feat: expose `join_on` [#914](https://github.com/apache/datafusion-python/pull/914) (ion-elgreco) +- feat: add fill_null/nan expressions [#919](https://github.com/apache/datafusion-python/pull/919) (ion-elgreco) +- feat: add `with_columns` [#909](https://github.com/apache/datafusion-python/pull/909) (ion-elgreco) +- feat: add `cast` to DataFrame [#916](https://github.com/apache/datafusion-python/pull/916) (ion-elgreco) +- feat: add `head`, `tail` methods [#915](https://github.com/apache/datafusion-python/pull/915) (ion-elgreco) + +**Fixed bugs:** + +- fix: remove use of deprecated `make_scalar_function` [#906](https://github.com/apache/datafusion-python/pull/906) (Michael-J-Ward) +- fix: udwf example [#948](https://github.com/apache/datafusion-python/pull/948) (mesejo) + +**Other:** + +- Ts/minor updates release process [#903](https://github.com/apache/datafusion-python/pull/903) (timsaucer) +- build(deps): bump pyo3 from 0.22.3 to 0.22.4 [#910](https://github.com/apache/datafusion-python/pull/910) (dependabot[bot]) +- refactor: `from_arrow` use protocol typehints [#917](https://github.com/apache/datafusion-python/pull/917) (ion-elgreco) +- Change requires-python version in pyproject.toml [#924](https://github.com/apache/datafusion-python/pull/924) (kosiew) +- chore: deprecate `select_columns` [#911](https://github.com/apache/datafusion-python/pull/911) (ion-elgreco) +- build(deps): bump uuid from 1.10.0 to 1.11.0 [#927](https://github.com/apache/datafusion-python/pull/927) (dependabot[bot]) +- Add array_empty scalar function [#931](https://github.com/apache/datafusion-python/pull/931) (kosiew) +- add `cardinality` function to calculate total distinct elements in an array [#937](https://github.com/apache/datafusion-python/pull/937) (kosiew) +- Add empty scalar function (alias of array_empty), fix a small typo [#938](https://github.com/apache/datafusion-python/pull/938) (kosiew) +- README How to develop section now also works on Apple M1 [#940](https://github.com/apache/datafusion-python/pull/940) (drauschenbach) +- refactor: dataframe `join` params [#912](https://github.com/apache/datafusion-python/pull/912) (ion-elgreco) +- Upgrade to Datafusion 43 [#905](https://github.com/apache/datafusion-python/pull/905) (Michael-J-Ward) +- build(deps): bump tokio from 1.40.0 to 1.41.1 [#946](https://github.com/apache/datafusion-python/pull/946) (dependabot[bot]) +- Add list_cat, list_concat, list_repeat [#942](https://github.com/apache/datafusion-python/pull/942) (kosiew) +- Add foreign table providers [#921](https://github.com/apache/datafusion-python/pull/921) (timsaucer) +- Add make_list and tests for make_list, make_array [#949](https://github.com/apache/datafusion-python/pull/949) (kosiew) +- Documentation updates: simplify examples and add section on data sources [#955](https://github.com/apache/datafusion-python/pull/955) (timsaucer) +- Add datafusion.extract [#959](https://github.com/apache/datafusion-python/pull/959) (kosiew) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 9 Ion Koutsouris + 7 kosiew + 3 Tim Saucer + 3 dependabot[bot] + 2 Michael J Ward + 1 Daniel Mesejo + 1 David Rauschenbach +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. diff --git a/dev/changelog/44.0.0.md b/dev/changelog/44.0.0.md new file mode 100644 index 000000000..c5ed4bdb0 --- /dev/null +++ b/dev/changelog/44.0.0.md @@ -0,0 +1,58 @@ + + +# Apache DataFusion Python 44.0.0 Changelog + +This release consists of 12 commits from 5 contributors. See credits at the end of this changelog for more information. + +**Implemented enhancements:** + +- feat: support enable_url_table config [#980](https://github.com/apache/datafusion-python/pull/980) (chenkovsky) +- feat: remove DataFusion pyarrow feat [#1000](https://github.com/apache/datafusion-python/pull/1000) (timsaucer) + +**Fixed bugs:** + +- fix: correct LZ0 to LZO in compression options [#995](https://github.com/apache/datafusion-python/pull/995) (kosiew) + +**Other:** + +- Add arrow cast [#962](https://github.com/apache/datafusion-python/pull/962) (kosiew) +- Fix small issues in pyproject.toml [#976](https://github.com/apache/datafusion-python/pull/976) (kylebarron) +- chore: set validation and type hint for ffi tableprovider [#983](https://github.com/apache/datafusion-python/pull/983) (ion-elgreco) +- Support async iteration of RecordBatchStream [#975](https://github.com/apache/datafusion-python/pull/975) (kylebarron) +- Chore/upgrade datafusion 44 [#973](https://github.com/apache/datafusion-python/pull/973) (timsaucer) +- Default to ZSTD compression when writing Parquet [#981](https://github.com/apache/datafusion-python/pull/981) (kosiew) +- Feat/use uv python management [#994](https://github.com/apache/datafusion-python/pull/994) (timsaucer) +- minor: Update dependencies prior to release [#999](https://github.com/apache/datafusion-python/pull/999) (timsaucer) +- Apply import ordering in ruff check [#1001](https://github.com/apache/datafusion-python/pull/1001) (timsaucer) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 5 Tim Saucer + 3 kosiew + 2 Kyle Barron + 1 Chongchen Chen + 1 Ion Koutsouris +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + diff --git a/dev/changelog/45.0.0.md b/dev/changelog/45.0.0.md new file mode 100644 index 000000000..93659b171 --- /dev/null +++ b/dev/changelog/45.0.0.md @@ -0,0 +1,42 @@ + + +# Apache DataFusion Python 45.0.0 Changelog + +This release consists of 2 commits from 2 contributors. See credits at the end of this changelog for more information. + +**Fixed bugs:** + +- fix: add to_timestamp_nanos [#1020](https://github.com/apache/datafusion-python/pull/1020) (chenkovsky) + +**Other:** + +- Chore/upgrade datafusion 45 [#1010](https://github.com/apache/datafusion-python/pull/1010) (kevinjqliu) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 1 Kevin Liu + 1 Tim Saucer +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + diff --git a/dev/changelog/46.0.0.md b/dev/changelog/46.0.0.md new file mode 100644 index 000000000..3e5768099 --- /dev/null +++ b/dev/changelog/46.0.0.md @@ -0,0 +1,73 @@ + + +# Apache DataFusion Python 46.0.0 Changelog + +This release consists of 21 commits from 11 contributors. See credits at the end of this changelog for more information. + +**Implemented enhancements:** + +- feat: reads using global ctx [#982](https://github.com/apache/datafusion-python/pull/982) (ion-elgreco) +- feat: Implementation of udf and udaf decorator [#1040](https://github.com/apache/datafusion-python/pull/1040) (CrystalZhou0529) +- feat: expose regex_count function [#1066](https://github.com/apache/datafusion-python/pull/1066) (nirnayroy) +- feat: Update DataFusion dependency to 46 [#1079](https://github.com/apache/datafusion-python/pull/1079) (timsaucer) + +**Fixed bugs:** + +- fix: add to_timestamp_nanos [#1020](https://github.com/apache/datafusion-python/pull/1020) (chenkovsky) +- fix: type checking [#993](https://github.com/apache/datafusion-python/pull/993) (chenkovsky) + +**Other:** + +- [infra] Fail Clippy on rust build warnings [#1029](https://github.com/apache/datafusion-python/pull/1029) (kevinjqliu) +- Add user documentation for the FFI approach [#1031](https://github.com/apache/datafusion-python/pull/1031) (timsaucer) +- build(deps): bump arrow from 54.1.0 to 54.2.0 [#1035](https://github.com/apache/datafusion-python/pull/1035) (dependabot[bot]) +- Chore: Release datafusion-python 45 [#1024](https://github.com/apache/datafusion-python/pull/1024) (timsaucer) +- Enable Dataframe to be converted into views which can be used in register_table [#1016](https://github.com/apache/datafusion-python/pull/1016) (kosiew) +- Add ruff check for missing futures import [#1052](https://github.com/apache/datafusion-python/pull/1052) (timsaucer) +- Enable take comments to assign issues to users [#1058](https://github.com/apache/datafusion-python/pull/1058) (timsaucer) +- Update python min version to 3.9 [#1043](https://github.com/apache/datafusion-python/pull/1043) (kevinjqliu) +- feat/improve ruff test coverage [#1055](https://github.com/apache/datafusion-python/pull/1055) (timsaucer) +- feat/making global context accessible for users [#1060](https://github.com/apache/datafusion-python/pull/1060) (jsai28) +- Renaming Internal Structs [#1059](https://github.com/apache/datafusion-python/pull/1059) (Spaarsh) +- test: add pytest asyncio tests [#1063](https://github.com/apache/datafusion-python/pull/1063) (jsai28) +- Add decorator for udwf [#1061](https://github.com/apache/datafusion-python/pull/1061) (kosiew) +- Add additional ruff suggestions [#1062](https://github.com/apache/datafusion-python/pull/1062) (Spaarsh) +- Improve collection during repr and repr_html [#1036](https://github.com/apache/datafusion-python/pull/1036) (timsaucer) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 7 Tim Saucer + 2 Kevin Liu + 2 Spaarsh + 2 jsai28 + 2 kosiew + 1 Chen Chongchen + 1 Chongchen Chen + 1 Crystal Zhou + 1 Ion Koutsouris + 1 Nirnay Roy + 1 dependabot[bot] +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + diff --git a/dev/changelog/47.0.0.md b/dev/changelog/47.0.0.md new file mode 100644 index 000000000..a7ed90313 --- /dev/null +++ b/dev/changelog/47.0.0.md @@ -0,0 +1,64 @@ + + +# Apache DataFusion Python 47.0.0 Changelog + +This release consists of 23 commits from 5 contributors. See credits at the end of this changelog for more information. + +**Implemented enhancements:** + +- feat: support unparser [#1088](https://github.com/apache/datafusion-python/pull/1088) (chenkovsky) +- feat: update datafusion dependency 47 [#1107](https://github.com/apache/datafusion-python/pull/1107) (timsaucer) +- feat: alias with metadata [#1111](https://github.com/apache/datafusion-python/pull/1111) (chenkovsky) +- feat: add missing PyLogicalPlan to_variant [#1085](https://github.com/apache/datafusion-python/pull/1085) (chenkovsky) +- feat: add user defined table function support [#1113](https://github.com/apache/datafusion-python/pull/1113) (timsaucer) + +**Fixed bugs:** + +- fix: recursive import [#1117](https://github.com/apache/datafusion-python/pull/1117) (chenkovsky) + +**Other:** + +- Update changelog and version number [#1089](https://github.com/apache/datafusion-python/pull/1089) (timsaucer) +- Documentation updates: mention correct dataset on basics page [#1081](https://github.com/apache/datafusion-python/pull/1081) (floscha) +- Add Configurable HTML Table Formatter for DataFusion DataFrames in Python [#1100](https://github.com/apache/datafusion-python/pull/1100) (kosiew) +- Add DataFrame usage guide with HTML rendering customization options [#1108](https://github.com/apache/datafusion-python/pull/1108) (kosiew) +- 1075/enhancement/Make col class with __getattr__ [#1076](https://github.com/apache/datafusion-python/pull/1076) (deanm0000) +- 1064/enhancement/add functions to Expr class [#1074](https://github.com/apache/datafusion-python/pull/1074) (deanm0000) +- ci: require approving review [#1122](https://github.com/apache/datafusion-python/pull/1122) (timsaucer) +- Partial fix for 1078: Enhance DataFrame Formatter Configuration with Memory and Display Controls [#1119](https://github.com/apache/datafusion-python/pull/1119) (kosiew) +- Add fill_null method to DataFrame API for handling missing values [#1019](https://github.com/apache/datafusion-python/pull/1019) (kosiew) +- minor: reduce error size [#1126](https://github.com/apache/datafusion-python/pull/1126) (timsaucer) +- Move the udf module to user_defined [#1112](https://github.com/apache/datafusion-python/pull/1112) (timsaucer) +- add unit tests for expression functions [#1121](https://github.com/apache/datafusion-python/pull/1121) (timsaucer) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 12 Tim Saucer + 4 Chen Chongchen + 4 kosiew + 2 deanm0000 + 1 Florian Schäfer +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + diff --git a/dev/changelog/48.0.0.md b/dev/changelog/48.0.0.md new file mode 100644 index 000000000..80bc61aca --- /dev/null +++ b/dev/changelog/48.0.0.md @@ -0,0 +1,59 @@ + + +# Apache DataFusion Python 48.0.0 Changelog + +This release consists of 15 commits from 6 contributors. See credits at the end of this changelog for more information. + +**Implemented enhancements:** + +- feat: upgrade df48 dependency [#1143](https://github.com/apache/datafusion-python/pull/1143) (timsaucer) +- feat: Support Parquet writer options [#1123](https://github.com/apache/datafusion-python/pull/1123) (nuno-faria) +- feat: dataframe string formatter [#1170](https://github.com/apache/datafusion-python/pull/1170) (timsaucer) +- feat: collect once during display() in jupyter notebooks [#1167](https://github.com/apache/datafusion-python/pull/1167) (timsaucer) +- feat: python based catalog and schema provider [#1156](https://github.com/apache/datafusion-python/pull/1156) (timsaucer) +- feat: add FFI support for user defined functions [#1145](https://github.com/apache/datafusion-python/pull/1145) (timsaucer) + +**Other:** + +- Release DataFusion 47.0.0 [#1130](https://github.com/apache/datafusion-python/pull/1130) (timsaucer) +- Add a documentation build step in CI [#1139](https://github.com/apache/datafusion-python/pull/1139) (crystalxyz) +- Add DataFrame API Documentation for DataFusion Python [#1132](https://github.com/apache/datafusion-python/pull/1132) (kosiew) +- Add Interruptible Query Execution in Jupyter via KeyboardInterrupt Support [#1141](https://github.com/apache/datafusion-python/pull/1141) (kosiew) +- Support types other than String and Int for partition columns [#1154](https://github.com/apache/datafusion-python/pull/1154) (miclegr) +- Fix signature of `__arrow_c_stream__` [#1168](https://github.com/apache/datafusion-python/pull/1168) (kylebarron) +- Consolidate DataFrame Docs: Merge HTML Rendering Section as Subpage [#1161](https://github.com/apache/datafusion-python/pull/1161) (kosiew) +- Add compression_level support to ParquetWriterOptions and enhance write_parquet to accept full options object [#1169](https://github.com/apache/datafusion-python/pull/1169) (kosiew) +- Simplify HTML Formatter Style Handling Using Script Injection [#1177](https://github.com/apache/datafusion-python/pull/1177) (kosiew) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 6 Tim Saucer + 5 kosiew + 1 Crystal Zhou + 1 Kyle Barron + 1 Michele Gregori + 1 Nuno Faria +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + diff --git a/dev/changelog/49.0.0.md b/dev/changelog/49.0.0.md new file mode 100644 index 000000000..008bd43bc --- /dev/null +++ b/dev/changelog/49.0.0.md @@ -0,0 +1,61 @@ + + +# Apache DataFusion Python 49.0.0 Changelog + +This release consists of 16 commits from 7 contributors. See credits at the end of this changelog for more information. + +**Fixed bugs:** + +- fix(build): Include build.rs in published crates [#1199](https://github.com/apache/datafusion-python/pull/1199) (colinmarc) + +**Other:** + +- 48.0.0 Release [#1175](https://github.com/apache/datafusion-python/pull/1175) (timsaucer) +- Update CI rules [#1188](https://github.com/apache/datafusion-python/pull/1188) (timsaucer) +- Fix Python UDAF Accumulator Interface example to Properly Handle State and Updates with List[Array] Types [#1192](https://github.com/apache/datafusion-python/pull/1192) (kosiew) +- chore: Upgrade datafusion to version 49 [#1200](https://github.com/apache/datafusion-python/pull/1200) (nuno-faria) +- Update how to dev instructions [#1179](https://github.com/apache/datafusion-python/pull/1179) (ntjohnson1) +- build(deps): bump object_store from 0.12.2 to 0.12.3 [#1189](https://github.com/apache/datafusion-python/pull/1189) (dependabot[bot]) +- build(deps): bump uuid from 1.17.0 to 1.18.0 [#1202](https://github.com/apache/datafusion-python/pull/1202) (dependabot[bot]) +- build(deps): bump async-trait from 0.1.88 to 0.1.89 [#1203](https://github.com/apache/datafusion-python/pull/1203) (dependabot[bot]) +- build(deps): bump slab from 0.4.10 to 0.4.11 [#1205](https://github.com/apache/datafusion-python/pull/1205) (dependabot[bot]) +- Improved window and aggregate function signature [#1187](https://github.com/apache/datafusion-python/pull/1187) (timsaucer) +- Optional improvements in verification instructions [#1183](https://github.com/apache/datafusion-python/pull/1183) (paleolimbot) +- Improve `show()` output for empty DataFrames [#1208](https://github.com/apache/datafusion-python/pull/1208) (kosiew) +- build(deps): bump actions/download-artifact from 4 to 5 [#1201](https://github.com/apache/datafusion-python/pull/1201) (dependabot[bot]) +- build(deps): bump url from 2.5.4 to 2.5.7 [#1210](https://github.com/apache/datafusion-python/pull/1210) (dependabot[bot]) +- build(deps): bump actions/checkout from 4 to 5 [#1204](https://github.com/apache/datafusion-python/pull/1204) (dependabot[bot]) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 7 dependabot[bot] + 3 Tim Saucer + 2 kosiew + 1 Colin Marc + 1 Dewey Dunnington + 1 Nick + 1 Nuno Faria +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + diff --git a/dev/changelog/50.0.0.md b/dev/changelog/50.0.0.md new file mode 100644 index 000000000..c3f09d180 --- /dev/null +++ b/dev/changelog/50.0.0.md @@ -0,0 +1,60 @@ + + +# Apache DataFusion Python 50.0.0 Changelog + +This release consists of 12 commits from 7 contributors. See credits at the end of this changelog for more information. + +**Implemented enhancements:** + +- feat: allow passing a slice to and expression with the [] indexing [#1215](https://github.com/apache/datafusion-python/pull/1215) (timsaucer) + +**Documentation updates:** + +- docs: fix CaseBuilder documentation example [#1225](https://github.com/apache/datafusion-python/pull/1225) (IndexSeek) +- docs: update link to user example for custom table provider [#1224](https://github.com/apache/datafusion-python/pull/1224) (IndexSeek) +- docs: add apache iceberg as datafusion data source [#1240](https://github.com/apache/datafusion-python/pull/1240) (kevinjqliu) + +**Other:** + +- 49.0.0 release [#1211](https://github.com/apache/datafusion-python/pull/1211) (timsaucer) +- Update development guide in README.md [#1213](https://github.com/apache/datafusion-python/pull/1213) (YKoustubhRao) +- Add benchmark script and documentation for maximizing CPU usage in DataFusion Python [#1216](https://github.com/apache/datafusion-python/pull/1216) (kosiew) +- Fixing a few Typos [#1220](https://github.com/apache/datafusion-python/pull/1220) (ntjohnson1) +- Set fail on warning for documentation generation [#1218](https://github.com/apache/datafusion-python/pull/1218) (timsaucer) +- chore: remove redundant error transformation [#1232](https://github.com/apache/datafusion-python/pull/1232) (mesejo) +- Support string column identifiers for sort/aggregate/window and stricter Expr validation [#1221](https://github.com/apache/datafusion-python/pull/1221) (kosiew) +- Prepare for DF50 [#1231](https://github.com/apache/datafusion-python/pull/1231) (timsaucer) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 4 Tim Saucer + 2 Tyler White + 2 kosiew + 1 Daniel Mesejo + 1 Kevin Liu + 1 Koustubh Rao + 1 Nick +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + diff --git a/dev/changelog/50.1.0.md b/dev/changelog/50.1.0.md new file mode 100644 index 000000000..3b9ff84ff --- /dev/null +++ b/dev/changelog/50.1.0.md @@ -0,0 +1,57 @@ + + +# Apache DataFusion Python 50.1.0 Changelog + +This release consists of 11 commits from 7 contributors. See credits at the end of this changelog for more information. + +**Breaking changes:** + +- Unify Table representations [#1256](https://github.com/apache/datafusion-python/pull/1256) (timsaucer) + +**Implemented enhancements:** + +- feat: expose DataFrame.write_table [#1264](https://github.com/apache/datafusion-python/pull/1264) (timsaucer) +- feat: expose` DataFrame.parse_sql_expr` [#1274](https://github.com/apache/datafusion-python/pull/1274) (milenkovicm) + +**Other:** + +- Update version number, add changelog [#1249](https://github.com/apache/datafusion-python/pull/1249) (timsaucer) +- Fix drop() method to handle quoted column names consistently [#1242](https://github.com/apache/datafusion-python/pull/1242) (H0TB0X420) +- Make Session Context `pyclass` frozen so interior mutability is only managed by rust [#1248](https://github.com/apache/datafusion-python/pull/1248) (ntjohnson1) +- macos-13 is deprecated [#1259](https://github.com/apache/datafusion-python/pull/1259) (kevinjqliu) +- Freeze PyO3 wrappers & introduce interior mutability to avoid PyO3 borrow errors [#1253](https://github.com/apache/datafusion-python/pull/1253) (kosiew) +- chore: update dependencies [#1269](https://github.com/apache/datafusion-python/pull/1269) (timsaucer) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 4 Tim Saucer + 2 Siew Kam Onn + 1 H0TB0X420 + 1 Kevin Liu + 1 Marko Milenković + 1 Nick + 1 kosiew +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + diff --git a/dev/changelog/51.0.0.md b/dev/changelog/51.0.0.md new file mode 100644 index 000000000..cc157eb0d --- /dev/null +++ b/dev/changelog/51.0.0.md @@ -0,0 +1,74 @@ + + +# Apache DataFusion Python 51.0.0 Changelog + +This release consists of 23 commits from 7 contributors. See credits at the end of this changelog for more information. + +**Breaking changes:** + +- feat: reduce duplicate fields on join [#1184](https://github.com/apache/datafusion-python/pull/1184) (timsaucer) + +**Implemented enhancements:** + +- feat: expose `select_exprs` method on DataFrame [#1271](https://github.com/apache/datafusion-python/pull/1271) (milenkovicm) +- feat: allow DataFrame.filter to accept SQL strings [#1276](https://github.com/apache/datafusion-python/pull/1276) (K-dash) +- feat: add temporary view option for into_view [#1267](https://github.com/apache/datafusion-python/pull/1267) (timsaucer) +- feat: support session token parameter for AmazonS3 [#1275](https://github.com/apache/datafusion-python/pull/1275) (GCHQDeveloper028) +- feat: `with_column` supports SQL expression [#1284](https://github.com/apache/datafusion-python/pull/1284) (milenkovicm) +- feat: Add SQL expression for `repartition_by_hash` [#1285](https://github.com/apache/datafusion-python/pull/1285) (milenkovicm) +- feat: Add SQL expression support for `with_columns` [#1286](https://github.com/apache/datafusion-python/pull/1286) (milenkovicm) + +**Fixed bugs:** + +- fix: use coalesce instead of drop_duplicate_keys for join [#1318](https://github.com/apache/datafusion-python/pull/1318) (mesejo) +- fix: Inconsistent schemas when converting to pyarrow [#1315](https://github.com/apache/datafusion-python/pull/1315) (nuno-faria) + +**Other:** + +- Release 50.1 [#1281](https://github.com/apache/datafusion-python/pull/1281) (timsaucer) +- Update python minimum version to 3.10 [#1296](https://github.com/apache/datafusion-python/pull/1296) (timsaucer) +- chore: update datafusion minor version [#1297](https://github.com/apache/datafusion-python/pull/1297) (timsaucer) +- Enable remaining pylints [#1298](https://github.com/apache/datafusion-python/pull/1298) (timsaucer) +- Add Arrow C streaming, DataFrame iteration, and OOM-safe streaming execution [#1222](https://github.com/apache/datafusion-python/pull/1222) (kosiew) +- Add PyCapsule Type Support and Type Hint Enhancements for AggregateUDF in DataFusion Python Bindings [#1277](https://github.com/apache/datafusion-python/pull/1277) (kosiew) +- Add collect_column to dataframe [#1302](https://github.com/apache/datafusion-python/pull/1302) (timsaucer) +- chore: apply cargo fmt with import organization [#1303](https://github.com/apache/datafusion-python/pull/1303) (timsaucer) +- Feat/parameterized sql queries [#964](https://github.com/apache/datafusion-python/pull/964) (timsaucer) +- Upgrade to Datafusion 51 [#1311](https://github.com/apache/datafusion-python/pull/1311) (nuno-faria) +- minor: resolve build errors after latest merge into main [#1325](https://github.com/apache/datafusion-python/pull/1325) (timsaucer) +- Update build workflow link [#1330](https://github.com/apache/datafusion-python/pull/1330) (timsaucer) +- Do not convert pyarrow scalar values to plain python types when passing as `lit` [#1319](https://github.com/apache/datafusion-python/pull/1319) (timsaucer) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 12 Tim Saucer + 4 Marko Milenković + 2 Nuno Faria + 2 kosiew + 1 Daniel Mesejo + 1 GCHQDeveloper028 + 1 𝕂 +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + diff --git a/dev/changelog/52.0.0.md b/dev/changelog/52.0.0.md new file mode 100644 index 000000000..3f848bb47 --- /dev/null +++ b/dev/changelog/52.0.0.md @@ -0,0 +1,78 @@ + + +# Apache DataFusion Python 52.0.0 Changelog + +This release consists of 26 commits from 9 contributors. See credits at the end of this changelog for more information. + +**Implemented enhancements:** + +- feat: add CatalogProviderList support [#1363](https://github.com/apache/datafusion-python/pull/1363) (timsaucer) +- feat: add support for generating JSON formatted substrait plan [#1376](https://github.com/apache/datafusion-python/pull/1376) (Prathamesh9284) +- feat: add regexp_instr function [#1382](https://github.com/apache/datafusion-python/pull/1382) (mesejo) + +**Fixed bugs:** + +- fix: mangled errors [#1377](https://github.com/apache/datafusion-python/pull/1377) (mesejo) + +**Documentation updates:** + +- docs: Clarify first_value usage in select vs aggregate [#1348](https://github.com/apache/datafusion-python/pull/1348) (AdMub) + +**Other:** + +- Release 51.0.0 [#1333](https://github.com/apache/datafusion-python/pull/1333) (timsaucer) +- Use explicit timer in unit test [#1338](https://github.com/apache/datafusion-python/pull/1338) (timsaucer) +- Add use_fabric_endpoint parameter to MicrosoftAzure class [#1357](https://github.com/apache/datafusion-python/pull/1357) (djouallah) +- Prepare for DF52 release [#1337](https://github.com/apache/datafusion-python/pull/1337) (timsaucer) +- build(deps): bump actions/checkout from 5 to 6 [#1310](https://github.com/apache/datafusion-python/pull/1310) (dependabot[bot]) +- build(deps): bump actions/download-artifact from 5 to 7 [#1321](https://github.com/apache/datafusion-python/pull/1321) (dependabot[bot]) +- build(deps): bump actions/upload-artifact from 4 to 6 [#1322](https://github.com/apache/datafusion-python/pull/1322) (dependabot[bot]) +- build(deps): bump actions/cache from 4 to 5 [#1323](https://github.com/apache/datafusion-python/pull/1323) (dependabot[bot]) +- Pass Field information back and forth when using scalar UDFs [#1299](https://github.com/apache/datafusion-python/pull/1299) (timsaucer) +- Update dependency minor versions to prepare for DF52 release [#1368](https://github.com/apache/datafusion-python/pull/1368) (timsaucer) +- Improve displayed error by using `DataFusionError`'s `Display` trait [#1370](https://github.com/apache/datafusion-python/pull/1370) (abey79) +- Enforce DataFrame display memory limits with `max_rows` + `min_rows` constraint (deprecate `repr_rows`) [#1367](https://github.com/apache/datafusion-python/pull/1367) (kosiew) +- Implement all CSV reader options [#1361](https://github.com/apache/datafusion-python/pull/1361) (timsaucer) +- chore: add confirmation before tarball is released [#1372](https://github.com/apache/datafusion-python/pull/1372) (milenkovicm) +- Build in debug mode for PRs [#1375](https://github.com/apache/datafusion-python/pull/1375) (timsaucer) +- minor: remove ffi test wheel from distribution artifact [#1378](https://github.com/apache/datafusion-python/pull/1378) (timsaucer) +- chore: update rust 2024 edition [#1371](https://github.com/apache/datafusion-python/pull/1371) (timsaucer) +- Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [#1347](https://github.com/apache/datafusion-python/pull/1347) (kosiew) +- minor: update cargo dependencies [#1383](https://github.com/apache/datafusion-python/pull/1383) (timsaucer) +- chore: bump Python version for RAT checking [#1386](https://github.com/apache/datafusion-python/pull/1386) (timsaucer) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 13 Tim Saucer + 4 dependabot[bot] + 2 Daniel Mesejo + 2 kosiew + 1 Adisa Mubarak (AdMub) + 1 Antoine Beyeler + 1 Dhanashri Prathamesh Iranna + 1 Marko Milenković + 1 Mimoune +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + diff --git a/dev/changelog/53.0.0.md b/dev/changelog/53.0.0.md new file mode 100644 index 000000000..3e27a852d --- /dev/null +++ b/dev/changelog/53.0.0.md @@ -0,0 +1,107 @@ + + +# Apache DataFusion Python 53.0.0 Changelog + +This release consists of 52 commits from 9 contributors. See credits at the end of this changelog for more information. + +**Breaking changes:** + +- minor: remove deprecated interfaces [#1481](https://github.com/apache/datafusion-python/pull/1481) (timsaucer) + +**Implemented enhancements:** + +- feat: feat: add to_time, to_local_time, to_date functions [#1387](https://github.com/apache/datafusion-python/pull/1387) (mesejo) +- feat: Add FFI_TableProviderFactory support [#1396](https://github.com/apache/datafusion-python/pull/1396) (davisp) + +**Fixed bugs:** + +- fix: satisfy rustfmt check in lib.rs re-exports [#1406](https://github.com/apache/datafusion-python/pull/1406) (kevinjqliu) + +**Documentation updates:** + +- docs: clarify DataFusion 52 FFI session-parameter requirement for provider hooks [#1439](https://github.com/apache/datafusion-python/pull/1439) (kevinjqliu) + +**Other:** + +- Merge release 52.0.0 into main [#1389](https://github.com/apache/datafusion-python/pull/1389) (timsaucer) +- Add workflow to verify release candidate on multiple systems [#1388](https://github.com/apache/datafusion-python/pull/1388) (timsaucer) +- Allow running "verify release candidate" github workflow on Windows [#1392](https://github.com/apache/datafusion-python/pull/1392) (kevinjqliu) +- ci: update pre-commit hooks, fix linting, and refresh dependencies [#1385](https://github.com/apache/datafusion-python/pull/1385) (dariocurr) +- Add CI check for crates.io patches [#1407](https://github.com/apache/datafusion-python/pull/1407) (timsaucer) +- Enable doc tests in local and CI testing [#1409](https://github.com/apache/datafusion-python/pull/1409) (ntjohnson1) +- Upgrade to DataFusion 53 [#1402](https://github.com/apache/datafusion-python/pull/1402) (nuno-faria) +- Catch warnings in FFI unit tests [#1410](https://github.com/apache/datafusion-python/pull/1410) (timsaucer) +- Add docstring examples for Scalar trigonometric functions [#1411](https://github.com/apache/datafusion-python/pull/1411) (ntjohnson1) +- Create workspace with core and util crates [#1414](https://github.com/apache/datafusion-python/pull/1414) (timsaucer) +- Add docstring examples for Scalar regex, crypto, struct and other [#1422](https://github.com/apache/datafusion-python/pull/1422) (ntjohnson1) +- Add docstring examples for Scalar math functions [#1421](https://github.com/apache/datafusion-python/pull/1421) (ntjohnson1) +- Add docstring examples for Common utility functions [#1419](https://github.com/apache/datafusion-python/pull/1419) (ntjohnson1) +- Add docstring examples for Aggregate basic and bitwise/boolean functions [#1416](https://github.com/apache/datafusion-python/pull/1416) (ntjohnson1) +- Fix CI errors on main [#1432](https://github.com/apache/datafusion-python/pull/1432) (timsaucer) +- Add docstring examples for Scalar temporal functions [#1424](https://github.com/apache/datafusion-python/pull/1424) (ntjohnson1) +- Add docstring examples for Aggregate statistical and regression functions [#1417](https://github.com/apache/datafusion-python/pull/1417) (ntjohnson1) +- Add docstring examples for Scalar array/list functions [#1420](https://github.com/apache/datafusion-python/pull/1420) (ntjohnson1) +- Add docstring examples for Scalar string functions [#1423](https://github.com/apache/datafusion-python/pull/1423) (ntjohnson1) +- Add docstring examples for Aggregate window functions [#1418](https://github.com/apache/datafusion-python/pull/1418) (ntjohnson1) +- ci: pin third-party actions to Apache-approved SHAs [#1438](https://github.com/apache/datafusion-python/pull/1438) (kevinjqliu) +- minor: bump datafusion to release version [#1441](https://github.com/apache/datafusion-python/pull/1441) (timsaucer) +- ci: add swap during build, use tpchgen-cli [#1443](https://github.com/apache/datafusion-python/pull/1443) (timsaucer) +- Update remaining existing examples to make testable/standalone executable [#1437](https://github.com/apache/datafusion-python/pull/1437) (ntjohnson1) +- Do not run validate_pycapsule if pointer_checked is used [#1426](https://github.com/apache/datafusion-python/pull/1426) (Tpt) +- Implement configuration extension support [#1391](https://github.com/apache/datafusion-python/pull/1391) (timsaucer) +- Add a working, more complete example of using a catalog (docs) [#1427](https://github.com/apache/datafusion-python/pull/1427) (toppyy) +- chore: update dependencies [#1447](https://github.com/apache/datafusion-python/pull/1447) (timsaucer) +- Complete doc string examples for functions.py [#1435](https://github.com/apache/datafusion-python/pull/1435) (ntjohnson1) +- chore: enforce uv lockfile consistency in CI and pre-commit [#1398](https://github.com/apache/datafusion-python/pull/1398) (mesejo) +- CI: Add CodeQL workflow for GitHub Actions security scanning [#1408](https://github.com/apache/datafusion-python/pull/1408) (kevinjqliu) +- ci: update codespell paths [#1469](https://github.com/apache/datafusion-python/pull/1469) (timsaucer) +- Add missing datetime functions [#1467](https://github.com/apache/datafusion-python/pull/1467) (timsaucer) +- Add AI skill to check current repository against upstream APIs [#1460](https://github.com/apache/datafusion-python/pull/1460) (timsaucer) +- Add missing string function `contains` [#1465](https://github.com/apache/datafusion-python/pull/1465) (timsaucer) +- Add missing conditional functions [#1464](https://github.com/apache/datafusion-python/pull/1464) (timsaucer) +- Reduce peak memory usage during release builds to fix OOM on manylinux runners [#1445](https://github.com/apache/datafusion-python/pull/1445) (kevinjqliu) +- Add missing map functions [#1461](https://github.com/apache/datafusion-python/pull/1461) (timsaucer) +- minor: Fix pytest instructions in the README [#1477](https://github.com/apache/datafusion-python/pull/1477) (nuno-faria) +- Add missing array functions [#1468](https://github.com/apache/datafusion-python/pull/1468) (timsaucer) +- Add missing scalar functions [#1470](https://github.com/apache/datafusion-python/pull/1470) (timsaucer) +- Add missing aggregate functions [#1471](https://github.com/apache/datafusion-python/pull/1471) (timsaucer) +- Add missing Dataframe functions [#1472](https://github.com/apache/datafusion-python/pull/1472) (timsaucer) +- Add missing deregister methods to SessionContext [#1473](https://github.com/apache/datafusion-python/pull/1473) (timsaucer) +- Add missing registration methods [#1474](https://github.com/apache/datafusion-python/pull/1474) (timsaucer) +- Add missing SessionContext utility methods [#1475](https://github.com/apache/datafusion-python/pull/1475) (timsaucer) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 25 Tim Saucer + 13 Nick + 6 Kevin Liu + 2 Daniel Mesejo + 2 Nuno Faria + 1 Paul J. Davis + 1 Thomas Tanon + 1 Topias Pyykkönen + 1 dario curreri +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + diff --git a/dev/changelog/pre-43.0.0.md b/dev/changelog/pre-43.0.0.md new file mode 100644 index 000000000..ae3a2348a --- /dev/null +++ b/dev/changelog/pre-43.0.0.md @@ -0,0 +1,715 @@ + + +# DataFusion Python Changelog + +## [42.0.0](https://github.com/apache/datafusion-python/tree/42.0.0) (2024-10-06) + +This release consists of 20 commits from 6 contributors. See credits at the end of this changelog for more information. + +**Implemented enhancements:** + +- feat: expose between [#868](https://github.com/apache/datafusion-python/pull/868) (mesejo) +- feat: make register_csv accept a list of paths [#883](https://github.com/apache/datafusion-python/pull/883) (mesejo) +- feat: expose http object store [#885](https://github.com/apache/datafusion-python/pull/885) (mesejo) + +**Fixed bugs:** + +- fix: Calling `count` on a pyarrow dataset results in an error [#843](https://github.com/apache/datafusion-python/pull/843) (Michael-J-Ward) + +**Other:** + +- Upgrade datafusion [#867](https://github.com/apache/datafusion-python/pull/867) (emgeee) +- Feature/aggregates as windows [#871](https://github.com/apache/datafusion-python/pull/871) (timsaucer) +- Fix regression on register_udaf [#878](https://github.com/apache/datafusion-python/pull/878) (timsaucer) +- build(deps): upgrade setup-protoc action and protoc version number [#873](https://github.com/apache/datafusion-python/pull/873) (Michael-J-Ward) +- build(deps): bump prost-types from 0.13.2 to 0.13.3 [#881](https://github.com/apache/datafusion-python/pull/881) (dependabot[bot]) +- build(deps): bump prost from 0.13.2 to 0.13.3 [#882](https://github.com/apache/datafusion-python/pull/882) (dependabot[bot]) +- chore: remove XFAIL from passing tests [#884](https://github.com/apache/datafusion-python/pull/884) (Michael-J-Ward) +- Add user defined window function support [#880](https://github.com/apache/datafusion-python/pull/880) (timsaucer) +- build(deps): bump syn from 2.0.77 to 2.0.79 [#886](https://github.com/apache/datafusion-python/pull/886) (dependabot[bot]) +- fix example of reading parquet from s3 [#896](https://github.com/apache/datafusion-python/pull/896) (sir-sigurd) +- release-testing [#889](https://github.com/apache/datafusion-python/pull/889) (Michael-J-Ward) +- chore(bench): fix create_tables.sql for tpch benchmark [#897](https://github.com/apache/datafusion-python/pull/897) (Michael-J-Ward) +- Add physical and logical plan conversion to and from protobuf [#892](https://github.com/apache/datafusion-python/pull/892) (timsaucer) +- Feature/instance udfs [#890](https://github.com/apache/datafusion-python/pull/890) (timsaucer) +- chore(ci): remove Mambaforge variant from CI [#894](https://github.com/apache/datafusion-python/pull/894) (Michael-J-Ward) +- Use OnceLock to store TokioRuntime [#895](https://github.com/apache/datafusion-python/pull/895) (Michael-J-Ward) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 7 Michael J Ward + 5 Tim Saucer + 3 Daniel Mesejo + 3 dependabot[bot] + 1 Matt Green + 1 Sergey Fedoseev +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + +## [41.0.0](https://github.com/apache/datafusion-python/tree/41.0.0) (2024-09-09) + +This release consists of 19 commits from 6 contributors. See credits at the end of this changelog for more information. + +**Implemented enhancements:** + +- feat: enable list of paths for read_csv [#824](https://github.com/apache/datafusion-python/pull/824) (mesejo) +- feat: better exception and message for table not found [#851](https://github.com/apache/datafusion-python/pull/851) (mesejo) +- feat: make cast accept built-in Python types [#858](https://github.com/apache/datafusion-python/pull/858) (mesejo) + +**Other:** + +- chore: Prepare for 40.0.0 release [#801](https://github.com/apache/datafusion-python/pull/801) (andygrove) +- Add typing-extensions dependency to pyproject [#805](https://github.com/apache/datafusion-python/pull/805) (timsaucer) +- Upgrade deps to datafusion 41 [#802](https://github.com/apache/datafusion-python/pull/802) (Michael-J-Ward) +- Fix SessionContext init with only SessionConfig [#827](https://github.com/apache/datafusion-python/pull/827) (jcrist) +- build(deps): upgrade actions/{upload,download}-artifact@v3 to v4 [#829](https://github.com/apache/datafusion-python/pull/829) (Michael-J-Ward) +- Run ruff format in CI [#837](https://github.com/apache/datafusion-python/pull/837) (timsaucer) +- Add PyCapsule support for Arrow import and export [#825](https://github.com/apache/datafusion-python/pull/825) (timsaucer) +- Feature/expose when function [#836](https://github.com/apache/datafusion-python/pull/836) (timsaucer) +- Add Window Functions for use with function builder [#808](https://github.com/apache/datafusion-python/pull/808) (timsaucer) +- chore: fix typos [#844](https://github.com/apache/datafusion-python/pull/844) (mesejo) +- build(ci): use proper mac runners [#841](https://github.com/apache/datafusion-python/pull/841) (Michael-J-Ward) +- Set of small features [#839](https://github.com/apache/datafusion-python/pull/839) (timsaucer) +- chore: fix docstrings, typos [#852](https://github.com/apache/datafusion-python/pull/852) (mesejo) +- chore: Use datafusion re-exported dependencies [#856](https://github.com/apache/datafusion-python/pull/856) (emgeee) +- add guidelines on separating python and rust code [#860](https://github.com/apache/datafusion-python/pull/860) (Michael-J-Ward) +- Update Aggregate functions to take builder parameters [#859](https://github.com/apache/datafusion-python/pull/859) (timsaucer) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 7 Tim Saucer + 5 Daniel Mesejo + 4 Michael J Ward + 1 Andy Grove + 1 Jim Crist-Harif + 1 Matt Green +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + +## [40.0.0](https://github.com/apache/datafusion-python/tree/40.0.0) (2024-08-09) + +This release consists of 18 commits from 4 contributors. See credits at the end of this changelog for more information. + +- Update changelog for 39.0.0 [#742](https://github.com/apache/datafusion-python/pull/742) (andygrove) +- build(deps): bump uuid from 1.8.0 to 1.9.1 [#744](https://github.com/apache/datafusion-python/pull/744) (dependabot[bot]) +- build(deps): bump mimalloc from 0.1.42 to 0.1.43 [#745](https://github.com/apache/datafusion-python/pull/745) (dependabot[bot]) +- build(deps): bump syn from 2.0.67 to 2.0.68 [#746](https://github.com/apache/datafusion-python/pull/746) (dependabot[bot]) +- Tsaucer/find window fn [#747](https://github.com/apache/datafusion-python/pull/747) (timsaucer) +- Python wrapper classes for all user interfaces [#750](https://github.com/apache/datafusion-python/pull/750) (timsaucer) +- Expose array sort [#764](https://github.com/apache/datafusion-python/pull/764) (timsaucer) +- Upgrade protobuf and remove GH Action googletest-installer [#773](https://github.com/apache/datafusion-python/pull/773) (Michael-J-Ward) +- Upgrade Datafusion 40 [#771](https://github.com/apache/datafusion-python/pull/771) (Michael-J-Ward) +- Bugfix: Calling count with None arguments [#768](https://github.com/apache/datafusion-python/pull/768) (timsaucer) +- Add in user example that compares a two different approaches to UDFs [#770](https://github.com/apache/datafusion-python/pull/770) (timsaucer) +- Add missing exports for wrapper modules [#782](https://github.com/apache/datafusion-python/pull/782) (timsaucer) +- Add PyExpr to_variant conversions [#793](https://github.com/apache/datafusion-python/pull/793) (Michael-J-Ward) +- Add missing expressions to wrapper export [#795](https://github.com/apache/datafusion-python/pull/795) (timsaucer) +- Doc/cross reference [#791](https://github.com/apache/datafusion-python/pull/791) (timsaucer) +- Re-Enable `num_centroids` to `approx_percentile_cont` [#798](https://github.com/apache/datafusion-python/pull/798) (Michael-J-Ward) +- UDAF process all state variables [#799](https://github.com/apache/datafusion-python/pull/799) (timsaucer) + +## Credits + +Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor. + +``` + 9 Tim Saucer + 4 Michael J Ward + 3 dependabot[bot] + 2 Andy Grove +``` + +Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release. + +## [39.0.0](https://github.com/apache/datafusion-python/tree/39.0.0) (2024-06-25) + +**Merged pull requests:** + +- ci: add substrait feature to linux builds [#720](https://github.com/apache/datafusion-python/pull/720) (Michael-J-Ward) +- Docs deploy action [#721](https://github.com/apache/datafusion-python/pull/721) (Michael-J-Ward) +- update deps [#723](https://github.com/apache/datafusion-python/pull/723) (Michael-J-Ward) +- Upgrade maturin [#725](https://github.com/apache/datafusion-python/pull/725) (Michael-J-Ward) +- Upgrade datafusion 39 [#728](https://github.com/apache/datafusion-python/pull/728) (Michael-J-Ward) +- use ScalarValue::to_pyarrow to convert to python object [#731](https://github.com/apache/datafusion-python/pull/731) (Michael-J-Ward) +- Pyo3 `Bound<'py, T>` api [#734](https://github.com/apache/datafusion-python/pull/734) (Michael-J-Ward) +- github test action: drop python 3.7, add python 3.12 [#736](https://github.com/apache/datafusion-python/pull/736) (Michael-J-Ward) +- Pyarrow filter pushdowns [#735](https://github.com/apache/datafusion-python/pull/735) (Michael-J-Ward) +- build(deps): bump syn from 2.0.66 to 2.0.67 [#738](https://github.com/apache/datafusion-python/pull/738) (dependabot[bot]) +- Pyo3 refactorings [#740](https://github.com/apache/datafusion-python/pull/740) (Michael-J-Ward) +- UDAF `sum` workaround [#741](https://github.com/apache/datafusion-python/pull/741) (Michael-J-Ward) + +## [38.0.1](https://github.com/apache/datafusion-python/tree/38.0.1) (2024-05-25) + +**Implemented enhancements:** + +- feat: add python bindings for ends_with function [#693](https://github.com/apache/datafusion-python/pull/693) (richtia) +- feat: expose `named_struct` in python [#700](https://github.com/apache/datafusion-python/pull/700) (Michael-J-Ward) + +**Merged pull requests:** + +- Add document about basics of working with expressions [#668](https://github.com/apache/datafusion-python/pull/668) (timsaucer) +- chore: Update Python release process now that DataFusion is TLP [#674](https://github.com/apache/datafusion-python/pull/674) (andygrove) +- Fix Docs [#676](https://github.com/apache/datafusion-python/pull/676) (Michael-J-Ward) +- Add examples from TPC-H [#666](https://github.com/apache/datafusion-python/pull/666) (timsaucer) +- fix conda nightly builds, attempt 2 [#689](https://github.com/apache/datafusion-python/pull/689) (Michael-J-Ward) +- Upgrade to datafusion 38 [#691](https://github.com/apache/datafusion-python/pull/691) (Michael-J-Ward) +- chore: update to maturin's recommended project layout for rust/python… [#695](https://github.com/apache/datafusion-python/pull/695) (Michael-J-Ward) +- chore: update cargo deps [#698](https://github.com/apache/datafusion-python/pull/698) (Michael-J-Ward) +- feat: add python bindings for ends_with function [#693](https://github.com/apache/datafusion-python/pull/693) (richtia) +- feat: expose `named_struct` in python [#700](https://github.com/apache/datafusion-python/pull/700) (Michael-J-Ward) +- Website fixes [#702](https://github.com/apache/datafusion-python/pull/702) (Michael-J-Ward) + +## [37.1.0](https://github.com/apache/datafusion-python/tree/37.1.0) (2024-05-08) + +**Implemented enhancements:** + +- feat: add execute_stream and execute_stream_partitioned [#610](https://github.com/apache/datafusion-python/pull/610) (mesejo) + +**Documentation updates:** + +- docs: update docs CI to install python-311 requirements [#661](https://github.com/apache/datafusion-python/pull/661) (Michael-J-Ward) + +**Merged pull requests:** + +- Switch to Ruff for Python linting [#529](https://github.com/apache/datafusion-python/pull/529) (andygrove) +- Remove sql-on-pandas/polars/cudf examples [#602](https://github.com/apache/datafusion-python/pull/602) (andygrove) +- build(deps): bump object_store from 0.9.0 to 0.9.1 [#611](https://github.com/apache/datafusion-python/pull/611) (dependabot[bot]) +- More missing array funcs [#605](https://github.com/apache/datafusion-python/pull/605) (judahrand) +- feat: add execute_stream and execute_stream_partitioned [#610](https://github.com/apache/datafusion-python/pull/610) (mesejo) +- build(deps): bump uuid from 1.7.0 to 1.8.0 [#615](https://github.com/apache/datafusion-python/pull/615) (dependabot[bot]) +- Bind SQLOptions and relative ctx method #567 [#588](https://github.com/apache/datafusion-python/pull/588) (giacomorebecchi) +- bugfix: no panic on empty table [#613](https://github.com/apache/datafusion-python/pull/613) (mesejo) +- Expose `register_listing_table` [#618](https://github.com/apache/datafusion-python/pull/618) (henrifroese) +- Expose unnest feature [#641](https://github.com/apache/datafusion-python/pull/641) (timsaucer) +- Update domain names and paths in asf yaml [#643](https://github.com/apache/datafusion-python/pull/643) (andygrove) +- use python 3.11 to publish docs [#645](https://github.com/apache/datafusion-python/pull/645) (andygrove) +- docs: update docs CI to install python-311 requirements [#661](https://github.com/apache/datafusion-python/pull/661) (Michael-J-Ward) +- Upgrade Datafusion to v37.1.0 [#669](https://github.com/apache/datafusion-python/pull/669) (Michael-J-Ward) + +## [36.0.0](https://github.com/apache/datafusion-python/tree/36.0.0) (2024-03-02) + +**Implemented enhancements:** + +- feat: Add `flatten` array function [#562](https://github.com/apache/datafusion-python/pull/562) (mobley-trent) + +**Documentation updates:** + +- docs: Add ASF attribution [#580](https://github.com/apache/datafusion-python/pull/580) (simicd) + +**Merged pull requests:** + +- Allow PyDataFrame to be used from other projects [#582](https://github.com/apache/datafusion-python/pull/582) (andygrove) +- docs: Add ASF attribution [#580](https://github.com/apache/datafusion-python/pull/580) (simicd) +- Add array functions [#560](https://github.com/apache/datafusion-python/pull/560) (ongchi) +- feat: Add `flatten` array function [#562](https://github.com/apache/datafusion-python/pull/562) (mobley-trent) + +## [35.0.0](https://github.com/apache/datafusion-python/tree/35.0.0) (2024-01-20) + +**Merged pull requests:** + +- build(deps): bump syn from 2.0.41 to 2.0.43 [#559](https://github.com/apache/datafusion-python/pull/559) (dependabot[bot]) +- build(deps): bump tokio from 1.35.0 to 1.35.1 [#558](https://github.com/apache/datafusion-python/pull/558) (dependabot[bot]) +- build(deps): bump async-trait from 0.1.74 to 0.1.77 [#556](https://github.com/apache/datafusion-python/pull/556) (dependabot[bot]) +- build(deps): bump pyo3 from 0.20.0 to 0.20.2 [#557](https://github.com/apache/datafusion-python/pull/557) (dependabot[bot]) + +## [34.0.0](https://github.com/apache/datafusion-python/tree/34.0.0) (2023-12-28) + +**Merged pull requests:** + +- Adjust visibility of crate private members & Functions [#537](https://github.com/apache/datafusion-python/pull/537) (jdye64) +- Update json.rst [#538](https://github.com/apache/datafusion-python/pull/538) (ray-andrew) +- Enable mimalloc local_dynamic_tls feature [#540](https://github.com/apache/datafusion-python/pull/540) (jdye64) +- Enable substrait feature to be built by default in CI, for nightlies … [#544](https://github.com/apache/datafusion-python/pull/544) (jdye64) + +## [33.0.0](https://github.com/apache/datafusion-python/tree/33.0.0) (2023-11-16) + +**Merged pull requests:** + +- First pass at getting architectured builds working [#350](https://github.com/apache/datafusion-python/pull/350) (charlesbluca) +- Remove libprotobuf dep [#527](https://github.com/apache/datafusion-python/pull/527) (jdye64) + +## [32.0.0](https://github.com/apache/datafusion-python/tree/32.0.0) (2023-10-21) + +**Implemented enhancements:** + +- feat: expose PyWindowFrame [#509](https://github.com/apache/datafusion-python/pull/509) (dlovell) +- add Binary String Functions;encode,decode [#494](https://github.com/apache/datafusion-python/pull/494) (jiangzhx) +- add bit_and,bit_or,bit_xor,bool_add,bool_or [#496](https://github.com/apache/datafusion-python/pull/496) (jiangzhx) +- add first_value last_value [#498](https://github.com/apache/datafusion-python/pull/498) (jiangzhx) +- add regr\_\* functions [#499](https://github.com/apache/datafusion-python/pull/499) (jiangzhx) +- Add random missing bindings [#522](https://github.com/apache/datafusion-python/pull/522) (jdye64) +- Allow for multiple input files per table instead of a single file [#519](https://github.com/apache/datafusion-python/pull/519) (jdye64) +- Add support for window function bindings [#521](https://github.com/apache/datafusion-python/pull/521) (jdye64) + +**Merged pull requests:** + +- Prepare 31.0.0 release [#500](https://github.com/apache/datafusion-python/pull/500) (andygrove) +- Improve release process documentation [#505](https://github.com/apache/datafusion-python/pull/505) (andygrove) +- add Binary String Functions;encode,decode [#494](https://github.com/apache/datafusion-python/pull/494) (jiangzhx) +- build(deps): bump mimalloc from 0.1.38 to 0.1.39 [#502](https://github.com/apache/datafusion-python/pull/502) (dependabot[bot]) +- build(deps): bump syn from 2.0.32 to 2.0.35 [#503](https://github.com/apache/datafusion-python/pull/503) (dependabot[bot]) +- build(deps): bump syn from 2.0.35 to 2.0.37 [#506](https://github.com/apache/datafusion-python/pull/506) (dependabot[bot]) +- Use latest DataFusion [#511](https://github.com/apache/datafusion-python/pull/511) (andygrove) +- add bit_and,bit_or,bit_xor,bool_add,bool_or [#496](https://github.com/apache/datafusion-python/pull/496) (jiangzhx) +- use DataFusion 32 [#515](https://github.com/apache/datafusion-python/pull/515) (andygrove) +- add first_value last_value [#498](https://github.com/apache/datafusion-python/pull/498) (jiangzhx) +- build(deps): bump regex-syntax from 0.7.5 to 0.8.1 [#517](https://github.com/apache/datafusion-python/pull/517) (dependabot[bot]) +- build(deps): bump pyo3-build-config from 0.19.2 to 0.20.0 [#516](https://github.com/apache/datafusion-python/pull/516) (dependabot[bot]) +- add regr\_\* functions [#499](https://github.com/apache/datafusion-python/pull/499) (jiangzhx) +- Add random missing bindings [#522](https://github.com/apache/datafusion-python/pull/522) (jdye64) +- build(deps): bump rustix from 0.38.18 to 0.38.19 [#523](https://github.com/apache/datafusion-python/pull/523) (dependabot[bot]) +- Allow for multiple input files per table instead of a single file [#519](https://github.com/apache/datafusion-python/pull/519) (jdye64) +- Add support for window function bindings [#521](https://github.com/apache/datafusion-python/pull/521) (jdye64) +- Small clippy fix [#524](https://github.com/apache/datafusion-python/pull/524) (andygrove) + +## [31.0.0](https://github.com/apache/datafusion-python/tree/31.0.0) (2023-09-12) + +[Full Changelog](https://github.com/apache/datafusion-python/compare/28.0.0...31.0.0) + +**Implemented enhancements:** + +- feat: add case function (#447) [#448](https://github.com/apache/datafusion-python/pull/448) (mesejo) +- feat: add compression options [#456](https://github.com/apache/datafusion-python/pull/456) (mesejo) +- feat: add register_json [#458](https://github.com/apache/datafusion-python/pull/458) (mesejo) +- feat: add basic compression configuration to write_parquet [#459](https://github.com/apache/datafusion-python/pull/459) (mesejo) +- feat: add example of reading parquet from s3 [#460](https://github.com/apache/datafusion-python/pull/460) (mesejo) +- feat: add register_avro and read_table [#461](https://github.com/apache/datafusion-python/pull/461) (mesejo) +- feat: add missing scalar math functions [#465](https://github.com/apache/datafusion-python/pull/465) (mesejo) + +**Documentation updates:** + +- docs: include pre-commit hooks section in contributor guide [#455](https://github.com/apache/datafusion-python/pull/455) (mesejo) + +**Merged pull requests:** + +- Build Linux aarch64 wheel [#443](https://github.com/apache/datafusion-python/pull/443) (gokselk) +- feat: add case function (#447) [#448](https://github.com/apache/datafusion-python/pull/448) (mesejo) +- enhancement(docs): Add user guide (#432) [#445](https://github.com/apache/datafusion-python/pull/445) (mesejo) +- docs: include pre-commit hooks section in contributor guide [#455](https://github.com/apache/datafusion-python/pull/455) (mesejo) +- feat: add compression options [#456](https://github.com/apache/datafusion-python/pull/456) (mesejo) +- Upgrade to DF 28.0.0-rc1 [#457](https://github.com/apache/datafusion-python/pull/457) (andygrove) +- feat: add register_json [#458](https://github.com/apache/datafusion-python/pull/458) (mesejo) +- feat: add basic compression configuration to write_parquet [#459](https://github.com/apache/datafusion-python/pull/459) (mesejo) +- feat: add example of reading parquet from s3 [#460](https://github.com/apache/datafusion-python/pull/460) (mesejo) +- feat: add register_avro and read_table [#461](https://github.com/apache/datafusion-python/pull/461) (mesejo) +- feat: add missing scalar math functions [#465](https://github.com/apache/datafusion-python/pull/465) (mesejo) +- build(deps): bump arduino/setup-protoc from 1 to 2 [#452](https://github.com/apache/datafusion-python/pull/452) (dependabot[bot]) +- Revert "build(deps): bump arduino/setup-protoc from 1 to 2 (#452)" [#474](https://github.com/apache/datafusion-python/pull/474) (viirya) +- Minor: fix wrongly copied function description [#497](https://github.com/apache/datafusion-python/pull/497) (viirya) +- Upgrade to Datafusion 31.0.0 [#491](https://github.com/apache/datafusion-python/pull/491) (judahrand) +- Add `isnan` and `iszero` [#495](https://github.com/apache/datafusion-python/pull/495) (judahrand) + +## 30.0.0 + +- Skipped due to a breaking change in DataFusion + +## 29.0.0 + +- Skipped + +## [28.0.0](https://github.com/apache/datafusion-python/tree/28.0.0) (2023-07-25) + +**Implemented enhancements:** + +- feat: expose offset in python API [#437](https://github.com/apache/datafusion-python/pull/437) (cpcloud) + +**Merged pull requests:** + +- File based input utils [#433](https://github.com/apache/datafusion-python/pull/433) (jdye64) +- Upgrade to 28.0.0-rc1 [#434](https://github.com/apache/datafusion-python/pull/434) (andygrove) +- Introduces utility for obtaining SqlTable information from a file like location [#398](https://github.com/apache/datafusion-python/pull/398) (jdye64) +- feat: expose offset in python API [#437](https://github.com/apache/datafusion-python/pull/437) (cpcloud) +- Use DataFusion 28 [#439](https://github.com/apache/datafusion-python/pull/439) (andygrove) + +## [27.0.0](https://github.com/apache/datafusion-python/tree/27.0.0) (2023-07-03) + +**Merged pull requests:** + +- LogicalPlan.to_variant() make public [#412](https://github.com/apache/datafusion-python/pull/412) (jdye64) +- Prepare 27.0.0 release [#423](https://github.com/apache/datafusion-python/pull/423) (andygrove) + +## [26.0.0](https://github.com/apache/datafusion-python/tree/26.0.0) (2023-06-11) + +[Full Changelog](https://github.com/apache/datafusion-python/compare/25.0.0...26.0.0) + +**Merged pull requests:** + +- Add Expr::Case when_then_else support to rex_call_operands function [#388](https://github.com/apache/datafusion-python/pull/388) (jdye64) +- Introduce BaseSessionContext abstract class [#390](https://github.com/apache/datafusion-python/pull/390) (jdye64) +- CRUD Schema support for `BaseSessionContext` [#392](https://github.com/apache/datafusion-python/pull/392) (jdye64) +- CRUD Table support for `BaseSessionContext` [#394](https://github.com/apache/datafusion-python/pull/394) (jdye64) + +## [25.0.0](https://github.com/apache/datafusion-python/tree/25.0.0) (2023-05-23) + +[Full Changelog](https://github.com/apache/datafusion-python/compare/24.0.0...25.0.0) + +**Merged pull requests:** + +- Prepare 24.0.0 Release [#376](https://github.com/apache/datafusion-python/pull/376) (andygrove) +- build(deps): bump uuid from 1.3.1 to 1.3.2 [#359](https://github.com/apache/datafusion-python/pull/359) (dependabot[bot]) +- build(deps): bump mimalloc from 0.1.36 to 0.1.37 [#361](https://github.com/apache/datafusion-python/pull/361) (dependabot[bot]) +- build(deps): bump regex-syntax from 0.6.29 to 0.7.1 [#334](https://github.com/apache/datafusion-python/pull/334) (dependabot[bot]) +- upgrade maturin to 0.15.1 [#379](https://github.com/apache/datafusion-python/pull/379) (Jimexist) +- Expand Expr to include RexType basic support [#378](https://github.com/apache/datafusion-python/pull/378) (jdye64) +- Add Python script for generating changelog [#383](https://github.com/apache/datafusion-python/pull/383) (andygrove) + +## [24.0.0](https://github.com/apache/datafusion-python/tree/24.0.0) (2023-05-09) + +[Full Changelog](https://github.com/apache/datafusion-python/compare/23.0.0...24.0.0) + +**Documentation updates:** + +- Fix link to user guide [#354](https://github.com/apache/datafusion-python/pull/354) (andygrove) + +**Merged pull requests:** + +- Add interface to serialize Substrait plans to Python Bytes. [#344](https://github.com/apache/datafusion-python/pull/344) (kylebrooks-8451) +- Add partition_count property to ExecutionPlan. [#346](https://github.com/apache/datafusion-python/pull/346) (kylebrooks-8451) +- Remove unsendable from all Rust pyclass types. [#348](https://github.com/apache/datafusion-python/pull/348) (kylebrooks-8451) +- Fix link to user guide [#354](https://github.com/apache/datafusion-python/pull/354) (andygrove) +- Fix SessionContext execute. [#353](https://github.com/apache/datafusion-python/pull/353) (kylebrooks-8451) +- Pub mod expr in lib.rs [#357](https://github.com/apache/datafusion-python/pull/357) (jdye64) +- Add benchmark derived from TPC-H [#355](https://github.com/apache/datafusion-python/pull/355) (andygrove) +- Add db-benchmark [#365](https://github.com/apache/datafusion-python/pull/365) (andygrove) +- First pass of documentation in mdBook [#364](https://github.com/apache/datafusion-python/pull/364) (MrPowers) +- Add 'pub' and '#[pyo3(get, set)]' to DataTypeMap [#371](https://github.com/apache/datafusion-python/pull/371) (jdye64) +- Fix db-benchmark [#369](https://github.com/apache/datafusion-python/pull/369) (andygrove) +- Docs explaining how to view query plans [#373](https://github.com/apache/datafusion-python/pull/373) (andygrove) +- Improve db-benchmark [#372](https://github.com/apache/datafusion-python/pull/372) (andygrove) +- Make expr member of PyExpr public [#375](https://github.com/apache/datafusion-python/pull/375) (jdye64) + +## [23.0.0](https://github.com/apache/datafusion-python/tree/23.0.0) (2023-04-23) + +[Full Changelog](https://github.com/apache/datafusion-python/compare/22.0.0...23.0.0) + +**Merged pull requests:** + +- Improve API docs, README, and examples for configuring context [#321](https://github.com/apache/datafusion-python/pull/321) (andygrove) +- Osx build linker args [#330](https://github.com/apache/datafusion-python/pull/330) (jdye64) +- Add requirements file for python 3.11 [#332](https://github.com/apache/datafusion-python/pull/332) (r4ntix) +- mac arm64 build [#338](https://github.com/apache/datafusion-python/pull/338) (andygrove) +- Add conda.yaml baseline workflow file [#281](https://github.com/apache/datafusion-python/pull/281) (jdye64) +- Prepare for 23.0.0 release [#335](https://github.com/apache/datafusion-python/pull/335) (andygrove) +- Reuse the Tokio Runtime [#341](https://github.com/apache/datafusion-python/pull/341) (kylebrooks-8451) + +## [22.0.0](https://github.com/apache/datafusion-python/tree/22.0.0) (2023-04-10) + +[Full Changelog](https://github.com/apache/datafusion-python/compare/21.0.0...22.0.0) + +**Merged pull requests:** + +- Fix invalid build yaml [#308](https://github.com/apache/datafusion-python/pull/308) (andygrove) +- Try fix release build [#309](https://github.com/apache/datafusion-python/pull/309) (andygrove) +- Fix release build [#310](https://github.com/apache/datafusion-python/pull/310) (andygrove) +- Enable datafusion-substrait protoc feature, to remove compile-time dependency on protoc [#312](https://github.com/apache/datafusion-python/pull/312) (andygrove) +- Fix Mac/Win release builds in CI [#313](https://github.com/apache/datafusion-python/pull/313) (andygrove) +- install protoc in docs workflow [#314](https://github.com/apache/datafusion-python/pull/314) (andygrove) +- Fix documentation generation in CI [#315](https://github.com/apache/datafusion-python/pull/315) (andygrove) +- Source wheel fix [#319](https://github.com/apache/datafusion-python/pull/319) (andygrove) + +## [21.0.0](https://github.com/apache/datafusion-python/tree/21.0.0) (2023-03-30) + +[Full Changelog](https://github.com/apache/datafusion-python/compare/20.0.0...21.0.0) + +**Merged pull requests:** + +- minor: Fix minor warning on unused import [#289](https://github.com/apache/datafusion-python/pull/289) (viirya) +- feature: Implement `describe()` method [#293](https://github.com/apache/datafusion-python/pull/293) (simicd) +- fix: Printed results not visible in debugger & notebooks [#296](https://github.com/apache/datafusion-python/pull/296) (simicd) +- add package.include and remove wildcard dependency [#295](https://github.com/apache/datafusion-python/pull/295) (andygrove) +- Update main branch name in docs workflow [#303](https://github.com/apache/datafusion-python/pull/303) (andygrove) +- Upgrade to DF 21 [#301](https://github.com/apache/datafusion-python/pull/301) (andygrove) + +## [20.0.0](https://github.com/apache/datafusion-python/tree/20.0.0) (2023-03-17) + +[Full Changelog](https://github.com/apache/datafusion-python/compare/0.8.0...20.0.0) + +**Implemented enhancements:** + +- Empty relation bindings [#208](https://github.com/apache/datafusion-python/pull/208) (jdye64) +- wrap display_name and canonical_name functions [#214](https://github.com/apache/datafusion-python/pull/214) (jdye64) +- Add PyAlias bindings [#216](https://github.com/apache/datafusion-python/pull/216) (jdye64) +- Add bindings for scalar_variable [#218](https://github.com/apache/datafusion-python/pull/218) (jdye64) +- Bindings for LIKE type expressions [#220](https://github.com/apache/datafusion-python/pull/220) (jdye64) +- Bool expr bindings [#223](https://github.com/apache/datafusion-python/pull/223) (jdye64) +- Between bindings [#229](https://github.com/apache/datafusion-python/pull/229) (jdye64) +- Add bindings for GetIndexedField [#227](https://github.com/apache/datafusion-python/pull/227) (jdye64) +- Add bindings for case, cast, and trycast [#232](https://github.com/apache/datafusion-python/pull/232) (jdye64) +- add remaining expr bindings [#233](https://github.com/apache/datafusion-python/pull/233) (jdye64) +- feature: Additional export methods [#236](https://github.com/apache/datafusion-python/pull/236) (simicd) +- Add Python wrapper for LogicalPlan::Union [#240](https://github.com/apache/datafusion-python/pull/240) (iajoiner) +- feature: Create dataframe from pandas, polars, dictionary, list or pyarrow Table [#242](https://github.com/apache/datafusion-python/pull/242) (simicd) +- Add Python wrappers for `LogicalPlan::Join` and `LogicalPlan::CrossJoin` [#246](https://github.com/apache/datafusion-python/pull/246) (iajoiner) +- feature: Set table name from ctx functions [#260](https://github.com/apache/datafusion-python/pull/260) (simicd) +- Explain bindings [#264](https://github.com/apache/datafusion-python/pull/264) (jdye64) +- Extension bindings [#266](https://github.com/apache/datafusion-python/pull/266) (jdye64) +- Subquery alias bindings [#269](https://github.com/apache/datafusion-python/pull/269) (jdye64) +- Create memory table [#271](https://github.com/apache/datafusion-python/pull/271) (jdye64) +- Create view bindings [#273](https://github.com/apache/datafusion-python/pull/273) (jdye64) +- Re-export Datafusion dependencies [#277](https://github.com/apache/datafusion-python/pull/277) (jdye64) +- Distinct bindings [#275](https://github.com/apache/datafusion-python/pull/275) (jdye64) +- Drop table bindings [#283](https://github.com/apache/datafusion-python/pull/283) (jdye64) +- Bindings for LogicalPlan::Repartition [#285](https://github.com/apache/datafusion-python/pull/285) (jdye64) +- Expand Rust return type support for Arrow DataTypes in ScalarValue [#287](https://github.com/apache/datafusion-python/pull/287) (jdye64) + +**Documentation updates:** + +- docs: Example of calling Python UDF & UDAF in SQL [#258](https://github.com/apache/datafusion-python/pull/258) (simicd) + +**Merged pull requests:** + +- Minor docs updates [#210](https://github.com/apache/datafusion-python/pull/210) (andygrove) +- Empty relation bindings [#208](https://github.com/apache/datafusion-python/pull/208) (jdye64) +- wrap display_name and canonical_name functions [#214](https://github.com/apache/datafusion-python/pull/214) (jdye64) +- Add PyAlias bindings [#216](https://github.com/apache/datafusion-python/pull/216) (jdye64) +- Add bindings for scalar_variable [#218](https://github.com/apache/datafusion-python/pull/218) (jdye64) +- Bindings for LIKE type expressions [#220](https://github.com/apache/datafusion-python/pull/220) (jdye64) +- Bool expr bindings [#223](https://github.com/apache/datafusion-python/pull/223) (jdye64) +- Between bindings [#229](https://github.com/apache/datafusion-python/pull/229) (jdye64) +- Add bindings for GetIndexedField [#227](https://github.com/apache/datafusion-python/pull/227) (jdye64) +- Add bindings for case, cast, and trycast [#232](https://github.com/apache/datafusion-python/pull/232) (jdye64) +- add remaining expr bindings [#233](https://github.com/apache/datafusion-python/pull/233) (jdye64) +- Pre-commit hooks [#228](https://github.com/apache/datafusion-python/pull/228) (jdye64) +- Implement new release process [#149](https://github.com/apache/datafusion-python/pull/149) (andygrove) +- feature: Additional export methods [#236](https://github.com/apache/datafusion-python/pull/236) (simicd) +- Add Python wrapper for LogicalPlan::Union [#240](https://github.com/apache/datafusion-python/pull/240) (iajoiner) +- feature: Create dataframe from pandas, polars, dictionary, list or pyarrow Table [#242](https://github.com/apache/datafusion-python/pull/242) (simicd) +- Fix release instructions [#238](https://github.com/apache/datafusion-python/pull/238) (andygrove) +- Add Python wrappers for `LogicalPlan::Join` and `LogicalPlan::CrossJoin` [#246](https://github.com/apache/datafusion-python/pull/246) (iajoiner) +- docs: Example of calling Python UDF & UDAF in SQL [#258](https://github.com/apache/datafusion-python/pull/258) (simicd) +- feature: Set table name from ctx functions [#260](https://github.com/apache/datafusion-python/pull/260) (simicd) +- Upgrade to DataFusion 19 [#262](https://github.com/apache/datafusion-python/pull/262) (andygrove) +- Explain bindings [#264](https://github.com/apache/datafusion-python/pull/264) (jdye64) +- Extension bindings [#266](https://github.com/apache/datafusion-python/pull/266) (jdye64) +- Subquery alias bindings [#269](https://github.com/apache/datafusion-python/pull/269) (jdye64) +- Create memory table [#271](https://github.com/apache/datafusion-python/pull/271) (jdye64) +- Create view bindings [#273](https://github.com/apache/datafusion-python/pull/273) (jdye64) +- Re-export Datafusion dependencies [#277](https://github.com/apache/datafusion-python/pull/277) (jdye64) +- Distinct bindings [#275](https://github.com/apache/datafusion-python/pull/275) (jdye64) +- build(deps): bump actions/checkout from 2 to 3 [#244](https://github.com/apache/datafusion-python/pull/244) (dependabot[bot]) +- build(deps): bump actions/upload-artifact from 2 to 3 [#245](https://github.com/apache/datafusion-python/pull/245) (dependabot[bot]) +- build(deps): bump actions/download-artifact from 2 to 3 [#243](https://github.com/apache/datafusion-python/pull/243) (dependabot[bot]) +- Use DataFusion 20 [#278](https://github.com/apache/datafusion-python/pull/278) (andygrove) +- Drop table bindings [#283](https://github.com/apache/datafusion-python/pull/283) (jdye64) +- Bindings for LogicalPlan::Repartition [#285](https://github.com/apache/datafusion-python/pull/285) (jdye64) +- Expand Rust return type support for Arrow DataTypes in ScalarValue [#287](https://github.com/apache/datafusion-python/pull/287) (jdye64) + +## [0.8.0](https://github.com/apache/datafusion-python/tree/0.8.0) (2023-02-22) + +[Full Changelog](https://github.com/apache/datafusion-python/compare/0.8.0-rc1...0.8.0) + +**Implemented enhancements:** + +- Add support for cuDF physical execution engine [\#202](https://github.com/apache/datafusion-python/issues/202) +- Make it easier to create a Pandas dataframe from DataFusion query results [\#139](https://github.com/apache/datafusion-python/issues/139) + +**Fixed bugs:** + +- Build error: could not compile `thiserror` due to 2 previous errors [\#69](https://github.com/apache/datafusion-python/issues/69) + +**Closed issues:** + +- Integrate with the new `object_store` crate [\#22](https://github.com/apache/datafusion-python/issues/22) + +**Merged pull requests:** + +- Update README in preparation for 0.8 release [\#206](https://github.com/apache/datafusion-python/pull/206) ([andygrove](https://github.com/andygrove)) +- Add support for cudf as a physical execution engine [\#205](https://github.com/apache/datafusion-python/pull/205) ([jdye64](https://github.com/jdye64)) +- Run `maturin develop` instead of `cargo build` in verification script [\#200](https://github.com/apache/datafusion-python/pull/200) ([andygrove](https://github.com/andygrove)) +- Add tests for recently added functionality [\#199](https://github.com/apache/datafusion-python/pull/199) ([andygrove](https://github.com/andygrove)) +- Implement `to_pandas()` [\#197](https://github.com/apache/datafusion-python/pull/197) ([simicd](https://github.com/simicd)) +- Add Python wrapper for LogicalPlan::Sort [\#196](https://github.com/apache/datafusion-python/pull/196) ([andygrove](https://github.com/andygrove)) +- Add Python wrapper for LogicalPlan::Aggregate [\#195](https://github.com/apache/datafusion-python/pull/195) ([andygrove](https://github.com/andygrove)) +- Add Python wrapper for LogicalPlan::Limit [\#193](https://github.com/apache/datafusion-python/pull/193) ([andygrove](https://github.com/andygrove)) +- Add Python wrapper for LogicalPlan::Filter [\#192](https://github.com/apache/datafusion-python/pull/192) ([andygrove](https://github.com/andygrove)) +- Add experimental support for executing SQL with Polars and Pandas [\#190](https://github.com/apache/datafusion-python/pull/190) ([andygrove](https://github.com/andygrove)) +- Update changelog for 0.8 release [\#188](https://github.com/apache/datafusion-python/pull/188) ([andygrove](https://github.com/andygrove)) +- Add ability to execute ExecutionPlan and get a stream of RecordBatch [\#186](https://github.com/apache/datafusion-python/pull/186) ([andygrove](https://github.com/andygrove)) +- Dffield bindings [\#185](https://github.com/apache/datafusion-python/pull/185) ([jdye64](https://github.com/jdye64)) +- Add bindings for DFSchema [\#183](https://github.com/apache/datafusion-python/pull/183) ([jdye64](https://github.com/jdye64)) +- test: Window functions [\#182](https://github.com/apache/datafusion-python/pull/182) ([simicd](https://github.com/simicd)) +- Add bindings for Projection [\#180](https://github.com/apache/datafusion-python/pull/180) ([jdye64](https://github.com/jdye64)) +- Table scan bindings [\#178](https://github.com/apache/datafusion-python/pull/178) ([jdye64](https://github.com/jdye64)) +- Make session configurable [\#176](https://github.com/apache/datafusion-python/pull/176) ([andygrove](https://github.com/andygrove)) +- Upgrade to DataFusion 18.0.0 [\#175](https://github.com/apache/datafusion-python/pull/175) ([andygrove](https://github.com/andygrove)) +- Use latest DataFusion rev in preparation for DF 18 release [\#174](https://github.com/apache/datafusion-python/pull/174) ([andygrove](https://github.com/andygrove)) +- Arrow type bindings [\#173](https://github.com/apache/datafusion-python/pull/173) ([jdye64](https://github.com/jdye64)) +- Pyo3 bump [\#171](https://github.com/apache/datafusion-python/pull/171) ([jdye64](https://github.com/jdye64)) +- feature: Add additional aggregation functions [\#170](https://github.com/apache/datafusion-python/pull/170) ([simicd](https://github.com/simicd)) +- Make from_substrait_plan return DataFrame instead of LogicalPlan [\#164](https://github.com/apache/datafusion-python/pull/164) ([andygrove](https://github.com/andygrove)) +- feature: Implement count method [\#163](https://github.com/apache/datafusion-python/pull/163) ([simicd](https://github.com/simicd)) +- CI Fixes [\#162](https://github.com/apache/datafusion-python/pull/162) ([jdye64](https://github.com/jdye64)) +- Upgrade to DataFusion 17 [\#160](https://github.com/apache/datafusion-python/pull/160) ([andygrove](https://github.com/andygrove)) +- feature: Improve string representation of datafusion classes [\#159](https://github.com/apache/datafusion-python/pull/159) ([simicd](https://github.com/simicd)) +- Make PyExecutionPlan.plan public [\#156](https://github.com/apache/datafusion-python/pull/156) ([andygrove](https://github.com/andygrove)) +- Expose methods on logical and execution plans [\#155](https://github.com/apache/datafusion-python/pull/155) ([andygrove](https://github.com/andygrove)) +- Fix clippy for new Rust version [\#154](https://github.com/apache/datafusion-python/pull/154) ([andygrove](https://github.com/andygrove)) +- Add DataFrame methods for accessing plans [\#153](https://github.com/apache/datafusion-python/pull/153) ([andygrove](https://github.com/andygrove)) +- Use DataFusion rev 5238e8c97f998b4d2cb9fab85fb182f325a1a7fb [\#150](https://github.com/apache/datafusion-python/pull/150) ([andygrove](https://github.com/andygrove)) +- build\(deps\): bump async-trait from 0.1.61 to 0.1.62 [\#148](https://github.com/apache/datafusion-python/pull/148) ([dependabot[bot]](https://github.com/apps/dependabot)) +- Rename default branch from master to main [\#147](https://github.com/apache/datafusion-python/pull/147) ([andygrove](https://github.com/andygrove)) +- Substrait bindings [\#145](https://github.com/apache/datafusion-python/pull/145) ([jdye64](https://github.com/jdye64)) +- build\(deps\): bump uuid from 0.8.2 to 1.2.2 [\#143](https://github.com/apache/datafusion-python/pull/143) ([dependabot[bot]](https://github.com/apps/dependabot)) +- Prepare for 0.8.0 release [\#141](https://github.com/apache/datafusion-python/pull/141) ([andygrove](https://github.com/andygrove)) +- Improve README and add more examples [\#137](https://github.com/apache/datafusion-python/pull/137) ([andygrove](https://github.com/andygrove)) +- test: Expand tests for built-in functions [\#129](https://github.com/apache/datafusion-python/pull/129) ([simicd](https://github.com/simicd)) +- build\(deps\): bump object_store from 0.5.2 to 0.5.3 [\#126](https://github.com/apache/datafusion-python/pull/126) ([dependabot[bot]](https://github.com/apps/dependabot)) +- build\(deps\): bump mimalloc from 0.1.32 to 0.1.34 [\#125](https://github.com/apache/datafusion-python/pull/125) ([dependabot[bot]](https://github.com/apps/dependabot)) +- Introduce conda directory containing datafusion-dev.yaml conda enviro… [\#124](https://github.com/apache/datafusion-python/pull/124) ([jdye64](https://github.com/jdye64)) +- build\(deps\): bump bzip2 from 0.4.3 to 0.4.4 [\#121](https://github.com/apache/datafusion-python/pull/121) ([dependabot[bot]](https://github.com/apps/dependabot)) +- build\(deps\): bump tokio from 1.23.0 to 1.24.1 [\#119](https://github.com/apache/datafusion-python/pull/119) ([dependabot[bot]](https://github.com/apps/dependabot)) +- build\(deps\): bump async-trait from 0.1.60 to 0.1.61 [\#118](https://github.com/apache/datafusion-python/pull/118) ([dependabot[bot]](https://github.com/apps/dependabot)) +- Upgrade to DataFusion 16.0.0 [\#115](https://github.com/apache/datafusion-python/pull/115) ([andygrove](https://github.com/andygrove)) +- Bump async-trait from 0.1.57 to 0.1.60 [\#114](https://github.com/apache/datafusion-python/pull/114) ([dependabot[bot]](https://github.com/apps/dependabot)) +- Bump object_store from 0.5.1 to 0.5.2 [\#112](https://github.com/apache/datafusion-python/pull/112) ([dependabot[bot]](https://github.com/apps/dependabot)) +- Bump tokio from 1.21.2 to 1.23.0 [\#109](https://github.com/apache/datafusion-python/pull/109) ([dependabot[bot]](https://github.com/apps/dependabot)) +- Add entries for publishing production \(asf-site\) and staging docs [\#107](https://github.com/apache/datafusion-python/pull/107) ([martin-g](https://github.com/martin-g)) +- Add a workflow that builds the docs and deploys them at staged or production [\#104](https://github.com/apache/datafusion-python/pull/104) ([martin-g](https://github.com/martin-g)) +- Upgrade to DataFusion 15.0.0 [\#103](https://github.com/apache/datafusion-python/pull/103) ([andygrove](https://github.com/andygrove)) +- build\(deps\): bump futures from 0.3.24 to 0.3.25 [\#102](https://github.com/apache/datafusion-python/pull/102) ([dependabot[bot]](https://github.com/apps/dependabot)) +- build\(deps\): bump pyo3 from 0.17.2 to 0.17.3 [\#101](https://github.com/apache/datafusion-python/pull/101) ([dependabot[bot]](https://github.com/apps/dependabot)) +- build\(deps\): bump mimalloc from 0.1.30 to 0.1.32 [\#98](https://github.com/apache/datafusion-python/pull/98) ([dependabot[bot]](https://github.com/apps/dependabot)) +- build\(deps\): bump rand from 0.7.3 to 0.8.5 [\#97](https://github.com/apache/datafusion-python/pull/97) ([dependabot[bot]](https://github.com/apps/dependabot)) +- Fix GitHub actions warnings [\#95](https://github.com/apache/datafusion-python/pull/95) ([martin-g](https://github.com/martin-g)) +- Fixes \#81 - Add CI workflow for source distribution [\#93](https://github.com/apache/datafusion-python/pull/93) ([martin-g](https://github.com/martin-g)) +- post-release updates [\#91](https://github.com/apache/datafusion-python/pull/91) ([andygrove](https://github.com/andygrove)) +- Build for manylinux 2014 [\#88](https://github.com/apache/datafusion-python/pull/88) ([martin-g](https://github.com/martin-g)) +- update release readme tag [\#86](https://github.com/apache/datafusion-python/pull/86) ([Jimexist](https://github.com/Jimexist)) +- Upgrade Maturin to 0.14.2 [\#85](https://github.com/apache/datafusion-python/pull/85) ([martin-g](https://github.com/martin-g)) +- Update release instructions [\#83](https://github.com/apache/datafusion-python/pull/83) ([andygrove](https://github.com/andygrove)) +- \[Functions\] - Add python function binding to `functions` [\#73](https://github.com/apache/datafusion-python/pull/73) ([francis-du](https://github.com/francis-du)) + +## [0.8.0-rc1](https://github.com/apache/datafusion-python/tree/0.8.0-rc1) (2023-02-17) + +[Full Changelog](https://github.com/apache/datafusion-python/compare/0.7.0-rc2...0.8.0-rc1) + +**Implemented enhancements:** + +- Add bindings for datafusion_common::DFField [\#184](https://github.com/apache/datafusion-python/issues/184) +- Add bindings for DFSchema/DFSchemaRef [\#181](https://github.com/apache/datafusion-python/issues/181) +- Add bindings for datafusion_expr Projection [\#179](https://github.com/apache/datafusion-python/issues/179) +- Add bindings for `TableScan` struct from `datafusion_expr::TableScan` [\#177](https://github.com/apache/datafusion-python/issues/177) +- Add a "mapping" struct for types [\#172](https://github.com/apache/datafusion-python/issues/172) +- Improve string representation of datafusion classes \(dataframe, context, expression, ...\) [\#158](https://github.com/apache/datafusion-python/issues/158) +- Add DataFrame count method [\#151](https://github.com/apache/datafusion-python/issues/151) +- \[REQUEST\] Github Actions Improvements [\#146](https://github.com/apache/datafusion-python/issues/146) +- Change default branch name from master to main [\#144](https://github.com/apache/datafusion-python/issues/144) +- Bump pyo3 to 0.18.0 [\#140](https://github.com/apache/datafusion-python/issues/140) +- Add script for Python linting [\#134](https://github.com/apache/datafusion-python/issues/134) +- Add Python bindings for substrait module [\#132](https://github.com/apache/datafusion-python/issues/132) +- Expand unit tests for built-in functions [\#128](https://github.com/apache/datafusion-python/issues/128) +- support creating arrow-datafusion-python conda environment [\#122](https://github.com/apache/datafusion-python/issues/122) +- Build Python source distribution in GitHub workflow [\#81](https://github.com/apache/datafusion-python/issues/81) +- EPIC: Add all functions to python binding `functions` [\#72](https://github.com/apache/datafusion-python/issues/72) + +**Fixed bugs:** + +- Build is broken [\#161](https://github.com/apache/datafusion-python/issues/161) +- Out of memory when sorting [\#157](https://github.com/apache/datafusion-python/issues/157) +- window_lead test appears to be non-deterministic [\#135](https://github.com/apache/datafusion-python/issues/135) +- Reading csv does not work [\#130](https://github.com/apache/datafusion-python/issues/130) +- Github actions produce a lot of warnings [\#94](https://github.com/apache/datafusion-python/issues/94) +- ASF source release tarball has wrong directory name [\#90](https://github.com/apache/datafusion-python/issues/90) +- Python Release Build failing after upgrading to maturin 14.2 [\#87](https://github.com/apache/datafusion-python/issues/87) +- Maturin build hangs on Linux ARM64 [\#84](https://github.com/apache/datafusion-python/issues/84) +- Cannot install on Mac M1 from source tarball from testpypi [\#82](https://github.com/apache/datafusion-python/issues/82) +- ImportPathMismatchError when running pytest locally [\#77](https://github.com/apache/datafusion-python/issues/77) + +**Closed issues:** + +- Publish documentation for Python bindings [\#39](https://github.com/apache/datafusion-python/issues/39) +- Add Python binding for `approx_median` [\#32](https://github.com/apache/datafusion-python/issues/32) +- Release version 0.7.0 [\#7](https://github.com/apache/datafusion-python/issues/7) + +## [0.7.0-rc2](https://github.com/apache/datafusion-python/tree/0.7.0-rc2) (2022-11-26) + +[Full Changelog](https://github.com/apache/datafusion-python/compare/0.7.0...0.7.0-rc2) + +## [Unreleased](https://github.com/datafusion-contrib/datafusion-python/tree/HEAD) + +[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/0.5.1...HEAD) + +**Merged pull requests:** + +- use \_\_getitem\_\_ for df column selection [\#41](https://github.com/datafusion-contrib/datafusion-python/pull/41) ([Jimexist](https://github.com/Jimexist)) +- fix demo in readme [\#40](https://github.com/datafusion-contrib/datafusion-python/pull/40) ([Jimexist](https://github.com/Jimexist)) +- Implement select_columns [\#39](https://github.com/datafusion-contrib/datafusion-python/pull/39) ([andygrove](https://github.com/andygrove)) +- update readme and changelog [\#38](https://github.com/datafusion-contrib/datafusion-python/pull/38) ([Jimexist](https://github.com/Jimexist)) +- Add PyDataFrame.explain [\#36](https://github.com/datafusion-contrib/datafusion-python/pull/36) ([andygrove](https://github.com/andygrove)) +- Release 0.5.0 [\#34](https://github.com/datafusion-contrib/datafusion-python/pull/34) ([Jimexist](https://github.com/Jimexist)) +- disable nightly in workflow [\#33](https://github.com/datafusion-contrib/datafusion-python/pull/33) ([Jimexist](https://github.com/Jimexist)) +- update requirements to 37 and 310, update readme [\#32](https://github.com/datafusion-contrib/datafusion-python/pull/32) ([Jimexist](https://github.com/Jimexist)) +- Add custom global allocator [\#30](https://github.com/datafusion-contrib/datafusion-python/pull/30) ([matthewmturner](https://github.com/matthewmturner)) +- Remove pandas dependency [\#25](https://github.com/datafusion-contrib/datafusion-python/pull/25) ([matthewmturner](https://github.com/matthewmturner)) +- upgrade datafusion and pyo3 [\#20](https://github.com/datafusion-contrib/datafusion-python/pull/20) ([Jimexist](https://github.com/Jimexist)) +- update maturin 0.12+ [\#17](https://github.com/datafusion-contrib/datafusion-python/pull/17) ([Jimexist](https://github.com/Jimexist)) +- Update README.md [\#16](https://github.com/datafusion-contrib/datafusion-python/pull/16) ([Jimexist](https://github.com/Jimexist)) +- apply cargo clippy --fix [\#15](https://github.com/datafusion-contrib/datafusion-python/pull/15) ([Jimexist](https://github.com/Jimexist)) +- update test workflow to include rust clippy and check [\#14](https://github.com/datafusion-contrib/datafusion-python/pull/14) ([Jimexist](https://github.com/Jimexist)) +- use maturin 0.12.6 [\#13](https://github.com/datafusion-contrib/datafusion-python/pull/13) ([Jimexist](https://github.com/Jimexist)) +- apply cargo fmt [\#12](https://github.com/datafusion-contrib/datafusion-python/pull/12) ([Jimexist](https://github.com/Jimexist)) +- use stable not nightly [\#11](https://github.com/datafusion-contrib/datafusion-python/pull/11) ([Jimexist](https://github.com/Jimexist)) +- ci: test against more compilers, setup clippy and fix clippy lints [\#9](https://github.com/datafusion-contrib/datafusion-python/pull/9) ([cpcloud](https://github.com/cpcloud)) +- Fix use of importlib.metadata and unify requirements.txt [\#8](https://github.com/datafusion-contrib/datafusion-python/pull/8) ([cpcloud](https://github.com/cpcloud)) +- Ship the Cargo.lock file in the source distribution [\#7](https://github.com/datafusion-contrib/datafusion-python/pull/7) ([cpcloud](https://github.com/cpcloud)) +- add \_\_version\_\_ attribute to datafusion object [\#3](https://github.com/datafusion-contrib/datafusion-python/pull/3) ([tfeda](https://github.com/tfeda)) +- fix ci by fixing directories [\#2](https://github.com/datafusion-contrib/datafusion-python/pull/2) ([Jimexist](https://github.com/Jimexist)) +- setup workflow [\#1](https://github.com/datafusion-contrib/datafusion-python/pull/1) ([Jimexist](https://github.com/Jimexist)) + +## [0.5.1](https://github.com/datafusion-contrib/datafusion-python/tree/0.5.1) (2022-03-15) + +[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/0.5.1-rc1...0.5.1) + +## [0.5.1-rc1](https://github.com/datafusion-contrib/datafusion-python/tree/0.5.1-rc1) (2022-03-15) + +[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/0.5.0...0.5.1-rc1) + +## [0.5.0](https://github.com/datafusion-contrib/datafusion-python/tree/0.5.0) (2022-03-10) + +[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/0.5.0-rc2...0.5.0) + +## [0.5.0-rc2](https://github.com/datafusion-contrib/datafusion-python/tree/0.5.0-rc2) (2022-03-10) + +[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/0.5.0-rc1...0.5.0-rc2) + +**Closed issues:** + +- Add support for Ballista [\#37](https://github.com/datafusion-contrib/datafusion-python/issues/37) +- Implement DataFrame.explain [\#35](https://github.com/datafusion-contrib/datafusion-python/issues/35) + +## [0.5.0-rc1](https://github.com/datafusion-contrib/datafusion-python/tree/0.5.0-rc1) (2022-03-09) + +[Full Changelog](https://github.com/datafusion-contrib/datafusion-python/compare/4c98b8e9c3c3f8e2e6a8f2d1ffcfefda344c4680...0.5.0-rc1) + +**Closed issues:** + +- Investigate exposing additional optimizations [\#28](https://github.com/datafusion-contrib/datafusion-python/issues/28) +- Use custom allocator in Python build [\#27](https://github.com/datafusion-contrib/datafusion-python/issues/27) +- Why is pandas a requirement? [\#24](https://github.com/datafusion-contrib/datafusion-python/issues/24) +- Unable to build [\#18](https://github.com/datafusion-contrib/datafusion-python/issues/18) +- Setup CI against multiple Python version [\#6](https://github.com/datafusion-contrib/datafusion-python/issues/6) diff --git a/dev/check_crates_patch.py b/dev/check_crates_patch.py new file mode 100644 index 000000000..74e489e1f --- /dev/null +++ b/dev/check_crates_patch.py @@ -0,0 +1,61 @@ +#!/usr/bin/env python3 +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +"""Check that no Cargo.toml files contain [patch.crates-io] entries. + +Release builds must not depend on patched crates. During development it is +common to temporarily patch crates-io dependencies, but those patches must +be removed before creating a release. + +An empty [patch.crates-io] section is allowed. +""" + +import sys +from pathlib import Path + +import tomllib + + +def main() -> int: + errors: list[str] = [] + for cargo_toml in sorted(Path().rglob("Cargo.toml")): + if "target" in cargo_toml.parts: + continue + with Path.open(cargo_toml, "rb") as f: + data = tomllib.load(f) + patch = data.get("patch", {}).get("crates-io", {}) + if patch: + errors.append(str(cargo_toml)) + for name, spec in patch.items(): + errors.append(f" {name} = {spec}") + + if errors: + print("ERROR: Release builds must not contain [patch.crates-io] entries.") + print() + for line in errors: + print(line) + print() + print("Remove all [patch.crates-io] entries before creating a release.") + return 1 + + print("OK: No [patch.crates-io] entries found.") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/dev/clean.sh b/dev/clean.sh new file mode 100755 index 000000000..0d86680e8 --- /dev/null +++ b/dev/clean.sh @@ -0,0 +1,62 @@ +#!/usr/bin/env bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +# This cleans up the project by removing build artifacts and other generated files. + +# Function to remove a directory and print the action +remove_dir() { + if [ -d "$1" ]; then + echo "Removing directory: $1" + rm -rf "$1" + fi +} + +# Function to remove a file and print the action +remove_file() { + if [ -f "$1" ]; then + echo "Removing file: $1" + rm -f "$1" + fi +} + +# Remove .pytest_cache directory +remove_dir .pytest_cache/ + +# Remove target directory +remove_dir target/ + +# Remove any __pycache__ directories +find python/ -type d -name "__pycache__" -print | while read -r dir; do + remove_dir "$dir" +done + +# Remove pytest-coverage.lcov file +# remove_file .coverage +# remove_file pytest-coverage.lcov + +# Remove rust-coverage.lcov file +# remove_file rust-coverage.lcov + +# Remove pyo3 files +find python/ -type f -name '_internal.*.so' -print | while read -r file; do + remove_file "$file" +done + +echo "Cleanup complete." \ No newline at end of file diff --git a/dev/create_license.py b/dev/create_license.py index 2a67cb8fd..acbf8587c 100644 --- a/dev/create_license.py +++ b/dev/create_license.py @@ -20,12 +20,11 @@ import json import subprocess +from pathlib import Path -subprocess.check_output(["cargo", "install", "cargo-license"]) data = subprocess.check_output( [ - "cargo", - "license", + "cargo-license", "--avoid-build-deps", "--avoid-dev-deps", "--do-not-bundle", @@ -248,5 +247,5 @@ result += "------------------\n\n" result += f"### {name} {version}\n* source: [{repository}]({repository})\n* license: {license}\n\n" -with open("LICENSE.txt", "w") as f: +with Path.open("LICENSE.txt", "w") as f: f.write(result) diff --git a/dev/python_lint.sh b/dev/python_lint.sh index 949346294..2d867f29d 100755 --- a/dev/python_lint.sh +++ b/dev/python_lint.sh @@ -1,4 +1,4 @@ -#!/bin/bash +#!/usr/bin/env bash # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file @@ -21,6 +21,6 @@ # DataFusion CI does set -e -source venv/bin/activate -flake8 --exclude venv --ignore=E501,W503 +source .venv/bin/activate +flake8 --exclude venv,benchmarks/db-benchmark --ignore=E501,W503 black --line-length 79 . diff --git a/dev/release/README.md b/dev/release/README.md index cec0eef5e..ed28f4aa6 100644 --- a/dev/release/README.md +++ b/dev/release/README.md @@ -56,33 +56,43 @@ Before creating a new release: - a PR should be created and merged to update the major version number of the project - A new release branch should be created, such as `branch-0.8` -### Update CHANGELOG.md +## Preparing a Release Candidate -Define release branch (e.g. `branch-0.8`), base version tag (e.g. `0.7.0`) and future version tag (e.g. `0.9.0`). Commits -between the base version tag and the release branch will be used to populate the changelog content. +### Change Log + +We maintain a `CHANGELOG.md` so our users know what has been changed between releases. + +The changelog is generated using a Python script: ```bash -# create the changelog -CHANGELOG_GITHUB_TOKEN= ./dev/release/update_change_log-datafusion-python.sh main 0.8.0 0.7.0 -# review change log / edit issues and labels if needed, rerun until you are happy with the result -git commit -a -m 'Create changelog for release' +$ GITHUB_TOKEN= ./dev/release/generate-changelog.py 24.0.0 HEAD 25.0.0 > dev/changelog/25.0.0.md ``` -_If you see the error `"You have exceeded a secondary rate limit"` when running this script, try reducing the CPU -allocation to slow the process down and throttle the number of GitHub requests made per minute, by modifying the -value of the `--cpus` argument in the `update_change_log.sh` script._ +This script creates a changelog from GitHub PRs based on the labels associated with them as well as looking for +titles starting with `feat:`, `fix:`, or `docs:` . The script will produce output similar to: -You can add `invalid` or `development-process` label to exclude items from -release notes. +``` +Fetching list of commits between 24.0.0 and HEAD +Fetching pull requests +Categorizing pull requests +Generating changelog content +``` -Send a PR to get these changes merged into the release branch (e.g. `branch-0.8`). If new commits that could change the -change log content landed in the release branch before you could merge the PR, you need to rerun the changelog update -script to regenerate the changelog and update the PR accordingly. +### Update the version number -### Preparing a Release Candidate +The only place you should need to update the version is in the root `Cargo.toml`. +After updating the toml file, run `cargo update` to update the cargo lock file. +If you do not want to update all the dependencies, you can instead run `cargo build` +which should only update the version number for `datafusion-python`. ### Tag the Repository +Commit the changes to the changelog and version. + +Assuming you have set up a remote to the `apache` repository rather than your personal fork, +you need to push a tag to start the CI process for release candidates. The following assumes +the upstream repository is called `apache`. + ```bash git tag 0.8.0-rc1 git push apache 0.8.0-rc1 @@ -94,42 +104,7 @@ git push apache 0.8.0-rc1 ./dev/release/create-tarball.sh 0.8.0 1 ``` -This will also create the email template to send to the mailing list. Here is an example: - -``` -To: dev@arrow.apache.org -Subject: [VOTE][RUST][DataFusion] Release DataFusion Python Bindings 0.7.0 RC2 -Hi, - -I would like to propose a release of Apache Arrow DataFusion Python Bindings, -version 0.7.0. - -This release candidate is based on commit: bd1b78b6d444b7ab172c6aec23fa58c842a592d7 [1] -The proposed release tarball and signatures are hosted at [2]. -The changelog is located at [3]. -The Python wheels are located at [4]. - -Please download, verify checksums and signatures, run the unit tests, and vote -on the release. The vote will be open for at least 72 hours. - -Only votes from PMC members are binding, but all members of the community are -encouraged to test the release and vote with "(non-binding)". - -The standard verification procedure is documented at https://github.com/apache/arrow-datafusion-python/blob/main/dev/release/README.md#verifying-release-candidates. - -[ ] +1 Release this as Apache Arrow DataFusion Python 0.7.0 -[ ] +0 -[ ] -1 Do not release this as Apache Arrow DataFusion Python 0.7.0 because... - -Here is my vote: - -+1 - -[1]: https://github.com/apache/arrow-datafusion-python/tree/bd1b78b6d444b7ab172c6aec23fa58c842a592d7 -[2]: https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-python-0.7.0-rc2 -[3]: https://github.com/apache/arrow-datafusion-python/blob/bd1b78b6d444b7ab172c6aec23fa58c842a592d7/CHANGELOG.md -[4]: https://test.pypi.org/project/datafusion/0.7.0/ -``` +This will also create the email template to send to the mailing list. Create a draft email using this content, but do not send until after completing the next step. @@ -142,17 +117,25 @@ This section assumes some familiarity with publishing Python packages to PyPi. F Pushing an `rc` tag to the release branch will cause a GitHub Workflow to run that will build the Python wheels. -Go to https://github.com/apache/arrow-datafusion-python/actions and look for an action named "Python Release Build" +Go to https://github.com/apache/datafusion-python/actions and look for an action named "Python Release Build" that has run against the pushed tag. -Click on the action and scroll down to the bottom of the page titled "Artifacts". Download `dist.zip`. +Click on the action and scroll down to the bottom of the page titled "Artifacts". Download `dist.zip`. It should +contain files such as: + +```text +datafusion-22.0.0-cp37-abi3-macosx_10_7_x86_64.whl +datafusion-22.0.0-cp37-abi3-macosx_11_0_arm64.whl +datafusion-22.0.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl +datafusion-22.0.0-cp37-abi3-win_amd64.whl +``` Upload the wheels to testpypi. ```bash unzip dist.zip python3 -m pip install --upgrade setuptools twine build -python3 -m twine upload --repository testpypi datafusion-0.7.0-cp37-abi3-*.whl +python3 -m twine upload --repository testpypi datafusion-22.0.0-cp37-abi3-*.whl ``` When prompted for username, enter `__token__`. When prompted for a password, enter a valid GitHub Personal Access Token @@ -162,7 +145,7 @@ When prompted for username, enter `__token__`. When prompted for a password, ent Download the source tarball created in the previous step, untar it, and run: ```bash -python3 -m build +maturin sdist ``` This will create a file named `dist/datafusion-0.7.0.tar.gz`. Upload this to testpypi: @@ -171,16 +154,62 @@ This will create a file named `dist/datafusion-0.7.0.tar.gz`. Upload this to tes python3 -m twine upload --repository testpypi dist/datafusion-0.7.0.tar.gz ``` +### Run Verify Release Candidate Workflow + +Before sending the vote email, run the manually triggered GitHub Actions workflow +"Verify Release Candidate" and confirm all matrix jobs pass across the OS/architecture matrix +(for example, Linux, macOS, and Windows runners): + +1. Go to https://github.com/apache/datafusion-python/actions/workflows/verify-release-candidate.yml +2. Click "Run workflow" +3. Set `version` to the release version (for example, `52.0.0`) +4. Set `rc_number` to the RC number (for example, `0`) +5. Wait for all jobs to complete successfully + +Include a short note in the vote email template that this workflow was run across all OS/architecture +matrix entries and that all jobs passed. + +```text +Verification note: The manually triggered "Verify Release Candidate" workflow was run for version and rc_number across all configured OS/architecture matrix entries, and all matrix jobs completed successfully. +``` + ### Send the Email Send the email to start the vote. ## Verifying a Release -Install the release from testpypi: +Releases may be verified using `verify-release-candidate.sh`: ```bash -pip install --extra-index-url https://test.pypi.org/simple/ datafusion==0.7.0 +git clone https://github.com/apache/datafusion-python.git +dev/release/verify-release-candidate.sh 48.0.0 1 +``` + +Alternatively, one can run unit tests against a testpypi release candidate: + +```bash +# clone a fresh repo +git clone https://github.com/apache/datafusion-python.git +cd datafusion-python + +# checkout the release commit +git fetch --tags +git checkout 40.0.0-rc1 +git submodule update --init --recursive + +# create the env +python3 -m venv .venv +source .venv/bin/activate + +# install release candidate +pip install --extra-index-url https://test.pypi.org/simple/ datafusion==40.0.0 + +# install test dependencies +pip install pytest numpy pytest-asyncio + +# run the tests +pytest --import-mode=importlib python/tests -vv ``` Try running one of the examples from the top-level README, or write some custom Python code to query some available @@ -198,6 +227,14 @@ Create the source release tarball: ./dev/release/release-tarball.sh 0.8.0 1 ``` +### Publishing Rust Crate to crates.io + +Some projects depend on the Rust crate directly, so we publish this to crates.io + +```shell +cargo publish +``` + ### Publishing Python Artifacts to PyPi Go to the Test PyPI page of Datafusion, and download @@ -208,33 +245,59 @@ uploading them using `twine`: twine upload --repository pypi dist-release/* ``` -### Publish Python Artifacts to Anaconda +### Publish Python Artifacts to conda-forge + +Pypi packages auto upload to conda-forge via [datafusion feedstock](https://github.com/conda-forge/datafusion-feedstock) -Publishing artifacts to Anaconda is similar to PyPi. First, Download the source tarball created in the previous step and untar it. +### Push the Release Tag ```bash -# Assuming you have an existing conda environment named `datafusion-dev` if not see root README for instructions -conda activate datafusion-dev -conda build . +git checkout 0.8.0-rc1 +git tag 0.8.0 +git push apache 0.8.0 ``` -This will setup a virtual conda environment and build the artifacts inside of that virtual env. This step can take a few minutes as the entire build, host, and runtime environments are setup. Once complete a local filesystem path will be emitted for the location of the resulting package. Observe that path and copy to your clipboard. +### Add the release to Apache Reporter + +Add the release to https://reporter.apache.org/addrelease.html?datafusion with a version name prefixed with `DATAFUSION-PYTHON`, +for example `DATAFUSION-PYTHON-31.0.0`. + +The release information is used to generate a template for a board report (see example from Apache Arrow +[here](https://github.com/apache/arrow/pull/14357)). + +### Delete old RCs and Releases + +See the ASF documentation on [when to archive](https://www.apache.org/legal/release-policy.html#when-to-archive) +for more information. -Ex: `/home/conda/envs/datafusion/conda-bld/linux-64/datafusion-0.7.0.tar.bz2` +#### Deleting old release candidates from `dev` svn -Now you are ready to publish this resulting package to anaconda.org. This can be accomplished in a few simple steps. +Release candidates should be deleted once the release is published. + +Get a list of DataFusion release candidates: ```bash -# First login to Anaconda with the datafusion credentials -anaconda login -# Upload the package -anaconda upload /home/conda/envs/datafusion/conda-bld/linux-64/datafusion-0.7.0.tar.bz2 +svn ls https://dist.apache.org/repos/dist/dev/datafusion | grep datafusion-python ``` -### Push the Release Tag +Delete a release candidate: ```bash -git checkout 0.8.0-rc1 -git tag 0.8.0 -git push apache 0.8.0 +svn delete -m "delete old DataFusion RC" https://dist.apache.org/repos/dist/dev/datafusion/apache-datafusion-python-7.1.0-rc1/ +``` + +#### Deleting old releases from `release` svn + +Only the latest release should be available. Delete old releases after publishing the new release. + +Get a list of DataFusion releases: + +```bash +svn ls https://dist.apache.org/repos/dist/release/datafusion | grep datafusion-python +``` + +Delete a release: + +```bash +svn delete -m "delete old DataFusion release" https://dist.apache.org/repos/dist/release/datafusion/datafusion-python-7.0.0 ``` diff --git a/dev/release/check-rat-report.py b/dev/release/check-rat-report.py index 30a01116b..72a35212e 100644 --- a/dev/release/check-rat-report.py +++ b/dev/release/check-rat-report.py @@ -21,17 +21,16 @@ import re import sys import xml.etree.ElementTree as ET +from pathlib import Path if len(sys.argv) != 3: - sys.stderr.write( - "Usage: %s exclude_globs.lst rat_report.xml\n" % sys.argv[0] - ) + sys.stderr.write("Usage: %s exclude_globs.lst rat_report.xml\n" % sys.argv[0]) sys.exit(1) exclude_globs_filename = sys.argv[1] xml_filename = sys.argv[2] -globs = [line.strip() for line in open(exclude_globs_filename, "r")] +globs = [line.strip() for line in Path.open(exclude_globs_filename)] tree = ET.parse(xml_filename) root = tree.getroot() diff --git a/dev/release/create-tarball.sh b/dev/release/create-tarball.sh index c05da5b75..d6ca76561 100755 --- a/dev/release/create-tarball.sh +++ b/dev/release/create-tarball.sh @@ -21,9 +21,9 @@ # Adapted from https://github.com/apache/arrow-rs/tree/master/dev/release/create-tarball.sh # This script creates a signed tarball in -# dev/dist/apache-arrow-datafusion-python--.tar.gz and uploads it to +# dev/dist/apache-datafusion-python--.tar.gz and uploads it to # the "dev" area of the dist.apache.arrow repository and prepares an -# email for sending to the dev@arrow.apache.org list for a formal +# email for sending to the dev@datafusion.apache.org list for a formal # vote. # # See release/README.md for full release instructions @@ -65,25 +65,25 @@ tag="${version}-rc${rc}" echo "Attempting to create ${tarball} from tag ${tag}" release_hash=$(cd "${SOURCE_TOP_DIR}" && git rev-list --max-count=1 ${tag}) -release=apache-arrow-datafusion-python-${version} +release=apache-datafusion-python-${version} distdir=${SOURCE_TOP_DIR}/dev/dist/${release}-rc${rc} tarname=${release}.tar.gz tarball=${distdir}/${tarname} -url="https://dist.apache.org/repos/dist/dev/arrow/${release}-rc${rc}" +url="https://dist.apache.org/repos/dist/dev/datafusion/${release}-rc${rc}" if [ -z "$release_hash" ]; then echo "Cannot continue: unknown git tag: ${tag}" fi -echo "Draft email for dev@arrow.apache.org mailing list" +echo "Draft email for dev@datafusion.apache.org mailing list" echo "" echo "---------------------------------------------------------" cat < ${tarball}.sha256 (cd ${distdir} && shasum -a 512 ${tarname}) > ${tarball}.sha512 -echo "Uploading to apache dist/dev to ${url}" -svn co --depth=empty https://dist.apache.org/repos/dist/dev/arrow ${SOURCE_TOP_DIR}/dev/dist +echo "Uploading to datafusion dist/dev to ${url}" +svn co --depth=empty https://dist.apache.org/repos/dist/dev/datafusion ${SOURCE_TOP_DIR}/dev/dist svn add ${distdir} -svn ci -m "Apache Arrow DataFusion Python ${version} ${rc}" ${distdir} +svn ci -m "Apache DataFusion Python ${version} ${rc}" ${distdir} diff --git a/dev/release/generate-changelog.py b/dev/release/generate-changelog.py new file mode 100755 index 000000000..d86736773 --- /dev/null +++ b/dev/release/generate-changelog.py @@ -0,0 +1,179 @@ +#!/usr/bin/env python + +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse +import os +import re +import subprocess +import sys + +from github import Github + + +def print_pulls(repo_name, title, pulls) -> None: + if len(pulls) > 0: + print(f"**{title}:**") + print() + for pull, commit in pulls: + url = f"https://github.com/{repo_name}/pull/{pull.number}" + print(f"- {pull.title} [#{pull.number}]({url}) ({commit.author.login})") + print() + + +def generate_changelog(repo, repo_name, tag1, tag2, version) -> None: + # get a list of commits between two tags + print(f"Fetching list of commits between {tag1} and {tag2}", file=sys.stderr) + comparison = repo.compare(tag1, tag2) + + # get the pull requests for these commits + print("Fetching pull requests", file=sys.stderr) + unique_pulls = [] + all_pulls = [] + for commit in comparison.commits: + pulls = commit.get_pulls() + for pull in pulls: + # there can be multiple commits per PR if squash merge is not being used and + # in this case we should get all the author names, but for now just pick one + if pull.number not in unique_pulls: + unique_pulls.append(pull.number) + all_pulls.append((pull, commit)) + + # we split the pulls into categories + breaking = [] + bugs = [] + docs = [] + enhancements = [] + performance = [] + other = [] + + # categorize the pull requests based on GitHub labels + print("Categorizing pull requests", file=sys.stderr) + for pull, commit in all_pulls: + # see if PR title uses Conventional Commits + cc_type = "" + cc_breaking = "" + parts = re.findall(r"^([a-z]+)(\([a-z]+\))?(!)?:", pull.title) + if len(parts) == 1: + parts_tuple = parts[0] + cc_type = parts_tuple[0] # fix, feat, docs, chore + # cc_scope = parts_tuple[1] # component within project + cc_breaking = parts_tuple[2] == "!" + + labels = [label.name for label in pull.labels] + if "api change" in labels or cc_breaking: + breaking.append((pull, commit)) + elif "bug" in labels or cc_type == "fix": + bugs.append((pull, commit)) + elif "performance" in labels or cc_type == "perf": + performance.append((pull, commit)) + elif "enhancement" in labels or cc_type == "feat": + enhancements.append((pull, commit)) + elif "documentation" in labels or cc_type == "docs" or cc_type == "doc": + docs.append((pull, commit)) + else: + other.append((pull, commit)) + + # produce the changelog content + print("Generating changelog content", file=sys.stderr) + + # ASF header + print("""\n""") + + print(f"# Apache DataFusion Python {version} Changelog\n") + + # get the number of commits + commit_count = subprocess.check_output( + f"git log --pretty=oneline {tag1}..{tag2} | wc -l", shell=True, text=True + ).strip() + + # get number of contributors + contributor_count = subprocess.check_output( + f"git shortlog -sn {tag1}..{tag2} | wc -l", shell=True, text=True + ).strip() + + print( + f"This release consists of {commit_count} commits from {contributor_count} contributors. " + f"See credits at the end of this changelog for more information.\n" + ) + + print_pulls(repo_name, "Breaking changes", breaking) + print_pulls(repo_name, "Performance related", performance) + print_pulls(repo_name, "Implemented enhancements", enhancements) + print_pulls(repo_name, "Fixed bugs", bugs) + print_pulls(repo_name, "Documentation updates", docs) + print_pulls(repo_name, "Other", other) + + # show code contributions + credits = subprocess.check_output( + f"git shortlog -sn {tag1}..{tag2}", shell=True, text=True + ).rstrip() + + print("## Credits\n") + print( + "Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) " + "per contributor.\n" + ) + print("```") + print(credits) + print("```\n") + + print( + "Thank you also to everyone who contributed in other ways such as filing issues, reviewing " + "PRs, and providing feedback on this release.\n" + ) + + +def cli(args=None) -> None: + """Process command line arguments.""" + if not args: + args = sys.argv[1:] + + parser = argparse.ArgumentParser() + parser.add_argument("tag1", help="The previous commit or tag (e.g. 0.1.0)") + parser.add_argument("tag2", help="The current commit or tag (e.g. HEAD)") + parser.add_argument( + "version", help="The version number to include in the changelog" + ) + args = parser.parse_args() + + token = os.getenv("GITHUB_TOKEN") + project = "apache/datafusion-python" + + g = Github(token) + repo = g.get_repo(project) + generate_changelog(repo, project, args.tag1, args.tag2, args.version) + + +if __name__ == "__main__": + cli() diff --git a/dev/release/rat_exclude_files.txt b/dev/release/rat_exclude_files.txt index db5379d89..b2db144e8 100644 --- a/dev/release/rat_exclude_files.txt +++ b/dev/release/rat_exclude_files.txt @@ -42,4 +42,10 @@ Cargo.lock .history *rat.txt */.git -docs.yaml \ No newline at end of file +.github/* +benchmarks/tpch/queries/q*.sql +benchmarks/tpch/create_tables.sql +.cargo/config.toml +**/.cargo/config.toml +uv.lock +examples/tpch/answers_sf1/*.tbl \ No newline at end of file diff --git a/dev/release/release-tarball.sh b/dev/release/release-tarball.sh index f5e8eb1bf..2b82d1bac 100755 --- a/dev/release/release-tarball.sh +++ b/dev/release/release-tarball.sh @@ -43,7 +43,14 @@ fi version=$1 rc=$2 -tmp_dir=tmp-apache-arrow-datafusion-python-dist +read -r -p "Proceed to release tarball for ${version}-rc${rc}? [y/N]: " answer +answer=${answer:-no} +if [ "${answer}" != "y" ]; then + echo "Cancelled tarball release!" + exit 1 +fi + +tmp_dir=tmp-apache-datafusion-python-dist echo "Recreate temporary directory: ${tmp_dir}" rm -rf ${tmp_dir} @@ -52,23 +59,23 @@ mkdir -p ${tmp_dir} echo "Clone dev dist repository" svn \ co \ - https://dist.apache.org/repos/dist/dev/arrow/apache-arrow-datafusion-python-${version}-rc${rc} \ + https://dist.apache.org/repos/dist/dev/datafusion/apache-datafusion-python-${version}-rc${rc} \ ${tmp_dir}/dev echo "Clone release dist repository" -svn co https://dist.apache.org/repos/dist/release/arrow ${tmp_dir}/release +svn co https://dist.apache.org/repos/dist/release/datafusion ${tmp_dir}/release echo "Copy ${version}-rc${rc} to release working copy" -release_version=arrow-datafusion-python-${version} +release_version=datafusion-python-${version} mkdir -p ${tmp_dir}/release/${release_version} cp -r ${tmp_dir}/dev/* ${tmp_dir}/release/${release_version}/ svn add ${tmp_dir}/release/${release_version} echo "Commit release" -svn ci -m "Apache Arrow DataFusion Python ${version}" ${tmp_dir}/release +svn ci -m "Apache DataFusion Python ${version}" ${tmp_dir}/release echo "Clean up" rm -rf ${tmp_dir} echo "Success! The release is available here:" -echo " https://dist.apache.org/repos/dist/release/arrow/${release_version}" +echo " https://dist.apache.org/repos/dist/release/datafusion/${release_version}" diff --git a/dev/release/update_change_log-datafusion-python.sh b/dev/release/update_change_log-datafusion-python.sh deleted file mode 100755 index a11447f0b..000000000 --- a/dev/release/update_change_log-datafusion-python.sh +++ /dev/null @@ -1,33 +0,0 @@ -#!/bin/bash -# -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. -# - -# Usage: -# CHANGELOG_GITHUB_TOKEN= ./update_change_log-datafusion.sh main 8.0.0 7.1.0 -# CHANGELOG_GITHUB_TOKEN= ./update_change_log-datafusion.sh maint-7.x 7.1.0 7.0.0 - -RELEASE_BRANCH=$1 -RELEASE_TAG=$2 -BASE_TAG=$3 - -SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -${SOURCE_DIR}/update_change_log.sh \ - "${BASE_TAG}" \ - --future-release "${RELEASE_TAG}" \ - --release-branch "${RELEASE_BRANCH}" diff --git a/dev/release/update_change_log.sh b/dev/release/update_change_log.sh deleted file mode 100755 index a0b398131..000000000 --- a/dev/release/update_change_log.sh +++ /dev/null @@ -1,87 +0,0 @@ -#!/bin/bash -# -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. -# - -# Adapted from https://github.com/apache/arrow-rs/tree/master/dev/release/update_change_log.sh - -# invokes the changelog generator from -# https://github.com/github-changelog-generator/github-changelog-generator -# -# With the config located in -# arrow-datafusion/.github_changelog_generator -# -# Usage: -# CHANGELOG_GITHUB_TOKEN= ./update_change_log.sh - -set -e - -SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" -SOURCE_TOP_DIR="$(cd "${SOURCE_DIR}/../../" && pwd)" - -if [[ "$#" -lt 1 ]]; then - echo "USAGE: $0 SINCE_TAG EXTRA_ARGS..." - exit 1 -fi - -SINCE_TAG=$1 -shift 1 - -OUTPUT_PATH="CHANGELOG.md" - -pushd ${SOURCE_TOP_DIR} - -# reset content in changelog -git checkout "${SINCE_TAG}" "${OUTPUT_PATH}" -# remove license header so github-changelog-generator has a clean base to append -sed -i.bak '1,18d' "${OUTPUT_PATH}" - -docker run -it --rm \ - --cpus "0.1" \ - -e CHANGELOG_GITHUB_TOKEN=$CHANGELOG_GITHUB_TOKEN \ - -v "$(pwd)":/usr/local/src/your-app \ - githubchangeloggenerator/github-changelog-generator \ - --user apache \ - --project arrow-datafusion-python \ - --since-tag "${SINCE_TAG}" \ - --base "${OUTPUT_PATH}" \ - --output "${OUTPUT_PATH}" \ - "$@" - -sed -i.bak "s/\\\n/\n\n/" "${OUTPUT_PATH}" - -echo ' -' | cat - "${OUTPUT_PATH}" > "${OUTPUT_PATH}".tmp -mv "${OUTPUT_PATH}".tmp "${OUTPUT_PATH}" diff --git a/dev/release/verify-release-candidate.sh b/dev/release/verify-release-candidate.sh index be86f69e0..9591e0335 100755 --- a/dev/release/verify-release-candidate.sh +++ b/dev/release/verify-release-candidate.sh @@ -1,4 +1,4 @@ -#!/bin/bash +#!/usr/bin/env bash # # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file @@ -32,8 +32,8 @@ set -x set -o pipefail SOURCE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]:-$0}")" && pwd)" -ARROW_DIR="$(dirname $(dirname ${SOURCE_DIR}))" -ARROW_DIST_URL='https://dist.apache.org/repos/dist/dev/arrow' +DATAFUSION_PYTHON_DIR="$(dirname $(dirname ${SOURCE_DIR}))" +DATAFUSION_PYTHON_DIST_URL='https://dist.apache.org/repos/dist/dev/datafusion' download_dist_file() { curl \ @@ -41,11 +41,11 @@ download_dist_file() { --show-error \ --fail \ --location \ - --remote-name $ARROW_DIST_URL/$1 + --remote-name $DATAFUSION_PYTHON_DIST_URL/$1 } download_rc_file() { - download_dist_file apache-arrow-datafusion-python-${VERSION}-rc${RC_NUMBER}/$1 + download_dist_file apache-datafusion-python-${VERSION}-rc${RC_NUMBER}/$1 } import_gpg_keys() { @@ -89,31 +89,40 @@ verify_dir_artifact_signatures() { setup_tempdir() { cleanup() { if [ "${TEST_SUCCESS}" = "yes" ]; then - rm -fr "${ARROW_TMPDIR}" + rm -fr "${DATAFUSION_PYTHON_TMPDIR}" else - echo "Failed to verify release candidate. See ${ARROW_TMPDIR} for details." + echo "Failed to verify release candidate. See ${DATAFUSION_PYTHON_TMPDIR} for details." fi } - if [ -z "${ARROW_TMPDIR}" ]; then - # clean up automatically if ARROW_TMPDIR is not defined - ARROW_TMPDIR=$(mktemp -d -t "$1.XXXXX") + if [ -z "${DATAFUSION_PYTHON_TMPDIR}" ]; then + # clean up automatically if DATAFUSION_PYTHON_TMPDIR is not defined + DATAFUSION_PYTHON_TMPDIR=$(mktemp -d -t "$1.XXXXX") trap cleanup EXIT else # don't clean up automatically - mkdir -p "${ARROW_TMPDIR}" + mkdir -p "${DATAFUSION_PYTHON_TMPDIR}" fi } test_source_distribution() { - # install rust toolchain in a similar fashion like test-miniconda + # install rust toolchain export RUSTUP_HOME=$PWD/test-rustup export CARGO_HOME=$PWD/test-rustup curl https://sh.rustup.rs -sSf | sh -s -- -y --no-modify-path - export PATH=$RUSTUP_HOME/bin:$PATH - source $RUSTUP_HOME/env + # On Unix, rustup creates an env file. On Windows GitHub runners (MSYS bash), + # that file may not exist, so fall back to adding Cargo bin directly. + if [ -f "$CARGO_HOME/env" ]; then + # shellcheck disable=SC1090 + source "$CARGO_HOME/env" + elif [ -f "$RUSTUP_HOME/env" ]; then + # shellcheck disable=SC1090 + source "$RUSTUP_HOME/env" + else + export PATH="$CARGO_HOME/bin:$PATH" + fi # build and test rust @@ -125,11 +134,21 @@ test_source_distribution() { git clone https://github.com/apache/arrow-testing.git testing git clone https://github.com/apache/parquet-testing.git parquet-testing - python3 -m venv venv - source venv/bin/activate - python3 -m pip install -U pip - python3 -m pip install -r requirements-310.txt - maturin develop + python3 -m venv .venv + if [ -x ".venv/bin/python" ]; then + VENV_PYTHON=".venv/bin/python" + elif [ -x ".venv/Scripts/python.exe" ]; then + VENV_PYTHON=".venv/Scripts/python.exe" + elif [ -x ".venv/Scripts/python" ]; then + VENV_PYTHON=".venv/Scripts/python" + else + echo "Unable to find python executable in virtual environment" + exit 1 + fi + + "$VENV_PYTHON" -m pip install -U pip + "$VENV_PYTHON" -m pip install -U maturin + "$VENV_PYTHON" -m maturin develop #TODO: we should really run tests here as well #python3 -m pytest @@ -142,11 +161,11 @@ test_source_distribution() { TEST_SUCCESS=no -setup_tempdir "arrow-${VERSION}" -echo "Working in sandbox ${ARROW_TMPDIR}" -cd ${ARROW_TMPDIR} +setup_tempdir "datafusion-python-${VERSION}" +echo "Working in sandbox ${DATAFUSION_PYTHON_TMPDIR}" +cd ${DATAFUSION_PYTHON_TMPDIR} -dist_name="apache-arrow-datafusion-python-${VERSION}" +dist_name="apache-datafusion-python-${VERSION}" import_gpg_keys fetch_archive ${dist_name} tar xf ${dist_name}.tar.gz diff --git a/dev/rust_lint.sh b/dev/rust_lint.sh index b1285cbc3..eeb9e2302 100755 --- a/dev/rust_lint.sh +++ b/dev/rust_lint.sh @@ -1,4 +1,4 @@ -#!/bin/bash +#!/usr/bin/env bash # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file diff --git a/docs/.gitignore b/docs/.gitignore new file mode 100644 index 000000000..6e8a53b6f --- /dev/null +++ b/docs/.gitignore @@ -0,0 +1,4 @@ +pokemon.csv +yellow_trip_data.parquet +yellow_tripdata_2021-01.parquet + diff --git a/docs/Makefile b/docs/Makefile index e65c8e250..49ebae372 100644 --- a/docs/Makefile +++ b/docs/Makefile @@ -35,4 +35,4 @@ help: # Catch-all target: route all unknown targets to Sphinx using the new # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). %: Makefile - @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) \ No newline at end of file + @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) --fail-on-warning \ No newline at end of file diff --git a/docs/README.md b/docs/README.md index a6f4998c8..502f1c2a1 100644 --- a/docs/README.md +++ b/docs/README.md @@ -19,46 +19,52 @@ # DataFusion Documentation -This folder contains the source content of the [python api](./source/api). -These are both published to https://arrow.apache.org/datafusion/ -as part of the release process. +This folder contains the source content of the [Python API](./source/api). +This is published to https://datafusion.apache.org/python by a GitHub action +when changes are merged to the main branch. ## Dependencies It's recommended to install build dependencies and build the documentation -inside a Python virtualenv. +inside a Python `venv` using `uv`. -- Python -- `pip install -r requirements.txt` +To prepare building the documentation run the following on the root level of the project: + +```bash +# Set up a virtual environment with the documentation dependencies +uv sync --dev --group docs --no-install-package datafusion +``` ## Build & Preview Run the provided script to build the HTML pages. ```bash -./build.sh +# Build the repository +uv run --no-project maturin develop --uv +# Build the documentation +uv run --no-project docs/build.sh ``` -The HTML will be generated into a `build` directory. +The HTML will be generated into a `build` directory in `docs`. Preview the site on Linux by running this command. ```bash -firefox build/html/index.html +firefox docs/build/html/index.html ``` ## Release Process -The documentation is served through the -[arrow-site](https://github.com/apache/arrow-site/) repo. To release a new -version of the docs, follow these steps: +This documentation is hosted at https://datafusion.apache.org/python -1. Run `./build.sh` inside `docs` folder to generate the docs website inside the `build/html` folder. -2. Clone the arrow-site repo -3. Checkout to the `asf-site` branch (NOT `master`) -4. Copy build artifacts into `arrow-site` repo's `datafusion` folder with a command such as +When the PR is merged to the `main` branch of the DataFusion +repository, a [github workflow](https://github.com/apache/datafusion-python/blob/main/.github/workflows/build.yml) which: -- `cp -rT ./build/html/ ../../arrow-site/datafusion/` (doesn't work on mac) -- `rsync -avzr ./build/html/ ../../arrow-site/datafusion/` +1. Builds the html content +2. Pushes the html content to the [`asf-site`](https://github.com/apache/datafusion-python/tree/asf-site) branch in this repository. -5. Commit changes in `arrow-site` and send a PR. \ No newline at end of file +The Apache Software Foundation provides https://arrow.apache.org/, +which serves content based on the configuration in +[.asf.yaml](https://github.com/apache/datafusion-python/blob/main/.asf.yaml), +which specifies the target as https://datafusion.apache.org/python. diff --git a/docs/build.sh b/docs/build.sh old mode 100644 new mode 100755 index 3f24f8eec..f73330323 --- a/docs/build.sh +++ b/docs/build.sh @@ -1,4 +1,4 @@ -#!/bin/bash +#!/usr/bin/env bash # # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file @@ -19,10 +19,23 @@ # set -e + +original_dir=$(pwd) +script_dir=$(dirname "$(realpath "$0")") +cd "$script_dir" || exit + +if [ ! -f pokemon.csv ]; then + curl -O https://gist.githubusercontent.com/ritchie46/cac6b337ea52281aa23c049250a4ff03/raw/89a957ff3919d90e6ef2d34235e6bf22304f3366/pokemon.csv +fi + +if [ ! -f yellow_tripdata_2021-01.parquet ]; then + curl -O https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet +fi + rm -rf build 2> /dev/null rm -rf temp 2> /dev/null mkdir temp cp -rf source/* temp/ -# replace relative URLs with absolute URLs -#sed -i 's/\.\.\/\.\.\/\.\.\//https:\/\/github.com\/apache\/arrow-datafusion\/blob\/master\//g' temp/contributor-guide/index.md -make SOURCEDIR=`pwd`/temp html \ No newline at end of file +make SOURCEDIR=`pwd`/temp html + +cd "$original_dir" || exit diff --git a/docs/mdbook/README.md b/docs/mdbook/README.md new file mode 100644 index 000000000..6dae6bc62 --- /dev/null +++ b/docs/mdbook/README.md @@ -0,0 +1,33 @@ + +# DataFusion Book + +This folder builds a DataFusion user guide using [mdBook](https://github.com/rust-lang/mdBook). + +## Build and run book locally + +Build the latest files with `mdbook build`. + +Open the book locally by running `open book/index.html`. + +## Install mdBook + +Download the `mdbook` binary or run `cargo install mdbook`. + +Then manually open it, so you have permissions to run it on your Mac. + +Add it to your path with a command like this so you can easily run the commands: `mv ~/Downloads/mdbook /Users/matthew.powers/.local/bin`. diff --git a/datafusion/common.py b/docs/mdbook/book.toml similarity index 85% rename from datafusion/common.py rename to docs/mdbook/book.toml index dd56640a4..089cb9a97 100644 --- a/datafusion/common.py +++ b/docs/mdbook/book.toml @@ -15,9 +15,9 @@ # specific language governing permissions and limitations # under the License. - -from ._internal import common - - -def __getattr__(name): - return getattr(common, name) +[book] +authors = ["Apache Arrow "] +language = "en" +multilingual = false +src = "src" +title = "DataFusion Book" diff --git a/docs/mdbook/src/SUMMARY.md b/docs/mdbook/src/SUMMARY.md new file mode 100644 index 000000000..23467ed4c --- /dev/null +++ b/docs/mdbook/src/SUMMARY.md @@ -0,0 +1,25 @@ + +# Summary + +- [Index](./index.md) +- [Installation](./installation.md) +- [Quickstart](./quickstart.md) +- [Usage](./usage/index.md) + - [Create a table](./usage/create-table.md) + - [Query a table](./usage/query-table.md) + - [Viewing Query Plans](./usage/query-plans.md) \ No newline at end of file diff --git a/docs/mdbook/src/images/datafusion-jupyterlab.png b/docs/mdbook/src/images/datafusion-jupyterlab.png new file mode 100644 index 000000000..c4d46884e Binary files /dev/null and b/docs/mdbook/src/images/datafusion-jupyterlab.png differ diff --git a/docs/mdbook/src/images/plan.svg b/docs/mdbook/src/images/plan.svg new file mode 100644 index 000000000..927147985 --- /dev/null +++ b/docs/mdbook/src/images/plan.svg @@ -0,0 +1,111 @@ + + + + + + +%3 + + +cluster_1 + +LogicalPlan + + +cluster_6 + +Detailed LogicalPlan + + + +2 + +Projection: my_table.a, SUM(my_table.b) + + + +3 + +Aggregate: groupBy=[[my_table.a]], aggr=[[SUM(my_table.b)]] + + + +2->3 + + + + + +4 + +Filter: my_table.a < Int64(3) + + + +3->4 + + + + + +5 + +TableScan: my_table + + + +4->5 + + + + + +7 + +Projection: my_table.a, SUM(my_table.b) +Schema: [a:Int64;N, SUM(my_table.b):Int64;N] + + + +8 + +Aggregate: groupBy=[[my_table.a]], aggr=[[SUM(my_table.b)]] +Schema: [a:Int64;N, SUM(my_table.b):Int64;N] + + + +7->8 + + + + + +9 + +Filter: my_table.a < Int64(3) +Schema: [a:Int64;N, b:Int64;N] + + + +8->9 + + + + + +10 + +TableScan: my_table +Schema: [a:Int64;N, b:Int64;N] + + + +9->10 + + + + + diff --git a/docs/mdbook/src/index.md b/docs/mdbook/src/index.md new file mode 100644 index 000000000..2c1d217f8 --- /dev/null +++ b/docs/mdbook/src/index.md @@ -0,0 +1,43 @@ + +# DataFusion Book + +DataFusion is a blazing fast query engine that lets you run data analyses quickly and reliably. + +DataFusion is written in Rust, but also exposes Python and SQL bindings, so you can easily query data in your language of choice. You don't need to know any Rust to be a happy and productive user of DataFusion. + +DataFusion lets you run queries faster than pandas. Let's compare query runtimes for a 5GB CSV file with 100 million rows of data. + +Take a look at a few rows of the data: + +``` ++-------+-------+--------------+-----+-----+-------+----+----+-----------+ +| id1 | id2 | id3 | id4 | id5 | id6 | v1 | v2 | v3 | ++-------+-------+--------------+-----+-----+-------+----+----+-----------+ +| id016 | id016 | id0000042202 | 15 | 24 | 5971 | 5 | 11 | 37.211254 | +| id039 | id045 | id0000029558 | 40 | 49 | 39457 | 5 | 4 | 48.951141 | +| id047 | id023 | id0000071286 | 68 | 20 | 74463 | 2 | 14 | 60.469241 | ++-------+-------+--------------+-----+-----+-------+----+----+-----------+ +``` + +Suppose you'd like to run the following query: `SELECT id1, sum(v1) AS v1 from the_table GROUP BY id1`. + +If you use pandas, then this query will take 43.6 seconds to execute. + +It only takes DataFusion 9.8 seconds to execute the same query. + +DataFusion is easy to use, powerful, and fast. Let's learn more! diff --git a/docs/mdbook/src/installation.md b/docs/mdbook/src/installation.md new file mode 100644 index 000000000..b29f3b66b --- /dev/null +++ b/docs/mdbook/src/installation.md @@ -0,0 +1,63 @@ + +# Installation + +DataFusion is easy to install, just like any other Python library. + +## Using uv + +If you do not yet have a virtual environment, create one: + +```bash +uv venv +``` + +You can add datafusion to your virtual environment with the usual: + +```bash +uv pip install datafusion +``` + +Or, to add to a project: + +```bash +uv add datafusion +``` + +## Using pip + +``` bash +pip install datafusion +``` + +## uv & JupyterLab setup + +This section explains how to install DataFusion in a uv environment with other libraries that allow for a nice Jupyter workflow. This setup is completely optional. These steps are only needed if you'd like to run DataFusion in a Jupyter notebook and have an interface like this: + +![DataFusion in Jupyter](https://github.com/MrPowers/datafusion-book/raw/main/src/images/datafusion-jupyterlab.png) + +Create a virtual environment with DataFusion, Jupyter, and other useful dependencies and start the desktop application. + +```bash +uv venv +uv pip install datafusion jupyterlab jupyterlab_code_formatter +uv run jupyter lab +``` + +## Examples + +See the [DataFusion Python Examples](https://github.com/apache/arrow-datafusion-python/tree/main/examples) for a variety of Python scripts that show DataFusion in action! diff --git a/docs/mdbook/src/quickstart.md b/docs/mdbook/src/quickstart.md new file mode 100644 index 000000000..bba0b36ae --- /dev/null +++ b/docs/mdbook/src/quickstart.md @@ -0,0 +1,77 @@ + +# DataFusion Quickstart + +You can easily query a DataFusion table with the Python API or with pure SQL. + +Let's create a small DataFrame and then run some queries with both APIs. + +Start by creating a DataFrame with four rows of data and two columns: `a` and `b`. + +```python +from datafusion import SessionContext + +ctx = SessionContext() + +df = ctx.from_pydict({"a": [1, 2, 3, 1], "b": [4, 5, 6, 7]}, name="my_table") +``` + +Let's append a column to this DataFrame that adds columns `a` and `b` with the SQL API. + +``` +ctx.sql("select a, b, a + b as sum_a_b from my_table") + ++---+---+---------+ +| a | b | sum_a_b | ++---+---+---------+ +| 1 | 4 | 5 | +| 2 | 5 | 7 | +| 3 | 6 | 9 | +| 1 | 7 | 8 | ++---+---+---------+ +``` + +DataFusion makes it easy to run SQL queries on DataFrames. + +Now let's run the same query with the DataFusion Python API: + +```python +from datafusion import col + +df.select( + col("a"), + col("b"), + col("a") + col("b"), +) +``` + +We get the same result as before: + +``` ++---+---+-------------------------+ +| a | b | my_table.a + my_table.b | ++---+---+-------------------------+ +| 1 | 4 | 5 | +| 2 | 5 | 7 | +| 3 | 6 | 9 | +| 1 | 7 | 8 | ++---+---+-------------------------+ +``` + +DataFusion also allows you to query data with a well-designed Python interface. + +Python users have two great ways to query DataFusion tables. diff --git a/docs/mdbook/src/usage/create-table.md b/docs/mdbook/src/usage/create-table.md new file mode 100644 index 000000000..98870fac0 --- /dev/null +++ b/docs/mdbook/src/usage/create-table.md @@ -0,0 +1,59 @@ + +# DataFusion Create Table + +It's easy to create DataFusion tables from a variety of data sources. + +## Create Table from Python Dictionary + +Here's how to create a DataFusion table from a Python dictionary: + +```python +from datafusion import SessionContext + +ctx = SessionContext() + +df = ctx.from_pydict({"a": [1, 2, 3, 1], "b": [4, 5, 6, 7]}, name="my_table") +``` + +Supplying the `name` parameter is optional. You only need to name the table if you'd like to query it with the SQL API. + +You can also create a DataFrame without a name that can be queried with the Python API: + +```python +from datafusion import SessionContext + +ctx = SessionContext() + +df = ctx.from_pydict({"a": [1, 2, 3, 1], "b": [4, 5, 6, 7]}) +``` + +## Create Table from CSV + +You can read a CSV into a DataFusion DataFrame. Here's how to read the `G1_1e8_1e2_0_0.csv` file into a table named `csv_1e8`: + +```python +ctx.register_csv("csv_1e8", "G1_1e8_1e2_0_0.csv") +``` + +## Create Table from Parquet + +You can read a Parquet file into a DataFusion DataFrame. Here's how to read the `yellow_tripdata_2021-01.parquet` file into a table named `taxi`. + +```python +ctx.register_table("taxi", "yellow_tripdata_2021-01.parquet") +``` diff --git a/docs/mdbook/src/usage/index.md b/docs/mdbook/src/usage/index.md new file mode 100644 index 000000000..1ef4406f7 --- /dev/null +++ b/docs/mdbook/src/usage/index.md @@ -0,0 +1,25 @@ + +# Usage + +This section shows how to create DataFusion DataFrames from a variety of data sources like CSV files and Parquet files. + +You'll learn more about the SQL statements that are supported by DataFusion. + +You'll also learn about the DataFusion's Python API for querying data. + +The documentation will wrap up with a variety of real-world data processing tasks that are well suited for DataFusion. The lightning-fast speed and reliable execution makes DataFusion the best technology for a variety of data processing tasks. diff --git a/docs/mdbook/src/usage/query-plans.md b/docs/mdbook/src/usage/query-plans.md new file mode 100644 index 000000000..a39aa9e42 --- /dev/null +++ b/docs/mdbook/src/usage/query-plans.md @@ -0,0 +1,170 @@ + + +# DataFusion Query Plans + +DataFusion's `DataFrame` is a wrapper around a query plan. In this chapter we will learn how to view +logical and physical query plans for DataFrames. + +## Sample Data + +Let's go ahead and create a simple DataFrame. You can do this in the Python shell or in a notebook. + +```python +from datafusion import SessionContext + +ctx = SessionContext() + +df = ctx.from_pydict({"a": [1, 2, 3, 1], "b": [4, 5, 6, 7]}, name="my_table") +``` + +## Logical Plan + +Next, let's look at the logical plan for this dataframe. + +```python +>>> df.logical_plan() +TableScan: my_table +``` + +The logical plan here consists of a single `TableScan` operator. Let's make a more interesting plan by creating a new +`DataFrame` representing an aggregate query with a filter. + +```python +>>> df = ctx.sql("SELECT a, sum(b) FROM my_table WHERE a < 3 GROUP BY a") +``` + +When we view the plan for this `DataFrame` we can see that there are now four operators in the plan, each +representing a logical transformation of the data. We start with a `TableScan` to read the data, followed by +a `Filter` to filter out rows that do not match the filter expression, then an `Aggregate` is performed. Finally, +a `Projection` is applied to ensure that the order of the final columns matches the `SELECT` part of the SQL query. + +```python +>>> df.logical_plan() +Projection: my_table.a, SUM(my_table.b) + Aggregate: groupBy=[[my_table.a]], aggr=[[SUM(my_table.b)]] + Filter: my_table.a < Int64(3) + TableScan: my_table +``` + +## Optimized Logical Plan + +DataFusion has a powerful query optimizer which will rewrite query plans to make them more efficient before they are +executed. We can view the output of the optimized by viewint the optimized logical plan. + +```python +>>> df.optimized_logical_plan() +Aggregate: groupBy=[[my_table.a]], aggr=[[SUM(my_table.b)]] + Filter: my_table.a < Int64(3) + TableScan: my_table projection=[a, b] +``` + +We can see that there are two key differences compared to the unoptimized logical plan: + +- The `Projection` has been removed because it was redundant in this case (the output of the `Aggregatge` plan + already had the columns in the correct order). +- The `TableScan` now has a projection pushed down so that it only reads the columns required to be able to execute + the query. In this case the table only has two columns and we referenced them both in the query, but this optimization + can be very effective in real-world queries against large tables. + +## Physical Plan + +Logical plans provide a representation of "what" the query should do it. Physical plans explain "how" the query +should be executed. + +We can view the physical plan (also known as an execution plan) using the `execution_plan` method. + +```python +>>> df.execution_plan() +AggregateExec: mode=FinalPartitioned, gby=[a@0 as a], aggr=[SUM(my_table.b)] + CoalesceBatchesExec: target_batch_size=8192 + RepartitionExec: partitioning=Hash([Column { name: "a", index: 0 }], 48), input_partitions=48 + AggregateExec: mode=Partial, gby=[a@0 as a], aggr=[SUM(my_table.b)] + CoalesceBatchesExec: target_batch_size=8192 + FilterExec: a@0 < 3 + RepartitionExec: partitioning=RoundRobinBatch(48), input_partitions=1 + MemoryExec: partitions=1, partition_sizes=[1] +``` + +The `TableScan` has now been replaced by a more specific `MemoryExec` for scanning the in-memory data. If we were +querying a CSV file on disk then we would expect to see a `CsvExec` instead. + +This plan has additional operators that were not in the logical plan: + +- `RepartionExec` has been added so that the data can be split into partitions and processed in parallel using + multiple cores. +- `CoalesceBatchesExec` will combine small batches into larger batches to ensure that processing remains efficient. + +The `Aggregate` operator now appears twice. This is because aggregates are performed in a two step process. Data is +aggregated within each partition in parallel and then those results (which could contain duplicate grouping keys) are +combined and the aggregate operations is applied again. + +## Creating Query Plan Diagrams + +DataFusion supports generating query plan diagrams in [DOT format](). + +DOT is a language for describing graphs and there are open source tools such as GraphViz that can render diagrams +from DOT files. + +We can use the following code to generate a DOT file for a logical query plan. + +```python +>>> diagram = df.logical_plan().display_graphviz() +>>> with open('plan.dot', 'w') as f: +>>> f.write(diagram) +``` + +If we view the view, we will see the following content. + +``` +// Begin DataFusion GraphViz Plan (see https://graphviz.org) +digraph { + subgraph cluster_1 + { + graph[label="LogicalPlan"] + 2[shape=box label="Projection: my_table.a, SUM(my_table.b)"] + 3[shape=box label="Aggregate: groupBy=[[my_table.a]], aggr=[[SUM(my_table.b)]]"] + 2 -> 3 [arrowhead=none, arrowtail=normal, dir=back] + 4[shape=box label="Filter: my_table.a < Int64(3)"] + 3 -> 4 [arrowhead=none, arrowtail=normal, dir=back] + 5[shape=box label="TableScan: my_table"] + 4 -> 5 [arrowhead=none, arrowtail=normal, dir=back] + } + subgraph cluster_6 + { + graph[label="Detailed LogicalPlan"] + 7[shape=box label="Projection: my_table.a, SUM(my_table.b)\nSchema: [a:Int64;N, SUM(my_table.b):Int64;N]"] + 8[shape=box label="Aggregate: groupBy=[[my_table.a]], aggr=[[SUM(my_table.b)]]\nSchema: [a:Int64;N, SUM(my_table.b):Int64;N]"] + 7 -> 8 [arrowhead=none, arrowtail=normal, dir=back] + 9[shape=box label="Filter: my_table.a < Int64(3)\nSchema: [a:Int64;N, b:Int64;N]"] + 8 -> 9 [arrowhead=none, arrowtail=normal, dir=back] + 10[shape=box label="TableScan: my_table\nSchema: [a:Int64;N, b:Int64;N]"] + 9 -> 10 [arrowhead=none, arrowtail=normal, dir=back] + } +} +// End DataFusion GraphViz Plan +``` + +We can use GraphViz from the command-line to convert this DOT file into an image. + +```bash +dot -Tsvg plan.dot > plan.svg +``` + +This generates the following diagram: + +![Query Plan Diagram](../images/plan.svg) diff --git a/docs/mdbook/src/usage/query-table.md b/docs/mdbook/src/usage/query-table.md new file mode 100644 index 000000000..5e4e38001 --- /dev/null +++ b/docs/mdbook/src/usage/query-table.md @@ -0,0 +1,125 @@ + +# DataFusion Query Table + +DataFusion tables can be queried with SQL or with the Python API. + +Let's create a small table and show the different types of queries that can be run. + +```python +df = ctx.from_pydict( + { + "first_name": ["li", "wang", "ron", "amanda"], + "age": [25, 75, 68, 18], + "country": ["china", "china", "us", "us"], + }, + name="some_people", +) +``` + +Here's the data in the table: + +``` ++------------+-----+---------+ +| first_name | age | country | ++------------+-----+---------+ +| li | 25 | china | +| wang | 75 | china | +| ron | 68 | us | +| amanda | 18 | us | ++------------+-----+---------+ +``` + +## DataFusion Filter DataFrame + +Here's how to find all individuals that are older than 65 years old in the data with SQL: + +``` +ctx.sql("select * from some_people where age > 65") + ++------------+-----+---------+ +| first_name | age | country | ++------------+-----+---------+ +| wang | 75 | china | +| ron | 68 | us | ++------------+-----+---------+ +``` + +Here's how to run the same query with Python: + +```python +df.filter(col("age") > lit(65)) +``` + +``` ++------------+-----+---------+ +| first_name | age | country | ++------------+-----+---------+ +| wang | 75 | china | +| ron | 68 | us | ++------------+-----+---------+ +``` + +## DataFusion Select Columns from DataFrame + +Here's how to select the `first_name` and `country` columns from the DataFrame with SQL: + +``` +ctx.sql("select first_name, country from some_people") + + ++------------+---------+ +| first_name | country | ++------------+---------+ +| li | china | +| wang | china | +| ron | us | +| amanda | us | ++------------+---------+ +``` + +Here's how to run the same query with Python: + +```python +df.select(col("first_name"), col("country")) +``` + +``` ++------------+---------+ +| first_name | country | ++------------+---------+ +| li | china | +| wang | china | +| ron | us | +| amanda | us | ++------------+---------+ +``` + +## DataFusion Aggregation Query + +Here's how to run a group by aggregation query: + +``` +ctx.sql("select country, count(*) as num_people from some_people group by country") + ++---------+------------+ +| country | num_people | ++---------+------------+ +| china | 2 | +| us | 2 | ++---------+------------+ +``` diff --git a/docs/source/_static/images/2x_bgwhite_original.png b/docs/source/_static/images/2x_bgwhite_original.png new file mode 100644 index 000000000..abb5fca6e Binary files /dev/null and b/docs/source/_static/images/2x_bgwhite_original.png differ diff --git a/docs/source/_static/images/DataFusion-Logo-Background-White.png b/docs/source/_static/images/DataFusion-Logo-Background-White.png deleted file mode 100644 index 023c2373f..000000000 Binary files a/docs/source/_static/images/DataFusion-Logo-Background-White.png and /dev/null differ diff --git a/docs/source/_static/images/DataFusion-Logo-Background-White.svg b/docs/source/_static/images/DataFusion-Logo-Background-White.svg deleted file mode 100644 index b3bb47c5e..000000000 --- a/docs/source/_static/images/DataFusion-Logo-Background-White.svg +++ /dev/null @@ -1 +0,0 @@ -DataFUSION-Logo-Dark \ No newline at end of file diff --git a/docs/source/_static/images/DataFusion-Logo-Dark.png b/docs/source/_static/images/DataFusion-Logo-Dark.png deleted file mode 100644 index cc60f12a0..000000000 Binary files a/docs/source/_static/images/DataFusion-Logo-Dark.png and /dev/null differ diff --git a/docs/source/_static/images/DataFusion-Logo-Dark.svg b/docs/source/_static/images/DataFusion-Logo-Dark.svg deleted file mode 100644 index e16f24443..000000000 --- a/docs/source/_static/images/DataFusion-Logo-Dark.svg +++ /dev/null @@ -1 +0,0 @@ -DataFUSION-Logo-Dark \ No newline at end of file diff --git a/docs/source/_static/images/DataFusion-Logo-Light.png b/docs/source/_static/images/DataFusion-Logo-Light.png deleted file mode 100644 index 8992213b0..000000000 Binary files a/docs/source/_static/images/DataFusion-Logo-Light.png and /dev/null differ diff --git a/docs/source/_static/images/DataFusion-Logo-Light.svg b/docs/source/_static/images/DataFusion-Logo-Light.svg deleted file mode 100644 index b3bef2193..000000000 --- a/docs/source/_static/images/DataFusion-Logo-Light.svg +++ /dev/null @@ -1 +0,0 @@ -DataFUSION-Logo-Light \ No newline at end of file diff --git a/docs/source/_static/images/original.png b/docs/source/_static/images/original.png new file mode 100644 index 000000000..687f94676 Binary files /dev/null and b/docs/source/_static/images/original.png differ diff --git a/docs/source/_static/images/original.svg b/docs/source/_static/images/original.svg new file mode 100644 index 000000000..6ba0ece99 --- /dev/null +++ b/docs/source/_static/images/original.svg @@ -0,0 +1,31 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/source/_static/images/original2x.png b/docs/source/_static/images/original2x.png new file mode 100644 index 000000000..a7402109b Binary files /dev/null and b/docs/source/_static/images/original2x.png differ diff --git a/docs/source/_static/theme_overrides.css b/docs/source/_static/theme_overrides.css index 1e972cc6f..aaa40fba2 100644 --- a/docs/source/_static/theme_overrides.css +++ b/docs/source/_static/theme_overrides.css @@ -56,7 +56,7 @@ a.navbar-brand img { /* This is the bootstrap CSS style for "table-striped". Since the theme does -not yet provide an easy way to configure this globaly, it easier to simply +not yet provide an easy way to configure this globally, it easier to simply include this snippet here than updating each table in all rst files to add ":class: table-striped" */ diff --git a/docs/source/_templates/docs-sidebar.html b/docs/source/_templates/docs-sidebar.html index bc2bf0092..44deeed25 100644 --- a/docs/source/_templates/docs-sidebar.html +++ b/docs/source/_templates/docs-sidebar.html @@ -1,6 +1,6 @@ - +