Skip to content

Conversation

@unidevel
Copy link

Test PR for branch staging-a35150612-rebase with head a351506

tanjialiang and others added 25 commits June 6, 2025 14:36
Summary:
Pull Request resolved: facebookincubator#13678

* Cleanup the priority in memory pool.
* Make abort candidate favor priority over bucket.

Reviewed By: xiaoxmeng

Differential Revision: D76018221

fbshipit-source-id: 7162f0c8ad967efdc0bd4e588901d42539c51ceb
…cubator#13639)

Summary:
Pull Request resolved: facebookincubator#13639

The current Task abstraction has a limitation when it comes to
operators that evaluate expressions - their expression statistics
are not directly accessible. The only way to access those is through
the ExpressionListner API. This change allows access to those
expressions stats via the operator stats. Since fetching these stats
can be expensive (especially if they are fetched often) a query
option `operator_track_expression_stats` is added to allows the user
to explicitly enable them.

Reviewed By: DanielMunozT, kgpai

Differential Revision: D75997924

fbshipit-source-id: 5fe62971d5e73a0f9f46be0f87def7aaa5dc20f0
…ator#13630)

Summary:
Fixes facebookincubator#13607

Stacked on facebookincubator#13613 which will fix the linux job fail.
The macos jobs fail because the dependency install_prefix wasn't set in the second step and xxhash is missing from the setup script.

Pull Request resolved: facebookincubator#13630

Reviewed By: kgpai

Differential Revision: D76073937

Pulled By: kKPulla

fbshipit-source-id: 8eb9b7ac113ac81ea2391053510b9eb6ce26fa28
…alysis and code refactor (facebookincubator#13655)

Summary:
Pull Request resolved: facebookincubator#13655

Add to support runtime stats tracking to breakdown the time spent inside reconstruct udf and report through operator runtime stats for offline analysis.
Some code refactor and minor optimizations

Reviewed By: juwentus1234, breeze1990

Differential Revision: D76011630

fbshipit-source-id: 77bb5cbad6663fb0815bad0f58159727cf2eb7a6
…incubator#13213)

Summary: Pull Request resolved: facebookincubator#13213

Reviewed By: xiaoxmeng

Differential Revision: D74249660

Pulled By: kgpai

fbshipit-source-id: 614ac257793c6a061dc30dbe23fa2ab8f486c2cf
…r#13680)

Summary:
Pull Request resolved: facebookincubator#13680

Enhance velox::exec::tryEvaluateConstantExpression to allow for propagating exceptions occurred during evaluation of a constant expression. Specifically, constant folding 5 / 0 fails with "division by zero". Before this change, the error would be swallowed. After this change the error would be propagated unless optional parameter suppressEvaluationFailures is set to true.

Remove velox::exec::evaluateConstantExpression as it is no longer used / needed.

Reviewed By: bikramSingh91

Differential Revision: D76152526

fbshipit-source-id: 190abf1c75f8034e58c70fa665abaaca4ba5a189
…acebookincubator#13687)

Summary:
Pull Request resolved: facebookincubator#13687

For sequence storage, we use unnest operator to per-event type optimization which might produce a number of output batches for each input by default which is not efficient for the followup index lookup. This PR introduces a config to enforce unnest operator to produce one output for each input for index lookup efficiency.

Reviewed By: tanjialiang

Differential Revision: D76196529

fbshipit-source-id: 769810efa20dc2315b914e80cb30c17d85177f33
Summary:
Pull Request resolved: facebookincubator#13679

Implement UnnestReplayer, a tool to be used to replay Unnest.
bypass-github-export-checks

Reviewed By: xiaoxmeng

Differential Revision: D76142525

fbshipit-source-id: e05087b9eab0f8edf908e19625837044b3840a6a
Summary:
Pull Request resolved: facebookincubator#13686

We shall not do time check after reclaim. Otherwise we wasted the just-now done reclaim.

Reviewed By: xiaoxmeng, skyelves

Differential Revision: D76172342

fbshipit-source-id: 734f5971432769bbca0ca74e039276034d8e74b3
Summary:
Add unit tests to HashJoinTest to cover the key features of left outer join. Currently these tests fail, as the left join is not implemented and we fallback to inner join, resulting to fewer results than expected.

Rollback Plan:

Pull Request resolved: facebookincubator#13638

Test Plan:
Related Test Files:
```
fbcode/velox/experimental/wave/exec/tests/HashJoinTest.cpp
```
Test Command:

To test the related test files and confirm the tests pass, run the command:
```
buck test @//mode/dbg //velox/experimental/wave/exec/tests:velox_wave_exec_test -- HashJoinTest
```

Currently, the left join tests are disabled. Running the left join tests would result to missing rows.

Rollback Plan:

Reviewed By: gggrace14

Differential Revision: D75990272

Pulled By: karatse

fbshipit-source-id: 64bd5bcebaeff4138bd1045b04c8ed75cb7efe1c
Summary:
The behaviour of lower function differs between Presto and Spark for 'İ'
(\u0130). Presto returns \u0069 and Spark returns \u0069 + \u0307. This PR
adds a template parameter `forSpark` for the `lowerUnicode` function and
`UpperLowerTemplateFunction` class to customize Spark's behaviour.

Spark's behaviour:
```
scala> val df = spark.sql("SELECT lower(CAST(X'C4B0' AS STRING))")
df: org.apache.spark.sql.DataFrame = [lower(CAST(X'C4B0' AS STRING)): string]

scala> df.foreach(row => println(row.getString(0).map(_.toInt.toHexString).mkString("\\u")))
69\u307
```

Presto's behaviour:

```
presto> SELECT to_hex(CAST(lower(U&'\0130') AS varbinary)) AS hex_result;
 hex_result
------------
 69
```

Pull Request resolved: facebookincubator#13158

Reviewed By: Yuhta

Differential Revision: D75963809

Pulled By: kKPulla

fbshipit-source-id: 99fbbb99b3e4e98f1184727eb31c77cf03fa0caf
…ator#13034)

Summary:
This PR adds the `varchar_type_write_side_check` function. This function trims
trailing space characters (ASCII 32) from input string and checks if it fits
within the specified length limit, throwing exceptions if the string exceeds
the limit after trimming or if the specified limit is not greater than 0.

Examples:

```
varchar_type_write_side_check("abc   ", 3) -- "abc"
varchar_type_write_side_check("abcd", 3) -- VeloxUserError: "Exceeds allowed length limitation: '3'"
varchar_type_write_side_check("   ", 0) -- VeloxUserError: "The length limit must be greater than 0."
```

**Note:** This PR is a splitted work of PR facebookincubator#12773, as outlined in PR facebookincubator#12772.

Pull Request resolved: facebookincubator#13034

Reviewed By: Yuhta

Differential Revision: D75963922

Pulled By: kKPulla

fbshipit-source-id: 34bba67e95645ab45f2493fd30f6635a6b3cf9a0
facebookincubator#13658)

Summary:
Presto supports
```
QUOTED_IDENTIFIER
    : '"' ( ~'"' | '""' )* '"'
    ;
```
The semantics are
```
Any character other than double quote is allowed between double quotes
And double double quote can be used to "escape" double quote
```

Fixes: facebookincubator#13648

Pull Request resolved: facebookincubator#13658

Reviewed By: mbasmanova, kagamiori

Differential Revision: D76071088

Pulled By: kKPulla

fbshipit-source-id: c1af839d997092fbaa075911d426359e236ecaf2
Summary:
Adds Spark trunc function.
Arguments:

date - date value or valid date string
fmt - the format representing the unit to be truncated to
"YEAR", "YYYY", "YY" - truncate to the first date of the year that the date falls in
"QUARTER" - truncate to the first date of the quarter that the date falls in
"MONTH", "MM", "MON" - truncate to the first date of the month that the date falls in
"WEEK" - truncate to the Monday of the week that the date falls in

https://spark.apache.org/docs/latest/api/sql/#trunc

Pull Request resolved: facebookincubator#13394

Reviewed By: kKPulla

Differential Revision: D75806942

Pulled By: kgpai

fbshipit-source-id: 80e866810061316f598ff8690d3533cf03b39701
facebookincubator#13688)

Summary:
Pull Request resolved: facebookincubator#13688

Support lazy decode in sequence storage index source and provide connector config to enable it in ss storage client

Reviewed By: tanjialiang, wenqiwooo

Differential Revision: D76198185

fbshipit-source-id: 885ecdd16b916b87a9fd664c0c08bcf93c336c12
Summary: Pull Request resolved: facebookincubator#13672

Reviewed By: Yuhta

Differential Revision: D76232786

Pulled By: kKPulla

fbshipit-source-id: ae64acbebce5177b71b76216577cec24e7b5d9ce
…epeatedColumnReader (facebookincubator#13348)

Summary:
Pull Request resolved: facebookincubator#13348

It is currently possible to specify a filter on map values and array elements in `ScanSpec`, but it will not have any effect at the moment.  This diff updates the `SelectiveRepeatedColumnReader` to add support for filtering based on map values and array elements.

Reviewed By: Yuhta

Differential Revision: D74757332

fbshipit-source-id: 22fa02b99b64d07d13c84bf71cb2d2cb809ad8fa
facebookincubator#13665)

Summary:
We have had issues with updates to dependencies breaking builds once the new image is generated, because the CI would still use the old image.  Most recently in facebookincubator#13663

This PR adds code to install the dependencies again if the relevant files have been touched in the PR.

Pull Request resolved: facebookincubator#13665

Reviewed By: Yuhta

Differential Revision: D76262919

Pulled By: kKPulla

fbshipit-source-id: 20bad99a4de5b3e5bf71562e53d1d1f71a1a4d2e
Summary:
Pull Request resolved: facebookincubator#13564

Adds ST_geometrytype udf to velox

Reviewed By: jagill, bikramSingh91

Differential Revision: D75643153

fbshipit-source-id: e43e10ecb357a68135c48bca155a621fb2cb3c33
Summary:
Pull Request resolved: facebookincubator#13565

Adds `st_distance` udf to velox

Reviewed By: jagill, bikramSingh91

Differential Revision: D75645148

fbshipit-source-id: 350c55b5e6950e5bd32841d9dd992e10207b43fa
…ookincubator#13570)

Summary:
Pull Request resolved: facebookincubator#13570

Previously we were throwing a `Status::UserError`, which did not propagate the error message to the user. This fixes that.

Reviewed By: bikramSingh91, amitkdutta

Differential Revision: D75702694

fbshipit-source-id: 68442cb32a03ee6eb3d46b9e3b2b2db6132f7204
…kincubator#13696)

Summary:
Pull Request resolved: facebookincubator#13696

The current implementation of json_extract relies on simdjson's
document parsing API to detect invalid JSONs early in the process.
However, this API only catches some invalid conditions as per the
documentation.
To handle errors during processing, non-exception throwing APIs are
used, which return an error code instead of throwing an exception.
Unfortunately, not all calls utilize these APIs, resulting in
exceptions being thrown when trying to access invalid portions of the
JSON.
This combination of factors allowed a specific type of invalid JSON
to bypass the initial check and encounter the exception-throwing
accessor API during parsing. As a result, the UDF threw an error
instead of returning a null result as expected. This change addresses
and resolves this issue by updating those exception throwing
callsites to use the nonexcept versions instead.

Reviewed By: kgpai

Differential Revision: D76276107

fbshipit-source-id: 789fe2bf69e443186eff7bf07ef6d06774aef1de
Summary:
This PR adds a [`.pre-commit-config.yaml`](https://pre-commit.com) with a number of hooks that mirror our previous `check.py` and some new ones to further improve code quality.

Hooks to replace check.py:
- cmake-format
- clang-format
- license header check
- ruff (python formatting and linting)
- clang-tidy (has to be called manually)

New Hooks:
- general quality checks
  - trailing white space removal
  - end of file fixer
  - large file check (is in the default config, we can probably drop it?)
  - check that files with shebangs are actually executable
  - and that executable scripts have shebangs
- Several checks to improve our GitHub Actions quality
  - zizmor, a security focused linter
  - action-validator, make sure actions are valid! (we have had issues with that before) also checks if paths globs are correct
  - yamlfmt/yamllint make sure we use a unified style in our yaml/workflows

I have added all fixes in this PR ~~but we could also merge them as a separate PR so it can be easier ignored when blaming.~~ Actually I think it is better to just handle everything in this PR to avoid causing a bunch of merge conflicts.

Compared to `check.py` this is also ***much*** faster. Running all hooks on all files in the repo takes under ten seconds!

Pull Request resolved: facebookincubator#13361

Reviewed By: kKPulla

Differential Revision: D76274374

Pulled By: kgpai

fbshipit-source-id: ba23f75fd1ff6e05d0e2a41d791028ac0220da56
Summary:
Fix binomial_cdf disparity

In java, if value >= numberOfTrials, directly return 1.0
```
public double cumulativeProbability(int x) {
        double ret;
        if (x < 0) {
            ret = (double)0.0F;
        } else if (x >= this.numberOfTrials) {
            ret = (double)1.0F;
        } else {
            ret = (double)1.0F - Beta.regularizedBeta(this.probabilityOfSuccess, (double)x + (double)1.0F, (double)(this.numberOfTrials - x));
        }

        return ret;
    }
```

In cpp, it throws an error in boost lib

```
if(k > N)
{
       *result = policies::raise_domain_error<RealType>(
          function,
          "Number of Successes argument is %1%, but must be <= Number of Trials !", k, pol);
       return false;
}
```

Reviewed By: amitkdutta

Differential Revision: D76301965

fbshipit-source-id: a1652021e7199dd0d1122d0bac2b144a7daf1bf8
…acebookincubator#13625)

Summary:
Pull Request resolved: facebookincubator#13625

**High-Level Goal:**
The noisy functions we are implementing aims to introduce a set of noisy aggregation functions in Presto, a SQL query engine, to facilitate the development of privacy-preserving systems and tools. The ultimate purpose is to simplify and streamline the implementation of differential privacy measures within Meta's data analysis and processing workflows.

**Purpose of the Diffs:**

This diff introduces the `noisy_count_gaussian` aggregation function, which counts the non-null values in a column and adds a normally distributed random double value with 0 mean and standard deviation of `noise_scale` to the true count. The noisy count is post-processed to be non-negative and rounded to bigint. This function is intended to replace traditional `count` operations with a version that incorporates Gaussian noise. This noise helps to obscure the exact count of records that meet a certain condition, making it more difficult for unauthorized parties to reverse-engineer sensitive information from aggregate data.

**Overall Impact:**
By adding these noisy aggregation functions, the diffs aim to make it easier for developers within Meta to create data analysis tools and systems that are compliant with differential privacy standards. This is a critical step towards protecting user privacy and ensuring that sensitive data is handled responsibly across various Meta platforms and services.

**Changes**
------------

* Added documentation for `noisy_count_gaussian` function in `fbcode/velox/docs/functions/presto/aggregate.rst`.
* Implemented `NoisyCountGaussianAggregate`
* Registered `NoisyCountGaussianAggregate` function in `fbcode/velox/functions/prestosql/aggregates/RegisterAggregateFunctions.cpp`.
* Added `kNoisyCountGaussian` constant in `fbcode/velox/functions/prestosql/aggregates/AggregateNames.h`.
* Added `noisy_count_gaussian` function to the list of non-deterministic functions in `fbcode/velox/expression/fuzzer/facebook/FacebookPrestoExpressionFuzzerTest.cpp`.

Reviewed By: kgpai

Differential Revision: D75912782

fbshipit-source-id: 672480e5a8203cc1635b7d954b8f59eb91859702
@unidevel unidevel closed this Jun 13, 2025
unidevel pushed a commit that referenced this pull request Jul 9, 2025
…ger-overflow (facebookincubator#13831)

Summary:
Pull Request resolved: facebookincubator#13831

This avoids the following errors:

```
fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_abs.h:56:41: runtime error: negation of -9223372036854775808 cannot be represented in type 'long'; cast to an unsigned type to negate this value to itself
    #0 0x000000346ce5 in std::abs(long) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_abs.h:56
    #1 0x000000345879 in std::shared_ptr<facebook::velox::BiasVector<facebook::velox::test::EvalTypeHelper<long>::Type>> facebook::velox::test::VectorMaker::biasVector<long>(std::vector<std::optional<long>, std::allocator<std::optional<long>>> const&) fbcode/velox/vector/tests/utils/VectorMaker-inl.h:58
    #2 0x000000344d34 in facebook::velox::test::BiasVectorErrorTest::errorTest(std::vector<std::optional<long>, std::allocator<std::optional<long>>>) fbcode/velox/vector/tests/BiasVectorTest.cpp:39
    #3 0x00000033ec99 in facebook::velox::test::BiasVectorErrorTest_checkRangeTooLargeError_Test::TestBody() fbcode/velox/vector/tests/BiasVectorTest.cpp:44
    #4 0x7fe0a2342c46 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) fbsource/src/gtest.cc:2727
    #5 0x7fe0a234275d in testing::Test::Run() fbsource/src/gtest.cc:2744
    #6 0x7fe0a2345fb3 in testing::TestInfo::Run() fbsource/src/gtest.cc:2890
    #7 0x7fe0a234c8eb in testing::TestSuite::Run() fbsource/src/gtest.cc:3068
    #8 0x7fe0a237b52b in testing::internal::UnitTestImpl::RunAllTests() fbsource/src/gtest.cc:6059
    #9 0x7fe0a237a0a2 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) fbsource/src/gtest.cc:2727
    #10 0x7fe0a23797f5 in testing::UnitTest::Run() fbsource/src/gtest.cc:5599
    #11 0x7fe0a2239800 in RUN_ALL_TESTS() fbsource/gtest/gtest.h:2334
    #12 0x7fe0a223952c in main fbcode/common/gtest/LightMain.cpp:20
    #13 0x7fe09ec2c656 in __libc_start_call_main /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #14 0x7fe09ec2c717 in __libc_start_main@GLIBC_2.2.5 /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../csu/libc-start.c:409:3
    #15 0x00000033d8b0 in _start /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/x86_64/start.S:116

UndefinedBehaviorSanitizer: signed-integer-overflow fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_abs.h:56:41
```
Avoid overflow by using the expression (static_cast<uint64_t>(1) + ~static_cast<uint64_t>(min)) to calculate the absolute value of min without using std::abs

Reviewed By: dmm-fb, peterenescu

Differential Revision: D76901449

fbshipit-source-id: 7eb3bd0f83e42f44cdf34ea1759f3aa9e1042dae
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.