forked from facebookincubator/velox
-
Notifications
You must be signed in to change notification settings - Fork 0
feat(noisy_count): Implement noisy_count_gaussian(col, noise_scale) (#13625) #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: Pull Request resolved: facebookincubator#13678 * Cleanup the priority in memory pool. * Make abort candidate favor priority over bucket. Reviewed By: xiaoxmeng Differential Revision: D76018221 fbshipit-source-id: 7162f0c8ad967efdc0bd4e588901d42539c51ceb
…cubator#13639) Summary: Pull Request resolved: facebookincubator#13639 The current Task abstraction has a limitation when it comes to operators that evaluate expressions - their expression statistics are not directly accessible. The only way to access those is through the ExpressionListner API. This change allows access to those expressions stats via the operator stats. Since fetching these stats can be expensive (especially if they are fetched often) a query option `operator_track_expression_stats` is added to allows the user to explicitly enable them. Reviewed By: DanielMunozT, kgpai Differential Revision: D75997924 fbshipit-source-id: 5fe62971d5e73a0f9f46be0f87def7aaa5dc20f0
…ator#13630) Summary: Fixes facebookincubator#13607 Stacked on facebookincubator#13613 which will fix the linux job fail. The macos jobs fail because the dependency install_prefix wasn't set in the second step and xxhash is missing from the setup script. Pull Request resolved: facebookincubator#13630 Reviewed By: kgpai Differential Revision: D76073937 Pulled By: kKPulla fbshipit-source-id: 8eb9b7ac113ac81ea2391053510b9eb6ce26fa28
…alysis and code refactor (facebookincubator#13655) Summary: Pull Request resolved: facebookincubator#13655 Add to support runtime stats tracking to breakdown the time spent inside reconstruct udf and report through operator runtime stats for offline analysis. Some code refactor and minor optimizations Reviewed By: juwentus1234, breeze1990 Differential Revision: D76011630 fbshipit-source-id: 77bb5cbad6663fb0815bad0f58159727cf2eb7a6
…incubator#13213) Summary: Pull Request resolved: facebookincubator#13213 Reviewed By: xiaoxmeng Differential Revision: D74249660 Pulled By: kgpai fbshipit-source-id: 614ac257793c6a061dc30dbe23fa2ab8f486c2cf
…r#13680) Summary: Pull Request resolved: facebookincubator#13680 Enhance velox::exec::tryEvaluateConstantExpression to allow for propagating exceptions occurred during evaluation of a constant expression. Specifically, constant folding 5 / 0 fails with "division by zero". Before this change, the error would be swallowed. After this change the error would be propagated unless optional parameter suppressEvaluationFailures is set to true. Remove velox::exec::evaluateConstantExpression as it is no longer used / needed. Reviewed By: bikramSingh91 Differential Revision: D76152526 fbshipit-source-id: 190abf1c75f8034e58c70fa665abaaca4ba5a189
…acebookincubator#13687) Summary: Pull Request resolved: facebookincubator#13687 For sequence storage, we use unnest operator to per-event type optimization which might produce a number of output batches for each input by default which is not efficient for the followup index lookup. This PR introduces a config to enforce unnest operator to produce one output for each input for index lookup efficiency. Reviewed By: tanjialiang Differential Revision: D76196529 fbshipit-source-id: 769810efa20dc2315b914e80cb30c17d85177f33
Summary: Pull Request resolved: facebookincubator#13679 Implement UnnestReplayer, a tool to be used to replay Unnest. bypass-github-export-checks Reviewed By: xiaoxmeng Differential Revision: D76142525 fbshipit-source-id: e05087b9eab0f8edf908e19625837044b3840a6a
Summary: Pull Request resolved: facebookincubator#13686 We shall not do time check after reclaim. Otherwise we wasted the just-now done reclaim. Reviewed By: xiaoxmeng, skyelves Differential Revision: D76172342 fbshipit-source-id: 734f5971432769bbca0ca74e039276034d8e74b3
Summary: Add unit tests to HashJoinTest to cover the key features of left outer join. Currently these tests fail, as the left join is not implemented and we fallback to inner join, resulting to fewer results than expected. Rollback Plan: Pull Request resolved: facebookincubator#13638 Test Plan: Related Test Files: ``` fbcode/velox/experimental/wave/exec/tests/HashJoinTest.cpp ``` Test Command: To test the related test files and confirm the tests pass, run the command: ``` buck test @//mode/dbg //velox/experimental/wave/exec/tests:velox_wave_exec_test -- HashJoinTest ``` Currently, the left join tests are disabled. Running the left join tests would result to missing rows. Rollback Plan: Reviewed By: gggrace14 Differential Revision: D75990272 Pulled By: karatse fbshipit-source-id: 64bd5bcebaeff4138bd1045b04c8ed75cb7efe1c
Summary:
The behaviour of lower function differs between Presto and Spark for 'İ'
(\u0130). Presto returns \u0069 and Spark returns \u0069 + \u0307. This PR
adds a template parameter `forSpark` for the `lowerUnicode` function and
`UpperLowerTemplateFunction` class to customize Spark's behaviour.
Spark's behaviour:
```
scala> val df = spark.sql("SELECT lower(CAST(X'C4B0' AS STRING))")
df: org.apache.spark.sql.DataFrame = [lower(CAST(X'C4B0' AS STRING)): string]
scala> df.foreach(row => println(row.getString(0).map(_.toInt.toHexString).mkString("\\u")))
69\u307
```
Presto's behaviour:
```
presto> SELECT to_hex(CAST(lower(U&'\0130') AS varbinary)) AS hex_result;
hex_result
------------
69
```
Pull Request resolved: facebookincubator#13158
Reviewed By: Yuhta
Differential Revision: D75963809
Pulled By: kKPulla
fbshipit-source-id: 99fbbb99b3e4e98f1184727eb31c77cf03fa0caf
…ator#13034) Summary: This PR adds the `varchar_type_write_side_check` function. This function trims trailing space characters (ASCII 32) from input string and checks if it fits within the specified length limit, throwing exceptions if the string exceeds the limit after trimming or if the specified limit is not greater than 0. Examples: ``` varchar_type_write_side_check("abc ", 3) -- "abc" varchar_type_write_side_check("abcd", 3) -- VeloxUserError: "Exceeds allowed length limitation: '3'" varchar_type_write_side_check(" ", 0) -- VeloxUserError: "The length limit must be greater than 0." ``` **Note:** This PR is a splitted work of PR facebookincubator#12773, as outlined in PR facebookincubator#12772. Pull Request resolved: facebookincubator#13034 Reviewed By: Yuhta Differential Revision: D75963922 Pulled By: kKPulla fbshipit-source-id: 34bba67e95645ab45f2493fd30f6635a6b3cf9a0
facebookincubator#13658) Summary: Presto supports ``` QUOTED_IDENTIFIER : '"' ( ~'"' | '""' )* '"' ; ``` The semantics are ``` Any character other than double quote is allowed between double quotes And double double quote can be used to "escape" double quote ``` Fixes: facebookincubator#13648 Pull Request resolved: facebookincubator#13658 Reviewed By: mbasmanova, kagamiori Differential Revision: D76071088 Pulled By: kKPulla fbshipit-source-id: c1af839d997092fbaa075911d426359e236ecaf2
Summary: Adds Spark trunc function. Arguments: date - date value or valid date string fmt - the format representing the unit to be truncated to "YEAR", "YYYY", "YY" - truncate to the first date of the year that the date falls in "QUARTER" - truncate to the first date of the quarter that the date falls in "MONTH", "MM", "MON" - truncate to the first date of the month that the date falls in "WEEK" - truncate to the Monday of the week that the date falls in https://spark.apache.org/docs/latest/api/sql/#trunc Pull Request resolved: facebookincubator#13394 Reviewed By: kKPulla Differential Revision: D75806942 Pulled By: kgpai fbshipit-source-id: 80e866810061316f598ff8690d3533cf03b39701
facebookincubator#13688) Summary: Pull Request resolved: facebookincubator#13688 Support lazy decode in sequence storage index source and provide connector config to enable it in ss storage client Reviewed By: tanjialiang, wenqiwooo Differential Revision: D76198185 fbshipit-source-id: 885ecdd16b916b87a9fd664c0c08bcf93c336c12
Summary: Pull Request resolved: facebookincubator#13672 Reviewed By: Yuhta Differential Revision: D76232786 Pulled By: kKPulla fbshipit-source-id: ae64acbebce5177b71b76216577cec24e7b5d9ce
…epeatedColumnReader (facebookincubator#13348) Summary: Pull Request resolved: facebookincubator#13348 It is currently possible to specify a filter on map values and array elements in `ScanSpec`, but it will not have any effect at the moment. This diff updates the `SelectiveRepeatedColumnReader` to add support for filtering based on map values and array elements. Reviewed By: Yuhta Differential Revision: D74757332 fbshipit-source-id: 22fa02b99b64d07d13c84bf71cb2d2cb809ad8fa
facebookincubator#13665) Summary: We have had issues with updates to dependencies breaking builds once the new image is generated, because the CI would still use the old image. Most recently in facebookincubator#13663 This PR adds code to install the dependencies again if the relevant files have been touched in the PR. Pull Request resolved: facebookincubator#13665 Reviewed By: Yuhta Differential Revision: D76262919 Pulled By: kKPulla fbshipit-source-id: 20bad99a4de5b3e5bf71562e53d1d1f71a1a4d2e
Summary: Pull Request resolved: facebookincubator#13564 Adds ST_geometrytype udf to velox Reviewed By: jagill, bikramSingh91 Differential Revision: D75643153 fbshipit-source-id: e43e10ecb357a68135c48bca155a621fb2cb3c33
Summary: Pull Request resolved: facebookincubator#13565 Adds `st_distance` udf to velox Reviewed By: jagill, bikramSingh91 Differential Revision: D75645148 fbshipit-source-id: 350c55b5e6950e5bd32841d9dd992e10207b43fa
…ookincubator#13570) Summary: Pull Request resolved: facebookincubator#13570 Previously we were throwing a `Status::UserError`, which did not propagate the error message to the user. This fixes that. Reviewed By: bikramSingh91, amitkdutta Differential Revision: D75702694 fbshipit-source-id: 68442cb32a03ee6eb3d46b9e3b2b2db6132f7204
…kincubator#13696) Summary: Pull Request resolved: facebookincubator#13696 The current implementation of json_extract relies on simdjson's document parsing API to detect invalid JSONs early in the process. However, this API only catches some invalid conditions as per the documentation. To handle errors during processing, non-exception throwing APIs are used, which return an error code instead of throwing an exception. Unfortunately, not all calls utilize these APIs, resulting in exceptions being thrown when trying to access invalid portions of the JSON. This combination of factors allowed a specific type of invalid JSON to bypass the initial check and encounter the exception-throwing accessor API during parsing. As a result, the UDF threw an error instead of returning a null result as expected. This change addresses and resolves this issue by updating those exception throwing callsites to use the nonexcept versions instead. Reviewed By: kgpai Differential Revision: D76276107 fbshipit-source-id: 789fe2bf69e443186eff7bf07ef6d06774aef1de
Summary: This PR adds a [`.pre-commit-config.yaml`](https://pre-commit.com) with a number of hooks that mirror our previous `check.py` and some new ones to further improve code quality. Hooks to replace check.py: - cmake-format - clang-format - license header check - ruff (python formatting and linting) - clang-tidy (has to be called manually) New Hooks: - general quality checks - trailing white space removal - end of file fixer - large file check (is in the default config, we can probably drop it?) - check that files with shebangs are actually executable - and that executable scripts have shebangs - Several checks to improve our GitHub Actions quality - zizmor, a security focused linter - action-validator, make sure actions are valid! (we have had issues with that before) also checks if paths globs are correct - yamlfmt/yamllint make sure we use a unified style in our yaml/workflows I have added all fixes in this PR ~~but we could also merge them as a separate PR so it can be easier ignored when blaming.~~ Actually I think it is better to just handle everything in this PR to avoid causing a bunch of merge conflicts. Compared to `check.py` this is also ***much*** faster. Running all hooks on all files in the repo takes under ten seconds! Pull Request resolved: facebookincubator#13361 Reviewed By: kKPulla Differential Revision: D76274374 Pulled By: kgpai fbshipit-source-id: ba23f75fd1ff6e05d0e2a41d791028ac0220da56
Summary:
Fix binomial_cdf disparity
In java, if value >= numberOfTrials, directly return 1.0
```
public double cumulativeProbability(int x) {
double ret;
if (x < 0) {
ret = (double)0.0F;
} else if (x >= this.numberOfTrials) {
ret = (double)1.0F;
} else {
ret = (double)1.0F - Beta.regularizedBeta(this.probabilityOfSuccess, (double)x + (double)1.0F, (double)(this.numberOfTrials - x));
}
return ret;
}
```
In cpp, it throws an error in boost lib
```
if(k > N)
{
*result = policies::raise_domain_error<RealType>(
function,
"Number of Successes argument is %1%, but must be <= Number of Trials !", k, pol);
return false;
}
```
Reviewed By: amitkdutta
Differential Revision: D76301965
fbshipit-source-id: a1652021e7199dd0d1122d0bac2b144a7daf1bf8
…acebookincubator#13625) Summary: Pull Request resolved: facebookincubator#13625 **High-Level Goal:** The noisy functions we are implementing aims to introduce a set of noisy aggregation functions in Presto, a SQL query engine, to facilitate the development of privacy-preserving systems and tools. The ultimate purpose is to simplify and streamline the implementation of differential privacy measures within Meta's data analysis and processing workflows. **Purpose of the Diffs:** This diff introduces the `noisy_count_gaussian` aggregation function, which counts the non-null values in a column and adds a normally distributed random double value with 0 mean and standard deviation of `noise_scale` to the true count. The noisy count is post-processed to be non-negative and rounded to bigint. This function is intended to replace traditional `count` operations with a version that incorporates Gaussian noise. This noise helps to obscure the exact count of records that meet a certain condition, making it more difficult for unauthorized parties to reverse-engineer sensitive information from aggregate data. **Overall Impact:** By adding these noisy aggregation functions, the diffs aim to make it easier for developers within Meta to create data analysis tools and systems that are compliant with differential privacy standards. This is a critical step towards protecting user privacy and ensuring that sensitive data is handled responsibly across various Meta platforms and services. **Changes** ------------ * Added documentation for `noisy_count_gaussian` function in `fbcode/velox/docs/functions/presto/aggregate.rst`. * Implemented `NoisyCountGaussianAggregate` * Registered `NoisyCountGaussianAggregate` function in `fbcode/velox/functions/prestosql/aggregates/RegisterAggregateFunctions.cpp`. * Added `kNoisyCountGaussian` constant in `fbcode/velox/functions/prestosql/aggregates/AggregateNames.h`. * Added `noisy_count_gaussian` function to the list of non-deterministic functions in `fbcode/velox/expression/fuzzer/facebook/FacebookPrestoExpressionFuzzerTest.cpp`. Reviewed By: kgpai Differential Revision: D75912782 fbshipit-source-id: 672480e5a8203cc1635b7d954b8f59eb91859702
unidevel
pushed a commit
that referenced
this pull request
Jul 9, 2025
…ger-overflow (facebookincubator#13831) Summary: Pull Request resolved: facebookincubator#13831 This avoids the following errors: ``` fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_abs.h:56:41: runtime error: negation of -9223372036854775808 cannot be represented in type 'long'; cast to an unsigned type to negate this value to itself #0 0x000000346ce5 in std::abs(long) fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_abs.h:56 #1 0x000000345879 in std::shared_ptr<facebook::velox::BiasVector<facebook::velox::test::EvalTypeHelper<long>::Type>> facebook::velox::test::VectorMaker::biasVector<long>(std::vector<std::optional<long>, std::allocator<std::optional<long>>> const&) fbcode/velox/vector/tests/utils/VectorMaker-inl.h:58 #2 0x000000344d34 in facebook::velox::test::BiasVectorErrorTest::errorTest(std::vector<std::optional<long>, std::allocator<std::optional<long>>>) fbcode/velox/vector/tests/BiasVectorTest.cpp:39 #3 0x00000033ec99 in facebook::velox::test::BiasVectorErrorTest_checkRangeTooLargeError_Test::TestBody() fbcode/velox/vector/tests/BiasVectorTest.cpp:44 #4 0x7fe0a2342c46 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) fbsource/src/gtest.cc:2727 #5 0x7fe0a234275d in testing::Test::Run() fbsource/src/gtest.cc:2744 #6 0x7fe0a2345fb3 in testing::TestInfo::Run() fbsource/src/gtest.cc:2890 #7 0x7fe0a234c8eb in testing::TestSuite::Run() fbsource/src/gtest.cc:3068 #8 0x7fe0a237b52b in testing::internal::UnitTestImpl::RunAllTests() fbsource/src/gtest.cc:6059 #9 0x7fe0a237a0a2 in bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) fbsource/src/gtest.cc:2727 #10 0x7fe0a23797f5 in testing::UnitTest::Run() fbsource/src/gtest.cc:5599 #11 0x7fe0a2239800 in RUN_ALL_TESTS() fbsource/gtest/gtest.h:2334 #12 0x7fe0a223952c in main fbcode/common/gtest/LightMain.cpp:20 #13 0x7fe09ec2c656 in __libc_start_call_main /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #14 0x7fe09ec2c717 in __libc_start_main@GLIBC_2.2.5 /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../csu/libc-start.c:409:3 #15 0x00000033d8b0 in _start /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/csu/../sysdeps/x86_64/start.S:116 UndefinedBehaviorSanitizer: signed-integer-overflow fbcode/third-party-buck/platform010/build/libgcc/include/c++/trunk/bits/std_abs.h:56:41 ``` Avoid overflow by using the expression (static_cast<uint64_t>(1) + ~static_cast<uint64_t>(min)) to calculate the absolute value of min without using std::abs Reviewed By: dmm-fb, peterenescu Differential Revision: D76901449 fbshipit-source-id: 7eb3bd0f83e42f44cdf34ea1759f3aa9e1042dae
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Test PR for branch staging-a35150612-rebase with head a351506