Error Training LGBM Regressor on Fabric

Hi,
Currently i am facing an issue training LGBM Regressor on Microsoft Fabric. I have tried multiple solutions but haven't been able to resolve it. My data is around 50000000 rows and 80 columns in parquet format. 
With same data it crashes like 2 or 3 times but then it runs on it. I am unable to understand why it is crashing on it and after multiple attempts without changing anything it runs on it.

 
My Current Spark Cluster is:
**Spark Cluster**
Driver Memory:112 GB
Executor Memory: 112 GB
Executors Core: 16
Driver Cores: 16

Synapse ML Version: 1.0.15

**LGBM Params**
        "objective": "tweedie",
        "tweedieVariancePower": 1.5,
        "metric":        "mape",
        "boostingType": "gbdt",
        "learningRate": 0.05,
        "featureFraction": 0.8,  
        "baggingFraction": 0.8,
        "baggingFreq":     5,
        "lambdaL2":        0.1,
        "verbosity":         2,
        "earlyStoppingRound": 100,
        # "parallelism": "voting_parallel",
        "dataTransferMode": "bulk",
        "numLeaves":        48,
         "maxDepth":        -1,
         "numIterations":      800,
         "maxBin":           127,
        "passThroughArgs": "min_child_samples=300"

**Solution Tried**

- I have tried streaming mode but with that it gives error of duplicate columns even when there are none
- Increase memory for driver and executor to 400GB
- Changing num_threads in model
- Reducing Columns to 50
- Reducing Max Bins to 50 from 127


**ERROR After Which It Crashes**
`
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container from a bad node: container_1760761562536_0001_01_000004 on host: vm-e6133356. Exit status: 134. Diagnostics: mic=ALL-UNNAMED' '--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED' '--add-opens=java.base/sun.nio.ch=ALL-UNNAMED' '--add-opens=java.base/sun.nio.cs=ALL-UNNAMED' '--add-opens=java.base/sun.security.action=ALL-UNNAMED' '--add-opens=java.base/sun.util.calendar=ALL-UNNAMED' '--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED' '-Djdk.reflect.useDirectMethodHandle=false' '-Dlog4j2.configurationFile=file:/usr/hdp/current/spark3-client/conf/executor-log4j2.properties' '-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl' '-Dio.netty.tryReflectionSetAccessible=true' '-XX:+UseG1GC' '-XX:OnError=head -n $(($(grep -n '\''Java frames'\'' hs_err_pid%p.log | sed -n '\''2p'\'' | cut -d: -f1) - 2)) hs_err_pid%p.log 1>&2' -Djava.io.tmpdir=/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1760761562536_0001/container_1760761562536_0001_01_000004/tmp '-Dspark.rpc.askTimeout=600s' '-Dspark.onesecurity.systemcontext.port=9099' '-Dspark.synapse.history.rpc.port=18082' '-Dspark.ui.port=0' '-Dspark.driver.port=44503' '-Dspark.history.ui.port=18080' -Dspark.yarn.app.container.log.dir=/var/log/yarn-nm/userlogs/application_1760761562536_0001/container_1760761562536_0001_01_000004 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@vm-9ab59426:44503 --executor-id 2 --hostname vm-e6133356 --cores 16 --app-id application_1760761562536_0001 --resourceProfileId 0 > /var/log/yarn-nm/userlogs/application_1760761562536_0001/container_1760761562536_0001_01_000004/stdout 2> /var/log/yarn-nm/userlogs/application_1760761562536_0001/container_1760761562536_0001_01_000004/stderr
Error files: stderr, stderr-active.
Last 8192 bytes of stderr :
rtitionTask [Executor task launch worker for task 8.0 in stage 75.0 (TID 3647)]: Done with cleanup for partition 8, task 3647
2025-10-18 04:37:31,880 INFO Executor [Executor task launch worker for task 8.0 in stage 75.0 (TID 3647)]: Finished task 8.0 in stage 75.0 (TID 3647). 2383 bytes result sent to driver
2025-10-18 04:37:31,908 INFO BulkPartitionTask [Executor task launch worker for task 9.0 in stage 75.0 (TID 3649)]: done with data preparation on partition 9, task 3649
2025-10-18 04:37:31,908 INFO BulkPartitionTask [Executor task launch worker for task 9.0 in stage 75.0 (TID 3649)]: Helper task 3649, partition 9 finished processing rows
2025-10-18 04:37:31,908 INFO BulkPartitionTask [Executor task launch worker for task 9.0 in stage 75.0 (TID 3649)]: Beginning cleanup for partition 9, task 3649
2025-10-18 04:37:31,908 INFO BulkPartitionTask [Executor task launch worker for task 9.0 in stage 75.0 (TID 3649)]: Done with cleanup for partition 9, task 3649
2025-10-18 04:37:31,910 INFO Executor [Executor task launch worker for task 9.0 in stage 75.0 (TID 3649)]: Finished task 9.0 in stage 75.0 (TID 3649). 2383 bytes result sent to driver
2025-10-18 04:37:42,740 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: NetworkInit succeeded. LightGBM task listening on: 12435
2025-10-18 04:37:42,740 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: Waiting for all data prep to be done, task 3663, partition 24
2025-10-18 04:37:42,740 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: Getting final training Dataset for partition 24.
2025-10-18 04:37:42,742 INFO DenseSyncAggregatedColumns [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task 3663 part 24 generating dense dataset with 11073926 rows and 89 columns
2025-10-18 04:37:47,604 INFO DenseSyncAggregatedColumns [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task 3663 part 24 adding metadata
2025-10-18 04:37:47,619 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: Creating LightGBM Booster for partition 24, task 3663
2025-10-18 04:37:47,624 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: Beginning training on LightGBM Booster for task 3663, partition 24
2025-10-18 04:37:47,624 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 0
2025-10-18 04:37:49,631 INFO ExecutorMetricReporter [metrics-Executor-Metric-Reporter-1-thread-1]: Report called
2025-10-18 04:37:49,632 INFO ExecutorMetricReporter [metrics-Executor-Metric-Reporter-1-thread-1]: Remote shuffle is not enabled, skipping task metrics reporting
2025-10-18 04:38:02,583 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 1
2025-10-18 04:38:03,115 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 2
2025-10-18 04:38:03,603 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 3
2025-10-18 04:38:04,132 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 4
2025-10-18 04:38:04,734 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 5
2025-10-18 04:38:06,287 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 6
2025-10-18 04:38:07,557 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 7
2025-10-18 04:38:08,258 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 8
2025-10-18 04:38:09,742 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 9
2025-10-18 04:38:10,299 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 10

 A fatal error has been detected by the Java Runtime Environment:

  SIGSEGV (0xb) at pc=0x00007d0d678034db, pid=6214, tid=6341

 JRE version: OpenJDK Runtime Environment Microsoft-11371465 (11.0.27+6) (build 11.0.27+6-LTS)
 Java VM: OpenJDK 64-Bit Server VM Microsoft-11371465 (11.0.27+6-LTS, mixed mode, tiered, g1 gc, linux-amd64)
 Problematic frame:
 C  [lib_lightgbm.so+0x3a54db]  LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b

 Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1760761562536_0001/container_1760761562536_0001_01_000004/core.6214)

 If you would like to submit a bug report, please visit:
   https://github.com/microsoft/openjdk/issues
 The crash happened outside the Java Virtual Machine in native code.
 See problematic frame for where to report the bug.


---------------  S U M M A R Y ------------

Command Line: -Djdk.jar.maxSignatureFileSize=2147483639 -Xmx114688m -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false -Dlog4j2.configurationFile=file:/usr/hdp/current/spark3-client/conf/executor-log4j2.properties -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -Dio.netty.tryReflectionSetAccessible=true -XX:+UseG1GC -XX:OnError=head -n $(($(grep -n 'Java frames' hs_err_pid%p.log | sed -n '2p' | cut -d: -f1) - 2)) hs_err_pid%p.log 1>&2 -Djava.io.tmpdir=/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1760761562536_0001/container_1760761562536_0001_01_000004/tmp -Dspark.rpc.askTimeout=600s -Dspark.onesecurity.systemcontext.port=9099 -Dspark.synapse.history.rpc.port=18082 -Dspark.ui.port=0 -Dspark.driver.port=44503 -Dspark.history.ui.port=18080 -Dspark.yarn.app.container.log.dir=/var/log/yarn-nm/userlogs/application_1760761562536_0001/container_1760761562536_0001_01_000004 -XX:OnOutOfMemoryError=kill %p org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@vm-9ab59426:44503 --executor-id 2 --hostname vm-e6133356 --cores 16 --app-id application_1760761562536_0001 --resourceProfileId 0

Host: AMD EPYC 7763 64-Core Processor, 16 cores, 125G, CBL-Mariner 2.0.20250729
Time: Sat Oct 18 04:38:11 2025 UTC elapsed time: 327.326341 seconds (0d 0h 5m 27s)

---------------  T H R E A D  ---------------

Current thread (0x00007d0db06d3800):  JavaThread "Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)" daemon [_thread_in_native, id=6341, stack(0x00007d0db6b00000,0x00007d0db6c00000)]

Stack: [0x00007d0db6b00000,0x00007d0db6c00000],  sp=0x00007d0db6bfb2c0,  free space=1004k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [lib_lightgbm.so+0x3a54db]  LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b
`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Error Training LGBM Regressor on Fabric #2438

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Error Training LGBM Regressor on Fabric #2438

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions