Skip to content

Error Training LGBM Regressor on Fabric #2438

@muhammad-omer-dev

Description

@muhammad-omer-dev

Hi,
Currently i am facing an issue training LGBM Regressor on Microsoft Fabric. I have tried multiple solutions but haven't been able to resolve it. My data is around 50000000 rows and 80 columns in parquet format.
With same data it crashes like 2 or 3 times but then it runs on it. I am unable to understand why it is crashing on it and after multiple attempts without changing anything it runs on it.

My Current Spark Cluster is:
Spark Cluster
Driver Memory:112 GB
Executor Memory: 112 GB
Executors Core: 16
Driver Cores: 16

Synapse ML Version: 1.0.15

LGBM Params
"objective": "tweedie",
"tweedieVariancePower": 1.5,
"metric": "mape",
"boostingType": "gbdt",
"learningRate": 0.05,
"featureFraction": 0.8,
"baggingFraction": 0.8,
"baggingFreq": 5,
"lambdaL2": 0.1,
"verbosity": 2,
"earlyStoppingRound": 100,
# "parallelism": "voting_parallel",
"dataTransferMode": "bulk",
"numLeaves": 48,
"maxDepth": -1,
"numIterations": 800,
"maxBin": 127,
"passThroughArgs": "min_child_samples=300"

Solution Tried

  • I have tried streaming mode but with that it gives error of duplicate columns even when there are none
  • Increase memory for driver and executor to 400GB
  • Changing num_threads in model
  • Reducing Columns to 50
  • Reducing Max Bins to 50 from 127

ERROR After Which It Crashes
`
ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container from a bad node: container_1760761562536_0001_01_000004 on host: vm-e6133356. Exit status: 134. Diagnostics: mic=ALL-UNNAMED' '--add-opens=java.base/jdk.internal.ref=ALL-UNNAMED' '--add-opens=java.base/sun.nio.ch=ALL-UNNAMED' '--add-opens=java.base/sun.nio.cs=ALL-UNNAMED' '--add-opens=java.base/sun.security.action=ALL-UNNAMED' '--add-opens=java.base/sun.util.calendar=ALL-UNNAMED' '--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED' '-Djdk.reflect.useDirectMethodHandle=false' '-Dlog4j2.configurationFile=file:/usr/hdp/current/spark3-client/conf/executor-log4j2.properties' '-Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl' '-Dio.netty.tryReflectionSetAccessible=true' '-XX:+UseG1GC' '-XX:OnError=head -n $(($(grep -n '''Java frames''' hs_err_pid%p.log | sed -n '''2p''' | cut -d: -f1) - 2)) hs_err_pid%p.log 1>&2' -Djava.io.tmpdir=/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1760761562536_0001/container_1760761562536_0001_01_000004/tmp '-Dspark.rpc.askTimeout=600s' '-Dspark.onesecurity.systemcontext.port=9099' '-Dspark.synapse.history.rpc.port=18082' '-Dspark.ui.port=0' '-Dspark.driver.port=44503' '-Dspark.history.ui.port=18080' -Dspark.yarn.app.container.log.dir=/var/log/yarn-nm/userlogs/application_1760761562536_0001/container_1760761562536_0001_01_000004 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@vm-9ab59426:44503 --executor-id 2 --hostname vm-e6133356 --cores 16 --app-id application_1760761562536_0001 --resourceProfileId 0 > /var/log/yarn-nm/userlogs/application_1760761562536_0001/container_1760761562536_0001_01_000004/stdout 2> /var/log/yarn-nm/userlogs/application_1760761562536_0001/container_1760761562536_0001_01_000004/stderr
Error files: stderr, stderr-active.
Last 8192 bytes of stderr :
rtitionTask [Executor task launch worker for task 8.0 in stage 75.0 (TID 3647)]: Done with cleanup for partition 8, task 3647
2025-10-18 04:37:31,880 INFO Executor [Executor task launch worker for task 8.0 in stage 75.0 (TID 3647)]: Finished task 8.0 in stage 75.0 (TID 3647). 2383 bytes result sent to driver
2025-10-18 04:37:31,908 INFO BulkPartitionTask [Executor task launch worker for task 9.0 in stage 75.0 (TID 3649)]: done with data preparation on partition 9, task 3649
2025-10-18 04:37:31,908 INFO BulkPartitionTask [Executor task launch worker for task 9.0 in stage 75.0 (TID 3649)]: Helper task 3649, partition 9 finished processing rows
2025-10-18 04:37:31,908 INFO BulkPartitionTask [Executor task launch worker for task 9.0 in stage 75.0 (TID 3649)]: Beginning cleanup for partition 9, task 3649
2025-10-18 04:37:31,908 INFO BulkPartitionTask [Executor task launch worker for task 9.0 in stage 75.0 (TID 3649)]: Done with cleanup for partition 9, task 3649
2025-10-18 04:37:31,910 INFO Executor [Executor task launch worker for task 9.0 in stage 75.0 (TID 3649)]: Finished task 9.0 in stage 75.0 (TID 3649). 2383 bytes result sent to driver
2025-10-18 04:37:42,740 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: NetworkInit succeeded. LightGBM task listening on: 12435
2025-10-18 04:37:42,740 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: Waiting for all data prep to be done, task 3663, partition 24
2025-10-18 04:37:42,740 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: Getting final training Dataset for partition 24.
2025-10-18 04:37:42,742 INFO DenseSyncAggregatedColumns [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task 3663 part 24 generating dense dataset with 11073926 rows and 89 columns
2025-10-18 04:37:47,604 INFO DenseSyncAggregatedColumns [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task 3663 part 24 adding metadata
2025-10-18 04:37:47,619 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: Creating LightGBM Booster for partition 24, task 3663
2025-10-18 04:37:47,624 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: Beginning training on LightGBM Booster for task 3663, partition 24
2025-10-18 04:37:47,624 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 0
2025-10-18 04:37:49,631 INFO ExecutorMetricReporter [metrics-Executor-Metric-Reporter-1-thread-1]: Report called
2025-10-18 04:37:49,632 INFO ExecutorMetricReporter [metrics-Executor-Metric-Reporter-1-thread-1]: Remote shuffle is not enabled, skipping task metrics reporting
2025-10-18 04:38:02,583 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 1
2025-10-18 04:38:03,115 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 2
2025-10-18 04:38:03,603 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 3
2025-10-18 04:38:04,132 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 4
2025-10-18 04:38:04,734 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 5
2025-10-18 04:38:06,287 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 6
2025-10-18 04:38:07,557 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 7
2025-10-18 04:38:08,258 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 8
2025-10-18 04:38:09,742 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 9
2025-10-18 04:38:10,299 INFO BulkPartitionTask [Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)]: LightGBM task starting iteration 10

A fatal error has been detected by the Java Runtime Environment:

SIGSEGV (0xb) at pc=0x00007d0d678034db, pid=6214, tid=6341

JRE version: OpenJDK Runtime Environment Microsoft-11371465 (11.0.27+6) (build 11.0.27+6-LTS)
Java VM: OpenJDK 64-Bit Server VM Microsoft-11371465 (11.0.27+6-LTS, mixed mode, tiered, g1 gc, linux-amd64)
Problematic frame:
C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b

Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1760761562536_0001/container_1760761562536_0001_01_000004/core.6214)

If you would like to submit a bug report, please visit:
https://github.com/microsoft/openjdk/issues
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.

--------------- S U M M A R Y ------------

Command Line: -Djdk.jar.maxSignatureFileSize=2147483639 -Xmx114688m -Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false -Dlog4j2.configurationFile=file:/usr/hdp/current/spark3-client/conf/executor-log4j2.properties -Djavax.xml.parsers.SAXParserFactory=com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl -Dio.netty.tryReflectionSetAccessible=true -XX:+UseG1GC -XX:OnError=head -n $(($(grep -n 'Java frames' hs_err_pid%p.log | sed -n '2p' | cut -d: -f1) - 2)) hs_err_pid%p.log 1>&2 -Djava.io.tmpdir=/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1760761562536_0001/container_1760761562536_0001_01_000004/tmp -Dspark.rpc.askTimeout=600s -Dspark.onesecurity.systemcontext.port=9099 -Dspark.synapse.history.rpc.port=18082 -Dspark.ui.port=0 -Dspark.driver.port=44503 -Dspark.history.ui.port=18080 -Dspark.yarn.app.container.log.dir=/var/log/yarn-nm/userlogs/application_1760761562536_0001/container_1760761562536_0001_01_000004 -XX:OnOutOfMemoryError=kill %p org.apache.spark.executor.YarnCoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@vm-9ab59426:44503 --executor-id 2 --hostname vm-e6133356 --cores 16 --app-id application_1760761562536_0001 --resourceProfileId 0

Host: AMD EPYC 7763 64-Core Processor, 16 cores, 125G, CBL-Mariner 2.0.20250729
Time: Sat Oct 18 04:38:11 2025 UTC elapsed time: 327.326341 seconds (0d 0h 5m 27s)

--------------- T H R E A D ---------------

Current thread (0x00007d0db06d3800): JavaThread "Executor task launch worker for task 24.0 in stage 75.0 (TID 3663)" daemon [_thread_in_native, id=6341, stack(0x00007d0db6b00000,0x00007d0db6c00000)]

Stack: [0x00007d0db6b00000,0x00007d0db6c00000], sp=0x00007d0db6bfb2c0, free space=1004k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [lib_lightgbm.so+0x3a54db] LightGBM::SerialTreeLearner::SplitInner(LightGBM::Tree*, int, int*, int*, bool)+0xf7b
`

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions