The columns of A don't match the number of elements of x. A: 768, x: 1536 #14362

SidWeng · 2024-08-08T03:47:35Z

SidWeng
Aug 8, 2024

I use the following pipeline with BioBERT Sentence Embeddings.
However, it throws The columns of A don't match the number of elements of x. A: 768, x: 1536 when execute pipeline.fit(). I trace the code and find out the dimension of randMatrix used by BucketedRandomProjectLSHModel is determined by DatasetUtils.getNumFeatures().
Does it imply something wrong with the data I feed into fit() ? The data I feed is a DataFrame with a String column code and a String column text. The longest length of text is 229.

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val embeddings = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en")
  .setInputCols("sentence")
  .setOutputCol("sentence_embeddings")

val document_similarity_ranker = new DocumentSimilarityRankerApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("doc_similarity_rankings")
  .setSimilarityMethod("brp")
  .setNumberOfNeighbours(1)
  .setBucketLength(2.0)
  .setNumHashTables(3)
  .setVisibleDistances(true)
  .setIdentityRanking(false)

val document_similarity_ranker_finisher = new DocumentSimilarityRankerFinisher()
  .setInputCols("doc_similarity_rankings")
  .setOutputCols("finished_doc_similarity_rankings_id", "finished_doc_similarity_rankings_neighbors")
  .setExtractNearestNeighbor(true)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    embeddings,
    document_similarity_ranker,
    document_similarity_ranker_finisher
  ))

24/08/08 03:19:13.581 [task-result-getter-3] WARN o.a.spark.scheduler.TaskSetManager - Lost task 7.2 in stage 10.0 (TID 370) (10.0.0.12 executor 4): org.apache.spark.SparkException: Failed to execute user defined function (LSHModel$$Lambda$5263/1056329262: (struct<type:tinyint,size:int,indices:array,values:array>) => array<struct<type:tinyint,size:int,indices:array,values:array>>)
at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:177)
at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.serializefromobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.convert.Wrappers$IteratorWrapper.hasNext(Wrappers.scala:32)
at org.sparkproject.guava.collect.Ordering.leastOf(Ordering.java:670)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at org.apache.spark.rdd.RDD.$anonfun$takeOrdered$2(RDD.scala:1539)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:855)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:855)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 768, x: 1536
at scala.Predef$.require(Predef.scala:281)
at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:579)
at org.apache.spark.ml.feature.BucketedRandomProjectionLSHModel.hashFunction(BucketedRandomProjectionLSH.scala:87)
at org.apache.spark.ml.feature.LSHModel.$anonfun$transform$1(LSH.scala:99)
... 22 more

Answered by SidWeng

Aug 15, 2024

Finally I found the root cause. There exists. in dataset like this

First document, this is my first sentence. This is my second sentence.

It will be viewed as 2 sentences.
The output column(sentence_embeddings) of BertSentenceEmbeddings and RoBertaSentenceEmbeddings is an array of size 2.
DocumentSimilarityRankerApproach.train() will flatten sentence_embeddings.embeddings and causes the dimension be 1536 (768 * 2)

val similarityDataset: DataFrame = embeddingsDataset
  .withColumn(s"$LSH_INPUT_COL_NAME", array_to_vector(flatten(col(INPUT_EMBEDDINGS))))

The solution to my case is to set custom bound for SentenceDetector

.setCustomBounds(Array("\n"))
.setUseCustomBoundsOnly(true)

View full answer

SidWeng · 2024-08-10T04:26:26Z

SidWeng
Aug 10, 2024
Author

The exception still raises even I use sent_roberta_base.

1 reply

SidWeng Aug 14, 2024
Author

it could be reproduced by modifying test case in DocumentSimilarityRankerTestSpec.scala

SidWeng · 2024-08-15T06:10:04Z

SidWeng
Aug 15, 2024
Author

Finally I found the root cause. There exists. in dataset like this

First document, this is my first sentence. This is my second sentence.

It will be viewed as 2 sentences.
The output column(sentence_embeddings) of BertSentenceEmbeddings and RoBertaSentenceEmbeddings is an array of size 2.
DocumentSimilarityRankerApproach.train() will flatten sentence_embeddings.embeddings and causes the dimension be 1536 (768 * 2)

val similarityDataset: DataFrame = embeddingsDataset
  .withColumn(s"$LSH_INPUT_COL_NAME", array_to_vector(flatten(col(INPUT_EMBEDDINGS))))

The solution to my case is to set custom bound for SentenceDetector

.setCustomBounds(Array("\n"))
.setUseCustomBoundsOnly(true)

1 reply

danilojsl Aug 16, 2024
Maintainer

Hi @SidWeng

Thanks for letting us know about your workaround. We are working on adding a parameter to DocumentSimilarityRankerApproach to choose the aggregation method when a document has multiple sentences. I hope we can include it in the next release

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The columns of A don't match the number of elements of x. A: 768, x: 1536 #14362

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

The columns of A don't match the number of elements of x. A: 768, x: 1536 #14362

Uh oh!

SidWeng Aug 8, 2024

Replies: 2 comments · 2 replies

Uh oh!

SidWeng Aug 10, 2024 Author

Uh oh!

SidWeng Aug 14, 2024 Author

Uh oh!

SidWeng Aug 15, 2024 Author

Uh oh!

Uh oh!

danilojsl Aug 16, 2024 Maintainer

SidWeng
Aug 8, 2024

Replies: 2 comments 2 replies

SidWeng
Aug 10, 2024
Author

SidWeng Aug 14, 2024
Author

SidWeng
Aug 15, 2024
Author

danilojsl Aug 16, 2024
Maintainer