-
Notifications
You must be signed in to change notification settings - Fork 86
Description
Describe the bug
We are using CoBrix with PySpark and executing it on AWS EMR.
We have the EBCDIC file and it's corresponding copybook in the AWS S3 bucket. While trying to parse the EBCDIC file using the Copybook, we are getting an error.
Error message :
py4j.protocol.Py4jJavaError : An error occurred while calling o2021.loa : za.co.absa.cobrix.cobol.parser.exceptions.SyntaxErrorException : Syntax error in the copybook at line 29 : Invalid input 'BBBB' at position 29:45
Code snippet that caused the issue
try :
file_path = f's3://{s3_bucket}/{ebcdic_file_path}'
spark.read
.format("cobol")
.option("copybook_contents", copybook)
.option("encoding", ebcdic)
.option("schema_retention_policy", "collapse_root")
.option("generate_record_id", True)
.load(file_path)
except Exception as e:
log_message = f'spark job failed with error : {e}'
logging.error(log_message)
raise e
Expected behavior
We expected the Cobrix to successfully parse the EBCDIC file record column using the Cobybook which has this datatype of 'BBBB'
Context
PySpark Jar dependencies :
- cobol-parser_2.12-2.6.7.jar
- hadoop-lzo-0.4.3.jar
- scodec-bits_2.12-1.1.12.jar
- scodec-core_2.12-1.11.4.jar
- spark-cobol_2.12-2.6.7.jar
- Operating system: AWS EMR (Linux Image)
Copybook (if possible)
15 EL02-267-COLNAME-A
20 EL02-267-COLNAME-B
PIC X(19).
.........
.........
.........
20 EL02-267-COLNAME-C REDEFINES
EL02-267-COLNAME-D
PIC 9(06)BBBB. (This is what is causing the issue we suppose)
GP5WHB 20 FILLER pic X(285). CLEAN-UP
Attach a small data file that can help reproduce the issue, if possible : Need to check the feasibility due to confidentiality of the data. Will get back.