TLDR: When submitting batch Cloud Data Fusion (CDF) pipelines at scale via REST api, pause for a few seconds between each call to allow CDF to catch up.
Background: as part of a migration we’re invovled in, our data science team is migrating hundreds of legacy MS Sqlserver ODS tables into BigQuery. While our engineering team is handling the actual migration, we (DS team) want the data and control of the data ourselves to build,prototype, migrate our models in GCP quickly without waiting for all of the quality, and wide scope requirements that our engineering team is tasked with. Enter…
Recently, I submitted some pyspark ETL jobs on our data science EMR cluster, and not long after submission, I encountered a strange error:
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000005b7027000, 1234763776, 0) failed; error='Cannot allocate memory' (errno=12)
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1234763776 bytes for committing reserved memory.
# An error report file with more information is saved as:
I was puzzled, since I had reserved plenty of memory (15G) on the driver as well as the executors
nohup spark-submit --master yarn…
Data Science @ Rackspace