Making Sense of Big Data

Testing Cloud Data Fusion for data science workflows

A typical CDF pipeline to move data from legacy sqlserver to BigQuery. Image by Author

TLDR: When submitting batch Cloud Data Fusion (CDF) pipelines at scale via REST api, pause for a few seconds between each call to allow CDF to catch up.

Background: as part of a migration we’re invovled in, our data science team is migrating hundreds of legacy MS Sqlserver ODS tables into BigQuery. While our engineering team is handling the actual migration, we (DS team) want the data and control of the data ourselves to build,prototype, migrate our models in GCP quickly without waiting for all of the quality, and wide scope requirements that our engineering team is tasked with. Enter…

Recently, I submitted some pyspark ETL jobs on our data science EMR cluster, and not long after submission, I encountered a strange error:

OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x00000005b7027000, 1234763776, 0) failed; error='Cannot allocate memory' (errno=12)
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 1234763776 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /home/hadoop/ce_studies/hs_err_pid28194.log

I was puzzled, since I had reserved plenty of memory (15G) on the driver as well as the executors

nohup spark-submit --master yarn…

Charlie Mueller

Data Science @ Rackspace

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store