AWS Glue

Glue configuration for optimal logging and cost efficiency

Custom Log4J for Spark on Glue to decrease CloudWatch cost!

Introduction

In today's cloud-centric computing environment, managing costs while ensuring optimal performance can be quite the balancing act, especially when dealing with large-scale data processing tasks. AWS Glue is frequently employed to prepare and transform data for analytics and streaming use cases. One area where costs can unexpectedly mount is logging, particularly when using services like Amazon CloudWatch. This blog post will guide you through configuring Log4J for Spark on AWS Glue to optimize logging and achieve cost efficiency.

Understanding the Basics

Before diving into the specifics, it’s essential to understand what AWS Glue, Spark, Log4J, and Amazon CloudWatch entail and how they interact:

AWS Glue: A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and real-time application development.

Apache Spark: An open-source unified analytics engine for large-scale data processing.

Log4J: A reliable, fast, and flexible logging framework (APIs) written in Java, which is used by many different platforms including Apache Spark.

Amazon CloudWatch: A monitoring and observability service built for aggregating logs and metrics from different services.

The Cost of Logging

Logging is crucial for debugging and monitoring applications, but it can lead to high costs if not managed properly. Each log entry consumes storage and, depending on the verbosity of the logs, the cost associated with storing and querying these logs can escalate quickly.

CloudWatch costs can accrue based on several factors:

Log Data Ingestion: CloudWatch charges for each gigabyte of log data ingested. If your applications are highly verbose in their logging, this can result in significant data ingestion charges.

Log Storage: After ingestion, CloudWatch Logs retains this data, charging for the storage per gigabyte per month. The longer you retain logs and the more data you store, the higher the costs.

Data Transfer: While transferring data within the same AWS region is free, transferring log data across regions can incur additional costs.

Configuring Log4J for Spark on AWS Glue

Here’s how you can tailor Log4J on Spark within AWS Glue to trim unnecessary logging and thus reduce costs:

Modify the Log4J Properties File: AWS Glue uses Apache Spark, which in turn uses Log4J for logging. You can customize the logging level by tweaking the log4j.properties file. Lowering the log level from INFO to WARN or ERROR reduces the volume of log entries generated. This directly impacts the cost by decreasing the amount of log data sent to CloudWatch.

  status = error

  rootLogger.level = warn

  # Console Appender
  rootLogger.appenderRef.stdout.ref = STDOUT
  appender.console.type = Console
  appender.console.name = STDOUT
  appender.console.target = SYSTEM_ERR
  appender.console.layout.type = PatternLayout
  appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss,SSS} %p [%t] %c{2} (%F:%M(%L)): %m%n

  # Loggers setting
  # Set the default spark-shell log level to WARN. When running the spark-shell, the
  # log level for this class is used to overwrite the root logger's log level, so that
  # the user can have different defaults for the shell and regular Spark apps.
  logger.spark_repl_main.name = org.apache.spark.repl.Main
  logger.spark_repl_main.additivity = false
  logger.spark_repl_main.level = warn

  logger.spark_deploy_yarn_client.name = org.apache.spark.deploy.yarn.Client
  logger.spark_deploy_yarn_client.additivity = false
  logger.spark_deploy_yarn_client.level = debug

  # Settings to quiet third party logs that are too verbose
  logger.spark_jetty.name = org.spark_project.jetty
  logger.spark_jetty.additivity = false
  logger.spark_jetty.level = warn

  logger.spark_jetty_util_abstract_lifecycle.name = org.spark_project.jetty.util.component.AbstractLifeCycle
  logger.spark_jetty_util_abstract_lifecycle.additivity = false
  logger.spark_jetty_util_abstract_lifecycle.level = error

  logger.spark_repel_expr_typer.name = org.apache.spark.repl.SparkIMain$exprTyper
  logger.spark_repel_expr_typer.additivity = false
  logger.spark_repel_expr_typer.level = warn

  logger.spark_repel_loop_interpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
  logger.spark_repel_loop_interpreter.additivity = false
  logger.spark_repel_loop_interpreter.level = warn

  logger.apache_parquet.name = org.apache.parquet
  logger.apache_parquet.additivity = false
  logger.apache_parquet.level = error

  logger.parquet.name = parquet
  logger.parquet.additivity = false
  logger.parquet.level = error

  logger.sql_datasource_parquet.name = org.apache.spark.sql.execution.datasources.parquet
  logger.sql_datasource_parquet.additivity = false
  logger.sql_datasource_parquet.level = error

  logger.sql_datasource_file_scan_rdd.name = org.apache.spark.sql.execution.datasources.FileScanRDD
  logger.sql_datasource_file_scan_rdd.additivity = false
  logger.sql_datasource_file_scan_rdd.level = error

  logger.hadoop_codec_pool.name = org.apache.hadoop.io.compress.CodecPool
  logger.hadoop_codec_pool.additivity = false
  logger.hadoop_codec_pool.level = error

  # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
  logger.hive_retrying_hms_handler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler
  logger.hive_retrying_hms_handler.additivity = false
  logger.hive_retrying_hms_handler.level = fatal

  logger.hive_function_registry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry
  logger.hive_function_registry.additivity = false
  logger.hive_function_registry.level = error

  # Remove DynamoDB tedious messages
  logger.ddb_page_result_multiplexer.name = org.apache.hadoop.dynamodb.preader.PageResultMultiplexer
  logger.ddb_page_result_multiplexer.additivity = false
  logger.ddb_page_result_multiplexer.level = off

  logger.ddb_read_worker.name = org.apache.hadoop.dynamodb.preader.ReadWorker
  logger.ddb_read_worker.additivity = false
  logger.ddb_read_worker.level = off

  packages = com.amazonaws.services.glue.cloudwatch

  # Progress Bar Appender; Progress bar content will not be added to dockerlog or cx error log
  appender.progress_bar.type = CloudWatchAppenderLog4j2
  appender.progress_bar.name = BAR
  appender.progress_bar.layout.type = PatternLayout
  appender.progress_bar.layout.pattern = %m%n

  appender.progress_bar.filter.bar_level_filter.type = CloudWatchLevelRangeFilter
  appender.progress_bar.filter.bar_level_filter.loggerToMatch = com.amazonaws.services.glue.ui.GlueConsoleProgressBar
  appender.progress_bar.filter.bar_level_filter.minLevel = fatal
  appender.progress_bar.filter.bar_level_filter.maxLevel = warn
  appender.progress_bar.filter.bar_level_filter.onMatch = accept
  appender.progress_bar.filter.bar_level_filter.onMismatch = deny

  loggers = BarLogger
  logger.BarLogger.name = com.amazonaws.services.glue.ui.GlueConsoleProgressBar
  logger.BarLogger.level = error
  logger.BarLogger.additivity = false
  logger.BarLogger.appenderRef.progress_bar.ref = BAR
  appender.progress_bar.flushInterval=5
  appender.progress_bar.maxRetries=5
  appender.progress_bar.logStream=progress-bar
  appender.progress_bar.logGroup=/aws-glue/jobs/logs-v2

Modify Glue job to read the Log4J configuration

Now you need to change the Glue job to inject the log properties into Spark configuration.

If you are using AWS console, you need to write the properties file path in the Referenced files path part:

If you are using CDK, you can pass as the --extra-files parameter:

  glue_alpha.Job(
              self,
              "GlueJob",
              default_arguments={
              "--extra-files": f"s3://{artifacts_bucket.bucket_name}/log4j2.properties"
              }
          )

Conclusion

Optimizing your logging strategy within AWS Glue by customizing Log4J for Spark applications is a practical step towards managing cloud expenditure effectively. By fine-tuning the logging levels and adopting a more strategic logging approach, you can significantly reduce costs while maintaining the necessary visibility into your applications’ performance and health. Regular monitoring and adjustments ensure that your logging remains both effective and economical.

If you are interested to get more insight about your cloud costs, you can check how DataChef implemented its own CDK Budget monitoring constructs.