Glue configuration for optimal logging and cost efficiency
Custom Log4J for Spark on Glue to decrease CloudWatch cost!
Introduction
In today's cloud-centric computing environment, managing costs while ensuring optimal performance can be quite the balancing act, especially when dealing with large-scale data processing tasks. AWS Glue is frequently employed to prepare and transform data for analytics and streaming use cases. One area where costs can unexpectedly mount is logging, particularly when using services like Amazon CloudWatch. This blog post will guide you through configuring Log4J for Spark on AWS Glue to optimize logging and achieve cost efficiency.
Understanding the Basics
Before diving into the specifics, it’s essential to understand what AWS Glue, Spark, Log4J, and Amazon CloudWatch entail and how they interact:
AWS Glue: A serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and real-time application development.
Apache Spark: An open-source unified analytics engine for large-scale data processing.
Log4J: A reliable, fast, and flexible logging framework (APIs) written in Java, which is used by many different platforms including Apache Spark.
Amazon CloudWatch: A monitoring and observability service built for aggregating logs and metrics from different services.
The Cost of Logging
Logging is crucial for debugging and monitoring applications, but it can lead to high costs if not managed properly. Each log entry consumes storage and, depending on the verbosity of the logs, the cost associated with storing and querying these logs can escalate quickly.
CloudWatch costs can accrue based on several factors:
Log Data Ingestion: CloudWatch charges for each gigabyte of log data ingested. If your applications are highly verbose in their logging, this can result in significant data ingestion charges.
Log Storage: After ingestion, CloudWatch Logs retains this data, charging for the storage per gigabyte per month. The longer you retain logs and the more data you store, the higher the costs.
Data Transfer: While transferring data within the same AWS region is free, transferring log data across regions can incur additional costs.
Configuring Log4J for Spark on AWS Glue
Here’s how you can tailor Log4J on Spark within AWS Glue to trim unnecessary logging and thus reduce costs:
Modify the Log4J Properties File: AWS Glue uses Apache Spark, which in turn uses Log4J for logging. You can customize the logging level by tweaking the
log4j.properties
file. Lowering the log level fromINFO
toWARN
orERROR
reduces the volume of log entries generated. This directly impacts the cost by decreasing the amount of log data sent to CloudWatch.status = error rootLogger.level = warn # Console Appender rootLogger.appenderRef.stdout.ref = STDOUT appender.console.type = Console appender.console.name = STDOUT appender.console.target = SYSTEM_ERR appender.console.layout.type = PatternLayout appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss,SSS} %p [%t] %c{2} (%F:%M(%L)): %m%n # Loggers setting # Set the default spark-shell log level to WARN. When running the spark-shell, the # log level for this class is used to overwrite the root logger's log level, so that # the user can have different defaults for the shell and regular Spark apps. logger.spark_repl_main.name = org.apache.spark.repl.Main logger.spark_repl_main.additivity = false logger.spark_repl_main.level = warn logger.spark_deploy_yarn_client.name = org.apache.spark.deploy.yarn.Client logger.spark_deploy_yarn_client.additivity = false logger.spark_deploy_yarn_client.level = debug # Settings to quiet third party logs that are too verbose logger.spark_jetty.name = org.spark_project.jetty logger.spark_jetty.additivity = false logger.spark_jetty.level = warn logger.spark_jetty_util_abstract_lifecycle.name = org.spark_project.jetty.util.component.AbstractLifeCycle logger.spark_jetty_util_abstract_lifecycle.additivity = false logger.spark_jetty_util_abstract_lifecycle.level = error logger.spark_repel_expr_typer.name = org.apache.spark.repl.SparkIMain$exprTyper logger.spark_repel_expr_typer.additivity = false logger.spark_repel_expr_typer.level = warn logger.spark_repel_loop_interpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter logger.spark_repel_loop_interpreter.additivity = false logger.spark_repel_loop_interpreter.level = warn logger.apache_parquet.name = org.apache.parquet logger.apache_parquet.additivity = false logger.apache_parquet.level = error logger.parquet.name = parquet logger.parquet.additivity = false logger.parquet.level = error logger.sql_datasource_parquet.name = org.apache.spark.sql.execution.datasources.parquet logger.sql_datasource_parquet.additivity = false logger.sql_datasource_parquet.level = error logger.sql_datasource_file_scan_rdd.name = org.apache.spark.sql.execution.datasources.FileScanRDD logger.sql_datasource_file_scan_rdd.additivity = false logger.sql_datasource_file_scan_rdd.level = error logger.hadoop_codec_pool.name = org.apache.hadoop.io.compress.CodecPool logger.hadoop_codec_pool.additivity = false logger.hadoop_codec_pool.level = error # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support logger.hive_retrying_hms_handler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler logger.hive_retrying_hms_handler.additivity = false logger.hive_retrying_hms_handler.level = fatal logger.hive_function_registry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry logger.hive_function_registry.additivity = false logger.hive_function_registry.level = error # Remove DynamoDB tedious messages logger.ddb_page_result_multiplexer.name = org.apache.hadoop.dynamodb.preader.PageResultMultiplexer logger.ddb_page_result_multiplexer.additivity = false logger.ddb_page_result_multiplexer.level = off logger.ddb_read_worker.name = org.apache.hadoop.dynamodb.preader.ReadWorker logger.ddb_read_worker.additivity = false logger.ddb_read_worker.level = off packages = com.amazonaws.services.glue.cloudwatch # Progress Bar Appender; Progress bar content will not be added to dockerlog or cx error log appender.progress_bar.type = CloudWatchAppenderLog4j2 appender.progress_bar.name = BAR appender.progress_bar.layout.type = PatternLayout appender.progress_bar.layout.pattern = %m%n appender.progress_bar.filter.bar_level_filter.type = CloudWatchLevelRangeFilter appender.progress_bar.filter.bar_level_filter.loggerToMatch = com.amazonaws.services.glue.ui.GlueConsoleProgressBar appender.progress_bar.filter.bar_level_filter.minLevel = fatal appender.progress_bar.filter.bar_level_filter.maxLevel = warn appender.progress_bar.filter.bar_level_filter.onMatch = accept appender.progress_bar.filter.bar_level_filter.onMismatch = deny loggers = BarLogger logger.BarLogger.name = com.amazonaws.services.glue.ui.GlueConsoleProgressBar logger.BarLogger.level = error logger.BarLogger.additivity = false logger.BarLogger.appenderRef.progress_bar.ref = BAR appender.progress_bar.flushInterval=5 appender.progress_bar.maxRetries=5 appender.progress_bar.logStream=progress-bar appender.progress_bar.logGroup=/aws-glue/jobs/logs-v2
Modify Glue job to read the Log4J configuration
Now you need to change the Glue job to inject the log properties into Spark configuration.
- If you are using AWS console, you need to write the properties file path in the
Referenced files path
part:
- If you are using AWS console, you need to write the properties file path in the
If you are using CDK, you can pass as the
--extra-files
parameter:glue_alpha.Job( self, "GlueJob", default_arguments={ "--extra-files": f"s3://{artifacts_bucket.bucket_name}/log4j2.properties" } )
Conclusion
Optimizing your logging strategy within AWS Glue by customizing Log4J for Spark applications is a practical step towards managing cloud expenditure effectively. By fine-tuning the logging levels and adopting a more strategic logging approach, you can significantly reduce costs while maintaining the necessary visibility into your applications’ performance and health. Regular monitoring and adjustments ensure that your logging remains both effective and economical.
If you are interested to get more insight about your cloud costs, you can check how DataChef implemented its own CDK Budget monitoring constructs.