AWS Glue ETL Code Samples

AWS Samples

Last update: Jan 3, 2023

Related tags

Data Analysis aws-glue-samples

Overview

AWS Glue ETL Code Samples

This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilities.

You can find the AWS Glue open-source Python libraries in a separate repository at: awslabs/aws-glue-libs.

Content

FAQ and How-to

Helps you get started using the many ETL capabilities of AWS Glue, and answers some of the more common questions people have.

Examples

You can run these sample job scripts on any of AWS Glue ETL jobs, container, or local environment.

Join and Relationalize Data in S3

This sample ETL script shows you how to use AWS Glue to load, transform, and rewrite data in AWS S3 so that it can easily and efficiently be queried and analyzed.
Clean and Process

This sample ETL script shows you how to take advantage of both Spark and AWS Glue features to clean and transform data for efficient analysis.
The resolveChoice Method

This sample explores all four of the ways you can resolve choice types in a dataset using DynamicFrame's resolveChoice method.
Converting character encoding

This sample ETL script shows you how to use AWS Glue job to convert character encoding.

Utilities

Hive metastore migration

This utility can help you migrate your Hive metastore to the AWS Glue Data Catalog.
Crawler undo and redo

These scripts can undo or redo the results of a crawl under some circumstances.
Spark UI

You can use this Dockerfile to run Spark history server in your container. See details: Launching the Spark History Server and Viewing the Spark UI Using Docker
use only IAM access controls

AWS Lake Formation applies its own permission model when you access data in Amazon S3 and metadata in AWS Glue Data Catalog through use of Amazon EMR, Amazon Athena and so on. If you currently use Lake Formation and instead would like to use only IAM Access controls, this tool enables you to achieve it.

GlueCustomConnectors

AWS Glue provides built-in support for the most commonly used data stores such as Amazon Redshift, MySQL, MongoDB. Powered by Glue ETL Custom Connector, you can subscribe a third-party connector from AWS Marketplace or build your own connector to connect to data stores that are not natively supported.

Development

Development guide with examples of connectors with simple, intermediate, and advanced functionalities. These examples demonstrate how to implement Glue Custom Connectors based on Spark Data Source or Amazon Athena Federated Query interfaces and plug them into Glue Spark runtime.
Local Validation Tests

This user guide describes validation tests that you can run locally on your laptop to integrate your connector with Glue Spark runtime.
Validation

This user guide shows how to validate connectors with Glue Spark runtime in a Glue job system before deploying them for your workloads.
Glue Spark Script Examples

Python scripts examples to use Spark, Amazon Athena and JDBC connectors with Glue Spark runtime.
Create and Publish Glue Connector to AWS Marketplace

If you would like to partner or publish your Glue custom connector to AWS Marketplace, please refer to this guide and reach out to us at [email protected] for further details on your connector.

License Summary

This sample code is made available under the MIT-0 license. See the LICENSE file.

Comments

Spark history server: README.md to show using AWS_PROFILE

This worked for me:

docker run -itd -v /home/username/.aws:/root/.aws -e AWS_PROFILE=profile -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog -Dspark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain" ....

instead of specifying access key/secret key etc. individually, this specifies the current AWS_PROFILE and exposes the current ~/.aws/credentials file into the docker container.

opened by wrschneider 23

Spark_UI docker container doesn't run

Following the steps at utilities/Spark_UI, when I run:

docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_my_eventlog_dir -Dspark.hadoop.fs.s3a.access.key=my_key_id -Dspark.hadoop.fs.s3a.secret.key=my_secret_access_key" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"

The container quickly craps out with:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/01/27 21:16:44 INFO HistoryServer: Started daemon with process name: 1@4b35fc26fd5a
20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for TERM
20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for HUP
20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for INT
20/01/27 21:16:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/01/27 21:16:45 INFO SecurityManager: Changing view acls to: root
20/01/27 21:16:45 INFO SecurityManager: Changing modify acls to: root
20/01/27 21:16:45 INFO SecurityManager: Changing view acls groups to: 
20/01/27 21:16:45 INFO SecurityManager: Changing modify acls groups to: 
20/01/27 21:16:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/01/27 21:16:45 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:280)
	at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper;
	at com.amazonaws.partitions.PartitionsLoader.<clinit>(PartitionsLoader.java:54)
	at com.amazonaws.regions.RegionMetadataFactory.create(RegionMetadataFactory.java:30)
	at com.amazonaws.regions.RegionUtils.initialize(RegionUtils.java:65)
	at com.amazonaws.regions.RegionUtils.getRegionMetadata(RegionUtils.java:53)
	at com.amazonaws.regions.RegionUtils.getRegion(RegionUtils.java:107)
	at com.amazonaws.services.s3.AmazonS3Client.createSigner(AmazonS3Client.java:4040)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5039)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998)
	at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1413)
	at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1349)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:276)
	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:117)
	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:86)
	... 6 more

opened by woodchuck 17

unclear instructions

in this example: https://github.com/awslabs/aws-glue-samples/blob/master/examples/join_and_relationalize.md "1. Crawl our sample dataset" can you specify if user is supposed to copy the dataset from your sample bucket to their own bucket before proceeding to crawl, or just use your bucket as the source? the latter doesn't work. I got "[626034be-45ee-4e8d-ae26-e985e8131ff9] ERROR : Error Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: F24B84B451EAEC7A) retrieving file at s3://awsglue-datasets/examples/us-legislators/house/persons.json. Tables created did not infer schemas from this file."

also there's a typo in the s3 path (you have a dot at the end) "s3://awsglue-datasets/examples/us-legislators."

opened by angelarw 15

local execution of aws glue

Trying to run aws glue with AWSGlue.zip throws following error


~/opt/spark-2.2.0-bin-hadoop2.7/bin/pyspark
Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 13:10:39) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
19/02/18 18:36:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/02/18 18:36:23 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
      /_/

Using Python version 2.7.15 (default, Dec 14 2018 13:10:39)
SparkSession available as 'spark'.
>>> from awsglue.dynamicframe import DynamicFrameWriter
>>> glueContext = GlueContext(sc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'GlueContext' is not defined
>>> from awsglue.context import GlueContext
>>> glueContext = GlueContext(sc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "awsglue/context.py", line 44, in __init__
    self._glue_scala_context = self._get_glue_scala_context(**options)
  File "awsglue/context.py", line 64, in _get_glue_scala_context
    return self._jvm.GlueContext(self._jsc.sc())
TypeError: 'JavaPackage' object is not callable

opened by ghost 11

How to keep the partition structure of the folder after ETL?

I have original files in S3 with folder structure like: /data/year=2017/month=1/day=1/origin_files /data/year=2017/month=1/day=2/origin_files

I use glue crawler create a table data(partitioned) as source of glue jobs. Currently after I use glue job converting files to ORC, I get: /data_orc/transformed_data_files.orc

Is that possible to keep same partition structure after transforming jobs? like: /data_orc/year=2017/month=1/day=1/transformed_data_files.orc /data_orc/year=2017/month=1/day=2/transformed_data_files.orc

It doesn't have to be file to file matching, but I hope the data partition can keep same folder structure.

opened by veryveryfat 10

Spark-UI docker container : setting aws security credentials throws

After building the docker image and attempting to start it (env vars are exported):

docker run -it -e SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=$LOG_DIR -Dspark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -Dfs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY_ID -Dfs.s3n.awsSecretAccessKey=$AWS_SECRET_ACCESS_KEY" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"

I get:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/10/15 11:02:47 INFO HistoryServer: Started daemon with process name: 1@f3fe1ed456cf
20/10/15 11:02:47 INFO SignalUtils: Registered signal handler for TERM
20/10/15 11:02:47 INFO SignalUtils: Registered signal handler for HUP
20/10/15 11:02:47 INFO SignalUtils: Registered signal handler for INT
20/10/15 11:02:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/10/15 11:02:47 INFO SecurityManager: Changing view acls to: root
20/10/15 11:02:47 INFO SecurityManager: Changing modify acls to: root
20/10/15 11:02:47 INFO SecurityManager: Changing view acls groups to: 
20/10/15 11:02:47 INFO SecurityManager: Changing modify acls groups to: 
20/10/15 11:02:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
20/10/15 11:02:48 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
20/10/15 11:02:48 WARN FileSystem: S3FileSystem is deprecated and will be removed in future releases. Use NativeS3FileSystem or S3AFileSystem instead.
Exception in thread "main" java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:280)
	at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified by setting the fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey properties (respectively).
	at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:74)
	at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:94)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
	at com.sun.proxy.$Proxy5.initialize(Unknown Source)
	at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:111)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:117)
	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:86)
	... 6 more

opened by fbielejec 6

Issue With Direct Migration from Hive Metastore to Glue Data Catalog

Hello, I hope you are doing well. At my company, we are in the process of adopting Athena for our adhoc data analytics tasks and configured Athena to use Glue data catalogue as its metastore. However, the source of truth for our data is a Hive metastore hosted on an AWS RDS mysql instance and we want Glue data catalog to be in sync with our Hive metastore. For this, we followed instructions outlined on awslabs' aws-glue-samples repo (https://github.com/awslabs/aws-glue-samples/tree/master/utilities/Hive_metastore_migration) to setup a Glue job that we will run (eventually periodically) to directly migrate Hive metastore to Glue data catalog.

However, we are running into a situation where the job finishes successfully but we don't see anything migrated into our Glue data catalog. I am wondering if this issue is experienced by others and if it got resolved. I do see an issue similar to our issue here (I even replied to this thread): https://github.com/awslabs/aws-glue-samples/issues/6, but this is closed and I dont see a solution posted in it.

Any help is greatly appreciated.

Thanks, Kshitij

opened by kjudahlookout 6
AWS Glue job is failing for larger csv data on s3

For small s3 input files (~10GB), glue ETL job works fine but for the larger dataset (~200GB), the job is failing.

Adding a part of ETL code. // Converting Dynamic frame to dataframe df = dropnullfields3.toDF() // create new partition column partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date')) // store the data in parquet format on s3 partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append")

Job executed for 4 hours and threw an error.

File "script_2017-11-23-15-07-32.py", line 49, in partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append") File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o172.save. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 3385 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127) ... 30 more

End of LogType:stdout

opened by sumit-saurabh 6
What version of the AWS SDK are you using?

I've tried with the 1.11.297 and 2.0 preview versions and they have a very different API from what your samples show. For example: These versions don't have a GlueContext under com.amazonaws.services.glue

opened by james-neutron 5
Migrate Directly from Hive to AWS Glue | No tables created

hive_databases_S3.txt hive_tables_S3.txt

I am trying to migrate directly from Hive to AWS Glue. I created proper Glue job with Hive connection. Tested connection, and it successfully connected. Basically followed all steps and all was successful. But eventually I can't see tables in AWS Glue catalog. No error logs in job and normal logs say run status as succeeded.

I even tried Migrate from Hive to AWS Glue using Amazon S3 Objects. And that too was successful, but no tables were created in Glue catalog. I could find metastore in S3 buckets exported from Hive (files attached).

Now I thought of running this code from my local Eclipse to debug. Can you please tell me how to debug from local Eclipse or in Glue console.

opened by addictedanshul 5

Spark-UI docker container startup issue

Hi,

I followed the steps from the readme.md & unable to start spark UI on local, below is the error trace -

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/02/22 16:14:17 INFO HistoryServer: Started daemon with process name: 1@d4be68f80dce
22/02/22 16:14:17 INFO SignalUtils: Registering signal handler for TERM
22/02/22 16:14:17 INFO SignalUtils: Registering signal handler for HUP
22/02/22 16:14:17 INFO SignalUtils: Registering signal handler for INT
22/02/22 16:14:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/02/22 16:14:17 INFO SecurityManager: Changing view acls to: root
22/02/22 16:14:17 INFO SecurityManager: Changing modify acls to: root
22/02/22 16:14:17 INFO SecurityManager: Changing view acls groups to: 
22/02/22 16:14:17 INFO SecurityManager: Changing modify acls groups to: 
22/02/22 16:14:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
22/02/22 16:14:17 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions: 
22/02/22 16:14:17 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
22/02/22 16:14:17 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
22/02/22 16:14:17 INFO MetricsSystemImpl: s3a-file-system metrics system started
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:300)
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.NoSuchFieldError: SERVICE_ID
at com.amazonaws.services.s3.AmazonS3Client.createRequest(AmazonS3Client.java:4772)
at com.amazonaws.services.s3.AmazonS3Client.createRequest(AmazonS3Client.java:4758)
at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1434)
at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1374)
at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$verifyBucketExists$1(S3AFileSystem.java:381)

opened by RajaShyam 4

Request to Host Glue Spark UI Images on DockerHub

Currently I need to write a bunch of code to automatically download and build this docker image because I can't pull it directly from DockerHub.

echo "glue/sparkui image not found, building image..."
TEMPD=$(mktemp -d -t "glue-sparkui-XXXX")
if [ ! -e "$TEMPD" ]; then
    echo >&2 "Failed to create temp directory"
    exit 1
fi
cd "$TEMPD"
git clone https://github.com/aws-samples/aws-glue-samples.git
(cd aws-glue-samples/utilities/Spark_UI/glue-1_0-2_0 && docker build -t glue/sparkui:2.0 .)
(cd aws-glue-samples/utilities/Spark_UI/glue-3_0 && docker build -t glue/sparkui:3.0 .)
(cd aws-glue-samples/utilities/Spark_UI/glue-4_0 && docker build -t glue/sparkui:4.0 .)
cd -
rm -rf "$TEMPD"

Can these docker images be published directly to DockerHub so I can simply docker pull them and start using immediately?

I see that the aws-glue-libs are already hosted there: https://hub.docker.com/r/amazon/aws-glue-libs

Would be great to host the Spark UI as well!

opened by AlJohri 0

Spark UI Glue 4.0 Logging Not Working?

I get this message and not more logs when I run the Glue 4.0 Spark UI container:

log4j:WARN No appenders could be found for logger (org.apache.spark.deploy.history.HistoryServer).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

opened by AlJohri 0

Bump jackson-databind from 2.13.3 to 2.13.4.1 in /utilities/Spark_UI/glue-4_0
Bumps jackson-databind from 2.13.3 to 2.13.4.1.

Commits

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump hadoop-common from 3.3.2 to 3.3.3 in /utilities/Spark_UI/glue-4_0
Bumps hadoop-common from 3.3.2 to 3.3.3.

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump jackson-databind from 2.10.5.1 to 2.12.7.1 in /utilities/Spark_UI/glue-1_0-2_0
Bumps jackson-databind from 2.10.5.1 to 2.12.7.1.

Commits

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0

Owner

AWS Samples

GitHub

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

ETL Pipeline with Airflow, Spark, s3, MongoDB and Amazon Redshift

214 Jan 2, 2023

Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

16 Jun 9, 2022

Python ELT Studio, an application for building ELT (and ETL) data flows.

The Python Extract, Load, Transform Studio is an application for performing ELT (and ETL) tasks. Under the hood the application consists of a two parts.

55 Nov 18, 2022

ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

18 Jul 6, 2022

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

2 Dec 12, 2021

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

2 Jan 6, 2022

ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

0 Jul 7, 2021

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

Airflow ETL With EKS EFS Sagemaker

Airflow ETL With EKS EFS & Sagemaker (en desarrollo) Diagrama de la solución Imp

1 Feb 14, 2022

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

1 Feb 11, 2022

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022

Incubator for useful bioinformatics code, primarily in Python and R

Collection of useful code related to biological analysis. Much of this is discussed with examples at Blue collar bioinformatics. All code, images and

560 Jan 3, 2023

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

359 Dec 22, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022

Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Damast This repository contains code developed for the digital humanities project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval

University of Stuttgart Visualization Research Center

2 Jul 1, 2022

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Nested Collaborative Learning for Long-Tailed Visual Recognition This repository is the official PyTorch implementation of the paper in CVPR 2022: Nes

65 Dec 9, 2022

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

3.3k Jan 4, 2023

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

3.3k Dec 31, 2022

AWS Glue PySpark - Apache Hudi Quick Start Guide

AWS Glue PySpark - Apache Hudi Quick Start Guide Disclaimer: This is a quick start guide for the Apache Hudi Python Spark connector, running on AWS Gl

8 Nov 14, 2022