AWS Glue ETL Code Samples

Overview

AWS Glue ETL Code Samples

This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilities.

You can find the AWS Glue open-source Python libraries in a separate repository at: awslabs/aws-glue-libs.

Content

  • FAQ and How-to

    Helps you get started using the many ETL capabilities of AWS Glue, and answers some of the more common questions people have.

Examples

You can run these sample job scripts on any of AWS Glue ETL jobs, container, or local environment.

  • Join and Relationalize Data in S3

    This sample ETL script shows you how to use AWS Glue to load, transform, and rewrite data in AWS S3 so that it can easily and efficiently be queried and analyzed.

  • Clean and Process

    This sample ETL script shows you how to take advantage of both Spark and AWS Glue features to clean and transform data for efficient analysis.

  • The resolveChoice Method

    This sample explores all four of the ways you can resolve choice types in a dataset using DynamicFrame's resolveChoice method.

  • Converting character encoding

    This sample ETL script shows you how to use AWS Glue job to convert character encoding.

Utilities

GlueCustomConnectors

AWS Glue provides built-in support for the most commonly used data stores such as Amazon Redshift, MySQL, MongoDB. Powered by Glue ETL Custom Connector, you can subscribe a third-party connector from AWS Marketplace or build your own connector to connect to data stores that are not natively supported.

marketplace

  • Development

    Development guide with examples of connectors with simple, intermediate, and advanced functionalities. These examples demonstrate how to implement Glue Custom Connectors based on Spark Data Source or Amazon Athena Federated Query interfaces and plug them into Glue Spark runtime.

  • Local Validation Tests

    This user guide describes validation tests that you can run locally on your laptop to integrate your connector with Glue Spark runtime.

  • Validation

    This user guide shows how to validate connectors with Glue Spark runtime in a Glue job system before deploying them for your workloads.

  • Glue Spark Script Examples

    Python scripts examples to use Spark, Amazon Athena and JDBC connectors with Glue Spark runtime.

  • Create and Publish Glue Connector to AWS Marketplace

    If you would like to partner or publish your Glue custom connector to AWS Marketplace, please refer to this guide and reach out to us at [email protected] for further details on your connector.

License Summary

This sample code is made available under the MIT-0 license. See the LICENSE file.

Comments
  • Spark history server: README.md to show using AWS_PROFILE

    Spark history server: README.md to show using AWS_PROFILE

    This worked for me:

    docker run -itd -v /home/username/.aws:/root/.aws -e AWS_PROFILE=profile -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog -Dspark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain" .... 
    

    instead of specifying access key/secret key etc. individually, this specifies the current AWS_PROFILE and exposes the current ~/.aws/credentials file into the docker container.

    opened by wrschneider 23
  • Spark_UI docker container doesn't run

    Spark_UI docker container doesn't run

    Following the steps at utilities/Spark_UI, when I run:

    docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_my_eventlog_dir -Dspark.hadoop.fs.s3a.access.key=my_key_id -Dspark.hadoop.fs.s3a.secret.key=my_secret_access_key" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
    

    The container quickly craps out with:

    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    20/01/27 21:16:44 INFO HistoryServer: Started daemon with process name: 1@4b35fc26fd5a
    20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for TERM
    20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for HUP
    20/01/27 21:16:44 INFO SignalUtils: Registered signal handler for INT
    20/01/27 21:16:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    20/01/27 21:16:45 INFO SecurityManager: Changing view acls to: root
    20/01/27 21:16:45 INFO SecurityManager: Changing modify acls to: root
    20/01/27 21:16:45 INFO SecurityManager: Changing view acls groups to: 
    20/01/27 21:16:45 INFO SecurityManager: Changing modify acls groups to: 
    20/01/27 21:16:45 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
    20/01/27 21:16:45 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
    Exception in thread "main" java.lang.reflect.InvocationTargetException
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    	at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:280)
    	at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
    Caused by: java.lang.NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper;
    	at com.amazonaws.partitions.PartitionsLoader.<clinit>(PartitionsLoader.java:54)
    	at com.amazonaws.regions.RegionMetadataFactory.create(RegionMetadataFactory.java:30)
    	at com.amazonaws.regions.RegionUtils.initialize(RegionUtils.java:65)
    	at com.amazonaws.regions.RegionUtils.getRegionMetadata(RegionUtils.java:53)
    	at com.amazonaws.regions.RegionUtils.getRegion(RegionUtils.java:107)
    	at com.amazonaws.services.s3.AmazonS3Client.createSigner(AmazonS3Client.java:4040)
    	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5039)
    	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4998)
    	at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1413)
    	at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1349)
    	at org.apache.hadoop.fs.s3a.S3AFileSystem.verifyBucketExists(S3AFileSystem.java:276)
    	at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:236)
    	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
    	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
    	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
    	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
    	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
    	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
    	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:117)
    	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:86)
    	... 6 more
    
    opened by woodchuck 17
  • unclear instructions

    unclear instructions

    in this example: https://github.com/awslabs/aws-glue-samples/blob/master/examples/join_and_relationalize.md "1. Crawl our sample dataset" can you specify if user is supposed to copy the dataset from your sample bucket to their own bucket before proceeding to crawl, or just use your bucket as the source? the latter doesn't work. I got "[626034be-45ee-4e8d-ae26-e985e8131ff9] ERROR : Error Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: F24B84B451EAEC7A) retrieving file at s3://awsglue-datasets/examples/us-legislators/house/persons.json. Tables created did not infer schemas from this file."

    also there's a typo in the s3 path (you have a dot at the end) "s3://awsglue-datasets/examples/us-legislators."

    opened by angelarw 15
  • local execution of aws glue

    local execution of aws glue

    Trying to run aws glue with AWSGlue.zip throws following error

    
    ~/opt/spark-2.2.0-bin-hadoop2.7/bin/pyspark
    Python 2.7.15 |Anaconda, Inc.| (default, Dec 14 2018, 13:10:39) 
    [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    19/02/18 18:36:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    19/02/18 18:36:23 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 2.2.0
          /_/
    
    Using Python version 2.7.15 (default, Dec 14 2018 13:10:39)
    SparkSession available as 'spark'.
    >>> from awsglue.dynamicframe import DynamicFrameWriter
    >>> glueContext = GlueContext(sc)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    NameError: name 'GlueContext' is not defined
    >>> from awsglue.context import GlueContext
    >>> glueContext = GlueContext(sc)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "awsglue/context.py", line 44, in __init__
        self._glue_scala_context = self._get_glue_scala_context(**options)
      File "awsglue/context.py", line 64, in _get_glue_scala_context
        return self._jvm.GlueContext(self._jsc.sc())
    TypeError: 'JavaPackage' object is not callable
    
    opened by ghost 11
  • How to keep the partition structure of the folder after ETL?

    How to keep the partition structure of the folder after ETL?

    I have original files in S3 with folder structure like: /data/year=2017/month=1/day=1/origin_files /data/year=2017/month=1/day=2/origin_files

    I use glue crawler create a table data(partitioned) as source of glue jobs. Currently after I use glue job converting files to ORC, I get: /data_orc/transformed_data_files.orc

    Is that possible to keep same partition structure after transforming jobs? like: /data_orc/year=2017/month=1/day=1/transformed_data_files.orc /data_orc/year=2017/month=1/day=2/transformed_data_files.orc

    It doesn't have to be file to file matching, but I hope the data partition can keep same folder structure.

    opened by veryveryfat 10
  • Spark-UI docker container : setting aws security credentials throws

    Spark-UI docker container : setting aws security credentials throws

    After building the docker image and attempting to start it (env vars are exported):

    docker run -it -e SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=$LOG_DIR -Dspark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -Dfs.s3n.awsAccessKeyId=$AWS_ACCESS_KEY_ID -Dfs.s3n.awsSecretAccessKey=$AWS_SECRET_ACCESS_KEY" -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
    

    I get:

    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    20/10/15 11:02:47 INFO HistoryServer: Started daemon with process name: 1@f3fe1ed456cf
    20/10/15 11:02:47 INFO SignalUtils: Registered signal handler for TERM
    20/10/15 11:02:47 INFO SignalUtils: Registered signal handler for HUP
    20/10/15 11:02:47 INFO SignalUtils: Registered signal handler for INT
    20/10/15 11:02:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    20/10/15 11:02:47 INFO SecurityManager: Changing view acls to: root
    20/10/15 11:02:47 INFO SecurityManager: Changing modify acls to: root
    20/10/15 11:02:47 INFO SecurityManager: Changing view acls groups to: 
    20/10/15 11:02:47 INFO SecurityManager: Changing modify acls groups to: 
    20/10/15 11:02:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
    20/10/15 11:02:48 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
    20/10/15 11:02:48 WARN FileSystem: S3FileSystem is deprecated and will be removed in future releases. Use NativeS3FileSystem or S3AFileSystem instead.
    Exception in thread "main" java.lang.reflect.InvocationTargetException
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    	at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:280)
    	at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
    Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified by setting the fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey properties (respectively).
    	at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:74)
    	at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:94)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
    	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
    	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
    	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
    	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
    	at com.sun.proxy.$Proxy5.initialize(Unknown Source)
    	at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:111)
    	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2812)
    	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:100)
    	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2849)
    	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2831)
    	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
    	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
    	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:117)
    	at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:86)
    	... 6 more
    
    
    opened by fbielejec 6
  • Issue With Direct Migration from Hive Metastore to Glue Data Catalog

    Issue With Direct Migration from Hive Metastore to Glue Data Catalog

    Hello, I hope you are doing well. At my company, we are in the process of adopting Athena for our adhoc data analytics tasks and configured Athena to use Glue data catalogue as its metastore. However, the source of truth for our data is a Hive metastore hosted on an AWS RDS mysql instance and we want Glue data catalog to be in sync with our Hive metastore. For this, we followed instructions outlined on awslabs' aws-glue-samples repo (https://github.com/awslabs/aws-glue-samples/tree/master/utilities/Hive_metastore_migration) to setup a Glue job that we will run (eventually periodically) to directly migrate Hive metastore to Glue data catalog.

    However, we are running into a situation where the job finishes successfully but we don't see anything migrated into our Glue data catalog. I am wondering if this issue is experienced by others and if it got resolved. I do see an issue similar to our issue here (I even replied to this thread): https://github.com/awslabs/aws-glue-samples/issues/6, but this is closed and I dont see a solution posted in it.

    Any help is greatly appreciated.

    Thanks, Kshitij

    opened by kjudahlookout 6
  • AWS Glue job is failing for larger csv data on s3

    AWS Glue job is failing for larger csv data on s3

    For small s3 input files (~10GB), glue ETL job works fine but for the larger dataset (~200GB), the job is failing.

    Adding a part of ETL code. // Converting Dynamic frame to dataframe df = dropnullfields3.toDF() // create new partition column partitioned_dataframe = df.withColumn('part_date', df['timestamp_utc'].cast('date')) // store the data in parquet format on s3 partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append")

    Job executed for 4 hours and threw an error.

    File "script_2017-11-23-15-07-32.py", line 49, in partitioned_dataframe.write.partitionBy(['part_date']).format("parquet").save(output_lg_partitioned_dir, mode="append") File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/readwriter.py", line 550, in save File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/mnt/yarn/usercache/root/appcache/application_1511449472652_0001/container_1511449472652_0001_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o172.save. : org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 3385 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127) ... 30 more

    End of LogType:stdout

    opened by sumit-saurabh 6
  • What version of the AWS SDK are you using?

    What version of the AWS SDK are you using?

    I've tried with the 1.11.297 and 2.0 preview versions and they have a very different API from what your samples show. For example: These versions don't have a GlueContext under com.amazonaws.services.glue

    opened by james-neutron 5
  • Migrate Directly from Hive to AWS Glue | No tables created

    Migrate Directly from Hive to AWS Glue | No tables created

    hive_databases_S3.txt hive_tables_S3.txt

    I am trying to migrate directly from Hive to AWS Glue. I created proper Glue job with Hive connection. Tested connection, and it successfully connected. Basically followed all steps and all was successful. But eventually I can't see tables in AWS Glue catalog. No error logs in job and normal logs say run status as succeeded.

    I even tried Migrate from Hive to AWS Glue using Amazon S3 Objects. And that too was successful, but no tables were created in Glue catalog. I could find metastore in S3 buckets exported from Hive (files attached).

    Now I thought of running this code from my local Eclipse to debug. Can you please tell me how to debug from local Eclipse or in Glue console.

    opened by addictedanshul 5
  • Spark-UI docker container startup issue

    Spark-UI docker container startup issue

    Hi,

    I followed the steps from the readme.md & unable to start spark UI on local, below is the error trace -

    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    22/02/22 16:14:17 INFO HistoryServer: Started daemon with process name: 1@d4be68f80dce
    22/02/22 16:14:17 INFO SignalUtils: Registering signal handler for TERM
    22/02/22 16:14:17 INFO SignalUtils: Registering signal handler for HUP
    22/02/22 16:14:17 INFO SignalUtils: Registering signal handler for INT
    22/02/22 16:14:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    22/02/22 16:14:17 INFO SecurityManager: Changing view acls to: root
    22/02/22 16:14:17 INFO SecurityManager: Changing modify acls to: root
    22/02/22 16:14:17 INFO SecurityManager: Changing view acls groups to: 
    22/02/22 16:14:17 INFO SecurityManager: Changing modify acls groups to: 
    22/02/22 16:14:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
    22/02/22 16:14:17 INFO FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions: 
    22/02/22 16:14:17 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
    22/02/22 16:14:17 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
    22/02/22 16:14:17 INFO MetricsSystemImpl: s3a-file-system metrics system started
    Exception in thread "main" java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:300)
    at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
    Caused by: java.lang.NoSuchFieldError: SERVICE_ID
    at com.amazonaws.services.s3.AmazonS3Client.createRequest(AmazonS3Client.java:4772)
    at com.amazonaws.services.s3.AmazonS3Client.createRequest(AmazonS3Client.java:4758)
    at com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1434)
    at com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1374)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$verifyBucketExists$1(S3AFileSystem.java:381)
    
    opened by RajaShyam 4
  • Request to Host Glue Spark UI Images on DockerHub

    Request to Host Glue Spark UI Images on DockerHub

    Currently I need to write a bunch of code to automatically download and build this docker image because I can't pull it directly from DockerHub.

    echo "glue/sparkui image not found, building image..."
    TEMPD=$(mktemp -d -t "glue-sparkui-XXXX")
    if [ ! -e "$TEMPD" ]; then
        echo >&2 "Failed to create temp directory"
        exit 1
    fi
    cd "$TEMPD"
    git clone https://github.com/aws-samples/aws-glue-samples.git
    (cd aws-glue-samples/utilities/Spark_UI/glue-1_0-2_0 && docker build -t glue/sparkui:2.0 .)
    (cd aws-glue-samples/utilities/Spark_UI/glue-3_0 && docker build -t glue/sparkui:3.0 .)
    (cd aws-glue-samples/utilities/Spark_UI/glue-4_0 && docker build -t glue/sparkui:4.0 .)
    cd -
    rm -rf "$TEMPD"
    

    Can these docker images be published directly to DockerHub so I can simply docker pull them and start using immediately?

    I see that the aws-glue-libs are already hosted there: https://hub.docker.com/r/amazon/aws-glue-libs

    Would be great to host the Spark UI as well!

    opened by AlJohri 0
  • Spark UI Glue 4.0 Logging Not Working?

    Spark UI Glue 4.0 Logging Not Working?

    I get this message and not more logs when I run the Glue 4.0 Spark UI container:

    log4j:WARN No appenders could be found for logger (org.apache.spark.deploy.history.HistoryServer).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    
    opened by AlJohri 0
  • Bump jackson-databind from 2.13.3 to 2.13.4.1 in /utilities/Spark_UI/glue-4_0

    Bump jackson-databind from 2.13.3 to 2.13.4.1 in /utilities/Spark_UI/glue-4_0

    Bumps jackson-databind from 2.13.3 to 2.13.4.1.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump hadoop-common from 3.3.2 to 3.3.3 in /utilities/Spark_UI/glue-4_0

    Bump hadoop-common from 3.3.2 to 3.3.3 in /utilities/Spark_UI/glue-4_0

    Bumps hadoop-common from 3.3.2 to 3.3.3.

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump jackson-databind from 2.10.5.1 to 2.12.7.1 in /utilities/Spark_UI/glue-1_0-2_0

    Bump jackson-databind from 2.10.5.1 to 2.12.7.1 in /utilities/Spark_UI/glue-1_0-2_0

    Bumps jackson-databind from 2.10.5.1 to 2.12.7.1.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Owner
AWS Samples
AWS Samples
Renato 214 Jan 2, 2023
Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

null 16 Jun 9, 2022
Python ELT Studio, an application for building ELT (and ETL) data flows.

The Python Extract, Load, Transform Studio is an application for performing ELT (and ETL) tasks. Under the hood the application consists of a two parts.

Schlerp 55 Nov 18, 2022
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 6, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

Dan Katz 2 Jan 6, 2022
ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

Juan Nicolas Serrano 0 Jul 7, 2021
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
Airflow ETL With EKS EFS Sagemaker

Airflow ETL With EKS EFS & Sagemaker (en desarrollo) Diagrama de la solución Imp

null 1 Feb 14, 2022
Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

null 1 Feb 11, 2022
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

null 1 Feb 11, 2022
Incubator for useful bioinformatics code, primarily in Python and R

Collection of useful code related to biological analysis. Much of this is discussed with examples at Blue collar bioinformatics. All code, images and

Brad Chapman 560 Jan 3, 2023
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Jan 4, 2023
Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

null 102 Dec 30, 2022
Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Damast This repository contains code developed for the digital humanities project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval

University of Stuttgart Visualization Research Center 2 Jul 1, 2022
[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Nested Collaborative Learning for Long-Tailed Visual Recognition This repository is the official PyTorch implementation of the paper in CVPR 2022: Nes

Jun Li 65 Dec 9, 2022
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Jan 4, 2023
Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

AWS Data Wrangler Pandas on AWS Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretMana

Amazon Web Services - Labs 3.3k Dec 31, 2022
AWS Glue PySpark - Apache Hudi Quick Start Guide

AWS Glue PySpark - Apache Hudi Quick Start Guide Disclaimer: This is a quick start guide for the Apache Hudi Python Spark connector, running on AWS Gl

Gabriel Amazonas Mesquita 8 Nov 14, 2022