Singer is an open source standard for moving data between databases, web APIs, files, queues, and just about anything else you can think of.

We would like to modify the SPEC in order to provide the following additional features:

Schema Discovery - Allow a Tap to indicate what the Schema will be without actually streaming data
Stream and Field Selection - Allow the user to select a subset of the available streams and fields
Stream Renaming - Allow the user to provide names for stream in order to resolve name collisions

We'll refer to these two features together as Schema Editing.

Motivation

Stream and Field Selection

Suppose we had a generic Tap that pulled data from a Postgres database. It could potentially produce one stream for every table in every schema, including every field in each table. But it's likely that users would want to only include a subset of the tables and a subset of the fields within each selected table. A Postgres Tap would not have a statically defined schema. It would need to connect to the database in order to find the available schemas, tables, and fields. So we need some way to allow the tap to query the database for the available schemas, tables, and fields, then allow the user to select which ones they want, then run another job taking the user's selections as input.

Stream Renaming

Currently the RECORD and SCHEMA messages use a "stream" field to identify a stream. A data source that contains a hierarchy of namespaces, such as Postgres, it's not clear how a Tap should derive the name of the stream. For example, suppose we have a Postgres database called "prod", with a schema called "public", with a table called "users". If we were to call the stream "users", that might conflict with a table named "users" in some other schema. In order to disambiguate, we may want to allow the user to provide a mapping from the data source's notion of a stream or table to the Tap's stream name.

Note that Schema Discovery is required by both the Stream and Field Selection and Stream Renaming features.

Proposed Solution

Extend the specification as follows.

Add Discover Mode

Taps should allow an optional --discover command line flag. If --discover is provided, the Tap should print out a SCHEMA message for every stream that is available to it

It is expected that the user will edit the discovered schemas through some interface in order to delete schemas for streams they don't want, or delete specific fields they don't want. Then they can pass the resulting pruned schemas back in via a --schemas option...

Add Schema Selection

A Tap should allow an optional --schemas SCHEMAS argument that points to a file containing the list of schemas describing the desired output. It is expected that the schemas provided will be a pruned version of the schemas produced by a previous run of the same Tap in discover mode. The Tap should attempt to produce output that conforms to the schemas provided with the --schemas option. If no --schemas option is provided, the Tap should fetch all fields of all streams available.

Add "source" to SCHEMA message

Extend the SCHEMA message to add a "source" field, the structure of which is determined entirely by the Tap. The "source" field identifies the source of the stream. For a Tap that pulls from a database source, this could be something like

"source": {
  "schema": "public",
  "table": "users"
}

For a Tap that pulls from an API, it could be

"source": {
  "endpoint": "users"
}

If the user wants to rename a Stream, they can provide a --schemas argument that provides a new value for the "stream" field for the same source.

Example

Suppose we have a Postgres Tap, with a configuration that points to a database that has the following schema / table / field structure:

public
- users
  - id
  - first_name
  - last_name
- orders
  - id
  - user_id
  - amount
  - credit_card_number

Suppose the Postgres tap normally names the stream "_

", so the stream names would be "public_users" and "public_orders".

So if we ran the Tap in discover mode:

$ tap_postgres --config config.json --discover > schemas.json

we would get the following output

{
  "type": "SCHEMA",
  "stream": "public_users",
  "source": {"schema": "public", "table": "users"},
  "key_properties": ["id"],
  "schema": {
    "type": "object",
    "properties": {
      "id": {"type": "integer"},
      "first_name": {"type": "string"},
      "last_name": {"type": "string"},
    }
  }
}
{
  "type": "SCHEMA",
  "stream": "public_orders",
  "source": {"schema": "public", "table": "orders"},
  "key_properties": ["id"],
  "schema": {
    "type": "object",
    "properties": {
      "id": {"type": "integer"},
      "user_id": {"type": "integer"},
      "amount": {"type": "number"},
      "credit_card_number": {"type": "string"}
    }
  }
}

Now let's assume the user wants to make the following changes to the schema:

Remove the public_ prefix from the stream names
Get rid of the users table
Get rid of the credit card field

The user could make those changes by deleting the schema message for the users table, deleting the schema property for the "credit_card_number" field, and changing the stream name for the orders table:

{
  "type": "SCHEMA",
  "stream": "orders",
  "soruce": {"schema": "public",
                     "table": "orders"},
  "key_properties": ["id"],
  "schema": {
    "type": "object",
    "properties": {
      "id": {"type": "integer"},
      "user_id": {"type": "integer"},
      "amount": {"type": "number"},
    }
  }
}

So now the user would run the Tap again, specifying the edited schema file as input:

$ tap_postgres --config config.json --schemas schemas_edited.json

Concerns

Can we come up with a better name for "discover mode"?
How do we keep this from overly complicating taps that don't need schema selection?

The schema editing adds a lot of complexity. For Taps that can provide very large sets of streams and fields, this is necessary. But what about a Tap with a small static schema, that doesn't need to support schema selection? In particular:
1. Can a Tap choose not to support schema selection? If so,
2. What should a Tap that doesn't support schema selection do if I call it with the --discover flag? Print out the schema and exit 0? Exit non-0?
3. What should a Tap that doesn't support schema selection do if I call it with a --schemas SCHEMAS option? Ignore it, or fail?
4. What if a Tap that does support schema selection is invoked with a --schemas SCHEMAS option where the schemas provided do not match the schema that's available to it?
Given the complexity introduced by these changes, I'm inclined to say that we should make Stream and Field Selection and Stream Renaming optional parts of the spec, and say that a Tap that does not want to support these features should fail hard if they are invoked with a --discover or --schemas option.

opened by mdelaurentis 7

How do we want to specify primary key fields?

I can only think of two real options for specifying which fields should be used as the primary key:

Put a "key": true property on each of the key fields in the SCHEMA message. This is what we have right now, in that the target looks for that property and will set the key fields based on that.

Put a top-level "key_names" property in the message. For example:

{"type": "SCHEMA",
 "stream": "users",
 "schema": { "..." },
 "key_names": ["customer_id", "email"]}

The advantage of the first option is that we avoid cluttering up the top level of the message with additional properties, which is I think desirable. The advantage of the second option is that it would make the choice of primary key columns more explicit and obvious. If we went with option 1, it would be easy for a tap author to simply forget to mark some fields as keys. We have already forgotten to do that with the last four taps we've done. With option 2, we could require a "key_names" field, even if it points to an empty list.

I would vote for option 2, because I would prefer to be as explicit as possible about what the key fields are.

question

opened by mdelaurentis 7

Updated links in 'Developing a Target' Section

Updated Schema, Record, and State links in Developing a Target section of Running and Developing Singer Taps and Targets documentation page.

Closes #70

opened by alexvaldez-edge 3

Proposed Best Practice: Dry Run Mode

As a best practice, a Tap or Target should provide a Dry Run mode, where it just verifies that it can connect to its data source or destination using the configuration provided.

Motivation

It would be helpful if a Tap could give the user quick feedback as to whether it can connect to the data source using the configuration provided. Currently if you run a tap with invalid configuration, it will exit with a non-zero status rather quickly. But if you run it with valid credentials, it will start streaming data. Users may find it desireable to have a mode of operation where the Tap just makes a quick attempt to connect to the data source and then exits zero or non-zero in order to indicate success or failure.

Proposed Solution

A Tap or Target should support a -n and --dry-run option. This option indicates that the Tap or Target should just attempt to connect to the data source or destination with the configuration provided. If it can connect, exit 0. If it can't, exit non-zero with a useful error message.

question wontfix

opened by mdelaurentis 3

Metrics

I've been adding experimental metrics logging to singer-python and a few taps. I'd like to add a best practice recommendation for logging metrics from Taps. I would really appreciate feedback on this best practice recommendation and the singer-python changes. Below are links to diffs for a few taps that use the new singer-python stats utilities, so you can see how the implementation would work in practice.

Updated Best Practices guide: https://github.com/singer-io/getting-started/blob/metrics/BEST_PRACTICES.md (look at that rather than the diff)
singer-python - https://github.com/singer-io/singer-python/pull/19
tap-shippo - https://github.com/singer-io/tap-shippo/pull/3/files
tap-facebook - https://github.com/singer-io/tap-facebook/pull/3/files
tap-closeio - https://github.com/singer-io/tap-closeio/pull/6/files

I'm particularly interested in answering feedback in the following areas:

What terminology should we use? "stats"? "metrics"?
Are the field names clear?
For the singer-python change, is the distinction between a Counter and a Timer clear enough?

opened by mdelaurentis 2

Example has invalid JSON

{"type": "RECORD", "stream": "stream": "users", "record": {"id": 2, "name": "Mike"}}

Stream stream

opened by criccomini 2

fix table name property in Catalog

The table_name property should be table, as per in this source code https://github.com/singer-io/singer-python/blob/0c066de21111d8572425083b4a8792d193c80af1/singer/catalog.py#L21

cla-missing

opened by burmecia 1

Add child stream handling to best practices

This is a part of a lot of Singer taps, and if done right will make handling child streams a lot easier. We've wanted to include something like this for awhile, so I'm putting up this PR to start talking about how best to convey this guidance.

opened by dmosorast 1

Install git and then checkout code

Here's the error I saw:

Either git or ssh (required by git to clone through SSH) is not installed
in the image. Falling back to CircleCI's native git client but the
behavior may be different from official git. If this is an issue, please
use an image that has official git and ssh installed.

ssh: no key found

The first section appeared on the last successful run as well.

I rebuilt the container with SSH, connected, and got

root@400194e20721:~# git
-bash: git: command not found

So this PR will install git and then checkout the code

opened by luandy64 1

No Mention of selected property must be True to Process a Stream

For someone coming to singer for the first time, a lot of time can be wasted wondering why a stream is skipped when running a tap. It appears only after adding the selected property to metadata does a stream get selected to be synced.

Problem

The get started documentation should state that a stream is only processed if selected is true

opened by Jagjit-Thind 1

How do I submit a tap for inclusion in Stitch?

Hey there!

I'm developing a tap for BigCommerce here https://github.com/chrisgoddard/tap-bigcommerce

What is the process for getting new taps integrated in Stitch? This is my first one but I have 3-4 other platforms that I want to write taps for and ideally have available as Stitch connectors.

Thanks.

-Chris

opened by chrisgoddard 1

Run singer tap-target infinitely

Hello, We have a use case where we want to run our tap-target infinitely and not stop after 1 run meaning, Tap runs at a time and take some data, pass it to target to the destination, Now we dont want to stop here Again tap should be executed so that it can take new data. Is this somehow possible and how we can achieve it?

opened by shubhransh-locale 0

issue with list items

i have an feature called country and it is containing and list of country values how to insert these using singer

{"country":["india","usa","japan","china"]} this was for mat of data i wanted to send them to postgres table under column named country

please help about how to do these what was the schema and all

opened by santhoshvempali 0

Intermittent Errors in Salesforce Marketing Cloud Integration

Hi Community,

We are using the SFMC integration to sync data from SFMC to BigQuery. There are intermittent errors happening quite often. Sometimes it last for over a day, sometimes last for a couple of hours. And they can recover at sometime later on w/o any work from our side. Could you help us to understand

the root cause of such error
does the integration need to end up omitting some data in order to recover such error ? (or what's changed for the sync to move on w/o errors)

Thanks!

One error log example attached below

` 2021-05-27 11:55:57,858Z 2021-05-27 11:55:58,008Z 2021-05-27 11:55:58,555Z 2021-05-27 11:55:58,555Z 2021-05-27 11:55:58,555Z 2021-05-27 11:55:58,555Z 2021-05-27 11:55:58,555Z tap - 2021-05-27 11:55:58,555Z 2021-05-27 11:55:58,555Z tap - 2021-05-27 11:55:58,555Z 2021-05-27 11:55:58,555Z tap - 2021-05-27 11:55:58,556Z 2021-05-27 11:55:58,556Z 2021-05-27 11:55:58,556Z tap - 2021-05-27 11:55:58,556Z 2021-05-27 11:55:58,556Z tap - 2021-05-27 11:55:58,556Z 2021-05-27 11:55:58,556Z tap - 2021-05-27 11:55:58,556Z 2021-05-27 11:55:58,556Z tap - 2021-05-27 11:55:58,556Z 2021-05-27 11:55:58,556Z tap - 2021-05-27 11:55:58,556Z 2021-05-27 11:55:58,556Z tap - 2021-05-27 11:55:58,556Z 2021-05-27 11:55:58,557Z 2021-05-27 11:55:58,557Z tap - 2021-05-27 11:55:58,557Z 2021-05-27 11:55:58,557Z tap - 2021-05-27 11:55:58,557Z 2021-05-27 11:55:58,557Z tap - 2021-05-27 11:55:58,557Z 2021-05-27 11:55:58,557Z tap - 2021-05-27 11:55:58,557Z 2021-05-27 11:55:58,557Z tap - 2021-05-27 11:55:58,557Z 2021-05-27 11:55:58,557Z tap - 2021-05-27 11:55:58,558Z 2021-05-27 11:55:58,558Z tap - 2021-05-27 11:55:58,558Z 2021-05-27 11:55:58,558Z tap - 2021-05-27 11:55:58,558Z 2021-05-27 11:55:58,558Z tap - 2021-05-27 11:55:58,558Z 2021-05-27 11:55:58,558Z tap - 2021-05-27 11:55:58,558Z 2021-05-27 11:55:58,558Z tap - 2021-05-27 11:55:58,558Z 2021-05-27 11:55:58,558Z target - INFO replicated 6805 records from "data_extension.PROD_Businesses" endpoint tap - INFO Getting more results from 'DataExtensionObject' endpoint tap - ERROR <suds.sax.document.Document object at 0x7f6e142ad470> tap - ERROR :2:62: syntax error tap - Traceback (most recent call last): tap - File "/root/.pyenv/versions/3.5.2/lib/python3.5/xml/sax/expatreader.py", line 210, in feed self._parser.Parse(data, isFinal) tap - xml.parsers.expat.ExpatError: syntax error: line 2, column 62 tap - During handling of the above exception, another exception occurred: tap - Traceback (most recent call last): tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_exacttarget/init.py", line 136, in do_sync stream_accessor.sync() tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_exacttarget/dao.py", line 74, in sync return self.sync_data() tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_exacttarget/endpoints/data_extensions.py", line 283, in sync_data replication_key=replication_key) tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_exacttarget/endpoints/data_extensions.py", line 209, in _replicate for row in result: tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_exacttarget/client.py", line 153, in request_from_cursor response = tap_exacttarget__getMoreResults(cursor, batch_size=batch_size) tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_exacttarget/fuel_overrides.py", line 32, in tap_exacttarget__getMoreResults obj = TapExacttarget__ET_Continue(cursor.auth_stub, cursor.last_request_id, batch_size) target - INFO Serializing batch with 1518 messages for table data_extension.PROD_Businesses tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/tap_exacttarget/fuel_overrides.py", line 26, in init response = auth_stub.soap_client.service.Retrieve(ws_continueRequest) tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/suds/client.py", line 521, in call return client.invoke(args, kwargs) tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/suds/client.py", line 581, in invoke result = self.send(soapenv) tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/suds/client.py", line 621, in send original_soapenv=original_soapenv) tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/suds/client.py", line 661, in process_reply replyroot = _parse(reply) tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/suds/client.py", line 832, in _parse return Parser().parse(string=string) tap - File "/code/orchestrator/tap-env/lib/python3.5/site-packages/suds/sax/parser.py", line 133, in parse sax.parse(source) tap - File "/root/.pyenv/versions/3.5.2/lib/python3.5/xml/sax/expatreader.py", line 110, in parse xmlreader.IncrementalParser.parse(self, source) tap - File "/root/.pyenv/versions/3.5.2/lib/python3.5/xml/sax/xmlreader.py", line 125, in parse self.feed(buffer) tap - File "/root/.pyenv/versions/3.5.2/lib/python3.5/xml/sax/expatreader.py", line 214, in feed self._err_handler.fatalError(exc) tap - File "/root/.pyenv/versions/3.5.2/lib/python3.5/xml/sax/handler.py", line 38, in fatalError raise exception tap - xml.sax._exceptions.SAXParseException: :2:62: syntax error tap - ERROR Failed to sync endpoint, moving on!

`

opened by FionaYiZhao 0

Dealing with multiple accounts from config.json

Hi,

We are planning to integrate Singer in our application, but right now we are seeing that Singer config can have only a single account added in the configuration.

Is it possible to have multiple accounts added in the configuration. Say we have 100 B2B companies that are using Stripe and we would like to fetch their data through Singer. Is this something can be done through Singer?

Thanks.

opened by azhard4int 0

Owner

Singer

Simple, Composable Open Source ETL

GitHub https://singer.io

Dragon Age: Origins toolset to extract/build .erf files, patch language-specific .dlg files, and view the contents of files in the ERF or GFF format

DAOTools This is a set of tools for Dragon Age: Origins modding. It can patch the text lines of .dlg files, extract and build an .erf file, and view t

8 Dec 6, 2022

MHS2 Save file editing tools. Transfers save files between players, switch and pc version, encrypts and decrypts.

SaveTools MHS2 Save file editing tools. Transfers save files between players, switch and pc version, encrypts and decrypts. Credits Written by Asteris

31 Nov 17, 2022

Maltego transforms to pivot between PE files based on their VirusTotal codeblocks

VirusTotal Codeblocks Maltego Transforms Introduction These Maltego transforms allow you to pivot between different PE files based on codeblocks they

18 Feb 3, 2022

dotsend is a web application which helps you to upload your large files and share file via link

0 Dec 3, 2022

This program can help you to move and rename many files at once

This program can help you to rename and save many files in a folder in seconds, but don't give the same name to files, it can delete both files.

1 Oct 10, 2022

Python function to stream unzip all the files in a ZIP archive: without loading the entire ZIP file or any of its files into memory at once

206 Jan 2, 2023

csv2ir is a script to convert ir .csv files to .ir files for the flipper.

csv2ir csv2ir is a script to convert ir .csv files to .ir files for the flipper. For a repo of .ir files, please see https://github.com/logickworkshop

38 Dec 31, 2022

LightCSV - This CSV reader is implemented in just pure Python.

LightCSV Simple light CSV reader This CSV reader is implemented in just pure Python. It allows to specify a separator, a quote char and column titles

6 Mar 5, 2022

This is just a GUI that detects your file's real extension using the filetype module.

Real-file.extnsn This is just a GUI that detects your file's real extension using the filetype module. Requirements Python 3.4 and above filetype modu

1 Aug 8, 2021

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series. The Fourier series can be animated and visualized, the function can be output as a two dimensional vector for Desmos and there is a method to output the coefficients as LaTeX code.

12 Jan 1, 2023

Python script for converting figma produced SVG files into C++ JUCE framework source code

AutoJucer Python script for converting figma produced SVG files into C++ JUCE framework source code Watch the tutorial here! Getting Started Make some

1 Nov 26, 2021

This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that have that extension.

FileBulk This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that h

1 Jun 26, 2022

Creates folders into a directory to categorize files in that directory by file extensions and move all things from sub-directories to current directory.

Categorize and Uncategorize Your Folders Table of Content TL;DR just take me to how to install. What are Extension Categorizer and Folder Dumper Insta

1 Oct 17, 2021

Singer is an open source standard for moving data between databases, web APIs, files, queues, and just about anything else you can think of.

Related tags

Overview

Singer

Docs

Comments

Motivation

Proposed Solution

Add Discover Mode

Add Schema Selection

Add "source" to SCHEMA message

Example

Concerns

Motivation

Proposed Solution

Owner

Singer

Dragon Age: Origins toolset to extract/build .erf files, patch language-specific .dlg files, and view the contents of files in the ERF or GFF format

MHS2 Save file editing tools. Transfers save files between players, switch and pc version, encrypts and decrypts.

Maltego transforms to pivot between PE files based on their VirusTotal codeblocks

dotsend is a web application which helps you to upload your large files and share file via link

This program can help you to move and rename many files at once

Python function to stream unzip all the files in a ZIP archive: without loading the entire ZIP file or any of its files into memory at once

csv2ir is a script to convert ir .csv files to .ir files for the flipper.

LightCSV - This CSV reader is implemented in just pure Python.

This is just a GUI that detects your file's real extension using the filetype module.

Here is some Python code that allows you to read in SVG files and approximate their paths using a Fourier series.

Python script for converting figma produced SVG files into C++ JUCE framework source code

This is a file deletion program that asks you for an extension of a file (.mp3, .pdf, .docx, etc.) to delete all of the files in a dir that have that extension.

Search for files under the specified directory. Extract the file name and file path and import them as data.

Fast Python reader and editor for ASAM MDF / MF4 (Measurement Data Format) files

Uproot is a library for reading and writing ROOT files in pure Python and NumPy.

Extract the windows major and minor build numbers from an ISO file, and automatically sort the iso files.

A JupyterLab extension that allows opening files and directories with external desktop applications.

pydicom - Read, modify and write DICOM files with python code

Creates folders into a directory to categorize files in that directory by file extensions and move all things from sub-directories to current directory.