Microsoft SQL Server Integration Services or SSIS is a SQL Server component, that performs data integration, transformation, and migration tasks. It was first released in 2005 to replace Microsoft’s Data Transformation Services and since has become one of the most widely used ETL tools. For a lot of companies that already have SQL Server, SSIS is the natural choice for an ETL tool - it reduces the number of vendors needed, there is a good amount of talent with the required skills, and the technology itself is good at minimizing data redundancies and transformations.
In recent years, however, Microsoft has switched its focus to cloud technologies. Many projects are being migrated to Azure and companies are starting to look for SSIS alternatives. In this article, we explore twelve of these alternatives, broken down into four categories:
- Fully managed
- Managed data lakes
We try to provide an objective overview of each technology, along with the pros and cons of each one. We break everything down by the following criteria:
- Pricing models.
- Replication modes.
- Compatibility, as measured by the number of connectors and other relevant details.
- Ease of recruitment for that technology.
- Ideal use cases.
Self-Hosted ETL tools
Self-hosted ETL tools are the least expensive solution that gives you the most flexibility in terms of querying data. The biggest advantage is the full control you have over your data, which is extremely important when cybersecurity and data privacy are concerned. Using self-hosted tools might also be the right choice for you if already have an experienced data engineering team, as they reduce infrastructure costs and allow you to scale custom solutions. In this section, we provide some details on SSIS and introduce our open-source data movement tool SQLpipe.
SQL Server Integration Services (SISS)
SSIS is Microsoft’s ETL tool that comes as a component of SQL Server. It works with data in various formats, such as unstructured flat file data, semi-structured XML data, or fully structured relational data. SSIS also works with OLTP / OLAP databases, as well as data warehouses. It allows for the automation of many administrative tasks and the GUI makes it easy to work with without writing any code. Even though many consider SSIS to be a legacy product that is soon to be succeeded by Azure Data Factory, it is still widely used across different industries.
However, some users argue that SSIS should be used just to move data between layers with high variability of data types and formats and to orchestrate the solution, rather than using it to write custom code, advance transformations, and business logic.
SQLpipe + Airflow
SQLpipe is an open-source data engineering tool that allows you to move the results of a query between data systems. It uses ODBC to connect to data sources. In addition to moving data, you can run queries and download the results locally as a CSV. A popular use case is moving data from your application or other IT systems to a data warehouse for analytical processing needs. SQLpipe can also migrate from one data system to another (such as from a paid database like Oracle to a free one like PostgreSQL) and consolidate data to implement master data management.
SQLpipe’s main strength is its flexibility - it allows you to run any query you want to extract data. It can connect to any data system ODBC can connect to, and is easy to deploy in your own environment. SQLpipe is distributed as a Docker image without any other dependencies. It is suitable for situations where you cannot allow third parties to manipulate or have access to your data; when you want to extract data with a query or when you want to connect to a system that does not have a connector, but for which an ODBC driver exists.
On the downside, the provided Docker image only includes the most popular ODBC connectors and if you want to connect to another system, you would have to install the driver yourself. Moreover, if you want to load incrementally you would have to do it through Timestamp rather than Change Data Capture.
Apache Airflow is an open-source framework for creating data pipelines that allows you to programmatically author, schedule, and monitor workflows. Workflows in Airflow are written as DAGs (Directed Acyclic Graphs) in Python. Users can specify when a workflow should start and end, and how often it should run. Airflow also provides an interactive interface for monitoring workflow status. You can easily combine data orchestration/workflow management from Airflow with data movement functionality from SQLpipe for a completely open-source ETL solution.
Note: Here we also need to mention the ability to self-host the open-source tools Presto, Spark, and Airbyte. However, we’ve decided to focus on Airbyte as a managed cloud platform and cover Prest and Spark in more detail under the Managed data lakes section.
As part of Microsoft SQL Server, the cost of SSIS would depend on the actual implementation you opt for. SQLpipe and Airflow, on the other hand, are open source and can be used free of charge.
SSIS supports both full and incremental loads. A full-load design in SSIS needs fewer components dragged into the package, which makes it simpler. Incremental loads can be done through Change Data Capture, DateTime columns, and lookups.
SQLpipe also supports full and incremental loads, however, the incremental loading is done through Timestamp rather than CDC.
SSIS can extract data from a wide variety of sources such as Application Systems, CRMs, Relational Databases, Queue Systems, and Protocols. Data can be loaded into various data stores, such as flat files, process analytic objects, and databases. SSIS also allows you to write custom destinations.
SQLpipe can connect to anything that has an ODBC driver as a source, but only has a few target connections, namely PostgreSQL, MS SQL Server, Snowflake, MySQL, MariaDB, CSV, and S3. Being an open-source program, you are able to add your own targets as well.
Ease of Recruitment
Even though SSIS may be considered a legacy system, there is still a significant demand for SSIS developers. According to Enlyft’s data, there are over 47 000 companies using SSIS (most of which are in the US). SQLpipe, on the other hand, is a relatively new open-source project, and all known development is done by the original team.
Fully Managed ETL Tools
Fully managed platforms are the easiest ETL solutions to work with - assuming they fit your use case. They are suitable for teams that are just starting or who want to reduce the time and effort spent on developing and maintaining data pipelines. However, these solutions provide the least control in terms of resource usage and are not suitable for situations where you cannot provide control over your data to a third party. In this section, we focus on three popular fully managed ETL services: Fivetran, Hevo Data, and Airbyte.
Fivetran was founded in 2012 and since has become one of the well-established ETL service providers. It offers a fully managed data pipeline as a service and sits in the ingestion layer of the data stack. Fivetran requires no configuration or coding, with its main focus being data ingestion and ETL. Over the years the company has built a large library of proprietary connectors (over 190 as of October 2022) many of which come with pre-defined schemas that provide a reliable way for the data to land in the destination warehouse.
Fivetran claims to be a no-code solution that requires almost no maintenance. It is good at replicating complicated sources like Salesforce and is a time-saver for pulling data from SaaS applications into a cloud destination. While the ease of setup and low maintenance are attractive, Fivetran can get quite pricey at higher usage levels. You can add your custom connectors, but only the ones developed by Fivetran will be managed by them, so it’s up to you to maintain your own connectors.
On the downside, Fivetran has been struggling to create an S3 destination for over 2 years now. The last update on the support ticket related to the issue was from September 2022, in which they stated they were still sorting out technical issues around it. Users have also reported that Fivetran is fragile and can easily break when it comes to replicating data from transactional databases, resulting in long and slow resync periods. Moreover, the cost of using Fivetran can get high if you are using it for database replication or as your company scales (more about that in the Pricing section).
Hevo Data is another no-code, end-to-end data pipeline platform. It offers real-time replication to destinations that include all major data warehouses and databases. Some of its features include data deduplication, reverse ETL, and custom objects.
Hevo seems to emphasize the robustness of their connectors, rather than their count, as well as the reverse ETL functionality. They offer 24/7 customer assistance and no-code GUI that makes setting up a new connector easy.
On the other hand, Hevo does not allow for existing connectors to be customized, or new custom ones to be built. Integration with a custom, in-house data infrastructure (e.g. Kubernetes, Airflow) is not supported either. Moreover, some users report that Hevo can be tricky in the beginning and has high CPU usage.
Airbyte is one of the fastest-growing ETL platform providers. It is an open-source data integration tool that you can self-host if you want. However, we’ve included it in this section, as Airbyte the company seems to push you towards using their managed service.
As part of their managed service, Airbyte provides a built-in scheduler, with no limitations on sync frequency, while, for instance, Fivetran makes the frequency dependent on your price tier. Data orchestration with Airbyte is not restricted either. They also emphasize that their pricing model can help companies scale without worrying about cost, unlike the approaches adopted by some competitors like Fivetran and Hevo. Airbyte can be easy to configure from the UI, which allows you to add your own custom connectors. A CLI was also recently added as well.
Unfortunately, there are some online discussions around its product's reliability. Some users do not consider the product to be production-ready, which can be seen in the number of open issues on Airbyte’s public issue log on GitHub (3000 issues as of October 2022). Others report that it can be really slow when it comes to CDC replication of SQL databases. Moreover, even if you are self-hosting Airbyte, you develop data pipelines through its UI. This can make it hard to do version control or validate junior or new teammates' work.
Fivetran’s pricing is consumption-based. You pay per monthly active rows (MAR) across all connectors and destinations, and there is no price difference between connectors. For example, if you move the same 1000 records that change every 5 minutes for a month, that would be 1000 monthly active rows. If you move 1000 different new records every 5 minutes, that would be approximately 8.6 million active rows. In short, if you have lots of updates, you (generally) don’t have a lot of MAR, but if you have to deal with a ton of new data, then your MAR would increase significantly. Though as the monthly consumption increases, the cost per row decreases, using Fivetran can become extremely costly when your company scales.
Hevo Data offers a free plan that includes 1 million free events (number of new records inserted or updated in a warehouse) on a limited set of connectors. Their paid plans start from $249/month for 5M events to $1449/month for 100M events. A 14-day free trial is also available. Just like Fivetran, the volume-based model adopted by Hevo Data might not be optimal for use cases that involve the replication of millions of rows.
Airbyte implements separating pricing between APIs, databases, and files which enables users to pay for value received depending on the source. According to the company’s website, their model is more affordable than the common monthly-active-rows (MAR) model. For instance, for 300M rows of APIs, and 300GBs of files and databases, Airbyte’s price is $5,400 while MAR model is $7,467.
Fivetran supports full table and incremental sync via Change Data Capture. It incrementally updates the data in the destination in batches that can be run from every 5 minutes to every 24 hours. The system also automatically recognizes schema changes in the data sources and persists these changes to the destination.
Hevo Data supports full table and incremental loads via SELECT or Replication key, Timestamp, and Change Data Capture.
Airbyte can also transfer full tables or do incremental loads via CDC (with pricing adapted particularly for incremental loading). Additionally, they provide an "at least once" guarantee of replicating each record that is present when the sync runs.
Fivetran: 192 connectors and 10 destinations( Azure Synapse, BigQuery, Databricks, MySQL, Panoply, Periscope, PostgreSQL Redshift, Snowflake, SQL Server)
Hevo Data: 150+ connectors and 2 types of destinations - data warehouses and databases (though they do not recommend DBs as destinations)
Airbyte: 125 sources, 26 destinations
Managed Data Lakes
Managed data lakes are services built on top of open-source ETL products. Such platforms may be the right choice for you if have large amounts of data, lots of concurrent users, or pre-existing data in blob storage, such as S3, that you want to do analytics on. Managed data lakes are fairly easy to use and offer great querying flexibility, but, depending on the tool you choose, you might need to perform your own job orchestration with Airflow. In this section, we take a look at a few open-source products, namely Apache Spark, Presto, and Trino, along with their respective managed platforms - Databricks, Ahana, and Starburst.
Apache Spark + Databricks
Apache Spark is an open-source program that excels at large-scale data processing. In 2013 it was given to the Apache foundation, and since then has become one of the most popular and active big data projects. Its goal was to provide a fast general-purpose framework for large-scale data processing and overcome the limitations of MapReduce, the most common data processing method in Hadoop at the time. It allows you to do large-scale data transformations and analysis and then immediately apply Machine Learning and AI algorithms to them.
Databricks was founded by the team that originally created Spark. It is a distributed computing framework and a managed platform for running Apache Spark. It also provides a point-and-click interface, accompanied by a sophisticated API allowing for task automation. Apache Spark has experienced substantial growth over the last decade and has become the essential data processing and AI engine in many companies because of its speed and ease of use. Databricks has delivered products built on top of Apache Spark that are more optimized and simpler to use than the original open-source product.
Databricks simplifies the creation of modern data warehouses that enable companies to provide self-service analytics and machine learning with enterprise-grade performance and governance.
It comes with built-in orchestration, GIT integrations, Spark SQL Endpoints with Photon, and easy mode integrations (Fivetran, Tableau, Power BI, dbt Cloud / CLI). Having built-in orchestration eliminates the need to use a separate workflow management tool like Airflow. It is also straightforward as it gives you an Airflow-like matrix that shows you tasks of a pipeline with historical runs so you can diagnose recurring issues easily.
There are some drawbacks to Databricks, such as it being extremely Notebook driven, which makes it tough to structure things into separate files/modules, thus making it hard to implement proper testing, version control, CI/CD, etc. Databricks can be overkill for small quantities of structured and semi-structured data. Moreover, some users have reported issues with the CLI and unclear documentation.
Presto + Ahana
Presto is an open-source distributed ANSI SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It was created at Facebook and its initial use case was to run interactive queries to replace Apache Hive. Nowadays Facebook uses it for interactive queries against several internal data stores including their 300 petabytes data warehouse. Presto is now distributed under the Apache license and is optimized for low-latency interactive querying. You can write your own connector in Java and map a backend to Presto, or use the Thrift connector to write a Thrift service in any language to query virtually anything. Presto comes with a UI and CLI. It gives you the flexibility to reduce or even eliminate the time-consuming and expensive ETL and data integration processes by allowing you to query the data where it lives without having to move it or copy it into another location first.
Ahana Cloud offers a managed service for Presto on AWS. It comes fully integrated and gives you an open cloud data warehouse with a pre-built catalog that’s managed for you, a built-in integration with S3, as well as a BI sandbox. It is cloud native and runs on Amazon EKS for high scalability, availability, and manageability. It is good for data engineers, and platform developers who may not have deep distributed systems background or knowledge of the Hadoop ecosystem, or experience with open-source projects.
Trino + Starburst
Trino (formerly PrestoSQL) is a distributed query engine, with a number of plugins that connect to a database. It is where the creators of Presto now focus their efforts rather than the Facebook-run PrestoDB. In the query engine, Trino has a cost-based optimizer that helps define an optimal plan based on the table statistics and other info it receives from plugins. These statistics help Trino decide if it's better to schedule a big fact table as the outer loop and keep a smaller dimension table on an inner loop that can easily be loaded into memory.
Starburst Enterprise is built on Trino and claims to be the fastest SQL-based, Massively Parallel Processing (MPP) query engine. It requires no data moving or copying. It gives you access to the same library of pre-built connectors that you would find in Trino but also provides access to connectors not available through the open-source offering.
In essence, Starburst does nothing special to the core query engine to differentiate itself from the open-source project. Where the performance gains happen are in the custom Starburst connectors themselves. They expose both table/column level statistics which can help in a lot of cases based on the query. Many of the open-source connectors don't tap into the MPP architecture that Trino runs on and so making the connectors compatible to be run in parallel is another speedup.
Databricks’ pricing is based on your compute usage. Storage, networking, and related costs will vary depending on the services you choose and your cloud service provider. They also provide a 14-day free trial.
Ahana Cloud has a credit-based system, with $0.25 per Ahana cloud credit hour. For example Presto Cluster of 5 x r5.xlarge running every workday = $569/month. Free 14-trial is available.
Starburst comes with a 15-day trial and the actual cost is based on your configuration choices on AWS. For instance, the price for an r5.xlarge EC2 instance hosted in US East would be around $0.40/hr ($0.15/hr for Starburst enterprise + $0.25/hr for EC2) before the additional taxes and fees.
With Databricks you can perform incremental sync via Change Data Capture. Ahana Cloud allows you to do full table replication, as well incremental loads using Timestamp. There are a few companies trying to build Change Data Capture into Presto, but the main tool does not have that functionality. In Starburst cached tables can be refreshed incrementally or as a full refresh.
Databricks can ingest data from external databases, namely PostgreSQL, MySQL, MariaDB, and SQL Server through JDBC or ingest data from file types, such as Parquet, ORC, JSON, CSV, XML, S3, etc.
Ahana Cloud has listed 32 Presto connectors, along with Presto’s HTTP API, the JDBC and ODBC drivers, and several client libraries.
Starburst has 40+ supported connectors including high-performance connectors for Oracle, Teradata, and Snowflake.
Ease of Recruitment
Generally, in the US, it is easier to recruit for Spark than Presto (Presto and Trino require the same skill set). As of October 2022, there are more than 6300 active job postings (in the US) that require some experience with Spark vs 1638 jobs looking for Presto expertise. According to Enlyft’s data more than 16 000 companies use Spark and a little over 220 use Presto.
Presto: 1638 jobs
Spark: 6318 jobs
Cloud-native ETL tools are worth exploring if you already use a particular cloud service provider and their solutions fit your needs exactly. For instance, if you are happy with SSIS, and just want a managed platform, Azure Data Factory may be the right choice for you. Overall, such solutions provide a good level of flexibility but can prove difficult to work with. In this section, we look at Google Cloud’s Data Fusion and Dataflow, AWS Pipeline and Glue, as well as Azure Data Factory.
Google Cloud Data Fusion
Data Fusion is a fully managed cloud-native data integration service based on the open-source pipeline development tool CDAP. It is a flexible and portable solution for hybrid and multi-cloud environments. Data Fusion is focused on enabling data integration scenarios - reading from a source via an extensible set of connectors and writing to targets such as BigQuery. It supports major Hadoop distributions and it comes with a visual point-and-click interface.
Data Fusion gives you a fast development path and a good amount of out-of-the-box functionality. Some users choose it because of the visual-based graph development that leverages an extensible set of connectors and operators. Others believe that the service is not quite production-ready yet. There are limits on the API and you need a big Dataproc cluster to run many jobs, though it might work well for small-scale companies. Data Fusion is more suitable for data ingestions from a source to a destination with few transformations. Moreover, engineers at Google have recommended using Data Fusion if you are comfortable with Spark or have existing Spark scripts you want to port.
Google Cloud Dataflow
Dataflow is a fully managed, serverless data processing service in GCP that runs jobs using Apache Beam libraries. It offers a unified programming model to develop pipelines that can execute on a wide range of data processing patterns including ETL, batch computation, and stream-based processing. Each pipeline takes large amounts of data, potentially combining it with other data, and creates an enriched target dataset. The resulting dataset can be similar in size to the original, or a smaller summary dataset. Some of the additional Dataflow functionality includes vertical and horizontal autoscaling, fusion optimization, and real-time Change Data Capture.
Compared to Data Fusion, Dataflow gives you a more cost-effective solution, more flexibility, control, and fine-tuning, as well as more portability via Beam. However, if you're not comfortable with the Beam SDK, you might prefer Data Fusion. Dataflow might not be the right solution for you if you’re working with datasets that fit easily within memory on a single machine, because it is not the easiest framework to work with and it is likely to increase your development time. Moreover, the time to spin machines up and down (for batch processing) would unnecessarily increase your overall processing time.
AWS Data Pipeline & AWS Glue
Data Pipeline is a Cloud-based managed data workflow service used to automate the movement and transformation of data spread across various AWS tools while enabling monitoring from a single location. It offers flexibility in terms of the execution environment, access and control over the data processing code, and access and control over the compute resources that run your code.
Glue is a serverless, fully managed ETL platform that allows you to create and execute AWS ETL jobs with a few clicks in the AWS Management Console. It automatically discovers your data and stores the related metadata in the AWS Glue Data Catalog. After that data is available for querying. It is good for semi-structured data or when you want to run jobs on a serverless Apache Spark-based platform. Glue allows you to focus on the ETL job, as you don't have to configure or manage your compute resources.
There is a significant overlap between the two services. For instance, you can use them to natively integrate with S3, DynamoDB, RDS, and Redshift. You can move and transform data across different components within the AWS Cloud platform. Both services can assist you with ETL tasks. However, Glue is built on Apache Spark and its ETL jobs are Python or Scala-based. In general, Glue is more of a managed ETL service, and Data Pipeline is more of a managed workflow service.
As a general principle, you should use Data Pipeline for moving batch data between AWS components (e.g. replicating a database to S3 or DynamoDB backup), and running ETL jobs that don’t require the use of Spark. On the other hand, Glue is good for running jobs on serverless Spark-based platforms, designing complex ETL pipelines, or processing streaming.
Azure Data Factory
Data factory is another fully managed serverless data integration and transformation service. It is the successor to SSIS and it allows you to lift and shift existing SSIS packages to Azure. Data Factory is designed for centralized applications and many people consider it to be more streamlined and easier to use than other similar services such as Dataflow. It is a natural choice for people looking to modernize their SSIS workflows.
Overall, Data Factory works well if you are on using Azure. It is marketed as a GUI drag-n-drop, but it also supports the deployment of coding scripts. Some users, however, recommend against relying on scripts, even though it is technically possible. Data Factory is generally good for simpler software and it integrates especially well with other aspects of Azure. It should be looked at as an orchestration and monitoring tool, more than a transformational tool. It's cloud-first, but you can access on-premise data by using a self-hosted integration runtime. Supposedly Microsoft’s intention is to replace SSIS with Data Factory. That being said, there are a lot of similarities between both systems, so if you are coming from SSIS there won’t be much of a learning curve.
Cloud Data Fusion’s pricing depends on the interface instance hours. Data Fusion bills by the minute, but the usage is measured in hours (the period of time between an instance is created and the time it is deleted). For pipeline development, the price per instance per hour can vary between $250 and $3000 per month, depending on your pricing tier. For pipeline execution, you are charged for the Dataproc cluster that Data Fusion creates to run your pipeline.
Cloud Dataflow jobs are billed per second, based on the actual use of Dataflow batch or streaming workers.
With AWS Pipeline you are billed depending on how often your preconditions and activities are scheduled to run and where they run (on-premises or AWS).
AWS Glue, on the other hand, bills you by the second for crawlers discovering data in data stores and ETL jobs.
Azure Data Factory’s pricing is calculated based on the number of pipeline orchestration runs, compute hours for flow execution and debugging; and the number of Data Factory operations, such as pipeline monitoring.
Besides full data loading, Data Fusion provides a pluggable component called Replication that allows you to replicate data continuously and in real time from relational operational databases into BigQuery.
Dataflow supports full table and incremental replication via custom SELECT statements.
AWS Pipeline and Glue both offer full table sync. Pipeline can do incremental loading via Timestamp, while Glue uses CDC through AWS Database Migration Service (DMS).
Azure Data Factory supports full table load, as well as incremental load via CDC
Cloud Dataflow supports both batch and streaming ingestion. For batch, it can access both GCP-hosted and on-premises databases. For streaming, it uses PubSub. Cloud Dataflow doesn't support any SaaS data sources but can write data to Google Cloud Storage or BigQuery.
Cloud Data Fusion supports simple preload transformations — validating, formatting, and encrypting or decrypting data, among other operations — created in a graphical user interface.
It has a broad library of open-source transformations and more than 100 available plugins. Data Fusion supports almost 20 file and database sources and more than 20 destinations, including databases, file formats, and real-time resources.
AWS Data Pipeline supports 4 sources for native integration - DynamoDB, SQL, Redshift, and S3. It can be configured to integrate with a wide variety of other data sources ranging from AWS Elastic File System to on-premises data sources to run Java-based tasks.
Glue natively supports data stored in Aurora, RDS, Redshift, DynamoDB, S3, JDBC-type data stores (such as MS SQL Server and Oracle), and PostgreSQL databases in your Amazon Virtual Private Cloud, plus MongoDB client stores. Other integrations include Kafka, AWS Lambda, Apache Flink, MKS, and others.
Azure Data Factory integrates with about 80 data sources, including SaaS platforms, SQL and NoSQL databases, generic protocols, and various file types. It supports around 20 cloud and on-premises data warehouse and database destinations and has more than 90 built-in maintenance-free connectors.
Ease of recruitment
Experience with AWS Glue is by far the most in-demand when it comes to cloud-based ETL tools. According to Enlyft’s 2022 report, there are 859 companies currently using AWS Glue, such as Blue Cross Blue Shield and Red Hat. Cloud Dataflow comes second in terms of available positions looking for that skill set, and about 295 companies using the product. Enlyft reports that over 4600 companies use Azure Data Factory, but the open positions for that skillset are a little over 40 (in the US). We assume that’s because you generally don’t need a whole new skill set to work with Data Factory if you have experience with SSIS.
Cloud Dataflow: 528 jobs
Cloud Data Fusion: 227 jobs
AWS Data Pipeline: 38 jobs
AWS Glue: 3387 jobs
Azure Data Factory: 42 jobs
If you are looking for an SSIS alternative and don't know where to start, here are a few summary points for you to consider:
- Self-hosted ETL tools, such as SQLpipe or Airbyte, give you full control over your data and can reduce infrastructure costs.
- Fully managed solutions are easy to work with, do not require writing code, and have a good amount of built-in connectors.
- Managed data lakes are good for use cases where traditional data warehouses don’t suit your needs (e.g. you have a large number of concurrent users, or want to do analytics on preexisting data in blob storage).
- Cloud-native solutions can be useful if you already work with a cloud provider and their solutions fit your needs exactly.
You can find a summary table with the main points here.