GraphQL ETL: Overview & Helpful Tools for Data Engineers

Ethan
CEO, Portable

What Is GraphQL?

GraphQL is an open-source data query and manipulation language for accessing or aggregating data via an Application Programming Interface (API).

This technology helps access precise information without querying an entire data table. It also makes available the normalized data across all API connectors. It became public in 2015 and came under the umbrella of the non-profit Linux foundation.

GraphQL presents a unique approach to developing APIs. It offers the clients to define the structure of the required data. The server returns to clients with the same format of the requested data. This prevents the clients from fetching excessive amounts of data from the servers.

What Are the Benefits of GraphQL?

Simplified data fetching

GraphQL simplifies data ingestion. It manages the code by getting the data in the required shape. Existing infrastructure can use this query language on top of their existing architecture.

REST and SOAP are some practical examples. It brings critical improvement in workflows. In this way, data organization is clean and unified. And the data is available all at once.

Standardized and scalable

GraphQL API technology fulfills the demanding requirements of modern front-end apps. It provides a standardized and scalable way of getting real-time updates. In this way, the app's frontend became more relational. GraphQL also supports the implementation of the services.

This technology uses Web sockets to put in place services like reading and writing. It also supports subscriptions of new data updates of GraphQL subscriptions.

Expressive API

This API technology also provides built-in typed API schema and expressive language. It helps in the identification of data integration bugs before they happen.

It also helps in narrowing the scope of custom API integrations.

Multiple platforms and languages support

GraphQL provides server implementations for many languages. For instance, GraphQL.js is a reference implementation for JavaScript. We can deploy an Apollo GraphQL API to Azure in an Azure function. A community for GraphQL has built versions of the GraphQL runtime in many languages.

The following is a list of technologies for the implementation of GraphQL APIs:

  • LanguageTechnology
  • Ruby — graphql-ruby
  • Scala — Caliban
  • Python — Graphene
  • Typescript — Apollo Server
  • Java/Kotlin — GraphQL Java, GraphQL Kotlin, DGS
  • Go — gqlgen, GraphQLGo

Should You Use GraphQL for ETL?

No, you should not use GraphQL for ETL.

Data engineers use ETL processes to extract and modify existing data into a new dataset. But, if this data is restructured into several micro-pieces, GraphQL is an excellent choice.

An example, consider a typical e-commerce data project. The backend is restructured in micro-services. Before refactoring, the website fetches its data from MySQL replica in AWS-RD. Each micro-service will create a query, and Apache Airflow will fetch data. "GraphQLToS3Operator" is an operator that retrieves the data using the request library. It saves it in a data frame that is later stored in Amazon S3. Subsequently, this data frame is available on Snowflake.

GraphQL resolver output is a decider for the selection. If the output of the resolver is MySQL queries, it's better to get access to the SQL server endpoint. But if there are variations within the application, it's better to grab it from GraphQL. It's not a suitable choice to replicate application logic on the data engineering side because stakeholders at different tiers are not communicating every change of GraphQL queries.

Transform Data with GraphQL: Use Cases

Artsy used the GraphQL layer to improve page speed as a key performance indicator. GraphQL connected many front-end apps with various APIs and extracted business logic from the client's apps. This helped the developers to fetch data. It also reduced the gaps between web and iOS developers to work with the same API layer.

Artsy is already using Mural, a framework for React and GraphQL. A library named JoiQL came out of this project. It converts Joi schemas into GraphQL schemas. It was trivial to create a GraphQL-based endpoint with JoiQL.

Since its journey from inspirational REST API, GitHub has found GraphQL as a problem solver and opportunity provider for integrators. It also snubbed XML in favor of JSON. GitHub finally made available its API through GraphQL.

GitHub started supporting GraphQL to solve two problems. The first is scalability. The other one is inconsistency in the aggregate collection of metadata of endpoints. None of the standards matched with API requirements except GraphQL.

GraphQL constructs a request by defining the required resources. POST command requests the server, and the server responds with a format of your request. GraphQL solves even more complicated scenarios. By only using one request, all the required data is available.

To determine the effectiveness of GraphQL, engineers at GitHub decided to put in place emoji reactions on comments as a small data model. They also modeled types that defined GraphQL schema.

With GraphQL, the GitHub team implemented logging requests and reporting exceptions. It helped in providing error responses. GitHub finds GraphQL as a significant shift in its platform strategy.

Load GraphQL API Data to a Data Warehouse

Export GraphQL API Data to JSON

JavaScript Object Notation (JSON) is a human-readable format to represent metadata. JSON format is best for sending data from a server to a webpage or vice versa. When the clients connect with GraphQL servers using HTTPS. The servers expect the query to be in JSON body. So, we create a JSON object from a GraphQL query.

GraphQL docs provide straight queries in the form like this:

query {

Student(id:3) {

id

birthDate {

year

month

day

}

name {

text

}

}

}

We can convert GraphQL query to a JSON body in a simple way. First, remove new lines/ extra spaces from the GraphQL query. Then add it under a "query" key in a JSON object.

JSON can add and send the optional "operationName" and "variables" fields. According to the documentation provided by the "GraphQL POST request." A standard GraphQL POST query should use the application/json content type. It also includes a JSON-encoded body of the following form:

{

  "query": "...",

  "operationName": "...",

  "variables": {"myVariable": "someValue", ... }

}

So, the JSON object of the GraphQL query looks like the following:

{

"query": "query { Student(id:3){ id birthDate { year,month,day } name { text } } }", "variables": {}

}

Dedicated Tools for Loading GraphQL API Data

There are several dedicated tools available for loading data from a GraphQL API. Some of the most popular tools include:

1. Portable

Portable is a cloud-based ETL tool for no-code data pipelines. It supports loading the GraphQL API data.

2. Apollo Client

Apollo Client is a popular JavaScript library for querying GraphQL APIs and managing client-side data. It provides a simple and intuitive API for making GraphQL requests, caching data, and handling real-time updates.

3. Relay

Relay is a JavaScript framework for building data-driven React applications explicitly designed for use with GraphQL APIs. It provides an optimized data-fetching strategy that makes loading and managing data in your React components easy.

4. URQL

URQL is a lightweight and flexible GraphQL client for React, focusing on simplicity and performance. It provides a simple API for querying GraphQL APIs and updating client-side data, as well as several advanced features such as real-time updates and optimistic UI.

5. GraphiQL

GraphiQL is a graphical user interface for exploring GraphQL APIs that provides an interactive environment for testing and debugging your GraphQL queries.

6. GraphQL Playground

GraphQL Playground is a web-based interactive environment for testing and exploring GraphQL APIs, focusing on usability and performance.

7. GraphQL Voyager

GraphQL Voyager enables graphical visualization to GraphQL to increase human readability. It is an interactive tool helpful for discussing and designing data models.

8. Hypergraph

Hygraph is the first native GraphQL federated content platform that allows you to unify and distribute content from anywhere to anywhere.

9. Swagger

Swagger is used for seamlessly moving RESTful APIs to GraphQL.

10. Insomnia

Insomnia is an API design and test tool. It facilitates developers to build GraphQL APIs in a user-friendly way.

Besides the above-listed tools, GraphQL API in Microsoft Dynamics CRM enables pulling and pushing data. Tools like Splunk can consume CSV. Python scripts query GraphQL API and convert the response to CSV.

Building GraphQL Endpoints With Help From ETL

Despite GraphQL enabling clients to request the exact data they need, it requires significant work, including designing a schema, defining resolvers, and connecting to data sources.

ETL (Extract, Transform, Load) plays a significant role in helping to engineer a future with GraphQL. ETL tools can extract data from various sources, such as databases and file systems, and transform it into a format that GraphQL APIs can consume.

Using top ETL tools to help engineer GraphQL helps organizations leverage their existing data infrastructure. Many organizations have invested heavily in databases, data warehouses, and other data sources. Rather than exposing production-level data, ETL tools can pipe data to a dedicated API endpoint optimized for GraphQL client interactions.

Ultimately, ETL helps organizations to actualize a modern data stack for scalable operations and gain competitive strengths in a rapidly changing landscape.