Introduction to Apache Kylin

Modern software architecture is often broken. Slow delivery leads to missed opportunities, innovation is stalled due to architectural complexities, and engineering resources are exceedingly expensive.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

With Orkes Conductor managed through Orkes Cloud, developers can focus on building mission critical applications without worrying about infrastructure maintenance to meet goals and, simply put, taking new products live faster and reducing total cost of ownership.

Try a 14-Day Free Trial of Orkes Conductor today.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Browser testing is essential if you have a website or web applications that users interact with. Manual testing can be very helpful to an extent, but given the multiple browsers available, not to mention versions and operating system, testing everything manually becomes time-consuming and repetitive.

To help automate this process, Selenium is a popular choice for developers, as an open-source tool with a large and active community. What's more, we can further scale our automation testing by running on theLambdaTest cloud-based testing platform.

Read more through our step-by-step tutorial on how to set up Selenium tests with Java and run them on LambdaTest:

>> Automated Browser Testing With Selenium

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Refactor Java code safely — and automatically — with OpenRewrite.

Refactoring big codebases by hand is slow, risky, and easy to put off. That’s where OpenRewrite comes in. The open-source framework for large-scale, automated code transformations helps teams modernize safely and consistently.

Each month, the creators and maintainers of OpenRewrite at Moderne run live, hands-on training sessions — one for newcomers and one for experienced users. You’ll see how recipes work, how to apply them across projects, and how to modernize code with confidence.

Join the next session, bring your questions, and learn how to automate the kind of work that usually eats your sprint time.

1. Introduction

Apache Kylin is an open-source OLAP engine built to bring sub-second query performance to massive datasets. Originally developed by eBay and later donated to the Apache Software Foundation, Kylin has grown into a widely adopted tool for big data analytics, particularly in environments dealing with trillions of records across complex pipelines.

The platform is known for blending OLAP performance with the scale of distributed systems. It bridges the gap between complex, large-scale data storage and the speed requirements of modern business intelligence tools, enabling faster decisions on fresher data.

In this tutorial, we’ll explore the core features that make Kylin stand out, walk through its architecture, and look at how it changes the game in big data analytics. Let’s get started!

2. Understanding Apache Kylin’s Core Capabilities

Let’s start by looking at what Apache Kylin does well.

Apache Kylin delivers sub-second latency even when operating on datasets that span trillions of rows. This is possible due to its heavy use of pre-computed data models and optimized indexing. When performance and speed are critical, Kylin shines.

Similarly, Kylin also easily handles high concurrency. Whether the system is serving hundreds of queries simultaneously or performing heavy aggregations, the underlying architecture is built to scale without becoming a bottleneck.

Another strength is Kylin’s unified big data warehouse architecture. It integrates natively with the Hadoop ecosystem and data lake platforms, making it a solid fit for organizations already invested in distributed storage. For visualization and business reporting, Kylin integrates seamlessly with tools like Tableau, Superset, and Power BI. It exposes query interfaces that allow us to explore data without needing to understand the underlying complexity.

Furthermore, if we’re looking for production-ready features, Kylin provides robust security, metadata management, and multi-tenant capabilities, making it suitable for enterprise use at scale. Kylin’s performance isn’t just luck; its components are engineered from the ground up using multidimensional modeling, smart indexing, and an efficient data-loading pipeline.

Let’s take a closer look at how each of these elements contributes to its capabilities.

2.1. Multidimensional Modeling and the Role of Models

At the heart of Kylin is its data model, which is built using star or snowflake schemas to define the relationships between the underlying data tables. In this structure, we define dimensions, which are the perspectives or categories we want to analyze (like region, product, or time). Alongside them are measures, and aggregated numerical values such as total sales or average price.

Similarly, Kylin also supports computed columns, which allow us to define new fields using expressions or transformations, which are useful for standardizing date formats or creating derived attributes. It handles joins during the model definition stage, allowing Kylin to understand relationships and optimize the model accordingly.

Once a model is built, it becomes the foundation for index creation and data loading.

2.2. Index Design and Pre-Computation (CUBEs)

To achieve its speed, Kylin heavily relies on pre-computation. It builds indexes (also known as CUBEs) that aggregate data ahead of time based on the model dimensions and measures. There are two main types of indexes in Kylin:

Aggregate Indexes: These store pre-aggregated combinations of dimensions and measures, such as total revenue by product and month.
Table Indexes: These are multilevel indexes that help serve detailed or drill-down queries, like fetching the last 50 orders placed by a specific user.

By precomputing the possible combinations and storing them efficiently, Kylin avoids the need to scan raw data at query time. This drastically reduces latency, even for complex analytical queries.

Notably, index design is critical. The more targeted and efficient the indexes are, the less storage and processing power is consumed during query time.

2.3. Data Loading Process

Once the model and indexes are in place, we need to load the data. Data loading in Kylin involves building the CUBEs and populating them with pre-computed results.

Traditionally, this is done in batch mode using offline data. Kylin reads from source tables, often from Hive or Parquet files in HDFS, and processes the data into its index structures.

In addition, there’s also support for streaming sources like Apache Kafka, enabling near real-time ingestion and analysis. This makes it possible to use Kylin in hybrid batch-streaming scenarios without changing the analytical layer.

Importantly, once we load the data, queries run against the pre-built indexes instead of raw datasets, providing consistent and predictable performance regardless of the underlying volume.

3. How to Run Apache Kylin in Docker

The fastest way to explore Apache Kylin is by spinning it up in a Docker container. This is perfect if we want to test out new features locally or evaluate the latest releases.

Let’s see a docker run command to start using Apache Kylin:

$ docker run -d \
  --name Kylin5-Machine \
  --hostname localhost \
  -e TZ=UTC \
  -m 10G \
  -p 7070:7070 \
  -p 8088:8088 \
  -p 9870:9870 \
  -p 8032:8032 \
  -p 8042:8042 \
  -p 2181:2181 \
  apachekylin/apache-kylin-standalone:5.0.0-GA

Here, we pull the standalone image for version 5.0 and launch it. We’re launching Apache Kylin 5.0 as a standalone container and exposing some common ports for easy access:

–name: assigns a name to the container
–hostname: sets the container’s hostname (helpful for internal references)
-e TZ=UTC: sets the timezone to UTC
-m 10G: limits the container’s memory usage to 10 GB (it’s recommended to assign at least 10GB of memory to the container for a smoother experience)
-p options: map essential Kylin and Hadoop-related service ports from the container to the host
apachekylin/apache-kylin-standalone:5.0.0-GA: the image, which includes all necessary services bundled together

While the docker run command itself doesn’t produce output beyond the container ID, we can validate that it’s running with docker ps:

$ docker ps --filter name=Kylin5-Machine
CONTAINER ID   IMAGE                                         STATUS          PORTS                                             NAMES
abc123456789   apachekylin/apache-kylin-standalone:5.0.0-GA   Up 10 seconds   0.0.0.0:7070->7070/tcp, ...                      Kylin5-Machine

Once we’re sure that the container is up, we can access the Kylin web UI at http://localhost:7070 and start exploring. This setup gives us everything we need to build models and explore datasets in a self-contained environment.

3.1. Verifying the Kylin Instance

Once the container is running, we can verify the instance using a simple health check via curl:

$ curl http://localhost:7070/kylin/api/system/health

If everything is working, we should see a response indicating the server status as UP:

{
    "status": "UP",
    "storage": {
        "status": "UP"
    },
    "metadata": {
        "status": "UP"
    },
    "query": {
        "status": "UP"
    }
}

This confirms that Kylin’s internal services, metadata, query engine, and storage are running and ready to accept operations.

3.2. Accessing the Kylin Web Interface

The Kylin UI will be available at http://localhost:7070. We can use the default credentials to log in:

Username: ADMIN
Password: KYLIN

Once it’s up, Kylin’s web interface will also have access to Spark and Hadoop UI components through the other ports.

From here, we can create a project, upload a data model, and begin building CUBEs. The interface also includes sections for managing metadata, monitoring build jobs, and testing SQL queries interactively.

4. How to Define a Model and Build a CUBE in Apache Kylin Using SQL

With Kylin, we can also define models and kick off CUBE builds using plain SQL and the REST API. This makes the process cleaner, automatable, and perfect for dev-heavy environments. Let’s walk through it.

4.1. Loading a Table Into Kylin

Assuming the source table sales_data exists in Hive or a similar catalog, we begin by telling Kylin about it.

To do this, we can make a POST request to the /tables API via curl:

$ curl -X POST http://localhost:7070/kylin/api/tables/default.sales_data \
  -H "Authorization: Basic $(echo -n 'ADMIN:KYLIN' | base64)" \
  -d '{"project":"sales_analytics"}'

Here, we register our source table, sales_data, into the sales_analytics project. This tells Kylin to pull metadata for the sales_data table from the configured catalog (like Hive or JDBC). Let’s see an example output:

{
    "uuid": "fcbe5a9a-xxxx-xxxx-xxxx-87d8c1e6b2c5",
    "database": "default",
    "name": "sales_data",
    "project": "sales_analytics"
}

As we can see, once registered, it’s available for model creation.

4.2. Creating a Model From SQL

Here’s where things get interesting. We can now define a model using a SQL statement, and Kylin infers the dimensions, measures, and joins automatically.

Let’s see an example SQL:

SELECT
  order_id,
  product_id,
  region,
  order_date,
  SUM(order_amount) AS total_sales
FROM sales_data
GROUP BY order_id, product_id, region, order_date

This tells Kylin what the dimensions and measures are.

Now, let’s send this SQL to Kylin’s modeling engine via API:

$ curl -X POST http://localhost:7070/kylin/api/models/sql \
  -H "Authorization: Basic $(echo -n 'ADMIN:KYLIN' | base64)" \
  -H "Content-Type: application/json" \
  -d '{
    "project": "sales_analytics",
    "modelName": "sales_cube_model",
    "sql": "SELECT order_id, product_id, region, order_date, SUM(order_amount) AS total_sales FROM sales_data GROUP BY order_id, product_id, region, order_date"
  }'

If the request is successful, Kylin creates a new model that includes all the columns mentioned, along with a basic aggregation on order_amount:

{
    "model_id": "sales_cube_model",
    "status": "ONLINE",
    "fact_table": "sales_data",
    "dimensions": ["order_id", "product_id", "region", "order_date"],
    "measures": ["SUM(order_amount)"]
}

Here, this created a new model, sales_cube_model, by inferring metadata directly from the SQL. It automatically marks grouping fields as dimensions and applies the aggregation as a measure.

4.3. Triggering a CUBE Build Job

Once the model is created, we can trigger a build job to materialize the index.

First, we get the model’s ID (or name), then we send a build request:

$ curl -X PUT http://localhost:7070/kylin/api/jobs \
  -H "Authorization: Basic $(echo -n 'ADMIN:KYLIN' | base64)" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "sales_cube_model",
    "project": "sales_analytics",
    "build_type": "BUILD",
    "start_time": 0,
    "end_time": 2000000000000
  }'

After running this, Kylin starts building the CUBE using the default aggregation groups, and outputs the status:

{
    "uuid": "job_3f23c498-xxxx-xxxx-xxxx-9eab1a66f79c",
    "status": "PENDING",
    "exec_start_time": 1711700000000,
    "model_name": "sales_cube_model"
}

This schedules a full CUBE build (covering all time ranges) for the model. Kylin precomputes aggregates defined in the model. The timestamp range here is wide open, which works well for full builds.

4.4. Monitoring the Build Status

For monitoring the build status, we can always check the status of that build job using the job API:

$ curl -X GET "http://localhost:7070/kylin/api/jobs?projectName=sales_analytics" \
  -H "Authorization: Basic $(echo -n 'ADMIN:KYLIN' | base64)"

[
    {
        "job_status": "FINISHED",
        "model_name": "sales_cube_model",
        "duration": 52300,
        "last_modified": 1711700150000
    }
]

The response shows job stages, status, duration, and whether the build succeeded. Once it reaches “job_status”: “FINISHED“, we’re ready to query.

Notably, Kylin supports index pruning based on query patterns. After a few queries, we can check the index usage stats in the API. We may find that some dimensions are rarely used together, and trimming those combinations from the index definition can improve build times and reduce storage without affecting query coverage.

In short, we’ve fully modeled a dataset, defined aggregations, and materialized a CUBE. We can now query next, or even automate this flow as part of a CI/CD analytics pipeline. For recurring data loads, we can automate the CUBE build process using cron jobs or CI pipelines. Kylin’s REST API is script-friendly, so it’s easy to trigger builds at midnight, hourly, or whenever new data lands in the source system.

5. Conclusion

In this article, we explored Apache Kylin, a purpose-built tool for extreme scale and performance in big data analytics. It combines the power of OLAP modeling with distributed computing to deliver fast, reliable insights across massive datasets.

With significant components and features, the platform introduces streaming support, a native compute engine, automated modeling, and smarter metadata handling. These changes make it more approachable, more performant, and more aligned with modern data architectures.

Whether we’re building dashboards, powering real-time metrics, or democratizing data access, Kylin provides the tooling to get it done at scale, and at speed.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Orkes is the leading workflow orchestration platform built to enable teams to transform the way they develop, connect, and deploy applications, microservices, AI agents, and more.

Try a 14-Day Free Trial of Orkes Conductor today.

Modern Java teams move fast — but codebases don’t always keep up. Frameworks change, dependencies drift, and tech debt builds until it starts to drag on delivery. OpenRewrite was built to fix that: an open-source refactoring engine that automates repetitive code changes while keeping developer intent intact.

The monthly training series, led by the creators and maintainers of OpenRewrite at Moderne, walks through real-world migrations and modernization patterns. Whether you’re new to recipes or ready to write your own, you’ll learn practical ways to refactor safely and at scale.

If you’ve ever wished refactoring felt as natural — and as fast — as writing code, this is a good place to start.

REST with Spring Boot

Learn Spring Security ▼▲

Learn Spring Security Core

Learn Spring Security OAuth

Learn Spring

Learn Spring Data JPA

Learn JUnit

Learn Maven

Learn Hibernate JPA

Learn Mockito

Learn JSON with Jackson

Full Archive

Baeldung Ebooks

About Baeldung

1. Introduction

2. Understanding Apache Kylin’s Core Capabilities

2.1. Multidimensional Modeling and the Role of Models

2.2. Index Design and Pre-Computation (CUBEs)

2.3. Data Loading Process

3. How to Run Apache Kylin in Docker

3.1. Verifying the Kylin Instance

3.2. Accessing the Kylin Web Interface

4. How to Define a Model and Build a CUBE in Apache Kylin Using SQL

4.1. Loading a Table Into Kylin

4.2. Creating a Model From SQL

4.3. Triggering a CUBE Build Job

4.4. Monitoring the Build Status

5. Conclusion