This functionality is only available to users who have purchased the Data Distiller SKU. Contact your Adobe representative for more information.
Learn how to use hypercubes in Adobe Experience Platform’s Experience Query Service to perform advanced data analysis with enhanced efficiency. This document covers how to use advanced functions from the Apache Datasketches library to handle distinct counts and complex calculations incrementally, without having to reprocess historical data each time.
In big data analysis, generating metrics such as distinct counts, quantiles, most-frequent items, joins, and graph analysis often involves non-additive counting (where results can’t simply be summed up from subgroups). Traditional methods require reprocessing all historical data, which can be resource-intensive and time-consuming. Use sketches, which are compact summaries that use probabilities to represent large datasets, and advanced Query Service functions to streamline this process by reducing the need for recalculating.
Hypercubes offer several powerful functions to enhance data analysis efficiency and flexibility.
Use hypercubes to efficiently generate distinct counts for various user interactions without fully recalculating data each time. The following are some practical scenarios for their use:
In these situations, you can pre-calculate basic information for specific categories. However, when analyzing data across multiple dimensions and time periods, you have to either recalculate everything from raw data or use a Query Service hypercube. Hypercubes streamline the process by organizing data efficiently, which allows flexible filtering and multi-dimensional analysis without reprocessing. They use advanced functions to estimate results quickly and accurately to offer key benefits such as improved processing efficiency, scalability, and adaptability for complex analytical tasks.
Query Service can compress millions or billions of data points (for example, user IDs) into a compact form called a sketch. This sketch has a significantly reduced data size for query processing, which maintains scalability and makes it much easier and faster to work with. No matter how large the original data gets, the size of the sketch stays small, which makes analyzing big data much more manageable and efficient.
The diagram below illustrates how Commerce, Product Info, and Web dimension ExperienceEvents are processed into sketches, which are then used to approximate unique counts.
To avoid recalculating and enhance processing speed, you can merge sketches from different categories or groups. Query Service also simplifies the design by organizing your data into a hypercube, where each row becomes a summary of its partition (a collection of dimensions) alongside the sketch column. Each row of the hyper-cube contains the dimension combination but does not have any raw data. When executing a query, specify the dimensional columns you want to use for building additive metrics and merge the sketches for those rows.
Customer data is often large-scale, but you can eliminate the need to reprocess historical data by using incremental processing. Sketches are much smaller and allow for faster, real-time results while saving on compute resources and costs. This data transformation makes interactive queries more feasible and efficient.
This section outlines how each function optimizes data processing and enhances analytical capabilities through the efficient use of sketches and hypercubes. It details their purpose, example syntax, parameters, and expected output.
hll_build_agg
is an aggregate function that creates an HLL (HyperLogLog) sketch. This function is a compact, probabilistic method for estimating the number of unique values within a column or expression in a grouped dataset.
hll_build_agg(column [, lgConfigK])
Usage:
The following example demonstrates how the function can be structured within a query.
SELECT
[dim1, dim2 ... ,] hll_build_agg(coalesce(col1, col2, col3)) AS sketch_col
FROM fact_sketch_table
[GROUP BY dimension1, dimension2 ...]
Parameter | Description |
---|---|
column |
The column or column name on which to create a sketch. |
lgConfigK |
Int (Optional) The log-base-2 of K, where K is the number of buckets or slots for the HLL Sketch. Min value: 4. Max value: 12. Default value: 12. |
Output Column | Description |
---|---|
sketch_res |
A column of type string containing the stringified HLL sketch. |
The following example builds an aggregate sketch on the customer_id
column:
SELECT
country,
hll_build_agg(customer_id, 10) AS sketch
FROM
EXPLODE(
ARRAY<STRUCT<country STRING, customer_id STRING, invoice_id STRING>>[
('UA', 'customer_id_1', 'invoice_id_11'),
('CZ', 'customer_id_2', 'invoice_id_22'),
('CZ', 'customer_id_2', 'invoice_id_23'),
('BR', 'customer_id_3', 'invoice_id_31'),
('UA', 'customer_id_2', 'invoice_id_24')
])
GROUP BY country;
SQL example output:
Country | Sketch |
---|---|
UA | AgEHBAMAAgCR9mUEulKKCQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA== |
CZ | AgEHBAMAAQC6UooJAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA== |
BR | AgEHBAMAAQCcmH0HAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA== |
hll_estimate
is a scalar function that provides an estimation of the distinct count within each row of a dataset. Unlike aggregate functions, hll_estimate
operates row-wise and is used for estimating the distinct count from a sketch within individual rows.
This function cannot be used as an aggregated function. For aggregated counts, use sketch_count
.
hll_estimate(sketch_col)
Usage:
The following example demonstrates how the function can be structured within a query.
SELECT
[col1, col2 ... ,] hll_estimate(sketch_column) AS estimate
FROM fact_sketch_table
Parameter | Description |
---|---|
sketch_column |
Column containing a stringified HLL sketch. It estimates the distinct count for the sketch in each row. |
Output Column | Description |
---|---|
estimate |
A column of type double that provides the estimation of the sketch, rounded to two decimal places. |
The following example estimates the distinct count of customers by country using the hll_estimate
function on an HLL sketch:
SELECT
country,
hll_estimate(hll_build_agg(customer_id, 10)) AS distinct_customers_by_country
FROM
(
SELECT
country,
hll_build_agg(customer_id, 10) AS sketch
FROM
EXPLODE(
ARRAY<STRUCT<country STRING, customer_id STRING, invoice_id STRING>>[
('UA', 'customer_id_1', 'invoice_id_11'),
('CZ', 'customer_id_2', 'invoice_id_22'),
('CZ', 'customer_id_2', 'invoice_id_23'),
('BR', 'customer_id_3', 'invoice_id_31'),
('UA', 'customer_id_2', 'invoice_id_24')
])
GROUP BY country
);
SQL Example Output:
Country | distinct_customers_by_country |
---|---|
UA | 2.00 |
CZ | 1.00 |
BR | 1.00 |
hll_merge_agg
hll_merge_agg
is an aggregate function that merges multiple HLL sketches within a group, producing a new sketch as the output. It allows the combination of sketches across partitions or dimensions, enhancing data analysis flexibility.
hll_merge_agg(sketch_col [, allowDifferentLgConfigK])
Usage:
The following example demonstrates how the function can be structured within a query.
SELECT
[dim1, dim2 ... ,] hll_merge_agg(sketch_column.sketch) AS estimate
FROM fact_sketch_table
[GROUP BY dimension1, dimension2 ...]
Parameter | Description |
---|---|
sketch_column |
Column containing the stringified HLL sketch. |
allowDifferentLgConfigK |
Boolean (Optional) If set to true, allows merging of sketches with different lgConfigK values. The default value is false. An exception is thrown if the value is false and sketches have different lgConfigK values. |
If allowDifferentLgConfigK
is set to false, merging sketches with different lgConfigK
values results in an UnsupportedOperationException
.
Output Column | Description |
---|---|
sketch_res |
A column of type HLL sketch containing the stringified merged HLL Sketch. |
The following example merges multiple HLL sketches on the customer_id
column:
SELECT
hll_merge_agg(hll_sketch) AS uniq_customers_with_invoice
FROM
(
SELECT
country,
hll_build_agg(customer_id) AS hll_sketch
FROM
EXPLODE(
ARRAY<STRUCT<country STRING, customer_id STRING, invoice_id STRING>>[
('UA', 'customer_id_1', 'invoice_id_11'),
('BR', 'customer_id_3', 'invoice_id_31'),
('CZ', 'customer_id_2', 'invoice_id_22'),
('CZ', 'customer_id_2', 'invoice_id_23'),
('BR', 'customer_id_3', 'invoice_id_31'),
('UA', 'customer_id_2', 'invoice_id_24')
])
GROUP BY country
UNION
SELECT
country,
hll_build_agg(customer_id) AS hll_sketch
FROM
EXPLODE(
ARRAY<STRUCT<country STRING, customer_id STRING, invoice_id STRING>>[
('UA', 'customer_id_1', 'invoice_id_21'),
('MX', 'customer_id_3', 'invoice_id_31'),
('MX', 'customer_id_2', 'invoice_id_21')
])
GROUP BY country
)
GROUP BY customer_id;
SQL Example Output:
Country | hll_merge_agg(sketch, true) |
---|---|
UA | AgEHDAMAAwiR9mUEulKKCQAAAAAAAAAAAAAAAA== |
CZ | AgEHDAMAAQi6UooJAAAAAAAAAAAAAAAAAAAAAAAA== |
BR | AgEHDAMAAQicmH0HAAAAAAAAAAAAAAAAAAAAAAAA== |
MX | AgEHFQMAAgiGL/kNdAAAAAAAAAAAAAAAAAAAAAAA== |
hll_merge_count_agg
hll_merge_count_agg
is an aggregate function that estimates the cardinality (number of unique elements) from one or more sketches within a column. It returns a single estimate for all sketches encountered within the grouping. This function is used for aggregating sketches and cannot be used as a row-wise transformation. For row-wise estimates, use sketch_estimate
.
hll_merge_count_agg(sketch_col [, allowDifferentLgConfigK])
Usage:
The following example demonstrates how the function can be structured within a query.
SELECT
[dim1, dim2 ... ,] hll_merge_count_agg(sketch_column) AS estimate
FROM fact_sketch_table
[GROUP BY dimension1, dimension2 ...]
Parameter | Description |
---|---|
sketch_column |
A column containing the stringified HLL sketch. |
allowDifferentLgConfigK |
Boolean (Optional) The default value is false. If set to true, it allows the merging of sketches with different lgConfigK values. Otherwise, an UnsupportedOperationException is thrown. |
Output Column | Description |
---|---|
estimate |
A column of type Double providing the estimation of the sketch. |
The following example estimates the number of unique customers with invoices using the hll_merge_count_agg
function:
SELECT
hll_merge_count_agg(hll_sketch) AS uniq_customers_with_invoice
FROM
(
SELECT
country,
hll_build_agg(customer_id) AS hll_sketch
FROM
EXPLODE(
ARRAY<STRUCT<country STRING, customer_id STRING, invoice_id STRING>>[
('UA', 'customer_id_1', 'invoice_id_11'),
('BR', 'customer_id_3', 'invoice_id_31'),
('CZ', 'customer_id_2', 'invoice_id_22'),
('CZ', 'customer_id_2', 'invoice_id_23'),
('BR', 'customer_id_3', 'invoice_id_31'),
('UA', 'customer_id_2', 'invoice_id_24')
])
GROUP BY country
UNION
SELECT
country,
hll_build_agg(customer_id) AS hll_sketch
FROM
EXPLODE(
ARRAY<STRUCT<country STRING, customer_id STRING, invoice_id STRING>>[
('UA', 'customer_id_1', 'invoice_id_21'),
('MX', 'customer_id_3', 'invoice_id_31'),
('MX', 'customer_id_2', 'invoice_id_21')
])
GROUP BY country
)
GROUP BY customer_id;
SQL Example Output:
Country | hll_merge_count_agg(sketch, true) |
---|---|
UA | 2.0 |
CZ | 1.0 |
BR | 1.0 |
MX | 2.0 |
Currently, sketches cannot be updated once created. Future updates will introduce the capability to update sketches. With this functionality you can handle missed runs and late-arriving data more effectively.
By reading this document, you now know how to use hypercubes and associated sketch functions to perform efficient data processing for complex, multi-dimensional analyses without the need to reprocess historical data. This approach saves time, reduces costs, and offers the flexibility required for real-time, interactive queries, making it a valuable tool for big data analysis in Adobe Experience Platform.
Next, explore other key concepts such as incremental loading and data dedupliction to deepen your understanding of how to use these functions effectively for your specific data needs.