The derived datasets feature provides a convenient means to generate datasets of your choice from other information available in the data lake. These datasets can be refreshed at any regular cadence and optionally published into your Real-Time Customer Profile data. Derived datasets address the need to build complex datasets such as decile, percentile, and quartile over simpler ones such as max, count, and mean. These datasets can be calculated specifically for an individual user or for a business entity. This enables you to derive datasets that can be directly accredited to an identifier, such as email addresses, device IDs, and phone numbers, and also derive datasets that are indirectly associated with that user or business profile.
Derived datasets are needed for a variety of use cases when data is being analyzed on the data lake. This data can then be marked for use in Real-Time Customer Profile and used in downstream use cases such as creating highly focussed audiences. Some potential use cases for this feature might include:
To create a ranking based on one or more metrics (such as revenue, viewership duration, and so on) on a particular dimension (category), complex derived datasets are required. Deciles, quartiles, and percentiles allow flexibility and precision when ranking data with derived datasets.
A decile is a method of splitting up a set of ranked data into 10 equal parts. When the data is divided into deciles, a decile rank is assigned to each row in the data set. This allows the data to be sorted into descending or ascending order.
A decile rank arranges the data in order from lowest to highest and is done on a scale of 1 to 10 where each successive number corresponds to an increase of 10 percentage points.
Decile buckets represent the number of ranked groups and are used to assign a ranking to a dimension (category) in the dataset. The bucket can be a number or an expression that evaluates to a positive integer value for each partition. The buckets must not have a null value.
Quartiles are used to divide the distribution by four and percentiles by 100.
Query Service provides built-in functions such as sessionization and last touch, amongst others, that you can apply to any time series data to generate business related derivate datasets. You have the option to base these analytical derived datasets on one or more identity and optionally publish the data to Real-Time Customer Profile if required.
Some potential use cases for this type of derived attribute might include:
You are also able to calculate business metrics as a derived attribute and use them in conjunction with simple datasets such as zip code or an aggregated metric such as total count. For example, a total count based on a city or province, or total count based on a business category and a city/province.
By reading this document, you have a better understanding of how Query Service derived datasets facilitate complex use cases for maximizing the utility of your data. Next, you should read the decile-based derived attribute use case to see how this feature is applied in a real-world scenario.