Data Science Workspace is no longer available for purchase.
This documentation is intended for existing customers with prior entitlements to Data Science Workspace.
Adobe Experience Platform allows you to use Structured Query Language (SQL) in Data Science Workspace by integrating Query Service into JupyterLab as a standard feature.
This tutorial demonstrates sample SQL queries for common use cases to explore, transform, and analyze Adobe Analytics data.
Before starting this tutorial, you must have the following prerequisites:
Access to Adobe Experience Platform. If you do not have access to an organization in Experience Platform, please speak to your system administrator before proceeding
An Adobe Analytics dataset
A working understanding of the following key concepts used in this tutorial:
In Experience Platform, navigate to Notebooks from the left navigation column. Allow a moment for JupyterLab to load.
If a new Launcher tab did not automatically appear, open a new Launcher tab by clicking File then select New Launcher.
In the Launcher tab, click the Blank icon in a Python 3 environment to open an empty notebook.
Python 3 is currently the only supported environment for Query Service in notebooks.
On the left selection rail, click the Data icon and double click the Datasets directory to list all datasets.
Find an Adobe Analytics dataset to explore and right-click on the listing, click Query Data in Notebook to generate SQL queries in the empty notebook.
Click the first generated cell containing the function qs_connect()
and execute it by clicking the play button. This function creates a connection between your notebook instance and the Query Service.
Copy down the Adobe Analytics dataset name from the second generated SQL query, it will be the value after FROM
.
Insert a new notebook cell by clicking the + button.
Copy, paste, and execute the following import statements in a new cell. These statements will be used to visualize your data:
import plotly.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot
Next, copy and paste the following variables in a new cell. Modify their values as needed, then execute them.
target_table = "your Adobe Analytics dataset name"
target_year = "2019"
target_month = "04"
target_day = "01"
target_table
: Name of your Adobe Analytics dataset.target_year
: Specific year for which the target data is from.target_month
: Specific month for which the target is from.target_day
: Specific day for which the target data is from.You can change these values at any time. When doing so, be sure to execute the variables cell for the changes to be applied.
Enter the following SQL queries in individual notebook cells. Execute a query by selecting on its cell followed by selecting the play button. Successful query results or error logs are displayed below the executed cell.
When a notebook is inactive for an extended period of time, the connection between the notebook and Query Service may break. In such cases, restart JupyterLab by selecting the Restart button located in the top-right corner next to the power button.
The notebook kernel resets but the cells will remain, re-run all the cells to continue where you had left off.
The following query returns the hourly visitor count for a specified date:
%%read_sql hourly_visitor -c QS_CONNECTION
SELECT Substring(timestamp, 1, 10) AS Day,
Substring(timestamp, 12, 2) AS Hour,
Count(DISTINCT concat(enduserids._experience.aaid.id,
_experience.analytics.session.num)) AS Visit_Count
FROM {target_table}
WHERE TIMESTAMP = to_timestamp('{target_year}-{target_month}-{target_day}')
GROUP BY Day, Hour
ORDER BY Hour;
In the above query, the timestamp in the WHERE
clause is set to be the value of target_year
. Include variables in SQL queries by containing them in curly brackets ({}
).
The first line of the query contains the optional variable hourly_visitor
. Query results will be stored in this variable as a Pandas dataframe. Storing results in a dataframe allows you to later visualize the query results using a desired Python package. Execute the following Python code in a new cell to generate a bar graph:
trace = go.Bar(
x = hourly_visitor['Hour'],
y = hourly_visitor['Visit_Count'],
name = "Visitor Count"
)
layout = go.Layout(
title = 'Visit Count by Hour of Day',
width = 1200,
height = 600,
xaxis = dict(title = 'Hour of Day'),
yaxis = dict(title = 'Count')
)
fig = go.Figure(data = [trace], layout = layout)
iplot(fig)
The following query returns the hourly actions count for a specified date:
%%read_sql hourly_actions -d -c QS_CONNECTION
SELECT Substring(timestamp, 1, 10) AS Day,
Substring(timestamp, 12, 2) AS Hour,
Count(concat(enduserids._experience.aaid.id,
_experience.analytics.session.num,
_experience.analytics.session.depth)) AS Count
FROM {target_table}
WHERE TIMESTAMP = to_timestamp('{target_year}-{target_month}-{target_day}')
GROUP BY Day, Hour
ORDER BY Hour;
Executing the above query will store the results in hourly_actions
as a dataframe. Execute the following function in a new cell to preview the results:
hourly_actions.head()
The above query can be modified to return the hourly actions count for a specified date range using logical operators in the WHERE clause:
%%read_sql hourly_actions_date_range -d -c QS_CONNECTION
SELECT Substring(timestamp, 1, 10) AS Day,
Substring(timestamp, 12, 2) AS Hour,
Count(concat(enduserids._experience.aaid.id,
_experience.analytics.session.num,
_experience.analytics.session.depth)) AS Count
FROM {target_table}
WHERE timestamp >= TO_TIMESTAMP('2019-06-01 00', 'YYYY-MM-DD HH')
AND timestamp <= TO_TIMESTAMP('2019-06-02 23', 'YYYY-MM-DD HH')
GROUP BY Day, Hour
ORDER BY Hour;
Executing the modified query stores the results in hourly_actions_date_range
as a dataframe. Execute the following function in a new cell to preview the results:
hourly_actions_date_rage.head()
The following query returns the number of events per visitor session for a specified date:
%%read_sql events_per_session -c QS_CONNECTION
SELECT concat(enduserids._experience.aaid.id,
'-#',
_experience.analytics.session.num) AS aaid_sess_key,
Count(timestamp) AS Count
FROM {target_table}
WHERE TIMESTAMP = to_timestamp('{target_year}-{target_month}-{target_day}')
GROUP BY aaid_sess_key
ORDER BY Count DESC;
Execute the following Python code to generate a histogram for the number of events per visit session:
data = [go.Histogram(x = events_per_session['Count'])]
layout = go.Layout(
title = 'Histogram of Number of Events per Visit Session',
xaxis = dict(title = 'Number of Events'),
yaxis = dict(title = 'Count')
)
fig = go.Figure(data = data, layout = layout)
iplot(fig)
The following query returns the ten most popular pages for a specified date:
%%read_sql popular_pages -c QS_CONNECTION
SELECT web.webpagedetails.name AS Page_Name,
Sum(web.webpagedetails.pageviews.value) AS Page_Views
FROM {target_table}
WHERE TIMESTAMP = to_timestamp('{target_year}-{target_month}-{target_day}')
GROUP BY web.webpagedetails.name
ORDER BY page_views DESC
LIMIT 10;
The following query returns the ten most active users for a specified date:
%%read_sql active_users -c QS_CONNECTION
SELECT enduserids._experience.aaid.id AS aaid,
Count(timestamp) AS Count
FROM {target_table}
WHERE TIMESTAMP = to_timestamp('{target_year}-{target_month}-{target_day}')
GROUP BY aaid
ORDER BY Count DESC
LIMIT 10;
The following query returns the ten cities that are generating a majority of user activities for a specified date:
%%read_sql active_cities -c QS_CONNECTION
SELECT concat(placeContext.geo.stateProvince, ' - ', placeContext.geo.city) AS state_city,
Count(timestamp) AS Count
FROM {target_table}
WHERE TIMESTAMP = to_timestamp('{target_year}-{target_month}-{target_day}')
GROUP BY state_city
ORDER BY Count DESC
LIMIT 10;
This tutorial demonstrated some sample uses cases for utilizing Query Service in Jupyter notebooks. Follow the Analyze your data using Jupyter Notebooks tutorial to see how similar operations are performed using the Data Access SDK.