Data Science Workspace is no longer available for purchase.
This documentation is intended for existing customers with prior entitlements to Data Science Workspace.
The following document contains examples on how to access data using Python for use in Data Science Workspace. For information on accessing data using JupyterLab notebooks, visit the JupyterLab notebooks data access documentation.
After setting the environment variables and completing installation, your dataset can now be read into the pandas dataframe.
import pandas as pd
from .utils import get_client_context
from platform_sdk.dataset_reader import DatasetReader
def load(config_properties):
client_context = get_client_context(config_properties)
dataset_reader = DatasetReader(client_context, config_properties['DATASET_ID'])
df = dataset_reader.read()
df = dataset_reader.select(['column-a','column-b']).read()
client_context = get_client_context(config_properties)
dataset = Dataset(client_context).get_by_id({DATASET_ID})
partitions = dataset.get_partitions_info()
The DISTINCT clause allows you to fetch all the distinct values at a row/column level, removing all duplicate values from the response.
An example of using the distinct()
function can be seen below:
df = dataset_reader.select(['column-a']).distinct().read()
You can use certain operators in Python to help filter your dataset.
The functions used for filtering are case sensitive.
eq() = '='
gt() = '>'
ge() = '>='
lt() = '<'
le() = '<='
And = and operator
Or = or operator
An example of using these filtering functions can be seen below:
df = dataset_reader.where(experience_ds['timestamp'].gt(87879779797).And(experience_ds['timestamp'].lt(87879779797)).Or(experience_ds['a'].eq(123)))
The ORDER BY clause allows received results to be sorted by a specified column in a specific order (ascending or descending). This is done by using the sort()
function.
An example of using the sort()
function can be seen below:
df = dataset_reader.sort([('column_1', 'asc'), ('column_2', 'desc')])
The LIMIT clause allows you to limit the number of records received from the dataset.
An example of using the limit()
function can be seen below:
df = dataset_reader.limit(100).read()
The OFFSET clause allows you to skip rows, from the beginning, to start returning rows from a later point. In combination with LIMIT, this can be used to iterate rows in blocks.
An example of using the offset()
function can be seen below:
df = dataset_reader.offset(100).read()
To write to a dataset, you need to supply the pandas dataframe to your dataset.
client_context = get_client_context(config_properties)
# To fetch existing dataset
dataset = Dataset(client_context).get_by_id({DATASET_ID})
dataset_writer = DatasetWriter(client_context, dataset)
write_tracker = dataset_writer.write(<your_dataFrame>, file_format='json')
For longer running jobs, you may need to store intermediate steps. In instances like this, you can read and write to a userspace.
Paths to the data are not stored. You need to store the corresponding path to its respective data.
client_context = get_client_context(config_properties)
user_helper = UserSpaceHelper(client_context)
user_helper.write(data_frame=<data_frame>, path=<path_to_directory>, ref_dataset_id=<ref_dataset_id>)
client_context = get_client_context(config_properties)
user_helper = UserSpaceHelper(client_context)
my_df = user_helper.read(path=<path_to_directory>, ref_dataset_id=<ref_dataset_id>)
Adobe Experience Platform Data Science Workspace provides a recipe sample that uses the above code samples to read and write data. If you want to learn more about how to use Python for accessing your data, please review the Data Science Workspace Python GitHub Repository.