Extracting data from SC Navigator into Databricks

This article describes how you can get output data from your AIMMS SC Navigator cloud account into your own Databricks environment. You can download the example Jupyter notebook here and use the template directly in your Databricks environment, or use it locally if you have Python installed.

What to do in AIMMS

Each AIMMS SC Navigator cloud account is by default equipped with an Azure Data Lake Storage Gen2 (ADLS). For this route we will be utilizing the external access to this storage account for retrieving the exported data. It is therefore crucial that in the save results screen you check the option to export the results to the ADLS too:

The results will be saved in the parquet file format in the export folder of the ADLS.

You will also need to create a SAS-token to access the files in the export folder to allow access to the ADLS. You can find instructions on how to do this on the page SAS Tokens.

Configuration in Databricks

You can download this full Jupyter notebook example and replace the ‘TODO’-items in the notebook with your own variables. Below we will describe the code fragments and how they can be used.

In Databricks you will be using Python packages for the connection to the ADLS. First these need to be installed:

%pip install azure-storage-file-datalake azure-identity

Including a restart to apply them:

# restart python to apply installed packages
dbutils.library.restartPython()

Then you’ll need to provide the SAS-token and the URL, both of which are exposed/generated in the previous step when creating the SAS-token:

# TODO: provide the sas token and the URL of your storage
account_url_full = "<put here your URL>"
# remove the folder from the url
account_url = account_url_full[:-20]
sas_token = "<put here your sas token>"
# the file system is fixed, this is the container SC Navigator will write its results
file_system="sc-navigator-export"

You’ll need to import more required packages:

import os
from azure.storage.filedatalake import (
     DataLakeServiceClient,
     DataLakeDirectoryClient,
     FileSystemClient
)
from azure.identity import DefaultAzureCredential

And create a DataLakeServiceClient to access the storage:

# create a DataLakeServiceClient to access the storage
dsc = DataLakeServiceClient(account_url=account_url, credential=sas_token)

You can now list all available scenarios on the storage:

# list all available scenario's on the storage
# Note that you need to save the scenario from SCN with datalake storage enables before they will be visible!
file_system_client = dsc.get_file_system_client(file_system=file_system)
paths = file_system_client.get_paths()
list_of_scenarios = list()
for path in paths:
     if "/" not in path.name and path.name != "info.txt":
             print(path.name + '\n')
             list_of_scenarios.append(path.name)
     else:
             continue

If this code block gives an error, please check the provided URL and the SAS token (this token needs to be valid and cannot be expired). For further reference we refer to the {field{*fldinst HYPERLINK “https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python?tabs=azure-ad”}{fldrslt Microsoft documentation}}.

Results on the ADLS are stored per scenario - in this next code fragment you can decide which scenarios you actually want to load. If you do not specify specific scenarios all scenarios will be loaded:

# the results are stored per scenario. You can change the list of scenario's to load if you change the list_of_scenarios variable.
# list_of_scenarios = ["<scenario_1>", "<scenario_2>"]
# this code creates a directory client per scenario to load its files
dictionary_of_scenario_clients = {}
for scenario in list_of_scenarios:
        client = dsc.get_directory_client(file_system=file_system,directory=scenario)
        dictionary_of_scenario_clients[scenario] = client

And consecutively load the files for the selected scenarios:

# load the files for the selected scenario's
import io
import pandas as pd
from collections import defaultdict
results = defaultdict(list)
paths = file_system_client.get_paths()
for path in paths:
   if "/" in path.name:
      scenario = path.name.split("/")[0]
      file_name = path.name.split("/")[1]
      if scenario in list_of_scenarios:
         try:
            file = client.get_file_client(file_name)
            data_file = file.download_file()
            data_binary = data_file.readall()
            parquet_file = io.BytesIO(data_binary)
            df = pd.read_parquet(parquet_file)
            df["scenario_id"] = scenario
            results[file_name.split(".")[0]].append(df)
         except:
            print(file_name + " could not be loaded into a dataframe")

Then it is useful to combine the list of dataframes into one dataframe per file type:

# combine the list of dataframes into one dataframe per file type
dataframes_for_visualization = dict()
for k,v in results.items():
        df = pd.concat(v)
        dataframes_for_visualization[k] = df

To show all data:

# show all data
[k for k,v in dataframes_for_visualization.items()]

In this notebook we now select the end to end report for all scenarios. In a similar way you can select and save other data files.

# in this example we use the end-to-end-report only but other data can be processed in a similar manner
df_e2e = dataframes_for_visualization["end-to-end-report"]
display(df_e2e)

Then you’ll need to do some final post-processing so data can be stored as a hive_table, considering Spark cannot handle spaces or brackets and they are present in the exported parquet-files:

# do some final post-processing so data can be storred as a hive_table
# spark cannot handle spaces or brackets
import numpy as np
df_e2e.columns = [x.replace(" ", "_") for x in df_e2e.columns]
df_e2e.columns = [x.replace("(", "") for x in df_e2e.columns]
df_e2e.columns = [x.replace(")", "") for x in df_e2e.columns]
df_e2e = df_e2e.replace({None: np.nan})

In the last step we actually save the results:

# save to databricks hive storage
# you can choose to use another destination for example csv files or a database
df_spark = spark.createDataFrame(df_e2e)
df_spark.write.saveAsTable("SC_Navigator_e2e")

That’s it! With the example Jupyter notebook download you will have a quick start to getting your SC Navigator data into the Databricks environment. From Databricks you can easily load the data into PowerBI or other visualization tools.