Create a Microsoft Azure dataset for Ingest Processor pipelines

Create a Microsoft Azure dataset in the Data Management app to define the Azure storage container that your pipelines send data to.

Note: In the Controlled Availability release stage, Splunk products may have limitations on customer access, features, maturity, and regional availability. For additional information on Controlled Availability please contact your Splunk representative.

To send data from Ingest Processor to an Azure Blob Storage container or an Azure Data Lake Storage container, you must create a Microsoft Azure dataset in the Data Management app on Splunk Cloud Platform. You can then use the dataset as a pipeline destination.

You can optionally configure the dataset to also support federated searches, so that you can use the same dataset to write and read data from Microsoft Azure.

The dataset uses a Microsoft Azure connection for authentication. You can create multiple datasets that use the same connection.

  1. In Splunk Cloud Platform, select Data Management from the Apps panel.
  2. Navigate to the Datasets page, and then select Create dataset.
  3. On the Select data store page, select Microsoft Azure, then select Next.
  4. On the Configure connection page, do one of the following:
    • If you have already created the necessary Microsoft Azure connection, select it from the Associated connection drop-down list and then select Next.
    • If you have not created the connection yet, select Create connection. You are prompted to navigate away from the current screen to create the connection. See Create a Microsoft Azure connection for Ingest Processor pipelines for more information.
  5. On the Define dataset page, configure the following options, and then select Next:
    Option name Configuration instructions
    Dataset name Enter a unique name for your dataset.
    Dataset description (Optional) Enter a description for your dataset.
    Azure container URL

    Enter the URL of the Azure storage container that you want to send data to.

    This URL must include the path to a directory in the container, and it cannot end in a file name. The format of a valid Azure container URL value is as follows: https://storage_account_name.blob.core.windows.net/container_name/path_to_directory

    Is your storage account hierarchical or flat? Specify whether you are using this dataset to send data to Azure Data Lake Storage or Azure Blob Storage.
    Usage

    Select Data routing and federated search.

    Note: If you set Usage to Federated search, then the dataset cannot be used as a pipeline destination and can only be used in federated searches.
  6. On the Configure dataset page, configure the following Data routing options:
    Option name Configuration instructions
    Output schema

    Select Pipeline output.

    Note: Avoid selecting Splunk HTTP Event Collector (HEC), especially if you intend to run federated searches on this dataset. Schema inference does not always work as expected on events that use the HEC schema.
    Output format

    Select the file format that you want to use to store your data in the Azure container.

    If you select Parquet, be aware that the following limitations apply:

    • For best results, you must process and route your data using a pipeline that's created from a template instead of using a custom-configured pipeline. Pipeline templates can ensure that the schema of the resulting Parquet output is compatible with federated searches.

    • If you use a custom-configured pipeline that changes the schema of the events, the dataset will format the event according to the HEC event schema in order to produce Parquet output that is at least partially compatible with federated searches. The resulting events contain the top-level fields described in Event metadata in the Splunk Cloud Platform Get Data In manual.

    Compression type Select the compression format for your data.
    File name prefix

    (Optional) Enter a prefix for the name of the file that contains your output data in Azure.

    By default, file names are 32-digit UUIDs (universally unique identifiers) that are autogenerated by the system. You can make the file name more human-readable by adding a prefix. For example, if the autogenerated UUID is 2f0ff66a-e87a-4af5-befb-18dcafa6012f, then entering financial-report in the File name prefix field changes the resulting file name to financial-report-2f0ff66a-e87a-4af5-befb-18dcafa6012f.

  7. (Optional) To adjust the maximum number of events that this dataset can send in each batch of output data, expand Advanced settings and enter your desired maximum number of events in the Batch size field.
    Note: In most cases, the default Batch size value is sufficient. The actual size of each batch can vary depending on the rate at which the Ingest Processor is sending out data.
  8. Configure the following Federated search options:
    Option name Configuration instructions
    Activate Toggle this switch to either allow or disallow federated searches for this dataset.
    Queue URL

    (Optional) If federated searches are allowed for this dataset, you can arrange for the Splunk-managed data catalog to be updated automatically whenever data is added to or removed from the dataset.

    • To configure automated catalog updates, you first must create a queue in Azure Queue Storage that receives notifications, as well as an Azure Event Grid system topic that forwards blob lifecycle events from your Azure Storage Account to the queue. Then, set this option to Yes and enter the URL of the queue from Azure Queue Storage in the Queue URL field. For detailed instructions, see Ensure the Microsoft Azure dataset and its data catalog stay in sync in the Splunk Cloud Platform Federated Search manual.

    • If you don't want to configure automated catalog updates, then set this option to No.

  9. Select Next.
  10. On the Review page, ensure that all the entered information is correct, and then select Create Dataset to create your dataset.

You now have a Microsoft Azure dataset that can access the data in your Azure container.

To send data from Ingest Processor to your Azure container, create a pipeline that uses the Microsoft Azure dataset as a destination. Then, apply the pipeline to Ingest Processor. For more information, see the following pages:

For information about running federated searches on Microsoft Azure datasets, see Run federated searches over Microsoft Azure datasets in the Splunk Cloud Platform Federated Search manual.