Configure Microsoft Azure dataset details

Provide information about how Splunk software creates the Splunk-native data catalog that facilitates federated searches of your Microsoft Azure dataset.

Note: In the Controlled Availability release stage, Splunk products may have limitations on customer access, features, maturity, and regional availability. For additional information on Controlled Availability please contact your Splunk representative.

To run federated searches over a Microsoft Azure dataset, Splunk software requires the dataset be backed by a data catalog that Splunk software creates for your dataset. In this step, you determine how this data catalog is created and managed.

You decide whether its schema is created manually or inferred automatically with a crawler. You optionally ensure whether the data catalog is automatically kept in sync with your dataset as it changes. You provide time field information if your data contains time-series data and you want to make use of time fields in your searches. And you provide partition field information as necessary to facilitate efficient federated searches.

  1. On the Configure dataset step of the Create dataset workflow, identify whether your data is stored in one of the following non-table formats: Parquet, CSV, or JSON.
  2. If your data is in CSV or JSON format, indicate whether the data is compressed with Gzip or is Uncompressed.
  3. (Optional) If this dataset is updated on an ongoing basis, and you want Splunk software to keep your Splunk-native data catalog in sync with the Microsoft Azure dataset it represents, set up the following things in the Microsoft Azure Portal:
    • An Azure Storage Queue that receives messages from downstream consumers.
    • An Azure Event Grid system topic scoped to your Azure Storage Account, with a subscription that forwards blob lifecycle events (created, deleted, and so on) to the Azure Storage Queue.

    For detailed setup instructions, see Ensure the Microsoft Azure dataset and its data catalog stay in synch with each other.

    When you set up the Azure Storage Queue, you can retrieve its URL and enter it into the Queue URL field.

    Note: Skip this step if your dataset is composed of historical data that is not subject to future updates, or if you are not interested in keeping your data catalog in sync with changes to your dataset.
  4. Indicate how you want to define the schema for your data catalog.
    • Select Define schema manually if you want to manually determine the columns in your dataset. You can use a Field list view or JSON view. If you select JSON view, your input must match the JSON data schema (dataSchema). For more information, see JSON standards for the data and partition schemas.
    • Select Discover schema via crawler if you want to have a crawler scan a sample set of files from your dataset to infer the overall schema. Use Number of files to scan to tell the crawler how many files it should scan.
    Note: The crawler can be applied only to data that is in Parquet, CSV, or JSON format. All files sampled by the crawler must follow the same schema. Inconsistent schemas across sampled files might result in a data catalog with incorrectly inferred fields in its schema. The crawler begins its scan after you complete the Time settings and Define partitions sections in the following steps and select Next to go to the Review step.
  5. (Optional) Select Define the time field if your dataset contains time-series data and you intend to use time-based filtering or SPL2 time functions when you run federated searches over it.

    If you select Define the time field, provide the Time field, Time format, and Unix time field.

    For more information, see Identify the time field in a Microsoft Azure dataset.

  6. If your dataset is partitioned into data subsets, indicate what those partitions are. Answer Are your partitions Hive-compatible?
    • Yes: Define your partitions manually or let Splunk software identify the partitions for you using a crawler. If you choose to define your partitions manually, you can use a Field list view or a JSON view. If you select JSON view your input must match the partition schema (dataPartition.PartitionSchema). For more information, see JSON standards for the data and partition schemas.
      Note: When you choose to let Splunk software identify the partitions for you, the crawler cannot determine which of your partitions are time partitions. If your partitions include time partitions, define them manually in the next step.
    • No: Define partitions manually using a Field list view or a JSON view.
    • I don't have partitions: Select Next to go to the dataset Review page.
  7. (Optional) If you have time-based partitions, identify them under Time partition settings. See Identify time partitions in a Microsoft Azure dataset.
  8. Select Next.
  9. On the Review page, review your dataset definition. If the details appear correct, select Create Dataset to create your dataset.
After you create your Microsoft Azure dataset there are two things you should do: