diff --git a/azure-blob-store/docs/AzureBlobStore-batchsource.md b/azure-blob-store/docs/AzureBlobStore-batchsource.md index a448f20..7b01b81 100644 --- a/azure-blob-store/docs/AzureBlobStore-batchsource.md +++ b/azure-blob-store/docs/AzureBlobStore-batchsource.md @@ -3,59 +3,203 @@ Description ----------- -Batch source to use Microsoft Azure Blob Storage as a source. +Batch source to read from Microsoft Azure Blob Storage (WASB) or Azure Data Lake Storage Gen2 (ABFS). Use Case -------- -This source is used whenever you need to read from Microsoft Azure Blob Storage. For -example, you may want to read in files from Microsoft Azure Blob Storage, parse them and -then store them in a Microsoft SQL Server Database. +This source is used whenever you need to read files from Azure Blob Storage or Azure Data Lake Storage Gen2. +For example, you may want to read log files from Azure every hour and store the results in a database or +another storage system. + +Supported Path Schemes +---------------------- + +| Scheme | Driver | Typical Use | +|--------|--------|-------------| +| `wasb://` | NativeAzureFileSystem (WASB) | Azure Blob Storage | +| `wasbs://` | NativeAzureFileSystem (WASB, SSL) | Azure Blob Storage (secure) | +| `abfs://` | AzureBlobFileSystem (ABFS) | Azure Data Lake Storage Gen2 | +| `abfss://` | SecureAzureBlobFileSystem (ABFS, SSL) | Azure Data Lake Storage Gen2 (secure) | + +Supported Authentication Methods +--------------------------------- + +| Method | Works with | Credentials required | +|--------|-----------|----------------------| +| Storage Account Key | wasb, wasbs, abfs, abfss | Storage key | +| SAS Token | wasb, wasbs | SAS token + container | +| Service Principal | abfs, abfss | Tenant ID, Client ID, Client secret | +| Managed Identity | abfs, abfss | None (identity from Azure runtime) | Properties ---------- -**Reference Name:** This will be used to uniquely identify this source for lineage, annotating metadata, etc. -**Path:** The path on Microsoft Azure Blob Storage to use as input. The path uses filename expansion (globbing) to read -files. The path must start with `wasb://` or `wasbs://`, for example, `wasb://mycontainer@mystorageaccount.blob.core.windows.net/filename.txt`. (Macro-enabled) +**Reference Name:** Name used to uniquely identify this source for lineage, annotating metadata, etc. + +**Path:** The path to use as input. Uses filename expansion (globbing) to read files. The path scheme +determines which driver and authentication methods are available. (Macro-enabled) + +- WASB example: `wasb://mycontainer@myaccount.blob.core.windows.net/data/` +- ABFS example: `abfss://mycontainer@myaccount.dfs.core.windows.net/data/` + +**Account:** The Microsoft Azure Storage account FQDN. (Macro-enabled) + +- For `wasb://` / `wasbs://` paths, must end with `.blob.core.windows.net` + (e.g. `mystorageaccount.blob.core.windows.net`). +- For `abfs://` / `abfss://` paths, must end with `.dfs.core.windows.net` + (e.g. `mystorageaccount.dfs.core.windows.net`). + +**Authentication Method:** The method used to authenticate with Azure. Defaults to `Storage Account Key`. + +--- + +### Storage Account Key + +**Storage Key:** The base64-encoded storage key for the Azure Storage account. +Required when authentication method is `Storage Account Key`. (Macro-enabled) + +--- + +### SAS Token + +SAS Token authentication is supported for `wasb://` and `wasbs://` paths only. + +**SAS Token:** The Shared Access Signature token for the container. +Required when authentication method is `SAS Token`. (Macro-enabled) + +**Container:** The blob container to connect to. +Required when authentication method is `SAS Token`. (Macro-enabled) + +--- + +### Service Principal (Azure AD) + +Service Principal authentication uses Azure Active Directory OAuth2 client credentials. +Requires `abfs://` or `abfss://` paths. + +**Tenant ID:** The Azure Active Directory tenant (directory) ID. +Required when authentication method is `Service Principal`. (Macro-enabled) + +**Client ID:** The application (client) ID of the registered Azure AD service principal. +Required when authentication method is `Service Principal`. (Macro-enabled) + +**Client Secret:** The client secret value generated for the Azure AD service principal. +Required when authentication method is `Service Principal`. (Macro-enabled) + +> The service principal must be assigned the **Storage Blob Data Reader** role (or equivalent) +> on the storage account or container in Azure IAM. + +--- + +### Managed Identity (Azure AD) + +Managed Identity authentication uses the Azure-assigned identity of the compute resource running +the pipeline (e.g. an Azure VM, AKS node pool, or HDInsight cluster). No credentials are required. +Requires `abfs://` or `abfss://` paths. -**Account:** The Microsoft Azure Blob Storage account to use. The account must end with `.blob.core.windows.net`. -For example, `mystorageaccount.blob.core.windows.net`, where `mystorageaccount` is the Microsoft -Azure Storage account name. (Macro-enabled) +> The managed identity must be assigned the **Storage Blob Data Reader** role (or equivalent) +> on the storage account or container in Azure IAM. -**Authentication Method:** The authentication method to use to connect to Microsoft Azure. Can be either -`Storage Account Key` or `SAS Token`. Defaults to `Storage Account Key`. +--- -**Storage Key:** The storage key for the container on the Microsoft Azure Storage account. -Must be a valid base64 encoded storage key provided by Microsoft Azure. Required when authentication method is set -to `Storage Account Key`. (Macro-enabled) +**Format:** Format of the data to read. The format must be one of `avro`, `blob`, `csv`, `delimited`, +`json`, `parquet`, `text`, `tsv`, or `xls`. +- `blob`: Every input file is read into a separate record. Requires a schema with a field named `body` of type `bytes`. +- `text`: Files are read line by line. Requires a schema with a field named `body` of type `string`. +- `csv`: Comma-separated values. Supports header row and quoted values. +- `tsv`: Tab-separated values. Supports header row and quoted values. +- `delimited`: Delimiter-separated values using a configurable delimiter. +- `avro`: Apache Avro binary format. Schema is read from the file. +- `parquet`: Apache Parquet columnar format. Schema is read from the file. +- `json`: JSON objects, one per line. +- `xls`: Microsoft Excel spreadsheet format. -**SAS Token:** The SAS token to use to connect to the specified container. Required when authentication method is set -to `SAS Token`. (Macro-enabled) +**Sample Size:** The maximum number of rows investigated for automatic data type detection. +Default is 1000. Only used when format is `xls`. -**Container:** The container to connect to. Required when authentication method is set to`SAS Token`. (Macro-enabled) +**Override:** Per-column data type overrides that skip automatic type detection. +Only used when format is `xls`. -**Ignore Non-Existing Folders:** Identify if path needs to be ignored or not, for case when directory or file does not -exists. If set to true it will treat the not present folder as 0 input and log a warning. Default is `false`. +**Delimiter:** The delimiter to use when format is `delimited`. Ignored for other formats. -**Recursive:** Boolean value to determine if files are to be read recursively from the path. Default is `false`. +**Enable Quoted Values:** Whether to treat content between quotes as a value. Applies to `csv`, `tsv`, +and `delimited` formats. -Example -------- +**Use First Row as Header:** Whether to use the first row as a header. Applies to `csv`, `tsv`, +`delimited`, and `xls` formats. -This example connects to Microsoft Azure Blob Storage and reads in files found in the -specified directory. This example uses Microsoft Azure Storage 'mystorageaccount.blob.core.windows.net', using the -'mystorageaccount' account name: +**Terminate Reading After Empty Row:** Stop reading after the first empty row. Defaults to false. +Only used when format is `xls`. + +**Select Sheet Using / Sheet Value:** Sheet selection by name or number (0-based). Only used when format is `xls`. + +**Maximum Split Size:** Maximum size in bytes for each input partition. Default is 128MB. + +**Regex Path Filter:** Regular expression that file paths must match to be included in the input. + +**Path Field:** Output field to store the file path each record was read from. +Must be a string field in the output schema. + +**Path Filename Only:** Whether to use only the filename instead of the full URI in the path field. +Default is false. + +**Read Files Recursively:** Whether to read files recursively from the path. Default is false. + +**Allow Empty Input:** Whether to allow an input path with no data. When false, the plugin errors if +there is no data. Default is false. + +**File System Properties:** Additional Hadoop filesystem properties as a JSON object. (Macro-enabled) + +**File Encoding:** Character encoding for the file(s). Default is UTF-8. + +Examples +-------- + +### Storage Account Key (WASB) { "name": "AzureBlobStore", "type": "batchsource", "properties": { - "path": "`wasb://mycontainer@mystorageaccount.blob.core.windows.net/filename.txt", - "account": "mystorageaccount.blob.core.windows.net", + "referenceName": "azure_blob_input", + "path": "wasb://mycontainer@myaccount.blob.core.windows.net/data/", + "account": "myaccount.blob.core.windows.net", "authenticationMethod": "storageAccountKey", "storageKey": "XXXXXEEESSS/YYYY=", - "ignoreNonExistingFolders": "false", - "recursive": "false" + "format": "csv", + "skipHeader": "true" + } + } + +### Service Principal (ABFS) + + { + "name": "AzureBlobStore", + "type": "batchsource", + "properties": { + "referenceName": "adls_gen2_input", + "path": "abfss://mycontainer@myaccount.dfs.core.windows.net/data/", + "account": "myaccount.dfs.core.windows.net", + "authenticationMethod": "servicePrincipal", + "tenantId": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", + "clientId": "yyyyyyyy-yyyy-yyyy-yyyy-yyyyyyyyyyyy", + "clientSecret": "your-client-secret", + "format": "parquet", + "recursiveRead": "true" + } + } + +### Managed Identity (ABFS) + + { + "name": "AzureBlobStore", + "type": "batchsource", + "properties": { + "referenceName": "adls_gen2_input", + "path": "abfss://mycontainer@myaccount.dfs.core.windows.net/data/", + "account": "myaccount.dfs.core.windows.net", + "authenticationMethod": "managedIdentity", + "format": "json" } - } \ No newline at end of file + } diff --git a/azure-blob-store/pom.xml b/azure-blob-store/pom.xml index 8900f01..6ab9aa6 100644 --- a/azure-blob-store/pom.xml +++ b/azure-blob-store/pom.xml @@ -31,14 +31,24 @@ io.cdap.plugin - filesource-common - ${project.parent.version} + format-common + ${cdap.plugin.version} org.apache.hadoop hadoop-azure ${hadoop.version} + + com.fasterxml.jackson.core + jackson-databind + ${jackson.databind.version} + + + org.apache.commons + commons-lang3 + 3.14.0 + diff --git a/azure-blob-store/src/main/java/io/cdap/plugin/source/AzureBatchSource.java b/azure-blob-store/src/main/java/io/cdap/plugin/source/AzureBatchSource.java index 8fa14dd..090a9ec 100644 --- a/azure-blob-store/src/main/java/io/cdap/plugin/source/AzureBatchSource.java +++ b/azure-blob-store/src/main/java/io/cdap/plugin/source/AzureBatchSource.java @@ -17,128 +17,347 @@ package io.cdap.plugin.source; import com.google.common.base.Strings; +import com.google.gson.Gson; +import com.google.gson.reflect.TypeToken; import io.cdap.cdap.api.annotation.Description; import io.cdap.cdap.api.annotation.Macro; import io.cdap.cdap.api.annotation.Name; import io.cdap.cdap.api.annotation.Plugin; import io.cdap.cdap.etl.api.FailureCollector; import io.cdap.cdap.etl.api.batch.BatchSource; -import io.cdap.plugin.common.AbstractFileBatchSource; -import io.cdap.plugin.common.FileSourceConfig; +import io.cdap.cdap.etl.api.batch.BatchSourceContext; +import io.cdap.plugin.common.Asset; +import io.cdap.plugin.common.LineageRecorder; +import io.cdap.plugin.format.input.PathTrackingInputFormat; +import io.cdap.plugin.format.plugin.AbstractFileSource; +import io.cdap.plugin.format.plugin.AbstractFileSourceConfig; +import java.lang.reflect.Type; +import java.util.Collections; import java.util.HashMap; +import java.util.List; import java.util.Map; import javax.annotation.Nullable; /** - * {@link BatchSource} for Azure Blob Store. + * {@link BatchSource} for Azure Blob Store and Azure Data Lake Storage Gen2. + * + * Supported path schemes: + * wasb:// / wasbs:// — Azure Blob Storage (WASB driver) + * abfs:// / abfss:// — Azure Data Lake Storage Gen2 (ABFS driver) + * + * Supported authentication methods: + * storageAccountKey — shared key; works with wasb and abfs paths + * sasToken — SAS token; wasb paths only + * servicePrincipal — Azure AD OAuth2 client credentials; abfs paths only + * managedIdentity — Azure AD managed identity; abfs paths only */ @Plugin(type = BatchSource.PLUGIN_TYPE) @Name("AzureBlobStore") -@Description("Batch source to read from Azure Blob Storage.") -public class AzureBatchSource extends AbstractFileBatchSource { - private static final String PATH = "path"; - private static final String ACCOUNT = "account"; - private static final String AUTHENTICATION_METHOD = "authenticationMethod"; - private static final String STORAGE_ACCOUNT_KEY = "storageKey"; - private static final String SAS_TOKEN = "sasToken"; - private static final String CONTAINER = "container"; - private static final String STORAGE_ACCOUNT_KEY_AUTH_METHOD = "storageAccountKey"; - private static final String SAS_TOKEN_AUTH_METHOD = "sasToken"; - - @SuppressWarnings("unused") +@Description("Batch source to read from Azure Blob Storage or Azure Data Lake Storage Gen2.") +public class AzureBatchSource extends AbstractFileSource { + private final AzureBatchConfig config; + private Asset asset; public AzureBatchSource(AzureBatchConfig config) { super(config); this.config = config; } + @Override + public void prepareRun(BatchSourceContext context) throws Exception { + asset = Asset.builder(config.getReferenceName()) + .setFqn(config.getPath()).build(); + super.prepareRun(context); + } + + @Override + protected LineageRecorder getLineageRecorder(BatchSourceContext context) { + return new LineageRecorder(context, asset); + } + + @Override + protected Map getFileSystemProperties(BatchSourceContext context) { + Map properties = new HashMap<>(config.getFilesystemProperties()); + + String path = config.getPath(); + + if (path != null && (path.startsWith("wasb://") || path.startsWith("wasbs://"))) { + properties.put("fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem"); + properties.put("fs.wasb.impl.disable.cache", "true"); + properties.put("fs.wasbs.impl.disable.cache", "true"); + properties.put("fs.AbstractFileSystem.wasb.impl", "org.apache.hadoop.fs.azure.Wasb"); + } else if (path != null && (path.startsWith("abfs://") || path.startsWith("abfss://"))) { + properties.put("fs.abfs.impl", "org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem"); + properties.put("fs.abfss.impl", "org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem"); + properties.put("fs.AbstractFileSystem.abfs.impl", "org.apache.hadoop.fs.azurebfs.Abfs"); + properties.put("fs.AbstractFileSystem.abfss.impl", "org.apache.hadoop.fs.azurebfs.Abfss"); + } + + String authMethod = config.authenticationMethod; + String account = config.account; + + if (AzureBatchConfig.AUTH_STORAGE_ACCOUNT_KEY.equalsIgnoreCase(authMethod)) { + properties.put(String.format("fs.azure.account.key.%s", account), config.storageKey); + + } else if (AzureBatchConfig.AUTH_SAS_TOKEN.equalsIgnoreCase(authMethod)) { + properties.put(String.format("fs.azure.sas.%s.%s", config.container, account), config.sasToken); + + } else if (AzureBatchConfig.AUTH_SERVICE_PRINCIPAL.equalsIgnoreCase(authMethod)) { + properties.put(String.format("fs.azure.account.auth.type.%s", account), "OAuth"); + properties.put(String.format("fs.azure.account.oauth.provider.type.%s", account), + "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider"); + properties.put(String.format("fs.azure.account.oauth2.client.endpoint.%s", account), + String.format("https://login.microsoftonline.com/%s/oauth2/token", config.tenantId)); + properties.put(String.format("fs.azure.account.oauth2.client.id.%s", account), config.clientId); + properties.put(String.format("fs.azure.account.oauth2.client.secret.%s", account), config.clientSecret); + + } else if (AzureBatchConfig.AUTH_MANAGED_IDENTITY.equalsIgnoreCase(authMethod)) { + properties.put(String.format("fs.azure.account.auth.type.%s", account), "OAuth"); + properties.put(String.format("fs.azure.account.oauth.provider.type.%s", account), + "org.apache.hadoop.fs.azurebfs.oauth2.MsiTokenProvider"); + } + + if (config.shouldCopyHeader()) { + properties.put(PathTrackingInputFormat.COPY_HEADER, "true"); + } + if (config.getFileEncoding() != null && !config.getFileEncoding().equals(config.getDefaultFileEncoding())) { + properties.put(PathTrackingInputFormat.SOURCE_FILE_ENCODING, config.getFileEncoding()); + } + return properties; + } + + @Override + protected void recordLineage(LineageRecorder lineageRecorder, List outputFields) { + lineageRecorder.recordRead("Read", "Read from Azure Blob Storage.", outputFields); + } + + @Override + protected boolean shouldGetSchema() { + return !config.containsMacro(AzureBatchConfig.NAME_PATH) + && !config.containsMacro("format") + && !config.containsMacro("delimiter") + && !config.containsMacro(AzureBatchConfig.NAME_ACCOUNT) + && !config.containsMacro(AzureBatchConfig.NAME_STORAGE_KEY) + && !config.containsMacro(AzureBatchConfig.NAME_SAS_TOKEN) + && !config.containsMacro(AzureBatchConfig.NAME_TENANT_ID) + && !config.containsMacro(AzureBatchConfig.NAME_CLIENT_ID) + && !config.containsMacro(AzureBatchConfig.NAME_CLIENT_SECRET) + && !config.containsMacro(AzureBatchConfig.NAME_FILE_SYSTEM_PROPERTIES); + } + /** * Plugin config for {@link AzureBatchSource}. */ - public static class AzureBatchConfig extends FileSourceConfig { - @Description("Path to file(s) to be read. If a directory is specified,terminate the path name with a '/'. " + - "The path must start with `wasb://` or `wasbs://`.") + public static class AzureBatchConfig extends AbstractFileSourceConfig { + + public static final String NAME_PATH = "path"; + public static final String NAME_ACCOUNT = "account"; + public static final String NAME_STORAGE_KEY = "storageKey"; + public static final String NAME_SAS_TOKEN = "sasToken"; + public static final String NAME_TENANT_ID = "tenantId"; + public static final String NAME_CLIENT_ID = "clientId"; + public static final String NAME_CLIENT_SECRET = "clientSecret"; + public static final String NAME_FILE_SYSTEM_PROPERTIES = "fileSystemProperties"; + + static final String AUTH_STORAGE_ACCOUNT_KEY = "storageAccountKey"; + static final String AUTH_SAS_TOKEN = "sasToken"; + static final String AUTH_SERVICE_PRINCIPAL = "servicePrincipal"; + static final String AUTH_MANAGED_IDENTITY = "managedIdentity"; + + private static final String NAME_AUTHENTICATION_METHOD = "authenticationMethod"; + private static final String NAME_CONTAINER = "container"; + private static final Gson GSON = new Gson(); + private static final Type MAP_STRING_STRING_TYPE = new TypeToken>() { }.getType(); + + @Description("Path to file(s) to be read. If a directory is specified, terminate the path name with a '/'. " + + "Supports wasb://, wasbs://, abfs://, and abfss:// schemes. " + + "Example WASB: wasb://mycontainer@myaccount.blob.core.windows.net/path/. " + + "Example ABFS: abfss://mycontainer@myaccount.dfs.core.windows.net/path/.") @Macro private String path; - @Description("The Microsoft Azure Storage account to use.") + @Description("The Microsoft Azure Storage account to use. " + + "For WASB paths, must end with `.blob.core.windows.net` " + + "(e.g. `mystorageaccount.blob.core.windows.net`). " + + "For ABFS paths, must end with `.dfs.core.windows.net` " + + "(e.g. `mystorageaccount.dfs.core.windows.net`).") @Macro private String account; - @Description("The authentication method to use to connect to Microsoft Azure. Can be either 'Storage Account Key'" + - " or 'SAS Token'. Defaults to 'Storage Account Key'.") + @Description("The authentication method to use to connect to Microsoft Azure. " + + "'Storage Account Key' and 'SAS Token' work with wasb:// paths. " + + "'Service Principal' and 'Managed Identity' work with abfs:// paths. " + + "Defaults to 'Storage Account Key'.") private String authenticationMethod; - @Description("The storage key for the specified container on the specified Azure Storage account. Must be a " + - "valid base64 encoded storage key provided by Microsoft Azure.") + // --- Storage Account Key --- + + @Description("The storage key for the container on the Microsoft Azure Storage account. " + + "Must be a valid base64 encoded storage key provided by Microsoft Azure. " + + "Required when authentication method is 'Storage Account Key'.") @Nullable @Macro private String storageKey; - @Description("The SAS token to use to connect to the specified container. Required when authentication method " + - "is set to 'SAS Token'.") + // --- SAS Token --- + + @Description("The SAS token to use to connect to the specified container. " + + "Required when authentication method is 'SAS Token'.") @Nullable @Macro private String sasToken; - @Description("The container to connect to. Required when authentication method is set to 'SAS Token'.") + @Description("The container to connect to. Required when authentication method is 'SAS Token'.") @Nullable @Macro private String container; + // --- Azure AD: Service Principal --- + + @Description("The Azure Active Directory tenant (directory) ID. " + + "Required when authentication method is 'Service Principal'.") + @Nullable + @Macro + private String tenantId; + + @Description("The client (application) ID of the Azure AD service principal. " + + "Required when authentication method is 'Service Principal'.") + @Nullable + @Macro + private String clientId; + + @Description("The client secret of the Azure AD service principal. " + + "Required when authentication method is 'Service Principal'.") + @Nullable + @Macro + private String clientSecret; + + // --- Extra FS properties --- + + @Description("Any additional properties to use when reading from the filesystem. " + + "This is an advanced feature that requires knowledge of the properties supported by the underlying filesystem.") + @Nullable + @Macro + private String fileSystemProperties; + + public AzureBatchConfig() { + fileSystemProperties = GSON.toJson(Collections.emptyMap()); + } + @Override - protected void validate(FailureCollector collector) { + public void validate(FailureCollector collector) { super.validate(collector); - if (!containsMacro(PATH) && (!path.startsWith("wasb://") && !path.startsWith("wasbs://"))) { - collector.addFailure("Path must start with wasb:// or wasbs:// for Windows Azure Blob Store input files.", null) - .withConfigProperty(PATH); + + boolean isWasb = !containsMacro(NAME_PATH) && + path != null && (path.startsWith("wasb://") || path.startsWith("wasbs://")); + boolean isAbfs = !containsMacro(NAME_PATH) && + path != null && (path.startsWith("abfs://") || path.startsWith("abfss://")); + + if (!containsMacro(NAME_PATH) && !isWasb && !isAbfs) { + collector.addFailure( + "Path must start with wasb://, wasbs://, abfs://, or abfss://.", null) + .withConfigProperty(NAME_PATH); + } + + if (!containsMacro(NAME_ACCOUNT) && account != null) { + if (isWasb && !account.endsWith(".blob.core.windows.net")) { + collector.addFailure( + "Account must end with '.blob.core.windows.net' for wasb:// paths.", null) + .withConfigProperty(NAME_ACCOUNT); + } + if (isAbfs && !account.endsWith(".dfs.core.windows.net")) { + collector.addFailure( + "Account must end with '.dfs.core.windows.net' for abfs:// paths.", null) + .withConfigProperty(NAME_ACCOUNT); + } + } + + boolean validAuthMethod = AUTH_STORAGE_ACCOUNT_KEY.equalsIgnoreCase(authenticationMethod) + || AUTH_SAS_TOKEN.equalsIgnoreCase(authenticationMethod) + || AUTH_SERVICE_PRINCIPAL.equalsIgnoreCase(authenticationMethod) + || AUTH_MANAGED_IDENTITY.equalsIgnoreCase(authenticationMethod); + + if (!validAuthMethod) { + collector.addFailure( + "Authentication method must be one of 'Storage Account Key', 'SAS Token', " + + "'Service Principal', or 'Managed Identity'.", null) + .withConfigProperty(NAME_AUTHENTICATION_METHOD); } - if (!containsMacro(ACCOUNT) && !account.endsWith(".blob.core.windows.net")) { - collector.addFailure("Account must end with '.blob.core.windows.net' for Windows Azure Blob Store", null) - .withConfigProperty(ACCOUNT); + + if (AUTH_STORAGE_ACCOUNT_KEY.equalsIgnoreCase(authenticationMethod)) { + if (!containsMacro(NAME_STORAGE_KEY) && Strings.isNullOrEmpty(storageKey)) { + collector.addFailure( + "Storage key must be provided when authentication method is 'Storage Account Key'.", null) + .withConfigProperty(NAME_STORAGE_KEY); + } } - if (!(STORAGE_ACCOUNT_KEY_AUTH_METHOD.equalsIgnoreCase(authenticationMethod) || - SAS_TOKEN_AUTH_METHOD.equalsIgnoreCase(authenticationMethod))) { - collector.addFailure("Authentication method should be one of 'Storage Account Key' or 'SAS Token'", - null).withConfigProperty(AUTHENTICATION_METHOD); + + if (AUTH_SAS_TOKEN.equalsIgnoreCase(authenticationMethod)) { + if (!containsMacro(NAME_PATH) && isAbfs) { + collector.addFailure("SAS Token authentication is only supported for wasb:// and wasbs:// paths.", null) + .withConfigProperty(NAME_AUTHENTICATION_METHOD); + } + if (!containsMacro(NAME_SAS_TOKEN) && Strings.isNullOrEmpty(sasToken)) { + collector.addFailure( + "SAS token must be provided when authentication method is 'SAS Token'.", null) + .withConfigProperty(NAME_SAS_TOKEN); + } + if (!containsMacro(NAME_CONTAINER) && Strings.isNullOrEmpty(container)) { + collector.addFailure( + "Container must be provided when authentication method is 'SAS Token'.", null) + .withConfigProperty(NAME_CONTAINER); + } } - if (STORAGE_ACCOUNT_KEY_AUTH_METHOD.equalsIgnoreCase(authenticationMethod) && - !containsMacro(STORAGE_ACCOUNT_KEY) && Strings.isNullOrEmpty(storageKey)) { - collector.addFailure("Storage key must be provided when authentication method is set " + - "to 'Storage Account Key'", null).withConfigProperty(STORAGE_ACCOUNT_KEY); + + if (AUTH_SERVICE_PRINCIPAL.equalsIgnoreCase(authenticationMethod) + || AUTH_MANAGED_IDENTITY.equalsIgnoreCase(authenticationMethod)) { + if (!containsMacro(NAME_PATH) && isWasb) { + collector.addFailure( + "Azure AD authentication (Service Principal / Managed Identity) requires abfs:// or abfss:// paths.", null) + .withConfigProperty(NAME_AUTHENTICATION_METHOD); + } } - if (SAS_TOKEN_AUTH_METHOD.equalsIgnoreCase(authenticationMethod)) { - if (!containsMacro(SAS_TOKEN) && Strings.isNullOrEmpty(sasToken)) { - collector.addFailure("SAS token must be provided when authentication method is set to 'SAS Token'", - null).withConfigProperty(SAS_TOKEN); + + if (AUTH_SERVICE_PRINCIPAL.equalsIgnoreCase(authenticationMethod)) { + if (!containsMacro(NAME_TENANT_ID) && Strings.isNullOrEmpty(tenantId)) { + collector.addFailure( + "Tenant ID must be provided when authentication method is 'Service Principal'.", null) + .withConfigProperty(NAME_TENANT_ID); + } + if (!containsMacro(NAME_CLIENT_ID) && Strings.isNullOrEmpty(clientId)) { + collector.addFailure( + "Client ID must be provided when authentication method is 'Service Principal'.", null) + .withConfigProperty(NAME_CLIENT_ID); } - if (!containsMacro(CONTAINER) && Strings.isNullOrEmpty(container)) { - collector.addFailure("Container must be provided when authentication method is set to 'SAS Token'", - null).withConfigProperty(CONTAINER); + if (!containsMacro(NAME_CLIENT_SECRET) && Strings.isNullOrEmpty(clientSecret)) { + collector.addFailure( + "Client secret must be provided when authentication method is 'Service Principal'.", null) + .withConfigProperty(NAME_CLIENT_SECRET); } } - } - @Override - protected Map getFileSystemProperties() { - Map properties = new HashMap<>(super.getFileSystemProperties()); - properties.put("fs.wasb.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem"); - properties.put("fs.wasb.impl.disable.cache", "true"); - properties.put("fs.wasbs.impl.disable.cache", "true"); - properties.put("fs.AbstractFileSystem.wasb.impl", "org.apache.hadoop.fs.azure.Wasb"); - if (STORAGE_ACCOUNT_KEY_AUTH_METHOD.equalsIgnoreCase(authenticationMethod)) { - properties.put(String.format("fs.azure.account.key.%s", account), storageKey); - } else if (SAS_TOKEN_AUTH_METHOD.equalsIgnoreCase(authenticationMethod)) { - properties.put(String.format("fs.azure.sas.%s.%s", container, account), sasToken); + if (!containsMacro(NAME_FILE_SYSTEM_PROPERTIES)) { + try { + getFilesystemProperties(); + } catch (Exception e) { + collector.addFailure("File system properties must be a valid JSON.", null) + .withConfigProperty(NAME_FILE_SYSTEM_PROPERTIES).withStacktrace(e.getStackTrace()); + } } - return properties; } @Override - protected String getPath() { + public String getPath() { return path; } + + Map getFilesystemProperties() { + if (containsMacro(NAME_FILE_SYSTEM_PROPERTIES) || Strings.isNullOrEmpty(fileSystemProperties)) { + return new HashMap<>(); + } + return GSON.fromJson(fileSystemProperties, MAP_STRING_STRING_TYPE); + } } } diff --git a/azure-blob-store/widgets/AzureBlobStore-batchsource.json b/azure-blob-store/widgets/AzureBlobStore-batchsource.json index 2759b8c..f14603a 100644 --- a/azure-blob-store/widgets/AzureBlobStore-batchsource.json +++ b/azure-blob-store/widgets/AzureBlobStore-batchsource.json @@ -15,12 +15,18 @@ { "widget-type": "textbox", "label": "Path", - "name": "path" + "name": "path", + "widget-attributes": { + "placeholder": "wasb://@.blob.core.windows.net/path/ or abfss://@.dfs.core.windows.net/path/" + } }, { "widget-type": "textbox", "label": "Account", - "name": "account" + "name": "account", + "widget-attributes": { + "placeholder": "mystorageaccount.blob.core.windows.net" + } }, { "widget-type": "radio-group", @@ -37,17 +43,25 @@ { "id": "sasToken", "label": "SAS Token" + }, + { + "id": "servicePrincipal", + "label": "Service Principal" + }, + { + "id": "managedIdentity", + "label": "Managed Identity" } ] } }, { - "widget-type": "textbox", - "label": "Azure Blob Store Storage Key", + "widget-type": "password", + "label": "Storage Key", "name": "storageKey" }, { - "widget-type": "textbox", + "widget-type": "password", "label": "SAS Token", "name": "sasToken" }, @@ -58,63 +72,253 @@ }, { "widget-type": "textbox", - "label": "Maximum Split Size", - "name": "maxSplitSize" + "label": "Tenant ID", + "name": "tenantId", + "widget-attributes": { + "placeholder": "Azure AD tenant (directory) ID" + } }, { "widget-type": "textbox", - "label": "Regex Path Filter", - "name": "fileRegex" + "label": "Client ID", + "name": "clientId", + "widget-attributes": { + "placeholder": "Azure AD application (client) ID" + } }, { - "widget-type": "select", - "label": "Read files recursively", - "name": "recursive", + "widget-type": "password", + "label": "Client Secret", + "name": "clientSecret", "widget-attributes": { - "values": [ - "true", - "false" - ], - "default": "false" + "placeholder": "Azure AD client secret" } } ] }, { - "label": "Output Schema Properties", + "label": "Basic", "properties": [ { - "widget-type": "textbox", - "label": "Path Field", - "name": "pathField", + "widget-type": "select", + "label": "Format", + "name": "format", + "widget-attributes": { + "values": [ + "avro", + "blob", + "csv", + "delimited", + "json", + "parquet", + "text", + "tsv", + "xls" + ], + "default": "text" + }, "plugin-function": { "method": "POST", "widget": "outputSchema", - "output-property": "schema", + "label": "Get Schema", + "required-fields": [ + "path" + ], + "missing-required-fields-message": "Please provide the path field", "plugin-method": "getSchema" } }, + { + "widget-type": "number", + "label": "Sample Size", + "name": "sampleSize", + "widget-attributes": { + "default": "1000", + "minimum": "1" + } + }, + { + "widget-type": "keyvalue-dropdown", + "label": "Override", + "name": "override", + "widget-attributes": { + "key-placeholder": "Field Name", + "value-placeholder": "Data Type", + "dropdownOptions": [ + "boolean", + "bytes", + "double", + "float", + "int", + "long", + "string", + "date", + "time", + "timestamp" + ] + } + }, + { + "widget-type": "textbox", + "label": "Delimiter", + "name": "delimiter", + "widget-attributes": { + "placeholder": "Delimiter if the format is 'delimited'" + } + }, + { + "widget-type": "toggle", + "name": "enableQuotedValues", + "label": "Enable Quoted Values", + "widget-attributes": { + "default": "false", + "on": { + "value": "true", + "label": "True" + }, + "off": { + "value": "false", + "label": "False" + } + } + }, + { + "widget-type": "toggle", + "label": "Use First Row as Header", + "name": "skipHeader", + "widget-attributes": { + "default": "false", + "on": { + "value": "true", + "label": "True" + }, + "off": { + "value": "false", + "label": "False" + } + } + }, + { + "widget-type": "toggle", + "label": "Terminate Reading After Empty Row", + "name": "terminateIfEmptyRow", + "widget-attributes": { + "default": "false", + "on": { + "value": "true", + "label": "True" + }, + "off": { + "value": "false", + "label": "False" + } + } + }, { "widget-type": "select", - "label": "Use File Name as Path Field", - "name": "filenameOnly", + "label": "Select Sheet Using", + "name": "sheet", "widget-attributes": { "values": [ - "true", - "false" + "Sheet Name", + "Sheet Number" ], - "default": "false" + "default": "Sheet Number" + } + }, + { + "widget-type": "textbox", + "label": "Sheet Value", + "name": "sheetValue", + "widget-attributes": { + "default": "0" } } ] }, { - "label": "Advanced Properties", + "label": "Advanced", "properties": [ { "widget-type": "textbox", - "label": "Input Format Class", - "name": "inputFormatClass" + "label": "Maximum Split Size", + "name": "maxSplitSize", + "widget-attributes": { + "placeholder": "Maximum split size for each partition specified in bytes" + } + }, + { + "widget-type": "textbox", + "label": "Regex Path Filter", + "name": "fileRegex", + "widget-attributes": { + "placeholder": "Regular expression for files to read" + } + }, + { + "widget-type": "textbox", + "label": "Path Field", + "name": "pathField", + "widget-attributes": { + "placeholder": "Output field to contain the path of the file that was read" + } + }, + { + "widget-type": "radio-group", + "name": "filenameOnly", + "label": "Path Filename Only", + "widget-attributes": { + "layout": "inline", + "default": "false", + "options": [ + { + "id": "true", + "label": "True" + }, + { + "id": "false", + "label": "False" + } + ] + } + }, + { + "widget-type": "radio-group", + "name": "recursiveRead", + "label": "Read Files Recursively", + "widget-attributes": { + "layout": "inline", + "default": "false", + "options": [ + { + "id": "true", + "label": "True" + }, + { + "id": "false", + "label": "False" + } + ] + } + }, + { + "widget-type": "radio-group", + "label": "Allow Empty Input", + "name": "allowEmptyInput", + "widget-attributes": { + "layout": "inline", + "default": "false", + "options": [ + { + "id": "true", + "label": "True" + }, + { + "id": "false", + "label": "False" + } + ] + } }, { "widget-type": "json-editor", @@ -123,20 +327,281 @@ }, { "widget-type": "select", - "label": "Ignore Non-Existing Folders", - "name": "ignoreNonExistingFolders", + "label": "File Encoding", + "name": "fileEncoding", "widget-attributes": { "values": [ - "true", - "false" + { + "label": "UTF-8", + "value": "UTF-8" + }, + { + "label": "UTF-32", + "value": "UTF-32" + }, + { + "label": "ISO-8859-1 (Latin-1 Western European)", + "value": "ISO-8859-1" + }, + { + "label": "ISO-8859-2 (Latin-2 Central European)", + "value": "ISO-8859-2" + }, + { + "label": "ISO-8859-3 (Latin-3 South European)", + "value": "ISO-8859-3" + }, + { + "label": "ISO-8859-4 (Latin-4 North European)", + "value": "ISO-8859-4" + }, + { + "label": "ISO-8859-5 (Latin/Cyrillic)", + "value": "ISO-8859-5" + }, + { + "label": "ISO-8859-6 (Latin/Arabic)", + "value": "ISO-8859-6" + }, + { + "label": "ISO-8859-7 (Latin/Greek)", + "value": "ISO-8859-7" + }, + { + "label": "ISO-8859-8 (Latin/Hebrew)", + "value": "ISO-8859-8" + }, + { + "label": "ISO-8859-9 (Latin-5 Turkish)", + "value": "ISO-8859-9" + }, + { + "label": "ISO-8859-11 (Latin/Thai)", + "value": "ISO-8859-11" + }, + { + "label": "ISO-8859-13 (Latin-7 Baltic Rim)", + "value": "ISO-8859-13" + }, + { + "label": "ISO-8859-15 (Latin-9)", + "value": "ISO-8859-15" + }, + { + "label": "Windows-1250", + "value": "Windows-1250" + }, + { + "label": "Windows-1251", + "value": "Windows-1251" + }, + { + "label": "Windows-1252", + "value": "Windows-1252" + }, + { + "label": "Windows-1253", + "value": "Windows-1253" + }, + { + "label": "Windows-1254", + "value": "Windows-1254" + }, + { + "label": "Windows-1255", + "value": "Windows-1255" + }, + { + "label": "Windows-1256", + "value": "Windows-1256" + }, + { + "label": "Windows-1257", + "value": "Windows-1257" + }, + { + "label": "Windows-1258", + "value": "Windows-1258" + }, + { + "label": "IBM00858", + "value": "IBM00858" + }, + { + "label": "IBM01140", + "value": "IBM01140" + }, + { + "label": "IBM01141", + "value": "IBM01141" + }, + { + "label": "IBM01142", + "value": "IBM01142" + }, + { + "label": "IBM01143", + "value": "IBM01143" + }, + { + "label": "IBM01144", + "value": "IBM01144" + }, + { + "label": "IBM01145", + "value": "IBM01145" + }, + { + "label": "IBM01146", + "value": "IBM01146" + }, + { + "label": "IBM01147", + "value": "IBM01147" + }, + { + "label": "IBM01148", + "value": "IBM01148" + }, + { + "label": "IBM01149", + "value": "IBM01149" + }, + { + "label": "IBM037", + "value": "IBM037" + }, + { + "label": "IBM1026", + "value": "IBM1026" + }, + { + "label": "IBM1047", + "value": "IBM1047" + }, + { + "label": "IBM273", + "value": "IBM273" + }, + { + "label": "IBM277", + "value": "IBM277" + }, + { + "label": "IBM278", + "value": "IBM278" + }, + { + "label": "IBM280", + "value": "IBM280" + }, + { + "label": "IBM284", + "value": "IBM284" + }, + { + "label": "IBM285", + "value": "IBM285" + }, + { + "label": "IBM290", + "value": "IBM290" + }, + { + "label": "IBM297", + "value": "IBM297" + }, + { + "label": "IBM420", + "value": "IBM420" + }, + { + "label": "IBM424", + "value": "IBM424" + }, + { + "label": "IBM437", + "value": "IBM437" + }, + { + "label": "IBM500", + "value": "IBM500" + }, + { + "label": "IBM775", + "value": "IBM775" + }, + { + "label": "IBM850", + "value": "IBM850" + }, + { + "label": "IBM852", + "value": "IBM852" + }, + { + "label": "IBM855", + "value": "IBM855" + }, + { + "label": "IBM857", + "value": "IBM857" + }, + { + "label": "IBM860", + "value": "IBM860" + }, + { + "label": "IBM861", + "value": "IBM861" + }, + { + "label": "IBM862", + "value": "IBM862" + }, + { + "label": "IBM863", + "value": "IBM863" + }, + { + "label": "IBM864", + "value": "IBM864" + }, + { + "label": "IBM865", + "value": "IBM865" + }, + { + "label": "IBM866", + "value": "IBM866" + }, + { + "label": "IBM868", + "value": "IBM868" + }, + { + "label": "IBM869", + "value": "IBM869" + }, + { + "label": "IBM870", + "value": "IBM870" + }, + { + "label": "IBM871", + "value": "IBM871" + }, + { + "label": "IBM918", + "value": "IBM918" + } ], - "default": "false" + "default": "UTF-8" } }, { - "widget-type": "textbox", - "label": "Time Table", - "name": "timeTable" + "widget-type": "hidden", + "name": "copyHeader" } ] } @@ -144,25 +609,16 @@ "outputs": [ { "name": "schema", - "widget-type": "schema", - "widget-attributes": { - "default-schema": { - "name": "fileRecord", - "type": "record", - "fields": [ - { - "name": "offset", - "type": "long" - }, - { - "name": "body", - "type": "string" - } - ] - } - } + "widget-type": "schema" } ], + "jump-config": { + "datasets": [ + { + "ref-property-name": "referenceName" + } + ] + }, "filters": [ { "name": "StorageAccountAuthentication", @@ -188,6 +644,111 @@ "name": "container" } ] + }, + { + "name": "ServicePrincipalAuthentication", + "condition": { + "expression": "authenticationMethod == 'servicePrincipal'" + }, + "show": [ + { + "name": "tenantId" + }, + { + "name": "clientId" + }, + { + "name": "clientSecret" + } + ] + }, + { + "name": "delimiter", + "condition": { + "expression": "format == 'delimited'" + }, + "show": [ + { + "name": "delimiter" + } + ] + }, + { + "name": "enableQuotedValues", + "condition": { + "expression": "format == 'delimited' || format == 'csv' || format == 'tsv'" + }, + "show": [ + { + "name": "enableQuotedValues" + } + ] + }, + { + "name": "skipHeader", + "condition": { + "expression": "format == 'delimited' || format == 'csv' || format == 'tsv' || format == 'xls'" + }, + "show": [ + { + "name": "skipHeader" + } + ] + }, + { + "name": "sheet", + "condition": { + "expression": "format == 'xls'" + }, + "show": [ + { + "name": "sheet" + } + ] + }, + { + "name": "sheetValue", + "condition": { + "expression": "format == 'xls'" + }, + "show": [ + { + "name": "sheetValue" + } + ] + }, + { + "name": "terminateIfEmptyRow", + "condition": { + "expression": "format == 'xls'" + }, + "show": [ + { + "name": "terminateIfEmptyRow" + } + ] + }, + { + "name": "sampleSize", + "condition": { + "expression": "format == 'xls'" + }, + "show": [ + { + "name": "sampleSize" + } + ] + }, + { + "name": "override", + "condition": { + "expression": "format == 'xls'" + }, + "show": [ + { + "name": "override" + } + ] } ] } diff --git a/pom.xml b/pom.xml index 8a6aac8..500d882 100644 --- a/pom.xml +++ b/pom.xml @@ -77,16 +77,17 @@ UTF-8 - 6.11.0 + 6.10.0 3.3.6 - 2.13.0 + 2.12.1 2.13.4.2 + 27.0-jre 1.11.4 widgets docs - system:cdap-data-pipeline[6.11.0-SNAPSHOT,7.0.0-SNAPSHOT),system:cdap-data-streams[6.11.0-SNAPSHOT,7.0.0-SNAPSHOT) + system:cdap-data-pipeline[6.10.0,7.0.0-SNAPSHOT),system:cdap-data-streams[6.10.0,7.0.0-SNAPSHOT) ${project.basedir} @@ -114,6 +115,11 @@ ${cdap.version} provided + + com.google.guava + guava + ${guava.version} + io.cdap.cdap hydrator-test