diff --git a/README.md b/README.md index cc2a78a4..3cc5b794 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,10 @@ # Skyflow Python SDK +[![PyPI version](https://img.shields.io/pypi/v/skyflow.svg)](https://pypi.org/project/skyflow/) +[![Python versions](https://img.shields.io/pypi/pyversions/skyflow.svg)](https://pypi.org/project/skyflow/) +[![CI Checks](https://github.com/skyflowapi/skyflow-python/actions/workflows/ci.yml/badge.svg)](https://github.com/skyflowapi/skyflow-python/actions/workflows/ci.yml) +[![License](https://img.shields.io/github/license/skyflowapi/skyflow-python.svg)](https://github.com/skyflowapi/skyflow-python/blob/main/LICENSE) + > **This is the current, recommended version of the Skyflow SDK.** V2.1.0 brings flexible auth, multi-vault support, native data types, and rich error diagnostics. > > Migrating from v1? See the **[Migration Guide](https://github.com/skyflowapi/skyflow-python/blob/main/docs/migrate_to_v2.md)** for step-by-step instructions. V1 is in maintenance mode and will reach End of Life on October 31, 2026. @@ -69,6 +74,9 @@ The Skyflow Python SDK is designed to help with integrating Skyflow into a Pytho The Skyflow SDK enables you to connect to your Skyflow Vault(s) to securely handle sensitive data at rest, in-transit, and in-use. +> [!TIP] +> Looking for the full list of request parameters, response object attributes, enums, client-management methods, and Detect helper classes? See the **[API Reference](docs/api_reference.md)**. + > [!IMPORTANT] > This readme documents SDK version 2. > For version 1 see the [v1.16.0 README](https://github.com/skyflowapi/skyflow-python/tree/v1). @@ -78,7 +86,7 @@ The Skyflow SDK enables you to connect to your Skyflow Vault(s) to securely hand ### Require -- Python 3.8.0 and above (tested with Python 3.8.0) +- Python 3.9 and above (tested with Python 3.9) ### Configuration @@ -92,6 +100,19 @@ pip install skyflow Get started quickly with the essential steps: authenticate, initialize the client, and perform a basic vault operation. This section shows you a minimal working example. +### Before you begin + +To run the examples below, you need a Skyflow account and a few values from the Skyflow Studio console. If you don't have an account yet, [request a demo](https://www.skyflow.com/get-demo). + +| Value | Where to find it | +|-------|------------------| +| `vault_id` | Your vault's details page in Skyflow Studio. | +| `cluster_id` | The first segment of your vault URL: `https://{cluster_id}.vault.skyflowapis.com`. | +| `env` | The environment your vault runs in — `Env.PROD`, `Env.SANDBOX`, `Env.DEV`, or `Env.STAGE` (defaults to `PROD`). | +| Credentials | Create a **service account** in Studio. Choose **API key** during creation for the simplest setup, or download the service-account `credentials.json` for token-based auth. See [Authentication & authorization](#authentication--authorization). | + +The quickstart below assumes a table named `table1` with `card_number` and `cardholder_name` columns. Create a matching table (or adjust the table/column names to your schema) in your vault before running it. See the [Skyflow docs](https://docs.skyflow.com/) for creating vaults, tables, and service accounts. + ### Authenticate You can use an API key or a personal bearer token to directly authenticate and authorize requests with the SDK. Use API keys for long-term service authentication. Use bearer tokens for optimal security. @@ -146,7 +167,7 @@ See [docs/advanced_initialization.md](docs/advanced_initialization.md) for advan Insert data into your vault using the `insert` method. Set `return_tokens=True` in the request to ensure values are tokenized in the response. -Create an insert request with the `InsertRequest` class, which includes the values to be inserted as a list of records. +Create an insert request with the [`InsertRequest`](docs/api_reference.md#insertrequest) class, which includes the values to be inserted as a list of records. Below is a simple example to get started. See the [Insert and tokenize data](#insert-and-tokenize-data-insertrequest) section for advanced options. @@ -168,6 +189,12 @@ insert_response = skyflow_client.vault('').insert(insert_request) print('Insert response:', insert_response) ``` +Returns an [`InsertResponse`](docs/api_reference.md#insertresponse) (`inserted_fields`, `errors`). With `return_tokens=True`, each entry includes the `skyflow_id` and a token per column: + +```text +Insert response: InsertResponse(inserted_fields=[{'skyflow_id': 'a8f0c2e1-7b3d-4f9a-8c21-1d2e3f4a5b6c', 'card_number': '5391-4629-3722-7102', 'cardholder_name': '0f6b8a2c-90ab-4cde-9def-567890abcdef'}], errors=None) +``` + ## Upgrade from v1 to v2 Upgrade from `skyflow-python` v1 using the dedicated guide in [docs/migrate_to_v2.md](docs/migrate_to_v2.md). @@ -202,6 +229,14 @@ response = skyflow_client.vault('').insert(insert_request) print('Insert response:', response) ``` +Returns an [`InsertResponse`](docs/api_reference.md#insertresponse): + +```text +Insert response: InsertResponse(inserted_fields=[{'skyflow_id': 'a8f0c2e1-7b3d-4f9a-8c21-1d2e3f4a5b6c', '': '', '': ''}], errors=None) +``` + +> With `continue_on_error=True`, each entry also carries a `request_index`, and `errors` is a list of `{request_index, request_id, error, http_code}` for the rows that failed. + #### Insert example with `continue_on_error` option Set the `continue_on_error` flag to `True` to allow insert operations to proceed despite encountering partial errors. @@ -227,7 +262,7 @@ insert_request = InsertRequest( Convert tokens back into plaintext values (or masked values) using the `.detokenize()` method. Detokenization accepts tokens and returns values. -Create a detokenization request with the `DetokenizeRequest` class, which requires a list of tokens and column groups as input. +Create a detokenization request with the [`DetokenizeRequest`](docs/api_reference.md#detokenizerequest) class, which requires a list of tokens and column groups as input. Provide optional parameters such as the redaction type and the option to continue on error. @@ -249,12 +284,18 @@ response = skyflow_client.vault('').detokenize(detokenize_request) print('Detokenization response:', response) ``` +Returns a [`DetokenizeResponse`](docs/api_reference.md#detokenizeresponse) (`detokenized_fields`, `errors`); each field has `token`, `value`, and `type`: + +```text +Detokenization response: DetokenizeResponse(detokenized_fields=[{'token': 'token1', 'value': '4111111111111111', 'type': 'STRING'}, {'token': 'token2', 'value': 'John Doe', 'type': 'STRING'}], errors=None) +``` + > [!TIP] > See the full example in the samples directory: [detokenize_records.py](samples/vault_api/detokenize_records.py) ### Get Record(s): `.get(request)` -Retrieve data using Skyflow IDs or unique column values with the `get` method. Create a get request with the `GetRequest` class, specifying parameters such as the table name, redaction type, Skyflow IDs, column names, and column values. +Retrieve data using Skyflow IDs or unique column values with the `get` method. Create a get request with the [`GetRequest`](docs/api_reference.md#getrequest) class, specifying parameters such as the table name, redaction type, Skyflow IDs, column names, and column values. > [!NOTE] > You can't use both Skyflow IDs and column name/value pairs in the same request. @@ -276,6 +317,12 @@ response = skyflow_client.vault('').get(get_request) print('Get response:', response) ``` +Returns a [`GetResponse`](docs/api_reference.md#getresponse) (`data`, `errors`), where `data` is a list of record dicts: + +```text +Get response: GetResponse(data=[{'skyflow_id': 'a8f0c2e1-7b3d-4f9a-8c21-1d2e3f4a5b6c', 'card_number': '4111111111111111', 'cardholder_name': 'John Doe'}], errors=None) +``` + #### Get by Skyflow IDs Retrieve specific records using Skyflow IDs. Use this method when you know the exact record IDs. @@ -295,6 +342,10 @@ response = skyflow_client.vault('').get(get_request) print('Data retrieval successful:', response) ``` +```text +Data retrieval successful: GetResponse(data=[{'skyflow_id': '', 'card_number': '4111111111111111', 'cardholder_name': 'John Doe'}], errors=None) +``` + #### Get tokens for records Return tokens for records to securely process sensitive data while maintaining data privacy. @@ -344,7 +395,7 @@ Use redaction types to control how sensitive data displays when retrieved from t ### Update Records -Update data in your vault using the `update` method. Create an update request with the `UpdateRequest` class, specifying parameters such as the table name and data (as a dictionary). +Update data in your vault using the `update` method. Create an update request with the [`UpdateRequest`](docs/api_reference.md#updaterequest) class, specifying parameters such as the table name and data (as a dictionary). You can pass options like `return_tokens` directly to the request. When `True`, Skyflow returns tokens for the updated records. When `False`, it returns IDs. @@ -366,12 +417,18 @@ response = skyflow_client.vault('').update(update_request) print('Update response:', response) ``` +Returns an [`UpdateResponse`](docs/api_reference.md#updateresponse) (`updated_field`, `errors`). With the default `return_tokens=False`, only the `skyflow_id` is returned; with `return_tokens=True`, tokens for the updated columns are included: + +```text +Update response: UpdateResponse(updated_field={'skyflow_id': ''}, errors=None) +``` + > [!TIP] > See the full example in the samples directory: [update_record.py](samples/vault_api/update_record.py) ### Delete Records -Delete records using Skyflow IDs with the `delete` method. Create a delete request with the `DeleteRequest` class, which accepts a list of Skyflow IDs: +Delete records using Skyflow IDs with the `delete` method. Create a delete request with the [`DeleteRequest`](docs/api_reference.md#deleterequest) class, which accepts a list of Skyflow IDs: ```python from skyflow.vault.data import DeleteRequest @@ -385,12 +442,18 @@ response = skyflow_client.vault('').delete(delete_request) print('Delete response:', response) ``` +Returns a [`DeleteResponse`](docs/api_reference.md#deleteresponse) (`deleted_ids`, `errors`): + +```text +Delete response: DeleteResponse(deleted_ids=['', '', ''], errors=None) +``` + > [!TIP] > See the full example in the samples directory: [delete_records.py](samples/vault_api/delete_records.py) ### Query -Retrieve data with SQL queries using the `query` method. Create a query request with the `QueryRequest` class, which takes the `query` parameter as follows: +Retrieve data with SQL queries using the `query` method. Create a query request with the [`QueryRequest`](docs/api_reference.md#queryrequest) class, which takes the `query` parameter as follows: ```python from skyflow.vault.data import QueryRequest @@ -403,6 +466,12 @@ response = skyflow_client.vault('').query(query_request) print('Query response:', response) ``` +Returns a [`QueryResponse`](docs/api_reference.md#queryresponse) (`fields`, `errors`), where `fields` is a list of matching record dicts (each also includes a `tokenized_data` map): + +```text +Query response: QueryResponse(fields=[{'card_number': '4111111111111111', 'cardholder_name': 'John Doe', 'tokenized_data': {}}], errors=None) +``` + > [!TIP] > See the full example in the samples directory: [query_records.py](samples/vault_api/query_records.py) @@ -410,7 +479,7 @@ Refer to [Query your data](https://docs.skyflow.com/query-data/) and [Execute Qu ### Upload File -Upload files to a Skyflow vault using the `upload_file` method. Create a file upload request with the `FileUploadRequest` class. +Upload files to a Skyflow vault using the `upload_file` method. Create a file upload request with the [`FileUploadRequest`](docs/api_reference.md#fileuploadrequest) class. **Upload a file to an existing record:** @@ -444,12 +513,18 @@ with open('path/to/file.pdf', 'rb') as file_obj: print('File upload:', response) ``` +Both forms return a [`FileUploadResponse`](docs/api_reference.md#fileuploadresponse) (`skyflow_id`, `errors`) with the ID of the record the file was attached to (or the newly created record): + +```text +File upload: FileUploadResponse(skyflow_id='a8f0c2e1-7b3d-4f9a-8c21-1d2e3f4a5b6c', errors=None) +``` + > [!TIP] > See the full example in the samples directory: [upload_file.py](samples/vault_api/upload_file.py) ### Retrieve Existing Tokens: `.tokenize(request)` -Retrieve tokens for values that already exist in the vault using the `.tokenize()` method. This method returns existing tokens only and does not generate new tokens. +Retrieve tokens for values that already exist in the vault using the `.tokenize()` method. This method returns existing tokens only and does not generate new tokens. Build the request with the [`TokenizeRequest`](docs/api_reference.md#tokenizerequest) class. #### Construct a `.tokenize()` request @@ -467,6 +542,12 @@ response = skyflow_client.vault('').tokenize(tokenize_request) print('Tokenization result:', response) ``` +Returns a [`TokenizeResponse`](docs/api_reference.md#tokenizeresponse) (`tokenized_fields`, `errors`); each field carries its `token`: + +```text +Tokenization result: TokenizeResponse(tokenized_fields=[{'token': 'a1b2c3d4-...'}, {'token': 'e5f6g7h8-...'}], errors=None) +``` + > [!TIP] > See the full example in the samples directory: [tokenize_records.py](samples/vault_api/tokenize_records.py) @@ -478,7 +559,7 @@ De-identify and reidentify sensitive data in text and files using Skyflow Detect De-identify or anonymize text using the `deidentify_text` method. -Create a de-identify text request with the `DeidentifyTextRequest` class. +Create a de-identify text request with the [`DeidentifyTextRequest`](docs/api_reference.md#deidentifytextrequest) class. ```python from skyflow.vault.detect import DeidentifyTextRequest, TokenFormat, Transformations, DateTransformation @@ -501,12 +582,18 @@ response = skyflow_client.detect('').deidentify_text(request) print('De-identify Text Response:', response) ``` +Returns a [`DeidentifyTextResponse`](docs/api_reference.md#deidentifytextresponse) (`processed_text`, `entities`, `word_count`, `char_count`, `errors`). `entities` is a list of [`EntityInfo`](docs/api_reference.md#entityinfo) describing each detected entity: + +```text +De-identify Text Response: DeidentifyTextResponse(processed_text='My SSN is [SSN_1].', entities=[...], word_count=4, char_count=18, errors=None) +``` + > [!TIP] > See the full example in the samples directory: [deidentify_text.py](samples/detect_api/deidentify_text.py) ### Re-identify Text: `.reidentify_text(request)` -Re-identify text using the `reidentify_text` method. Create a reidentify text request with the `ReidentifyTextRequest` class, which includes the redacted or de-identified text to be re-identified. +Re-identify text using the `reidentify_text` method. Create a reidentify text request with the [`ReidentifyTextRequest`](docs/api_reference.md#reidentifytextrequest) class, which includes the redacted or de-identified text to be re-identified. ```python from skyflow.vault.detect import ReidentifyTextRequest @@ -523,12 +610,18 @@ response = skyflow_client.detect().reidentify_text(request) print('Re-identify Text Response:', response) ``` +Returns a [`ReidentifyTextResponse`](docs/api_reference.md#reidentifytextresponse) (`processed_text`, `errors`): + +```text +Re-identify Text Response: ReidentifyTextResponse(processed_text='John lives in NYC', errors=None) +``` + > [!TIP] > See the full example in the samples directory: [reidentify_text.py](samples/detect_api/reidentify_text.py) ### De-identify File: `.deidentify_file(request)` -De-identify files using the `deidentify_file` method. Create a request with the `DeidentifyFileRequest` class, which includes the file to be deidentified. Provide optional parameters to control how entities are detected and deidentified. +De-identify files using the `deidentify_file` method. Create a request with the [`DeidentifyFileRequest`](docs/api_reference.md#deidentifyfilerequest) class, which includes the file to be deidentified. Provide optional parameters to control how entities are detected and deidentified. ```python from skyflow.vault.detect import DeidentifyFileRequest, TokenFormat, FileInput @@ -548,6 +641,12 @@ with open('path/to/file.pdf', 'rb') as file_obj: print('De-identify File Response:', response) ``` +Returns a [`DeidentifyFileResponse`](docs/api_reference.md#deidentifyfileresponse) with the processed file plus metadata (`file`, `type`, `extension`, `word_count`, `char_count`, `size_in_kb`, `entities`, `run_id`, `status`, `errors`, and more — see the [API Reference](docs/api_reference.md#response-objects)). If processing exceeds `wait_time`, only `run_id` and `status` are returned (poll with `get_detect_run`): + +```text +De-identify File Response: DeidentifyFileResponse(file_base64=None, file=, type='application/pdf', extension='pdf', ..., run_id='r-9c1f2a3b', status='SUCCESS', errors=None) +``` + **Supported file types:** - Documents: `doc`, `docx`, `pdf` @@ -569,7 +668,7 @@ with open('path/to/file.pdf', 'rb') as file_obj: ### Get Run: `.get_detect_run(request)` -Retrieve the results of a previously started file de-identification operation using the `get_detect_run` method. Initialize the request with the `run_id` returned from a prior .`deidentify_file` call. +Retrieve the results of a previously started file de-identification operation using the `get_detect_run` method. Build the request with the [`GetDetectRunRequest`](docs/api_reference.md#getdetectrunrequest) class, initialized with the `run_id` returned from a prior `deidentify_file` call. ```python from skyflow.vault.detect import GetDetectRunRequest @@ -582,6 +681,12 @@ response = skyflow_client.detect().get_detect_run(request) print('Get Detect Run Response:', response) ``` +Returns a [`DeidentifyFileResponse`](docs/api_reference.md#deidentifyfileresponse) with the current `status` for the run (and the processed file once `status` is complete): + +```text +Get Detect Run Response: DeidentifyFileResponse(file_base64=None, file=None, ..., run_id='r-9c1f2a3b', status='IN_PROGRESS', errors=None) +``` + > [!TIP] > See the full example in the samples directory: [get_detect_run.py](samples/detect_api/get_detect_run.py) @@ -594,7 +699,7 @@ Securely send and receive data between your systems and first- or third-party se ### Invoke a connection -To invoke a connection, use the `invoke` method of the Skyflow client. +To invoke a connection, use the `invoke` method of the Skyflow client. Build the request with the [`InvokeConnectionRequest`](docs/api_reference.md#invokeconnectionrequest) class. #### Construct an invoke connection request @@ -614,7 +719,13 @@ response = skyflow_client.connection().invoke(invoke_request) print('Connection response:', response) ``` -`method` supports the following methods: +Returns an [`InvokeConnectionResponse`](docs/api_reference.md#invokeconnectionresponse) (`data`, `metadata`, `errors`), where `data` is the connection's response body: + +```text +Connection response: InvokeConnectionResponse(data={'message': 'success'}, metadata={'request_id': 'b7d3...'}, errors=None) +``` + +`method` supports the following methods (see [`RequestMethod`](docs/api_reference.md#requestmethod)): - `GET` - `POST` @@ -823,6 +934,28 @@ skyflow_client = ( ) ``` +## Using the client in production + +**Build the client once and reuse it.** `Skyflow.builder()...build()` returns a long-lived client that lazily creates and caches an HTTP client and bearer token per vault. Construct it once at startup (for example, as a module-level singleton or a dependency-injected instance) and reuse it across requests. Rebuilding the client on every request discards these caches and forces unnecessary token regeneration. + +```python +# At application startup +skyflow_client = ( + Skyflow.builder() + .add_vault_config(vault_config) + .set_log_level(LogLevel.ERROR) + .build() +) + +# Reuse `skyflow_client` for the lifetime of the process +``` + +**Bearer token refresh is automatic.** When you authenticate with a service-account credentials file/string (or API key), the SDK caches the generated bearer token and regenerates it automatically once it expires. You don't need to manage token lifecycle yourself for the common case. (For the rare expire-mid-request case, see [Bearer token expiration edge cases](#bearer-token-expiration-edge-cases).) + +**Configuration mutation is not concurrency-safe.** Methods that change client configuration at runtime — `add_vault_config`, `update_vault_config`, `remove_vault_config`, the `*_connection_config` methods, and `update_skyflow_credentials` — mutate shared client state without locking. Perform configuration changes during setup, not concurrently with in-flight requests from other threads. Once configured, reusing the built client to issue operations is the intended usage pattern. + +**Timeouts and retries.** The SDK does not currently expose request timeout or automatic-retry configuration. If you need strict timeout or retry guarantees, wrap your SDK calls with your own timeout/retry logic at the application layer. + ## Error handling ### Catching `SkyflowError` instances @@ -861,6 +994,24 @@ If you encounter this kind of error, retry the request. During the retry the SDK > See the full example in the samples directory: [bearer_token_expiry_example.py](samples/service_account/bearer_token_expiry_example.py) > See [docs.skyflow.com](https://docs.skyflow.com) for more details on authentication, access control, and governance for Skyflow. +## Troubleshooting + +Most first-run problems come from configuration mismatches. Every error raised by the SDK is a `SkyflowError` exposing `http_code`, `message`, and `details` — inspect these first (see [Error handling](#error-handling)). + +| Symptom | Likely cause | Fix | +|---------|--------------|-----| +| `pip install skyflow` fails / `RuntimeError: skyflow requires Python 3.9+` | Python older than 3.9 | Use Python 3.9 or above. | +| Connection/DNS failures, or 404 on every call | Wrong `cluster_id` | `cluster_id` is the first segment of your vault URL: `https://{cluster_id}.vault.skyflowapis.com`. | +| Requests hit the wrong host / unexpected auth failures | Wrong `env` | Match `env` to where your vault runs (`Env.PROD`, `Env.SANDBOX`, `Env.DEV`, `Env.STAGE`). | +| `401 Unauthorized` | Invalid or expired credentials | Verify your API key / service-account credentials. Regenerate if needed. | +| `403 Forbidden` | Service account lacks permission for the operation | Grant the service account a role with the required permissions, or use a [scoped token](#generate-bearer-tokens-scoped-to-certain-roles) with the right role. | +| `404` referencing a table or column | Table/column doesn't exist or name mismatch | Confirm the table and column names match your vault schema exactly (case-sensitive). | +| Vault not found / 404 with a valid `cluster_id` | Wrong `vault_id` | Copy `vault_id` from the vault's details page in Skyflow Studio. | +| `Authentication failed. Bearer token is expired.` | Token expired between verification and the API call | Retry the request; the SDK regenerates the token. See [Bearer token expiration edge cases](#bearer-token-expiration-edge-cases). | +| Unexpected credential is used | Multiple credentials provided | Only one credential type is used at a time; the last one added takes precedence. Provide exactly one. | + +If you're stuck, set `set_log_level(LogLevel.DEBUG)` during development for detailed SDK logs (see [Logging](#logging)). + ## Security ### Reporting a Vulnerability diff --git a/docs/api_reference.md b/docs/api_reference.md new file mode 100644 index 00000000..cacc2fe2 --- /dev/null +++ b/docs/api_reference.md @@ -0,0 +1,506 @@ +# API Reference + +A reference for the public Skyflow Python SDK surface: client-management methods, request and response objects, enums, Detect helper classes, and service-account functions. For task-oriented usage and examples, see the [README](../README.md). + +All attributes, parameters, and enum values below are taken directly from the SDK source. + +## Table of Contents + +- [Client management methods](#client-management-methods) +- [Request objects](#request-objects) +- [Response objects](#response-objects) +- [Enums](#enums) +- [Detect helper classes](#detect-helper-classes) +- [Service account functions](#service-account-functions) + +--- + +## Client management methods + +In addition to the builder methods (`add_vault_config`, `add_connection_config`, `add_skyflow_credentials`, `set_log_level`, `build`) and the operation accessors (`vault()`, `connection()`, `detect()`), a built `Skyflow` client exposes methods to mutate its configuration and logging at runtime. + +| Method | Purpose | +|--------|---------| +| `add_vault_config(config)` | Add a vault configuration after build. | +| `remove_vault_config(vault_id)` | Remove a vault configuration. | +| `update_vault_config(config)` | Update an existing vault configuration. | +| `get_vault_config(vault_id)` | Retrieve a vault configuration. | +| `add_connection_config(config)` | Add a connection configuration. | +| `remove_connection_config(connection_id)` | Remove a connection configuration. | +| `update_connection_config(config)` | Update a connection configuration. | +| `get_connection_config(connection_id)` | Retrieve a connection configuration. | +| `add_skyflow_credentials(credentials)` | Add common Skyflow credentials applied across configs. | +| `update_skyflow_credentials(credentials)` | Update the common Skyflow credentials. | +| `set_log_level(log_level)` | Set the log level (builder + client). | +| `update_log_level(log_level)` | Change the log level after initialization. | +| `get_log_level()` | Return the current log level. | +| `vault(vault_id=None)` | Get a vault controller for the given (or default) vault. | +| `connection(connection_id=None)` | Get a connection controller. | +| `detect(vault_id=None)` | Get a Detect controller. | + +```python +# Example: manage configuration after the client is built +skyflow_client.add_vault_config(another_vault_config) +skyflow_client.update_log_level(LogLevel.DEBUG) +current_level = skyflow_client.get_log_level() +``` + +--- + +## Request objects + +Parameters are listed with their defaults as defined in the constructors. + +### `InsertRequest` + +`skyflow.vault.data` — passed to `vault().insert()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `table` | _(required)_ | Target table name. | +| `values` | _(required)_ | List of record dicts to insert. | +| `tokens` | `None` | Bring-your-own-token values, aligned with `values` (used with `token_mode`). | +| `upsert` | `None` | Column name to use as the upsert index (must have a `unique` constraint). | +| `homogeneous` | `False` | Treat the batch as homogeneous (all records share the same columns). | +| `token_mode` | `TokenMode.DISABLE` | BYOT mode. See [`TokenMode`](#tokenmode). | +| `return_tokens` | `True` | Return tokens for inserted values. | +| `continue_on_error` | `False` | Continue the batch despite partial errors. | + +### `UpdateRequest` + +`skyflow.vault.data` — passed to `vault().update()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `table` | _(required)_ | Target table name. | +| `data` | _(required)_ | Dict containing `skyflow_id` and the columns to update. | +| `tokens` | `None` | BYOT values for the updated columns. | +| `return_tokens` | `False` | Return tokens (vs. IDs) for updated records. | +| `token_mode` | `TokenMode.DISABLE` | BYOT mode. See [`TokenMode`](#tokenmode). | + +### `GetRequest` + +`skyflow.vault.data` — passed to `vault().get()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `table` | _(required)_ | Target table name. | +| `ids` | `None` | Skyflow IDs to retrieve. Mutually exclusive with `column_name`/`column_values`. | +| `redaction_type` | `None` | See [`RedactionType`](#redactiontype). | +| `return_tokens` | `False` | Return tokens instead of values. | +| `fields` | `None` | Specific fields/columns to return. | +| `offset` | `None` | Pagination offset. | +| `limit` | `None` | Pagination limit. | +| `download_url` | `None` | Return file download URLs for file columns. | +| `column_name` | `None` | Unique column to look up by. Mutually exclusive with `ids`. | +| `column_values` | `None` | Values for `column_name`. | + +### `FileUploadRequest` + +`skyflow.vault.data` — passed to `vault().upload_file()`. Provide exactly one file source: `file_object`, `file_path`, or `base64`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `table` | _(required)_ | Target table name. | +| `column_name` | `None` | File column name. | +| `skyflow_id` | `None` | Existing record ID. Omit to create a new record. | +| `file_path` | `None` | Path to a file to upload. | +| `base64` | `None` | Base64-encoded file content. | +| `file_object` | `None` | An open binary file object. | +| `file_name` | `None` | Override the file name. | + +### `FileInput` + +`skyflow.vault.detect` — wrapper for a file passed to `DeidentifyFileRequest`. Provide one of: + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `file` | `None` | An open binary file (`BufferedReader`). | +| `file_path` | `None` | Path to a file. | + +### `DeidentifyTextRequest` + +`skyflow.vault.detect` — passed to `detect().deidentify_text()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `text` | _(required)_ | Text to de-identify. | +| `entities` | `None` | Entity types to detect. See `DetectEntities`. | +| `allow_regex_list` | `None` | Regex patterns to always treat as detectable. | +| `restrict_regex_list` | `None` | Regex patterns to exclude from detection. | +| `token_format` | `None` | `TokenFormat` controlling token types per entity. | +| `transformations` | `None` | `Transformations` (e.g. date shifting). | + +### `DeidentifyFileRequest` + +`skyflow.vault.detect` — passed to `detect().deidentify_file()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `file` | `None` | A `FileInput`. | +| `entities` | `None` | Entity types to detect. | +| `allow_regex_list` | `None` | Regex patterns to always treat as detectable. | +| `restrict_regex_list` | `None` | Regex patterns to exclude. | +| `token_format` | `None` | `TokenFormat` per entity. | +| `transformations` | `None` | `Transformations` (not supported for Documents/Images/PDFs). | +| `output_processed_image` | `None` | Include the processed image in output. | +| `output_ocr_text` | `None` | Include OCR text in the response. | +| `masking_method` | `None` | See [`MaskingMethod`](#maskingmethod). | +| `pixel_density` | `None` | Pixel density for PDF processing. | +| `max_resolution` | `None` | Max resolution for PDF processing. | +| `output_processed_audio` | `None` | Include processed audio. | +| `output_transcription` | `None` | See [`DetectOutputTranscriptions`](#detectoutputtranscriptions). | +| `bleep` | `None` | Audio bleep config. See [`Bleep`](#bleep). | +| `output_directory` | `None` | Directory to write the processed file. | +| `wait_time` | `None` | Max seconds to wait (≤ 64). | + +### `DetokenizeRequest` + +`skyflow.vault.tokens` — passed to `vault().detokenize()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `data` | _(required)_ | List of `{token, redaction_type}` dicts to detokenize. See [`RedactionType`](#redactiontype). | +| `continue_on_error` | `False` | Continue despite per-token errors. | + +### `TokenizeRequest` + +`skyflow.vault.tokens` — passed to `vault().tokenize()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `values` | _(required)_ | List of `{value, column_group}` dicts to tokenize. | + +### `DeleteRequest` + +`skyflow.vault.data` — passed to `vault().delete()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `table` | _(required)_ | Target table name. | +| `ids` | _(required)_ | List of Skyflow IDs to delete. | + +### `QueryRequest` + +`skyflow.vault.data` — passed to `vault().query()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `query` | _(required)_ | The SQL query string to execute. | + +### `ReidentifyTextRequest` + +`skyflow.vault.detect` — passed to `detect().reidentify_text()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `text` | _(required)_ | The redacted/de-identified text to re-identify. | +| `redacted_entities` | `None` | Entity types to keep redacted. See `DetectEntities`. | +| `masked_entities` | `None` | Entity types to mask. | +| `plain_text_entities` | `None` | Entity types to reveal as plain text. | + +### `GetDetectRunRequest` + +`skyflow.vault.detect` — passed to `detect().get_detect_run()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `run_id` | _(required)_ | The `run_id` returned by a prior `deidentify_file` call. | + +### `InvokeConnectionRequest` + +`skyflow.vault.connection` — passed to `connection().invoke()`. + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `method` | _(required)_ | HTTP method. See [`RequestMethod`](#requestmethod). | +| `body` | `None` | Request body (dict). | +| `path_params` | `None` | Path parameters (dict). | +| `query_params` | `None` | Query parameters (dict). | +| `headers` | `None` | Request headers (dict). | + +--- + +## Response objects + +Every vault, token, connection, and Detect operation returns a typed response object. Each attribute below lists its type and meaning. Types use `| None` to mark attributes that may be absent. + +> **The `errors` attribute** is common to most responses. It is `list[dict] | None` and is populated only on partial failure (for example when `continue_on_error=True`); it is `None` when there are no errors. Each error dict contains `request_index`, `request_id`, `error`, and `http_code`. The per-class tables below describe only the operation-specific attributes and refer back to this note for `errors`. + +```python +response = skyflow_client.vault('').insert(insert_request) +print(response.inserted_fields) # list of inserted records (with tokens if return_tokens=True) +print(response.errors) # None unless there was a partial failure +``` + +### `InsertResponse` + +`skyflow.vault.data` — returned by `vault().insert()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `inserted_fields` | `list[dict]` | One entry per inserted record. Each has `skyflow_id`; with `return_tokens=True`, also a token per column; with `continue_on_error=True`, also a `request_index`. | +| `errors` | `list[dict] \| None` | See the note above. | + +### `GetResponse` + +`skyflow.vault.data` — returned by `vault().get()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `data` | `list[dict]` | Retrieved records as `field → value` dicts (tokens instead of values when `return_tokens=True`). Defaults to `[]`. | +| `errors` | `list[dict] \| None` | See the note above. | + +### `DeleteResponse` + +`skyflow.vault.data` — returned by `vault().delete()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `deleted_ids` | `list[str] \| None` | Skyflow IDs of the deleted records. | +| `errors` | `list[dict] \| None` | See the note above. | + +### `UpdateResponse` + +`skyflow.vault.data` — returned by `vault().update()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `updated_field` | `dict` | The updated record: `skyflow_id`, plus a token per updated column when `return_tokens=True`. | +| `errors` | `list[dict] \| None` | See the note above. | + +### `QueryResponse` + +`skyflow.vault.data` — returned by `vault().query()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `fields` | `list[dict]` | Matching records. Each record dict also includes a `tokenized_data` map. | +| `errors` | `list[dict] \| None` | See the note above. | + +### `FileUploadResponse` + +`skyflow.vault.data` — returned by `vault().upload_file()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `skyflow_id` | `str` | ID of the record the file was attached to (or of the newly created record). | +| `errors` | `list[dict] \| None` | See the note above. | + +### `DetokenizeResponse` + +`skyflow.vault.tokens` — returned by `vault().detokenize()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `detokenized_fields` | `list[dict]` | One entry per token, each with `token`, `value` (plaintext or masked), and `type` (the value type). | +| `errors` | `list[dict] \| None` | See the note above. | + +### `TokenizeResponse` + +`skyflow.vault.tokens` — returned by `vault().tokenize()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `tokenized_fields` | `list[dict]` | One entry per value, each with its `token`. | +| `errors` | `list[dict] \| None` | See the note above. | + +### `InvokeConnectionResponse` + +`skyflow.vault.connection` — returned by `connection().invoke()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `data` | `dict` | The connection's response body. | +| `metadata` | `dict` | Response metadata (for example `request_id`). Defaults to `{}`. | +| `errors` | `list[dict] \| None` | See the note above. | + +### `DeidentifyTextResponse` + +`skyflow.vault.detect` — returned by `detect().deidentify_text()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `processed_text` | `str` | The de-identified text. | +| `entities` | `list[EntityInfo]` | Detected entities. See [`EntityInfo`](#entityinfo). | +| `word_count` | `int` | Word count of the input text. | +| `char_count` | `int` | Character count of the input text. | +| `errors` | `list \| None` | See the note above. | + +### `ReidentifyTextResponse` + +`skyflow.vault.detect` — returned by `detect().reidentify_text()`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `processed_text` | `str` | The re-identified text. | +| `errors` | `list \| None` | See the note above. | + +### `DeidentifyFileResponse` + +`skyflow.vault.detect` — returned by `detect().deidentify_file()` and `detect().get_detect_run()`. All non-error attributes are optional (default `None`) and are populated based on the file type and processing status. If processing exceeds `wait_time`, only `run_id` and `status` are set; poll with `get_detect_run`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `file_base64` | `str \| None` | The processed file as a base64 string. | +| `file` | `File \| None` | The processed file wrapper. See [`File`](#file). | +| `type` | `str \| None` | MIME type of the processed file. | +| `extension` | `str \| None` | File extension of the processed file. | +| `word_count` | `int \| None` | Word count (text-bearing files). | +| `char_count` | `int \| None` | Character count (text-bearing files). | +| `size_in_kb` | `float \| None` | Size of the processed file in KB. | +| `duration_in_seconds` | `float \| None` | Duration in seconds (audio files). | +| `page_count` | `int \| None` | Page count (PDF/document files). | +| `slide_count` | `int \| None` | Slide count (presentation files). | +| `entities` | `list[EntityInfo]` | Detected entities. Defaults to `[]`. See [`EntityInfo`](#entityinfo). | +| `run_id` | `str \| None` | Run identifier; pass to `get_detect_run` to poll for results. | +| `status` | `str \| None` | Processing status of the run. | +| `errors` | `list \| None` | See the note above. | + +--- + +## Enums + +All enums are importable from `skyflow.utils.enums`. + +### `Env` + +Deployment environment. Values: `DEV`, `SANDBOX`, `PROD`, `STAGE`. + +### `EnvUrls` + +Vault hostnames per environment (used internally; exported for reference). + +| Member | Host | +|--------|------| +| `PROD` | `vault.skyflowapis.com` | +| `SANDBOX` | `vault.skyflowapis-preview.com` | +| `DEV` | `vault.skyflowapis.dev` | +| `STAGE` | `vault.skyflowapis.tech` | + +### `LogLevel` + +`DEBUG`, `INFO`, `WARN`, `ERROR`, `OFF`. See [Logging](../README.md#logging). + +### `RedactionType` + +How retrieved data is displayed. Values: `PLAIN_TEXT`, `MASKED`, `DEFAULT`, `REDACTED`. See [Redaction Types](../README.md#redaction-types). + +### `TokenMode` + +Bring-your-own-token mode for `InsertRequest`/`UpdateRequest`. + +| Member | Meaning | +|--------|---------| +| `DISABLE` | Do not accept caller-supplied tokens (default). | +| `ENABLE` | Accept caller-supplied tokens. | +| `ENABLE_STRICT` | Accept caller-supplied tokens with strict validation. | + +### `TokenType` + +Token format for Detect. Values: `VAULT_TOKEN` (`vault_token`), `ENTITY_UNIQUE_COUNTER` (`entity_unq_counter`), `ENTITY_ONLY` (`entity_only`). + +### `ContentType` + +Content type for connection requests. Values: `JSON`, `PLAINTEXT`, `XML`, `URLENCODED`, `FORMDATA`, `HTML`. + +### `RequestMethod` + +HTTP method for connections. Values: `GET`, `POST`, `PUT`, `PATCH`, `DELETE`, `NONE`. + +### `MaskingMethod` + +Image masking method for Detect file de-identification. Values: `BLACKBOX` (`blackbox`), `BLUR` (`blur`). + +### `DetectOutputTranscriptions` + +Audio transcription output type for Detect. Values: `DIARIZED_TRANSCRIPTION`, `MEDICAL_DIARIZED_TRANSCRIPTION`, `MEDICAL_TRANSCRIPTION`, `TRANSCRIPTION`, `PLAINTEXT_TRANSCRIPTION`. + +### `DetectEntities` + +Entity types Detect can identify (e.g. `SSN`, `CREDIT_CARD`, `NAME`, `DOB`). Import from `skyflow.utils.enums`. + +--- + +## Detect helper classes + +Importable from `skyflow.vault.detect`. + +### `EntityInfo` + +A detected entity, returned inside `DeidentifyTextResponse.entities`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `token` | `str` | The token replacing the entity. | +| `value` | `str` | The original entity value. | +| `text_index` | `TextIndex` | Position in the input text. | +| `processed_index` | `TextIndex` | Position in the processed text. | +| `entity` | `str` | Entity type. | +| `scores` | `Dict[str, float]` | Confidence scores. | + +### `TextIndex` + +| Attribute | Type | Description | +|-----------|------|-------------| +| `start` | `int` | Start offset. | +| `end` | `int` | End offset. | + +### `Bleep` + +Audio bleep configuration for `DeidentifyFileRequest`. + +| Attribute | Type | Description | +|-----------|------|-------------| +| `gain` | `float` | Loudness in dB. | +| `frequency` | `float` | Pitch in Hz. | +| `start_padding` | `float` | Padding at start (seconds). | +| `stop_padding` | `float` | Padding at end (seconds). | + +### `File` + +Wrapper around the processed file returned in `DeidentifyFileResponse.file`. + +| Member | Kind | Description | +|--------|------|-------------| +| `name` | property | File name. | +| `size` | property | Size in bytes. | +| `type` | property | MIME/type string. | +| `last_modified` | property | Last-modified timestamp. | +| `seek(offset, whence=0)` | method | Seek within the file. | +| `read(size=-1)` | method | Read file content. | + +--- + +## Service account functions + +Importable from `skyflow.service_account`. See [Authentication & authorization](../README.md#authentication--authorization) for `generate_bearer_token`, `generate_bearer_token_from_creds`, and `generate_signed_data_tokens`. + +### `is_expired(token, logger=None)` + +Returns `True` if the given bearer token is expired (or `None`). Useful for caching tokens and only regenerating when needed. + +```python +from skyflow.service_account import generate_bearer_token, is_expired + +if cached_token is None or is_expired(cached_token): + cached_token, _ = generate_bearer_token('path/to/credentials.json') +``` + +### `generate_signed_data_tokens_from_creds(credentials, options)` + +The credentials-string counterpart to `generate_signed_data_tokens(filepath, options)`. Accepts a JSON credentials string instead of a file path; `options` is the same (`data_tokens`, `time_to_live`, `ctx`). + +```python +import os +from skyflow.service_account import generate_signed_data_tokens_from_creds + +signed_tokens = generate_signed_data_tokens_from_creds( + os.getenv('SKYFLOW_CREDENTIALS'), + { + 'data_tokens': ['dataToken1', 'dataToken2'], + 'time_to_live': 90, + 'ctx': 'user_12345', + }, +) +``` diff --git a/docs/onboarding-backlog.md b/docs/onboarding-backlog.md new file mode 100644 index 00000000..6ace9a25 --- /dev/null +++ b/docs/onboarding-backlog.md @@ -0,0 +1,292 @@ +# Onboarding & Documentation Backlog + +**Status:** Triage backlog (not an implementation plan) +**Created:** 2026-06-02 +**Scope:** Consolidates two audits of the Skyflow Python SDK into one prioritized, pick-up-ready backlog. Implementation scope/sequencing to be decided by the team during triage. + +## Source audits + +This document merges and de-duplicates: + +1. **Onboarding / first-run audit** — focuses on time-to-first-success, correctness drift between docs and code, packaging/typing, and contributor onboarding. (Captured in this repo.) +2. **API-coverage audit** — Confluence page *"Python"* (space `SDK1`), which catalogues ~80+ public interfaces and finds ~35+ undocumented in the README: + +## Executive summary + +The README is genuinely comprehensive and well-structured (clear ToC, consistent "construct request → call → print" pattern, runnable sample links, a real v1→v2 migration guide). The gaps are **not** about missing breadth. They cluster into three themes: + +- **Correctness drift** — the docs contradict the code in ways that break or mislead a new user before they write any code (e.g. Python version requirement). +- **Hard first-run barrier** — the quickstart assumes the user already has a vault, cluster ID, credentials, and a matching table; there is no "zero-to-running" / sandbox path, and **no operation shows its response shape**. +- **Undocumented public surface** — ~35+ exported enums, response classes, request parameters, and client-management methods never appear in the README, forcing users to read source. + +Legend: **Priority** P0 (broken/misleading) · P1 (high friction) · P2 (polish / bigger bet). **Effort** S (hours) · M (1–3 days) · L (week+). **Source** Onboarding · API-coverage · Both. + +Evidence tagging: claims verified against code carry a `file:line` reference. Items sourced from the API-coverage audit but **not yet re-verified in code** are marked _(unverified)_ so triage knows what still needs a code check. + +--- + +## Prioritized summary + +| ID | Title | Category | Priority | Effort | Source | +|----|-------|----------|----------|--------|--------| +| OB-1 | Python version contradiction (3.8 in README vs 3.9 in setup.py) | Correctness | P0 | S | Onboarding | +| OB-2 | "async" sample is actually thread-pool concurrency (misleading) | Correctness | P0 | S | Onboarding | +| OB-3 | Document response-object shapes for every operation | README-content | P1 | M | Both | +| OB-4 | Add `skyflow/py.typed` (PEP 561) for IDE/type-checker support | Packaging/Typing | P1 | S | Onboarding | +| OB-5 | Add "Get your credentials" / sandbox first-run section | README-content | P1 | M | Onboarding | +| OB-6 | Add a Troubleshooting / common-errors section | README-content | P1 | M | Onboarding | +| OB-7 | Document undocumented request parameters | README-content | P1 | M | API-coverage | +| OB-8 | Document undocumented enums (TokenMode, etc.) | README-content | P1 | S | API-coverage | +| OB-9 | Document Skyflow client-management methods | README-content | P1 | S | API-coverage | +| OB-10 | Document `is_expired` & `generate_signed_data_tokens_from_creds` | README-content | P1 | S | API-coverage | +| OB-11 | Add `samples/README.md` | Samples | P1 | M | Onboarding | +| OB-12 | Document client lifecycle / thread-safety / retries | README-content | P1 | M | Onboarding | +| OB-13 | Document Detect helper classes (EntityInfo, TextIndex, Bleep, File) | README-content | P2 | M | API-coverage | +| OB-14 | Add README badges (PyPI, Python versions, build, license) | Polish | P2 | S | Onboarding | +| OB-15 | Align `requirements.txt` ↔ `setup.py` dependency drift | Correctness | P2 | S | Onboarding | +| OB-16 | Remove/clean up `UploadFileRequest` empty stub | Cleanup | P2 | S | API-coverage | +| OB-17 | Typed config/request objects (TypedDict or pydantic) | Bigger-bet | P2 | L | Onboarding | +| OB-18 | Hosted API reference (Sphinx → ReadTheDocs) + split README | Bigger-bet | P2 | L | Onboarding | +| OB-19 | Framework integration guides (Django/Flask/FastAPI) | Bigger-bet | P2 | L | Onboarding | +| OB-20 | Add CONTRIBUTING.md, issue templates, CODE_OF_CONDUCT | Contributor | P2 | M | Onboarding | + +--- + +## P0 — Broken or misleading (fix first) + +### OB-1 · Python version contradiction +*Priority P0 · Effort S · Correctness · Onboarding* + +**Problem:** The README states Python 3.8 is supported and tested, but the package requires 3.9+. A 3.8 user follows the README and `pip install`/import fails before writing any code. +- [README.md:81](../README.md#L81): *"Python 3.8.0 and above (tested with Python 3.8.0)"* +- [setup.py:8](../setup.py#L8): `raise RuntimeError("skyflow requires Python 3.9+")`, plus `python_requires=">=3.9"`. + +**Proposed fix:** Decide the real floor (3.9 per `setup.py`) and make README, `setup.py`, and any CI matrix agree. Update the README "Require" line and the "tested with" claim to match what CI actually tests. + +**Acceptance criteria:** README, `setup.py`, and CI declare the same minimum Python version; the "tested with" version matches an actual CI job. + +### OB-2 · Misleading "async" sample +*Priority P0 · Effort S · Correctness · Onboarding* + +**Problem:** [samples/detect_api/deidentify_file_async.py](../samples/detect_api/deidentify_file_async.py) is named "async" but uses `concurrent.futures.ThreadPoolExecutor` — the SDK is sync-only (no `async def`/`await` in the package). The name implies asyncio support that does not exist. + +**Proposed fix:** Rename to convey thread-based concurrency (e.g. `deidentify_file_concurrent.py`) and/or add a comment clarifying the threading model. If/when true async is on the roadmap, note it explicitly instead. + +**Acceptance criteria:** Sample name/comments accurately reflect that concurrency is thread-based, not asyncio. No doc implies `await`-able clients exist. + +--- + +## P1 — High friction (core of the effort) + +### OB-3 · Document response-object shapes +*Priority P1 · Effort M · README-content · Both* + +**Problem:** Every README example ends with `print(...response)`, but the README never shows what comes back. Response classes and their attributes are undocumented — users must read source or guess. Verified attributes: +- `InsertResponse` → `inserted_fields`, `errors` ([_insert_response.py](../skyflow/vault/data/_insert_response.py)) +- Per the API-coverage audit _(unverified beyond InsertResponse)_: `GetResponse(data, errors)`, `DeleteResponse(deleted_ids, errors)`, `UpdateResponse(updated_field, errors)`, `QueryResponse(fields, errors)`, `FileUploadResponse(skyflow_id, errors)`, `DetokenizeResponse(detokenized_fields, errors)`, `TokenizeResponse(tokenized_fields, errors)`, `InvokeConnectionResponse(data, metadata, errors)`, `DeidentifyTextResponse(processed_text, entities, word_count, char_count)`, `ReidentifyTextResponse(processed_text)`, `DeidentifyFileResponse(file_base64, file, type, extension, word_count, char_count, size_in_kb, duration_in_seconds, page_count, slide_count, entities, run_id, status)`. + +**Proposed fix:** For each operation, add an "Example response" block showing the object and its attributes. Verify each attribute list against source before publishing. + +**Acceptance criteria:** Every documented operation shows its response object and attributes; all attribute lists verified against the response class source. + +### OB-4 · Add `skyflow/py.typed` +*Priority P1 · Effort S · Packaging/Typing · Onboarding* + +**Problem:** A `py.typed` marker exists only at `skyflow/generated/rest/py.typed`, not at the top-level `skyflow/` package. Per PEP 561, mypy/pyright therefore treat the public SDK as untyped — no autocomplete on `InsertRequest(...)`, no inline type errors — despite the package shipping pydantic. + +**Proposed fix:** Add an empty `skyflow/py.typed` and ensure it's included in the wheel (`package_data`/`include_package_data` in `setup.py`). + +**Acceptance criteria:** `import skyflow` resolves types under mypy/pyright in a consuming project; `py.typed` ships in the built wheel. + +### OB-5 · "Get your credentials" / sandbox first-run section +*Priority P1 · Effort M · README-content · Onboarding* + +**Problem:** The quickstart ([README.md:91](../README.md#L91)) assumes the user already has `vault_id`, `cluster_id`, an `api_key`, and a table named `table1` with matching columns. Nothing explains how to obtain these. A brand-new user cannot run the first example. + +**Proposed fix:** Add a short pre-quickstart section: where to find `vault_id`/`cluster_id` (the `{clusterId}.vault.skyflowapis.com` URL), how to create a service account / API key, and the minimal table/schema the quickstart assumes. Link to the relevant Skyflow console docs. If a sandbox/test environment exists, show the no-setup path. + +**Acceptance criteria:** A new user can go from "fresh account" to a successful first call using only the README; every placeholder in the quickstart has a "where this comes from" pointer. + +### OB-6 · Troubleshooting / common-errors section +*Priority P1 · Effort M · README-content · Onboarding* + +**Problem:** Only the bearer-token-expiry edge case is documented ([README.md:850](../README.md#L850)). The errors new users actually hit — wrong `cluster_id`, wrong `env`, 403 (missing role permission), table/column not found — have no guidance. + +**Proposed fix:** Add an "error → likely cause → fix" table covering the top first-run failures, plus how to read `SkyflowError` (`http_code`, `message`, `details`). + +**Acceptance criteria:** Section covers the top ~8–10 first-run errors with actionable fixes. + +### OB-7 · Document undocumented request parameters +*Priority P1 · Effort M · README-content · API-coverage_(unverified)_* + +**Problem:** Several request parameters are exported and usable but absent from the README: +- `InsertRequest`: `tokens`, `homogeneous`, `token_mode` +- `UpdateRequest`: `tokens`, `token_mode` +- `GetRequest`: `fields`, `offset`, `limit`, `download_url` (field selection + pagination — commonly needed) +- `FileUploadRequest`: `file_path`, `base64`, `file_name` +- `FileInput`: `file_path` +- `DeidentifyFileRequest`: `allow_regex_list`, `restrict_regex_list`, `output_processed_image`, `output_ocr_text`, `masking_method`, `pixel_density`, `max_resolution`, `output_processed_audio`, `output_transcription`, `bleep` +- `DeidentifyTextRequest`: `allow_regex_list`, `restrict_regex_list` + +**Proposed fix:** Document each parameter (name, default, purpose) on the relevant request. Verify defaults against source before publishing. Prioritize `GetRequest` pagination and `InsertRequest`/`UpdateRequest` `token_mode`/`tokens` (BYOT). + +**Acceptance criteria:** Each request's documented parameter list matches its constructor signature; defaults verified. + +### OB-8 · Document undocumented enums +*Priority P1 · Effort S · README-content · API-coverage* + +**Problem:** Exported from `skyflow.utils.enums` ([__init__.py](../skyflow/utils/enums/__init__.py), verified) but undocumented: +- `TokenMode` → `DISABLE`, `ENABLE`, `ENABLE_STRICT` (verified, [token_mode.py](../skyflow/utils/enums/token_mode.py)) — used by Insert/Update for BYOT. +- `MaskingMethod` → `BLACKBOX`/`BLUR` (verified, [masking_method.py](../skyflow/utils/enums/masking_method.py)). +- `TokenType` full set (`VAULT_TOKEN`, `ENTITY_UNIQUE_COUNTER`, `ENTITY_ONLY`) — only two values shown in Detect examples. +- `ContentType` (`JSON`, `PLAINTEXT`, `XML`, `URLENCODED`, `FORMDATA`), `DetectOutputTranscriptions`, `Env` non-PROD variants (`SANDBOX`, `DEV`, `STAGE`), `EnvUrls`, `RequestMethod.NONE`. _(values unverified)_ + +**Proposed fix:** Add a concise enum reference (table per enum) and link from the operations that consume them. + +**Acceptance criteria:** Every user-facing enum and its values are listed; values verified against source. + +### OB-9 · Document Skyflow client-management methods +*Priority P1 · Effort S · README-content · API-coverage* + +**Problem:** The README documents only the builder methods and `vault()/connection()/detect()` accessors. These public methods (verified in [skyflow/client/skyflow.py:27-67](../skyflow/client/skyflow.py#L27)) are undocumented: +`remove_vault_config`, `update_vault_config`, `get_vault_config`, `add_connection_config`, `remove_connection_config`, `update_connection_config`, `get_connection_config`, `add_skyflow_credentials`, `update_skyflow_credentials`, `update_log_level`, `get_log_level`. + +**Proposed fix:** Add a "Managing the client after build" subsection documenting post-build config mutation and log-level control. + +**Acceptance criteria:** All public, non-builder client methods are documented with signatures and purpose. + +### OB-10 · Document `is_expired` & `generate_signed_data_tokens_from_creds` +*Priority P1 · Effort S · README-content · API-coverage* + +**Problem:** Both are exported from `skyflow.service_account` (verified, [service_account/__init__.py](../skyflow/service_account/__init__.py)) but undocumented. `is_expired(token)` is commonly needed for token-lifecycle management; `generate_signed_data_tokens_from_creds` is the string-based counterpart to the documented file-path function. + +**Proposed fix:** Document both alongside the existing token-generation docs, mirroring the `_from_creds` pattern already shown for bearer tokens. + +**Acceptance criteria:** Both functions documented with a usage snippet. + +### OB-11 · `samples/README.md` +*Priority P1 · Effort M · Samples · Onboarding* + +**Problem:** 22 sample files across `samples/{vault_api,detect_api,service_account}` have no README explaining prerequisites or how to run them; each hardcodes ``. + +**Proposed fix:** Add `samples/README.md` (prerequisites, how to run, what each sample shows, per-sample index). + +**Acceptance criteria:** A user can run any sample by following `samples/README.md` and replacing the inline placeholders with their own values. + +### OB-12 · Client lifecycle / thread-safety / retries +*Priority P1 · Effort M · README-content · Onboarding* + +**Problem:** The stated audience is Python backends, but the README never says whether `Skyflow` is thread-safe, should be reused as a singleton vs rebuilt per request, or how retries/timeouts/connection pooling are configured. + +**Proposed fix:** Add a short "Using the client in production" section covering reuse/lifecycle, thread-safety, and any timeout/retry configuration. Verify behavior against the client implementation before documenting. + +**Acceptance criteria:** Section answers reuse, thread-safety, and timeout/retry questions; claims verified against code. + +--- + +## P2 — Polish & bigger bets + +### OB-13 · Document Detect helper classes +*Priority P2 · Effort M · README-content · API-coverage_(unverified)_* + +**Problem:** Publicly exported from `skyflow.vault.detect` but undocumented: `EntityInfo` (`token`, `value`, `text_index`, `processed_index`, `entity`, `scores`), `TextIndex` (`start`, `end`), `Bleep` (`gain`, `frequency`, `start_padding`, `stop_padding`), and the Detect `File` wrapper (`name`, `size`, `type`, `last_modified`, `read()`, `seek()`). + +**Proposed fix:** Document these as part of the Detect response/request reference. Verify attributes against source. + +**Acceptance criteria:** Each helper class documented with attributes/methods; verified. + +### OB-14 · README badges +*Priority P2 · Effort S · Polish · Onboarding* + +**Problem:** The README has zero badges (verified). No PyPI version, supported-Python, build, coverage, or license signal at a glance. + +**Proposed fix:** Add standard badges to the README header. + +**Acceptance criteria:** Header shows at least PyPI version, supported Python versions, build status, and license. + +### OB-15 · Align `requirements.txt` ↔ `setup.py` +*Priority P2 · Effort S · Correctness · Onboarding* + +**Problem:** Dependency drift, e.g. `setuptools >= 21.0.0` ([requirements.txt](../requirements.txt)) vs `setuptools >= 75.3.3` ([setup.py:31](../setup.py#L31)). Risk of inconsistent installs. + +**Proposed fix:** Reconcile the two (or generate one from the other) so version constraints match. + +**Acceptance criteria:** No conflicting version constraints between the two files. + +### OB-16 · Clean up `UploadFileRequest` stub +*Priority P2 · Effort S · Cleanup · API-coverage* + +**Problem:** `UploadFileRequest` is an exported empty stub (`def __init__(self): pass`, verified [_upload_file_request.py](../skyflow/vault/data/_upload_file_request.py)), apparently superseded by `FileUploadRequest`. It pollutes the public surface. + +**Proposed fix:** Remove from exports (or document if it has a real purpose). Treat as a potential breaking change — confirm nothing depends on it and gate on a minor/major version bump. + +**Acceptance criteria:** `UploadFileRequest` is either removed from the public API or documented with a clear purpose; release notes mention the change if removed. + +### OB-17 · Typed config/request objects +*Priority P2 · Effort L · Bigger-bet · Onboarding* + +**Problem:** Config and credentials are untyped dicts (`{'vault_id': ..., 'cluster_id': ...}`), so a typo like `'clster_id'` fails at runtime instead of at edit time. The SDK already depends on pydantic but users get none of the type safety on inputs. + +**Proposed fix:** Offer `TypedDict` (or pydantic model) variants for config/credentials and request constructors, keeping dict input for back-compat. Design as an additive, non-breaking enhancement. + +**Acceptance criteria:** Users get autocomplete and static validation on config/request inputs without breaking existing dict-based usage. + +### OB-18 · Hosted API reference + README split +*Priority P2 · Effort L · Bigger-bet · Onboarding* + +**Problem:** Public modules are `_`-prefixed, so users have only the 870-line README — no generated API reference. The single file is hard to scan for onboarding vs reference use. + +**Proposed fix:** Generate a hosted API reference (Sphinx → ReadTheDocs) from docstrings and split the README into a short "Getting Started" plus a linked reference. (Depends on docstring coverage — may pair with OB-3/OB-7/OB-8.) + +**Acceptance criteria:** A hosted, versioned API reference exists; README is scannable for first-run vs deep reference. + +### OB-19 · Framework integration guides +*Priority P2 · Effort L · Bigger-bet · Onboarding* + +**Problem:** No examples for the most common backends (Django/Flask/FastAPI), including where to construct/reuse the client. (Pairs with OB-12.) + +**Proposed fix:** Add one focused integration example per major framework showing client lifecycle and a representative operation. + +**Acceptance criteria:** At least one runnable integration example each for Django, Flask, and FastAPI. + +### OB-20 · Contributor onboarding files +*Priority P2 · Effort M · Contributor · Onboarding* + +**Problem:** No `CONTRIBUTING.md`, issue templates, or `CODE_OF_CONDUCT.md` (only a PR template exists under `.github/workflows/`). Blocks contributor onboarding; the PR template is also in a non-standard location. + +**Proposed fix:** Add `CONTRIBUTING.md` (dev setup, test/lint commands, branch/release flow), issue templates, and a code of conduct. Move the PR template to the conventional `.github/` root if appropriate. + +**Acceptance criteria:** Standard GitHub community-health files present and discoverable. + +--- + +## Appendix: full undocumented-interface inventory + +Reproduced from the API-coverage audit for completeness so nothing is lost as README items close. Verified entries are tagged; the rest are _(unverified)_ pending a code check during implementation. + +### Enums (exported from `skyflow.utils.enums`, export list verified) +| Enum | Values | Notes | +|------|--------|-------| +| `EnvUrls` | — | Internal but exported _(unverified)_ | +| `ContentType` | `JSON`, `PLAINTEXT`, `XML`, `URLENCODED`, `FORMDATA` | _(values unverified)_ | +| `TokenMode` | `DISABLE`, `ENABLE`, `ENABLE_STRICT` | Verified; used by Insert/Update (BYOT) | +| `TokenType` | `VAULT_TOKEN`, `ENTITY_UNIQUE_COUNTER`, `ENTITY_ONLY` | Partially documented; full set _(unverified)_ | +| `DetectOutputTranscriptions` | `DIARIZED_TRANSCRIPTION`, `MEDICAL_DIARIZED_TRANSCRIPTION`, `MEDICAL_TRANSCRIPTION`, `TRANSCRIPTION` | _(values unverified)_ | +| `MaskingMethod` | `BLACKBOX` (`blackbox`), `BLUR` (`blur`) | Verified | +| `Env` non-PROD | `SANDBOX`, `DEV`, `STAGE` | Only `PROD` shown in README _(unverified)_ | +| `RequestMethod.NONE` | — | `GET/POST/PUT/PATCH/DELETE` documented; `NONE` not _(unverified)_ | + +### Response classes (attributes per API-coverage audit; only `InsertResponse` verified here) +`InsertResponse(inserted_fields, errors)` ✓ · `GetResponse(data, errors)` · `DeleteResponse(deleted_ids, errors)` · `UpdateResponse(updated_field, errors)` · `QueryResponse(fields, errors)` · `FileUploadResponse(skyflow_id, errors)` · `DetokenizeResponse(detokenized_fields, errors)` · `TokenizeResponse(tokenized_fields, errors)` · `InvokeConnectionResponse(data, metadata, errors)` · `DeidentifyTextResponse(processed_text, entities, word_count, char_count)` · `ReidentifyTextResponse(processed_text)` · `DeidentifyFileResponse(file_base64, file, type, extension, word_count, char_count, size_in_kb, duration_in_seconds, page_count, slide_count, entities, run_id, status)` + +### Detect helper classes (exported from `skyflow.vault.detect`) +`EntityInfo(token, value, text_index, processed_index, entity, scores)` · `TextIndex(start, end)` · `Bleep(gain, frequency, start_padding, stop_padding)` · `File(name, size, type, last_modified, read(), seek())` — _(unverified)_ + +### Client methods (verified, [skyflow/client/skyflow.py:27-67](../skyflow/client/skyflow.py#L27)) +`remove_vault_config`, `update_vault_config`, `get_vault_config`, `add_connection_config`, `remove_connection_config`, `update_connection_config`, `get_connection_config`, `add_skyflow_credentials`, `update_skyflow_credentials`, `update_log_level`, `get_log_level` + +### Service-account functions (verified, [service_account/__init__.py](../skyflow/service_account/__init__.py)) +`is_expired(token)`, `generate_signed_data_tokens_from_creds(credentials, options)` + +### Cleanup candidates +`UploadFileRequest` (empty stub, verified) · `SkyflowMessages` (exported, internal) · `SDK_VERSION` (exported, undocumented) · `Audit`/`BinLookUp` (stub controllers, **not** exported — not user-reachable, verified [controller/__init__.py](../skyflow/vault/controller/__init__.py)) diff --git a/requirements.txt b/requirements.txt index bc927eb5..d8c5feaf 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,5 +1,5 @@ python_dateutil >= 2.5.3 -setuptools >= 21.0.0 +setuptools >= 75.3.3 urllib3 >= 1.25.3, < 3 pydantic >= 2 typing-extensions >= 4.7.1 diff --git a/samples/README.md b/samples/README.md new file mode 100644 index 00000000..f3532b68 --- /dev/null +++ b/samples/README.md @@ -0,0 +1,65 @@ +# Skyflow Python SDK — Samples + +Runnable examples for the Skyflow Python SDK, grouped by area. Start with the [README](../README.md) and [API Reference](../docs/api_reference.md) for full documentation. + +## Prerequisites + +- Python 3.9 or above +- The SDK installed: `pip install skyflow` +- A Skyflow account, a vault, and a service account (see [Before you begin](../README.md#before-you-begin)) +- Your `vault_id`, `cluster_id`, `env`, and one credential (API key, bearer token, or service-account credentials) + +## Configure + +The samples ship with inline `` strings (for example ``). Replace the placeholders in the sample you want to run with your own values before running it. + +> Never commit real credentials. + +## Run a sample + +```bash +python samples/vault_api/insert_records.py +``` + +## What's here + +### `vault_api/` +Core vault data operations. + +| Sample | Demonstrates | +|--------|--------------| +| `client_operations.py` | Building and managing the Skyflow client | +| `credentials_options.py` | The different credential types | +| `insert_records.py` | Inserting and tokenizing records (`continue_on_error`) | +| `insert_byot.py` | Bring-your-own-token inserts | +| `get_records.py` | Getting records by Skyflow ID | +| `get_column_values.py` | Getting records by column name/values | +| `update_record.py` | Updating a record | +| `delete_records.py` | Deleting records | +| `query_records.py` | SQL queries | +| `detokenize_records.py` | Detokenizing tokens | +| `tokenize_records.py` | Retrieving existing tokens | +| `upload_file.py` | Uploading a file to a record | +| `invoke_connection.py` | Invoking a Skyflow Connection | + +### `detect_api/` +Skyflow Detect (de-identification / re-identification). + +| Sample | Demonstrates | +|--------|--------------| +| `deidentify_text.py` | De-identifying text | +| `reidentify_text.py` | Re-identifying text | +| `deidentify_file.py` | De-identifying a file | +| `deidentify_file_concurrent.py` | Running a file de-identification on a background thread (thread-based concurrency, not asyncio) | +| `get_detect_run.py` | Polling a file de-identification run by `run_id` | + +### `service_account/` +Bearer-token and signed-data-token generation. + +| Sample | Demonstrates | +|--------|--------------| +| `token_generation_example.py` | Generating a bearer token | +| `scoped_token_generation_example.py` | Tokens scoped to specific roles | +| `token_generation_with_context_example.py` | Tokens with context (`ctx`) | +| `signed_token_generation_example.py` | Signed data tokens | +| `bearer_token_expiry_example.py` | Handling token expiry / regeneration | diff --git a/samples/detect_api/deidentify_file_async.py b/samples/detect_api/deidentify_file_concurrent.py similarity index 91% rename from samples/detect_api/deidentify_file_async.py rename to samples/detect_api/deidentify_file_concurrent.py index 23d2f40f..d3144f3c 100644 --- a/samples/detect_api/deidentify_file_async.py +++ b/samples/detect_api/deidentify_file_concurrent.py @@ -12,15 +12,18 @@ from concurrent.futures import ThreadPoolExecutor """ - * Skyflow Deidentify File Example - * - * This sample demonstrates how to use all available options for deidentifying files - * using an asynchronous approach. - * Supported file types: images (jpg, png, etc.), pdf, audio (mp3, wav), documents, + * Skyflow Deidentify File Example (concurrent) + * + * This sample demonstrates how to use all available options for deidentifying files. + * The SDK is synchronous; this example runs the (blocking) deidentify_file call on a + * background thread using concurrent.futures.ThreadPoolExecutor so the main thread can + * continue working. This is thread-based concurrency, not asyncio — the SDK does not + * expose async/await coroutines. + * Supported file types: images (jpg, png, etc.), pdf, audio (mp3, wav), documents, * spreadsheets, presentations, structured text. """ -def perform_file_deidentification_async(): +def perform_file_deidentification_concurrent(): try: # Step 1: Configure Credentials credentials = { diff --git a/setup.py b/setup.py index 5e833f82..83c5b490 100644 --- a/setup.py +++ b/setup.py @@ -18,6 +18,11 @@ author='Skyflow', author_email='service-ops@skyflow.com', packages=find_packages(where='.', exclude=['test*']), + # Ship PEP 561 markers so type checkers (mypy/pyright) see the SDK's types. + package_data={ + 'skyflow': ['py.typed'], + 'skyflow.generated.rest': ['py.typed'], + }, url='https://github.com/skyflowapi/skyflow-python/', license='LICENSE', description='Skyflow SDK for the Python programming language', diff --git a/skyflow/py.typed b/skyflow/py.typed new file mode 100644 index 00000000..e69de29b diff --git a/skyflow/utils/enums/request_method.py b/skyflow/utils/enums/request_method.py index 61efef3d..a002dcaa 100644 --- a/skyflow/utils/enums/request_method.py +++ b/skyflow/utils/enums/request_method.py @@ -4,5 +4,6 @@ class RequestMethod(Enum): GET = "GET" POST = "POST" PUT = "PUT" + PATCH = "PATCH" DELETE = "DELETE" NONE = "NONE" \ No newline at end of file